Ingestion connector seam — assemble, don't build, and ingestion owns the input contract

0013-ingestion-connector-seam

decision read as Explain confidence asserted status active 2026-06-14 owner ingestion-engineer
Reversibility
two-way door

DEC-0013 — Ingestion connector seam (assemble, don't build; ingestion owns the input contract)

Reversibility: two-way door — on connector internals, parser choices, and normalization details; the Source / Provenance / SourceSpan input contract and the Connector seam are the durable parts.

Context

Extraction runtime architecture — the moat reserved @dossier/ingestion (name + dir + README stub) and fixed its role twice: ingestion is commodity — assemble, don't build ("Ingestion is commodity (assemble), but extraction → OKF is the moat (build)"), and the reserved @dossier/ingestion is the package that "defines the CleanDoc + Provenance input contract extraction consumes." With the moat (Extraction runtime architecture — the moat), the measurement layer (Live extraction eval harness — what we measure is what extraction optimizes for), the agentic foundation (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), and the orchestration runtime (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) all built and green, the missing piece was the front door: the layer that pulls a client's raw knowledge in and hands extraction clean, provenance-stamped markdown. Until this existed, runLoop's ingest stage was a pass-through and the loop could not start from real client files.

This decision builds that reserved package: @dossier/ingestion — the connectors that normalize raw sources to clean markdown + provenance for extraction. The build (verified this session, all green, no network): @dossier/ingestion 19 tests green; extraction stayed green through the contract swap; runLoop now ingests real files end-to-end; repo-wide 268 passed / 1 gated-skip; all 6 packages build. No network anywhere.

Options considered

1. Connectors — build them now vs. assemble behind a seam.

  • (a) Build the commodity connectors now (web crawl, files→markdown, SharePoint/M365). Maximum source coverage day one — but it spends the team's scarce build budget on the commodity layer (Extraction runtime architecture — the moat is explicit: ingestion is "assemble," not the moat), pulls three vendor SDKs into the dependency tree, and networks CI — breaking the offline-first invariant the whole monorepo holds to. It builds what the market already gives away.
  • (b) Ship the contract + one real offline connector; reserve the OSS/vendor connectors behind the seam (chosen). A Connector interface is the contract; a real LocalFilesConnector proves the contract end-to-end with no network; the three commodity connectors (Firecrawl / Unstructured / Microsoft 365) are reserved stubs, each the only place that would import its vendor SDK, each documenting its assembly + incremental/delta strategy. "Assemble, don't build" is encoded in the shape of the package, not just asserted in prose. Same seam-with-mock discipline already proven for the live ClaudeClient (Extraction runtime architecture — the moat), the Embedder (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), and the AgentSdkOrchestrator (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system).

2. Who owns the input contract — ingestion vs. extraction vs. a shared types package.

  • (a) Extraction owns it. Keep Source / Provenance / SourceSpan defined in @dossier/extraction and have ingestion depend on extraction. But ingestion is the producer of these shapes and extraction is the consumer — making the consumer own the contract inverts the dependency arrow and forces ingestion to depend on the moat just to know what it emits.
  • (b) A third shared @dossier/contracts types package. Clean in theory, but it adds a package whose only content is a handful of types, and it splits the input contract away from the one package whose entire job is producing it. More boundary, less cohesion.
  • (c) Ingestion owns it (chosen). The producer owns the contract. Source / Provenance / SourceSpan are canonically defined in @dossier/ingestion; @dossier/extraction re-exports them. This makes @dossier/ingestion a second leaf (depends on nothing internal, like the @dossier/okf keystone), @dossier/extraction depends on okf + ingestion, and there is no cycle. The input contract is a pure, portable shape owned by the layer that creates it.

Decision

1. Ingestion owns the input contract; it is a second leaf. Source / Provenance / SourceSpan are canonically defined in @dossier/ingestion (the producer/owner of the input contract per Extraction runtime architecture — the moat); @dossier/extraction re-exports them so its public surface is unchanged. The arrows: @dossier/ingestion depends on nothing internal (a second leaf alongside the @dossier/okf keystone); @dossier/extraction depends on okf + ingestion; no cycle. The input contract is a pure, portable shape — the producer owns it, the consumer re-exports it.

2. A Connector seam + one real offline connector; OSS/vendor connectors reserved. A Connector interface (ingest(): AsyncIterable<Source>) is the contract every source binds to. The real, shippable-now connector is LocalFilesConnector: a directory → normalized markdown + stamped provenance, with include/exclude globs, read confined against .. / symlink escape, and binary files skipped + flagged. It is usable today with no network. A DocumentParser seam decouples format handling (built-in TextMarkdownParser now; Unstructured reserved for docx/pdf). The three commodity connectors are reserved stubs, each the only place that would import its vendor SDK, each documenting its assembly + incremental/delta strategy:

  • FirecrawlConnector — web crawl.
  • UnstructuredConnector — files → markdown (docx/pdf, via the DocumentParser seam).
  • Microsoft365Connector — SharePoint / M365 via Microsoft Graph delta query + Copilot connectors. A SyncCursor / IncrementalConnector type reserves live/delta sync so incremental ingestion has a home in the contract before any connector implements it.

3. Provenance from atom zero. Every ingested Source carries provenance (the file/URI as source); it flows through extraction onto every emitted atom. The auditable-memory discipline (Adopt OKF as Dossier's canonical knowledge format / Dossier — The Knowledge Model (v0) principle 8 — provenance travels with each atom) therefore starts at ingestion, not after extraction.

4. Wired into the loop. runLoop's ingest stage now reads real files from sourceDir via LocalFilesConnector (it was a pass-through). The per-tenant loop (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) now ingests real files end-to-end: ingest → extract → emit → serve.

Rationale

  • "Assemble, don't build" is encoded in the package shape, not just asserted. Extraction runtime architecture — the moat fixed ingestion as commodity. Reserving the three vendor connectors behind a Connector seam — each the single import site for its SDK — means the team's build budget stays on the moat, CI stays fully offline, and swapping or adding a source never touches the contract or the loop. The same seam discipline that already paid off for ClaudeClient, Embedder, and AgentSdkOrchestrator.
  • The producer owns the contract — so the arrows stay clean. Defining Source / Provenance / SourceSpan in @dossier/ingestion and re-exporting from @dossier/extraction makes ingestion a second leaf and keeps the dependency graph acyclic (okf and ingestion are leaves; extraction depends on both). The consumer never owns the producer's contract, and the input shape stays pure and portable.
  • Provenance from atom zero makes auditable memory true at the front door. Adopt OKF as Dossier's canonical knowledge format and Dossier — The Knowledge Model (v0) require provenance to travel with each atom; stamping it at ingestion (the file/URI as source) and flowing it through extraction means every served atom is traceable to where it came in — the sovereignty/audit promise of Dossier — Mission & North Star holds from the first byte, not retroactively.
  • One real offline connector proves the contract; the loop now starts from real files. LocalFilesConnector exercises the whole ConnectorSource → extraction path with no network, and wiring it into runLoop turns the loop's first stage from a pass-through into real ingestion — so the end-to-end flow is demonstrated, not diagrammed.
  • asserted, not verified. Built and green offline (@dossier/ingestion 19 tests; repo-wide 268 passed / 1 gated-skip; all 6 packages build), with the loop ingesting real files end-to-end — but the three vendor connectors are unbuilt, the input contract has only been exercised against local files, and incremental/delta sync is reserved, not run. This is design-level conviction backed by an offline single-connector run, not field evidence against a real client source.

Consequences

  • The platform's data flow is complete end-to-end. The loop now runs from raw client files all the way to a served, agent-queryable OKF KB (ingest → extract → emit → serve). With this, every reserved package from Extraction runtime architecture — the moat is built: okf, extraction, ingestion, mcp, runtime. The learning loop is a full data pipeline, not a partial one.
  • The connector seam is a standing reservation. When a real client source needs it, the live FirecrawlConnector / UnstructuredConnector / Microsoft365Connector drop into the existing Connector seam — each the single import site for its vendor SDK, with no change to the input contract, the loop, or the offline-CI invariant for the built core.
  • The input contract is now load-bearing and owned by ingestion. Source / Provenance / SourceSpan live in @dossier/ingestion and are re-exported by @dossier/extraction. Any change to these shapes is a change to the producer's contract that ripples to every consumer — it must stay a pure, portable shape, and ingestion must stay a leaf (no internal dependency) or the acyclic graph regresses.
  • Provenance is enforced from ingestion onward. Every ingested Source is stamped; new connectors must stamp provenance at ingest or they regress the auditable-memory guarantee (Adopt OKF as Dossier's canonical knowledge format / Dossier — The Knowledge Model (v0) principle 8).
  • The remaining deferred work is the network/vendor backends behind established seams — the three OSS connectors, the live Embedder (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), the AgentSdkOrchestrator (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) — plus Claude Code plugin packaging for distribution. No new architecture is owed; the seams are in place.
  • Two-way vs. durable. Connector internals, parser choices, and normalization details are expected to evolve (two-way door). The Source / Provenance / SourceSpan input contract and the Connector seam are the durable, harder-to-reverse commitments.

Review

Wire the OSS connectors (Firecrawl / Unstructured / Microsoft 365) through the reserved seam when a real client source needs them — at which point exercise the input contract against non-file sources and consider promoting confidence to verified. And revisit the contract if a connector needs richer provenance or spans than Source / Provenance / SourceSpan currently carry (e.g. page/section anchors from PDFs, M365 delta tokens) — widening the contract is the trigger to re-examine this decision.