Ingestion connector seam — assemble, don't build, and ingestion owns the input contract
0013-ingestion-connector-seam
- Reversibility
- two-way door
DEC-0013 — Ingestion connector seam (assemble, don't build; ingestion owns the input contract)
Reversibility: two-way door — on connector internals, parser choices, and normalization details; the Source / Provenance / SourceSpan input contract and the Connector seam are the durable parts.
Context
Extraction runtime architecture — the moat reserved @dossier/ingestion (name + dir + README stub) and fixed its role twice: ingestion is commodity — assemble, don't build ("Ingestion is commodity (assemble), but extraction → OKF is the moat (build)"), and the reserved @dossier/ingestion is the package that "defines the CleanDoc + Provenance input contract extraction consumes." With the moat (Extraction runtime architecture — the moat), the measurement layer (Live extraction eval harness — what we measure is what extraction optimizes for), the agentic foundation (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), and the orchestration runtime (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) all built and green, the missing piece was the front door: the layer that pulls a client's raw knowledge in and hands extraction clean, provenance-stamped markdown. Until this existed, runLoop's ingest stage was a pass-through and the loop could not start from real client files.
This decision builds that reserved package: @dossier/ingestion — the connectors that normalize raw sources to clean markdown + provenance for extraction. The build (verified this session, all green, no network): @dossier/ingestion 19 tests green; extraction stayed green through the contract swap; runLoop now ingests real files end-to-end; repo-wide 268 passed / 1 gated-skip; all 6 packages build. No network anywhere.
Options considered
1. Connectors — build them now vs. assemble behind a seam.
- (a) Build the commodity connectors now (web crawl, files→markdown, SharePoint/M365). Maximum source coverage day one — but it spends the team's scarce build budget on the commodity layer (Extraction runtime architecture — the moat is explicit: ingestion is "assemble," not the moat), pulls three vendor SDKs into the dependency tree, and networks CI — breaking the offline-first invariant the whole monorepo holds to. It builds what the market already gives away.
- (b) Ship the contract + one real offline connector; reserve the OSS/vendor connectors behind the seam (chosen). A
Connectorinterface is the contract; a realLocalFilesConnectorproves the contract end-to-end with no network; the three commodity connectors (Firecrawl / Unstructured / Microsoft 365) are reserved stubs, each the only place that would import its vendor SDK, each documenting its assembly + incremental/delta strategy. "Assemble, don't build" is encoded in the shape of the package, not just asserted in prose. Same seam-with-mock discipline already proven for the liveClaudeClient(Extraction runtime architecture — the moat), theEmbedder(MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), and theAgentSdkOrchestrator(Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system).
2. Who owns the input contract — ingestion vs. extraction vs. a shared types package.
- (a) Extraction owns it. Keep
Source/Provenance/SourceSpandefined in@dossier/extractionand have ingestion depend on extraction. But ingestion is the producer of these shapes and extraction is the consumer — making the consumer own the contract inverts the dependency arrow and forces ingestion to depend on the moat just to know what it emits. - (b) A third shared
@dossier/contractstypes package. Clean in theory, but it adds a package whose only content is a handful of types, and it splits the input contract away from the one package whose entire job is producing it. More boundary, less cohesion. - (c) Ingestion owns it (chosen). The producer owns the contract.
Source/Provenance/SourceSpanare canonically defined in@dossier/ingestion;@dossier/extractionre-exports them. This makes@dossier/ingestiona second leaf (depends on nothing internal, like the@dossier/okfkeystone),@dossier/extractiondepends onokf+ingestion, and there is no cycle. The input contract is a pure, portable shape owned by the layer that creates it.
Decision
1. Ingestion owns the input contract; it is a second leaf. Source / Provenance / SourceSpan are canonically defined in @dossier/ingestion (the producer/owner of the input contract per Extraction runtime architecture — the moat); @dossier/extraction re-exports them so its public surface is unchanged. The arrows: @dossier/ingestion depends on nothing internal (a second leaf alongside the @dossier/okf keystone); @dossier/extraction depends on okf + ingestion; no cycle. The input contract is a pure, portable shape — the producer owns it, the consumer re-exports it.
2. A Connector seam + one real offline connector; OSS/vendor connectors reserved. A Connector interface (ingest(): AsyncIterable<Source>) is the contract every source binds to. The real, shippable-now connector is LocalFilesConnector: a directory → normalized markdown + stamped provenance, with include/exclude globs, read confined against .. / symlink escape, and binary files skipped + flagged. It is usable today with no network. A DocumentParser seam decouples format handling (built-in TextMarkdownParser now; Unstructured reserved for docx/pdf). The three commodity connectors are reserved stubs, each the only place that would import its vendor SDK, each documenting its assembly + incremental/delta strategy:
FirecrawlConnector— web crawl.UnstructuredConnector— files → markdown (docx/pdf, via theDocumentParserseam).Microsoft365Connector— SharePoint / M365 via Microsoft Graph delta query + Copilot connectors. ASyncCursor/IncrementalConnectortype reserves live/delta sync so incremental ingestion has a home in the contract before any connector implements it.
3. Provenance from atom zero. Every ingested Source carries provenance (the file/URI as source); it flows through extraction onto every emitted atom. The auditable-memory discipline (Adopt OKF as Dossier's canonical knowledge format / Dossier — The Knowledge Model (v0) principle 8 — provenance travels with each atom) therefore starts at ingestion, not after extraction.
4. Wired into the loop. runLoop's ingest stage now reads real files from sourceDir via LocalFilesConnector (it was a pass-through). The per-tenant loop (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) now ingests real files end-to-end: ingest → extract → emit → serve.
Rationale
- "Assemble, don't build" is encoded in the package shape, not just asserted. Extraction runtime architecture — the moat fixed ingestion as commodity. Reserving the three vendor connectors behind a
Connectorseam — each the single import site for its SDK — means the team's build budget stays on the moat, CI stays fully offline, and swapping or adding a source never touches the contract or the loop. The same seam discipline that already paid off forClaudeClient,Embedder, andAgentSdkOrchestrator. - The producer owns the contract — so the arrows stay clean. Defining
Source/Provenance/SourceSpanin@dossier/ingestionand re-exporting from@dossier/extractionmakes ingestion a second leaf and keeps the dependency graph acyclic (okfandingestionare leaves;extractiondepends on both). The consumer never owns the producer's contract, and the input shape stays pure and portable. - Provenance from atom zero makes auditable memory true at the front door. Adopt OKF as Dossier's canonical knowledge format and Dossier — The Knowledge Model (v0) require provenance to travel with each atom; stamping it at ingestion (the file/URI as
source) and flowing it through extraction means every served atom is traceable to where it came in — the sovereignty/audit promise of Dossier — Mission & North Star holds from the first byte, not retroactively. - One real offline connector proves the contract; the loop now starts from real files.
LocalFilesConnectorexercises the wholeConnector→Source→ extraction path with no network, and wiring it intorunLoopturns the loop's first stage from a pass-through into real ingestion — so the end-to-end flow is demonstrated, not diagrammed. asserted, notverified. Built and green offline (@dossier/ingestion19 tests; repo-wide 268 passed / 1 gated-skip; all 6 packages build), with the loop ingesting real files end-to-end — but the three vendor connectors are unbuilt, the input contract has only been exercised against local files, and incremental/delta sync is reserved, not run. This is design-level conviction backed by an offline single-connector run, not field evidence against a real client source.
Consequences
- The platform's data flow is complete end-to-end. The loop now runs from raw client files all the way to a served, agent-queryable OKF KB (ingest → extract → emit → serve). With this, every reserved package from Extraction runtime architecture — the moat is built:
okf,extraction,ingestion,mcp,runtime. The learning loop is a full data pipeline, not a partial one. - The connector seam is a standing reservation. When a real client source needs it, the live
FirecrawlConnector/UnstructuredConnector/Microsoft365Connectordrop into the existingConnectorseam — each the single import site for its vendor SDK, with no change to the input contract, the loop, or the offline-CI invariant for the built core. - The input contract is now load-bearing and owned by ingestion.
Source/Provenance/SourceSpanlive in@dossier/ingestionand are re-exported by@dossier/extraction. Any change to these shapes is a change to the producer's contract that ripples to every consumer — it must stay a pure, portable shape, and ingestion must stay a leaf (no internal dependency) or the acyclic graph regresses. - Provenance is enforced from ingestion onward. Every ingested
Sourceis stamped; new connectors must stamp provenance at ingest or they regress the auditable-memory guarantee (Adopt OKF as Dossier's canonical knowledge format / Dossier — The Knowledge Model (v0) principle 8). - The remaining deferred work is the network/vendor backends behind established seams — the three OSS connectors, the live
Embedder(MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), theAgentSdkOrchestrator(Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) — plus Claude Code plugin packaging for distribution. No new architecture is owed; the seams are in place. - Two-way vs. durable. Connector internals, parser choices, and normalization details are expected to evolve (two-way door). The
Source/Provenance/SourceSpaninput contract and theConnectorseam are the durable, harder-to-reverse commitments.
Review
Wire the OSS connectors (Firecrawl / Unstructured / Microsoft 365) through the reserved seam when a real client source needs them — at which point exercise the input contract against non-file sources and consider promoting confidence to verified. And revisit the contract if a connector needs richer provenance or spans than Source / Provenance / SourceSpan currently carry (e.g. page/section anchors from PDFs, M365 delta tokens) — widening the contract is the trigger to re-examine this decision.