Extraction runtime architecture — the moat

0008-extraction-runtime-architecture

decision read as Explain confidence verified status active 2026-06-14 owner principal-architect

Reversibility: one-way door

DEC-0008 — Extraction runtime architecture (the moat)

Reversibility: one-way door on the durable parts — repo topology, package boundaries, and the OKF schema-as-code keystone; two-way door on the pipeline internals, model ids, and eval shape.

Context

This is Dossier's first real product code. The repo is greenfield for code — only knowledge/ and .claude/ exist; no package.json, no src. Ingestion is commodity (assemble), but extraction → OKF is the moat (build): turning clean source content into atomic OKF concepts that conform to Dossier — The Knowledge Model (v0) and the typed edges that form the graph. We must set repo topology and package boundaries now so we don't repaint later, and pin the OKF model as code (the shared contract between knowledge-architect, who owns the model, and extraction-engineer, who targets it). Constraints already fixed: OKF is the system of record (Adopt OKF as Dossier's canonical knowledge format); Claude-primitives-first (Claude-primitives-first build strategy); TypeScript end-to-end on VoidZero/Vite (Vite 8 / Rolldown / Oxc / Vitest); GraphRAG over vector-only; multi-tenancy starts siloed per client (one client = one OKF git repo, single source of truth).

Options considered

1. Repo topology.

(a) Single package — fast to start, but the OKF schema, extraction, future ingestion, and MCP all collapse into one boundary; vendor/cache code leaks into the sovereign-data contract; no clean dependency arrows.
(b) pnpm monorepo with workspaces — explicit package boundaries, one shared keystone every package depends on, reserved names for not-yet-built layers, isolated builds/tests.
(c) Polyrepo — premature; cross-cutting refactors on a v0 schema become multi-repo choreography.

2. Claude integration for the typed extraction transform.

(a) Anthropic TS SDK with forced tool-use / structured outputs for each typed extraction call (schema-constrained JSON, validated not parsed-and-hoped).
(b) Claude Agent SDK driving the whole transform conversationally.
(c) Bespoke prompt-and-parse — rejected (violates DEC-0004 and the faithfulness bar).

3. Pipeline shape. A single mega-prompt ("here's a doc, emit OKF") vs a typed staged data-flow (load → segment → extract → validate → resolve/dedup → link → serialize), each stage with named input/output types and provenance threaded through.

4. OKF schema as code. Hand-rolled TS interfaces + ad-hoc YAML I/O vs a Zod schema mirroring Dossier — The Knowledge Model (v0) (base frontmatter + per-type fields + edge vocabulary) exposed as parse / serialize / validate, with types inferred from the schema (one source of truth).

Decision

1. Topology — pnpm monorepo. Four packages, dependency arrows pointing only inward to the keystone:

Package	Role	Build now?
`@dossier/okf`	Keystone. OKF schema-as-code — Zod schema mirroring Dossier — The Knowledge Model (v0) (base frontmatter + per-type fields + edge vocabulary), `parse(md)→Atom`, `serialize(Atom)→md`, `validate(Atom)→Result`, edge/graph helpers. Zero Claude/network deps. Every other package depends on it.	Yes
`@dossier/extraction`	The moat. The staged extraction pipeline (clean markdown + provenance → atomic OKF atoms + typed edges), the Claude client wrapper (forced-tool-use JSON), entity resolution/dedup, linking, and the eval harness. Depends on `@dossier/okf`.	Yes
`@dossier/ingestion`	Commodity connectors → clean markdown + provenance. Reserved — name + dir + README stub only.	No (reserve)
`@dossier/mcp`	Tenant-scoped MCP server exposing a client's OKF KB + GraphRAG to downstream agents. Reserved — name + dir + README stub only.	No (reserve)

Plus an internal @dossier/runtime slot is reserved for the Agent SDK orchestration (platform-engineer's domain, DEC-0004) — not built here; named so the orchestration story has a home and doesn't get crammed into @dossier/extraction.

2. Claude integration — split by responsibility, reconciling DEC-0004.

The typed extraction transform uses the Anthropic TypeScript SDK directly with forced tool use (a single tool whose input_schema is the JSON Schema generated from the @dossier/okf Zod schema; tool_choice forces it). This makes each segment → candidate atoms call schema-constrained and deterministic to validate. This lives in @dossier/extraction.
The runtime orchestration (resumable per-client ingest → extract → OKF, memory tool, context editing, retries) is the Claude Agent SDK story and belongs to the reserved @dossier/runtime (platform-engineer, DEC-0004). The extraction package exposes a clean async pipeline API the runtime calls — it does not own orchestration. Reconciliation: Agent SDK orchestrates the loop; the SDK's structured/tool-use calls do the typed extraction. Same primitive family, two layers.
Default model ids (June 2026 lineup): extraction transform defaults to claude-sonnet-4-6 (the workhorse for high-volume typed extraction); escalate to claude-opus-4-8 for hard/low-confidence segments and judgment (decision) extraction; claude-haiku-4-5-20251001 for cheap mechanical passes (segmentation hints, dedup candidate scoring). Model id is config, never hard-coded.
Prompt caching: cache the stable prefix — the system spec, the OKF tool schema, the model/extraction instructions, and the few-shot exemplars (the DXA atoms are the gold set). The per-segment source content is the only uncached tail. This is where the corpus-scale cost win lives.

3. Pipeline — a typed staged data-flow contract (each stage pure where possible, provenance threaded throughout):

load        Source(md + Provenance)          → CleanDoc
segment     CleanDoc                          → Segment[]      (each w/ source spans)
extract     Segment                           → CandidateAtom[] (typed, source spans, confidence)  [LLM, forced-tool-use]
validate    CandidateAtom                     → Result<Atom, ValidationError>   [@dossier/okf]
resolve     Atom[]                            → ResolvedAtom[]  (entity resolution + dedup; stable ids)
link        ResolvedAtom[]                    → KnowledgeGraph  (typed edges; enforces DEC-0007)
serialize   ResolvedAtom                      → OkfFile (path + md)             [@dossier/okf]
emit        OkfFile[]                         → write to client OKF git repo (siloed, system of record)

The knowledge graph is derived from frontmatter edges and is a replaceable cache (Adopt OKF as Dossier's canonical knowledge format) — never the source of truth. GraphRAG serving is reserved for @dossier/mcp, not built now. The link stage enforces the single-producer / no-orphan-artifact rule from The produces edge is canonical on the producing process only as a hard check.

4. OKF schema as code — @dossier/okf is the keystone. Zod schema mirrors Dossier — The Knowledge Model (v0) exactly: a base-frontmatter schema, a discriminated union on type (role | process | workflow | decision | system | policy | artifact | term | concept + vertical extensions), the edge vocabulary as typed fields, and Atom types inferred from the schema (single source of truth — types never drift from validation). Public surface: parse(md) → Atom, serialize(Atom) → md (round-trip stable, deterministic key order), validate(Atom) → Result, plus toJsonSchema() (feeds the extraction tool) and graph/edge helpers. Extensibility honored: only type is OKF-mandatory; unknown types fall back to concept; verticals register extra types without forking the core.

5. Tenant isolation (note-level — not built here). A per-client run is scoped by an explicit TenantContext (client id, OKF repo path/remote, model config, cache namespace) threaded through every stage — never read from ambient/global state. Output is written only to that client's siloed OKF git repo. This is the seam where the platform-engineer's future provisioning (@dossier/runtime + per-tenant repo/index/MCP) plugs in; extraction must accept the context as an argument, never assume a single tenant.

6. Build/test/CI. Vite 8 library builds per package (Rolldown), Vitest for tests, oxlint (Oxc) for lint, a root tsconfig.base.json with per-package tsconfig.json extending it (project references, strict). Tests run with no live API — the Claude client is an injected interface, mocked in tests against recorded fixtures; a real CLI entrypoint runs against ANTHROPIC_API_KEY when present. The DXA atom set is the first extraction eval's gold standard.

Rationale

Sovereignty & boundaries. Separate packages keep the sovereign-data contract (@dossier/okf) physically isolated from replaceable-cache/vendor code (extraction, future index/MCP). The dependency arrows encode Adopt OKF as Dossier's canonical knowledge format: everything depends on the format; the format depends on nothing.
The keystone protects the moat. One schema-as-code package means knowledge-architect and extraction-engineer share one contract; the extraction tool schema is generated from the validator, so "what we ask Claude for" and "what we accept" can never drift.
Faithfulness by construction. Forced tool use + schema validation makes a malformed atom unrepresentable, not merely caught later — directly serving the extraction-engineer's "a wrong atom is worse than a missing one" bar. Confidence is asserted: design-level conviction, to be moved to verified once the first eval runs against the DXA gold set.
Primitives, reconciled. Honors DEC-0004 without conflating layers: Agent SDK = orchestration (runtime); SDK structured/tool-use = the typed transform (extraction). Naming @dossier/runtime now prevents orchestration from leaking into the moat.
Cost at corpus scale. Caching the spec + OKF tool schema + few-shot, with model-tiering (Haiku/Sonnet/Opus by difficulty), is the difference between a demo and an economic pipeline over a real client's SharePoint.

Consequences

One-way doors: the four package names/boundaries and @dossier/okf as the universal dependency. Changing these later is a repaint — chosen deliberately now. The schema-as-code keystone becomes the single contract all future packages (ingestion, mcp, runtime, verticals) bind to.
Two-way doors: pipeline internals, default model ids, caching specifics, eval shape — expected to evolve as the first real corpus runs.
Downstream ripples: the @dossier/okf schema must encode Knowledge Model v0 and the The produces edge is canonical on the producing process only single-producer/no-orphan rule as validation. The reserved @dossier/ingestion defines the CleanDoc + Provenance input contract extraction consumes. The reserved @dossier/mcp will serve GraphRAG off the derived graph. The reserved @dossier/runtime is where the Agent SDK orchestration + per-tenant provisioning land.
Division of labor is fixed (see the build spec): platform-engineer owns scaffold/configs/build/CI + package skeletons; extraction-engineer owns the @dossier/okf schema-as-code, the pipeline, serialization, and the first eval — sequenced so skeletons land before the schema fills them.

Review

Revisit after the first extraction run against a real (or the DXA gold) corpus: does forced-tool-use + Zod give acceptable faithfulness/coverage, do the model tiers earn their cost, and does the staged contract hold? Promote confidence to verified once the eval passes. Reconsider package boundaries only if multi-tenant scale (DEC-0004's flagged MCP-isolation risk) forces it.

Validated (2026-06-14). Confidence promoted asserted → verified: the review condition is met. The monorepo is scaffolded on the VoidZero stack (pnpm workspaces, Vite 8/Rolldown, Vitest 4, oxlint, TS 6, ESM) with @dossier/okf (keystone) and @dossier/extraction (the moat) built, and @dossier/ingestion/@dossier/mcp/@dossier/runtime reserved — dependency arrows point inward to @dossier/okf only. @dossier/okf round-trips (parse → serialize → parse stable) over the real DXA gold atoms and validateGraph is clean over the full DXA set (DEC-0007 single-producer/no-orphan + dangling-edge checks encoded). The staged pipeline runs end-to-end on a MockClaudeClient; the first offline eval passes green against the DXA gold set (a live eval is scaffolded, gated behind ANTHROPIC_API_KEY). Full verification green with no network: typecheck, lint, test (152 passed / 1 gated-skipped), build. The architecture rationale above is unchanged — only the design-level conviction is now evidence-backed.