Codebase ingestion as the 4th connector — a three-layer deterministic code-graph substrate + git-mined "why", gated on a de-risk spike and dogfooded on this repo first

0040-codebase-ingestion-three-layer

decision read as Explain confidence asserted status active 2026-06-16 owner principal-architect

Reversibility: two-way door

DEC-0040 — Codebase ingestion (three-layer code-graph substrate + git-mined "why")

Context

Ingestion connector seam — assemble, don't build, and ingestion owns the input contract shipped the ingestion front door with a Connector seam and reserved the three commodity connectors (Firecrawl web / Unstructured files / Microsoft365 SharePoint). This decision adds a fourth connector — codebase ingestion — and, unlike the other three, it is not commodity assembly: a client's codebase is the densest, least-documented, highest-decay reservoir of the tacit "why this, what we'd never do again" that Dossier — The Knowledge Model (v0) (principle 7) calls the IP that walks out the door, and mining decision atoms from a client's own git history is the most on-thesis output Dossier can produce (Dossier — Mission & North Star). For the DXA go-to-market the win is amplified — agencies hold heterogeneous client codebases that no code-search tool turns into owned institutional memory.

This was produced by a /deep-research workflow (run wf_011bb749-351, 2026-06-16, 22 sources / 25 adversarially-verified claims), pressure-tested by the Product Owner and Principal Platform Architect subagents, and ratified by the user on 2026-06-16. It is recorded as a full decision because it is roadmap-defining and future readers will ask three things: why a graph instead of flattened markdown? why does the LLM never author the structural graph? why gate the build behind a spike?

Options considered

How code becomes knowledge:

Flatten code to markdown and run the existing extraction pipeline unchanged — reuse the Source/markdown contract (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract) verbatim. Rejected: the contract is structurally insufficient for code — the heading-splitter in extraction's segment stage would shatter source files into junk segments, losing the call/import structure that is the signal.
Let an LLM read the repo and extract a knowledge graph directly (LLM-authored structural graph). Rejected: per the June-2026 survey, deterministic AST-derived graphs beat LLM-extracted knowledge graphs on every axis — 40–70× faster, 9–21× cheaper, more complete (LLM extraction silently drops files), at equal-or-better correctness. Letting the model author structure is slower, costlier, and less complete.
(chosen) A deterministic code-graph substrate + a graph-native segmenter that reuses extraction's back half, leading with the git-mined "why". Code enters the pipeline as a graph; a deterministic substrate builds it with zero LLM; the LLM is used only where judgment is irreducible (explaining rationale), and every such output clears the Live extraction eval harness — what we measure is what extraction optimizes for faithfulness floor.

Substrate shape: a single universal extractor vs. (chosen) three layers — universal substrate / language packs / platform overlays — so "treats every codebase the same" is true at the taxonomy level while per-language and per-framework specificity lives in swappable, registry-driven data.

Code-graph storage: a graph database as a hard dependency now (Neo4j/Kuzu) vs. (chosen) an embedded-first derived cache behind a GraphStore seam, evidence-gated exactly like the vector backend (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB / First full-loop SERVE on a real external client — reconcile divergent extraction runs to one canonical KB on a quality rubric; lexical retrieval sufficient (VectorRetriever seam not yet needed)).

Sequencing: build the substrate first vs. (chosen) gate the entire build behind a cheap de-risk "why" spike (go/no-go) that tests the value thesis independently of the tree-sitter substrate.

Decision

Add codebase ingestion as the fourth connector, architected as three layers, and gate the build on a de-risk spike — dogfooded on this repo first.

The three layers

Universal deterministic substrate. A tree-sitter symbol graph (nodes: repo / dir / file / symbol; edges: contains / imports / calls / references) plus a git-history graph (nodes: commit / author / PR / issue / release; edges: authored_by / touches / merges / closes / co_changed). Zero LLM, zero network, fully offline. It treats every codebase identically at the node/edge taxonomy level, and the node/edge kinds are a CLOSED set.
Language packs. Per-language tree-sitter tag-query (.scm) + schema data that emit into the closed structural edge kinds. This is the unit of incremental language coverage — substrate-internal DATA, not a fork. "Treats every codebase the same" holds at the taxonomy level because a per-language query pack sits underneath; adding a language is adding a pack, never touching the substrate's closed taxonomy.
Platform overlays (DEFERRED to v2+). Deterministic annotators (Sitecore-/Next.js-/Terraform-aware) that enrich the base graph without forking it, registered via the same registry primitive as OKF edge vocabulary is registry-driven — a vertical declares its own traversable edges (registerType / registerLanguage / registerOverlay). They annotate-only — stamp semanticRole, add same-kind edges — and never invent edge kinds. Judgmental interpretation (vs. deterministic detection) lives in Agent Skills, not overlays.

The decisive design rules

Never let the LLM author the structural graph. Deterministic AST + git graphs are the substrate; the LLM is confined to the judgment layer. This mirrors Dossier's existing split — the deterministic `@dossier/okf` keystone + judgmental LLM extraction — applied to code.
Code enters as a GRAPH, not flattened markdown. A new graph-node segmenter (PageRank-scoped central nodes + their k-hop neighborhood as the prompt) replaces the heading-splitter, then REUSES the existing extraction back half — validate → resolve → link → reconcile → emit — unchanged. This is "reuse extraction's back half, replace its front half," not a second system.
Lead with the git-mined "why." git log -L → filter trivial commits → linked PRs/issues (GitHub GraphQL as a reserved keyed connector, in the keyless-floor / keyed-premium spirit) → LLM-explained rationale → OKF decision atoms (confidence: inferred; provenance = the commit / PR SHA). Each mined rationale must clear the Live extraction eval harness — what we measure is what extraction optimizes for faithfulness floor or be DROPPED — never shipped fabricated (Dossier — The Knowledge Model (v0) principle 7; the fabrication guard the faithfulness judge exists to be). The symbol graph alone is a commodity (the IDE already gives a call graph); the "why" is what makes the universal substrate a product rather than inert structure.
Storage split. The code graph is a replaceable, embedded-first DERIVED CACHE behind a GraphStore seam (Neo4j/Kuzu deferred, evidence-gated exactly like the vector backend in MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB / First full-loop SERVE on a real external client — reconcile divergent extraction runs to one canonical KB on a quality rubric; lexical retrieval sufficient (VectorRetriever seam not yet needed)). OKF markdown+YAML in the client's git is the SYSTEM OF RECORD (Adopt OKF as Dossier's canonical knowledge format). Not every graph node becomes an atom — PageRank centrality filters deep-extract vs. summarize vs. skip; only distilled knowledge crosses into OKF.
Freshness — git-resident incremental refresh. Content-hash-keyed; git diff drives patch-only re-parse; the history graph is append-only; the freshness watermark is a SyncCursor wrapping a commit SHA (the reserved IncrementalConnector contract from Ingestion connector seam — assemble, don't build, and ingestion owns the input contract). Refresh flows through the compounding reconcile (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime)), preserving human-curated (asserted / verified) atoms — a re-parse never clobbers curation.
Multi-tenant isolation hardening. Extraction runs inside the tenant silo (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Fix git-per-tenant isolation when a tenant root is nested inside another repo); source code must NEVER enter the LLM prompt-cache prefix — the cacheable prefix stays tenant-agnostic, source is the uncached per-tenant tail, and cacheNamespace = clientId. New MCP code tools (search_code / get_code_neighborhood / explain_symbol_history) go through MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB's confinePath isolation gate and serve structure + synthesis by default, with raw file bodies gated behind explicit tenant policy — the agentic foundation must not become a code-exfil surface.
OKF mapping — no new core types for v1. Code-graph elements distill into existing concept types — system / process / workflow / artifact / term / policy / decision / role. Framework-typed concepts (sitecore-template, terraform-module) appear only as vertical types via registerType (OKF edge vocabulary is registry-driven — a vertical declares its own traversable edges) when an agency wants them — extend, never fork.

Scope, sequencing, and the gate

v1 cut: TypeScript + Python, single-tenant, offline; deliver architecture/structure atoms + git-mined decision atoms; provenance non-negotiable on every atom; dogfooded on THIS repo first (D:\github\dossier has ~38 hand-authored decision records — a built-in gold eval set), then one real client repo — the dogfood-then-rollout / First full-loop SERVE on a real external client — reconcile divergent extraction runs to one canonical KB on a quality rubric; lexical retrieval sufficient (VectorRetriever seam not yet needed) discipline.
The build is GATED on a de-risk spike (go/no-go) BEFORE building any substrate infrastructure — see De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers. Mine the "why" from two repos — this repo (recall vs. the ~38 gold decisions) and one messy real client repo — through the EXISTING faithfulness judge, and report two numbers: (a) decision-recall vs. gold, (b) faithfulness-pass-rate + raw decision-yield on the messy repo. The why-layer is testable independently of the tree-sitter substrate, so this de-risks the entire value thesis cheaply on existing machinery. That number IS the value thesis.
Roadmap order: finish the agentic board v1 review gate (Agentic board v1 — build the git-resident OKF task board (deterministic offline core, SDK reserved), resolving DEC-0024's four open questions and dogfooding Dossier's own repo first, currently one GitHub-Actions dispatch from done) → run the "why" spike → build substrate v1 → sequenced AHEAD OF the hosted control plane (Build a fully-owned hosted control plane (do NOT adopt the Vercel claude-managed-agents starter); settle the system of record as hybrid / thin-control-plane with the client-owned OKF git repo canonical) and the remaining commodity connectors (Unstructured / M365). Language coverage beyond TS/Python is tracked as a contingent backlog — see Language-pack backlog — author per-language tree-sitter tag-query + schema packs beyond TS/Python (CONTINGENT on the v1 build proceeding).
Deferred to v2+: SCIP/LSIF precise indexing (per-toolchain, two-tier accuracy, not universal — kept OFF the substrate, behind a seam), languages beyond TS/Python, framework/platform overlays, the Neo4j/Kuzu backend, NL→Cypher serving, business/domain-logic extraction, and incremental-refresh runtime wiring.

Rationale

Mission fit — the most on-thesis output we have. A client's own git history mined into owned decision atoms is exactly the "capture the why" thesis (Dossier — Mission & North Star / Dossier — The Knowledge Model (v0) principle 7), pointed at the highest-decay reservoir there is. The DXA GTM makes it sharper: agencies hold many heterogeneous client codebases and no existing code tool turns them into sovereign institutional memory.
Evidence-led, with honest caveats. The deterministic-over-LLM-extraction call rests on a June-2026 survey. The headline magnitudes (40–70× / 9–21×) come from a single non-peer-reviewed preprint (arXiv 2601.08773, Java-only eval) — but the DIRECTION is independently corroborated (RepoGraph, ICLR 2025; CodexGraph, NAACL 2025; Microsoft GraphRAG cost figures). Per-language engineering is a real, acknowledged cost — which is precisely why the language pack is its own contingent backlog and why the build is gated on a spike rather than committed up front.
Architectural congruence — the house architecture, not a new moat. It reuses the existing extraction back half, the isolation model (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB / Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Fix git-per-tenant isolation when a tenant root is nested inside another repo), the compounding reconcile (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime)), the registry composability (OKF edge vocabulary is registry-driven — a vertical declares its own traversable edges), and the faithfulness eval (Live extraction eval harness — what we measure is what extraction optimizes for). It is the existing architecture aimed at code — fewer new primitives, lower risk.
The "why"-ceiling risk validates the product. Rationale is only recoverable where git/PR hygiene supports it — which is the argument FOR Dossier's "capture the why going forward" discipline (its own DEC-records) over pure retroactive mining. The spike measures whether real client history clears the bar, and that measurement is itself a finding worth having.
asserted, not verified — and the gate is the proof. Nothing here is built yet; this is a ratified architectural direction. The honest confidence is asserted, and the verification gate is explicit: the de-risk spike's two numbers. We do not imply verified for a design that has not run.

Consequences

Codebase ingestion becomes the fourth connector in the Ingestion connector seam — assemble, don't build, and ingestion owns the input contract family, but as a graph-native, judgment-bearing connector rather than a commodity markdown source — the only connector that emits decision atoms about the source's own history.
A graph-node segmenter is added to extraction's front half; its back half (validate → resolve → link → reconcile → emit) is reused unchanged — so this is an extension of the existing pipeline, not a parallel system to maintain.
A GraphStore seam joins the standing reservations (alongside the Embedder/VectorIndex and AgentSdkOrchestrator seams) — the embedded-first cache ships first; the Neo4j/Kuzu backend is built only when a real KB's needs are shown to demand it.
A new isolation invariant is committed: source code never enters the cacheable prompt prefix (cacheNamespace = clientId), and code-serving MCP tools default to structure+synthesis with raw bodies behind explicit policy — the MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB confinePath gate extends to every new code tool.
Two follow-up tasks are filed (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers — the immediate go/no-go gate; Language-pack backlog — author per-language tree-sitter tag-query + schema packs beyond TS/Python (CONTINGENT on the v1 build proceeding) — contingent per-language packs beyond TS/Python). The board v1 review gate (Agentic board v1 — build the git-resident OKF task board (deterministic offline core, SDK reserved), resolving DEC-0024's four open questions and dogfooding Dossier's own repo first) precedes this work and is tracked by that decision's own ## Review (one GitHub-Actions dispatch from done); it is not re-filed here to avoid duplicating its ownership.
The build does not start until the spike returns its two numbers — so the expensive, less-reversible substrate infrastructure is committed only after the value thesis is measured.

Review

Promote asserted → verified when the de-risk spike (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers) reports its two numbers and they clear the go bar: (a) decision-recall against this repo's ~38 gold decision records is high enough to demonstrate the symbol+git substrate recovers real captured rationale, and (b) on a messy real client repo, faithfulness-pass-rate (through the existing Live extraction eval harness — what we measure is what extraction optimizes for judge) plus raw decision-yield clear the floor that makes mined decision atoms shippable rather than discarded noise. A no-go is an equally valid outcome and would itself be recorded (with the measured numbers) as the reason not to build the substrate — the spike de-risks both directions. Revisit the three-layer split only if the closed node/edge taxonomy is shown insufficient for a real client stack (the escape hatch is a vertical type via registerType, never widening the substrate's closed set).

On reversibility (two-way door). The swappable internals stay swappable — the GraphStore backend, the per-language packs, the platform overlays, the segmenter's centrality heuristic, and the exact code-MCP tool surface are all behind seams or registry-driven and can change without a rewrite. What this commits to and is harder to reverse is the house-architecture stance applied to code: the LLM never authors the structural graph; code enters as a graph, not flattened markdown; OKF in the client's git stays the system of record while the code-graph is a derived cache; provenance is non-negotiable on every emitted atom; source code never enters the cacheable prompt prefix. And the whole build is itself gated behind a cheap go/no-go spike, so the expensive, less-reversible substrate infrastructure is not committed until the value thesis is measured — the most consequential commitment is deferred behind the cheapest possible test.

Provenance

/deep-research workflow (run wf_011bb749-351, 2026-06-16; 22 sources / 25 adversarially-verified claims) → pressure-tested by the Product Owner and Principal Platform Architect subagents → ratified by the user on 2026-06-16. Evidence base: a June-2026 survey of code-graph vs. LLM-extracted knowledge graphs; headline magnitudes from arXiv 2601.08773 (non-peer-reviewed, Java-only) with direction corroborated by RepoGraph (ICLR 2025), CodexGraph (NAACL 2025), and Microsoft GraphRAG cost figures — recorded with those caveats explicit, not as settled fact. confidence: asserted — this is a ratified architectural direction; nothing is built, and the verification gate is the de-risk spike's two numbers (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers). No shipped package source changed for this decision.