Codebase ingestion as the 4th connector — a three-layer deterministic code-graph substrate + git-mined "why", gated on a de-risk spike and dogfooded on this repo first
0040-codebase-ingestion-three-layer
- Reversibility
- two-way door
DEC-0040 — Codebase ingestion (three-layer code-graph substrate + git-mined "why")
Context
Ingestion connector seam — assemble, don't build, and ingestion owns the input contract shipped the ingestion front door with a Connector seam and reserved the three commodity connectors (Firecrawl web / Unstructured files / Microsoft365 SharePoint). This decision adds a fourth connector — codebase ingestion — and, unlike the other three, it is not commodity assembly: a client's codebase is the densest, least-documented, highest-decay reservoir of the tacit "why this, what we'd never do again" that Dossier — The Knowledge Model (v0) (principle 7) calls the IP that walks out the door, and mining decision atoms from a client's own git history is the most on-thesis output Dossier can produce (Dossier — Mission & North Star). For the DXA go-to-market the win is amplified — agencies hold heterogeneous client codebases that no code-search tool turns into owned institutional memory.
This was produced by a /deep-research workflow (run wf_011bb749-351, 2026-06-16, 22 sources / 25 adversarially-verified claims), pressure-tested by the Product Owner and Principal Platform Architect subagents, and ratified by the user on 2026-06-16. It is recorded as a full decision because it is roadmap-defining and future readers will ask three things: why a graph instead of flattened markdown? why does the LLM never author the structural graph? why gate the build behind a spike?
Options considered
How code becomes knowledge:
- Flatten code to markdown and run the existing extraction pipeline unchanged — reuse the
Source/markdown contract (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract) verbatim. Rejected: the contract is structurally insufficient for code — the heading-splitter in extraction'ssegmentstage would shatter source files into junk segments, losing the call/import structure that is the signal. - Let an LLM read the repo and extract a knowledge graph directly (LLM-authored structural graph). Rejected: per the June-2026 survey, deterministic AST-derived graphs beat LLM-extracted knowledge graphs on every axis — 40–70× faster, 9–21× cheaper, more complete (LLM extraction silently drops files), at equal-or-better correctness. Letting the model author structure is slower, costlier, and less complete.
- (chosen) A deterministic code-graph substrate + a graph-native segmenter that reuses extraction's back half, leading with the git-mined "why". Code enters the pipeline as a graph; a deterministic substrate builds it with zero LLM; the LLM is used only where judgment is irreducible (explaining rationale), and every such output clears the Live extraction eval harness — what we measure is what extraction optimizes for faithfulness floor.
Substrate shape: a single universal extractor vs. (chosen) three layers — universal substrate / language packs / platform overlays — so "treats every codebase the same" is true at the taxonomy level while per-language and per-framework specificity lives in swappable, registry-driven data.
Code-graph storage: a graph database as a hard dependency now (Neo4j/Kuzu) vs. (chosen) an embedded-first derived cache behind a GraphStore seam, evidence-gated exactly like the vector backend (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB / First full-loop SERVE on a real external client — reconcile divergent extraction runs to one canonical KB on a quality rubric; lexical retrieval sufficient (VectorRetriever seam not yet needed)).
Sequencing: build the substrate first vs. (chosen) gate the entire build behind a cheap de-risk "why" spike (go/no-go) that tests the value thesis independently of the tree-sitter substrate.
Decision
Add codebase ingestion as the fourth connector, architected as three layers, and gate the build on a de-risk spike — dogfooded on this repo first.
The three layers
- Universal deterministic substrate. A tree-sitter symbol graph (nodes:
repo/dir/file/symbol; edges:contains/imports/calls/references) plus a git-history graph (nodes:commit/author/PR/issue/release; edges:authored_by/touches/merges/closes/co_changed). Zero LLM, zero network, fully offline. It treats every codebase identically at the node/edge taxonomy level, and the node/edge kinds are a CLOSED set. - Language packs. Per-language tree-sitter tag-query (
.scm) + schema data that emit into the closed structural edge kinds. This is the unit of incremental language coverage — substrate-internal DATA, not a fork. "Treats every codebase the same" holds at the taxonomy level because a per-language query pack sits underneath; adding a language is adding a pack, never touching the substrate's closed taxonomy. - Platform overlays (DEFERRED to v2+). Deterministic annotators (Sitecore-/Next.js-/Terraform-aware) that enrich the base graph without forking it, registered via the same registry primitive as OKF edge vocabulary is registry-driven — a vertical declares its own traversable edges (
registerType/registerLanguage/registerOverlay). They annotate-only — stampsemanticRole, add same-kind edges — and never invent edge kinds. Judgmental interpretation (vs. deterministic detection) lives in Agent Skills, not overlays.
The decisive design rules
- Never let the LLM author the structural graph. Deterministic AST + git graphs are the substrate; the LLM is confined to the judgment layer. This mirrors Dossier's existing split — the deterministic `@dossier/okf` keystone + judgmental LLM extraction — applied to code.
- Code enters as a GRAPH, not flattened markdown. A new graph-node segmenter (PageRank-scoped central nodes + their k-hop neighborhood as the prompt) replaces the heading-splitter, then REUSES the existing extraction back half —
validate → resolve → link → reconcile → emit— unchanged. This is "reuse extraction's back half, replace its front half," not a second system. - Lead with the git-mined "why."
git log -L→ filter trivial commits → linked PRs/issues (GitHub GraphQL as a reserved keyed connector, in the keyless-floor / keyed-premium spirit) → LLM-explained rationale → OKFdecisionatoms (confidence: inferred; provenance = the commit / PR SHA). Each mined rationale must clear the Live extraction eval harness — what we measure is what extraction optimizes for faithfulness floor or be DROPPED — never shipped fabricated (Dossier — The Knowledge Model (v0) principle 7; the fabrication guard the faithfulness judge exists to be). The symbol graph alone is a commodity (the IDE already gives a call graph); the "why" is what makes the universal substrate a product rather than inert structure. - Storage split. The code graph is a replaceable, embedded-first DERIVED CACHE behind a
GraphStoreseam (Neo4j/Kuzu deferred, evidence-gated exactly like the vector backend in MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB / First full-loop SERVE on a real external client — reconcile divergent extraction runs to one canonical KB on a quality rubric; lexical retrieval sufficient (VectorRetriever seam not yet needed)). OKF markdown+YAML in the client's git is the SYSTEM OF RECORD (Adopt OKF as Dossier's canonical knowledge format). Not every graph node becomes an atom — PageRank centrality filters deep-extract vs. summarize vs. skip; only distilled knowledge crosses into OKF. - Freshness — git-resident incremental refresh. Content-hash-keyed;
git diffdrives patch-only re-parse; the history graph is append-only; the freshness watermark is aSyncCursorwrapping a commit SHA (the reservedIncrementalConnectorcontract from Ingestion connector seam — assemble, don't build, and ingestion owns the input contract). Refresh flows through the compounding reconcile (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime)), preserving human-curated (asserted/verified) atoms — a re-parse never clobbers curation. - Multi-tenant isolation hardening. Extraction runs inside the tenant silo (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Fix git-per-tenant isolation when a tenant root is nested inside another repo); source code must NEVER enter the LLM prompt-cache prefix — the cacheable prefix stays tenant-agnostic, source is the uncached per-tenant tail, and
cacheNamespace = clientId. New MCP code tools (search_code/get_code_neighborhood/explain_symbol_history) go through MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB'sconfinePathisolation gate and serve structure + synthesis by default, with raw file bodies gated behind explicit tenant policy — the agentic foundation must not become a code-exfil surface. - OKF mapping — no new core types for v1. Code-graph elements distill into existing concept types —
system/process/workflow/artifact/term/policy/decision/role. Framework-typed concepts (sitecore-template,terraform-module) appear only as vertical types viaregisterType(OKF edge vocabulary is registry-driven — a vertical declares its own traversable edges) when an agency wants them — extend, never fork.
Scope, sequencing, and the gate
- v1 cut: TypeScript + Python, single-tenant, offline; deliver architecture/structure atoms + git-mined
decisionatoms; provenance non-negotiable on every atom; dogfooded on THIS repo first (D:\github\dossierhas ~38 hand-authoreddecisionrecords — a built-in gold eval set), then one real client repo — the dogfood-then-rollout / First full-loop SERVE on a real external client — reconcile divergent extraction runs to one canonical KB on a quality rubric; lexical retrieval sufficient (VectorRetriever seam not yet needed) discipline. - The build is GATED on a de-risk spike (go/no-go) BEFORE building any substrate infrastructure — see De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers. Mine the "why" from two repos — this repo (recall vs. the ~38 gold decisions) and one messy real client repo — through the EXISTING faithfulness judge, and report two numbers: (a) decision-recall vs. gold, (b) faithfulness-pass-rate + raw decision-yield on the messy repo. The why-layer is testable independently of the tree-sitter substrate, so this de-risks the entire value thesis cheaply on existing machinery. That number IS the value thesis.
- Roadmap order: finish the agentic board v1 review gate (Agentic board v1 — build the git-resident OKF task board (deterministic offline core, SDK reserved), resolving DEC-0024's four open questions and dogfooding Dossier's own repo first, currently one GitHub-Actions dispatch from done) → run the "why" spike → build substrate v1 → sequenced AHEAD OF the hosted control plane (Build a fully-owned hosted control plane (do NOT adopt the Vercel claude-managed-agents starter); settle the system of record as hybrid / thin-control-plane with the client-owned OKF git repo canonical) and the remaining commodity connectors (Unstructured / M365). Language coverage beyond TS/Python is tracked as a contingent backlog — see Language-pack backlog — author per-language tree-sitter tag-query + schema packs beyond TS/Python (CONTINGENT on the v1 build proceeding).
- Deferred to v2+: SCIP/LSIF precise indexing (per-toolchain, two-tier accuracy, not universal — kept OFF the substrate, behind a seam), languages beyond TS/Python, framework/platform overlays, the Neo4j/Kuzu backend, NL→Cypher serving, business/domain-logic extraction, and incremental-refresh runtime wiring.
Rationale
- Mission fit — the most on-thesis output we have. A client's own git history mined into owned
decisionatoms is exactly the "capture the why" thesis (Dossier — Mission & North Star / Dossier — The Knowledge Model (v0) principle 7), pointed at the highest-decay reservoir there is. The DXA GTM makes it sharper: agencies hold many heterogeneous client codebases and no existing code tool turns them into sovereign institutional memory. - Evidence-led, with honest caveats. The deterministic-over-LLM-extraction call rests on a June-2026 survey. The headline magnitudes (40–70× / 9–21×) come from a single non-peer-reviewed preprint (arXiv 2601.08773, Java-only eval) — but the DIRECTION is independently corroborated (RepoGraph, ICLR 2025; CodexGraph, NAACL 2025; Microsoft GraphRAG cost figures). Per-language engineering is a real, acknowledged cost — which is precisely why the language pack is its own contingent backlog and why the build is gated on a spike rather than committed up front.
- Architectural congruence — the house architecture, not a new moat. It reuses the existing extraction back half, the isolation model (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB / Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Fix git-per-tenant isolation when a tenant root is nested inside another repo), the compounding reconcile (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime)), the registry composability (OKF edge vocabulary is registry-driven — a vertical declares its own traversable edges), and the faithfulness eval (Live extraction eval harness — what we measure is what extraction optimizes for). It is the existing architecture aimed at code — fewer new primitives, lower risk.
- The "why"-ceiling risk validates the product. Rationale is only recoverable where git/PR hygiene supports it — which is the argument FOR Dossier's "capture the why going forward" discipline (its own DEC-records) over pure retroactive mining. The spike measures whether real client history clears the bar, and that measurement is itself a finding worth having.
asserted, notverified— and the gate is the proof. Nothing here is built yet; this is a ratified architectural direction. The honest confidence isasserted, and the verification gate is explicit: the de-risk spike's two numbers. We do not implyverifiedfor a design that has not run.
Consequences
- Codebase ingestion becomes the fourth connector in the Ingestion connector seam — assemble, don't build, and ingestion owns the input contract family, but as a graph-native, judgment-bearing connector rather than a commodity markdown source — the only connector that emits
decisionatoms about the source's own history. - A graph-node segmenter is added to extraction's front half; its back half (
validate → resolve → link → reconcile → emit) is reused unchanged — so this is an extension of the existing pipeline, not a parallel system to maintain. - A
GraphStoreseam joins the standing reservations (alongside theEmbedder/VectorIndexandAgentSdkOrchestratorseams) — the embedded-first cache ships first; the Neo4j/Kuzu backend is built only when a real KB's needs are shown to demand it. - A new isolation invariant is committed: source code never enters the cacheable prompt prefix (
cacheNamespace = clientId), and code-serving MCP tools default to structure+synthesis with raw bodies behind explicit policy — the MCP agentic foundation — tenant-scoped GraphRAG over the OKF KBconfinePathgate extends to every new code tool. - Two follow-up tasks are filed (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers — the immediate go/no-go gate; Language-pack backlog — author per-language tree-sitter tag-query + schema packs beyond TS/Python (CONTINGENT on the v1 build proceeding) — contingent per-language packs beyond TS/Python). The board v1 review gate (Agentic board v1 — build the git-resident OKF task board (deterministic offline core, SDK reserved), resolving DEC-0024's four open questions and dogfooding Dossier's own repo first) precedes this work and is tracked by that decision's own
## Review(one GitHub-Actions dispatch from done); it is not re-filed here to avoid duplicating its ownership. - The build does not start until the spike returns its two numbers — so the expensive, less-reversible substrate infrastructure is committed only after the value thesis is measured.
Review
Promote asserted → verified when the de-risk spike (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers) reports its two numbers and they clear the go bar: (a) decision-recall against this repo's ~38 gold decision records is high enough to demonstrate the symbol+git substrate recovers real captured rationale, and (b) on a messy real client repo, faithfulness-pass-rate (through the existing Live extraction eval harness — what we measure is what extraction optimizes for judge) plus raw decision-yield clear the floor that makes mined decision atoms shippable rather than discarded noise. A no-go is an equally valid outcome and would itself be recorded (with the measured numbers) as the reason not to build the substrate — the spike de-risks both directions. Revisit the three-layer split only if the closed node/edge taxonomy is shown insufficient for a real client stack (the escape hatch is a vertical type via registerType, never widening the substrate's closed set).
On reversibility (two-way door). The swappable internals stay swappable — the GraphStore backend, the per-language packs, the platform overlays, the segmenter's centrality heuristic, and the exact code-MCP tool surface are all behind seams or registry-driven and can change without a rewrite. What this commits to and is harder to reverse is the house-architecture stance applied to code: the LLM never authors the structural graph; code enters as a graph, not flattened markdown; OKF in the client's git stays the system of record while the code-graph is a derived cache; provenance is non-negotiable on every emitted atom; source code never enters the cacheable prompt prefix. And the whole build is itself gated behind a cheap go/no-go spike, so the expensive, less-reversible substrate infrastructure is not committed until the value thesis is measured — the most consequential commitment is deferred behind the cheapest possible test.
Provenance
/deep-research workflow (run wf_011bb749-351, 2026-06-16; 22 sources / 25 adversarially-verified claims) → pressure-tested by the Product Owner and Principal Platform Architect subagents → ratified by the user on 2026-06-16. Evidence base: a June-2026 survey of code-graph vs. LLM-extracted knowledge graphs; headline magnitudes from arXiv 2601.08773 (non-peer-reviewed, Java-only) with direction corroborated by RepoGraph (ICLR 2025), CodexGraph (NAACL 2025), and Microsoft GraphRAG cost figures — recorded with those caveats explicit, not as settled fact. confidence: asserted — this is a ratified architectural direction; nothing is built, and the verification gate is the de-risk spike's two numbers (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers). No shipped package source changed for this decision.