Codebase ingestion as the 4th connector — a three-layer deterministic code-graph substrate + git-mined "why", gated on a de-risk spike and dogfooded on this repo first

0040-codebase-ingestion-three-layer

decision read as Explain confidence asserted status active 2026-06-16 owner principal-architect
Reversibility
two-way door

DEC-0040 — Codebase ingestion (three-layer code-graph substrate + git-mined "why")

Context

Ingestion connector seam — assemble, don't build, and ingestion owns the input contract shipped the ingestion front door with a Connector seam and reserved the three commodity connectors (Firecrawl web / Unstructured files / Microsoft365 SharePoint). This decision adds a fourth connector — codebase ingestion — and, unlike the other three, it is not commodity assembly: a client's codebase is the densest, least-documented, highest-decay reservoir of the tacit "why this, what we'd never do again" that Dossier — The Knowledge Model (v0) (principle 7) calls the IP that walks out the door, and mining decision atoms from a client's own git history is the most on-thesis output Dossier can produce (Dossier — Mission & North Star). For the DXA go-to-market the win is amplified — agencies hold heterogeneous client codebases that no code-search tool turns into owned institutional memory.

This was produced by a /deep-research workflow (run wf_011bb749-351, 2026-06-16, 22 sources / 25 adversarially-verified claims), pressure-tested by the Product Owner and Principal Platform Architect subagents, and ratified by the user on 2026-06-16. It is recorded as a full decision because it is roadmap-defining and future readers will ask three things: why a graph instead of flattened markdown? why does the LLM never author the structural graph? why gate the build behind a spike?

Options considered

How code becomes knowledge:

  1. Flatten code to markdown and run the existing extraction pipeline unchanged — reuse the Source/markdown contract (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract) verbatim. Rejected: the contract is structurally insufficient for code — the heading-splitter in extraction's segment stage would shatter source files into junk segments, losing the call/import structure that is the signal.
  2. Let an LLM read the repo and extract a knowledge graph directly (LLM-authored structural graph). Rejected: per the June-2026 survey, deterministic AST-derived graphs beat LLM-extracted knowledge graphs on every axis — 40–70× faster, 9–21× cheaper, more complete (LLM extraction silently drops files), at equal-or-better correctness. Letting the model author structure is slower, costlier, and less complete.
  3. (chosen) A deterministic code-graph substrate + a graph-native segmenter that reuses extraction's back half, leading with the git-mined "why". Code enters the pipeline as a graph; a deterministic substrate builds it with zero LLM; the LLM is used only where judgment is irreducible (explaining rationale), and every such output clears the Live extraction eval harness — what we measure is what extraction optimizes for faithfulness floor.

Substrate shape: a single universal extractor vs. (chosen) three layers — universal substrate / language packs / platform overlays — so "treats every codebase the same" is true at the taxonomy level while per-language and per-framework specificity lives in swappable, registry-driven data.

Code-graph storage: a graph database as a hard dependency now (Neo4j/Kuzu) vs. (chosen) an embedded-first derived cache behind a GraphStore seam, evidence-gated exactly like the vector backend (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB / First full-loop SERVE on a real external client — reconcile divergent extraction runs to one canonical KB on a quality rubric; lexical retrieval sufficient (VectorRetriever seam not yet needed)).

Sequencing: build the substrate first vs. (chosen) gate the entire build behind a cheap de-risk "why" spike (go/no-go) that tests the value thesis independently of the tree-sitter substrate.

Decision

Add codebase ingestion as the fourth connector, architected as three layers, and gate the build on a de-risk spike — dogfooded on this repo first.

The three layers

  1. Universal deterministic substrate. A tree-sitter symbol graph (nodes: repo / dir / file / symbol; edges: contains / imports / calls / references) plus a git-history graph (nodes: commit / author / PR / issue / release; edges: authored_by / touches / merges / closes / co_changed). Zero LLM, zero network, fully offline. It treats every codebase identically at the node/edge taxonomy level, and the node/edge kinds are a CLOSED set.
  2. Language packs. Per-language tree-sitter tag-query (.scm) + schema data that emit into the closed structural edge kinds. This is the unit of incremental language coverage — substrate-internal DATA, not a fork. "Treats every codebase the same" holds at the taxonomy level because a per-language query pack sits underneath; adding a language is adding a pack, never touching the substrate's closed taxonomy.
  3. Platform overlays (DEFERRED to v2+). Deterministic annotators (Sitecore-/Next.js-/Terraform-aware) that enrich the base graph without forking it, registered via the same registry primitive as OKF edge vocabulary is registry-driven — a vertical declares its own traversable edges (registerType / registerLanguage / registerOverlay). They annotate-only — stamp semanticRole, add same-kind edges — and never invent edge kinds. Judgmental interpretation (vs. deterministic detection) lives in Agent Skills, not overlays.

The decisive design rules

Scope, sequencing, and the gate

Rationale

Consequences

Review

Promote asserted → verified when the de-risk spike (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers) reports its two numbers and they clear the go bar: (a) decision-recall against this repo's ~38 gold decision records is high enough to demonstrate the symbol+git substrate recovers real captured rationale, and (b) on a messy real client repo, faithfulness-pass-rate (through the existing Live extraction eval harness — what we measure is what extraction optimizes for judge) plus raw decision-yield clear the floor that makes mined decision atoms shippable rather than discarded noise. A no-go is an equally valid outcome and would itself be recorded (with the measured numbers) as the reason not to build the substrate — the spike de-risks both directions. Revisit the three-layer split only if the closed node/edge taxonomy is shown insufficient for a real client stack (the escape hatch is a vertical type via registerType, never widening the substrate's closed set).

On reversibility (two-way door). The swappable internals stay swappable — the GraphStore backend, the per-language packs, the platform overlays, the segmenter's centrality heuristic, and the exact code-MCP tool surface are all behind seams or registry-driven and can change without a rewrite. What this commits to and is harder to reverse is the house-architecture stance applied to code: the LLM never authors the structural graph; code enters as a graph, not flattened markdown; OKF in the client's git stays the system of record while the code-graph is a derived cache; provenance is non-negotiable on every emitted atom; source code never enters the cacheable prompt prefix. And the whole build is itself gated behind a cheap go/no-go spike, so the expensive, less-reversible substrate infrastructure is not committed until the value thesis is measured — the most consequential commitment is deferred behind the cheapest possible test.

Provenance

/deep-research workflow (run wf_011bb749-351, 2026-06-16; 22 sources / 25 adversarially-verified claims) → pressure-tested by the Product Owner and Principal Platform Architect subagents → ratified by the user on 2026-06-16. Evidence base: a June-2026 survey of code-graph vs. LLM-extracted knowledge graphs; headline magnitudes from arXiv 2601.08773 (non-peer-reviewed, Java-only) with direction corroborated by RepoGraph (ICLR 2025), CodexGraph (NAACL 2025), and Microsoft GraphRAG cost figures — recorded with those caveats explicit, not as settled fact. confidence: asserted — this is a ratified architectural direction; nothing is built, and the verification gate is the de-risk spike's two numbers (De-risk spike (GO/NO-GO) — mine the git "why" through the existing faithfulness judge, report two numbers). No shipped package source changed for this decision.