Live extraction eval harness — what we measure is what extraction optimizes for
0009-extraction-eval-methodology
- Reversibility
- two-way door
DEC-0009 — Extraction eval methodology (the measurement layer)
Reversibility: two-way door — the metric set, thresholds, and gold corpus evolve; the eval-on-the-seam, offline-first design is the durable part.
Context
"Extraction quality is the moat" (Extraction runtime architecture — the moat) is an assertion until it is a number. Claude-primitives-first build strategy names private evals as a pillar — "is the loop improving against business outcomes?" — and Extraction runtime architecture — the moat §6 explicitly left the live eval a TODO: the offline pipeline ran green on a MockClaudeClient, but nothing yet scored the real emitted OKF against a committed gold standard.
This decision defines that scorer: the live extraction eval harness, built into @dossier/extraction on the existing ClaudeClient seam and run() pipeline (no rewrites). It scores the real emitted OKF (parses the written files back, not the in-memory candidates), runs fully offline in CI (MockClaudeClient + mock judge), and exposes a gated live path. The judgments below are methodology IP: what we choose to measure is what extraction will be optimized for, so the metric set, the gold-corpus shape, and the pass bars are themselves consequential decisions — not implementation detail.
Options considered
- Human-rated only. A reviewer reads each run and grades it. Highest-fidelity per-judgment, but unrepeatable, un-CI-able, and impossible to gate a regression on — it cannot answer "did this change make extraction worse?" automatically.
- Structural match only. Score atom precision/recall, edge faithfulness, OKF validity, and graph integrity against gold reference atoms. Fully deterministic and CI-able — but blind to the highest-risk failure mode: a
decisionatom can match onid,type, title, and edges yet carry hallucinated rationale. Structure says "correct"; the why is fabricated. This is precisely the tacit-judgment IP (Dossier — The Knowledge Model (v0) principle 7) the moat exists to protect. - Structural + LLM-as-judge faithfulness (chosen). All of option 2's structural metrics plus an LLM-as-judge that scores, per rationale claim, whether the extracted atom's "why" is source-grounded — catching both omission (a real claim dropped) and fabrication (a claim with no source basis). The judge runs on the same
ClaudeClientseam, routed to theescalate/opus tier.
Decision
Adopt option 3. The harness scores the real emitted OKF on five metrics, against a small committed gold corpus, gated by default thresholds, offline-first with a gated live path.
1. Metric set.
- Atom precision / recall / F1 — predicted atoms matched to gold by
idprimary, with atype+ normalized-title fallback; type-accuracy reported alongside. - Edge faithfulness — micro P/R/F1 pooled across the typed edges (
owner,uses,produces,governed_by,stages,relates_to,decided_by). Set-valued edges are matched order-insensitive;stagesis order-SENSITIVE (a workflow is an ordered path through processes — Dossier — The Knowledge Model (v0)). - Validity rate — fraction of emitted atoms passing
okf.validate. Expected ≈ 1.0 under forced tool use (Extraction runtime architecture — the moat), but measured anyway as a guard against schema drift. - Graph integrity — error/warning counts from
RunResult.graphIssues; this is where The produces edge is canonical on the producing process only violations (multi-producer / orphan artifact / dangling edge) surface. - Faithfulness — the IP metric. An LLM-as-judge (same
ClaudeClientseam,escalate/opus tier) scores, per rationale claim, whether the extracted "why" is source-grounded. This is the guard on the judgment layer — extracteddecisionrationale must never be hallucinated (Dossier — The Knowledge Model (v0) principle 7; the tacit IP Nadella names in Dossier — Mission & North Star).
2. Gold-corpus methodology. A small, committed, self-contained corpus of authored source docs PAIRED with expected gold atoms — (source → reference-atoms) pairs that make precision/recall well-defined. Modeled on the real DXA gold shapes; coverage ≥2 process, ≥2 role, 1 system, 1 workflow, and 1 decision — the decision case is what exercises faithfulness.
3. Default thresholds (the loop's bar). Graded on the micro/pooled aggregate (macro also reported, to surface a single bad case rather than letting it average out):
| Metric | Default bar |
|---|---|
| Atom F1 | ≥ 0.6 |
| Edge F1 | ≥ 0.5 |
| Validity | == 1.0 |
| Graph errors | == 0 |
| Faithfulness | ≥ 0.8 |
These are deliberately baseline starting bars, to be raised as extraction improves — a floor, not a ceiling.
4. Offline-first, gated-live design. The judge is on the seam, so it is mockable. CI never hits the network (MockClaudeClient + mock judge). A human runs the live eval with ANTHROPIC_API_KEY to grade the real loop. Eval JSON reports are run artifacts (gitignored); the dataset + harness are committed.
Rationale
- Faithfulness is the metric structure can't see. Options 1 and 2 each fail at the decisive point: human-rated can't gate regressions; structural-only is blind to fabricated rationale. The LLM-judge closes exactly the gap between "the atom looks right" and "the why is true to the source" — the failure mode with the highest IP cost, because a wrongly-captured tacit judgment is worse than a missing one (Dossier — The Knowledge Model (v0) principle 7).
- What we measure is what we optimize for. Pinning the metric set is itself a steering decision: by grading atoms, typed edges, validity, graph integrity, and faithfulness, we point extraction at conformant, well-linked, honestly-grounded output — not just surface coverage.
- Re-parsing the emitted files scores the real product. Scoring the written OKF (not in-memory candidates) means the eval grades exactly what a client receives — serialization, validation, resolution, and linking all included.
- On the seam keeps it primitives-consistent and CI-safe. Putting the judge on the same
ClaudeClientinterface (Extraction runtime architecture — the moat) honors Claude-primitives-first build strategy, keeps CI deterministic and offline, and lets the live grade run with one env var. - Micro-aggregate with macro reported. Pooled scoring gives a stable headline number; reporting macro alongside prevents one bad atom from being averaged into invisibility.
asserted, notverified. This is a methodology choice, green against an authored DXA-shaped corpus — it has not yet been validated against a real client corpus. The thresholds are judgment, not yet evidence.
Consequences
- The moat is now measurable. Extraction regressions are catchable and improvements provable — Extraction runtime architecture — the moat §6's eval TODO is closed and Claude-primitives-first build strategy's evals pillar is concrete.
- The judgment layer is guarded. Fabricated
decisionrationale now fails an eval gate instead of shipping silently — the single highest-risk extraction failure has a tripwire. - The gold corpus is a maintained asset. It must evolve with Dossier — The Knowledge Model (v0) and the edge vocabulary; new edge types or atom types need gold coverage or the eval silently under-tests them.
- CI stays offline and free; live grading is a deliberate human act against
ANTHROPIC_API_KEY. Eval reports are gitignored run artifacts; only the dataset + harness are committed. - Thresholds are a two-way door, on purpose. They are baseline floors; the metric set / gold shapes / bars are expected to move. The durable, harder-to-reverse commitment is the eval-on-the-seam, offline-first, faithfulness-as-a-first-class-metric design.
Review
Revisit the thresholds after the first real client run — raise the bars as extraction improves, and re-examine whether atom F1 ≥ 0.6 / edge F1 ≥ 0.5 / faithfulness ≥ 0.8 are still floors or have become ceilings. Once a live judge tier is regularly wired, add a cost-$ metric (per-run / per-atom spend) so the eval grades economics alongside quality — and consider promoting confidence to verified once the methodology has held against a real (non-authored) corpus.