Live extraction eval harness — what we measure is what extraction optimizes for

0009-extraction-eval-methodology

decision read as Explain confidence asserted status active 2026-06-14 owner extraction-engineer
Reversibility
two-way door

DEC-0009 — Extraction eval methodology (the measurement layer)

Reversibility: two-way door — the metric set, thresholds, and gold corpus evolve; the eval-on-the-seam, offline-first design is the durable part.

Context

"Extraction quality is the moat" (Extraction runtime architecture — the moat) is an assertion until it is a number. Claude-primitives-first build strategy names private evals as a pillar — "is the loop improving against business outcomes?" — and Extraction runtime architecture — the moat §6 explicitly left the live eval a TODO: the offline pipeline ran green on a MockClaudeClient, but nothing yet scored the real emitted OKF against a committed gold standard.

This decision defines that scorer: the live extraction eval harness, built into @dossier/extraction on the existing ClaudeClient seam and run() pipeline (no rewrites). It scores the real emitted OKF (parses the written files back, not the in-memory candidates), runs fully offline in CI (MockClaudeClient + mock judge), and exposes a gated live path. The judgments below are methodology IP: what we choose to measure is what extraction will be optimized for, so the metric set, the gold-corpus shape, and the pass bars are themselves consequential decisions — not implementation detail.

Options considered

  1. Human-rated only. A reviewer reads each run and grades it. Highest-fidelity per-judgment, but unrepeatable, un-CI-able, and impossible to gate a regression on — it cannot answer "did this change make extraction worse?" automatically.
  2. Structural match only. Score atom precision/recall, edge faithfulness, OKF validity, and graph integrity against gold reference atoms. Fully deterministic and CI-able — but blind to the highest-risk failure mode: a decision atom can match on id, type, title, and edges yet carry hallucinated rationale. Structure says "correct"; the why is fabricated. This is precisely the tacit-judgment IP (Dossier — The Knowledge Model (v0) principle 7) the moat exists to protect.
  3. Structural + LLM-as-judge faithfulness (chosen). All of option 2's structural metrics plus an LLM-as-judge that scores, per rationale claim, whether the extracted atom's "why" is source-grounded — catching both omission (a real claim dropped) and fabrication (a claim with no source basis). The judge runs on the same ClaudeClient seam, routed to the escalate/opus tier.

Decision

Adopt option 3. The harness scores the real emitted OKF on five metrics, against a small committed gold corpus, gated by default thresholds, offline-first with a gated live path.

1. Metric set.

  • Atom precision / recall / F1 — predicted atoms matched to gold by id primary, with a type + normalized-title fallback; type-accuracy reported alongside.
  • Edge faithfulness — micro P/R/F1 pooled across the typed edges (owner, uses, produces, governed_by, stages, relates_to, decided_by). Set-valued edges are matched order-insensitive; stages is order-SENSITIVE (a workflow is an ordered path through processes — Dossier — The Knowledge Model (v0)).
  • Validity rate — fraction of emitted atoms passing okf.validate. Expected ≈ 1.0 under forced tool use (Extraction runtime architecture — the moat), but measured anyway as a guard against schema drift.
  • Graph integrity — error/warning counts from RunResult.graphIssues; this is where The produces edge is canonical on the producing process only violations (multi-producer / orphan artifact / dangling edge) surface.
  • Faithfulness — the IP metric. An LLM-as-judge (same ClaudeClient seam, escalate/opus tier) scores, per rationale claim, whether the extracted "why" is source-grounded. This is the guard on the judgment layer — extracted decision rationale must never be hallucinated (Dossier — The Knowledge Model (v0) principle 7; the tacit IP Nadella names in Dossier — Mission & North Star).

2. Gold-corpus methodology. A small, committed, self-contained corpus of authored source docs PAIRED with expected gold atoms(source → reference-atoms) pairs that make precision/recall well-defined. Modeled on the real DXA gold shapes; coverage ≥2 process, ≥2 role, 1 system, 1 workflow, and 1 decision — the decision case is what exercises faithfulness.

3. Default thresholds (the loop's bar). Graded on the micro/pooled aggregate (macro also reported, to surface a single bad case rather than letting it average out):

Metric Default bar
Atom F1 ≥ 0.6
Edge F1 ≥ 0.5
Validity == 1.0
Graph errors == 0
Faithfulness ≥ 0.8

These are deliberately baseline starting bars, to be raised as extraction improves — a floor, not a ceiling.

4. Offline-first, gated-live design. The judge is on the seam, so it is mockable. CI never hits the network (MockClaudeClient + mock judge). A human runs the live eval with ANTHROPIC_API_KEY to grade the real loop. Eval JSON reports are run artifacts (gitignored); the dataset + harness are committed.

Rationale

  • Faithfulness is the metric structure can't see. Options 1 and 2 each fail at the decisive point: human-rated can't gate regressions; structural-only is blind to fabricated rationale. The LLM-judge closes exactly the gap between "the atom looks right" and "the why is true to the source" — the failure mode with the highest IP cost, because a wrongly-captured tacit judgment is worse than a missing one (Dossier — The Knowledge Model (v0) principle 7).
  • What we measure is what we optimize for. Pinning the metric set is itself a steering decision: by grading atoms, typed edges, validity, graph integrity, and faithfulness, we point extraction at conformant, well-linked, honestly-grounded output — not just surface coverage.
  • Re-parsing the emitted files scores the real product. Scoring the written OKF (not in-memory candidates) means the eval grades exactly what a client receives — serialization, validation, resolution, and linking all included.
  • On the seam keeps it primitives-consistent and CI-safe. Putting the judge on the same ClaudeClient interface (Extraction runtime architecture — the moat) honors Claude-primitives-first build strategy, keeps CI deterministic and offline, and lets the live grade run with one env var.
  • Micro-aggregate with macro reported. Pooled scoring gives a stable headline number; reporting macro alongside prevents one bad atom from being averaged into invisibility.
  • asserted, not verified. This is a methodology choice, green against an authored DXA-shaped corpus — it has not yet been validated against a real client corpus. The thresholds are judgment, not yet evidence.

Consequences

  • The moat is now measurable. Extraction regressions are catchable and improvements provableExtraction runtime architecture — the moat §6's eval TODO is closed and Claude-primitives-first build strategy's evals pillar is concrete.
  • The judgment layer is guarded. Fabricated decision rationale now fails an eval gate instead of shipping silently — the single highest-risk extraction failure has a tripwire.
  • The gold corpus is a maintained asset. It must evolve with Dossier — The Knowledge Model (v0) and the edge vocabulary; new edge types or atom types need gold coverage or the eval silently under-tests them.
  • CI stays offline and free; live grading is a deliberate human act against ANTHROPIC_API_KEY. Eval reports are gitignored run artifacts; only the dataset + harness are committed.
  • Thresholds are a two-way door, on purpose. They are baseline floors; the metric set / gold shapes / bars are expected to move. The durable, harder-to-reverse commitment is the eval-on-the-seam, offline-first, faithfulness-as-a-first-class-metric design.

Review

Revisit the thresholds after the first real client run — raise the bars as extraction improves, and re-examine whether atom F1 ≥ 0.6 / edge F1 ≥ 0.5 / faithfulness ≥ 0.8 are still floors or have become ceilings. Once a live judge tier is regularly wired, add a cost-$ metric (per-run / per-atom spend) so the eval grades economics alongside quality — and consider promoting confidence to verified once the methodology has held against a real (non-authored) corpus.