Concept identity = type + (canonical-title OR prefix-stripped id), exact-match closure — dedup owned by the @dossier/okf keystone, in-pass + opt-in reconcile + loop default

0056-concept-identity-dedup-in-okf-keystone

decision read as Explain confidence verified status active 2026-06-19 owner extraction-engineer
Reversibility
two-way door

DEC-0056 — Concept-identity dedup in the @dossier/okf keystone

Reversibility: two-way door — the normalization rules (prefix list, KEEP_PLURAL allowlist, singularization, the title∪id arms) are tunable data/policy and swappable; the durable commitments are that identity is OWNED by the keystone, that merge UNIONS provenance and never invents confidence or version chains, and that dedup is single-sourced — one identity model across resolve/reconcile/loop.

Context

The 33-page RBA Firecrawl run (First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam) was the first time the learning loop ran past 3 pages, and it surfaced the moat's integrity gap that smaller runs hid: the KB leaked 32 same-concept clusters / 41 redundant atoms. The root cause was filed as Make the learning loop dedup/reconcile at scale (collapse same-type duplicate clusters; default-on compounding) and verified against the source before this decision:

  • resolve() (the in-pass dedup, packages/extraction/src/pipeline/resolve.ts) keyed id-firstid:<id> if present, else tt:<type>|<normalized-title>. Two atoms with divergent ids (agentic-ai vs rba-agentic-ai) or near-but-not-identical titles therefore survived as separate atoms.
  • reconcile() (the compounding merge, packages/okf/src/reconcile.ts) keyed strictly by atom id and was opt-in, default off — so it neither ran by default on this loop nor would have collapsed divergent-id clusters had it run.

Strict id-equality can never collapse a concept emitted under divergent ids/titles. The loop needs a COARSER, identity-level question — "are these two atoms the same CONCEPT?" — owned in one place, or the compounding promise (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime)) inverts: every run accretes duplicates instead of refining knowledge. This is a moat-integrity decision, not a cosmetic cleanup.

Options considered

  1. Keep strict id-equality everywhere (status quo). Rejected: it provably leaks at scale (the 32 clusters), and divergent-id forks (rba--prefix vs bare) are exactly what a noisy multi-segment extraction produces.
  2. Fuzzy / embedding-similarity matching (substring, parenthetical/type-word stripping, vector similarity). Rejected — and this rejection is the heart of the decision. Fuzzier normalization over-merged distinct concepts: it would collapse "Co-Delivery" into "Co-Delivery Team", "Data Strategy" into "Data Strategy & Modernization", "Digital Strategy" into "Digital Strategy Practice". A wrong merge silently destroys a real concept; a missed duplicate is visible and recoverable. Faithfulness over coverage.
  3. Deterministic EXACT-match concept identity after minimal normalization, owned by the keystone (chosen). Two atoms are the same concept iff they share type AND either arm matches exactly:
    • title arm — the same canonicalTitle: lowercase; &// → space; punctuation → space; whitespace collapse; singularize only the trailing word, with a KEEP_PLURAL allowlist (services, analytics, operations, ops, devops, sales, logistics, communications) so domain plurals are never wrongly singularized.
    • id arm — the same stripIdPrefix(id): strip leading vertical/client/type prefixes (rba/cx/sys/dxa/cap), repeatedly, so rba-cap-… collapses too; only a LEADING segment is stripped (interior tokens untouched). Merge is the transitive closure of (title-arm ∪ id-arm). Exact-match keeps near-but-distinct concepts apart by construction.

Decision

Own concept identity + dedup in the @dossier/okf keystone, exact-match after minimal normalization, and wire it in three places off ONE identity model.

  • Identity + merge live in packages/okf/src/identity.ts — the keystone owns atom identity, so it owns atom merge (the same reasoning that keeps reconcile() there). It exports canonicalTitle, stripIdPrefix, conceptKeys, pickCanonicalId, and dedupe. dedupe runs union-find over the two concept keys, merges each cluster (mergeCluster), and rewrites every inbound edge through a dropped→canonical id remap so no edge is stranded.
  • In-pass: packages/extraction/src/pipeline/resolve.ts was rewired to drive entirely off dedupe (it no longer keys id-first itself), layering only the extraction-specific concern on top — unioning each survivor's provenance + source spans across every atom that collapsed onto it.
  • Opt-in reconcile pre-pass: packages/okf/src/reconcile.ts gained a { dedupe?: boolean } option. When on, existing is self-deduped, incoming is deduped and its ids canonicalized the same way (so a fresh rba-agentic-ai lands on an existing agentic-ai instead of forking), then the normal id-keyed curation-guard merge runs. Default off preserves the strict-id path for callers that want it.
  • Loop default: dedup is wired into runLoop's compounding path and the dropped (folded-away) files are deleted, so the loop compounds without re-accreting clusters.
  • Guarded by an over-merge unit test that asserts the deliberately-rejected fuzzy merges do NOT happen (distinct concepts stay distinct).

Rationale

  • Faithfulness over coverage — a wrong merge is worse than a missed dup. Both identity arms are EXACT-match after a small, defensible normalization — never fuzzy, substring, or embedding-similarity. That is precisely what keeps "Data Architecture" ≠ "Data Analytics", "Managed Support" ≠ "Managed Services", "Co-Delivery Engagement Model" ≠ "Co-Delivery TEAM Engagement Model". The KEEP_PLURAL allowlist exists for the same reason: blindly stripping a trailing s would collide distinct domain concepts.
  • One identity model, no drift. Identity is defined once in the keystone and consumed by resolve(), reconcile(), and the loop — not three forked notions of "same concept." This is the same single-source-of-truth discipline @dossier/okf already holds for the schema and the graph.
  • Provenance is sacred — UNIONED, never dropped (Adopt OKF as Dossier's canonical knowledge format). Every contributor's source survives on the merged atom (semicolon-joined distinct sources), typed edges are unioned so the merged atom is at least as connected as any contributor, and confidence is the highest authority present but is never upgraded past what was there (an all-inferred cluster stays inferred — we never invent verification).
  • No fabricated version chains. dedupe deliberately does NOT mint supersedes/superseded_by: these clusters are unmanaged duplicates, not version chains. A real version chain is a curation act, not an extraction side effect — consistent with the Dossier — The Knowledge Model (v0) versioning contract.
  • confidence: verified — field-measured on a real tenant, not asserted. Unlike a design-level conviction, this was measured against live data: the RBA KB went 494→453 atoms with 32→0 same-concept clusters, 0 dangling, orphan-artifacts 4→3 (a persona duplicate resolved), 453/453 strict-parse, the hero atom untouched. Test suites green: okf 167, extraction 81, repo 419. Applied as tenant commit 509c38d (the pre-dedup state recoverable at 75168d0).

Consequences

Operational lessons — the post-dedup RBA reference-tenant cleanup

The follow-on QA pass that took the RBA tenant the rest of the way to conformance- and graph-clean (75168d0509c38d dedup→8229530, by deterministic data surgery, no LLM re-extraction; the 4 tasks above were closed in it) surfaced three durable curation lessons — operational refinements within this decision's frame, recorded here so they are not lost to a one-line log entry:

Review

This record is confidence: verified — field-measured on a real tenant (RBA, 494→453, 32→0 clusters), guarded by an over-merge unit test, and green across okf/extraction/repo suites. The promotion gate is met. Revisit the normalization policy if a future tenant surfaces either a missed duplicate (a same-concept pair the exact-match arms don't catch) or — more seriously — an over-merge (two distinct concepts wrongly collapsed); the latter would be a faithfulness regression and should add a case to the over-merge guard test before any loosening. The loop-default policy (dedup on by the compounding path) inherits its review from The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime).