Concept identity = type + (canonical-title OR prefix-stripped id), exact-match closure — dedup owned by the @dossier/okf keystone, in-pass + opt-in reconcile + loop default

0056-concept-identity-dedup-in-okf-keystone

decision read as Explain confidence verified status active 2026-06-19 owner extraction-engineer

Reversibility: two-way door

DEC-0056 — Concept-identity dedup in the @dossier/okf keystone

Reversibility: two-way door — the normalization rules (prefix list, KEEP_PLURAL allowlist, singularization, the title∪id arms) are tunable data/policy and swappable; the durable commitments are that identity is OWNED by the keystone, that merge UNIONS provenance and never invents confidence or version chains, and that dedup is single-sourced — one identity model across resolve/reconcile/loop.

Context

The 33-page RBA Firecrawl run (First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam) was the first time the learning loop ran past 3 pages, and it surfaced the moat's integrity gap that smaller runs hid: the KB leaked 32 same-concept clusters / 41 redundant atoms. The root cause was filed as Make the learning loop dedup/reconcile at scale (collapse same-type duplicate clusters; default-on compounding) and verified against the source before this decision:

resolve() (the in-pass dedup, packages/extraction/src/pipeline/resolve.ts) keyed id-first — id:<id> if present, else tt:<type>|<normalized-title>. Two atoms with divergent ids (agentic-ai vs rba-agentic-ai) or near-but-not-identical titles therefore survived as separate atoms.
reconcile() (the compounding merge, packages/okf/src/reconcile.ts) keyed strictly by atom id and was opt-in, default off — so it neither ran by default on this loop nor would have collapsed divergent-id clusters had it run.

Strict id-equality can never collapse a concept emitted under divergent ids/titles. The loop needs a COARSER, identity-level question — "are these two atoms the same CONCEPT?" — owned in one place, or the compounding promise (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime)) inverts: every run accretes duplicates instead of refining knowledge. This is a moat-integrity decision, not a cosmetic cleanup.

Options considered

Keep strict id-equality everywhere (status quo). Rejected: it provably leaks at scale (the 32 clusters), and divergent-id forks (rba--prefix vs bare) are exactly what a noisy multi-segment extraction produces.
Fuzzy / embedding-similarity matching (substring, parenthetical/type-word stripping, vector similarity). Rejected — and this rejection is the heart of the decision. Fuzzier normalization over-merged distinct concepts: it would collapse "Co-Delivery" into "Co-Delivery Team", "Data Strategy" into "Data Strategy & Modernization", "Digital Strategy" into "Digital Strategy Practice". A wrong merge silently destroys a real concept; a missed duplicate is visible and recoverable. Faithfulness over coverage.
Deterministic EXACT-match concept identity after minimal normalization, owned by the keystone (chosen). Two atoms are the same concept iff they share type AND either arm matches exactly:
- title arm — the same canonicalTitle: lowercase; &// → space; punctuation → space; whitespace collapse; singularize only the trailing word, with a KEEP_PLURAL allowlist (services, analytics, operations, ops, devops, sales, logistics, communications) so domain plurals are never wrongly singularized.
- id arm — the same stripIdPrefix(id): strip leading vertical/client/type prefixes (rba/cx/sys/dxa/cap), repeatedly, so rba-cap-… collapses too; only a LEADING segment is stripped (interior tokens untouched). Merge is the transitive closure of (title-arm ∪ id-arm). Exact-match keeps near-but-distinct concepts apart by construction.

Decision

Own concept identity + dedup in the @dossier/okf keystone, exact-match after minimal normalization, and wire it in three places off ONE identity model.

Identity + merge live in packages/okf/src/identity.ts — the keystone owns atom identity, so it owns atom merge (the same reasoning that keeps reconcile() there). It exports canonicalTitle, stripIdPrefix, conceptKeys, pickCanonicalId, and dedupe. dedupe runs union-find over the two concept keys, merges each cluster (mergeCluster), and rewrites every inbound edge through a dropped→canonical id remap so no edge is stranded.
In-pass: packages/extraction/src/pipeline/resolve.ts was rewired to drive entirely off dedupe (it no longer keys id-first itself), layering only the extraction-specific concern on top — unioning each survivor's provenance + source spans across every atom that collapsed onto it.
Opt-in reconcile pre-pass: packages/okf/src/reconcile.ts gained a { dedupe?: boolean } option. When on, existing is self-deduped, incoming is deduped and its ids canonicalized the same way (so a fresh rba-agentic-ai lands on an existing agentic-ai instead of forking), then the normal id-keyed curation-guard merge runs. Default off preserves the strict-id path for callers that want it.
Loop default: dedup is wired into runLoop's compounding path and the dropped (folded-away) files are deleted, so the loop compounds without re-accreting clusters.
Guarded by an over-merge unit test that asserts the deliberately-rejected fuzzy merges do NOT happen (distinct concepts stay distinct).

Rationale

Faithfulness over coverage — a wrong merge is worse than a missed dup. Both identity arms are EXACT-match after a small, defensible normalization — never fuzzy, substring, or embedding-similarity. That is precisely what keeps "Data Architecture" ≠ "Data Analytics", "Managed Support" ≠ "Managed Services", "Co-Delivery Engagement Model" ≠ "Co-Delivery TEAM Engagement Model". The KEEP_PLURAL allowlist exists for the same reason: blindly stripping a trailing s would collide distinct domain concepts.
One identity model, no drift. Identity is defined once in the keystone and consumed by resolve(), reconcile(), and the loop — not three forked notions of "same concept." This is the same single-source-of-truth discipline @dossier/okf already holds for the schema and the graph.
Provenance is sacred — UNIONED, never dropped (Adopt OKF as Dossier's canonical knowledge format). Every contributor's source survives on the merged atom (semicolon-joined distinct sources), typed edges are unioned so the merged atom is at least as connected as any contributor, and confidence is the highest authority present but is never upgraded past what was there (an all-inferred cluster stays inferred — we never invent verification).
No fabricated version chains. dedupe deliberately does NOT mint supersedes/superseded_by: these clusters are unmanaged duplicates, not version chains. A real version chain is a curation act, not an extraction side effect — consistent with the Dossier — The Knowledge Model (v0) versioning contract.
confidence: verified — field-measured on a real tenant, not asserted. Unlike a design-level conviction, this was measured against live data: the RBA KB went 494→453 atoms with 32→0 same-concept clusters, 0 dangling, orphan-artifacts 4→3 (a persona duplicate resolved), 453/453 strict-parse, the hero atom untouched. Test suites green: okf 167, extraction 81, repo 419. Applied as tenant commit 509c38d (the pre-dedup state recoverable at 75168d0).

Consequences

The loop now compounds without accreting duplicate clusters. The compounding promise of The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) holds at scale: a re-crawl refines knowledge instead of forking it, because divergent-id/-title variants resolve to one canonical atom on both the in-pass and the reconcile sides.
The live RBA tenant is now deduped (453 atoms, 0 clusters) — but it is NOT yet reference-tenant quality. The separate p2 follow-ups from the DEC-0055 QA pass remain OPEN: Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run) (system mis-typing), Have extraction populate the accountability spine (owner / reports_to / members / decision_rights) (absent owner/reports_to/members), and the orphan-artifact case-study re-typing tracked in Resolve 4 orphan-artifact graph errors from the RBA Firecrawl run (link a producing process or prune). Closing dedup does not close those.
The normalization rules are tunable data, not a contract. The prefix list, KEEP_PLURAL allowlist, singularization heuristic, and the title∪id arms are policy that can be tightened or loosened (two-way door). The durable commitments are: identity is OWNED by the keystone; merge UNIONS provenance and never invents confidence or version chains; dedup is single-sourced across resolve/reconcile/loop.
clients/ is gitignored (Fix git-per-tenant isolation when a tenant root is nested inside another repo), so the 494→453 measurement lives in the tenant's own isolated repo (commit 509c38d), not in this repo's history. This record is the durable account of it.

Operational lessons — the post-dedup RBA reference-tenant cleanup

The follow-on QA pass that took the RBA tenant the rest of the way to conformance- and graph-clean (75168d0→509c38d dedup→8229530, by deterministic data surgery, no LLM re-extraction; the 4 tasks above were closed in it) surfaced three durable curation lessons — operational refinements within this decision's frame, recorded here so they are not lost to a one-line log entry:

(a) The system catch-all is a known extraction-TIME failure mode, and surgery is only the stopgap. Auto-minted stubs for workflow stages default to type: system (Deterministic edge-invariant repair stage (extraction Stage 5.5) mints system stubs), so a noisy run accretes mis-typed system atoms; the durable fix is type discipline at extraction (tracked by Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run)), not the post-hoc re-typing that the cleanup performed (system 70→47).
(b) When a re-typed stub exact-title-matches an existing canonical atom, MERGE into the canonical — never honor a speculative type hint. This is a first-class interaction of the identity model above: re-typing a stub can expose new same-concept pairs (the cleanup surfaced 8 such merges, e.g. an OCM stub), which the dedupe arm then folds. Single-source-of-truth beats a stub's guessed type.
(c) Owner grounding is discipline-matched, never blanket — partial-grounded beats fabricated. 72/103 processes given a grounded owner (discipline-matched to 9 roles) beats 103/103 fabricated; an ungrounded owner is left absent (Adopt OKF as Dossier's canonical knowledge format / Dossier — The Knowledge Model (v0) principle 8 — provenance discipline). The remaining structural/delivery-modeling depth is tracked as Close the RBA delivery-modeling taxonomy gap — capability→delivery `delivered_by` grounding + two persona-role cleanups (closed by Capability→delivery (`delivered_by`) is grounded by a deterministic practice→delivery-workflow bridge — a one-time curation pass, not an extraction-loop change).

Review

This record is confidence: verified — field-measured on a real tenant (RBA, 494→453, 32→0 clusters), guarded by an over-merge unit test, and green across okf/extraction/repo suites. The promotion gate is met. Revisit the normalization policy if a future tenant surfaces either a missed duplicate (a same-concept pair the exact-match arms don't catch) or — more seriously — an over-merge (two distinct concepts wrongly collapsed); the latter would be a faithfulness regression and should add a case to the over-merge guard test before any loosening. The loop-default policy (dedup on by the compounding path) inherits its review from The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime).