Concept identity = type + (canonical-title OR prefix-stripped id), exact-match closure — dedup owned by the @dossier/okf keystone, in-pass + opt-in reconcile + loop default
0056-concept-identity-dedup-in-okf-keystone
- Reversibility
- two-way door
DEC-0056 — Concept-identity dedup in the @dossier/okf keystone
Reversibility: two-way door — the normalization rules (prefix list, KEEP_PLURAL allowlist, singularization, the title∪id arms) are tunable data/policy and swappable; the durable commitments are that identity is OWNED by the keystone, that merge UNIONS provenance and never invents confidence or version chains, and that dedup is single-sourced — one identity model across resolve/reconcile/loop.
Context
The 33-page RBA Firecrawl run (First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam) was the first time the learning loop ran past 3 pages, and it surfaced the moat's integrity gap that smaller runs hid: the KB leaked 32 same-concept clusters / 41 redundant atoms. The root cause was filed as Make the learning loop dedup/reconcile at scale (collapse same-type duplicate clusters; default-on compounding) and verified against the source before this decision:
resolve()(the in-pass dedup,packages/extraction/src/pipeline/resolve.ts) keyed id-first —id:<id>if present, elsett:<type>|<normalized-title>. Two atoms with divergent ids (agentic-aivsrba-agentic-ai) or near-but-not-identical titles therefore survived as separate atoms.reconcile()(the compounding merge,packages/okf/src/reconcile.ts) keyed strictly by atomidand was opt-in, default off — so it neither ran by default on this loop nor would have collapsed divergent-id clusters had it run.
Strict id-equality can never collapse a concept emitted under divergent ids/titles. The loop needs a COARSER, identity-level question — "are these two atoms the same CONCEPT?" — owned in one place, or the compounding promise (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime)) inverts: every run accretes duplicates instead of refining knowledge. This is a moat-integrity decision, not a cosmetic cleanup.
Options considered
- Keep strict id-equality everywhere (status quo). Rejected: it provably leaks at scale (the 32 clusters), and divergent-id forks (
rba--prefix vs bare) are exactly what a noisy multi-segment extraction produces. - Fuzzy / embedding-similarity matching (substring, parenthetical/type-word stripping, vector similarity). Rejected — and this rejection is the heart of the decision. Fuzzier normalization over-merged distinct concepts: it would collapse "Co-Delivery" into "Co-Delivery Team", "Data Strategy" into "Data Strategy & Modernization", "Digital Strategy" into "Digital Strategy Practice". A wrong merge silently destroys a real concept; a missed duplicate is visible and recoverable. Faithfulness over coverage.
- Deterministic EXACT-match concept identity after minimal normalization, owned by the keystone (chosen). Two atoms are the same concept iff they share
typeAND either arm matches exactly:- title arm — the same
canonicalTitle: lowercase;&//→ space; punctuation → space; whitespace collapse; singularize only the trailing word, with aKEEP_PLURALallowlist (services,analytics,operations,ops,devops,sales,logistics,communications) so domain plurals are never wrongly singularized. - id arm — the same
stripIdPrefix(id): strip leading vertical/client/type prefixes (rba/cx/sys/dxa/cap), repeatedly, sorba-cap-…collapses too; only a LEADING segment is stripped (interior tokens untouched). Merge is the transitive closure of (title-arm ∪ id-arm). Exact-match keeps near-but-distinct concepts apart by construction.
- title arm — the same
Decision
Own concept identity + dedup in the @dossier/okf keystone, exact-match after minimal normalization, and wire it in three places off ONE identity model.
- Identity + merge live in
packages/okf/src/identity.ts— the keystone owns atom identity, so it owns atom merge (the same reasoning that keepsreconcile()there). It exportscanonicalTitle,stripIdPrefix,conceptKeys,pickCanonicalId, anddedupe.deduperuns union-find over the two concept keys, merges each cluster (mergeCluster), and rewrites every inbound edge through a dropped→canonical id remap so no edge is stranded. - In-pass:
packages/extraction/src/pipeline/resolve.tswas rewired to drive entirely offdedupe(it no longer keys id-first itself), layering only the extraction-specific concern on top — unioning each survivor'sprovenance+ sourcespansacross every atom that collapsed onto it. - Opt-in reconcile pre-pass:
packages/okf/src/reconcile.tsgained a{ dedupe?: boolean }option. When on,existingis self-deduped,incomingis deduped and its ids canonicalized the same way (so a freshrba-agentic-ailands on an existingagentic-aiinstead of forking), then the normal id-keyed curation-guard merge runs. Default off preserves the strict-id path for callers that want it. - Loop default: dedup is wired into
runLoop's compounding path and the dropped (folded-away) files are deleted, so the loop compounds without re-accreting clusters. - Guarded by an over-merge unit test that asserts the deliberately-rejected fuzzy merges do NOT happen (distinct concepts stay distinct).
Rationale
- Faithfulness over coverage — a wrong merge is worse than a missed dup. Both identity arms are EXACT-match after a small, defensible normalization — never fuzzy, substring, or embedding-similarity. That is precisely what keeps "Data Architecture" ≠ "Data Analytics", "Managed Support" ≠ "Managed Services", "Co-Delivery Engagement Model" ≠ "Co-Delivery TEAM Engagement Model". The
KEEP_PLURALallowlist exists for the same reason: blindly stripping a trailingswould collide distinct domain concepts. - One identity model, no drift. Identity is defined once in the keystone and consumed by
resolve(),reconcile(), and the loop — not three forked notions of "same concept." This is the same single-source-of-truth discipline@dossier/okfalready holds for the schema and the graph. - Provenance is sacred — UNIONED, never dropped (Adopt OKF as Dossier's canonical knowledge format). Every contributor's
sourcesurvives on the merged atom (semicolon-joined distinct sources), typed edges are unioned so the merged atom is at least as connected as any contributor, and confidence is the highest authority present but is never upgraded past what was there (an all-inferredcluster staysinferred— we never invent verification). - No fabricated version chains.
dedupedeliberately does NOT mintsupersedes/superseded_by: these clusters are unmanaged duplicates, not version chains. A real version chain is a curation act, not an extraction side effect — consistent with the Dossier — The Knowledge Model (v0) versioning contract. confidence: verified— field-measured on a real tenant, not asserted. Unlike a design-level conviction, this was measured against live data: the RBA KB went 494→453 atoms with 32→0 same-concept clusters, 0 dangling, orphan-artifacts 4→3 (a persona duplicate resolved), 453/453 strict-parse, the hero atom untouched. Test suites green: okf 167, extraction 81, repo 419. Applied as tenant commit509c38d(the pre-dedup state recoverable at75168d0).
Consequences
- The loop now compounds without accreting duplicate clusters. The compounding promise of The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) holds at scale: a re-crawl refines knowledge instead of forking it, because divergent-id/-title variants resolve to one canonical atom on both the in-pass and the reconcile sides.
- The live RBA tenant is now deduped (453 atoms, 0 clusters) — but it is NOT yet reference-tenant quality. The separate p2 follow-ups from the DEC-0055 QA pass remain OPEN: Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run) (system mis-typing), Have extraction populate the accountability spine (owner / reports_to / members / decision_rights) (absent
owner/reports_to/members), and the orphan-artifact case-study re-typing tracked in Resolve 4 orphan-artifact graph errors from the RBA Firecrawl run (link a producing process or prune). Closing dedup does not close those. - The normalization rules are tunable data, not a contract. The prefix list,
KEEP_PLURALallowlist, singularization heuristic, and the title∪id arms are policy that can be tightened or loosened (two-way door). The durable commitments are: identity is OWNED by the keystone; merge UNIONS provenance and never invents confidence or version chains; dedup is single-sourced acrossresolve/reconcile/loop. clients/is gitignored (Fix git-per-tenant isolation when a tenant root is nested inside another repo), so the 494→453 measurement lives in the tenant's own isolated repo (commit509c38d), not in this repo's history. This record is the durable account of it.
Operational lessons — the post-dedup RBA reference-tenant cleanup
The follow-on QA pass that took the RBA tenant the rest of the way to conformance- and graph-clean (75168d0→509c38d dedup→8229530, by deterministic data surgery, no LLM re-extraction; the 4 tasks above were closed in it) surfaced three durable curation lessons — operational refinements within this decision's frame, recorded here so they are not lost to a one-line log entry:
- (a) The
systemcatch-all is a known extraction-TIME failure mode, and surgery is only the stopgap. Auto-minted stubs for workflowstagesdefault totype: system(Deterministic edge-invariant repair stage (extraction Stage 5.5) mintssystemstubs), so a noisy run accretes mis-typedsystematoms; the durable fix is type discipline at extraction (tracked by Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run)), not the post-hoc re-typing that the cleanup performed (system70→47). - (b) When a re-typed stub exact-title-matches an existing canonical atom, MERGE into the canonical — never honor a speculative type hint. This is a first-class interaction of the identity model above: re-typing a stub can expose new same-concept pairs (the cleanup surfaced 8 such merges, e.g. an OCM stub), which the
dedupearm then folds. Single-source-of-truth beats a stub's guessed type. - (c) Owner grounding is discipline-matched, never blanket — partial-grounded beats fabricated. 72/103 processes given a grounded
owner(discipline-matched to 9 roles) beats 103/103 fabricated; an ungroundedowneris left absent (Adopt OKF as Dossier's canonical knowledge format / Dossier — The Knowledge Model (v0) principle 8 — provenance discipline). The remaining structural/delivery-modeling depth is tracked as Close the RBA delivery-modeling taxonomy gap — capability→delivery `delivered_by` grounding + two persona-role cleanups (closed by Capability→delivery (`delivered_by`) is grounded by a deterministic practice→delivery-workflow bridge — a one-time curation pass, not an extraction-loop change).
Review
This record is confidence: verified — field-measured on a real tenant (RBA, 494→453, 32→0 clusters), guarded by an over-merge unit test, and green across okf/extraction/repo suites. The promotion gate is met. Revisit the normalization policy if a future tenant surfaces either a missed duplicate (a same-concept pair the exact-match arms don't catch) or — more seriously — an over-merge (two distinct concepts wrongly collapsed); the latter would be a faithfulness regression and should add a case to the over-merge guard test before any loosening. The loop-default policy (dedup on by the compounding path) inherits its review from The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime).