Make the learning loop dedup/reconcile at scale (collapse same-type duplicate clusters; default-on compounding)

task-loop-reconcile-dedup-at-scale

task confidence verified status done 2026-06-19 owner extraction-engineer

source log-auditor — surfaced from the multi-surface FDE QA pass on the 33-page RBA Firecrawl tenant (DEC-0055); source ref-claims re-verified against packages/extraction + packages/okf before filing; closed DONE 2026-06-19 by DEC-0056 (verified against identity.ts/reconcile.ts/resolve.ts + the 494→453 field result)

Make the learning loop dedup/reconcile at scale

The 33-page RBA Firecrawl run (First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam) — the first time the loop ran past 3 pages — produced 26 same-type duplicate clusters / ~59 redundant atoms (~6.7% of 494) with 0 supersedes/superseded_by edges. These are unmanaged duplicates, not version chains: capability "Digital Strategy" ×4 (two citing the same URL), process "OCM" ×5, and a systemic rba--prefix-vs-bare-id pattern (agentic-ai / rba-agentic-ai, …). This is the root-cause finding of the QA pass — several of the type-confusion clusters in Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run) trace back to the same dedup leak.

Root cause — stated as verified, not as the brief literally framed it

The QA brief said reconcile() "is NOT wired into runLoop's emit path." That is not literally true and was corrected before filing (a wrong claim must not enter the durable record):

reconcile() is imported (packages/extraction/src/index.ts:58) and is called in the emit path (:265) — but only when options.reconcile === true, and that flag defaults false (:127, :192). So on this run it was off (overwrite path).
Even with the flag on, reconcile() keys strictly by atom id (packages/okf/src/reconcile.ts:11-13), so it would not collapse two different-id atoms of the same concept (agentic-ai vs rba-agentic-ai).
The in-pass dedup that should have caught these is resolve() (packages/extraction/src/pipeline/resolve.ts:14-21): it keys on id:<id> if present, else tt:<type>|<normalized-title>. Two atoms with divergent ids or near-but-not-identical titles therefore survive as separate atoms — exactly the observed clusters.

So the true fix is two-part, and broader than "wire reconcile in."

Shape — two parts

One-shot now: run a reconcile/dedup post-pass over the current RBA KB to collapse the existing 26 clusters — union provenance, express a real version chain with supersedes, merge exact duplicates. Nothing is dropped; merging is the single-source-of-truth act (Adopt OKF as Dossier's canonical knowledge format, knowledge-model principle 2).
So it doesn't recur: canonicalize ids (kill the rba--prefix divergence at emit), tighten resolve() same-type entity resolution, and decide + record whether reconcile should default on for live client runs (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) owns that policy). If the loop wiring changes, Platform / Runtime Engineer owns the runLoop emit-path change alongside the Knowledge-Extraction & GraphRAG Engineer.

Why a task, not a fix-in-place

This is a real engineering change across resolve(), id canonicalization, and a loop-default policy decision — owner judgment + code, not a one-token hygiene fix. Scoped to the RBA tenant OKF for the one-shot (clients/rba/tenants-firecrawl/rba-consulting, a gitignored sandbox per Fix git-per-tenant isolation when a tenant root is nested inside another repo) and to packages/extraction + packages/okf for the durable fix. Filed by the log-auditor from the QA pass; confidence: inferred.