Make the learning loop dedup/reconcile at scale (collapse same-type duplicate clusters; default-on compounding)
task-loop-reconcile-dedup-at-scale
Make the learning loop dedup/reconcile at scale
The 33-page RBA Firecrawl run (First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam) — the first time the loop ran past 3 pages — produced 26 same-type duplicate clusters / ~59 redundant atoms (~6.7% of 494) with 0 supersedes/superseded_by edges. These are unmanaged duplicates, not version chains: capability "Digital Strategy" ×4 (two citing the same URL), process "OCM" ×5, and a systemic rba--prefix-vs-bare-id pattern (agentic-ai / rba-agentic-ai, …). This is the root-cause finding of the QA pass — several of the type-confusion clusters in Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run) trace back to the same dedup leak.
Root cause — stated as verified, not as the brief literally framed it
The QA brief said reconcile() "is NOT wired into runLoop's emit path." That is not literally true and was corrected before filing (a wrong claim must not enter the durable record):
reconcile()is imported (packages/extraction/src/index.ts:58) and is called in the emit path (:265) — but only whenoptions.reconcile === true, and that flag defaultsfalse(:127,:192). So on this run it was off (overwrite path).- Even with the flag on,
reconcile()keys strictly by atomid(packages/okf/src/reconcile.ts:11-13), so it would not collapse two different-id atoms of the same concept (agentic-aivsrba-agentic-ai). - The in-pass dedup that should have caught these is
resolve()(packages/extraction/src/pipeline/resolve.ts:14-21): it keys onid:<id>if present, elsett:<type>|<normalized-title>. Two atoms with divergent ids or near-but-not-identical titles therefore survive as separate atoms — exactly the observed clusters.
So the true fix is two-part, and broader than "wire reconcile in."
Shape — two parts
- One-shot now: run a reconcile/dedup post-pass over the current RBA KB to collapse the existing 26 clusters — union provenance, express a real version chain with
supersedes, merge exact duplicates. Nothing is dropped; merging is the single-source-of-truth act (Adopt OKF as Dossier's canonical knowledge format, knowledge-model principle 2). - So it doesn't recur: canonicalize ids (kill the
rba--prefix divergence at emit), tightenresolve()same-type entity resolution, and decide + record whetherreconcileshould default on for live client runs (The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) owns that policy). If the loop wiring changes, Platform / Runtime Engineer owns therunLoopemit-path change alongside the Knowledge-Extraction & GraphRAG Engineer.
Why a task, not a fix-in-place
This is a real engineering change across resolve(), id canonicalization, and a loop-default policy decision — owner judgment + code, not a one-token hygiene fix. Scoped to the RBA tenant OKF for the one-shot (clients/rba/tenants-firecrawl/rba-consulting, a gitignored sandbox per Fix git-per-tenant isolation when a tenant root is nested inside another repo) and to packages/extraction + packages/okf for the durable fix. Filed by the log-auditor from the QA pass; confidence: inferred.