Improve extraction EMIT-time type discipline for DXA vertical types (so future runs don't need post-hoc curation)
task-extraction-emit-time-vertical-type-discipline
Improve extraction EMIT-time type discipline for DXA vertical types
Surfaced by the log-auditor while closing two sibling RBA type-discipline curation tasks, and flagged by the Knowledge-Extraction & GraphRAG Engineer as their shared root cause: the extraction EMIT path assigns vertical/concept types weakly, so the same class of mis-typing recurred twice on one tenant and was each time fixed by hand after the fact.
The one root cause behind two closed tasks
| Closed task | Mis-emit | Fixed by (stopgap) |
|---|---|---|
| Re-type 6 client-specific RBA `workflow` atoms as DXA `engagement`s (single-client SOW instances, not standing orchestrations) | single-client SOW instance emitted as generic workflow instead of DXA engagement |
6 atoms re-typed (2 new engagement nodes, 4 merged), tenant commit 975cc83 |
| Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run) | deliverables / process phases emitted as system (the catch-all) instead of artifact / process |
23 atoms re-typed, ids slugified, tenant commit 8229530 |
Both were closed by deterministic data surgery on the RBA tenant — a one-time, post-hoc re-type. Both closing notes recorded the durable fix as a lesson rather than filed work: the surgery is the stopgap; the emit path is where the type call should be made. This task makes that durable fix real, so future extractions are type-disciplined at emit and need no curation pass.
What "emit-time type discipline" means
A heuristic on the extraction EMIT path (Extraction runtime architecture — the moat) that strengthens a concept's type from grounded signals in the source, conservatively:
- client-name + single-client-SOW signal → DXA
engagement(a named one-off delivery for a specific client) rather thanworkflow(the org's standing path-through-nodes). The Dossier — The Knowledge Model (v0) is explicit onworkflowvsengagement(Digital Experience Agency vertical as the first reference implementation). - deliverable / output signal →
artifact; phase / activity signal →process.systemstays reserved to a tool/software/platform the org uses (the knowledge-model definition) — never a catch-all. - No grounding → conservative fallback. Faithfulness over coverage (Concept identity = type + (canonical-title OR prefix-stripped id), exact-match closure — dedup owned by the @dossier/okf keystone, in-pass + opt-in reconcile + loop default, knowledge-model principle 8): a defensible generic type beats a fabricated strong one.
Why a task (forward-looking), not a fix-in-place or a new ADR
This is forward-looking loop improvement, not tenant-data cleanup — the RBA tenant is already type-disciplined, conformance-clean, and graph-clean after the two surgery passes (parse 100%, validateGraph 0/0). Building a heuristic on the emit path and validating it against the Live extraction eval harness — what we measure is what extraction optimizes for judge is an extraction-layer code change owned by the Knowledge-Extraction & GraphRAG Engineer, with the Principal Knowledge-Format Architect confirming the type calls — not a one-token hygiene correction, and not a new direction (it executes the lesson already recorded in Task Board — Audit Log / Dossier — Decision & Audit Log under DEC-0056's and DEC-0057's frames, so no new ADR). Board globbed before filing — the two sibling RBA tasks are the post-hoc curation (both done); no open task covered the emit-time durable fix (grep of "emit-time" / "type-discipline" / "heuristic" returned only those closed RBA surgery tasks). confidence: inferred (agent-filed).