Detector ensemble at the ingestion boundary with a MEASURED F2/recall target (Presidio recall-tuned + custom "P2" recognizers + cloud DLP) — measure detection, don't trust it

task-ingestion-detector-ensemble-measured-recall

task confidence inferred status backlog 2026-06-19 owner ingestion-engineer
source log-auditor — surfaced recording 0059-untrusted-by-default-ingestion-serve-boundary (the sensitive-data + injection-defense research synthesis, research/2026-06-18-sensitive-data-and-injection-defense.md §5/§10/§12.1). Board globbed before filing — no open task covered a PII/PHI/PCI detector ensemble or a detection-recall harness (the existing extraction/dedup tasks are graph-quality, not sensitive-data detection).

Detector ensemble at the ingestion boundary with a measured F2/recall target

Layer-2 of the DEC-0059 boundary: Dossier's own detect-and-drop, before persistence/embedding/extraction (the chokepoint — the GraphRAG vector index is in-scope sensitive data because embeddings do not anonymize PII, so detection must precede the embed). Combine Presidio (recall-tuned, β=2) + custom recognizers for the client's "P2" tier + a cloud DLP as a second probabilistic opinion.

The non-negotiable: measure, don't trust

Every detector is probabilistic with irreducible false negatives, and vanilla Presidio is only ~0.35-0.38 F1 (configuration boosts F-score ~30%). So the deliverable is not "we ran Presidio" — it is a measured F2/recall number from a validation harness (Microsoft presidio-research, which scores at system/NER/recognizer level) against a labelled corpus, cleared against a documented bar. Tune for recall (β=2): a missed PII false-negative that lands is worse than over-dropping.

This is a layer, not the guarantee

The architecture carries the guarantee (Untrusted-by-default ingestion & serve boundary — defense-in-depth to keep regulated data out and contain prompt injection); this detector is one layer. It pairs with the fail-closed quarantine (Fail-closed quarantine wrapper for the Unstructured file path (zero-element / encrypted-by-header / unknown-MIME → quarantine-by-default) — because Unstructured fails EMPTY, not CLOSED) and the payload-free audit trail (Payload-free, tamper-evident audit/drop-log design — record the DECISION (per-item verdict + label snapshot + drop reason), NEVER the sensitive payload (the "audit paradox")) so a residual false-negative does not silently read as "clean."

Why a task, not a fix-in-place

Net-new ingestion-layer engineering (a configured ensemble + a measurement harness + a recall bar) — owner judgment + code, not hygiene. Detail and citations: research/2026-06-18-sensitive-data-and-injection-defense.md §5 and §10. confidence: inferred (agent-filed from DEC-0059).