Dumb-fast trace capture + off-hot-path distill/prune
0041-dumb-fast-trace-capture-off-hot-path-distill
- Reversibility
- two-way door
DEC-0041 — Dumb-fast trace capture + off-hot-path distill/prune
Reversibility: two-way door — every part is swappable (widen the matcher back to * to restore full capture; swap settings.json back to the node .mjs transport; the distiller/cleaner are off-hot-path scripts that touch no curated knowledge). The durable commitment is the two-tier shape itself — dumb-fast deterministic capture + smart off-hot-path distillation, with the human-curated layer as the single source of truth — which is DEC-0002 evolved, not replaced.
Context
An FDE deep read found that Dossier's own decision-capture/logging system felt slow and the repo felt bloated — and a ground-truth measurement pass this session (native Windows, PowerShell) confirmed real cost, not just perception:
- The PostToolUse hook was wired with matcher
"*"runningnode capture-trace.mjson every tool call (Establish the learning-loop & audit architecture). Measured: node cold-start ~122 ms, full hook ~147 ms/call — and ~99% of that is Node boot, not the script's work. Across a session of thousands of calls that is real, felt latency after every tool call. journal/traces/was unbounded: ~14 MB over 3 days, ~136 MB/month projected, with no rotation or pruning anywhere.- The felt bloat (distinct from repo bloat) also came from gitignored local scratch at the repo root —
bash.exe.stackdump, stray root PNGs,.playwright-mcp/— never tracked, but cluttering the working tree.
The capture mechanism is the raw machine tier of the learning loop (Establish the learning-loop & audit architecture). This decision keeps that two-layer model intact and evolves its capture mechanism for latency and bloat — it does not change what the curated layer is or who owns it.
Options considered
- Leave it (matcher
"*"+ per-callnode). Rejected: a measured ~147 ms tax on every tool call and an unbounded journal are exactly the "if a thing isn't [performant], fix it" bar in CLAUDE.md, on Dossier's own dogfood. - Keep
"*"but make the transport cheap. Helps the per-call cost but still fires (and appends) on every read/search/navigation call — capturing noise that is not a decision and still paying some cost on the majority of calls. Partial. - Keep
nodebut add async/batching inside the hook. Rejected: the cost is Node boot, not the work; batching inside a per-callnodespawn cannot remove the ~122 ms cold-start that dominates. - Narrow the matcher to mutating calls AND replace the transport with a dumb-fast append AND move all smart work off the hot path (chosen). Attacks all three measured problems at their root: fire on fewer calls, pay ~0 interpreter boot when we do, and defer every expensive operation to a deliberately-invoked off-hot-path script where latency is free.
Decision
Re-architect the learning loop's machine tier into a deliberately dumb-and-fast capture hook + a smart off-hot-path distiller, in four parts. This is the consequential, reversible call the user explicitly approved.
NARROW the PostToolUse matcher from
"*"toEdit|Write|MultiEdit|Bashin both.claude/settings.json(Dossier's dev repo) andplugin/dossier/hooks/hooks.json(the shipped plugin). Read/search/navigation calls are not decisions worth capturing — they were noise and latency. This deliberately relaxes the prior "every tool call is auto-captured" doctrine (the wording in CLAUDE.md and the spirit of Establish the learning-loop & audit architecture). Reverting is a one-token change — widen the matcher back to"*".Replace the per-call
nodespawn with a dumb-fast transport. On win32, a new.claude/hooks/capture-trace.cmd(cmd +findstr "^"raw append tojournal/traces/_pending.jsonl, no interpreter boot — measured ~75–84 ms warm vs node's ~147 ms, and ~0 on the now-narrowed-away majority of calls). The portablecapture-trace.mjsremains the transport that ships in the plugin (clients run any OS — Plugin + marketplace packaging — distribution as the agency wedge, built from the canonical .claude/ primitives) and the dev-repo fallback for non-win32 hosts. The capture is intentionally dumb: it appends the raw payload, unreshaped and unclipped; the tolerant distiller normalizes both the new raw-payload shape and the legacy{ts,tool,input,output}shape downstream.Move ALL the expensive work OFF the hot path into a new
scripts/trace-distill-prune.mjs(pnpm trace:prune): deterministic REDUCE of the raw journal (tool counts, files touched, bash commands, failure signals; tolerant of malformed lines, which it skips and counts) → DISTILL by handing the reduction (not the raw MBs) toclaude -punder the log-auditor's mandate (surface decisions missing fromknowledge/log.md, summarize the window, propose forward-looking improvements) → write a digest tojournal/digests/(gitignored review artifact) → PRUNE the raw file. The operator (or the log-auditor) promotes real items from a digest into the curated layer; curation stays human-judged (Establish the learning-loop & audit architecture). The distiller dogfoods the same subscriptionclaude -pCLI primitive the extraction moat uses (Claude-primitives-first build strategy / Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys)) — it mirrors@dossier/extraction'sdefaultCliRunner(win32cmd.exe /c claude, prompt on stdin) — so it adds CLI leverage in exactly the right place (off the hot path, no new dependency, no API key). Offline-testable via--no-llm(deterministic digest); retention via--keep-days(default 7); the live_pending.jsonlis always digested + cleared (it has no date and an always-fresh mtime, so age-based selection would never reach it).Add
scripts/clean-scratch.mjs(pnpm clean:scratch) to sweep gitignored local scratch. Safety rail: it refuses to delete anythinggit check-ignoredoes not already ignore;/screenshots/is the sanctioned capture home and is opt-in only.
Rationale
- The cost was Node boot, and the fix targets Node boot. With ~99% of the ~147 ms being cold-start, the only durable win is to stop booting an interpreter per call — hence the
.cmdtransport — and to fire on fewer calls — hence the narrowed matcher. Both were measured, not assumed. - Capture what's a decision, not every keystroke. A read or a search is navigation, not judgment; capturing it added latency and noise to the very signal the log-auditor must sift. Narrowing to mutating calls (
Edit|Write|MultiEdit|Bash) keeps the forensic record of what changed while dropping the record of what was looked at — an honest, reversible trade. - Latency is free off the hot path. The expensive operations (reshape, reduce, LLM distill, prune) don't belong in a per-call hook at all. Moving them into a deliberately-invoked script means they can be as smart and as slow as they need to be without taxing the session — and the distiller can use the full
claude -pprimitive rather than a fire-and-forget append. - Dogfood the moat's own CLI primitive in the loop. The distiller using the same subscription
claude -prunner as@dossier/extractionis Claude-primitives-first build strategy applied to our own learning loop: the recursive moat compounds on both sides, with no API key and no new dependency. - The single source of truth stays human-curated. Digests are gitignored intermediates, not knowledge. Nothing reaches
knowledge/without a human (or the log-auditor) promoting it — Establish the learning-loop & audit architecture's curated layer is untouched in kind, only fed differently. verified, notasserted. This is recorded against real measurement and live exercise this session, not design conviction — see Review for the exact evidence and its bounds.
Consequences
- Capture is now ~free on the hot path (win32): fires only on mutating calls, with no interpreter boot when it does. The portable
.mjsis unchanged in behavior and still what ships to clients (Plugin + marketplace packaging — distribution as the agency wedge, built from the canonical .claude/ primitives);settings.json's command is now win32-specific (a dev-host optimization), while the shipped plugin stays cross-platform. - The journal stops growing unbounded.
pnpm trace:prunereduces → distills → prunes;--keep-dayscontrols retention and the live_pending.jsonlis always cleared.journal/digests/is gitignored. - Doctrine relaxed, on the record. "Every tool call is auto-captured" (DEC-0002) becomes "every mutating tool call is auto-captured." CLAUDE.md was updated to say so; the cost is losing the forensic record of read/navigation calls. Reversible — widen the matcher to
"*". - A known win32 transport caveat, mitigated.
cmd + findstrcan mangle multibyte UTF-8 within string contents on pathological payloads; the structural NDJSON integrity is verified safe on large lines (a 100 KB single line was preserved intact and valid). Mitigated three ways: the tolerant distiller skips + counts malformed lines, the raw layer is ephemeral by design, and the portable.mjsfallback has no such limitation. - Do not re-introduce the cold-start elsewhere.
board-claim-guard.mjs(Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §5) is a NodePreToolUsehook; wiring it standing as-is would add ~120 ms to every mutating call on top of capture. The README flags this explicitly, and a board task carries the follow-up — see Give board-claim-guard the dumb-fast treatment before it goes standing (no per-write node cold-start). This is the one place the cold-start could silently come back; it is now owned, not just noted. - The committed-hooks.json / generated-.mjs split is honored (Plugin + marketplace packaging — distribution as the agency wedge, built from the canonical .claude/ primitives):
plugin/dossier/hooks/hooks.jsonis hand-authored and committed (its matcher was narrowed here);plugin/dossier/hooks/capture-trace.mjsis generated from.claude/hooks/bybuild-plugin.mjsand gitignored. The plugin bundle stayed in sync (build-plugin --checkOK).
Review
Recorded verified against this session's evidence, with its bounds stated honestly (do not overstate):
- Latencies measured natively (PowerShell): node cold-start ~122 ms; full node hook ~147 ms; cmd+findstr append ~75–84 ms warm. A 100 KB single line was preserved intact and still valid JSON (the NDJSON structural contract holds).
- The new
.cmdhook fired live this session and correctly captured only Edit/Bash, not Read, to_pending.jsonlas valid NDJSON. - A real
claude -p(sonnet) distill of2026-06-15.jsonl(1,084 events) produced a substantive digest that surfaced genuine candidate decisions not in the log (font-source package strategy after E404s; WCAG 2.1 AA as the token contrast floor; RBA-as-first-pilot having no ADR; M365/SharePoint connector scope) plus forward-looking improvements (extract a sharedscripts/check-contrast.mjs— the WCAG helper was rewritten throwaway ~4×) and honest anomalies (a token test diverged from values mid-session). The tolerant parser handled 10,307 real events with 0 malformed. clean-scratchreclaimed 1.23 MB (5 targets), all confirmed gitignored.- Full suite green: 366 passed / 1 skipped, 40 files; plugin bundle in sync (
build-plugin --checkOK).
verified attaches to this re-architecture's behavior on win32 this session — the latency win, the matcher narrowing firing correctly, the distiller producing a real digest, the cleaner's safety rail. It does not assert the cmd transport is byte-safe across all possible payloads (the multibyte caveat above is the bound), nor that the cross-platform .mjs plugin path has been re-measured on a non-win32 host. Revisit if: the matcher misses a class of decision worth capturing (widen it); the cmd-transport multibyte caveat bites a real payload (fall back to the portable .mjs, which is unaffected); or the board-claim-guard goes standing without the dumb-fast treatment (Give board-claim-guard the dumb-fast treatment before it goes standing (no per-write node cold-start)).