Agentic board v1 — build the git-resident OKF task board (deterministic offline core, SDK reserved), resolving DEC-0024's four open questions and dogfooding Dossier's own repo first
0030-agentic-board-v1-build
- Reversibility
- two-way door
DEC-0030 — Agentic board v1 (build)
Reversibility: two-way door — the swappable internals (scheduler host, exact claim/lease mechanism, MCP-vs-direct access, the kanban renderer) remain swappable, exactly as DEC-0024 framed them; what this build makes harder to reverse is only what DEC-0024 already named durable (the board is OKF in the client's git and the git board, not the session, is the source of truth; autonomous loops are always bounded — caps + idempotency + pause flag + kill switch; governance is enforced by hooks, not model trust). One escalation is treated as a GATED, near-one-way door: flipping the workflow's reserved schedule: on arms an uncancellable loop (ScheduleWakeup has no external cancel API) — it requires the pause/idempotency/caps to be green and the worker proven on Dossier's own board first.
Context
Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops was a /deep-research-backed direction, not a built system — it explicitly carried four unresolved open questions and a ## Review that said "revisit when the board is actually built; promote toward verified only against a real end-to-end run." This session (2026-06-16) the board was planned in depth and built out: the open questions were resolved against the verified primitive constraints, the OKF task type was landed in the model + schema-as-code (the Principal Knowledge-Format Architect hand-off Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops reserved, now closed — see Dossier — The Knowledge Model (v0), Dossier — Work Items (the agentic board)), and the deterministic core was implemented and verified green offline.
This decision records the v1 build calls and resolves DEC-0024's four open questions. It sits on three load-bearing prerequisites already decided: Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system (the reserved AgentSdkOrchestrator seam this mirrors), The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) (made repeated agent writes to a KB safe — the board's foundation), and KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix (every surface renders any tenant's KB via DOSSIER_KB/knowledgeDir(), which is what makes the dogfood→client path a config change). It honors Adopt OKF as Dossier's canonical knowledge format (the board is OKF in the client's git) and Claude-primitives-first build strategy (subagents/hooks/Agent-SDK seam).
Honest status up front. The @dossier/okf and @dossier/runtime pieces are verified green offline this session (tests pass, the seam is type-integral, the /board page renders over the real 8 task atoms). But no end-to-end GitHub Actions drain has run — the host, the claim race protocol, and the idempotent crash-resume are reasoned from verified primitives, not field-proven. So the overall direction is asserted, and DEC-0024's review condition ("a real dispatch that drains one task and lands a PR through review without a runaway") is not yet met.
Options considered
The five substrate/durability/access axes were already adjudicated in Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops (OKF-in-git; git-board-as-truth; bounded SDK loop; durable scheduler; hook governance; both direct + MCP access). This decision does not re-litigate them — it resolves the four open questions DEC-0024 left for the build, plus the v1 sequencing call.
OQ1 — durable host (cloud Routine vs GitHub Actions vs self-hosted runner).
- (a) GitHub Actions,
workflow_dispatch(manual) ONLY, no active cron (chosen).actions/checkoutmaterializes the tenant repo on disk (the OKF system of record, Adopt OKF as Dossier's canonical knowledge format) so the worker can readknowledge/, claim by editing frontmatter, commit, and push;GITHUB_TOKENopens the PR that is thereviewhandoff;concurrency:{group}serializes runs so two never race the same board; runner secrets gate the reserved key. Cold runner ~20-40s checkout+install, then bounded by the worker's caps — far below the Routines floor, free-tier minutes cover dogfooding here. - (b) Cloud Routines (research preview). Rejected for v1 on the verified Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §3 constraints: a fresh clone with NO local file access and a 1-hour minimum cannot operate a repo-resident worker that must edit frontmatter, commit, and push, and is too coarse for one-task draining.
- (c) Self-hosted Agent SDK runner. Reserved as the sovereignty fallback for a client whose repo cannot leave their network — the same workflow pointed at a different checkout, a config change, not a rebuild.
OQ2 — most robust git-native lock/lease under push races + stale leases + fresh-clone.
- (a) Single-file-per-task frontmatter claim/lease (
claimed_by+lease_expires, quoted ISO-8601) enforced by a PreToolUse hook, made race-safe by the protocol around it (chosen). The hook (host process, outside the context window) is the deterministic gate (deny > ask > allow). Push race → one-file-per-task means a losing claim is rejected non-fast-forward / serialized byconcurrency:{group}. Stale lease →lease_expires < now(with a clock-skew grace) is reclaimable; the reclaimer overwrites the pair. Fresh-clone → moot in v1 (Routines rejected); the GH Actions checkout is a full worktree and the claim lives in committed frontmatter any clone sees. - (b) A lock-file / ref namespace. Rejected (already in Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §5) — the frontmatter pair keeps state in the atom, single-source.
- (c) git-bug-style operation CRDT (Lamport + SHA-256 tiebreak). Reserved as the ceiling strictly for if/when many agents edit one board concurrently and single-file claiming proves insufficient.
OQ3 — safest kill switch + idempotency given ScheduleWakeup has NO external cancel API.
- (a) Layered self-bounding + ship
workflow_dispatchonly (chosen). Because there is no external cancel API, the safest move is to never arm an unattended self-re-queuing loop in v1, then layer defense-in-depth so even an accidental loop is finite: a board-level pause flag checked at the start of every wake, per-run caps, arun_ididempotency tag, a max-fires circuit breaker, and a documented kill switch. - (b) A Stop-hook circuit breaker / max-fires-in-git alone, with a live cron. Rejected for v1 — relying on a brake while the loop is armed is strictly riskier than not arming it; the brakes are kept as defense-in-depth, not as the primary control.
OQ4 — MCP-typed-tools vs direct file/CLI access, and the human-curation surface.
- (a) Direct frontmatter edits governed by the PreToolUse hook in v1; typed MCP board server reserved (chosen). Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §4 explicitly refuted "MCP-only" and chose "allow both." Direct edits are the simplest path the built stack already supports (the worker reads
tasks/*.mdvia the same loader the loop uses), and the governance that matters is at the hook layer regardless of transport — so making MCP a gate buys no safety while adding a network dependency that breaks offline CI (Extraction runtime architecture — the moat / Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system). - (b) A typed MCP board server (
task_claim/update/complete) as the gate. Reserved, not adopted — a genuine ergonomic upgrade for a futureAgentSdkBoardWorker(a legible validated contract, a server-side seam for the same claim/lease check), shipped later as a separate package, never on the v1 critical path.
Sequencing — dogfood vs. client-first.
- (a) Dogfood Dossier's own
knowledge/board FIRST (chosen). Prove + polish the bounded worker on this repo (default KB =knowledgeDir()), then point the identical workflow + worker at a client KB viaDOSSIER_KB— never a second codebase. - (b) Build straight for a client tenant. Rejected — autonomy is unproven; arming it on a client's repo before it is proven on our own violates the Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §7 "human-initiated until proven" stance and the KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix dogfood-then-rollout discipline.
Decision
Build the agentic board v1 as a deterministic, offline, hook-governed git-resident OKF task board, dogfooded on Dossier's own repo first, with the live Agent SDK reserved behind a seam — resolving DEC-0024's four open questions as above. Concretely:
Sequencing = dogfood-first; per-client rollout = config. The default target KB is this repo's own
knowledge/viaknowledgeDir()/DOSSIER_KB(KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix). Prove and polish the bounded worker on Dossier's own board, then point the same workflow + worker at a client KB by settingDOSSIER_KB— a configuration change, not a rebuild.OKF
tasktype landed. A supportingtaskconcept type is added to Dossier — The Knowledge Model (v0) (spinestatus/priority/owner-assignee/dependencies/acceptance_criteria+ the coordination pairclaimed_by/lease_expires;statusis free textz.string()so verticals extend without forking;dependenciesis an ordering hint, not a transitive hold) and mirrored in@dossier/okfschema-as-code (TaskSchemawired intobuiltinSchemas/builtinTypes/the exhaustive registry/the barrel +TaskAtom+io.tsKEY_ORDER). This closes the Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops hand-off.Runtime worker + seam.
DefaultBoardWorker+drainBoardship in a newpackages/runtime/src/board.ts, independent ofloop.ts(zero change to ingest→extract→emit→commit). It mirrorsorchestrator.tsexactly: aBoardWorkerinterface is the contractdrainBoarddepends on; the deterministic, offline, no-LLM, no-networkDefaultBoardWorker(namedefault) ships now; the liveAgentSdkBoardWorker(nameagent-sdk) is RESERVED — the sole place that would import@anthropic-ai/claude-agent-sdk, withclaim()+execute()both rejecting/RESERVED/— byte-for-byte theAgentSdkOrchestratorpattern (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Extraction runtime architecture — the moat).drainBoard: select (deterministic total order — claimable =backlogor stale-lease, deps-honored as a soft gate, priority then id-tiebreak), claim (commit the claim, stamp arun:<id>idempotency tag, refresh the lease), a bounded deterministic step (no LLM), transition +commitAll, all bounded bymaxTasksPerRun(default 1) and gated by the pause flag checked at the start of every wake.OQ1 — GitHub Actions, manual only.
.github/workflows/board-drain.ymlshipsworkflow_dispatchONLY; theschedule:/cron is commented with a "DO NOT UNCOMMENT" gate;concurrency:{group: dossier-board-drain, cancel-in-progress: false}serializes drains; least-privilegepermissions: {contents: write, pull-requests: write}; the reservedANTHROPIC_API_KEYis commented out; the PR step opens a review PR but never merges (one-PR-per-task). A self-hosted runner is the reserved sovereignty fallback.OQ2 — single-file frontmatter claim/lease.
claimed_by+lease_expires(quoted ISO-8601) per task, race-safe via the claim commit +concurrency:{group}serialization; a clock-skew grace makes an expired lease reclaimable; git-bug CRDT reserved as the heavy-concurrency ceiling.OQ3 — layered self-bounding + documented kill switch. A board-level pause flag in two redundant forms (
knowledge/.board-pausesentinel andboard_paused: trueontasks/index.md) checked before any claim;maxTasksPerRun(default 1) +maxTurns+maxBudgetUsd; arun:<id>idempotency tag so a crash-mid-run re-wake converges instead of duplicating; a max-fires circuit breaker; and the documented 4-step kill switch in the workflow header (pause flag → standing "never enable the schedule until proven" rule → disable/delete the workflow → attendedEsc/CLAUDE_CODE_DISABLE_CRON=1/kill). No active cron ships —ScheduleWakeuphas no external cancel API.Governance = a PreToolUse claim/lease hook, opt-in.
.claude/hooks/board-claim-guard.mjsreads the targettasks/**atom'sclaimed_by/lease_expires, DENIES a write that breaches a live foreign lease (deny > ask > allow), ALLOWS the holder and reclaimable/unclaimed tasks, and FAILS-OPEN on any fs/parse error (never wedges the session; diagnostics to stderr, JSON-only on stdout, ~1500ms stdin timeout) — confined to<KB>/tasksvia anisWithincheck before reading, KB-agnostic viaDOSSIER_KB. It is NOT auto-registered:.claude/settings.jsoncarries only the PostToolUse capture hook (Establish the learning-loop & audit architecture); wiring the guard is opt-in and documented in its header +.claude/hooks/README.md.OQ4 — direct edits + read-only site board +
reviewgate. Agents and humans write the board by direct frontmatter edits governed by the hook; a typed@dossier/mcp-boardserver is reserved (activated alongsideAgentSdkBoardWorker, never a gate). Human curation surface = the board markdown (edit/PR) + the read-only, KB-agnostic site/boardroute (packages/site/src/pages/board.astro,getCollection(type === 'task')againstknowledgeDir()/DOSSIER_KB, KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix) which never writes the board; thereviewstatus is the approval handoff (a worker →review+ opens a PR; a human merges to reachdone).
Rationale
- It resolves DEC-0024's open questions from the verified primitive constraints, not from preference. OQ1 falls out of Routines' fresh-clone/no-local-file-access + 1-hour floor vs. GH Actions' checkout/
GITHUB_TOKEN/concurrency(Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §3-4). OQ3's "ship manual only" is the only honest answer to "no external cancel API" (§7). OQ4's "direct edits, MCP reserved" is what §4's refutation of MCP-only already implies. Each answer is the protocol around the §5 frontmatter pair, not a heavier mechanism. - Seam discipline is the moat, and it is exact.
BoardWorker↔Orchestrator,DefaultBoardWorker↔DefaultOrchestrator(ships now, deterministic/offline),AgentSdkBoardWorker↔AgentSdkOrchestrator(the sole SDK plug point, rejects/RESERVED/),drainBoard↔DefaultOrchestrator.orchestrate(runLoop). CI stays offline by construction — board governance is provably correct offline before any live autonomy (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Extraction runtime architecture — the moat). - Governance is deterministic, not trusted — and fail-safe. The hook is fail-OPEN on error (a hook bug can never wedge the board) and fail-CLOSED on a live foreign lease; for the v1 default worker the actual race protection on a remote is the GH Actions
concurrency:{group}serialization, not model trust. - Sequencing keeps autonomy human-initiated and rollout cheap. Dogfooding on Dossier's own board first satisfies the §7 "human-initiated until proven" stance; because every surface is KB-agnostic (KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix), the client rollout is the same workflow + worker + hook reading a different
DOSSIER_KB— config, not a second build. - It rests on the compounding merge. The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) made repeated agent writes to a KB safe (human curation is
preserved, vanished atomsorphanednot deleted); without it, a board of agents writing to the KB would be unsafe. This build is the thing that prerequisite was for. asserted, notverified— the built pieces are green, the system is unproven. The@dossier/okf+@dossier/runtimecode is reproduced green offline this session, but no end-to-end Actions drain has run, so the host, the claim race protocol, and the idempotent crash-resume are reasoned-from-primitives, not field-proven. The confidence sits on the direction, not on a run that has not happened.
Consequences
- DEC-0024's four open questions are resolved (recorded here), and its
## Reviewcondition is partially met. Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops should be read as "open questions resolved by DEC-0030; promotion toverifiedstill pending a real end-to-end dispatch." Its core direction stands unchanged; this is the build that answers its follow-ups (see Review below for the cross-reference action). - All v1 board pieces now exist in code: the
taskschema + model,packages/runtime/src/board.ts(BoardWorker+DefaultBoardWorker+drainBoard,AgentSdkBoardWorkerreserved),scripts/board-drain.mjs,.github/workflows/board-drain.yml,.claude/hooks/board-claim-guard.mjs(+.claude/hooks/README.md), and the read-only/boardsite surface — verified green offline (see Verification). - No live autonomy is armed. The workflow is manual-dispatch only; the claim-guard hook is opt-in (not in
settings.json); theAgentSdkBoardWorkerrejects. Enabling the reservedschedule:is a gated, near-one-way escalation (no cancel API) requiring the pause/idempotency/caps green and the worker proven on Dossier's own board first. - Known integrity gaps inherited + amplified — routed, not silently shipped. (a) Dangling role edges: the seed tasks'
owner/assigneeand[[role]]body links resolve to notype: roleatom (there are none inknowledge/; the roles live only as.claude/agents/*.md). This is a pre-existing KB-wide gap (10 committed decisions already useowner: principal-architectwith no role atom), but the board makesowner/assigneefirst-class card fields wired through the route map, so the dead edges are now user-visible on/board. Routed to the Principal Knowledge-Format Architect (authortype: roleatoms inknowledge/roles/, or record a decision that owner/assignee are deliberately un-resolvable handles + flag them invalidateGraph). (b) Stale-lease grace mismatch: the runtime + hook apply a 5-minute clock-skew grace before a lease is stale, but the site marksstale/reclaimableatt <= now(zero grace) — for ~5 minutes the/boardcan say "reclaimable" while the hook still denies and the selector won't pick it up. Routed: share oneLEASE_STALE_GRACE_MSacross surfaces. (c) Hook-doc overclaim: the hook header + README assert a worker "claim-commit-then-verify-HEAD" backstop that the shippingDefaultBoardWorkerdoes not implement (its real race protection is the Actionsconcurrency:{group}; the in-worker verify-HEAD is reserved withAgentSdkBoardWorker). Routed: correct the hook docs to state the backstop precisely. - A CI graph-lint gate is the right long-term catch for (a). The dangling role/dependencies edges are exactly what a
validateGraphpass overknowledge/**would surface; that gate is not yet on (blocked behind Reconcile the decision reversibility field — free-text prose vs. the @dossier/okf enum, the keystone-parser free-text conflict — now Resolve the decision `reversibility` schema conformance gap). Sequencing that before leaning onowner/assignee/dependenciesgraph semantics converts these findings from "caught by a reviewer" to "caught by the loop." - Build-side only. Nothing is added to the client-facing plugin subset; no sovereign behavior changes (the
/boardsurface is read-only/zero-copy, Adopt OKF as Dossier's canonical knowledge format). - Two-way vs. gated. The swappable internals stay swappable (host, claim/lease mechanism, MCP-vs-direct, renderer). The durable commitments are the three DEC-0024 already named (git-board-as-truth; always-bounded loops; hook-enforced governance). The one near-one-way door is enabling the
schedule:.
Verification (this session, offline)
pnpm -F @dossier/okf test— green (thetaskschema prerequisite:TaskSchema+taskinbuiltinTypes/registry, the union-branch count updated, task validation + JSON-schema cases added).pnpm -F @dossier/runtime test— green, 10 files / 60 tests, fully offline (no network, no key): the pause flag checked before any claim,maxTasksPerRundefault 1, therun:<id>idempotency tag surviving parse→serialize→parse, the pure id-tiebroken selector, and theAgentSdkBoardWorkerassertingrejects(/RESERVED/)— the same shapeorchestrator.test.tsproves.pnpm -F @dossier/runtime typecheck(tsc -b) — clean; theBoardWorkerseam is type-integral. The runtime dependency chain builds clean (dist/board.jsemitted).- The
/boardpage renders over the real atoms: an out-dirastro build(bypassing a Windows editor-watcherdistlock that blocks only the defaultpnpm -F @dossier/site buildwrapper — an environment lock, not a code defect) emits/boardpopulated with all 8knowledge/tasks/*.mdatoms across 6 lifecycle columns / 13 cards, zero empty-state — true end-to-end (OKF atom → content loader → view-model →board.astro→ design CSS). - Guardrails held (read-verified):
AgentSdkBoardWorkerstill RESERVED;.claude/settings.jsoncarries ONLY the PostToolUse capture (board-claim-guard NOT wired);board-drain.ymlisworkflow_dispatchonly with cron commented, concurrency set, least-privilege, key commented, PR-opens-never-merges. - NOT verified: no end-to-end GitHub Actions
workflow_dispatchdrain has run — the design isasserted, awaiting a real dispatch (below).
Review
Promote toward verified only against a real end-to-end run: trigger one workflow_dispatch board-drain on this repo that claims a single task, runs the bounded default step, commits the transition, and lands a PR through the review state without a runaway — exactly the bar Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §Review set. That single run is the gate before the reserved schedule: is ever considered (a near-one-way door — no cancel API). Routed follow-ups, captured not closed:
- Cross-reference action: treat Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops's four open questions as resolved by this record — when DEC-0024 is next edited, annotate its open-questions block +
## Reviewwith "resolved/built in DEC-0030 (promotion pending a real dispatch)." (Not rewriting DEC-0024 here — ids are stable and history is layered, per the DEC-0023/DEC-0028 precedent.) - Principal Knowledge-Format Architect — resolve the dangling role edges: author
type: roleatoms (or record a decision thatowner/assigneeare deliberately un-resolvable handles) and flag danglingowner/assigneeinvalidateGraph. Sequence Resolve the decision `reversibility` schema conformance gap (the keystone parser) so a CI graph-lint overknowledge/**can catch this class. - Astro Starlight Engineer / board workstream — align the stale-lease grace across the site, runtime, and hook (one shared
LEASE_STALE_GRACE_MS); accept thetaskfree-textstatusin the site content schema soknowledge/tasks/renders on our own surface (the integration-seam defect already logged this session — Render the board as a derived `/board` surface in @dossier/site). - Runtime/architect — correct the hook docs to describe the backstop precisely (Actions
concurrency:{group}serializes the default worker; in-worker verify-HEAD is reserved withAgentSdkBoardWorker); roll the democlaimedtask's historicallease_expiresto a live-looking value (or annotate it as an intentional stale-lease exemplar) so the dogfood board shows a healthy claim (Honor robots.txt in the keyless HttpConnector); trim thetasks/index.md"Current board" table to a link list (or label it a seed-time snapshot) so it does not become a second, drifting copy ofstatus. - Live activation (separate, gated) — implement
AgentSdkBoardWorker.executeas a bounded SDK session (maxTurns+maxBudgetUsd) behindANTHROPIC_API_KEY, add a key-gated smoke test mirroring the live eval, wire the claim-guard intosettings.json, and only then consider the reservedschedule:.