Agentic board v1 — build the git-resident OKF task board (deterministic offline core, SDK reserved), resolving DEC-0024's four open questions and dogfooding Dossier's own repo first

0030-agentic-board-v1-build

decision read as Explain confidence asserted status active 2026-06-16 owner principal-architect
Reversibility
two-way door

DEC-0030 — Agentic board v1 (build)

Reversibility: two-way door — the swappable internals (scheduler host, exact claim/lease mechanism, MCP-vs-direct access, the kanban renderer) remain swappable, exactly as DEC-0024 framed them; what this build makes harder to reverse is only what DEC-0024 already named durable (the board is OKF in the client's git and the git board, not the session, is the source of truth; autonomous loops are always bounded — caps + idempotency + pause flag + kill switch; governance is enforced by hooks, not model trust). One escalation is treated as a GATED, near-one-way door: flipping the workflow's reserved schedule: on arms an uncancellable loop (ScheduleWakeup has no external cancel API) — it requires the pause/idempotency/caps to be green and the worker proven on Dossier's own board first.

Context

Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops was a /deep-research-backed direction, not a built system — it explicitly carried four unresolved open questions and a ## Review that said "revisit when the board is actually built; promote toward verified only against a real end-to-end run." This session (2026-06-16) the board was planned in depth and built out: the open questions were resolved against the verified primitive constraints, the OKF task type was landed in the model + schema-as-code (the Principal Knowledge-Format Architect hand-off Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops reserved, now closed — see Dossier — The Knowledge Model (v0), Dossier — Work Items (the agentic board)), and the deterministic core was implemented and verified green offline.

This decision records the v1 build calls and resolves DEC-0024's four open questions. It sits on three load-bearing prerequisites already decided: Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system (the reserved AgentSdkOrchestrator seam this mirrors), The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) (made repeated agent writes to a KB safe — the board's foundation), and KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix (every surface renders any tenant's KB via DOSSIER_KB/knowledgeDir(), which is what makes the dogfood→client path a config change). It honors Adopt OKF as Dossier's canonical knowledge format (the board is OKF in the client's git) and Claude-primitives-first build strategy (subagents/hooks/Agent-SDK seam).

Honest status up front. The @dossier/okf and @dossier/runtime pieces are verified green offline this session (tests pass, the seam is type-integral, the /board page renders over the real 8 task atoms). But no end-to-end GitHub Actions drain has run — the host, the claim race protocol, and the idempotent crash-resume are reasoned from verified primitives, not field-proven. So the overall direction is asserted, and DEC-0024's review condition ("a real dispatch that drains one task and lands a PR through review without a runaway") is not yet met.

Options considered

The five substrate/durability/access axes were already adjudicated in Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops (OKF-in-git; git-board-as-truth; bounded SDK loop; durable scheduler; hook governance; both direct + MCP access). This decision does not re-litigate them — it resolves the four open questions DEC-0024 left for the build, plus the v1 sequencing call.

OQ1 — durable host (cloud Routine vs GitHub Actions vs self-hosted runner).

  • (a) GitHub Actions, workflow_dispatch (manual) ONLY, no active cron (chosen). actions/checkout materializes the tenant repo on disk (the OKF system of record, Adopt OKF as Dossier's canonical knowledge format) so the worker can read knowledge/, claim by editing frontmatter, commit, and push; GITHUB_TOKEN opens the PR that is the review handoff; concurrency:{group} serializes runs so two never race the same board; runner secrets gate the reserved key. Cold runner ~20-40s checkout+install, then bounded by the worker's caps — far below the Routines floor, free-tier minutes cover dogfooding here.
  • (b) Cloud Routines (research preview). Rejected for v1 on the verified Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §3 constraints: a fresh clone with NO local file access and a 1-hour minimum cannot operate a repo-resident worker that must edit frontmatter, commit, and push, and is too coarse for one-task draining.
  • (c) Self-hosted Agent SDK runner. Reserved as the sovereignty fallback for a client whose repo cannot leave their network — the same workflow pointed at a different checkout, a config change, not a rebuild.

OQ2 — most robust git-native lock/lease under push races + stale leases + fresh-clone.

  • (a) Single-file-per-task frontmatter claim/lease (claimed_by + lease_expires, quoted ISO-8601) enforced by a PreToolUse hook, made race-safe by the protocol around it (chosen). The hook (host process, outside the context window) is the deterministic gate (deny > ask > allow). Push race → one-file-per-task means a losing claim is rejected non-fast-forward / serialized by concurrency:{group}. Stale lease → lease_expires < now (with a clock-skew grace) is reclaimable; the reclaimer overwrites the pair. Fresh-clone → moot in v1 (Routines rejected); the GH Actions checkout is a full worktree and the claim lives in committed frontmatter any clone sees.
  • (b) A lock-file / ref namespace. Rejected (already in Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §5) — the frontmatter pair keeps state in the atom, single-source.
  • (c) git-bug-style operation CRDT (Lamport + SHA-256 tiebreak). Reserved as the ceiling strictly for if/when many agents edit one board concurrently and single-file claiming proves insufficient.

OQ3 — safest kill switch + idempotency given ScheduleWakeup has NO external cancel API.

  • (a) Layered self-bounding + ship workflow_dispatch only (chosen). Because there is no external cancel API, the safest move is to never arm an unattended self-re-queuing loop in v1, then layer defense-in-depth so even an accidental loop is finite: a board-level pause flag checked at the start of every wake, per-run caps, a run_id idempotency tag, a max-fires circuit breaker, and a documented kill switch.
  • (b) A Stop-hook circuit breaker / max-fires-in-git alone, with a live cron. Rejected for v1 — relying on a brake while the loop is armed is strictly riskier than not arming it; the brakes are kept as defense-in-depth, not as the primary control.

OQ4 — MCP-typed-tools vs direct file/CLI access, and the human-curation surface.

Sequencing — dogfood vs. client-first.

Decision

Build the agentic board v1 as a deterministic, offline, hook-governed git-resident OKF task board, dogfooded on Dossier's own repo first, with the live Agent SDK reserved behind a seam — resolving DEC-0024's four open questions as above. Concretely:

  1. Sequencing = dogfood-first; per-client rollout = config. The default target KB is this repo's own knowledge/ via knowledgeDir()/DOSSIER_KB (KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix). Prove and polish the bounded worker on Dossier's own board, then point the same workflow + worker at a client KB by setting DOSSIER_KB — a configuration change, not a rebuild.

  2. OKF task type landed. A supporting task concept type is added to Dossier — The Knowledge Model (v0) (spine status/priority/owner-assignee/dependencies/acceptance_criteria + the coordination pair claimed_by/lease_expires; status is free text z.string() so verticals extend without forking; dependencies is an ordering hint, not a transitive hold) and mirrored in @dossier/okf schema-as-code (TaskSchema wired into builtinSchemas/builtinTypes/the exhaustive registry/the barrel + TaskAtom + io.ts KEY_ORDER). This closes the Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops hand-off.

  3. Runtime worker + seam. DefaultBoardWorker + drainBoard ship in a new packages/runtime/src/board.ts, independent of loop.ts (zero change to ingest→extract→emit→commit). It mirrors orchestrator.ts exactly: a BoardWorker interface is the contract drainBoard depends on; the deterministic, offline, no-LLM, no-network DefaultBoardWorker (name default) ships now; the live AgentSdkBoardWorker (name agent-sdk) is RESERVED — the sole place that would import @anthropic-ai/claude-agent-sdk, with claim() + execute() both rejecting /RESERVED/ — byte-for-byte the AgentSdkOrchestrator pattern (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Extraction runtime architecture — the moat). drainBoard: select (deterministic total order — claimable = backlog or stale-lease, deps-honored as a soft gate, priority then id-tiebreak), claim (commit the claim, stamp a run:<id> idempotency tag, refresh the lease), a bounded deterministic step (no LLM), transition + commitAll, all bounded by maxTasksPerRun (default 1) and gated by the pause flag checked at the start of every wake.

  4. OQ1 — GitHub Actions, manual only. .github/workflows/board-drain.yml ships workflow_dispatch ONLY; the schedule:/cron is commented with a "DO NOT UNCOMMENT" gate; concurrency:{group: dossier-board-drain, cancel-in-progress: false} serializes drains; least-privilege permissions: {contents: write, pull-requests: write}; the reserved ANTHROPIC_API_KEY is commented out; the PR step opens a review PR but never merges (one-PR-per-task). A self-hosted runner is the reserved sovereignty fallback.

  5. OQ2 — single-file frontmatter claim/lease. claimed_by + lease_expires (quoted ISO-8601) per task, race-safe via the claim commit + concurrency:{group} serialization; a clock-skew grace makes an expired lease reclaimable; git-bug CRDT reserved as the heavy-concurrency ceiling.

  6. OQ3 — layered self-bounding + documented kill switch. A board-level pause flag in two redundant forms (knowledge/.board-pause sentinel and board_paused: true on tasks/index.md) checked before any claim; maxTasksPerRun (default 1) + maxTurns + maxBudgetUsd; a run:<id> idempotency tag so a crash-mid-run re-wake converges instead of duplicating; a max-fires circuit breaker; and the documented 4-step kill switch in the workflow header (pause flag → standing "never enable the schedule until proven" rule → disable/delete the workflow → attended Esc/CLAUDE_CODE_DISABLE_CRON=1/kill). No active cron shipsScheduleWakeup has no external cancel API.

  7. Governance = a PreToolUse claim/lease hook, opt-in. .claude/hooks/board-claim-guard.mjs reads the target tasks/** atom's claimed_by/lease_expires, DENIES a write that breaches a live foreign lease (deny > ask > allow), ALLOWS the holder and reclaimable/unclaimed tasks, and FAILS-OPEN on any fs/parse error (never wedges the session; diagnostics to stderr, JSON-only on stdout, ~1500ms stdin timeout) — confined to <KB>/tasks via an isWithin check before reading, KB-agnostic via DOSSIER_KB. It is NOT auto-registered: .claude/settings.json carries only the PostToolUse capture hook (Establish the learning-loop & audit architecture); wiring the guard is opt-in and documented in its header + .claude/hooks/README.md.

  8. OQ4 — direct edits + read-only site board + review gate. Agents and humans write the board by direct frontmatter edits governed by the hook; a typed @dossier/mcp-board server is reserved (activated alongside AgentSdkBoardWorker, never a gate). Human curation surface = the board markdown (edit/PR) + the read-only, KB-agnostic site /board route (packages/site/src/pages/board.astro, getCollection(type === 'task') against knowledgeDir()/DOSSIER_KB, KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix) which never writes the board; the review status is the approval handoff (a worker → review + opens a PR; a human merges to reach done).

Rationale

  • It resolves DEC-0024's open questions from the verified primitive constraints, not from preference. OQ1 falls out of Routines' fresh-clone/no-local-file-access + 1-hour floor vs. GH Actions' checkout/GITHUB_TOKEN/concurrency (Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §3-4). OQ3's "ship manual only" is the only honest answer to "no external cancel API" (§7). OQ4's "direct edits, MCP reserved" is what §4's refutation of MCP-only already implies. Each answer is the protocol around the §5 frontmatter pair, not a heavier mechanism.
  • Seam discipline is the moat, and it is exact. BoardWorkerOrchestrator, DefaultBoardWorkerDefaultOrchestrator (ships now, deterministic/offline), AgentSdkBoardWorkerAgentSdkOrchestrator (the sole SDK plug point, rejects /RESERVED/), drainBoardDefaultOrchestrator.orchestrate(runLoop). CI stays offline by construction — board governance is provably correct offline before any live autonomy (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system / Extraction runtime architecture — the moat).
  • Governance is deterministic, not trusted — and fail-safe. The hook is fail-OPEN on error (a hook bug can never wedge the board) and fail-CLOSED on a live foreign lease; for the v1 default worker the actual race protection on a remote is the GH Actions concurrency:{group} serialization, not model trust.
  • Sequencing keeps autonomy human-initiated and rollout cheap. Dogfooding on Dossier's own board first satisfies the §7 "human-initiated until proven" stance; because every surface is KB-agnostic (KB-agnostic @dossier/site (renders any tenant's OKF KB) + runtime-driven site rendering + the Node-26 Windows build fix), the client rollout is the same workflow + worker + hook reading a different DOSSIER_KB — config, not a second build.
  • It rests on the compounding merge. The compounding merge — the per-tenant learning loop accumulates by id + confidence instead of overwriting (okf reconcile() + opt-in reconcile in extraction/runtime) made repeated agent writes to a KB safe (human curation is preserved, vanished atoms orphaned not deleted); without it, a board of agents writing to the KB would be unsafe. This build is the thing that prerequisite was for.
  • asserted, not verified — the built pieces are green, the system is unproven. The @dossier/okf + @dossier/runtime code is reproduced green offline this session, but no end-to-end Actions drain has run, so the host, the claim race protocol, and the idempotent crash-resume are reasoned-from-primitives, not field-proven. The confidence sits on the direction, not on a run that has not happened.

Consequences

  • DEC-0024's four open questions are resolved (recorded here), and its ## Review condition is partially met. Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops should be read as "open questions resolved by DEC-0030; promotion to verified still pending a real end-to-end dispatch." Its core direction stands unchanged; this is the build that answers its follow-ups (see Review below for the cross-reference action).
  • All v1 board pieces now exist in code: the task schema + model, packages/runtime/src/board.ts (BoardWorker + DefaultBoardWorker + drainBoard, AgentSdkBoardWorker reserved), scripts/board-drain.mjs, .github/workflows/board-drain.yml, .claude/hooks/board-claim-guard.mjs (+ .claude/hooks/README.md), and the read-only /board site surface — verified green offline (see Verification).
  • No live autonomy is armed. The workflow is manual-dispatch only; the claim-guard hook is opt-in (not in settings.json); the AgentSdkBoardWorker rejects. Enabling the reserved schedule: is a gated, near-one-way escalation (no cancel API) requiring the pause/idempotency/caps green and the worker proven on Dossier's own board first.
  • Known integrity gaps inherited + amplified — routed, not silently shipped. (a) Dangling role edges: the seed tasks' owner/assignee and [[role]] body links resolve to no type: role atom (there are none in knowledge/; the roles live only as .claude/agents/*.md). This is a pre-existing KB-wide gap (10 committed decisions already use owner: principal-architect with no role atom), but the board makes owner/assignee first-class card fields wired through the route map, so the dead edges are now user-visible on /board. Routed to the Principal Knowledge-Format Architect (author type: role atoms in knowledge/roles/, or record a decision that owner/assignee are deliberately un-resolvable handles + flag them in validateGraph). (b) Stale-lease grace mismatch: the runtime + hook apply a 5-minute clock-skew grace before a lease is stale, but the site marks stale/reclaimable at t <= now (zero grace) — for ~5 minutes the /board can say "reclaimable" while the hook still denies and the selector won't pick it up. Routed: share one LEASE_STALE_GRACE_MS across surfaces. (c) Hook-doc overclaim: the hook header + README assert a worker "claim-commit-then-verify-HEAD" backstop that the shipping DefaultBoardWorker does not implement (its real race protection is the Actions concurrency:{group}; the in-worker verify-HEAD is reserved with AgentSdkBoardWorker). Routed: correct the hook docs to state the backstop precisely.
  • A CI graph-lint gate is the right long-term catch for (a). The dangling role/dependencies edges are exactly what a validateGraph pass over knowledge/** would surface; that gate is not yet on (blocked behind Reconcile the decision reversibility field — free-text prose vs. the @dossier/okf enum, the keystone-parser free-text conflict — now Resolve the decision `reversibility` schema conformance gap). Sequencing that before leaning on owner/assignee/dependencies graph semantics converts these findings from "caught by a reviewer" to "caught by the loop."
  • Build-side only. Nothing is added to the client-facing plugin subset; no sovereign behavior changes (the /board surface is read-only/zero-copy, Adopt OKF as Dossier's canonical knowledge format).
  • Two-way vs. gated. The swappable internals stay swappable (host, claim/lease mechanism, MCP-vs-direct, renderer). The durable commitments are the three DEC-0024 already named (git-board-as-truth; always-bounded loops; hook-enforced governance). The one near-one-way door is enabling the schedule:.

Verification (this session, offline)

  • pnpm -F @dossier/okf testgreen (the task schema prerequisite: TaskSchema + task in builtinTypes/registry, the union-branch count updated, task validation + JSON-schema cases added).
  • pnpm -F @dossier/runtime testgreen, 10 files / 60 tests, fully offline (no network, no key): the pause flag checked before any claim, maxTasksPerRun default 1, the run:<id> idempotency tag surviving parse→serialize→parse, the pure id-tiebroken selector, and the AgentSdkBoardWorker asserting rejects(/RESERVED/) — the same shape orchestrator.test.ts proves.
  • pnpm -F @dossier/runtime typecheck (tsc -b) — clean; the BoardWorker seam is type-integral. The runtime dependency chain builds clean (dist/board.js emitted).
  • The /board page renders over the real atoms: an out-dir astro build (bypassing a Windows editor-watcher dist lock that blocks only the default pnpm -F @dossier/site build wrapper — an environment lock, not a code defect) emits /board populated with all 8 knowledge/tasks/*.md atoms across 6 lifecycle columns / 13 cards, zero empty-state — true end-to-end (OKF atom → content loader → view-model → board.astro → design CSS).
  • Guardrails held (read-verified): AgentSdkBoardWorker still RESERVED; .claude/settings.json carries ONLY the PostToolUse capture (board-claim-guard NOT wired); board-drain.yml is workflow_dispatch only with cron commented, concurrency set, least-privilege, key commented, PR-opens-never-merges.
  • NOT verified: no end-to-end GitHub Actions workflow_dispatch drain has run — the design is asserted, awaiting a real dispatch (below).

Review

Promote toward verified only against a real end-to-end run: trigger one workflow_dispatch board-drain on this repo that claims a single task, runs the bounded default step, commits the transition, and lands a PR through the review state without a runaway — exactly the bar Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §Review set. That single run is the gate before the reserved schedule: is ever considered (a near-one-way door — no cancel API). Routed follow-ups, captured not closed:

  • Cross-reference action: treat Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops's four open questions as resolved by this record — when DEC-0024 is next edited, annotate its open-questions block + ## Review with "resolved/built in DEC-0030 (promotion pending a real dispatch)." (Not rewriting DEC-0024 here — ids are stable and history is layered, per the DEC-0023/DEC-0028 precedent.)
  • Principal Knowledge-Format Architect — resolve the dangling role edges: author type: role atoms (or record a decision that owner/assignee are deliberately un-resolvable handles) and flag dangling owner/assignee in validateGraph. Sequence Resolve the decision `reversibility` schema conformance gap (the keystone parser) so a CI graph-lint over knowledge/** can catch this class.
  • Astro Starlight Engineer / board workstream — align the stale-lease grace across the site, runtime, and hook (one shared LEASE_STALE_GRACE_MS); accept the task free-text status in the site content schema so knowledge/tasks/ renders on our own surface (the integration-seam defect already logged this session — Render the board as a derived `/board` surface in @dossier/site).
  • Runtime/architect — correct the hook docs to describe the backstop precisely (Actions concurrency:{group} serializes the default worker; in-worker verify-HEAD is reserved with AgentSdkBoardWorker); roll the demo claimed task's historical lease_expires to a live-looking value (or annotate it as an intentional stale-lease exemplar) so the dogfood board shows a healthy claim (Honor robots.txt in the keyless HttpConnector); trim the tasks/index.md "Current board" table to a link list (or label it a seed-time snapshot) so it does not become a second, drifting copy of status.
  • Live activation (separate, gated) — implement AgentSdkBoardWorker.execute as a bounded SDK session (maxTurns + maxBudgetUsd) behind ANTHROPIC_API_KEY, add a key-gated smoke test mirroring the live eval, wire the claim-guard into settings.json, and only then consider the reserved schedule:.