Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops

0024-agentic-board-architecture

decision read as Explain confidence asserted status active 2026-06-16 owner principal-architect

Reversibility: two-way door

DEC-0024 — Agentic "sprint board" architecture

Reversibility: two-way door — the internals are swappable (scheduler host, exact claim/lease mechanism, MCP-vs-direct board access, kanban renderer); the durable, harder-to-reverse commitments are that the board is OKF in the client's git and the git board (not the session) is the source of truth, that autonomous loops are always bounded (caps + idempotency + kill switch), and that governance is enforced by hooks, not model trust.

Context

This session (2026-06-16) the founder ran the /deep-research workflow on a single question: how to implement a "sprint board" / work-item backlog that Dossier's agents continually and autonomously work on, built strictly on Claude primitives. Scope was locked first via clarifying questions: all four layers (loop runtime, board data model, human curation surface, multi-agent coordination); trigger model = scheduled/background autonomous; breadth = Claude-primitives only (no LangGraph / OpenAI / Devin). The research returned 22 sources with 23 of 25 claims adversarially verified (most 3-0) against primary Anthropic docs (Agent SDK agent-loop, scheduled-tasks, hooks, Routines) plus git-native board OSS (Backlog.md, llm-backlog, ticket/tk, git-bug). All findings are current to ~June 2026 and reference fast-moving features (Routines research preview, Claude Code v2.1.72+).

This decision records the architecture-direction call for how Dossier will build the agentic board when it builds it. It is a direction backed by verified primitive-level facts — NOT a built system. Nothing here is implemented. The primitive-level mechanics (the SDK loop, caps, session semantics, hooks, cron/Routines surfaces) are verified against primary docs; the overall architecture is asserted design conviction, and real open questions remain (carried below).

This is also the natural successor to Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system, which reserved the live AgentSdkOrchestrator behind a seam. The board is the work queue that a self-directed loop drains; the research refines how that reserved loop must actually run.

Options considered

1. Board substrate — where do work items live.

(a) OKF markdown+YAML in the client's git (chosen). One atomic task concept per work item; the git history IS the audit trail. The only option consistent with OKF-everywhere (Adopt OKF as Dossier's canonical knowledge format) and the sovereignty promise (Dossier — Mission & North Star). Validated pattern — Backlog.md, llm-backlog, and ticket/tk all do markdown+YAML-in-git task boards.
(b) External tracker (Linear / GitHub Projects / Jira). Rejected: not sovereign, not single-source-of-truth, off-platform — the client's backlog and its history would live in someone else's system.
(c) A database / run-store. Rejected: makes the platform the system of record for the client's own work, the exact lock-in Adopt OKF as Dossier's canonical knowledge format and Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system refuse.

2. Durability anchor — what survives between runs.

(a) The git board is the source of truth; every wake is a stateless, self-describing task pickup (chosen). State lives in the board, not the loop.
(b) Rely on Agent SDK session resume. Rejected on verified durability caveats: SDK sessions persist the conversation, not the filesystem; session files are host-local (~/.claude/projects/<encoded-cwd>/*.jsonl); a cwd mismatch silently spawns a fresh session; and Anthropic's own docs advise NOT relying on session resume cross-host — "capture the results you need." Resumption is an optimization, never the system of record.
(c) Add a durable workflow engine (e.g. Temporal). Noted as the heavier ceiling; deferred as overkill for v1 — the git board already provides durable, inspectable state.

3. Scheduler — what wakes the worker, unattended.

(a) A durable host — GitHub Actions / self-hosted Agent SDK runner / cloud Routines (chosen for unattended autonomy).
(b) Session-scoped /loop + ScheduleWakeup + the Cron tools (CronCreate/List/Delete). Rejected as the durability anchor: these only fire while Claude Code is open and idle (session-scoped). Fine for attended draining of the board; not for unattended background cadence. Sharp constraint to carry: cloud Routines (research preview) have a 1-hour minimum, run on a fresh clone with NO local file access, which likely make GitHub Actions or a self-hosted SDK runner the realistic durable host for a repo-resident board worker.

4. Agent access to the board — how the loop reads/writes tasks.

(a) Allow BOTH direct file/CLI edits AND an optional MCP server (chosen). Both paths are valid.
(b) Mandate MCP-only (agents never touch files). Rejected — and explicitly refuted by the research (1-2): the claim that agents must go through MCP and never touch files did not hold. Do not mandate MCP-only; an optional typed MCP surface (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB) is an upgrade, not a gate.

5. Coordination — keeping concurrent agents off each other's work.

(a) A frontmatter claim/lease pair (e.g. claimed_by + lease_expires) guarded by a PreToolUse hook (chosen for v1). Simplest mechanism that is deterministic, not model-trusted.
(b) git-bug-style operation CRDT (Lamport clocks + SHA-256 tiebreak). Reserved as the ceiling for heavy concurrent multi-agent editing — adopt only if simple one-file-per-task claiming proves insufficient.

Decision

Dossier will implement its agentic "sprint board" as a git-resident OKF task board, worked by bounded Claude Agent SDK loops on a durable scheduler, with hooks as the governance/locking layer and explicit human approval gates. Concretely:

Board = OKF. One atomic markdown+YAML task concept per work item, in the client's own git repo (a new task concept type extending Dossier — The Knowledge Model (v0)). Frontmatter spine: status (e.g. backlog / claimed / in_progress / review / done / blocked), priority, owner/assignee, dependencies (prereq task ids → blocking/ordering), acceptance_criteria, plus OKF provenance. The git history is the audit trail.
The git board — not the agent session — is the durable source of truth. This is the load-bearing correction from the research. Because SDK sessions persist the conversation (not the filesystem), are host-local, and resume cross-host is explicitly discouraged, every wake is a fresh, self-describing task pickup from git; state lives in the board, not the loop.
Loop runtime = Claude Agent SDK, headless, BOUNDED. The SDK embeds Claude Code's autonomous agent loop as a standalone package (no CLI install). Each run is bounded by maxTurns + maxBudgetUsd (Anthropic: "Setting a budget is a good default for production agents"); a turn-capped run returns error_max_turns and can resume from the saved session_id with a higher cap. Persistent task rules (objective + acceptance criteria) belong in CLAUDE.md (re-injected every request, survives lossy auto-compaction); a PreCompact hook can archive the transcript — tying into the existing journal/traces/ capture (Establish the learning-loop & audit architecture).
Scheduling = a DURABLE surface, not session-scoped /loop. Session /loop, ScheduleWakeup, and the Cron tools fire only while Claude Code is open and idle. Durable unattended cadence requires cloud Routines (research preview, 1-hour minimum, fresh clone, no local file access), Desktop scheduled tasks, or GitHub Actions. Routines' no-local-file-access + 1-hour floor likely make GitHub Actions or a self-hosted Agent SDK runner the realistic durable host for a repo-resident board worker.
Governance/locking = hooks. PreToolUse hooks run in the host process outside the agent context window, and a deny short-circuits the tool call (deny > ask > allow) — making claim/lock enforcement deterministic, not model-trusted. The claim/lease (claimed_by + lease_expires) is a frontmatter pair guarded by such a hook. git-bug's operation-based CRDT (Lamport clocks + SHA-256 tiebreak) is the conflict-free-concurrent-edit ceiling if simple one-file-per-task claiming proves insufficient.
Human curation = the board files + approval gates. Tasks are human-readable markdown, so humans curate by editing/PR; a thin kanban view can render over the OKF files later (Backlog.md proves a terminal + drag-drop web board over markdown-in-git). The canonical agent-on-board workflow is one-task-per-session / one-PR-per-task with a plan-before-coding human approval checkpoint (the "agent inbox" pattern); a review state is the handoff gate.
Bound the runaway-loop hazard. ScheduleWakeup has no external cancel API (verified; the lone non-unanimous claim at 2-1 — the runaway-pattern itself was 3-0). Today's escapes are human Esc, 7-day recurring-task expiry, CLAUDE_CODE_DISABLE_CRON=1, or killing the process. Any self-re-queuing loop MUST be self-bounded: turn/budget caps + idempotency keys + a board-level pause flag checked at every wake + a documented kill switch.

Rationale

It is the only design consistent with the two hard constraints. OKF-everywhere (Adopt OKF as Dossier's canonical knowledge format) and Claude-primitives-first (Claude-primitives-first build strategy) jointly determine the answer: the board is just more OKF, and the workers are just the Agent SDK loop Dossier already reserved. There was no real second design that honored both.
It activates, rather than duplicates, the reserved AgentSdkOrchestrator seam from Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system. The board is the work queue that a self-directed loop drains; the research refines how that reserved loop must actually run — bounded, resumable, git-as-truth, hook-governed. This connects the reserved seam to a concrete operating model without rewriting the control plane.
Sovereignty stays literally true. The board and its full history live in the client's own git, consistent with the git-per-tenant iteration record from Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system and Adopt OKF as Dossier's canonical knowledge format (the client owns the record of how their work was done, not the platform).
Governance is deterministic, not trusted. Putting claim/lock enforcement in a PreToolUse hook (host process, outside the context window, deny short-circuits the call) means correctness does not depend on the model choosing to behave — the same discipline that makes the audit hook reliable (Establish the learning-loop & audit architecture).
asserted, not verified. The primitive-level facts (SDK loop, caps, session semantics, hooks, cron/Routines surfaces) are verified against primary Anthropic docs, but the overall architecture is asserted design conviction — nothing is built and real open questions remain. This is a direction, not a proven system, and the four open questions below are unresolved.

Consequences

A new OKF task concept type is implied — to be designed in Dossier — The Knowledge Model (v0) (frontmatter spine sketched in the Decision above: status / priority / owner-assignee / dependencies / acceptance_criteria + provenance). Hand-off candidate for the Principal Knowledge-Format Architect; the schema is NOT designed here — recorded as a follow-up only.
The reserved AgentSdkOrchestrator seam now has a concrete operating model to satisfy when it is built: bounded runs, git-as-truth stateless wakes, hook-governed claims, durable scheduling. Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system's reservation and this direction must stay in sync.
A durable scheduling host must be chosen and stood up before any unattended autonomy ships — and it is likely GitHub Actions or a self-hosted SDK runner, given Routines' no-local-file-access + 1-hour floor.
A kill switch and self-bounding design are mandatory, not optional, for any self-re-queuing loop — because ScheduleWakeup has no external cancel API.
Two-way vs. durable. The internals are expected to evolve (scheduler host, exact claim/lease mechanism, MCP-vs-direct access, kanban renderer — all two-way doors). The harder-to-reverse commitments are: (a) the board is OKF in the client's git and the git board — not the session — is the source of truth; (b) autonomous loops are always bounded (caps + idempotency + kill switch); (c) governance is enforced by hooks, not model trust.

Open questions (unresolved follow-ups — recorded, not answered)

Can a durable cloud Routine (no local file access, fresh clone, 1-hour min) actually operate a repo-resident board worker that reads/claims a task, commits work, and pushes a branch/PR each run — or does that force GitHub Actions / a self-hosted Agent SDK runner? What is the real end-to-end latency/cost of each?
What is the most robust git-native lock/lease mechanism for concurrent agents (frontmatter claimed_by + lease_expires guarded by a PreToolUse hook vs. a lock-file / ref namespace vs. git-bug CRDT merge) under push races, stale leases from crashed agents, and the fresh-clone Routine model?
Given ScheduleWakeup has no external cancel API, what is the safest concrete kill switch + idempotency design (max-fires counter in git, board-level pause flag checked every wake, per-wake budget cap, Stop-hook circuit breaker)?
Should Dossier expose the board to agents primarily via a custom MCP server (typed tools: task_claim / update / complete with hook-enforced governance) or via direct file/CLI edits — and what is the human curation surface (thin kanban over OKF files vs. reuse an existing renderer), keeping OKF markdown the single source of truth?

Review

Revisit when the board is actually built — promote toward verified only against a real end-to-end run (a worker that wakes on a durable host, claims a task from git, does the work bounded, commits/pushes, and hands off through review without a runaway). Re-examine each open question against ground truth at that point, and re-check the primitive-level facts against current docs: this decision references fast-moving features (Routines research preview, Claude Code v2.1.72+, the Cron/ScheduleWakeup surfaces) current to ~June 2026 — several may change shape, gain a cancel API, or relax the Routines constraints, which would directly reopen options here.