Stay serialize-only for intra-tenant drains — keep the per-task worktree mechanism built-but-dormant until a measured throughput trigger fires

0062-defer-intra-tenant-worktree-parallelism

decision read as Explain confidence asserted status active 2026-06-20 owner principal-architect
Reversibility
two-way door

DEC-0062 — Stay serialize-only for intra-tenant drains; keep worktree parallelism built-but-dormant

Reversibility: two-way door. The deferral is fully reversible — the day the trigger below fires, this decision is revisited and the pinned design (below) is built behind the already-proven withTaskWorktree seam, activation being a flag, not a rewrite. What this decision deliberately does not take is the one-way door (relaxing the Inv 4 per-tenant drain lock to a per-task lock); keeping that door shut is the conservative, reversible call. The pinned design exists so that when the door is opened, the topology is already decided rather than re-litigated under load.

Disposes the topology call DEC-0053 §4 (Invariant 4) explicitly designed toward and carried forward as reserved "scale" work, surfaced to the Principal Platform Architect as Activate parallel intra-tenant drains via per-task git worktrees, or stay serialize-only? (one-way-door topology call). The worktree isolation mechanism is built and offline-proven (packages/runtime/src/worktree.ts, packages/runtime/test/worktree.test.ts — real git worktree add in a temp repo, two tasks edit+commit concurrently without corrupting each other, every path confineToTenant-gated). This decision resolves whether to activate it. It does not.

Context — the pressure that would justify activation does not exist

The mechanism is de-risked; the need is the question, and the ground truth says there is none:

  • The platform runs one drain, one task at a time. drainBoardSerialized serializes per tenant (Inv 4 holds); scripts/board-drain.mjs calls a single drainBoard with maxTasksPerRun: 1 (one-task-per-session, Agentic "sprint board" architecture — a git-resident OKF task board worked by bounded, hook-governed Agent SDK loops §6). There is no scheduler, no Actions matrix, no caller anywhere that dispatches two concurrent drains on one tenant. Verified this session against the actual invocation sites — board.ts, board-drain.mjs, agency-phase0-{dogfood,live}.mjs are the only drain entry points and none run concurrently.
  • One early client tenant (RBA, under gitignored clients/), plus the dogfood repo. Cross-tenant parallelism is already free (one MCP server + confineToTenant per tenant — DEC-0053 §4); the only thing worktrees buy is intra-tenant parallelism, i.e. running two tasks for the same client at once. No client is generating that load, and no latency SLO is being missed.
  • There is no observability that even measures intra-tenant queue depth or drain latency yet. Activating a one-way-door topology change to relieve a pressure we are not measuring would be building for an imagined future — the textbook YAGNI failure, and exactly the kind of premature-complexity cost a future-self regrets (the bar this role is measured against).

DEC-0053 §4's "serialize NOW, design TOWARD worktrees" was the right call then and is the right call now. The "design toward" obligation has been fully discharged by building + proving the mechanism. "Toward" is not "now."

Options considered

  1. Activate now (relax Inv 4 to a per-task lock; wire N concurrent withTaskWorktree drains). Rejected. Takes a one-way door (drain-lock relaxation, new merge + budget-accrual topology, a real test surface to maintain) to buy throughput nobody is requesting. Cost: permanent complexity in the isolation contract + an irreversible loosening of the safest invariant, against zero measured benefit. Negative expected value today.
  2. Delete the mechanism as premature; rebuild if ever needed. Rejected. Wasteful and anti-leverage — the mechanism is built and proven, costs ~0 to keep dormant, and its existence is precisely what makes future activation a flag not a rewrite. Deleting it would re-incur the build + re-derisk cost later. Keep proven optionality; don't pay to destroy it.
  3. Stay serialize-only; keep the mechanism built-but-dormant; pin the activation design + a measurable trigger. Chosen. Inv 4 (serialize) is reaffirmed as the standing topology. The mechanism stays behind the proven withTaskWorktree seam, annotated as deliberately dormant. The one-way door stays shut. The three sub-questions are pre-resolved (below) so the day the trigger fires, an FDE builds a decided design under load instead of litigating topology under load.

Decision

Serialize-only is the standing intra-tenant drain topology. Do not activate per-task worktree parallelism. Keep worktree.ts built-but-dormant behind withTaskWorktree; annotate it as such (header amended to point here). Hold the Inv 4 per-tenant drain lock as-is.

The explicit revisit trigger (the one-way door reopens when ALL three hold)

Revisit this decision — and build the pinned design — when all of:

  1. A real tenant has sustained intra-tenant queue depth. At least one tenant accumulates ≥ 3 simultaneously claimable, independent (dependencies-free or dependency-satisfied) tasks that a single serialized drain cannot clear inside the tenant's expected wake cadence — i.e. work that is genuinely parallelizable is repeatedly waiting on the serial lock, not on dependencies or human review.
  2. That backlog has a latency cost someone is paying. The serial drain's wall-clock-to-review for a tenant's board misses an explicit, written latency expectation (a client SLO, or a dogfood throughput target we set deliberately) — i.e. serialization is a measured bottleneck, not a hypothetical one.
  3. We can measure it. Per-tenant drain latency / intra-tenant queue depth is observable (a counter, a board metric) so activation's benefit is verifiable, not asserted.

The first sustained breach of (1)+(2) with (3) in place flips this. Until then, serialize-only stands. (A single bursty board, or a tenant whose tasks are mostly dependency-chained — which serialize anyway — does not trip it; the trigger is sustained, parallelizable, latency-costly load.)

The three open sub-questions — PRE-RESOLVED (the pinned design a future activation builds to)

So the door is decided, not deferred-and-vague. These are the design an FDE pins to when the trigger fires; they are recorded now while the context is fresh, but they do not activate anything.

  1. Merge topology → PRs the human gate disposes, NOT fast-forward to main. Each per-task worktree commits to its own dossier/task/<id> branch; activation lands those as PRs (or their local-merge analogue) that a human disposes via the already-built Inv 3 gatedispose.ts approveTask (review → done + the real merge commit) / rejectTask. A fast-forward-to-main path is explicitly rejected: it would let an agentic branch reach done/main without the human disposition, violating Inv 3 (only a human merge moves a task to done) — the non-bypassability dispose.ts made structural. So parallelism changes where work is staged (isolated branches), never who closes the loop (the human, through the existing single done-writing path). The worker still transitions only to review; the branch is what the human merges on approve.
  2. Budget accrual → the per-team split (budget.ts) is the apportionment key; accrual SUMS across worktrees against the unchanged tenant ceiling; the kill switch stays tenant-level. decideTeamBudget (DEC-0053 §5, built) is exactly the apportionment key for dividing the tenant envelope across concurrent drains: each concurrent worktree-drain carries a teamId and accrues against both its team sub-envelope and the outer tenant ceiling. The concurrent drains' spentThisDrainUsd must be summed into a single tenant-level running total before the post-drain enforceBudget check (the scheduler owns this sum — a per-drain check alone could let N drains each pass individually while collectively breaching). The hard-stop / board-pause kill switch is unchanged and stays tenant-level: a team breach denies that team's drain without pausing the board (other teams continue); only the tenant-ceiling breach trips the .board-pause sentinel — exactly enforceBudget's current deniedBy !== 'team' rule. Concurrency does not weaken the kill switch; it just requires the scheduler to feed it the summed accrual, not a single drain's.
  3. Drain-lock relaxation (THE one-way door) → a per-task sub-lock NESTED under a RETAINED per-tenant coordination lease; the isolation contract is: worktree path = isolation unit, shared object store, serialized ref update. Inv 4 does not fully relax to a free-for-all per-task lock. The contract that keeps tenant safety intact:
    • Retain a per-tenant coordination lease (the existing .drain-lock shape) as the outer bound that gates admission — it caps concurrency at a configured N and owns the summed budget accrual (sub-question 2). It is no longer "one drain," but it is still the single tenant-level coordinator.
    • Add a per-task sub-lock keyed on the worktree path / branch — the worktree is the isolation unit (proven: separate checkout, own branch, shared .git object store, no interleaved git add -A). Two tasks may run concurrently iff they hold distinct worktree sub-locks.
    • Serialize the one shared mutation: the ref update / merge to the tenant main line. Concurrent work is safe (isolated trees); concurrent integration is not (one HEAD). Branch creation and per-worktree commits parallelize; the merge-to-main hop (sub-question 1's PR disposition) is serialized through the human gate, which is already a serialization point. The object store is shared and append-only under concurrent commits to distinct refs — the proven-safe case.
    • Isolation contract, stated: a concurrent drain may read/write only inside its own confineToTenant-gated worktree path; it may commit only to its own dossier/task/<id> branch; it may not touch another worktree's path, another task's branch, the tenant main ref (only the human merge does), or another tenant's subtree. Inv 1/2/3/6/7 are untouched by construction — they live inside AgentSdkBoardWorker.execute(), which is per-task and worktree-agnostic.

Consequences

  • Inv 4 (serialize) is reaffirmed as the standing topology, not merely "not yet relaxed." DEC-0053's verified scope is unchanged: parallelism remains explicitly not covered (consistent with DEC-0053 §Promotion item 2, which carried scale forward as reserved).
  • worktree.ts is annotated dormant-by-decision (header amended to cite this record) so a future reader knows the mechanism's non-activation is a decided topology stance with a trigger, not an oversight or unfinished work.
  • The carried-forward "scale" reservation in DEC-0053 is now half-disposed: the per-team budget split (§5) was built; intra-tenant parallelism (§4) is hereby decided to defer with a trigger. DEC-0053's only remaining open scale item collapses to "activate worktrees when DEC-0062's trigger fires."
  • No code changes to the drain path. Serialize-only is the existing behavior; this decision changes a comment and disposes a task. No test churn, no new surface to maintain — the correct cost for a deferral.

Relation to DEC-0053

This is the dated disposition of the §4 one-way door Agentic-agency runtime topology — compile personas from the OKF graph and activate the reserved BoardWorker over the deterministic spine designed toward and §Promotion item 2 carried forward as reserved scale work. It does not re-grade DEC-0053's frontmatter (the verified scope already excludes parallelism). It converts "reserved, undecided" into "deferred, decided, with a measurable trigger and a pinned activation design."