Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system

0012-runtime-orchestration-control-plane

decision read as Explain confidence asserted status active 2026-06-14 owner platform-engineer

Reversibility: two-way door

DEC-0012 — Runtime orchestration & per-tenant control plane

Reversibility: two-way door — on orchestration internals, the registry/provisioning shape, and loop-stage wiring; the loop shape (provision → ingest → extract → OKF → serve), git-per-tenant as the iteration record, and the siloed-subtree isolation model are the durable parts.

Context

Extraction runtime architecture — the moat reserved @dossier/runtime (name only) as the home for "the Agent SDK orchestration + per-tenant provisioning" — explicitly so orchestration "doesn't get crammed into @dossier/extraction" and "never leaks into the moat." Claude-primitives-first build strategy names the Claude Agent SDK as "the ingestion→extraction→OKF runtime." With the moat (Extraction runtime architecture — the moat, @dossier/extraction), the measurement layer (Live extraction eval harness — what we measure is what extraction optimizes for), and the agentic foundation (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB, @dossier/mcp) all built and green, the missing piece was the layer that ties them into one runnable flow per client — the place where Dossier's learning loop stops being a set of integrated packages and becomes a system you can run end-to-end for a tenant.

This decision builds that reserved package: @dossier/runtime — the per-tenant orchestration runtime + control plane. It wires the whole loop into one runnable, resumable, siloed flow: provision → ingest → extract → OKF → serve. It reuses @dossier/extraction's run() and @dossier/mcp's createServer — it does not reimplement either; it is the conductor over the keystone (@dossier/okf) and the two built layers.

The build (verified this session, all green, no network): @dossier/runtime 21 tests green; repo-wide 246 passed / 1 gated-skip. The headline test runs extract → emit → serve as one loop, offline: provisionTenant → runLoop (a MockClaudeClient over a DXA discovery/solution-design source) emits 3 valid OKF atoms into the tenant's siloed repo → loadKnowledgeBase / createServer over that same repo → MCP search_concepts / get_related find the freshly-extracted atoms. The loop is real, not a diagram.

Options considered

1. Orchestration engine — deterministic control plane now vs. live Agent SDK now.

(a) Build the live AgentSdkOrchestrator now (import @anthropic-ai/claude-agent-sdk; resumable multi-source onboarding, memory tool, context editing, a curation sub-agent). Maximum capability day one — but it needs network + an API key, which breaks the offline-first CI invariant the whole monorepo holds to, and it couples the control plane to the SDK before the control plane itself is proven. The orchestration prose would be untestable.
(b) Ship a deterministic DefaultOrchestrator now, reserve the Agent SDK behind an Orchestrator seam (chosen). The control plane — provisioning, the registry, the staged loop wiring, the git-per-tenant iteration record, isolation enforcement — is the valuable, testable core and is fully exercisable offline. The Orchestrator interface lets a concrete AgentSdkOrchestrator (the only place that would import @anthropic-ai/claude-agent-sdk) drop in later without a control-plane rewrite. This mirrors the seam discipline already proven twice: the reserved live Embedder in MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB and the reserved live ClaudeClient in Extraction runtime architecture — the moat.

2. Tenant topology — siloed subtree-per-tenant vs. pooled.

(a) Pooled — one shared workspace/store, tenants distinguished by a key. Efficient, but a single bug or crafted id is a cross-client leak — against Dossier — Mission & North Star's sovereignty promise and the same MCP-isolation risk Claude-primitives-first build strategy flagged for review.
(b) Siloed subtree-per-tenant (chosen). provisionTenant creates a separate workspace root/<clientId>/ (its own OKF repo + a dossier.tenant.json manifest); a TenantRegistry (list / get / deprovision) manages lifecycle. Each tenant is a distinct subtree — the isolation boundary made concrete on disk, matching DEC-0008's "one client = one OKF git repo" and DEC-0011's "one server = one tenant."

3. Iteration record — git-per-tenant vs. a central store.

(a) Central run store / DB of loop iterations. Convenient for cross-tenant operator queries, but it makes the platform the system of record for the client's own learning history — the exact lock-in Adopt OKF as Dossier's canonical knowledge format exists to refuse.
(b) Git-per-tenant (chosen). When vcs: 'git', the loop runs git init on the tenant OKF repo and git add -A && git commit after each extract — so every loop iteration is a diff in the client's own git history. The client owns the record of how their institutional memory was learned, forever. Uses node:child_process — no new dependency.

Decision

1. A per-tenant control plane with siloed provisioning. provisionTenant creates a siloed workspace root/<clientId>/ containing the tenant's OKF repo and a dossier.tenant.json manifest; a TenantRegistry exposes list / get / deprovision for lifecycle. Each tenant is a separate subtree — the isolation boundary made concrete, not a query filter. Tenant context is threaded explicitly through every operation; no ambient/global state (consistent with the TenantContext discipline from Extraction runtime architecture — the moat and the TenantConfig discipline from MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB).

2. The orchestrated loop is the deterministic, testable core; the Agent SDK is reserved behind a seam. An Orchestrator interface defines the loop. The shipped DefaultOrchestrator runs the staged loop directly and offline (no network, no key) — it is the valuable, testable control plane. A concrete AgentSdkOrchestrator is reserved: it is the only place that would import @anthropic-ai/claude-agent-sdk, and it is where resumable multi-source onboarding, the memory tool, and a curation sub-agent will live. It is deferred because it requires network + key (which would break offline CI) — and because the control plane is the part that needs to be proven first. This honors DEC-0008's mandate that "the Agent SDK is reserved to @dossier/runtime so orchestration never leaks into the moat."

3. Git-per-tenant is the learning-loop iteration record, in the client's own history. With vcs: 'git', the loop git inits the tenant OKF repo and commits (git add -A && git commit) after each extract. Every iteration of the loop is therefore a diff in the client's own git — the client owns the record of how their memory compounded, satisfying Adopt OKF as Dossier's canonical knowledge format (one client = one git repo = system of record; replaceable caches everywhere else). Implemented with node:child_process; no new dependency.

4. Tenant isolation is enforced at every boundary. assertClientId rejects path-traversal characters and separators in client ids; confineToTenant blocks .., sibling-prefix escapes (acme must not reach acme-evil), and absolute-path escapes. Both gate every read, write, and discovery operation. The TenantRegistry runs ids through the same gate (get('../beta') throws). Isolation is a tested boundary, not an intention.

5. The loop is the integration, end-to-end and offline. runLoop composes the existing layers — it calls @dossier/extraction's run() to extract and emit OKF atoms into the tenant's siloed repo, then @dossier/mcp's loadKnowledgeBase / createServer serve that same repo. The headline test proves the full path offline (provision → runLoop on a MockClaudeClient → 3 valid OKF atoms → MCP search_concepts / get_related find them). Provision → ingest → extract → OKF → serve is one runnable flow.

Rationale

The control plane is the valuable, testable core — so build and prove it first. Provisioning, the registry, loop wiring, git-per-tenant, and isolation are all fully exercisable with zero network. Reserving the Agent SDK behind the Orchestrator seam means CI stays offline and the orchestration capability drops in later without a control-plane rewrite — the same seam discipline that already paid off for the Embedder (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB) and the ClaudeClient (Extraction runtime architecture — the moat).
Reuse, not reimplementation, keeps the layers honest. @dossier/runtime conducts @dossier/extraction's run() and @dossier/mcp's createServer over the shared @dossier/okf keystone. The runtime owns orchestration; it never re-derives extraction or serving — so the layers can't disagree about what the loop does.
Siloed subtree + git-per-tenant is sovereignty made operational. Isolation is the process/subtree boundary, not a query filter (the conservative read of the MCP-isolation risk Claude-primitives-first build strategy flagged). And the client's own git — not a platform DB — is the record of every loop iteration, so Dossier — Mission & North Star's "owned, compounding learning loop" is literally true on disk: they own the memory and the history of how it was learned.
asserted, not verified. Built and green offline (@dossier/runtime 21 tests; repo-wide 246 passed / 1 gated-skip), with the full loop proven end-to-end on a mock — but the live AgentSdkOrchestrator is unbuilt, real ingestion is reserved, and the isolation model is not yet battle-tested at real multi-tenant scale. This is design-level conviction backed by an offline single-tenant run, not field evidence.

Consequences

The last DEC-0008 reservation is closed. With @dossier/runtime built, all four architecture layers exist in code and are integrated: ingest (reserved/commodity) · extract + eval (the moat) · serve (the agentic foundation, MCP) · orchestrate (this runtime) — all on the @dossier/okf keystone, per-tenant siloed. Dossier's learning loop is now a runnable system, not a design.
The Agent SDK is a standing reservation. When resumable multi-source onboarding / memory / curation are needed, the live AgentSdkOrchestrator drops into the existing Orchestrator seam — the only place @anthropic-ai/claude-agent-sdk imports — with no control-plane rewrite and no change to the offline-CI invariant for the deterministic core.
Git-per-tenant is now load-bearing for sovereignty. The client's own git history is the iteration record; any future operator/analytics view must derive from it, never replace it — widening that into a central system of record would breach Adopt OKF as Dossier's canonical knowledge format.
Isolation correctness is a tested boundary (assertClientId + confineToTenant on every read/write/discovery; the registry gates ids). New code paths in @dossier/runtime must go through the gate or they regress the sovereignty guarantee.
Two-way vs. durable. Orchestration internals, the registry/provisioning shape, and loop-stage wiring are expected to evolve (two-way door). The loop shape (provision → ingest → extract → OKF → serve), git-per-tenant as the iteration record, and the siloed-subtree isolation model are the durable, harder-to-reverse commitments.

Review

Revisit at real multi-tenant scale: does siloed subtree-per-tenant + git-per-tenant hold operationally (provisioning throughput, repo proliferation, deprovision/retention), or does scale pressure force a different topology? And activate the AgentSdkOrchestrator (resumable multi-source onboarding, memory tool, curation sub-agent) and wire real ingestion (@dossier/ingestion) through the loop when the work demands it — at which point re-examine the orchestration design against a real client run and consider promoting confidence to verified.