Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system

0012-runtime-orchestration-control-plane

decision read as Explain confidence asserted status active 2026-06-14 owner platform-engineer
Reversibility
two-way door

DEC-0012 — Runtime orchestration & per-tenant control plane

Reversibility: two-way door — on orchestration internals, the registry/provisioning shape, and loop-stage wiring; the loop shape (provision → ingest → extract → OKF → serve), git-per-tenant as the iteration record, and the siloed-subtree isolation model are the durable parts.

Context

Extraction runtime architecture — the moat reserved @dossier/runtime (name only) as the home for "the Agent SDK orchestration + per-tenant provisioning" — explicitly so orchestration "doesn't get crammed into @dossier/extraction" and "never leaks into the moat." Claude-primitives-first build strategy names the Claude Agent SDK as "the ingestion→extraction→OKF runtime." With the moat (Extraction runtime architecture — the moat, @dossier/extraction), the measurement layer (Live extraction eval harness — what we measure is what extraction optimizes for), and the agentic foundation (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB, @dossier/mcp) all built and green, the missing piece was the layer that ties them into one runnable flow per client — the place where Dossier's learning loop stops being a set of integrated packages and becomes a system you can run end-to-end for a tenant.

This decision builds that reserved package: @dossier/runtime — the per-tenant orchestration runtime + control plane. It wires the whole loop into one runnable, resumable, siloed flow: provision → ingest → extract → OKF → serve. It reuses @dossier/extraction's run() and @dossier/mcp's createServer — it does not reimplement either; it is the conductor over the keystone (@dossier/okf) and the two built layers.

The build (verified this session, all green, no network): @dossier/runtime 21 tests green; repo-wide 246 passed / 1 gated-skip. The headline test runs extract → emit → serve as one loop, offline: provisionTenantrunLoop (a MockClaudeClient over a DXA discovery/solution-design source) emits 3 valid OKF atoms into the tenant's siloed repo → loadKnowledgeBase / createServer over that same repo → MCP search_concepts / get_related find the freshly-extracted atoms. The loop is real, not a diagram.

Options considered

1. Orchestration engine — deterministic control plane now vs. live Agent SDK now.

  • (a) Build the live AgentSdkOrchestrator now (import @anthropic-ai/claude-agent-sdk; resumable multi-source onboarding, memory tool, context editing, a curation sub-agent). Maximum capability day one — but it needs network + an API key, which breaks the offline-first CI invariant the whole monorepo holds to, and it couples the control plane to the SDK before the control plane itself is proven. The orchestration prose would be untestable.
  • (b) Ship a deterministic DefaultOrchestrator now, reserve the Agent SDK behind an Orchestrator seam (chosen). The control plane — provisioning, the registry, the staged loop wiring, the git-per-tenant iteration record, isolation enforcement — is the valuable, testable core and is fully exercisable offline. The Orchestrator interface lets a concrete AgentSdkOrchestrator (the only place that would import @anthropic-ai/claude-agent-sdk) drop in later without a control-plane rewrite. This mirrors the seam discipline already proven twice: the reserved live Embedder in MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB and the reserved live ClaudeClient in Extraction runtime architecture — the moat.

2. Tenant topology — siloed subtree-per-tenant vs. pooled.

  • (a) Pooled — one shared workspace/store, tenants distinguished by a key. Efficient, but a single bug or crafted id is a cross-client leak — against Dossier — Mission & North Star's sovereignty promise and the same MCP-isolation risk Claude-primitives-first build strategy flagged for review.
  • (b) Siloed subtree-per-tenant (chosen). provisionTenant creates a separate workspace root/<clientId>/ (its own OKF repo + a dossier.tenant.json manifest); a TenantRegistry (list / get / deprovision) manages lifecycle. Each tenant is a distinct subtree — the isolation boundary made concrete on disk, matching DEC-0008's "one client = one OKF git repo" and DEC-0011's "one server = one tenant."

3. Iteration record — git-per-tenant vs. a central store.

  • (a) Central run store / DB of loop iterations. Convenient for cross-tenant operator queries, but it makes the platform the system of record for the client's own learning history — the exact lock-in Adopt OKF as Dossier's canonical knowledge format exists to refuse.
  • (b) Git-per-tenant (chosen). When vcs: 'git', the loop runs git init on the tenant OKF repo and git add -A && git commit after each extract — so every loop iteration is a diff in the client's own git history. The client owns the record of how their institutional memory was learned, forever. Uses node:child_process — no new dependency.

Decision

1. A per-tenant control plane with siloed provisioning. provisionTenant creates a siloed workspace root/<clientId>/ containing the tenant's OKF repo and a dossier.tenant.json manifest; a TenantRegistry exposes list / get / deprovision for lifecycle. Each tenant is a separate subtree — the isolation boundary made concrete, not a query filter. Tenant context is threaded explicitly through every operation; no ambient/global state (consistent with the TenantContext discipline from Extraction runtime architecture — the moat and the TenantConfig discipline from MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB).

2. The orchestrated loop is the deterministic, testable core; the Agent SDK is reserved behind a seam. An Orchestrator interface defines the loop. The shipped DefaultOrchestrator runs the staged loop directly and offline (no network, no key) — it is the valuable, testable control plane. A concrete AgentSdkOrchestrator is reserved: it is the only place that would import @anthropic-ai/claude-agent-sdk, and it is where resumable multi-source onboarding, the memory tool, and a curation sub-agent will live. It is deferred because it requires network + key (which would break offline CI) — and because the control plane is the part that needs to be proven first. This honors DEC-0008's mandate that "the Agent SDK is reserved to @dossier/runtime so orchestration never leaks into the moat."

3. Git-per-tenant is the learning-loop iteration record, in the client's own history. With vcs: 'git', the loop git inits the tenant OKF repo and commits (git add -A && git commit) after each extract. Every iteration of the loop is therefore a diff in the client's own git — the client owns the record of how their memory compounded, satisfying Adopt OKF as Dossier's canonical knowledge format (one client = one git repo = system of record; replaceable caches everywhere else). Implemented with node:child_process; no new dependency.

4. Tenant isolation is enforced at every boundary. assertClientId rejects path-traversal characters and separators in client ids; confineToTenant blocks .., sibling-prefix escapes (acme must not reach acme-evil), and absolute-path escapes. Both gate every read, write, and discovery operation. The TenantRegistry runs ids through the same gate (get('../beta') throws). Isolation is a tested boundary, not an intention.

5. The loop is the integration, end-to-end and offline. runLoop composes the existing layers — it calls @dossier/extraction's run() to extract and emit OKF atoms into the tenant's siloed repo, then @dossier/mcp's loadKnowledgeBase / createServer serve that same repo. The headline test proves the full path offline (provision → runLoop on a MockClaudeClient → 3 valid OKF atoms → MCP search_concepts / get_related find them). Provision → ingest → extract → OKF → serve is one runnable flow.

Rationale

  • The control plane is the valuable, testable core — so build and prove it first. Provisioning, the registry, loop wiring, git-per-tenant, and isolation are all fully exercisable with zero network. Reserving the Agent SDK behind the Orchestrator seam means CI stays offline and the orchestration capability drops in later without a control-plane rewrite — the same seam discipline that already paid off for the Embedder (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB) and the ClaudeClient (Extraction runtime architecture — the moat).
  • Reuse, not reimplementation, keeps the layers honest. @dossier/runtime conducts @dossier/extraction's run() and @dossier/mcp's createServer over the shared @dossier/okf keystone. The runtime owns orchestration; it never re-derives extraction or serving — so the layers can't disagree about what the loop does.
  • Siloed subtree + git-per-tenant is sovereignty made operational. Isolation is the process/subtree boundary, not a query filter (the conservative read of the MCP-isolation risk Claude-primitives-first build strategy flagged). And the client's own git — not a platform DB — is the record of every loop iteration, so Dossier — Mission & North Star's "owned, compounding learning loop" is literally true on disk: they own the memory and the history of how it was learned.
  • asserted, not verified. Built and green offline (@dossier/runtime 21 tests; repo-wide 246 passed / 1 gated-skip), with the full loop proven end-to-end on a mock — but the live AgentSdkOrchestrator is unbuilt, real ingestion is reserved, and the isolation model is not yet battle-tested at real multi-tenant scale. This is design-level conviction backed by an offline single-tenant run, not field evidence.

Consequences

  • The last DEC-0008 reservation is closed. With @dossier/runtime built, all four architecture layers exist in code and are integrated: ingest (reserved/commodity) · extract + eval (the moat) · serve (the agentic foundation, MCP) · orchestrate (this runtime) — all on the @dossier/okf keystone, per-tenant siloed. Dossier's learning loop is now a runnable system, not a design.
  • The Agent SDK is a standing reservation. When resumable multi-source onboarding / memory / curation are needed, the live AgentSdkOrchestrator drops into the existing Orchestrator seam — the only place @anthropic-ai/claude-agent-sdk imports — with no control-plane rewrite and no change to the offline-CI invariant for the deterministic core.
  • Git-per-tenant is now load-bearing for sovereignty. The client's own git history is the iteration record; any future operator/analytics view must derive from it, never replace it — widening that into a central system of record would breach Adopt OKF as Dossier's canonical knowledge format.
  • Isolation correctness is a tested boundary (assertClientId + confineToTenant on every read/write/discovery; the registry gates ids). New code paths in @dossier/runtime must go through the gate or they regress the sovereignty guarantee.
  • Two-way vs. durable. Orchestration internals, the registry/provisioning shape, and loop-stage wiring are expected to evolve (two-way door). The loop shape (provision → ingest → extract → OKF → serve), git-per-tenant as the iteration record, and the siloed-subtree isolation model are the durable, harder-to-reverse commitments.

Review

Revisit at real multi-tenant scale: does siloed subtree-per-tenant + git-per-tenant hold operationally (provisioning throughput, repo proliferation, deprovision/retention), or does scale pressure force a different topology? And activate the AgentSdkOrchestrator (resumable multi-source onboarding, memory tool, curation sub-agent) and wire real ingestion (@dossier/ingestion) through the loop when the work demands it — at which point re-examine the orchestration design against a real client run and consider promoting confidence to verified.