Build a fully-owned hosted control plane (do NOT adopt the Vercel claude-managed-agents starter); settle the system of record as hybrid / thin-control-plane with the client-owned OKF git repo canonical

0032-hosted-control-plane-not-vercel-starter

decision read as Explain confidence verified status active 2026-06-16 owner principal-architect
Reversibility
one-way door

DEC-0032 — Build a fully-owned hosted control plane (NOT the Vercel starter); system of record = hybrid / thin-control-plane, OKF git repo canonical

Reversibility: one-way door on the topology commitment it settles — making the host the system of record (the starter's model) would be expensive to unwind and is the exact lock-in Adopt OKF as Dossier's canonical knowledge format exists to refuse, so "git-repo-canonical, host-stores-only-derived-state" is treated as a durable architectural invariant, not a revisitable preference. Two-way door on the implementation surface — the hosted app framework, the sandbox provider, the durable-workflow engine, and the dashboard tech are all swappable and not yet chosen (the four OPEN sub-decisions).

Context

The operator (DXA owner) evaluated vercel-labs/claude-managed-agents-starter (https://github.com/vercel-labs/claude-managed-agents-starter) as a possible foundation to host Dossier's app + dashboard + multi-tenant control plane, keeping the Claude Code plugin (Plugin + marketplace packaging — distribution as the agency wedge, built from the canonical .claude/ primitives) as a downloadable installable. He had built a prior project on it, liked it, and was worried the current structure (Claude Code plugin + Astro site + git-resident OKF) won't scale as features grow. He explicitly asked the research to settle WHERE THE SYSTEM OF RECORD SHOULD LIVE — thin control plane over client repos vs host-as-primary vs hybrid — and named the requirements the hosted layer must support: persistent DB/storage beyond git, auth + multi-tenant isolation, long-running/background server-side agent runs, and general future-proofing.

This was answered by a /deep-research workflow: 5 angles, 20 sources fetched, 25 claims adversarially verified (3-vote) — 21 confirmed / 4 killed. The findings, refutations, and follow-ups below are the verified output of that workflow, not assertion. This record carries the why exactly as the research established it; where a claim was refuted, it is recorded as refuted so we do not act on it later.

Time-sensitivity (load-bearing). All facts are dated Apr–Jun 2026 against fast-moving betas — Anthropic Managed Agents (beta header managed-agents-2026-04-01), @anthropic-ai/sdk 0.86.x, Vercel Workflows. Re-verify before acting.

Options considered

The system-of-record axis (the operator's explicit question).

  • (a) Host-as-primary — the hosted platform's database is the system of record; client repos (if any) are exports/mirrors. This is the starter's model (see Finding 4). Rejected — it is the literal inverse of Dossier — Mission & North Star's sovereignty thesis and the exact lock-in Adopt OKF as Dossier's canonical knowledge format exists to refuse.
  • (b) Thin control plane only — the host holds nothing durable; every read re-derives from the client repo. Maximally sovereign but operationally weak: no fast serving, no run history, no auth/tenant store.
  • (c) Hybrid / thin-control-plane (chosen) — the client-owned OKF git repo stays CANONICAL (system of record); the host stores only DERIVED state (vector/index caches, run history, session metadata, auth, repo pointers) and orchestrates agent runs against the repo. This is the DoltHub control-plane / data-plane split (Finding 7): a control plane (hosted API + reconciler over desired state) sitting over per-customer data instances (the client OKF repos). It satisfies the operator's four requirements without moving the system of record off the client's git.

The foundation axis (fork the starter vs build owned).

  • (a) Fork claude-managed-agents-starter as the product base. Rejected on four verified grounds:
    1. It is an UNLICENSED demo/reference template, not a maintained framework (Finding 1): license=null (no LICENSE file, /license 404, no license field in package.json), 58 stars / 5 forks, substantive work was an 8-day burst Apr 8–16 2026, only docs edits since (last push 2026-05-27). No usage/redistribution grant — read it as a pattern reference, do not fork it as a product base.
    2. It runs agents via Anthropic's HOSTED "Claude Managed Agents" (CMA) (Finding 3): lib/managed-agents.ts calls client.beta.sessions.create with agent/environment_id/vault_ids; model, harness, tools, and session state live on Anthropic's platform, the app is a thin poller. This is a different primitive than Dossier's self-hosted Agent SDK / claude -p runtime (Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys), Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) — adopting it moves the agent runtime (and arguably session state) onto Anthropic, and CMA bills via API key, likely incompatible with the current Pro/Max subscription-auth setup (see the reference-sdk-subscription-auth memory).
    3. Its persistence model makes the HOST the system of record (Finding 4): one Postgres table (managed_agent_session) holds metadata + repo pointer columns only; the agent event log lives in the durable workflow run (replayable). NO git-as-source-of-truth — the core reason not to adopt it wholesale.
    4. Its multi-tenancy is per-USER, not per-CLIENT/org (Finding 5): Better Auth + Sign in with Vercel + row-level ownership checks; per-user Anthropic "Vaults" hold MCP tokens but are workspace-scoped — NOT a tenant boundary. Dossier needs per-client/org isolation; isolation here depends entirely on app-level ownership checks.
  • (b) Build a fully-owned hosted control plane (chosen), taking the starter's patterns (not its code/license) and reconciling them to the sovereignty thesis.

Decision

Do NOT migrate Dossier onto the Vercel starter as a foundation. Build a fully-owned hosted control plane, and settle the system-of-record question as HYBRID / thin-control-plane: the client-owned OKF git repo is CANONICAL; the host stores only DERIVED state and orchestrates agent runs against the repo. Keep the plugin (Plugin + marketplace packaging — distribution as the agency wedge, built from the canonical .claude/ primitives) as a downloadable installable (the sovereign local on-ramp); the hosted app is the managed convenience layer over the same OKF substrate. This is a directional decision ACCEPTED — four sub-decisions remain OPEN (below).

Verified findings adopted as design inputs. Beyond the rejection grounds above:

  • The Claude Agent SDK is STATEFUL (Finding 8): query() spawns a claude CLI subprocess per session owning a shell, cwd, and an on-disk JSONL transcript; N sessions = N subprocesses. The host needs persistent sandboxes, not stateless serverless functions.
  • Agent SDK multi-tenant isolation requires EXPLICIT config (Finding 9): defaults leak one tenant's CLAUDE.md/settings into another's session. Required knobs — settingSources:[], CLAUDE_CODE_DISABLE_AUTO_MEMORY=1, per-tenant CLAUDE_CONFIG_DIR, per-tenant cwd per query(), per-tenant proxy egress rules. Mandatory given Dossier hands each client a CLAUDE.md + OKF context (Plugin + marketplace packaging — distribution as the agency wedge, built from the canonical .claude/ primitives).
  • The Agent SDK is platform-agnostic (Finding 10): Anthropic names Vercel Sandbox, Cloudflare Sandboxes, Fly Machines, Modal, Daytona, E2B as fitting providers — no host lock-in.
  • Anthropic does NOT auto-mount git repos into sandboxes (Finding 11): Dossier must build the repo-staging step (clone/checkout the client OKF repo into the sandbox, run, commit back). This is the net-new engineering the control-plane pattern requires — and is exactly the seam that keeps git canonical (the same one-client-one-repo isolation invariant protected by Fix git-per-tenant isolation when a tenant root is nested inside another repo and established by Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system).

Reusable ideas taken from the research (patterns, not code):

  • The durable-workflow execution model (Finding 6): poll → sleep() (releases compute) → replay-from-last-step on crash (app/workflows/tail-session.ts). Satisfies server-side long-running runs without a local session, and is reusable whether agents are CMA-hosted or self-hosted. Equivalents: Cloudflare Workflows/Durable Objects, Inngest, Temporal, Vercel Workflows.
  • The DoltHub control-plane / data-plane split (Finding 7): a hosted control plane (API + reconciler polling desired state) over per-customer data instances → maps to control plane (hosted app, derived state) over client-owned OKF git repos (data plane, canonical).

Rationale

  • It answers the operator's question from the sovereignty thesis, not preference. "Where should the system of record live?" → the client's OKF git repo, always (Adopt OKF as Dossier's canonical knowledge format, Dossier — Mission & North Star). The starter's strongest property (durable, replayable, hosted runs) is precisely the property that makes the host the system of record — the one thing Dossier cannot adopt. So we take the pattern (durable workflow + control/data-plane split) and invert the ownership (git canonical, host derived).
  • Owned, not a fork — for legal and architectural reasons. The starter is unlicensed (Finding 1) — there is no grant to build a product on it. And even with a grant, its host-as-SoR persistence and Anthropic-hosted runtime are the wrong substrate. Building owned is the only path that keeps both the license and the topology clean.
  • It preserves the runtime sovereignty we already chose. Dossier's runtime is self-hosted Agent SDK / claude -p on subscription auth (Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys), Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system). CMA would move the runtime and session state onto Anthropic and break subscription auth (API-key billing) — a regression on a decision already made.
  • It scales the way the operator wants without abandoning the thesis. Persistent DB/storage beyond git (derived state in Postgres), auth + tenant isolation (org-level, with the explicit Agent SDK isolation knobs), and long-running background runs (durable workflow over persistent sandboxes) are all delivered by the hybrid control plane — the concern that "the current structure won't scale" is addressed without making the host primary.
  • The net-new engineering is the right seam. Repo-staging (clone → run in an isolated sandbox → commit back) is exactly where git stays canonical and where the one-client-one-repo invariant (Fix git-per-tenant isolation when a tenant root is nested inside another repo) is enforced in the hosted world. It extends Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system's git-per-tenant model from a local subtree to a hosted sandbox.
  • verified — for the research, not the build. The 25 claims were adversarially 3-vote verified against 20 primary sources; the topology call follows directly from the confirmed findings (and is reinforced by the refuted ones). Confidence is verified for the research basis and the directional call; it is not a claim that any hosted control plane has been built — that is the OPEN follow-up work below, and the time-sensitive facts must be re-verified before acting.

Consequences

Refuted claims — record as DO-NOT-RELY

These were killed in adversarial verification and must not be acted on without fresh evidence:

  • (killed 0-3) Vercel Sandbox tiered timeouts "5h Pro / 45m Hobby / 5m default" — do NOT quote; verify current limits before sizing runs.
  • (killed 1-2) "Firecracker microVM per-session == per-tenant isolation" — per-session ≠ per-tenant; the tenant boundary must be designed explicitly.
  • (killed 0-3) "Sandbox runs have no persistence" — the persistent-sandbox pattern DOES preserve state on stop/resume; but ephemeral runs still need external durable state, so keep the system of record (git) + derived state (Postgres) outside the sandbox.
  • (killed 1-2) "Dolt customers can clone the hosted DB locally, so a hosted serving layer is automatically portable" — do not assume portability; it must be designed as repo-canonical + host-derived.

Review

This is a directional decision with four sub-decisions intentionally left OPEN — promote each as it is ratified, and re-verify the time-sensitive (Apr–Jun 2026, fast-moving beta) facts first.

OPEN sub-decisions (unresolved follow-ups):

  1. Agent RUNTIME — self-hosted Agent SDK in own sandbox (stronger sovereignty, keeps subscription auth, more ops) vs Anthropic-hosted CMA (simpler ops, Anthropic owns state, API-key billing). Thesis points to self-hosted; not yet ratified.
  2. Sandbox host choice + verified real execution ceiling (the prior timeout numbers were refuted — measure the actual limits).
  3. Hybrid boundary precision — host holds ONLY index cache + run metadata (repo strictly canonical) vs ALSO a queryable Postgres mirror of OKF concepts (faster serving, a second copy to reconcile). Sets reconciliation complexity and the strength of the sovereignty claim.
  4. Per-tenant isolation unit + cost — dedicated sandbox-per-tenant (DoltHub-style strong isolation, higher infra cost) vs shared container with the documented Agent SDK isolation knobs (Finding 9).

Recommended sequencing (for the record):

  1. Ratify runtime = self-hosted Agent SDK (sub-decision 1).
  2. Stand up the control-plane skeleton: hosted app + Postgres (derived state only) + org-level auth + a repos pointer table; no agent execution yet.
  3. Build repo-staging + run worker on one sandbox provider with the isolation knobs; dogfood on Dossier's own repo first (the dogfood-then-rollout discipline, per Agentic board v1 — build the git-resident OKF task board (deterministic offline core, SDK reserved), resolving DEC-0024's four open questions and dogfooding Dossier's own repo first / the board's sequencing call).
  4. Wrap in a durable workflow (poll/sleep/replay, Finding 6).
  5. Then build dashboard surfaces (verticals, skill/agent packs, harnesses) on the owned foundation.

Key sources (cite + re-verify):

  • https://github.com/vercel-labs/claude-managed-agents-starter (primary; repo, package.json, lib/, app/workflows/, schema.ts)
  • https://vercel.com/kb/guide/claude-managed-agent-vercel
  • https://vercel.com/templates/next.js/claude-managed-agents
  • https://vercel.com/changelog/run-claude-managed-agents-with-vercel-sandbox
  • https://platform.claude.com/docs/en/managed-agents/overview (+ /self-hosted-sandboxes, /vaults)
  • https://code.claude.com/docs/en/agent-sdk/hosting (statefulness, isolation knobs, provider list)
  • https://www.dolthub.com/blog/2022-06-06-hosted-infrastructure/ (+ 2022-05-18-hosted-dolt, /docs/products/hosted/infrastructure) — control/data-plane pattern
  • https://vercel.com/docs/workflows (durable workflow model)
  • GitHub vercel-labs/vercel-openclaw CLAUDE.md (single-instance persistent-sandbox reference)