Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode

0021-web-ingestion-keyless-http-and-firecrawl

decision read as Explain confidence asserted status active 2026-06-15 owner ingestion-engineer

Reversibility: two-way door

DEC-0021 — Web ingestion: keyless HttpConnector + Firecrawl wired + CLI web-ingest

Reversibility: two-way door — connector internals, crawl policy, and the SDK-defer mechanism are swappable; the durable parts are the DEC-0013 Connector seam, URL provenance from the page in, and offline-by-construction CI.

Context

Ingestion connector seam — assemble, don't build, and ingestion owns the input contract shipped the Connector seam plus one real offline connector (LocalFilesConnector) and reserved the three commodity connectors as stubs — among them FirecrawlConnector (web crawl), left as a NOT_WIRED stub. The platform could ingest local files end-to-end but had no real path to feed a website. The first external website run (the milestone under Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) / Fix git-per-tenant isolation when a tenant root is nested inside another repo) only worked by staging raw HTML by hand to stand in for the reserved connector — a manual scaffold, not a shippable command.

Two forces shaped what to build:

The user's standing preference for no API keys (Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) made extraction keyless on the subscription; ingestion still had no keyless web path).
"Feed a website" should be a first-class command, not a throwaway script — the realistic agency motion is "point Dossier at a client's site."

This decision builds web ingestion in @dossier/ingestion + @dossier/runtime, verified this session, all green, with one live reproduced CLI run.

Options considered

1. The default web path — keyless HttpConnector vs. only the (keyed) Firecrawl path.

(a) Only wire Firecrawl (the reserved DEC-0013 stub) as the single web path. Rejected: every web ingest would then require a Firecrawl key or self-host — directly contradicting the no-API-keys preference, and putting a paid/keyed dependency on the default "feed a website" motion. The premium extractor should be an upgrade, not the floor.
(b) Build a net-new keyless HttpConnector as the default, with Firecrawl as the premium upgrade (chosen). HttpConnector uses global fetch (Node 22+) — no SDK, no API key — does a bounded same-host BFS crawl from seed URL(s), routes each page's HTML through the existing TextMarkdownParser (htmlToMarkdown) seam, and streams one Source per page. It is the keyless floor; Firecrawl becomes the keyed quality upgrade (main-content extraction, JS render) for when the naive reducer isn't good enough. This makes @dossier/ingestion ship two real connectors (LocalFiles + Http) behind the same seam.

2. The Firecrawl SDK — a package dependency vs. a deferred dynamic import.

(a) Add @mendable/firecrawl-js to package.json. Rejected: it pulls a vendor SDK into the leaf's dependency tree and risks networking CI — breaking the offline-by-construction invariant the whole monorepo holds (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract). The leaf must stay lean.
(b) Defer the SDK behind a dynamic await import(...) inside ingest(), never a package dependency (chosen). Module load stays offline; the SDK is imported only at live-call time, wrapped with a clear "install …" error. Priority order: an injected client (the offline-test seam) → the deferred SDK via apiKey/baseUrl → a clear config error. Tests inject a fake client, so CI never imports the SDK — the same seam-with-mock discipline as the live ClaudeClient (Extraction runtime architecture — the moat), the Embedder (MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), and ClaudeCodeClient's CliRunner (Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys)).

3. Surfacing web ingest — a first-class CLI mode vs. a one-off script.

(a) A throwaway harness script (as the first website milestone used). Rejected: it isn't a product surface; "feed a website" is the realistic agency motion and deserves to be supported, not staged by hand each time.
(b) A first-class dossier-runtime run --url mode (chosen). Runtime httpIngest/firecrawlIngest factories (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) plus dossier-runtime run --url <seed> [--pages N] [--connector http|firecrawl]. --connector firecrawl requires FIRECRAWL_API_KEY (clear refusal otherwise — no silent network). Composes with the keyless subscription transport: the headline zero-key command is dossier-runtime run --subscription --url <url> --pages N --client <id> --root <dir> (keyless HttpConnector + the Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) subscription extraction).

Decision

Build a keyless HttpConnector as the default web path, wire FirecrawlConnector as the premium upgrade behind a deferred SDK, and add a first-class dossier-runtime run --url CLI mode.

HttpConnector — the first real keyless network connector (packages/ingestion/src/connectors/http.ts). Global fetch (Node 22+), no SDK, no API key. Bounded BFS crawl from seed URL(s) (maxPages, default 1), same-host by default, each page's HTML routed through the existing TextMarkdownParser (htmlToMarkdown) seam, streaming one Source per page with the live page URL stamped as provenance.source (URL provenance from the web in — Adopt OKF as Dossier's canonical knowledge format). Non-2xx / non-HTML / empty responses are skipped and recorded on skipped. robots.txt is NOT yet honored (documented limitation / follow-up). This is the default "feed a website" path and the keyless answer to the no-API-keys preference.
FirecrawlConnector wired — the premium web path (packages/ingestion/src/connectors/firecrawl.ts), replacing the Ingestion connector seam — assemble, don't build, and ingestion owns the input contract NOT_WIRED stub. Main-content extraction + JS render; scrape + crawl modes; URL provenance. The Firecrawl SDK import is deferred (await import('@mendable/firecrawl-js') inside ingest(), wrapped with a clear "install …" error) so module load stays offline and the SDK is never a package dependency. Resolution priority: injected client (offline-test seam) → deferred SDK via apiKey/baseUrl → clear config error. sync() is honestly reserved with the documented content-hash / change-tracking SyncCursor strategy. It needs a Firecrawl key / self-host to run live — the keyed upgrade over the keyless HttpConnector.
Runtime web-ingest + CLI mode. httpIngest/firecrawlIngest factories in @dossier/runtime (packages/runtime/src/loop.ts) + a first-class dossier-runtime run --url <seed> [--pages N] [--connector http|firecrawl] mode (packages/runtime/src/cli.ts). --connector firecrawl requires FIRECRAWL_API_KEY (clear refusal otherwise). Headline keyless command: dossier-runtime run --subscription --url <url> --pages N --client <id> --root <dir>.

This realizes part of Ingestion connector seam — assemble, don't build, and ingestion owns the input contract (which reserved Firecrawl / Unstructured / M365 as stubs): Firecrawl is now wired, and a net-new keyless HttpConnector exists — so @dossier/ingestion now ships two real connectors (LocalFiles + Http) plus a wired premium Firecrawl; Unstructured / Microsoft365 remain reserved.

Rationale

Keyless by default honors the user's preference and the GTM. A web ingest that needs no key (HttpConnector + the Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) subscription transport) lets an agency point Dossier at a client's site with zero API-key management — the floor is free, Firecrawl is the paid upgrade for quality. The premium/keyless split puts the cost exactly where the value is.
The SDK defer keeps the leaf lean and CI offline. Importing @mendable/firecrawl-js only at live-call time (never a package.json dependency) means module load stays offline and the leaf carries no vendor weight — the same seam discipline that already keeps AnthropicClaudeClient/ClaudeCodeClient/Embedder from networking CI. Tests inject a fake client/fetch, so the suite never touches the SDK or the network.
URL provenance from the page in keeps auditable memory true on the web path. HttpConnector stamps the live page URL as provenance.source (it even followed a data-services→data-analytics redirect and stamped the final page), so the Ingestion connector seam — assemble, don't build, and ingestion owns the input contract "provenance from atom zero" property and the Adopt OKF as Dossier's canonical knowledge format sovereignty/audit promise hold for web sources, not just files — closing the gap Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) flagged (file-path provenance there; URL provenance belongs to the web path).
A first-class CLI mode, not a script. "Feed a website" is the realistic agency motion; making it dossier-runtime run --url (with a clear key refusal on the Firecrawl path) turns the manually-staged first-website run into a supported, reproducible command.
asserted, not verified. Built and verified green this session — repo-wide typecheck clean, lint 0 errors, all 8 packages build, pnpm test 289 passed / 1 skipped (was 277 → +12: 6 HttpConnector + 6 Firecrawl offline tests, +2 runtime web-ingest, minus the dropped obsolete Firecrawl-"reserved" assertions), plugin:check in sync — plus one live reproduced CLI run (below). That is design-level conviction backed by a single live run, not multi-corpus or market validation; robots.txt is unhonored and the Firecrawl path is unexercised live.
- 2026-06-19 — the FirecrawlConnector hosted-API path is now field-proven (scoped promotion). Both #resolveClient branches have live evidence per First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam: the injected-client seam drove a 75-page live crawl of rbaconsulting.com, and the deferred-SDK import branch drove a live @mendable/firecrawl-js v4.28.1 scrape of the same site (status 200, no SDK-surface drift). The frontmatter confidence stays asserted because the model defines confidence as a single enum and it is gated on the broadest still-unproven claim in this record; the verified SCOPE is the hosted-API path only. Still asserted (the residuals): the self-hosted baseUrl path, multi-site generalization, crawl scale beyond ~75 pages, sync()/incremental, and the keyless HttpConnector (which robots.txt-unhonored covers). DEC-0013's broader connector-seam claim also stays asserted (it still covers the unbuilt Unstructured / M365 connectors).

Consequences

@dossier/ingestion now ships two real connectors + a wired premium one. LocalFilesConnector (offline files) and HttpConnector (keyless web) are both live behind the Connector seam; FirecrawlConnector is wired as the keyed premium web path. UnstructuredConnector and Microsoft365Connector remain reserved — this realizes part, not all, of Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's reservation.
The keyless web-ingest loop runs end-to-end, live. Verified, reproduced this session: dossier-runtime run --subscription --url https://www.rbaconsulting.com/what-we-do/data-services/ --pages 1 --client rba-cli … produced 16 OKF atoms, 0 rejected, 0 graph issues, committed (4926144) into the tenant's own isolated repo (the Fix git-per-tenant isolation when a tenant root is nested inside another repo fix held — main untouched at b74b9b6); provenance stamped the live URL (the connector followed a redirect and stamped the final page).
CI stays offline by construction. HttpConnector tests inject fetch; Firecrawl tests inject a client; no SDK was added to package.json — the offline-first invariant the monorepo holds is preserved.
Honest open items. robots.txt is not yet honored in HttpConnector (documented follow-up); the FirecrawlConnector hosted-API path is now field-proven — both #resolveClient branches (injected-client seam + deferred-SDK import) ran live against rbaconsulting.com per First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam (2026-06-19) — but its self-hosted baseUrl path remains unexercised live; UnstructuredConnector / Microsoft365Connector stay reserved; the naive htmlToMarkdown reducer is noisy on raw marketing HTML (the reason Firecrawl's main-content extraction exists as the upgrade).
Two-way vs. durable. Connector internals, crawl policy (same-host default, maxPages), and the SDK-defer mechanism are swappable (two-way door). The durable commitments are the Ingestion connector seam — assemble, don't build, and ingestion owns the input contract Connector seam, URL provenance from the page in, and offline-by-construction CI.

Review

The FirecrawlConnector hosted-API path was promoted to verified (scoped) on 2026-06-19 — both #resolveClient branches (injected-client seam + deferred-SDK import) have live field evidence from the rbaconsulting.com run (First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam); the frontmatter enum stays asserted and the verified scope is carried in Rationale/Consequences. Remaining bar before a broader verified: a second real site AND the self-hosted baseUrl path exercised live (the keyless HttpConnector stays asserted — robots.txt is still unhonored). Confirm crawl behavior (same-host scoping, redirects, maxPages) holds across multiple sites, and validate the self-host premium main-content path vs. the naive reducer. Honor robots.txt in HttpConnector before any broad/multi-page crawling (the standing follow-up). Revisit the crawl contract if real sources need cross-host crawling, depth/breadth limits beyond maxPages, or page/section provenance anchors. Wiring UnstructuredConnector / Microsoft365Connector remains the rest of Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's reserved work.