Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode
0021-web-ingestion-keyless-http-and-firecrawl
- Reversibility
- two-way door
DEC-0021 — Web ingestion: keyless HttpConnector + Firecrawl wired + CLI web-ingest
Reversibility: two-way door — connector internals, crawl policy, and the SDK-defer mechanism are swappable; the durable parts are the DEC-0013 Connector seam, URL provenance from the page in, and offline-by-construction CI.
Context
Ingestion connector seam — assemble, don't build, and ingestion owns the input contract shipped the Connector seam plus one real offline connector (LocalFilesConnector) and reserved the three commodity connectors as stubs — among them FirecrawlConnector (web crawl), left as a NOT_WIRED stub. The platform could ingest local files end-to-end but had no real path to feed a website. The first external website run (the milestone under Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) / Fix git-per-tenant isolation when a tenant root is nested inside another repo) only worked by staging raw HTML by hand to stand in for the reserved connector — a manual scaffold, not a shippable command.
Two forces shaped what to build:
- The user's standing preference for no API keys (Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) made extraction keyless on the subscription; ingestion still had no keyless web path).
- "Feed a website" should be a first-class command, not a throwaway script — the realistic agency motion is "point Dossier at a client's site."
This decision builds web ingestion in @dossier/ingestion + @dossier/runtime, verified this session, all green, with one live reproduced CLI run.
Options considered
1. The default web path — keyless HttpConnector vs. only the (keyed) Firecrawl path.
- (a) Only wire Firecrawl (the reserved DEC-0013 stub) as the single web path. Rejected: every web ingest would then require a Firecrawl key or self-host — directly contradicting the no-API-keys preference, and putting a paid/keyed dependency on the default "feed a website" motion. The premium extractor should be an upgrade, not the floor.
- (b) Build a net-new keyless
HttpConnectoras the default, with Firecrawl as the premium upgrade (chosen).HttpConnectoruses globalfetch(Node 22+) — no SDK, no API key — does a bounded same-host BFS crawl from seed URL(s), routes each page's HTML through the existingTextMarkdownParser(htmlToMarkdown) seam, and streams oneSourceper page. It is the keyless floor; Firecrawl becomes the keyed quality upgrade (main-content extraction, JS render) for when the naive reducer isn't good enough. This makes@dossier/ingestionship two real connectors (LocalFiles + Http) behind the same seam.
2. The Firecrawl SDK — a package dependency vs. a deferred dynamic import.
- (a) Add
@mendable/firecrawl-jstopackage.json. Rejected: it pulls a vendor SDK into the leaf's dependency tree and risks networking CI — breaking the offline-by-construction invariant the whole monorepo holds (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract). The leaf must stay lean. - (b) Defer the SDK behind a dynamic
await import(...)insideingest(), never a package dependency (chosen). Module load stays offline; the SDK is imported only at live-call time, wrapped with a clear "install …" error. Priority order: an injectedclient(the offline-test seam) → the deferred SDK viaapiKey/baseUrl→ a clear config error. Tests inject a fake client, so CI never imports the SDK — the same seam-with-mock discipline as the liveClaudeClient(Extraction runtime architecture — the moat), theEmbedder(MCP agentic foundation — tenant-scoped GraphRAG over the OKF KB), andClaudeCodeClient'sCliRunner(Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys)).
3. Surfacing web ingest — a first-class CLI mode vs. a one-off script.
- (a) A throwaway harness script (as the first website milestone used). Rejected: it isn't a product surface; "feed a website" is the realistic agency motion and deserves to be supported, not staged by hand each time.
- (b) A first-class
dossier-runtime run --urlmode (chosen). RuntimehttpIngest/firecrawlIngestfactories (Runtime orchestration & per-tenant control plane — the learning loop becomes a runnable system) plusdossier-runtime run --url <seed> [--pages N] [--connector http|firecrawl].--connector firecrawlrequiresFIRECRAWL_API_KEY(clear refusal otherwise — no silent network). Composes with the keyless subscription transport: the headline zero-key command isdossier-runtime run --subscription --url <url> --pages N --client <id> --root <dir>(keylessHttpConnector+ the Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) subscription extraction).
Decision
Build a keyless HttpConnector as the default web path, wire FirecrawlConnector as the premium upgrade behind a deferred SDK, and add a first-class dossier-runtime run --url CLI mode.
HttpConnector— the first real keyless network connector (packages/ingestion/src/connectors/http.ts). Globalfetch(Node 22+), no SDK, no API key. Bounded BFS crawl from seed URL(s) (maxPages, default 1), same-host by default, each page's HTML routed through the existingTextMarkdownParser(htmlToMarkdown) seam, streaming oneSourceper page with the live page URL stamped asprovenance.source(URL provenance from the web in — Adopt OKF as Dossier's canonical knowledge format). Non-2xx / non-HTML / empty responses are skipped and recorded onskipped.robots.txtis NOT yet honored (documented limitation / follow-up). This is the default "feed a website" path and the keyless answer to the no-API-keys preference.FirecrawlConnectorwired — the premium web path (packages/ingestion/src/connectors/firecrawl.ts), replacing the Ingestion connector seam — assemble, don't build, and ingestion owns the input contractNOT_WIREDstub. Main-content extraction + JS render;scrape+crawlmodes; URL provenance. The Firecrawl SDK import is deferred (await import('@mendable/firecrawl-js')insideingest(), wrapped with a clear "install …" error) so module load stays offline and the SDK is never a package dependency. Resolution priority: injectedclient(offline-test seam) → deferred SDK viaapiKey/baseUrl→ clear config error.sync()is honestly reserved with the documented content-hash / change-trackingSyncCursorstrategy. It needs a Firecrawl key / self-host to run live — the keyed upgrade over the keylessHttpConnector.- Runtime web-ingest + CLI mode.
httpIngest/firecrawlIngestfactories in@dossier/runtime(packages/runtime/src/loop.ts) + a first-classdossier-runtime run --url <seed> [--pages N] [--connector http|firecrawl]mode (packages/runtime/src/cli.ts).--connector firecrawlrequiresFIRECRAWL_API_KEY(clear refusal otherwise). Headline keyless command:dossier-runtime run --subscription --url <url> --pages N --client <id> --root <dir>.
This realizes part of Ingestion connector seam — assemble, don't build, and ingestion owns the input contract (which reserved Firecrawl / Unstructured / M365 as stubs): Firecrawl is now wired, and a net-new keyless HttpConnector exists — so @dossier/ingestion now ships two real connectors (LocalFiles + Http) plus a wired premium Firecrawl; Unstructured / Microsoft365 remain reserved.
Rationale
- Keyless by default honors the user's preference and the GTM. A web ingest that needs no key (
HttpConnector+ the Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) subscription transport) lets an agency point Dossier at a client's site with zero API-key management — the floor is free, Firecrawl is the paid upgrade for quality. The premium/keyless split puts the cost exactly where the value is. - The SDK defer keeps the leaf lean and CI offline. Importing
@mendable/firecrawl-jsonly at live-call time (never apackage.jsondependency) means module load stays offline and the leaf carries no vendor weight — the same seam discipline that already keepsAnthropicClaudeClient/ClaudeCodeClient/Embedderfrom networking CI. Tests inject a fakeclient/fetch, so the suite never touches the SDK or the network. - URL provenance from the page in keeps auditable memory true on the web path.
HttpConnectorstamps the live page URL asprovenance.source(it even followed adata-services→data-analyticsredirect and stamped the final page), so the Ingestion connector seam — assemble, don't build, and ingestion owns the input contract "provenance from atom zero" property and the Adopt OKF as Dossier's canonical knowledge format sovereignty/audit promise hold for web sources, not just files — closing the gap Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys) flagged (file-path provenance there; URL provenance belongs to the web path). - A first-class CLI mode, not a script. "Feed a website" is the realistic agency motion; making it
dossier-runtime run --url(with a clear key refusal on the Firecrawl path) turns the manually-staged first-website run into a supported, reproducible command. asserted, notverified. Built and verified green this session — repo-wide typecheck clean, lint 0 errors, all 8 packages build,pnpm test289 passed / 1 skipped (was 277 → +12: 6HttpConnector+ 6 Firecrawl offline tests, +2 runtime web-ingest, minus the dropped obsolete Firecrawl-"reserved" assertions),plugin:checkin sync — plus one live reproduced CLI run (below). That is design-level conviction backed by a single live run, not multi-corpus or market validation;robots.txtis unhonored and the Firecrawl path is unexercised live.- 2026-06-19 — the
FirecrawlConnectorhosted-API path is now field-proven (scoped promotion). Both#resolveClientbranches have live evidence per First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam: the injected-client seam drove a 75-page live crawl of rbaconsulting.com, and the deferred-SDK import branch drove a live@mendable/firecrawl-jsv4.28.1scrapeof the same site (status 200, no SDK-surface drift). The frontmatterconfidencestaysassertedbecause the model definesconfidenceas a single enum and it is gated on the broadest still-unproven claim in this record; the verified SCOPE is the hosted-API path only. Stillasserted(the residuals): the self-hostedbaseUrlpath, multi-site generalization, crawl scale beyond ~75 pages,sync()/incremental, and the keylessHttpConnector(whichrobots.txt-unhonored covers). DEC-0013's broader connector-seam claim also staysasserted(it still covers the unbuilt Unstructured / M365 connectors).
- 2026-06-19 — the
Consequences
@dossier/ingestionnow ships two real connectors + a wired premium one.LocalFilesConnector(offline files) andHttpConnector(keyless web) are both live behind theConnectorseam;FirecrawlConnectoris wired as the keyed premium web path.UnstructuredConnectorandMicrosoft365Connectorremain reserved — this realizes part, not all, of Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's reservation.- The keyless web-ingest loop runs end-to-end, live. Verified, reproduced this session:
dossier-runtime run --subscription --url https://www.rbaconsulting.com/what-we-do/data-services/ --pages 1 --client rba-cli …produced 16 OKF atoms, 0 rejected, 0 graph issues, committed (4926144) into the tenant's own isolated repo (the Fix git-per-tenant isolation when a tenant root is nested inside another repo fix held —mainuntouched atb74b9b6); provenance stamped the live URL (the connector followed a redirect and stamped the final page). - CI stays offline by construction.
HttpConnectortests injectfetch; Firecrawl tests inject aclient; no SDK was added topackage.json— the offline-first invariant the monorepo holds is preserved. - Honest open items.
robots.txtis not yet honored inHttpConnector(documented follow-up); theFirecrawlConnectorhosted-API path is now field-proven — both#resolveClientbranches (injected-client seam + deferred-SDK import) ran live against rbaconsulting.com per First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam (2026-06-19) — but its self-hostedbaseUrlpath remains unexercised live;UnstructuredConnector/Microsoft365Connectorstay reserved; the naivehtmlToMarkdownreducer is noisy on raw marketing HTML (the reason Firecrawl's main-content extraction exists as the upgrade). - Two-way vs. durable. Connector internals, crawl policy (same-host default,
maxPages), and the SDK-defer mechanism are swappable (two-way door). The durable commitments are the Ingestion connector seam — assemble, don't build, and ingestion owns the input contractConnectorseam, URL provenance from the page in, and offline-by-construction CI.
Review
The FirecrawlConnector hosted-API path was promoted to verified (scoped) on 2026-06-19 — both #resolveClient branches (injected-client seam + deferred-SDK import) have live field evidence from the rbaconsulting.com run (First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam); the frontmatter enum stays asserted and the verified scope is carried in Rationale/Consequences. Remaining bar before a broader verified: a second real site AND the self-hosted baseUrl path exercised live (the keyless HttpConnector stays asserted — robots.txt is still unhonored). Confirm crawl behavior (same-host scoping, redirects, maxPages) holds across multiple sites, and validate the self-host premium main-content path vs. the naive reducer. Honor robots.txt in HttpConnector before any broad/multi-page crawling (the standing follow-up). Revisit the crawl contract if real sources need cross-host crawling, depth/breadth limits beyond maxPages, or page/section provenance anchors. Wiring UnstructuredConnector / Microsoft365Connector remains the rest of Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's reserved work.