First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam

0055-firecrawl-first-live-client-crawl

decision read as Explain confidence verified status active 2026-06-18 owner ingestion-engineer
Reversibility
two-way door

DEC-0055 — First live FirecrawlConnector run against a real client source

Reversibility: two-way door — this is a field-evidence record; the durable commitments it informs (the Connector seam, URL provenance, offline-by-construction CI) are owned by DEC-0013 / DEC-0021, not re-decided here.

Context

The FirecrawlConnector was reserved behind the Connector seam in Ingestion connector seam — assemble, don't build, and ingestion owns the input contract (a NOT_WIRED stub) and wired in Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode (real scrape/crawl modes, SDK deferred behind await import('@mendable/firecrawl-js'), the offline-test seam being an injected FirecrawlConfig.client). Both decisions named the same outstanding gate in their Review sections: exercise the path against a real, non-file client source and consider promoting confidence — Ingestion connector seam — assemble, don't build, and ingestion owns the input contract ("Wire the OSS connectors through the reserved seam when a real client source needs them — at which point exercise the input contract against non-file sources and consider promoting confidence to verified"), and Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode ("run the live FirecrawlConnector against a key/self-host to validate the premium main-content path vs. the naive reducer"). DEC-0021's own live run had used the keyless HttpConnector on a single page (1-page, 16 atoms); the wired Firecrawl path itself was recorded as unexercised live.

This decision records the first time the wired Firecrawl path ran against a real client website — supplying the field evidence both Review gates asked for. It is a milestone record, not a new architectural commitment: the durable parts (the Connector seam, URL provenance from the page in, offline-by-construction CI) remain owned by Ingestion connector seam — assemble, don't build, and ingestion owns the input contract / Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode.

Options considered

This run was an exercise of an already-decided architecture, so the live "options" were operational, and two are worth recording because they shaped what the evidence proves:

  1. How to drive the live SDK — install @mendable/firecrawl-js vs. inject a thin REST client through the seam. A thin REST client over Firecrawl's hosted v2 API (POST /v2/crawl + poll GET /v2/crawl/{id}) was injected through the connector's FirecrawlConfig.client seam — deliberately not installing the SDK, honoring Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's "the SDK is an optional, never-bundled dependency." Honest caveat: this exercised the connector's real crawl→Source mapping and its URL-provenance logic against live data, but it did NOT exercise the connector's await import('@mendable/firecrawl-js') line (packages/ingestion/src/connectors/firecrawl.ts #resolveClient, the apiKey/baseUrl branch) — that path stays unvalidated live. The injected REST harness lives in clients/rba/harness/ (gitignored — see Fix git-per-tenant isolation when a tenant root is nested inside another repo), so it is a local sandbox, not committed to the Dossier repo.
  2. Where the loop output lands — overwrite the prior staged tenant vs. a fresh siloed tenant. A fresh siloed tenant (clients/rba/tenants-firecrawl/rba-consulting) was used, leaving the prior 3-page staged run intact at clients/rba/tenants/ for diffing. clients/ is gitignored — the tenant OKF is a local sandbox, not committed to the Dossier repo.

Decision

Record the live run as field evidence for the wired Firecrawl path, with the SDK-import-line caveat stated, and leave the DEC-0013 / DEC-0021 confidence-promotion question OPEN for the owner.

What ran (verified from the run):

  • Live crawl: 75 pages from https://www.rbaconsulting.com/ via the real FirecrawlConnector, completed, 123 Firecrawl credits, driven through the injected-client seam backed by a thin REST client over Firecrawl's hosted v2 API.
  • Curation: Core tier — 33 of 75 pages kept (home + 17 service pages + 11 case studies + 4 audience pages); blog (13) deferred; 29 dropped as archive / utility / thin.
  • Loop: the full real learning loop (ingest → extract → validate → resolve → link → serialize → emit OKF → git commit) on the Claude subscription (claude -p, forced sonnet — Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys)) into the fresh siloed tenant.
  • Result: stages ingest:ok extract:ok emit:ok commit:ok; 494 atoms / 654 edges / 0 rejected / 4 graphIssues; 203 extract calls, 1 failure; commit 75168d0 (in the tenant's own isolated repo — Fix git-per-tenant isolation when a tenant root is nested inside another repo); ~5380s.
  • Upgrade vs. the prior 3-page staged run: atoms 48→494; clients 1→10; systems 0→75; capabilities 29→210; processes 10→95; workflows 1→15.

The three integration learnings below (Consequences) are pinned as durable notes against the day the Firecrawl/M365 paths are wired for production use.

Rationale

  • The evidence is real and material — so confidence: verified for this record. This is a milestone record of a run that demonstrably happened (commit 75168d0, measured credits/atoms/edges/timings); the record's own factual claims are evidence-backed. The connector's crawl→Source mapping and URL provenance held against live, messy marketing HTML at 75-page scale — a real upgrade over DEC-0021's single keyless page.
  • The four graphIssues are structural, not hallucinations. They are orphan-artifact errors (severity error per The produces edge is canonical on the producing process only — every artifact needs exactly one producing process), and provenance was verified on at least one (the agribusiness artifact is grounded in /operations-and-business-leaders/). They are a known, bounded class of gap with a clear fix (link a producing process or prune), not a faithfulness failure.
  • The SDK-import line is the honest residual. Because the live client was injected through the seam (not the deferred SDK import), the await import(...) branch in #resolveClient is still unexercised live. Stating this keeps the record from over-claiming what the run proves.
  • Promotion is the owner's call, not this record's. Whether to promote Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's and/or Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode's Firecrawl confidence to verified, and whether to validate the real SDK path, needs the architect/owner's framing (e.g. whether "verified" should require the actual await import(SDK) line to have run, and across more than one site). This record supplies the field evidence; it does not assert the promotion. See Review.

Consequences

Review

Items 1 and 2 both RESOLVED 2026-06-19.

  1. Promote the Firecrawl confidence on Ingestion connector seam — assemble, don't build, and ingestion owns the input contract and/or Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode? — RESOLVED 2026-06-19. The owner promoted the FirecrawlConnector hosted-API path to verified on Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode only, scoped: both #resolveClient branches (injected-client seam crawl + deferred-SDK-import scrape, see item 2) now have live field evidence from this run. Ingestion connector seam — assemble, don't build, and ingestion owns the input contract stays asserted — its broader connector-seam claim still covers the unbuilt Unstructured / M365 connectors. The promotion is carried in DEC-0021's body (its confidence enum stays asserted, gated on the broadest unproven claim). Bar for the rest: a second real site + the self-hosted baseUrl path; the keyless HttpConnector stays asserted (robots.txt unhonored).
  2. Validate the real SDK path? — CLOSED 2026-06-19. The deferred-import branch (#resolveClientapiKey/baseUrl) was exercised live: @mendable/firecrawl-js v4.28.1 — installed only in gitignored node_modules, the committed manifest reverted afterward so the SDK is never a hard dependency (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract intact) — drove a real scrape of rbaconsulting.com through the connector, yielding 9,421 chars of markdown at status=200 with no SDK-surface drift (mod.default is the constructor; scrapeUrl(url, opts) resolves). Both of the connector's client paths (injected seam + deferred SDK import) now have live field evidence. Validated via clients/rba/harness/firecrawl-sdk-validate.mjs.

This record itself promotes to nothing further; it is verified as a faithful account of the run.