First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam
0055-firecrawl-first-live-client-crawl
- Reversibility
- two-way door
DEC-0055 — First live FirecrawlConnector run against a real client source
Reversibility: two-way door — this is a field-evidence record; the durable commitments it informs (the Connector seam, URL provenance, offline-by-construction CI) are owned by DEC-0013 / DEC-0021, not re-decided here.
Context
The FirecrawlConnector was reserved behind the Connector seam in Ingestion connector seam — assemble, don't build, and ingestion owns the input contract (a NOT_WIRED stub) and wired in Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode (real scrape/crawl modes, SDK deferred behind await import('@mendable/firecrawl-js'), the offline-test seam being an injected FirecrawlConfig.client). Both decisions named the same outstanding gate in their Review sections: exercise the path against a real, non-file client source and consider promoting confidence — Ingestion connector seam — assemble, don't build, and ingestion owns the input contract ("Wire the OSS connectors through the reserved seam when a real client source needs them — at which point exercise the input contract against non-file sources and consider promoting confidence to verified"), and Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode ("run the live FirecrawlConnector against a key/self-host to validate the premium main-content path vs. the naive reducer"). DEC-0021's own live run had used the keyless HttpConnector on a single page (1-page, 16 atoms); the wired Firecrawl path itself was recorded as unexercised live.
This decision records the first time the wired Firecrawl path ran against a real client website — supplying the field evidence both Review gates asked for. It is a milestone record, not a new architectural commitment: the durable parts (the Connector seam, URL provenance from the page in, offline-by-construction CI) remain owned by Ingestion connector seam — assemble, don't build, and ingestion owns the input contract / Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode.
Options considered
This run was an exercise of an already-decided architecture, so the live "options" were operational, and two are worth recording because they shaped what the evidence proves:
- How to drive the live SDK — install
@mendable/firecrawl-jsvs. inject a thin REST client through the seam. A thin REST client over Firecrawl's hosted v2 API (POST /v2/crawl+ pollGET /v2/crawl/{id}) was injected through the connector'sFirecrawlConfig.clientseam — deliberately not installing the SDK, honoring Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's "the SDK is an optional, never-bundled dependency." Honest caveat: this exercised the connector's real crawl→Sourcemapping and its URL-provenance logic against live data, but it did NOT exercise the connector'sawait import('@mendable/firecrawl-js')line (packages/ingestion/src/connectors/firecrawl.ts#resolveClient, theapiKey/baseUrlbranch) — that path stays unvalidated live. The injected REST harness lives inclients/rba/harness/(gitignored — see Fix git-per-tenant isolation when a tenant root is nested inside another repo), so it is a local sandbox, not committed to the Dossier repo. - Where the loop output lands — overwrite the prior staged tenant vs. a fresh siloed tenant. A fresh siloed tenant (
clients/rba/tenants-firecrawl/rba-consulting) was used, leaving the prior 3-page staged run intact atclients/rba/tenants/for diffing.clients/is gitignored — the tenant OKF is a local sandbox, not committed to the Dossier repo.
Decision
Record the live run as field evidence for the wired Firecrawl path, with the SDK-import-line caveat stated, and leave the DEC-0013 / DEC-0021 confidence-promotion question OPEN for the owner.
What ran (verified from the run):
- Live crawl: 75 pages from
https://www.rbaconsulting.com/via the realFirecrawlConnector, completed, 123 Firecrawl credits, driven through the injected-clientseam backed by a thin REST client over Firecrawl's hosted v2 API. - Curation: Core tier — 33 of 75 pages kept (home + 17 service pages + 11 case studies + 4 audience pages); blog (13) deferred; 29 dropped as archive / utility / thin.
- Loop: the full real learning loop (ingest → extract → validate → resolve → link → serialize → emit OKF → git commit) on the Claude subscription (
claude -p, forced sonnet — Subscription-backed extraction is a first-class transport — ClaudeCodeClient (no API keys)) into the fresh siloed tenant. - Result: stages
ingest:ok extract:ok emit:ok commit:ok; 494 atoms / 654 edges / 0 rejected / 4 graphIssues; 203 extract calls, 1 failure; commit75168d0(in the tenant's own isolated repo — Fix git-per-tenant isolation when a tenant root is nested inside another repo); ~5380s. - Upgrade vs. the prior 3-page staged run: atoms 48→494; clients 1→10; systems 0→75; capabilities 29→210; processes 10→95; workflows 1→15.
The three integration learnings below (Consequences) are pinned as durable notes against the day the Firecrawl/M365 paths are wired for production use.
Rationale
- The evidence is real and material — so
confidence: verifiedfor this record. This is a milestone record of a run that demonstrably happened (commit75168d0, measured credits/atoms/edges/timings); the record's own factual claims are evidence-backed. The connector's crawl→Sourcemapping and URL provenance held against live, messy marketing HTML at 75-page scale — a real upgrade over DEC-0021's single keyless page. - The four
graphIssuesare structural, not hallucinations. They are orphan-artifact errors (severityerrorper The produces edge is canonical on the producing process only — every artifact needs exactly one producing process), and provenance was verified on at least one (the agribusiness artifact is grounded in/operations-and-business-leaders/). They are a known, bounded class of gap with a clear fix (link a producing process or prune), not a faithfulness failure. - The SDK-import line is the honest residual. Because the live client was injected through the seam (not the deferred SDK import), the
await import(...)branch in#resolveClientis still unexercised live. Stating this keeps the record from over-claiming what the run proves. - Promotion is the owner's call, not this record's. Whether to promote Ingestion connector seam — assemble, don't build, and ingestion owns the input contract's and/or Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode's Firecrawl confidence to
verified, and whether to validate the real SDK path, needs the architect/owner's framing (e.g. whether "verified" should require the actualawait import(SDK)line to have run, and across more than one site). This record supplies the field evidence; it does not assert the promotion. See Review.
Consequences
- Field evidence now exists for the wired Firecrawl path. Ingestion connector seam — assemble, don't build, and ingestion owns the input contract / Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode previously recorded Firecrawl as wired-but-unexercised-live; that is no longer true for the crawl→
Source→loop path (the SDK-import branch excepted). - Three durable integration learnings — pin these for when Firecrawl/M365 are wired for production:
- Cumulative poll data. Firecrawl v2 crawl-status
datais cumulative on every poll — collecting per-poll duplicates pages. Collect only on the terminalcompletedresponse, then drain pagination vianext. - Transient 5xx. Firecrawl status/pagination endpoints 502 intermittently under load — retry transient 5xx.
- Free resume by job-id. Long crawls outlive a short client poll deadline; resuming by job-id is free (status reads don't bill credits). A reusable resume tool (
clients/rba/harness/firecrawl-job.mjs, gitignored) was built for this.
- Cumulative poll data. Firecrawl v2 crawl-status
- Three follow-up gaps filed as board tasks (do not overstate severity):
- Resolve 4 orphan-artifact graph errors from the RBA Firecrawl run (link a producing process or prune) — the 4 orphan-artifact
error-severity graph issues (agribusiness-grain-transport-automation-case-study,caleres-personalization-roadmap,employee-persona,employee-personas); structural, provenance-verified, fix = link a producing process or prune. - Add a retry/repair path for extraction segments that fail on malformed model JSON — 1 segment (#82) failed on malformed model JSON → its atoms were silently lost (
ok=false,atoms=0); a retry/repair path would recover these. - Harden the RBA harness subscription-client spawn — drop shell:true (Node DEP0190) — minor hardening:
clients/rba/harness/subscription-client.mjsspawnsclaudewithshell:true→ Node DEP0190 deprecation warning.
- Resolve 4 orphan-artifact graph errors from the RBA Firecrawl run (link a producing process or prune) — the 4 orphan-artifact
- A later multi-surface FDE/QA pass (2026-06-19) scaled the run 3→33 pages and surfaced four more follow-ups — the KB is structurally valid + secure (494/494
source, 494/494inferred, 0 secrets, 0 dangling frontmatter edges; landing/graph/docs all built and rendered faithfully) but not yet reference-tenant quality until the loop/extraction gaps below close. All filed as board tasks,confidence: inferred:- Make the learning loop dedup/reconcile at scale (collapse same-type duplicate clusters; default-on compounding) (p1, root cause) — 26 same-type duplicate clusters / ~59 redundant atoms (~6.7%) with 0
supersedesedges;resolve()leaks different-id / near-title dups at scale andreconcile()is opt-in + keys by id only. One-shot dedup of the current KB + a durable fix (id canonicalization, tighterresolve(), a default-on decision). - Fix extraction type-discipline — `system` used as a catch-all + non-slug ids (RBA run) (p2) — 18/25
systematoms mis-typed (UX deliverables →artifact, process phases →process) with non-slug ids; corroborated by the docs build's ugly URLs. - Have extraction populate the accountability spine (owner / reports_to / members / decision_rights) (p2) — the accountability layer is absent:
owner0/494,reports_to/members/decision_rights0; 18 roles own nothing, 95 processes have no owner. - Make the docs renderer registry-aware for vertical edges + render decision judgment fields (p2, the only surface-side fix) —
@dossier/okf-view'sEDGESdrops the DXA vertical edges (~25 spine atoms render no Related nav) andOkfMeta.astrorenders no decisioncontext/decision/rationale. - The existing Resolve 4 orphan-artifact graph errors from the RBA Firecrawl run (link a producing process or prune) was UPDATED (not re-filed) with the QA root-cause nuance: the 2 case studies are mis-typed delivery outcomes (→
engagement), and the persona pair is a duplicate-that-lost-its-producer (overlaps the dedup task).
- Make the learning loop dedup/reconcile at scale (collapse same-type duplicate clusters; default-on compounding) (p1, root cause) — 26 same-type duplicate clusters / ~59 redundant atoms (~6.7%) with 0
- Two-way door. This is a field-evidence record; it re-decides nothing. The durable commitments it informs are owned by Ingestion connector seam — assemble, don't build, and ingestion owns the input contract / Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode.
Review
Items 1 and 2 both RESOLVED 2026-06-19.
- Promote the Firecrawl confidence on Ingestion connector seam — assemble, don't build, and ingestion owns the input contract and/or Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode? — RESOLVED 2026-06-19. The owner promoted the
FirecrawlConnectorhosted-API path toverifiedon Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode only, scoped: both#resolveClientbranches (injected-client seam crawl + deferred-SDK-import scrape, see item 2) now have live field evidence from this run. Ingestion connector seam — assemble, don't build, and ingestion owns the input contract staysasserted— its broader connector-seam claim still covers the unbuilt Unstructured / M365 connectors. The promotion is carried in DEC-0021's body (itsconfidenceenum staysasserted, gated on the broadest unproven claim). Bar for the rest: a second real site + the self-hostedbaseUrlpath; the keylessHttpConnectorstaysasserted(robots.txtunhonored). - Validate the real SDK path? — CLOSED 2026-06-19. The deferred-import branch (
#resolveClient→apiKey/baseUrl) was exercised live:@mendable/firecrawl-jsv4.28.1 — installed only in gitignorednode_modules, the committed manifest reverted afterward so the SDK is never a hard dependency (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract intact) — drove a realscrapeof rbaconsulting.com through the connector, yielding 9,421 chars of markdown atstatus=200with no SDK-surface drift (mod.defaultis the constructor;scrapeUrl(url, opts)resolves). Both of the connector's client paths (injected seam + deferred SDK import) now have live field evidence. Validated viaclients/rba/harness/firecrawl-sdk-validate.mjs.
This record itself promotes to nothing further; it is verified as a faithful account of the run.