Web-crawl hardening for the live Firecrawl path — pin >=2.0.1, network-layer internal/link-local egress blocklist, never carry client session credentials

task-firecrawl-web-crawl-hardening

task confidence inferred status backlog 2026-06-19 owner ingestion-engineer

source log-auditor — surfaced recording 0059-untrusted-by-default-ingestion-serve-boundary (research §7a, the Firecrawl/web-crawl controls), directly relevant to the now-field-proven Firecrawl path (DEC-0055/DEC-0021). Board globbed before filing — no open task covered Firecrawl SSRF/version-pin/egress-blocklist or web-crawl credential hygiene (task-http-connector-robots-txt is the keyless HttpConnector robots gap, a different connector + concern; the DEC-0055 follow-up tasks are graph-quality/harness, not web-crawl security).

Web-crawl hardening for the live Firecrawl path

Security hardening of the live Firecrawl web path proven in First live FirecrawlConnector run against a real client source — field evidence for the reserved web seam (the first live FirecrawlConnector run against rbaconsulting.com), applying DEC-0059's untrusted-by-default boundary to the web source.

Three architectural closures (from the research)

Pin self-hosted Firecrawl ≥ 2.0.1 AND block internal/link-local egress at the network layer. Two SSRF CVEs (CVE-2024-56800 fixed 1.1.1; CVE-2025-57818 fixed 2.0.1) — and the playwright-service SSRF was deemed un-patchable in code, so the network-layer blocklist is the real control, not the version pin alone.
Never inject client session credentials into a public crawl. scrapeOptions.headers can carry cookies/auth; a public crawl must never carry the client's session.
Keep safe-by-default scoping on, and don't mistake crawl filters for a security boundary. allowExternalLinks/allowSubdomains/crawlEntireDomain all default false; limit defaults 10000. Pair Firecrawl's includePaths/excludePaths (crawl-frontier filters, not a security boundary — excludePaths has confirmed silent-fail bugs) with a Dossier-owned pre-ingest allowlist + post-crawl host validation. robots.txt is honored by default but is advisory, not a PII control — a public-but-disallowed page still has PII, so it still hits the L2 detector (Detector ensemble at the ingestion boundary with a MEASURED F2/recall target (Presidio recall-tuned + custom "P2" recognizers + cloud DLP) — measure detection, don't trust it).

Why a task, not a fix-in-place

Real security engineering on the now-field-proven web path (version pin + network egress blocklist + credential hygiene + host-validation allowlist) — owner judgment + code. Distinct from Honor robots.txt in the keyless HttpConnector (the keyless HttpConnector robots gap, a different connector). Detail + citations: research/2026-06-18-sensitive-data-and-injection-defense.md §7a. confidence: inferred (agent-filed from DEC-0059).