Honor robots.txt in the keyless HttpConnector

task-http-connector-robots-txt

task confidence asserted status claimed 2026-06-16 owner ingestion-engineer
source board-curator (knowledge-architect) — from 0021-web-ingestion-keyless-http-and-firecrawl (documented follow-up "robots.txt not yet honored")

Honor robots.txt in the keyless HttpConnector

Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode shipped the HttpConnector — the first real, keyless network connector (global fetch, no SDK, no API key, a bounded same-host BFS crawl) — and explicitly recorded "robots.txt not yet honored" as a follow-up. This task closes that gap so the polite-crawler contract holds before the connector is used at any scale.

In progress (claim/lease in action)

This atom is claimed: claimed_by: claude-agent-ingest-20260616, lease_expires: 2026-06-16T15:30:00Z. While the lease is live, the Build the PreToolUse claim/lease governance hook (once built) denies a competing agent's edit to this task; after the lease expires it becomes reclaimable. Coordination only — owner/assignee (Ingestion & Connectors Engineer) is unchanged.

Shape

Fetch /(robots.txt) for the seed host once, parse the applicable user-agent group, and gate each candidate URL through an allow-check before enqueuing; honor crawl-delay; fail open if robots.txt is missing or non-2xx (crawl allowed), consistent with crawler norms. Keep it keyless and inject fetch in tests so CI stays offline (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract seam discipline).