Honor robots.txt in the keyless HttpConnector
task-http-connector-robots-txt
Honor robots.txt in the keyless HttpConnector
Web ingestion — a keyless HttpConnector by default, Firecrawl wired as the premium path, and a first-class CLI web-ingest mode shipped the HttpConnector — the first real, keyless network connector (global fetch, no SDK, no API key, a bounded same-host BFS crawl) — and explicitly recorded "robots.txt not yet honored" as a follow-up. This task closes that gap so the polite-crawler contract holds before the connector is used at any scale.
In progress (claim/lease in action)
This atom is claimed: claimed_by: claude-agent-ingest-20260616, lease_expires: 2026-06-16T15:30:00Z. While the lease is live, the Build the PreToolUse claim/lease governance hook (once built) denies a competing agent's edit to this task; after the lease expires it becomes reclaimable. Coordination only — owner/assignee (Ingestion & Connectors Engineer) is unchanged.
Shape
Fetch /(robots.txt) for the seed host once, parse the applicable user-agent group, and gate each candidate URL through an allow-check before enqueuing; honor crawl-delay; fail open if robots.txt is missing or non-2xx (crawl allowed), consistent with crawler norms. Keep it keyless and inject fetch in tests so CI stays offline (Ingestion connector seam — assemble, don't build, and ingestion owns the input contract seam discipline).