Goal: collect a large list of public pages (no login, no paywalls) on a daily schedule (and near-real-time when it makes sense) without tripping bot protections. Stealth first; I’ll also add a short path to become a recognized/allow-listed crawler if the business later wants it.
I would keep three constraints in front of me at all times:
- Public surface only. If a page requires login or access control, I would skip it.
- Polite footprint. I would stay under per-site budgets (low RPS, controlled concurrency), spread traffic out, and avoid hammering.
- No brittle tricks. The system would lean on consistent identity (device, TLS, IP) and human-like pacing instead of playing whack-a-mole.
Cloudflare/Akamai will be scoring on:
- IP reputation & rate (residential vs cloud; bursts; geography).
- TLS/JA3 & HTTP/2 fingerprint (does the stack look like Chrome/Firefox or a script?).
- Browser/device fingerprint (UA vs features, fonts, WebGL/Canvas, timezone).
- Behavior (navigation path, timing jitter, interaction signals).
- Challenges (managed JS checks, Turnstile/hCaptcha, “Under Attack” pages, 403/429).
My design would respect that model instead of fighting it.
flowchart LR
subgraph ControlPlane
S[Scheduler] --> Q[Task Queue]
Policy[Per-Domain Policy + Budgets]
S --> Policy
end
Q --> W1[Fetcher Worker]
Q --> W2[Fetcher Worker]
Q --> Wn[Fetcher Worker]
subgraph Worker
Prof[Profile & Session Manager]
Net[Proxy/Egress Manager]
Sel["Path Selector (API/Sitemap/HTML)"]
Rend["Renderer (Headless only if needed)"]
Prof --> Sel --> Rend
Net --> Rend
end
Rend --> N[Normalizer/Parser]
N --> Dedupe[De-dupe & Change Detector]
Dedupe --> Store[Storage/Warehouse]
Rend --> Obs[Observability]
Obs --> ControlPlane
Routing order inside a worker:
- If the site exposes an API/feed/sitemap, I would prefer that.
- Otherwise I would fetch plain HTML with a stable, browser-looking stack.
- If the page is JS-heavy or returns a managed challenge, I would escalate to a headless renderer for that task only, cache the resulting cookies, and fall back to HTTP for the next N pages.
Device identity (per site, per session window): I would pin a realistic browser profile and reuse it:
- Same UA and
Sec-CH-UA*, same timezone/locale, same screen/DPR, same fonts/WebGL/Canvas outputs. - Persist cookies/localStorage/IndexedDB to a profile directory so requests look like one user, not a new device each hit.
Network egress: I would use sticky residential/ISP IPs:
- Keep one IP per worker/session for ~30 days (or per crawl window) in the same city/region as the target’s main audience or a neutral region the site expects.
- Avoid ASN class flips (residential ↔ cloud).
- Enforce low RPS and add jitter; spread load across many IPs when the corpus is large.
TLS/HTTP2: I would make the TLS/JA3 and ALPN match real browsers. If I’m not using a browser, I would impersonate Chrome/Firefox handshakes so the TLS layer doesn’t out me.
- I would “enter” via a hub page or sitemap URL, then fan out—avoiding cold deep-links for every single URL.
- Between requests I would wait random 300–1200 ms, and add longer gaps after bursts or when I cross section boundaries.
- I would cap to 1–3 parallel fetches per domain per IP and keep global ceilings tight (tuned per target).
- I would never open parallel tabs that fetch the same domain with the same session.
- Daily: I would schedule a full sweep during the site’s off-peak hours, partitioned across workers so the job finishes within the window without spiking per-domain budgets.
- Deltas: I would maintain ETag/Last-Modified and simhash/LSH fingerprints; I’d poll known “changey” endpoints more often, and only re-crawl a detail page when a parent/listing indicates a change.
I would classify responses and react differently:
- 429 / “Rate limited”: exponential backoff for that domain+IP, reduce concurrency, and reschedule.
- 403 / “Access denied” / CF/Akamai interstitial HTML: escalate this task to headless; if solved, cache cookies and return to HTTP for subsequent tasks on the same session.
- Repeated challenges (≥2 in 24h on same domain+profile): trip a circuit breaker for that domain, cool down, and require operator review or policy change.
- Hard bans on an IP: retire that IP for a long TTL, mark reputation bad, and rotate it out.
I would keep reason codes on the task so I can tune policies with data (rate_limit, cf_js_challenge, captcha, tls_mismatch, etc.).
- I would write raw responses (compressed) to blob storage with
{domain}/{date}/…prefixes, and write normalized records to a warehouse with columns likeurl, seen_at, hash, fields…, source, status, reason_code. - I would keep diff tables so downstream consumers can subscribe to changes only.
- I would log just enough for debugging (status, timing, proxy id, profile id), never PII.
- Per-domain concurrency: 1–3 (start at 1).
- Per-IP RPS ceiling: 0.2–0.5/s (i.e., 1 req every 2–5 s).
- Delays: 300–1200 ms between page requests; +2–5 s after every 20 pages.
- Retry budget: 2 retries max per URL with backoff; then park for that run.
- Headless use: <= 5–10% of tasks should need it; if higher, I would re-evaluate selectors/headers/TLS.
- IP lifecycle: min 7 days; prefer 30 days; retire early on bans.
- Snapshot window: pick a 2–6 h off-peak window; shard by site section.
Stealth is my default (public pages, polite budgets). If the business wants durability with key sites, I would:
- Publish a crawler page (UA string, contact, purpose).
- Announce static IP ranges and keep them stable.
- Ask the site owner (or their WAF support) for allow-listing/verified-bot status with strict rate caps. This flips many challenges off and is the only truly scalable path for strategic targets.
- I would not solve CAPTCHAs at scale or try to fake endless fingerprints; it’s noisy and brittle.
- I would not rotate IPs every few requests; that looks exactly like abuse.
- I would not exceed polite budgets even if I “can”. This would tank reputation.
sequenceDiagram
participant S as Scheduler
participant W as Worker
participant P as Profile/Session
participant E as Egress (Sticky IP)
participant T as Target
S->>W: task(url)
W->>P: load profile (domain-scoped)
W->>E: allocate sticky residential IP
W->>T: GET hub/sitemap (HTTP/2, real headers)
T-->>W: 200 + links (or JS challenge)
alt challenge
W->>T: solve with headless once, cache cookies
end
loop for each target page (budgeted)
W->>T: GET page (reuse cookies, same IP)
T-->>W: 200 | 304 | 429 | 403/challenge
W-->>W: classify + backoff/escalate as rules dictate
end
W-->>S: emit normalized rows + reason codes