Skip to content

Instantly share code, notes, and snippets.

@HamdaanAliQuatil
Created September 27, 2025 14:54
Show Gist options
  • Select an option

  • Save HamdaanAliQuatil/041fc18a214ad2b2ca801e29b5c33c2a to your computer and use it in GitHub Desktop.

Select an option

Save HamdaanAliQuatil/041fc18a214ad2b2ca801e29b5c33c2a to your computer and use it in GitHub Desktop.

Large-scale scraping (public data only)

Goal: collect a large list of public pages (no login, no paywalls) on a daily schedule (and near-real-time when it makes sense) without tripping bot protections. Stealth first; I’ll also add a short path to become a recognized/allow-listed crawler if the business later wants it.


1) North-star rules (so I don’t burn the target)

I would keep three constraints in front of me at all times:

  1. Public surface only. If a page requires login or access control, I would skip it.
  2. Polite footprint. I would stay under per-site budgets (low RPS, controlled concurrency), spread traffic out, and avoid hammering.
  3. No brittle tricks. The system would lean on consistent identity (device, TLS, IP) and human-like pacing instead of playing whack-a-mole.

2) What the bot-protections will be doing (so I can pre-empt them)

Cloudflare/Akamai will be scoring on:

  • IP reputation & rate (residential vs cloud; bursts; geography).
  • TLS/JA3 & HTTP/2 fingerprint (does the stack look like Chrome/Firefox or a script?).
  • Browser/device fingerprint (UA vs features, fonts, WebGL/Canvas, timezone).
  • Behavior (navigation path, timing jitter, interaction signals).
  • Challenges (managed JS checks, Turnstile/hCaptcha, “Under Attack” pages, 403/429).

My design would respect that model instead of fighting it.


3) System I would ship

flowchart LR
  subgraph ControlPlane
    S[Scheduler] --> Q[Task Queue]
    Policy[Per-Domain Policy + Budgets]
    S --> Policy
  end

  Q --> W1[Fetcher Worker]
  Q --> W2[Fetcher Worker]
  Q --> Wn[Fetcher Worker]

  subgraph Worker
    Prof[Profile & Session Manager]
    Net[Proxy/Egress Manager]
    Sel["Path Selector (API/Sitemap/HTML)"]
    Rend["Renderer (Headless only if needed)"]
    Prof --> Sel --> Rend
    Net --> Rend
  end

  Rend --> N[Normalizer/Parser]
  N --> Dedupe[De-dupe & Change Detector]
  Dedupe --> Store[Storage/Warehouse]
  Rend --> Obs[Observability]
  Obs --> ControlPlane
Loading

Routing order inside a worker:

  1. If the site exposes an API/feed/sitemap, I would prefer that.
  2. Otherwise I would fetch plain HTML with a stable, browser-looking stack.
  3. If the page is JS-heavy or returns a managed challenge, I would escalate to a headless renderer for that task only, cache the resulting cookies, and fall back to HTTP for the next N pages.

4) Identity & network (the part that would keep me from getting banned)

Device identity (per site, per session window): I would pin a realistic browser profile and reuse it:

  • Same UA and Sec-CH-UA*, same timezone/locale, same screen/DPR, same fonts/WebGL/Canvas outputs.
  • Persist cookies/localStorage/IndexedDB to a profile directory so requests look like one user, not a new device each hit.

Network egress: I would use sticky residential/ISP IPs:

  • Keep one IP per worker/session for ~30 days (or per crawl window) in the same city/region as the target’s main audience or a neutral region the site expects.
  • Avoid ASN class flips (residential ↔ cloud).
  • Enforce low RPS and add jitter; spread load across many IPs when the corpus is large.

TLS/HTTP2: I would make the TLS/JA3 and ALPN match real browsers. If I’m not using a browser, I would impersonate Chrome/Firefox handshakes so the TLS layer doesn’t out me.


5) Pacing & navigation (so my traffic looks normal)

  • I would “enter” via a hub page or sitemap URL, then fan out—avoiding cold deep-links for every single URL.
  • Between requests I would wait random 300–1200 ms, and add longer gaps after bursts or when I cross section boundaries.
  • I would cap to 1–3 parallel fetches per domain per IP and keep global ceilings tight (tuned per target).
  • I would never open parallel tabs that fetch the same domain with the same session.

6) Daily snapshots + near-real-time deltas

  • Daily: I would schedule a full sweep during the site’s off-peak hours, partitioned across workers so the job finishes within the window without spiking per-domain budgets.
  • Deltas: I would maintain ETag/Last-Modified and simhash/LSH fingerprints; I’d poll known “changey” endpoints more often, and only re-crawl a detail page when a parent/listing indicates a change.

7) Error & challenge handling (explicit, deterministic)

I would classify responses and react differently:

  • 429 / “Rate limited”: exponential backoff for that domain+IP, reduce concurrency, and reschedule.
  • 403 / “Access denied” / CF/Akamai interstitial HTML: escalate this task to headless; if solved, cache cookies and return to HTTP for subsequent tasks on the same session.
  • Repeated challenges (≥2 in 24h on same domain+profile): trip a circuit breaker for that domain, cool down, and require operator review or policy change.
  • Hard bans on an IP: retire that IP for a long TTL, mark reputation bad, and rotate it out.

I would keep reason codes on the task so I can tune policies with data (rate_limit, cf_js_challenge, captcha, tls_mismatch, etc.).


8) Storage, schema, and auditing

  • I would write raw responses (compressed) to blob storage with {domain}/{date}/… prefixes, and write normalized records to a warehouse with columns like url, seen_at, hash, fields…, source, status, reason_code.
  • I would keep diff tables so downstream consumers can subscribe to changes only.
  • I would log just enough for debugging (status, timing, proxy id, profile id), never PII.

9) Concrete knobs I would start with (and change based on telemetry)

  • Per-domain concurrency: 1–3 (start at 1).
  • Per-IP RPS ceiling: 0.2–0.5/s (i.e., 1 req every 2–5 s).
  • Delays: 300–1200 ms between page requests; +2–5 s after every 20 pages.
  • Retry budget: 2 retries max per URL with backoff; then park for that run.
  • Headless use: <= 5–10% of tasks should need it; if higher, I would re-evaluate selectors/headers/TLS.
  • IP lifecycle: min 7 days; prefer 30 days; retire early on bans.
  • Snapshot window: pick a 2–6 h off-peak window; shard by site section.

10) Compliance & “well-known crawler” path

Stealth is my default (public pages, polite budgets). If the business wants durability with key sites, I would:

  • Publish a crawler page (UA string, contact, purpose).
  • Announce static IP ranges and keep them stable.
  • Ask the site owner (or their WAF support) for allow-listing/verified-bot status with strict rate caps. This flips many challenges off and is the only truly scalable path for strategic targets.

11) What I would not do

  • I would not solve CAPTCHAs at scale or try to fake endless fingerprints; it’s noisy and brittle.
  • I would not rotate IPs every few requests; that looks exactly like abuse.
  • I would not exceed polite budgets even if I “can”. This would tank reputation.

12) Minimal sequence I would implement

sequenceDiagram
  participant S as Scheduler
  participant W as Worker
  participant P as Profile/Session
  participant E as Egress (Sticky IP)
  participant T as Target

  S->>W: task(url)
  W->>P: load profile (domain-scoped)
  W->>E: allocate sticky residential IP
  W->>T: GET hub/sitemap (HTTP/2, real headers)
  T-->>W: 200 + links (or JS challenge)
  alt challenge
    W->>T: solve with headless once, cache cookies
  end

  loop for each target page (budgeted)
    W->>T: GET page (reuse cookies, same IP)
    T-->>W: 200 | 304 | 429 | 403/challenge
    W-->>W: classify + backoff/escalate as rules dictate
  end
  W-->>S: emit normalized rows + reason codes
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment