Skip to content

Instantly share code, notes, and snippets.

@rsrini7
Created November 23, 2025 17:29
Show Gist options
  • Select an option

  • Save rsrini7/d2eb9b724bb56f4f54619a4fea237fbc to your computer and use it in GitHub Desktop.

Select an option

Save rsrini7/d2eb9b724bb56f4f54619a4fea237fbc to your computer and use it in GitHub Desktop.
cloudflare-down-nov-2025

Cloudflare Outage (November 18, 2025) – Step-By-Step Technical Incident Analysis

1. Permissions Change on ClickHouse

  • Engineer ran an RBAC update to improve security.
  • Example:
    ALTER TABLE permissions REVOKE ALL ON database.* FROM ROLE 'botmgmt_reader';
    

2. Config Generation Bug

  • Automated system queries ClickHouse for bot mitigation rules.
  • Due to the change, query output unexpectedly returns duplicate records (one per DB shard).
  • Example (intended vs buggy):
    -- Intended:
    SELECT feature, pattern FROM metadata.features;
    -- Bug — wrong scope, too broad:
    SELECT * FROM shards.metadata.features; -- Duplicates for each shard
    

3. Oversized Config Distributes Globally

  • Config file grows from ~60 entries to 200+ (with duplicates).
  • File auto-propagated every 5 minutes to all data centers.
  • Example (JSON excerpt):
    // Expected
    { "rules": [ {"id": "bot-1", "pattern": "abc.*def"} ] }
    // Faulty
    { "rules": [
        {"id": "bot-1", "pattern": "abc.*def"},  // repeated for every shard
        ...
    ]}
    

4. Service Crash & Flap Loop

  • Bot mitigation service tries to load malformed config and crashes (panic or assertion fails).
  • Service restarts, re-fetches config, crashes again ("flap loop").
  • Example (Rust-like):
    let config = parse_config("botmgmt.json").unwrap();
    assert!(config["rules"].len() < 128); // Triggers panic
    

5. Cascade Failure (Tightly Coupled Services)

  • Major dependent services (WAF, DDoS, analytics) also fail as they require bot mitigation output.

6. Global Outage (Homogeneous Deployment)

  • All 330+ Cloudflare data centers run identical stacks; failure is instant and everywhere.

7. Response & Remediation

  • Engineers halt automated config propagation.
  • Deploy last known-good config manually.
  • Example:
    cloudflarectl config upload --service=botmgmt --file=botmgmt_good.json
    
  • Outage resolved; services stabilize.

8. Postmortem Action Items (Code/Config Best Practices)

  • Validation on config load (limit size/format before applying):
    if len(config['rules']) > 128:
        raise Exception("Config file too large")
    
  • Graceful degradation: fallback logic for core components, not crash.
  • Progressive/Canary rollout: avoid full blast changes.
  • Monitor config growth and alert: automated warning for abnormal size or shape.
  • Treat config as code: version control, audit, change review, rollback on error.

Key System Design Lessons

  • Redundancy is not enough: Homogeneous systems suffer global blast radius if config/code is flawed.
  • Design for degradation: Core dependencies should allow default/fallback operational modes.
  • Limit blast radius: Use staged, regional rollouts and canarying.
  • Monitor + Alert: Proactive metrics for config, service health, error spikes.
  • Config is as critical as source code: Apply the same discipline—review, test, rollback.

Cloudflare Outage (November 18, 2025): Technical, Business, and Financial Analysis

1. Incident Timeline (Technical Focus)

  • 11:20 UTC: Engineer updates RBAC on ClickHouse.
    ALTER TABLE permissions REVOKE ALL ON database.* FROM ROLE 'botmgmt_reader';
    
  • Bot Mitigation config generation bug: Duplicate results for each DB shard.
    SELECT * FROM shards.metadata.features; -- Unexpected duplicate entries
    
  • Oversized config file auto-propagated globally to 330+ data centers.
    { "rules": [ ...200+ entries with duplicates... ] }
    
  • Service crash loop: Bot mitigation process panics on large config, all sites begin failing.
    assert!(config["rules"].len() < 128); // Panic here
    
  • Cascade failure: Dependent services (WAF, DDoS, analytics) all go down.
  • Global blast radius: All regions impacted simultaneously due to identical config distribution.
  • Mitigation: Engineers halt config propagation, manually inject known-good config.
    cloudflarectl config upload --service=botmgmt --file=botmgmt_good.json
    
  • Postmortem: Add config validation, canary rollouts, fallback logic, monitoring, treat config as code.

2. Business & Financial Impacts

  • Major sites affected: Twitter/X, ChatGPT, Shopify, Spotify, League of Legends, banks, and many more.
  • Internet Impact: ~20% of global internet traffic disrupted.
  • Estimated global economic loss:
    • $5–15 billion USD per hour
    • Total loss estimate: $60 billion USD (for 3+ hours outage)
  • Cloudflare Inc. market cap movement:
    • Lost $16 billion in market cap in a single trading day (video/chart source).
  • Reputational damage:
    • Reduced trust in Cloudflare's reliability, increased regulatory and customer scrutiny.
    • Highlighted the risk of single points of failure in globally homogeneous critical internet infrastructure.

3. Key Lessons & Best Practices

  • Validate configs at generation and before propagation.
  • Use staged/canary deployments.
  • Build fallback/default logic for critical-path dependencies.
  • Monitor system and config growth, alert on anomalies.
  • Treat configuration artifacts with same rigor as production code.
  • Design infrastructure for graceful degradation, not catastrophic collapse.
  • Map "blast radius" and apply extraordinary care/validation to central dependencies.

Summary:
A single configuration error—amplified by tight coupling, homogeneous deployment, and lack of safeguards—temporarily broke a fifth of the global internet and incurred one of the highest downtime losses in history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment