Cloudflare Outage (November 18, 2025) – Step-By-Step Technical Incident Analysis

1. Permissions Change on ClickHouse

Engineer ran an RBAC update to improve security.

Example:

ALTER TABLE permissions REVOKE ALL ON database.* FROM ROLE 'botmgmt_reader';

2. Config Generation Bug

Automated system queries ClickHouse for bot mitigation rules.
Due to the change, query output unexpectedly returns duplicate records (one per DB shard).

Example (intended vs buggy):

-- Intended:
SELECT feature, pattern FROM metadata.features;
-- Bug — wrong scope, too broad:
SELECT * FROM shards.metadata.features; -- Duplicates for each shard

3. Oversized Config Distributes Globally

Config file grows from ~60 entries to 200+ (with duplicates).
File auto-propagated every 5 minutes to all data centers.

Example (JSON excerpt):

// Expected
{ "rules": [ {"id": "bot-1", "pattern": "abc.*def"} ] }
// Faulty
{ "rules": [
    {"id": "bot-1", "pattern": "abc.*def"},  // repeated for every shard
    ...
]}

4. Service Crash & Flap Loop

Bot mitigation service tries to load malformed config and crashes (panic or assertion fails).
Service restarts, re-fetches config, crashes again ("flap loop").

Example (Rust-like):

let config = parse_config("botmgmt.json").unwrap();
assert!(config["rules"].len() < 128); // Triggers panic

5. Cascade Failure (Tightly Coupled Services)

Major dependent services (WAF, DDoS, analytics) also fail as they require bot mitigation output.

6. Global Outage (Homogeneous Deployment)

All 330+ Cloudflare data centers run identical stacks; failure is instant and everywhere.

7. Response & Remediation

Engineers halt automated config propagation.
Deploy last known-good config manually.

Example:

cloudflarectl config upload --service=botmgmt --file=botmgmt_good.json

Outage resolved; services stabilize.

8. Postmortem Action Items (Code/Config Best Practices)

Validation on config load (limit size/format before applying):

if len(config['rules']) > 128:
    raise Exception("Config file too large")

Graceful degradation: fallback logic for core components, not crash.
Progressive/Canary rollout: avoid full blast changes.
Monitor config growth and alert: automated warning for abnormal size or shape.
Treat config as code: version control, audit, change review, rollback on error.

Key System Design Lessons

Redundancy is not enough: Homogeneous systems suffer global blast radius if config/code is flawed.
Design for degradation: Core dependencies should allow default/fallback operational modes.
Limit blast radius: Use staged, regional rollouts and canarying.
Monitor + Alert: Proactive metrics for config, service health, error spikes.
Config is as critical as source code: Apply the same discipline—review, test, rollback.

Cloudflare Outage (November 18, 2025): Technical, Business, and Financial Analysis

1. Incident Timeline (Technical Focus)

11:20 UTC: Engineer updates RBAC on ClickHouse.

ALTER TABLE permissions REVOKE ALL ON database.* FROM ROLE 'botmgmt_reader';

Bot Mitigation config generation bug: Duplicate results for each DB shard.

SELECT * FROM shards.metadata.features; -- Unexpected duplicate entries

Oversized config file auto-propagated globally to 330+ data centers.
```
{ "rules": [ ...200+ entries with duplicates... ] }
```
Service crash loop: Bot mitigation process panics on large config, all sites begin failing.
```
assert!(config["rules"].len() < 128); // Panic here
```
Cascade failure: Dependent services (WAF, DDoS, analytics) all go down.
Global blast radius: All regions impacted simultaneously due to identical config distribution.
Mitigation: Engineers halt config propagation, manually inject known-good config.
```
cloudflarectl config upload --service=botmgmt --file=botmgmt_good.json
```
Postmortem: Add config validation, canary rollouts, fallback logic, monitoring, treat config as code.

2. Business & Financial Impacts

Major sites affected: Twitter/X, ChatGPT, Shopify, Spotify, League of Legends, banks, and many more.
Internet Impact: ~20% of global internet traffic disrupted.
Estimated global economic loss:
- $5–15 billion USD per hour
- Total loss estimate: $60 billion USD (for 3+ hours outage)
Cloudflare Inc. market cap movement:
- Lost $16 billion in market cap in a single trading day (video/chart source).
Reputational damage:
- Reduced trust in Cloudflare's reliability, increased regulatory and customer scrutiny.
- Highlighted the risk of single points of failure in globally homogeneous critical internet infrastructure.

3. Key Lessons & Best Practices

Validate configs at generation and before propagation.
Use staged/canary deployments.
Build fallback/default logic for critical-path dependencies.
Monitor system and config growth, alert on anomalies.
Treat configuration artifacts with same rigor as production code.
Design infrastructure for graceful degradation, not catastrophic collapse.
Map "blast radius" and apply extraordinary care/validation to central dependencies.

Summary:
A single configuration error—amplified by tight coupling, homogeneous deployment, and lack of safeguards—temporarily broke a fifth of the global internet and incurred one of the highest downtime losses in history.

rsrini7/cloudflare-down-nov-2025.md

Select an option

No results found