- Engineer ran an RBAC update to improve security.
- Example:
ALTER TABLE permissions REVOKE ALL ON database.* FROM ROLE 'botmgmt_reader';
- Automated system queries ClickHouse for bot mitigation rules.
- Due to the change, query output unexpectedly returns duplicate records (one per DB shard).
- Example (intended vs buggy):
-- Intended: SELECT feature, pattern FROM metadata.features; -- Bug — wrong scope, too broad: SELECT * FROM shards.metadata.features; -- Duplicates for each shard
- Config file grows from ~60 entries to 200+ (with duplicates).
- File auto-propagated every 5 minutes to all data centers.
- Example (JSON excerpt):
// Expected { "rules": [ {"id": "bot-1", "pattern": "abc.*def"} ] } // Faulty { "rules": [ {"id": "bot-1", "pattern": "abc.*def"}, // repeated for every shard ... ]}
- Bot mitigation service tries to load malformed config and crashes (panic or assertion fails).
- Service restarts, re-fetches config, crashes again ("flap loop").
- Example (Rust-like):
let config = parse_config("botmgmt.json").unwrap(); assert!(config["rules"].len() < 128); // Triggers panic
- Major dependent services (WAF, DDoS, analytics) also fail as they require bot mitigation output.
- All 330+ Cloudflare data centers run identical stacks; failure is instant and everywhere.
- Engineers halt automated config propagation.
- Deploy last known-good config manually.
- Example:
cloudflarectl config upload --service=botmgmt --file=botmgmt_good.json - Outage resolved; services stabilize.
- Validation on config load (limit size/format before applying):
if len(config['rules']) > 128: raise Exception("Config file too large") - Graceful degradation: fallback logic for core components, not crash.
- Progressive/Canary rollout: avoid full blast changes.
- Monitor config growth and alert: automated warning for abnormal size or shape.
- Treat config as code: version control, audit, change review, rollback on error.
- Redundancy is not enough: Homogeneous systems suffer global blast radius if config/code is flawed.
- Design for degradation: Core dependencies should allow default/fallback operational modes.
- Limit blast radius: Use staged, regional rollouts and canarying.
- Monitor + Alert: Proactive metrics for config, service health, error spikes.
- Config is as critical as source code: Apply the same discipline—review, test, rollback.
- 11:20 UTC: Engineer updates RBAC on ClickHouse.
ALTER TABLE permissions REVOKE ALL ON database.* FROM ROLE 'botmgmt_reader'; - Bot Mitigation config generation bug: Duplicate results for each DB shard.
SELECT * FROM shards.metadata.features; -- Unexpected duplicate entries - Oversized config file auto-propagated globally to 330+ data centers.
{ "rules": [ ...200+ entries with duplicates... ] } - Service crash loop: Bot mitigation process panics on large config, all sites begin failing.
assert!(config["rules"].len() < 128); // Panic here - Cascade failure: Dependent services (WAF, DDoS, analytics) all go down.
- Global blast radius: All regions impacted simultaneously due to identical config distribution.
- Mitigation: Engineers halt config propagation, manually inject known-good config.
cloudflarectl config upload --service=botmgmt --file=botmgmt_good.json - Postmortem: Add config validation, canary rollouts, fallback logic, monitoring, treat config as code.
- Major sites affected: Twitter/X, ChatGPT, Shopify, Spotify, League of Legends, banks, and many more.
- Internet Impact: ~20% of global internet traffic disrupted.
- Estimated global economic loss:
- $5–15 billion USD per hour
- Total loss estimate: $60 billion USD (for 3+ hours outage)
- Cloudflare Inc. market cap movement:
- Lost $16 billion in market cap in a single trading day (video/chart source).
- Reputational damage:
- Reduced trust in Cloudflare's reliability, increased regulatory and customer scrutiny.
- Highlighted the risk of single points of failure in globally homogeneous critical internet infrastructure.
- Validate configs at generation and before propagation.
- Use staged/canary deployments.
- Build fallback/default logic for critical-path dependencies.
- Monitor system and config growth, alert on anomalies.
- Treat configuration artifacts with same rigor as production code.
- Design infrastructure for graceful degradation, not catastrophic collapse.
- Map "blast radius" and apply extraordinary care/validation to central dependencies.
Summary:
A single configuration error—amplified by tight coupling, homogeneous deployment, and lack of safeguards—temporarily broke a fifth of the global internet and incurred one of the highest downtime losses in history.