- Before merging a PR
- Before deploying to production
- Any change touching APIs, persistence, queues, or external services
Identify failure modes that only show up under real production conditions: retries, partial failures, concurrency, and restarts.
You are reviewing this code as if it is already running in production under real load.
Assume the following WILL happen:
- requests will be retried (clients, load balancers, frameworks)
- downstream services will timeout or partially fail
- duplicate requests or events will occur
- the process may crash or restart mid-execution
Analyze this code and identify:
- Failure modes (what can go wrong)
- Blast radius (how far the failure spreads)
- Data safety risks (duplication, corruption, loss)
For each issue, classify it as:
- BLOCKER – can cause production incidents or data loss
- RISK – works now but unsafe at scale
- OK – failure is contained or self-healing
For each BLOCKER or RISK:
- describe a concrete failure scenario
- propose the smallest possible fix
Focus on real behavior, not style or formatting.