arun0009/production-failure-audit.md

Created January 29, 2026 04:02

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/arun0009/56478b6d368276ed3bdb6e9822296521.js"></script>
Save arun0009/56478b6d368276ed3bdb6e9822296521 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

production-failure-audit.md

Workflow: Production Failure & Blast Radius Audit

When to use

Before merging a PR
Before deploying to production
Any change touching APIs, persistence, queues, or external services

Goal

Identify failure modes that only show up under real production conditions: retries, partial failures, concurrency, and restarts.

Prompt (paste into Cascade)

You are reviewing this code as if it is already running in production under real load.

Assume the following WILL happen:

requests will be retried (clients, load balancers, frameworks)
downstream services will timeout or partially fail
duplicate requests or events will occur
the process may crash or restart mid-execution

Analyze this code and identify:

Failure modes (what can go wrong)
Blast radius (how far the failure spreads)
Data safety risks (duplication, corruption, loss)

For each issue, classify it as:

BLOCKER – can cause production incidents or data loss
RISK – works now but unsafe at scale
OK – failure is contained or self-healing

For each BLOCKER or RISK:

describe a concrete failure scenario
propose the smallest possible fix

Focus on real behavior, not style or formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment