Skip to content

Instantly share code, notes, and snippets.

@arun0009
Created January 29, 2026 04:02
Show Gist options
  • Select an option

  • Save arun0009/56478b6d368276ed3bdb6e9822296521 to your computer and use it in GitHub Desktop.

Select an option

Save arun0009/56478b6d368276ed3bdb6e9822296521 to your computer and use it in GitHub Desktop.

Workflow: Production Failure & Blast Radius Audit

When to use

  • Before merging a PR
  • Before deploying to production
  • Any change touching APIs, persistence, queues, or external services

Goal

Identify failure modes that only show up under real production conditions: retries, partial failures, concurrency, and restarts.


Prompt (paste into Cascade)

You are reviewing this code as if it is already running in production under real load.

Assume the following WILL happen:

  • requests will be retried (clients, load balancers, frameworks)
  • downstream services will timeout or partially fail
  • duplicate requests or events will occur
  • the process may crash or restart mid-execution

Analyze this code and identify:

  1. Failure modes (what can go wrong)
  2. Blast radius (how far the failure spreads)
  3. Data safety risks (duplication, corruption, loss)

For each issue, classify it as:

  • BLOCKER – can cause production incidents or data loss
  • RISK – works now but unsafe at scale
  • OK – failure is contained or self-healing

For each BLOCKER or RISK:

  • describe a concrete failure scenario
  • propose the smallest possible fix

Focus on real behavior, not style or formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment