Skip to content

Instantly share code, notes, and snippets.

@eduardo-matos
Last active February 21, 2026 14:32
Show Gist options
  • Select an option

  • Save eduardo-matos/1d122ad8e4ca371fbf9d8d240c0cee3e to your computer and use it in GitHub Desktop.

Select an option

Save eduardo-matos/1d122ad8e4ca371fbf9d8d240c0cee3e to your computer and use it in GitHub Desktop.
Example of incident response playbook for a software engineering team.

Phase 1: Identification & Triage

The goal here is to determine if the "blip" is a genuine emergency and how many people it’s affecting.

  • Declare the Incident: If a service is down, data is corrupted, or security is breached, officially declare an incident in your communication channel (e.g., Slack/Teams).

  • Assign Key Roles:

    • Incident Commander (IC): Leads the response, makes final decisions, and keeps the team focused.
    • Communications Lead: Handles updates to stakeholders and status pages.
    • Operations/SME: The engineers actually digging into the code or infrastructure.
  • Set Severity Level:

    • SEV1: Critical system down, total data loss, or security breach.
    • SEV2: Major functionality impaired for a large subset of users.
    • SEV3: Minor bug or performance degradation.

Phase 2: Communication & Coordination

Chaos happens when everyone is working in silos.

  • Establish a "War Room": Create a dedicated meeting link or Slack channel.
  • Internal Updates: Provide a "Pulse" update every 30–60 minutes for SEV1/2 incidents.
  • External Updates: Update the public-facing status page. Rule of thumb: Be honest, but don't over-share technical gore until the root cause is confirmed.

Phase 3: Containment & Mitigation

The priority is to stop the bleeding, not necessarily to find the permanent "fix."

  • The "Revert" First Rule: If a recent deployment aligns with the start of the incident, rollback immediately. Do not try to "roll forward" with a quick fix unless a rollback is impossible.
  • Isolate: If a specific service is failing or under attack, use load balancers or feature flags to shunt traffic away from the "sick" component.
  • Scale Up: If the issue is resource exhaustion, throw more compute at it (horizontal scaling) to buy time for investigation.

Phase 4: Resolution & Recovery

Once the system is stable, ensure it stays that way.

  • Verification: Use monitoring tools (Datadog, New Relic, etc.) to confirm that error rates have returned to baseline and latency is normal.
  • Staged Reintroduction: If traffic was diverted, slowly bleed it back in to ensure the system can handle the load.

Phase 5: Post-Incident Review (Post-Mortem)

An incident is a terrible thing to waste. This should happen within 48–72 hours of the resolution.

  • Blame-Free Culture: Focus on process and systemic failures, not individual "human error."
  • Timeline Reconstruction: What happened, and when did we notice?
  • Action Items: Create Jira tickets for the "Five Whys."
    • Example: "Why did the DB crash?" → "Because it ran out of memory." → "Why?" → "Because the leak wasn't caught in staging." → Action Item: Add memory leak detection to the CI/CD pipeline.
  • Postmortem: Release an official postmortem and make it public to your peers.

Examples of tools

Tool Category Example
Alerting incident.io, PagerDuty, Opsgenie
Observability Grafana, Prometheus, Honeycomb
Feature Flags LaunchDarkly (for instant "kill switches")
Communication Slack, Zoom, Statuspage.io
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment