The goal here is to determine if the "blip" is a genuine emergency and how many people it’s affecting.
-
Declare the Incident: If a service is down, data is corrupted, or security is breached, officially declare an incident in your communication channel (e.g., Slack/Teams).
-
Assign Key Roles:
- Incident Commander (IC): Leads the response, makes final decisions, and keeps the team focused.
- Communications Lead: Handles updates to stakeholders and status pages.
- Operations/SME: The engineers actually digging into the code or infrastructure.
-
Set Severity Level:
- SEV1: Critical system down, total data loss, or security breach.
- SEV2: Major functionality impaired for a large subset of users.
- SEV3: Minor bug or performance degradation.
Chaos happens when everyone is working in silos.
- Establish a "War Room": Create a dedicated meeting link or Slack channel.
- Internal Updates: Provide a "Pulse" update every 30–60 minutes for SEV1/2 incidents.
- External Updates: Update the public-facing status page. Rule of thumb: Be honest, but don't over-share technical gore until the root cause is confirmed.
The priority is to stop the bleeding, not necessarily to find the permanent "fix."
- The "Revert" First Rule: If a recent deployment aligns with the start of the incident, rollback immediately. Do not try to "roll forward" with a quick fix unless a rollback is impossible.
- Isolate: If a specific service is failing or under attack, use load balancers or feature flags to shunt traffic away from the "sick" component.
- Scale Up: If the issue is resource exhaustion, throw more compute at it (horizontal scaling) to buy time for investigation.
Once the system is stable, ensure it stays that way.
- Verification: Use monitoring tools (Datadog, New Relic, etc.) to confirm that error rates have returned to baseline and latency is normal.
- Staged Reintroduction: If traffic was diverted, slowly bleed it back in to ensure the system can handle the load.
An incident is a terrible thing to waste. This should happen within 48–72 hours of the resolution.
- Blame-Free Culture: Focus on process and systemic failures, not individual "human error."
- Timeline Reconstruction: What happened, and when did we notice?
- Action Items: Create Jira tickets for the "Five Whys."
- Example: "Why did the DB crash?" → "Because it ran out of memory." → "Why?" → "Because the leak wasn't caught in staging." → Action Item: Add memory leak detection to the CI/CD pipeline.
- Postmortem: Release an official postmortem and make it public to your peers.