In the past two weeks, we've held two Availability Review meetings featuring excellent presenters. These meetings facilitated fruitful discussions on how we can reflect and learn from customer incidents. (In case you missed any, you can find the recording for 08-27 and 09-04)π€
To enhance the efficiency of our AR meetings, here's a guide on the current Availability Review process and how AR issues should be completed. We're also integrating GHES-specific requirements into overall GitHub automations. Future improvements are expected to ease and eliminate more manual steps.π
| # | Step | Info |
|---|---|---|
| 1 | Availability Review created at end of GHES SEV 1 | this will be automated in future |
| 2 | Incident DRI fills out the Availability Review | create an issue |
| 3 | DRI Creates incident repair item | incident repaire item board |
| 4 | Assigned Availability Champion reviews then approves Availability Review | open AR issues |
| 5 | Incident DRI presents incident to team | scheduling |
- Make sure
ghes,incident,in-progresslabels are added to the issue (this will be automated in future). - Replace all
TODOfields with relevant information. - Adding links to issue, channel, docs and other artifacts produced during incident handling.
Win: reflect on what went well about how we responded to this issue: how we were alerted to the problem, the tooling that supported the process, playbooks, automations and individual's contributions.- Incident Timeline: a detailed incident timeline starting from customer reports the issue til engineer disengaged is expected. The Incident Commander (aka the GHES Manager Oncall) should have already created the timeline during handling the SEV 1. This information can be provided either as a link to file in
Timeline/Event Document linkor in the sectionHow did we respond to the incident? - Follow up issues: incident repair items or task for runbooks should be created as part of filling out the availability review issue. Add an
incident-repairlabel and ensure the incident repair items are linked in the Availability issue'sShort Term Repair Items Β§orLong Term Repair Items Β§so it can be tracked in the availability review board. Short Term items should be work we NEED to complete within the next Patch cycle, Long term items should be within the next Feature release (or two). Availability Risk Registry Items: Identify items that you would like to propose be tracked as a risk during the Availability review. These are fixes that, if not repaired have a high likelihood of reoccurrence, have reoccured multiple times now or have the potential to cause data loss or long running outages for our customers. Items that require fan out work and funding discussions can also be added here if need be. These issues are triaged and assigned by the Availability Champions weekly.- Check the Ready for Review
review:readybox when the content is complete to signal the Availability Champion to review the issue.
- we have two time slots for a biweekly availability review meeting, alternating between:
Tuesdays 10am EDT/4pm CET
Wednesdays 4pm PDT/7pm EDT
- presentation will be scheduled after the AR issue is reviewed and approved by availability champion.
Thanks to @lindarodgers and @thorrsson for providing direction and clarification on our process!π
We're still in the early stages of forming our culture and procedures. Everyone in GHES is part of this journey, and we welcome any suggestions you may have.
Process in a blick -- is this a typo ? Do you mean blink ? Or is blick a cool German word that I don't know about ?
I like the checklist for filling out the AR issue. For the incident timeline, please add a note that the Incident Commander (aka the GHES Manager Oncall) should have already created the timeline.
The incident repair items only need the
incident-repairlabel added. The manager for that AOR will determine priority.Agreed -- we need direction from Tim on the
Availability Risk Registrysection and how it applies to GHES.