A practical guide on availability review process

In the past two weeks, we've held two Availability Review meetings featuring excellent presenters. These meetings facilitated fruitful discussions on how we can reflect and learn from customer incidents. (In case you missed any, you can find the recording for 08-27 and 09-04)🤗

To enhance the efficiency of our AR meetings, here's a guide on the current Availability Review process and how AR issues should be completed. We're also integrating GHES-specific requirements into overall GitHub automations. Future improvements are expected to ease and eliminate more manual steps.😌

🐾 Process at a glance:

#	Step	Info
1	Availability Review created at end of GHES SEV 1	this will be automated in future
2	Incident DRI fills out the Availability Review	create an issue
3	DRI Creates incident repair item	incident repaire item board
4	Assigned Availability Champion reviews then approves Availability Review	open AR issues
5	Incident DRI presents incident to team	scheduling

📋 Check list for filling out the Availability Review Issue

Make sure ghes, incident, in-progress labels are added to the issue (this will be automated in future).
Replace all TODO fields with relevant information.
Adding links to issue, channel, docs and other artifacts produced during incident handling.
Win: reflect on what went well about how we responded to this issue: how we were alerted to the problem, the tooling that supported the process, playbooks, automations and individual's contributions.
Incident Timeline: a detailed incident timeline starting from customer reports the issue til engineer disengaged is expected. The Incident Commander (aka the GHES Manager Oncall) should have already created the timeline during handling the SEV 1. This information can be provided either as a link to file in Timeline/Event Document link or in the section How did we respond to the incident?
Follow up issues: incident repair items or task for runbooks should be created as part of filling out the availability review issue. Add an incident-repair label and ensure the incident repair items are linked in the Availability issue's Short Term Repair Items § or Long Term Repair Items § so it can be tracked in the availability review board. Short Term items should be work we NEED to complete within the next Patch cycle, Long term items should be within the next Feature release (or two).
Availability Risk Registry Items: Identify items that you would like to propose be tracked as a risk during the Availability review. These are fixes that, if not repaired have a high likelihood of reoccurrence, have reoccured multiple times now or have the potential to cause data loss or long running outages for our customers. Items that require fan out work and funding discussions can also be added here if need be. These issues are triaged and assigned by the Availability Champions weekly.
Check the Ready for Review review:ready box when the content is complete to signal the Availability Champion to review the issue.

📆 Schedule the presentation of Availibity Review issues

we have two time slots for a biweekly availability review meeting, alternating between:

Tuesdays 10am EDT/4pm CET

Wednesdays 4pm PDT/7pm EDT

presentation will be scheduled after the AR issue is reviewed and approved by availability champion.

Thanks to @lindarodgers and @thorrsson for providing direction and clarification on our process!💝

We're still in the early stages of forming our culture and procedures. Everyone in GHES is part of this journey, and we welcome any suggestions you may have.

Suggested changes to these sections

Follow up issues: incident repair items or task to created runbooks should be created as part of filling out the availability review issue. Add an incident-repair label and ensure the incident repair items are linked in the Availability issue so it can be tracked in the availability review board. Short Term items should be work we NEED to complete within the next Patch cycle, Long term items should be within the next Feature release (or two)

and

Availability Risk Registry Items: Identify items that you would like to propose be tracked as a risk during the Availability review. These are fixes that, if not repaired have a high likelihood of reoccurrence, have reoccured multiple times now or have the potential to cause data loss or long running outages for our customers. Items that require fan out work and funding discussions can also be added here if need be. These issues are triaged and assigned by the Availability Champions weekly.

zheng022/ar-process.md

Select an option

No results found

Select an option

No results found

🐾 Process at a glance:

📋 Check list for filling out the Availability Review Issue

📆 Schedule the presentation of Availibity Review issues

lindarodgers commented Sep 6, 2024

Uh oh!

thorrsson commented Sep 6, 2024 •

edited

Loading

Uh oh!

zheng022 commented Sep 9, 2024

Uh oh!

lindarodgers commented Sep 9, 2024

Uh oh!

zheng022/ar-process.md

🐾 Process at a glance:

📋 Check list for filling out the Availability Review Issue

📆 Schedule the presentation of Availibity Review issues

lindarodgers commented Sep 6, 2024

Uh oh!

thorrsson commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zheng022 commented Sep 9, 2024

Uh oh!

lindarodgers commented Sep 9, 2024

Uh oh!

thorrsson commented Sep 6, 2024 •

edited

Loading