Skip to content

Instantly share code, notes, and snippets.

@zheng022
Last active September 9, 2024 19:23
Show Gist options
  • Select an option

  • Save zheng022/57685d127b3f08e2d17de350d6e459f2 to your computer and use it in GitHub Desktop.

Select an option

Save zheng022/57685d127b3f08e2d17de350d6e459f2 to your computer and use it in GitHub Desktop.
A practical guide on availability review process

In the past two weeks, we've held two Availability Review meetings featuring excellent presenters. These meetings facilitated fruitful discussions on how we can reflect and learn from customer incidents. (In case you missed any, you can find the recording for 08-27 and 09-04)πŸ€—

To enhance the efficiency of our AR meetings, here's a guide on the current Availability Review process and how AR issues should be completed. We're also integrating GHES-specific requirements into overall GitHub automations. Future improvements are expected to ease and eliminate more manual steps.😌

🐾 Process at a glance:

# Step Info
1 Availability Review created at end of GHES SEV 1 this will be automated in future
2 Incident DRI fills out the Availability Review create an issue
3 DRI Creates incident repair item incident repaire item board
4 Assigned Availability Champion reviews then approves Availability Review open AR issues
5 Incident DRI presents incident to team scheduling

πŸ“‹ Check list for filling out the Availability Review Issue

  • Make sure ghes, incident, in-progress labels are added to the issue (this will be automated in future).
  • Replace all TODO fields with relevant information.
  • Adding links to issue, channel, docs and other artifacts produced during incident handling.
  • Win: reflect on what went well about how we responded to this issue: how we were alerted to the problem, the tooling that supported the process, playbooks, automations and individual's contributions.
  • Incident Timeline: a detailed incident timeline starting from customer reports the issue til engineer disengaged is expected. The Incident Commander (aka the GHES Manager Oncall) should have already created the timeline during handling the SEV 1. This information can be provided either as a link to file in Timeline/Event Document link or in the section How did we respond to the incident?
  • Follow up issues: incident repair items or task for runbooks should be created as part of filling out the availability review issue. Add an incident-repair label and ensure the incident repair items are linked in the Availability issue's Short Term Repair Items Β§ or Long Term Repair Items Β§ so it can be tracked in the availability review board. Short Term items should be work we NEED to complete within the next Patch cycle, Long term items should be within the next Feature release (or two).
  • Availability Risk Registry Items: Identify items that you would like to propose be tracked as a risk during the Availability review. These are fixes that, if not repaired have a high likelihood of reoccurrence, have reoccured multiple times now or have the potential to cause data loss or long running outages for our customers. Items that require fan out work and funding discussions can also be added here if need be. These issues are triaged and assigned by the Availability Champions weekly.
  • Check the Ready for Review review:ready box when the content is complete to signal the Availability Champion to review the issue.

πŸ“† Schedule the presentation of Availibity Review issues

  • we have two time slots for a biweekly availability review meeting, alternating between:

Tuesdays 10am EDT/4pm CET

Wednesdays 4pm PDT/7pm EDT

  • presentation will be scheduled after the AR issue is reviewed and approved by availability champion.

Thanks to @lindarodgers and @thorrsson for providing direction and clarification on our process!πŸ’

We're still in the early stages of forming our culture and procedures. Everyone in GHES is part of this journey, and we welcome any suggestions you may have.

@lindarodgers
Copy link

Process in a blick -- is this a typo ? Do you mean blink ? Or is blick a cool German word that I don't know about ?

I like the checklist for filling out the AR issue. For the incident timeline, please add a note that the Incident Commander (aka the GHES Manager Oncall) should have already created the timeline.

The incident repair items only need the incident-repair label added. The manager for that AOR will determine priority.

Agreed -- we need direction from Tim on the Availability Risk Registry section and how it applies to GHES.

@thorrsson
Copy link

thorrsson commented Sep 6, 2024

Suggested changes to these sections

Follow up issues: incident repair items or task to created runbooks should be created as part of filling out the availability review issue. Add an incident-repair label and ensure the incident repair items are linked in the Availability issue so it can be tracked in the availability review board. Short Term items should be work we NEED to complete within the next Patch cycle, Long term items should be within the next Feature release (or two)

and

Availability Risk Registry Items: Identify items that you would like to propose be tracked as a risk during the Availability review. These are fixes that, if not repaired have a high likelihood of reoccurrence, have reoccured multiple times now or have the potential to cause data loss or long running outages for our customers. Items that require fan out work and funding discussions can also be added here if need be. These issues are triaged and assigned by the Availability Champions weekly.

@zheng022
Copy link
Author

zheng022 commented Sep 9, 2024

Process in a blick -- is this a typo ? Do you mean blink ? Or is blick a cool German word that I don't know about ?

πŸ˜… God, exactly. Auf einen Blick is German of at a glance. πŸ™ˆ

@lindarodgers
Copy link

This looks great @zheng022 ! I love that you added the first paragraph to promote the GHES Availability Review meetings.

πŸ˜… God, exactly. Auf einen Blick is German of at a glance. πŸ™ˆ

And to think that my mother told me in 9th grade that i should NOT take German because I'd never use it !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment