SRE@Xero: Managing Incidents Part III

Published in

Xero Developer

6 min readMay 18, 2018

In this final part of my series on incident management (see part one and part two), I’ll be focusing on the post-incident review, which we refer to as the “postmortem process”. I think this is a misnomer, because by definition, anything that died during an incident must have been brought back to life before the incident came to a close. However, “post-resurrection process” seems like a bit of a mouthful.

In many ways, what happens after an incident is the most critical part, because without improvement we risk repeating the cycle. In this post, I’ll cover the parts that we automate, how we structure the review process and how we analyse our progress over time.

Templates and timelines

We have two key objectives for any postmortem that we run:

Ensure that we capture as much knowledge from the incident as we can, both from a “cause” and “effect” perspective.
Ensure that we come up with a set of actions that will prevent the incident from recurring.

Sometimes the knowledge capture process will feed into the the action plan, and sometimes the action plan will produce additional knowledge.

Time is of the essence — postmortem activities have more momentum when the incident is still fresh in everyone’s mind. We aim to limit postmortem activities to two months. Tasks that need more time need to be incorporated into product teams’ longer-term strategic planning.

A key person in the management of a postmortem is the postmortem facilitator — this doesn’t need to be someone with a deep technical understanding of the incident, but ultimately the facilitator is responsible for ensuring we capture sufficient knowledge and act on what we have found. For each postmortem, this process begins with a kickoff meeting.

Asking the hard questions

At the kickoff meeting, the facilitator guides a team of key technical and non-technical people involved in the incident through the knowledge capture process.

We use our chat bot (Multivac) to generate a postmortem template for each incident. Our current template is shown on the left. Multivac also adds a number of key details from the incident to help us jog our memory.

We include the list of responders in the invitation to the kickoff meeting. Multivac generates a template for this invitation automatically, however we leave it to the facilitator to find a suitable time and curate the list of guests.

During the kickoff meeting, the facilitator primarily acts as a scribe, by gathering answers to the various fields in the postmortem template. In some cases, the facilitator will need to guide the discussion by asking more pointed questions. Over time, some of the more common questions I’ve found myself asking include:

Triggers: was this due to a software bug in a recent release? Was this a capacity-related issue?
Detection: have appropriate metrics, logs and alerts been set up? Do we need to add anything to our change / release management processes? Could this have been caught earlier in the UAT / QA process?
Causes: have the contributing factors been identified? Do we need to spend more time investigating?(Note the use of plural here — we’ve found trying to isolate a single root cause is usually not productive, echoing the sentiment of this post and this one).
Response: was it easy to identify the components involved in the issue? Were the right teams engaged? Is additional access required so that more teams can support this service? Was the incident clearly communicated to the wider audience? Have we captured the entire timeline of the issue (i.e. the pre-/post-incident period)?
Resolution: Was a workaround put in place to resolve this issue? What steps are involved to move to a permanent solution? Can we create additional run books / documentation to help with this issue if it recurs?

In many cases, just asking a few of these questions will help to invigorate the conversation and ultimately generate more actions for follow-up. If we ask the right questions at the kickoff meeting, the list of follow-up tasks comes naturally.

Rather than track actions within the postmortem document, we generate action plans within our agile software management system, Jira. Generating Jira tickets for each action allows us to farm out postmortem activities to product team backlogs, and transition each task through their own workflows. The main task for the facilitator after this point is to ensure progress is made on each action, much like an agile team facilitator.

Taking care of business

The process I’ve outlined above allows us to put each incident under the microscope. In SRE, we’re also interested in the wider view: how are we doing on postmortems overall? Are any trends emerging? What have we achieved? simply by virtue of keeping records of our incidents, together with summaries of causes and effects, we can use a range of simple tools to extract long-term trends and patterns:

Postmortem tracking — our postmortem tracking report shows us all postmortems that are currently in progress, and how many of the actions are still outstanding. We also use it to categorise incidents based on the main cause (for example, incidents caused by software releases versus third-party outages).
Monthly review — in our monthly review meeting, we select several in-progress postmortems to report to a wider audience. Bringing issues to the attention of teams across the business encourages adoption of better practices and tooling, so that we avoid hitting the same problems.

A simple “word cloud” built from six months of postmortem review documents. I have deliberately obfuscated certain terms to avoid disparaging any particular vendor.

Text analysis — the plain text that makes up our knowledge capture documents provides a wealth of information for text mining. For example, we can perform a frequency analysis of our incident summaries that shows us the systems and technologies that have been subject to reliability issues most frequently. Publishing this information helps internal teams in their decision-making process for adopting new technologies.

Ownership of postmortem follow-up actions over the second half of 2017. The blue are show tasks owned by SRE whereas the red shows tasks owned by other teams at Xero.

Tracking #ownership — at Xero, the overall aim of SRE is to help product teams to take ownership of their own reliability. We use the basic data we record for incidents and postmortems as one of the ways to track this. For example, the graph above shows how teams have taken ownership of postmortem follow-up actions. If we drill further into this data, we find the leading and trailing teams in this space, which helps us plan how we engage with those teams.

The show must go on

Hopefully these three posts have given some insight into the beginning, middle and end of our incident management story. However, the story hasn’t really ended from Xero SRE’s perspective, as there is always room for improvement. I’ll end this series by highlighting some of our recent and upcoming initiatives to bolster our incident management practice:

Developing a parallel framework for security incident management based on our general ChatOps workflow.
Refactoring the discovery functionality out of Multivac and into a microservice platform, so that it can be consumed by other teams’ tools.
Evolving Multivac into a next-generation chat bot.
Improving our postmortem template and review processes to track more detail and reach a wider audience.

Finally, I’ve been seeking feedback about our incident management initiatives within the wider technical community. At AWS Summit Sydney last month, I gave a talk sponsored by PagerDuty which focused on the “on-call as code” system I discussed in part one. Xero will also be presenting two talks at SRECon Asia in June: our head of site reliability, Piers Chamberlain, will be speaking about some of the ideas I discussed in the previous section. My talk will focus on the use of automation throughout our incident management process. If you’ve been following this series, I’d be keen to hear your thoughts so feel free to get in touch.

SRE@Xero: Managing Incidents Part III

Templates and timelines

Asking the hard questions

Taking care of business

The show must go on

Written by Karthik Nilakant