A Crash Course on Incident Response, Part 3
In Part 1, I discussed the important aspects of a good incident management practice including effective communication, clearly defined stakeholders, and getting timely resolution. In Part 2, we explored the key tactical aspects of incident response. This final installment focuses on improving your response capabilities and potentially reducing the frequency and impact of future incidents.
Presumably, you’re reading this because you’d like to improve your incident response practice. The best way to improve is to conduct high quality “retrospectives” following incidents. Drills and gameday exercises can also help, but there is no substitute for examining real-life incidents. Every organization has a unique culture, technical capabilities, and technology profile. We’ll talk through how to continually improve your incident management practice.
What is a Retro?
Simply put, a “retro” means looking back at an incident and the response in order to understand it and learn from it. By looking back, you can learn what caused the incident and prevent, or better handle, similar incidents in the future.
In practice, running effective retros is challenging. Let’s dive into the timing then discuss running them effectively. Finally, we’ll cover the artifact produced — often called a “postmortem,” “incident write-up,” or “after-action report.”
Retro Timing
Retros should be conducted promptly following an incident’s resolution. Generally speaking, you should try to conduct your retro within a few days of the incident. You want to give everyone time to decompress and reflect a bit, but you want the incident current in their minds.
You’ll need information and data to discuss and learn from before you can have a meaningful discussion. If your response procedures collect that information throughout the response process, like those we suggested in Part 2, you’ll be better prepared to conduct the retro quickly. If you don’t collect good information throughout your response procedures, you’ll need to allow time to gather that information. You should do that as soon as possible.
Running a Retro
A facilitator should be designated who is responsible for ensuring the conversation is productive and valuable. They should guide the team through the incident so that all the needed topics are covered in sufficient depth. This should not be a box-ticking exercise, so superficial or lazy answers should be drilled into and discussed. The facilitator needs to make sure the conversation is moving along and it isn’t becoming about blaming people.
Your attendee group is one of the most important considerations. You should have representation from each group of key participants — including those who you tried, but failed, to engage — present at the discussion.
The retro should be “blameless.” Blameless means you focus on the events rather than placing blame on specific people. Remember that you’re not there to designate a scapegoat, you’re there to improve practices, processes, technology, etc.
I generally suggest that you do not have management (especially senior management) present, unless they were directly involved. You want frank, open discussion about what happened and why. People should be able to speak freely without fear of harming their career. Many of us strive to create a safe and open environment, one where constructive critical feedback is welcomed. However, a manager’s presence will alter a conversation and reduce people’s willingness to openly discuss issues and challenges — especially those around processes or resourcing.
Everyone should prepare to discuss the area they are responsible for. It is helpful for the development team(s) to talk through their system’s failure(s) and the organizational process, practice, system design, or technical reasons that lead to it. The infrastructure teams should discuss their system’s handling of the failures. The customer support team should talk through customer impacts, incident communications, messaging, and other relevant aspects. The response team should talk through the incident response procedures and practices.
The “Five Whys” is commonly used during retros to explore cause-and-effect relationships in order to identify a root cause, but be careful with this as incidents in complex systems are rarely due to a singular root cause. It’s important to avoid making assumptions, especially with hindsight.
Leave enough time for quality discussion and review of the incident — 30 minutes is probably not going to suffice. Try not to build the schedule so tight that you can’t allow good discussion to occur. If teams are able to share information in advance that is advantageous — but this process is about discussion and learning.
Incident Postmortem
The output of the retro is a “postmortem” document. These can be shared and stored for future reference and learning. I personally believe the value of these documents is in the process and discussions leading to their creation far more than the document itself. In some cases, a cleaned version will be produced for sharing externally, but these are typically internal documents.
As a rule of thumb, try to avoid calling out specific individuals’ names. That helps avoid someone becoming a scapegoat. Add or remove sections to better suit your needs and requirements, but postmortems typically include the following information:
- Incident identifier and date.
- Brief, common language description of the incident and its impact.
- Timeline of all key events related to the incident. This should be as complete and comprehensive as possible including the key development events, release, operations, key monitoring and altering events, customer impact and reports, incident management information and actions, etc.
- Discussion of the causes and contributing factors (“root cause”). I don’t refer to this as a “root cause” because, again, rarely is an incident the result of a single cause, so be sure to articulate and discuss the factors which compounded to lead to the incident. This should include a discussion of technical aspects, process, and practices that contributed.
- Discussion about the impact. How severe was the impact, how many users were impacted?
- Discussion of the resolution. How was the incident resolved? What options were considered and ruled out, why was the resolution selected?
- Incident response analysis. This section should talk through the detection of the incident (whether via automated monitoring, customer reports, internal users, etc.), through the engagement of the incident management team, to the communications of the incident. The goal here is to identify ways the incident management process could be improved.
- Lessons learned and takeaways. This should be a short, concise section. Articulate the key takeaways from each of the above sections. Ideally you identify several opportunities for improvement. Even if the entire process worked very well, I like to look for ways to improve whether that is incident communication, monitoring, or development processes.
This does not require a fancy tool; something such as Google Docs, Office 365, or a Wiki is fine. I generally store these in a (read only) shared folder so that the wider team can learn from them. Any containing sensitive information might need to be stored in a location with more restricted access.
What types of incidents should be retroed?
Not all incidents require a retro and corresponding write-up. When you’re introducing a more robust incident management practice, you might start with only the most severe and impactful incidents (frequently called “SEV1”). If you’re trying to effect change and instill a more disciplined approach, you might require a retro and write-up for any production incident.
It is also reasonable to require a retro for all production incidents, but only require a formal write-up for more severe incidents (say a SEV2 or higher). For less severe incidents, a simple description of the issue and its resolution might suffice. I typically calibrate my SEV definitions such that a SEV2 or higher requires a full write-up, but anything less severe only requires an abbreviated version.
Conclusion
Avoiding incidents is impossible. Instead, focus your attention on quick resolution and improving. Every incident is an opportunity to learn and improve some aspect of how you develop, test, deliver, operate, and support your software. If you treat them as such and invest in developing a mature incident management process, you’ll be paid back. Getting better at detecting and recovering from incidents means lowering risk. Being better at managing and communication around incidents will mean lower stress.
You should invest in improving incident communications, as discussed in Part 1 of this series. Few things will de-stress an incident as much as high quality, effective communication.
Designate an incident commander to own the response and coordination. Simple, streamlined, well-defined coordination and management protocols will reduce the uncertainty and ensure incidents are handled consistently and reliably. Complexity in your incident management procedures will rarely benefit you — so keep it simple. They also ensure you’ll be able to collect information to improve the process over time.
Finally, conduct good retros where all aspects of the incident are discussed and reviewed. Actively seek things to improve and you’ll find that incidents become less stressful over time. Retros should never be about assigning blame, but learning and improving.
With good practices and a constant effort to improve, incidents don’t have to be chaotic and stressful.
Real Kinetic helps companies develop effective incident management programs. Learn more about working with us.