Incident Management training
With a growing product comes bigger responsabilities. As the visibility and use of a product was growing we were asked to help reduce the impact of incidents and their frequency.
The context
Initial conversation with the client
The client was, like many companies, faced with weekly minor incidents and major ones from time to time. Their approach to those was pretty "ad-hoc" when we initially talked to them. While they always managed to solve the issue, we could see the following symptoms.
An O'nymous Inc.
Don't miss the next studies
- Superhero syndrome. With a team that was still small habits from early times were still present. The engineers with the most years in the company usually took care of the incidents, with sometimes a few others joining in. But the brunt of the load was supported by that one engineer. That engineer knew it all and could fix things quickly, but nobody else could chime in.
- Lack of structure. With that one engineer busy solving the issue and not communicating, there was little other engineers could do. And anyone else involved could rarely communicate properly with the rest of the company due to a lack of context.
- Lack of communication. As some elements of the engineering team were taking of the incident, they didn't communicate what they were doing. As often in those cases, the rest of the company, if they were aware of the issue, grew worried and was firing messages across slack channels to get information. As most of the engineering team wasn't involved in solving the problem, they also send messages around.
- Lack of post-mortems. There was little track of incidents that happened, what was identified and fixed during their management, and lessons that were learned.
Upon careful consideration of these factors and the engineering team's roadmap, it became evident that continuing with Heroku would result in a significant increase in costs, while also limiting access to now standard tools and services.
Things to consider
Introducing an incident response plan
Incidents are always stressful for the organization going through them at all levels. The clients are stressed, the C?Os are stressed, the sales and support teams are stressed, and the engineering team too, of course. Stress comes from the risks involved in each incident and the uncertainty about those. A big part of the work for the engineering team when handling an incident is to remove that uncertainty for themselves and the rest of the stakeholders (inside or outside the company).
Anatomy of an incident response
As per habit, the training starts with an introduction to an incident and how to grade it.
An incident is usually defined as an unplanned interruption or reduction in the quality of a service. Each team can figure out its scale, but to start from somewhere, we started with the following.
- Minor incident. The service is not working as normal, but major features are not impacted, and users can still use the product, albeit in a limited manner. An example would be incorrect styling on a web page.
- Major incident. The service is still reachable, but at least one major feature is not working as expected, and no data is lost. An example would be a chat service unable to list the history of chats between two people.
- Critical incident. The service is either unreachable or impacted by one or more major features not working as expected with some data loss.
- Security Incident. While this could be attached to any of the three levels listed before, a security incident for an internet product can be defined as any unauthorized or unexpected event compromising the confidentiality, integrity, or availability of the product or its data. This can include any attempt or successful action to gain unauthorized access to the product, its data, or its users' information.
Chosen solutions
What was done
After introducing those, we discussed an initial incident response relying primarily on Chapter 14 of Google's SRE book and previous experiences. We chose this approach as it's fairly simple and clear without too much complexity, thus allowing even a team of 8 to 10 people to get a first version of a process in place and grow from there.
Don't miss the next studies
- Roles and separation of responsabilities. To resolve the lack of structure and work on reducing the superhero syndrome, we introduce a pair of roles: incident operator and spokesperson. The first is a mix of 'Incident Command' and 'Operational Work'. The second one is solely focused on communication and logging. Whenever an incident was spotted, the person 'on call' would drop a message stating the start of the incident management process and call upon someone else to join the war room.
- Command and operation. As the team was small (less than 10 people) and the size of the stack too, it was difficult to keep incident command and operational work separate. Instead, the team elected to have one person "in charge" for most incidents unless it was a critical one and more people were needed.
- War room. We created a separate engineering-only slack channel to avoid having anyone and everyone involved or barge into conversations on slack during an incident. All engineers could join, but it was away from the day-to-day operations, thus allowing for focused discussions.
- Communication. As this was a major issue, it was important to put a lot of emphasis on this. We elected to have one person focused on communication within 2 slack channels: into the general engineering channel to keep everyone up to date about how things were going and if help was needed, in a specific company-wide channel to keep the rest of the company updated. Each communication would be done by the spokesperson and only that person and include an ETA for the next update. At first, that person was also in charge of creating the incident log file and logging what was seen, decided, and done.
-
Overall. By default, the person assuming on-call duties that week would be the de-facto incident operator, and they would call someone else to second as a spokesperson. If they felt not up to the task, they could ring up other team members and hand off the responsibilities.
The process aimed at bringing back the service as fast as possible but in the most ordained manner. First, we trained to recognize the level of the incident, then to figure out whether or not it was possible to solve it directly or if one had to mitigate it first.
We also covered the importance of clear communication and task assignment to avoid freelancing and conflicts.
Finally, we discussed how to write incident reports (post-mortems) and communicate about them with the rest of the engineering team and the rest of the company.
The result
The effects
As the company started to follow those principles, the effects were quickly spotted across the company. After the first use of the process, many people in the company shared their delight at how they didn't have to wait to know something was happening, that it was already being looked after, and when it was solved. Many never expected it could be done the way we did.
But it was not the only impact. Using the post-mortems, the team could reflect on key indicators such as Mean Time To Recovery (MTTR) and Change Failure Rate. After a couple of months, they could see that most of their incidents were triggered by changes in the codebase and bugs introduced by those. They could cover this during retrospectives and consider solutions together.
Managing incidents were also the opportunity for team members to look into areas of the codebase they were not familiar with yet, thus spreading knowledge. It also allowed less senior members to lead or co-lead the incident response by pairing with a senior team member. This was seen as key to supporting those individuals in their career growth.
After a few months, the team updated the process and adjusted it to fit their team and context better now that they had some experience managing incidents. We were no longer needed and moved on to other projects, but we were happy to have been of help.
The training was done as an additional part of an already running contract allowing us to not only give initial training but also pair with team members during several incidents, thus ensuring a solid adoption and comprehension of the practice.