TSM - Managing Accidents and Guilt in the DevOps Culture

Claudiu Demian - Systems Administrator

Accidents are unwanted, unpredictable events. However, they are sure to occur from time to time. As a system becomes more and more complex, the probability that it will crash or that the people using it will make mistakes also increases. The way in which we respond to this kind of crises has a major impact on the organizational culture, and on the team and the company's performance.

Let us consider the following situation:

Sabin is a system administrator with 3 years of experience in the field. For a year now, he has been working for IT SRL, where he manages servers and Linux virtual machines, in a team of 5 engineers. Recently, their manager decided that they need to standardize the name of the servers because they have run out of Game of Thrones character names for said servers. Therefore, he has decided that each category of machines must be codified with names and colours, according to their roles. For example, the DNSs will be called red-01, red-02, and the webservers black-01… black-xy, from now on.

Sabin's task was to run a kernel update on the webservers used for staging (called blackest-01…blackest-05 etc.). To do this, he prepared a command that reads the names of the servers from a file, connects to them via SSH and runs the commands necessary for the update.

Because it did not seem like an important change, Sabin ran the command on a Friday afternoon and went home, because the upgrade was taking too long. What he did not notice, though, was that, by mistake, the file containing the name of the servers that he used as argument was blackserv, instead of blackest, as he decided to call the files containing the production and staging servers.

As a result, this update affected the entire network and it took 4 hours for the colleague on-call to be notified that the sites stopped working, to find the source of the problem and to solve it.

On Monday morning, when finding out about the problem and its consequences, Sabin's manager's only reaction was to tell him that these mistakes are unacceptable and that another mistake like this will bring about penalties.

These situations appear often because people make mistakes. The manager's reaction can be considered natural. After all, a manager's role is to make sure that, eventually, the team delivers what they promised (available systems in this case).

However, if we analyse the experiences that companies have had over time, in which the managers' reactions to mistakes have been exactly like the one we described, we are in for a big surprise.

Sabin, our main character, is feeling guilty for the situation he created: he caused downtime for the product and one of his colleagues had to intervene to correct a mistake that he made. What his manager told him further strengthened this feeling of unease and, maybe created a fear of mistakes. From now on, he will probably think twice before making automated changes on the servers or maybe he will avoid making those changes himself altogether. Moreover, if a new incident appears, he might be less willing to talk about everything that happened for fear he might be singled out again.

His colleagues will have similar feelings. Those incidents will be reported less and less and there will be a tendency to shift the blame and avoid accountability. To avoid being considered the guilty one in case some problems appear, even if those problems are not due to human error, people will make excuses and will look for all sorts of ways to avoid certain tasks.

These are the situations that lead to the development of a culture of fear, which is based on the principle of the rotten apple: if we find a rotten apple in a big heap of apples, we take it out and throw it away. If we apply this to people in our work environments, and behave similarly with the people who made a mistake, the long-term results might be very far from our expectations.

John Alspaw, from Etsy, considers that accidents and mistakes should be seen as learning opportunities, not as occasions to point fingers and toss the blame ball around (i). Therefore, if each employee that was part of an incident is presented with the opportunity to talk about things from his or her perspective, the entire organization will be able to learn something from the whole event. Moreover, there are measures that can be taken, so that those mistakes may be avoided for the future.

The context in which this informational exchange takes place is called the "blameless postmortem", or retrospective. This is usually a meeting that is organized soon after the incident or after finishing a project (successfully or not). The stages of a postmortem can be summarized as follows (ii):

Setting the context: "We have organized this meeting to try and get everyone's thoughts, to find out what happened and learn from the incident, not to find whom to blame."
Describing the incident or the project
Establishing the chronology: what happened and when, what was said and when, what were the employees thinking when they made the decisions; for this step, a chat room dedicated to this kind of problems would be useful
Determining which other factors have contributed to the incident: these factors can be both personal (stress, fatigue, personal problems etc.), as well as professional (communication problems in the team or among different teams, technical problems etc.)
Describing the impact of the incident on the clients/business
Describing the steps taken to fix the problem
Establishing a certain course of action in order to avoid these problems in the future and to improve the process: it is recommended that these actions respect the SMART principles (Specific, Measurable, Agreed, Realistic, Time-bound) (iii)

Also, these meetings should include all the people and the factors that were involved in or affected by the incident, even if the latter are only there to learn what happened and what they should do in situations like that. After the postmortem, those who participated can put together a document describing the event/problem and the steps taken to fix it, but also tips on how to avoid these problems in the future. Amazon offers some good templates for tackling this kind of events (iv).

Self-confidence and trusting your colleagues represent vital elements in successfully managing incidents. This trust is easy to obtain if everyone knows what they need to do in case of an emergency. This is why some companies have clear procedures for this sort of situations. Server Density (v) offers an example for such a procedure:

Open Jira ticket for the incident.
Stopping the pager-duty notification.
Going to the chat room for incidents.
Searching the database for knowledge. Maybe this type of alert has occurred before and a solution already exists.
If the problem affects the users, a notification is posted on the site.
If the person on-call cannot solve the problem, it must be escalated to the people who can.

If such a procedure is in place, it creates a feeling of safety for the employees, especially for the new hires, because it means they know what to do if a problem appears. Furthermore, the other colleagues, the ones not involved in solving the issue, know that there were certain steps that were taken, which means there is a certain amount of predictability. Writing down all the steps taken for solving the problem and the discussions on the chat will later be used as a starting point and a basis for the postmortem, aside from their obvious advantage during the crisis.

Managers can also organize intervention drills in case of downtime, similar to fire alarm drills organized by firefighters. These exercises are good because they ensure the procedure is tested and followed by everybody. This also increases the trust between the members of the team. These exercises are also called drills or war games and can even include production if the chaos monkey procedure is followed.

If we go back to the example from the beginning of this article, we might ask what Sabin, his manager, and the team could learn from the entire situation. The first aspect could be a technical one: updating the servers through scripts or commands written by each administrator individually implies extra work and is more susceptible to errors. Using a system for automating the configurations and the orchestration could reduce the number of errors and would allow for a better visibility of the infrastructure (the colleague who is on-call could spot the problem faster). Another advantage would be the peer review: if Sabin had presented his plan and the exact command that he plans on using to a colleague, the colleague would have probably noticed the mistake Sabin was about to make.

Any mistake can be considered a learning opportunity and an opportunity to improve the organizational process. If someone does something they are not supposed to do, it is because there is no policy in place preventing them to do so. In addition, if the person does not know how to tackle a problem, it means there is no policy to guide the employee and they need more training.

Looking for someone to blame for a problem actually shifts the attention from what is really important and hinders the evolution and development of efficient organizational policies and practices, inhibiting the employees' desire to change and become better individuals.

Bibliography

https://codeascraft.com/2012/05/22/blameless-postmortems/

http://chef.github.io/devops-kungfu/\#/59

http://www.slideshare.net/jhand2/its-not-your-fault-blameless-post-mortems

https://aws.amazon.com/message/5467D2/

https://blog.serverdensity.com/whats-on-call-playbook/