Reducing Human Error in Software Based Services

As I’m writing this Texas and Louisiana continue to deal with the impact of Hurricane Harvey. Hurricane Irma is heading towards Florida. Los Angeles just experienced the biggest fire to burn in its history (La Tuna Fire). And in the last three months there have been two different collisions between US Navy vessels and civilian ships that resulted in 17 fatalities and multiple injuries.

The interaction and relationships between people, actions, and events in high risk endeavors is awe inspiring. Put aside the horrific loss of life and think of the amount of stress and chaos involved. Imagine knowing that your actions can have irreversible consequences. Though they can’t be changed I’m fascinated with efforts to prevent them from repeating.

Think of those interactions between people, actions and events as change. There are examples of software systems having critical or fatal consequences when failing to handle them. For most of us the impact might be setbacks delaying work or at most a financial consequence to our employer or ourselves. While the impact may differ there are benefits from learning from professions other than our own that deal with change on a daily basis.

Our job as systems or ops engineers should be on how to build, maintain, troubleshoot, and retire the systems we’re responsible for. But there’s been a shift building that has us focusing more on becoming experts at evaluating new technology.

Advances in our tooling has allowed us to rebuild, replace, or re-provision from failures. This starts introducing complacency, because the tools start to have more context of the issues than us. It shifts our focus away from reaching a better idea of what’s happening.

As the complexity and the number of systems involved increases our ability to understand what is happening and how they interact hasn’t kept up. If you have any third-party dependencies, what are the chances they’re going through a similar experience? How much of an impact does this have on your understanding of what’s happening in your own systems?

Atrophy of Basics Skills

The increased efficiency of our tooling creates a Jevons paradox. This is an economic idea that as the efficiency of something increases it will lead to more consumption instead of reducing it. Named after William Jevons who in the 19th century noticed that the consumption of coal increased after the release of a new steam engine design. The improvements with this new design increased the efficiency of the coal-fired steam engine over it’s predecessors. This fueled a wider adoption of the steam engine. It became cheaper for more people to use a new technology and this led to the increased consumption of coal.

For us the engineer’s time is the coal and the tools are the new coal-fired engine. As the efficiency of the tooling increases we tend to use more of the engineer’s time. The adoption of the tooling increases while the number of engineers tends to remain flat. Instead of bringing in more people we tend to try to do more with the people we have.

This contributes to an atrophying of the basic skills needed to do the job. Things like troubleshooting, situational awareness, and being able to hold a mental model of what’s happening. It’s a journeyman’s process to build them. Actual production experience is the best teacher and the best feedback is from your peers. Tools are starting to replace the opportunities for people to have those experiences to learn from.

Children of the Magenta and the Dangers of Automation

For most of us improving the efficiency of an engineer’s time will look like some sort of automation. And while there are obvious benefits there are some not so obvious negatives. First, automation can hide the context of what is, has, and will happen from us. How many times have you heard or asked yourself “What’s it doing now?”

“A lot of what’s happening is hidden from view from the pilots. It’s buried. When the airplane starts doing something that is unexpected and the pilot says ‘hey, what’s it doing now?’ — that’s a very very standard comment in cockpits today.’”
– William Langewiesche, journalist and former American Airlines captain.

In May 31, 2009 228 people died when Air France 447 lost altitude from 35,000 feet and pancaked into the Atlantic Ocean. A pressure probe had iced over preventing the aircraft from determining its speed. This caused the autopilot to disengage and the “fly-by-wire” system switched into a different mode.

“We appear to be locked into a cycle in which automation begets the erosion of skills or the lack of skills in the first place and this then begets more automation.” – William Langewiesche

Four years later Asiana Airlines flight 214 crashed on their final approach into SFO. It came in short of the runway striking the seawall. The NTSB report shows the flight crew mismanaged the initial approach and the aircraft was above the desired glide path. The captain responded by selecting the wrong autopilot mode, which caused the auto throttle to disengage. He had a faulty mental model of the aircraft’s automation logic. This over-reliance on automation and lack of understanding the systems was cited as major factors leading to the accident.

This has been described as “Children of the Magenta” due to the information presented in the cockpit from the autopilot being magenta in color. It was coined by Capt. Warren “Van” Vanderburgh at American Airlines Flight Academy. There are different levels of automation in an aircraft and he argues that by reducing the level of automation you can reduce the workload in some situations. The amount of automation should match the current conditions of the environment. It’s a 25 minute video that’s worth watching, but it boils down to this. Pilots have become too dependent on automation in general and are losing the skills needed to safely control their aircraft.

This led a Federal Aviation Administration task force on cockpit technology to urge airlines to have their pilots spend more time flying by hand. This focus of returning to the basic skills needed is similar to the report released from the Government Accountability Office (GAO) regarding the impact of maintenance and training on the readiness of the US Navy.

Based on updated data, GAO found that, as of June 2017, 37 percent of the warfare certifications for cruiser and destroyer crews based in Japan—including certifications for seamanship—had expired. This represents more than a fivefold increase in the percentage of expired warfare certifications for these ships since GAO’s May 2015 report.

What if Automation is part of the Product?

Knight Capital was a global financial services firm engaging in market making. They used high-frequency trading algorithms to manage a little over a 17% market share on NYSE and almost 17% on NASDAQ. In 2012 a new program was about to be released at the NYSE called Retail Liquidity Program (RLP). Knight Capital made a number of changes to it’s systems and software to handle the change. This included adding new code to an automated, high speed, algorithmic router called SMARS that would send orders into the market.

This was intended to replace some deprecated code in SMARS called Power Peg. Except it wasn’t deprecated and it was still being used. The new code for RLP even used the same feature flag as Power Peg. During the deploy one server was skipped and no one noticed the deployment was incomplete. When the feature flag for SMARS was activated it triggered the Power Peg code on that one server that was missing the update. After 45 minutes of routing millions of order into the market (4 million executions on 154 stocks) Knight Capital lost around $460 million dollars. In this case automation could have helped.

Automation is not a bad thing. You need to be thoughtful and clear on how it’s being used and how it functions. In Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity the authors provide a guideline for this. They show the interaction between people and automation can be improved by meeting four basic requirements (here I’m thinking that robots only had three laws). They then go on to describe the ten biggest challenges to satisfy them.

Four Basic Requirements

  1. Enter into an agreement, which we’ll call a Basic Compact, that the participants intend to work together
  2. Be mutually predictable in their actions
  3. Be mutually directable
  4. Maintain common ground

Complexity and Chaos

Complexity has a tendency to bring about chaos. Due to the difficulty of people to understand the system(s). Incomplete visibility into what is happening. The number of possibilities within them, including both the normal events, mutability of the data, and the out of ordinary. That encompasses failures, people using the system(s) in unexpected ways, large spikes in requests, and bad actors.

If this is the environment that we find ourselves working in we can only control the things we bring into it. That includes making sure we have a grasp on the basics of our profession and maintain the best possible understanding of what’s happening with our systems. This should allow us to work around issues as they happen and decrease our dependencies on our tools. There’s usually more than one way to get information or make a change. Knowing the basics and staying aware of what is happening can get us through the chaos.

There are three people whose works can help navigate these kind of conditions. John Boyd, Richard I. Cook, and Edward Tufte. Boyd gives us reference on how to work within chaos and how to use it to our own advantage. While Cook shows how complexity can fail and suggests ways to find the causes of those failures. And Tufte explains how we can reduce the complexity of the information we’re working with.

Team Resource Management

This leads us to a proven approach that has been used in aviation, fire fighting, and emergency medicine that we can adapt for our own use. In 1973 NASA started research into human factors in aviation safety. Several years later two Boeing 747s collided on the runway and killed 583 people. This prompted a workshop titled “Resource Management on the Flight Deck” that included the NASA researchers and the senior officers responsible for aircrew training from the major airlines. The result of this workshop was a focus on training to reduce the primary causes of aviation accidents called Crew Resource Management (CRM).

They saw the primary causes as human error and communication problems. The training would reinforce the importance of communication and orientating to what is actually happening. It changes the cultural muscle memory so when we start to see things go bad, we speak up and examine it. And it stresses a culture where authority may be questioned.

We’ve already adapted the Incident Command System for our own use… why not do the same with CRM?

What’s the Trend of Causes for Failures?

We don’t have a group like the NTSB or NASA focused on reducing failures in what we do. Yes, there are groups like USENIX, Apache Software Foundation, Linux Foundation, and the Cloud Native Computing Foundation. But I’m not aware of any of them tracking and researching the common causes of failures.

After a few searches for any reference to a postmortem, retro, or outages I came up with this list. These were pulled from searches on Lobsters, Hacker News, Slashdot, TechCrunch, Techmeme, High Scalability, and ArsTechnica. And in this very small nonscientific sample almost half are due to what I would call human error. There also the four power related causes. Don’t take anything out of this list other than the following. We would benefit from having more transparency of the failures in our industry and understanding their causes. 

Date Description and Link Cause
9/7/2017 Softlayer GLBS Outage Unclear
8/26/2017 BGP Leak caused internet outage in Japan Unknown
8/21/2017 Honeycomb outage In Progress
8/4/2017 Visual Studio Team Services outage Human Error
8/2/2017 Issues with Visual Studio Team Services Failed Dependency
5/18/2017 Let’s Encrypt OCSP outage Human Error
3/16/2017 Square Deploy Triggered Load
2/28/2017 AWS S3 Outage Human Error
2/9/2017 Instapaper Outage Cause & Recovery Human Error
1/31/2017 Gitlab Outage Human Error
1/22/2017 United Airlines grounded two hours, computer outage Unknown
10/21/2016 Dyn DNS DDoS DDoS
10/18/2016 Google Compute Engine Human Error
5/10/2016 SalesForce Failure after mitigating a power outage
1/28/2016 Github Service outage Cascading failure after power outage
1/19/2016 Twitter Human Error
7/27/2015 Joyent Manta Outage Locks Blocks The Data
1/10/2014 Dropbox Outage post-mortem Human Error
1/8/2014 GitHub Outage Human Error
3/3/2013 Cloudflare outage Unintended Consequences
8/1/2012 Knight Capital Human Error
6/30/2012 AWS power failure Double Failure of Generators during Power Outage
10/11/2011 Blackberry World Wide 3 Day Outage Hardware failure Hardware Failure and Failed Backup Process
11/15/2010 GitHub Outage Config Error Human Error

“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” – R. P. Feynman Rogers Commission Report