Reducing Human Error in Software Based Services

As I’m writing this Texas and Louisiana continue to deal with the impact of Hurricane Harvey. Hurricane Irma is heading towards Florida. Los Angeles just experienced the biggest fire to burn in its history (La Tuna Fire). And in the last three months there have been two different collisions between US Navy vessels and civilian ships that resulted in 17 fatalities and multiple injuries.

The interaction and relationships between people, actions, and events in high risk endeavors is awe inspiring. Put aside the horrific loss of life and think of the amount of stress and chaos involved. Imagine knowing that your actions can have irreversible consequences. Though they can’t be changed I’m fascinated with efforts to prevent them from repeating.

Think of those interactions between people, actions and events as change. There are examples of software systems having critical or fatal consequences when failing to handle them. For most of us the impact might be setbacks delaying work or at most a financial consequence to our employer or ourselves. While the impact may differ there are benefits from learning from professions other than our own that deal with change on a daily basis.

Our job as systems or ops engineers should be on how to build, maintain, troubleshoot, and retire the systems we’re responsible for. But there’s been a shift building that has us focusing more on becoming experts at evaluating new technology.

Advances in our tooling has allowed us to rebuild, replace, or re-provision from failures. This starts introducing complacency, because the tools start to have more context of the issues than us. It shifts our focus away from reaching a better idea of what’s happening.

As the complexity and the number of systems involved increases our ability to understand what is happening and how they interact hasn’t kept up. If you have any third-party dependencies, what are the chances they’re going through a similar experience? How much of an impact does this have on your understanding of what’s happening in your own systems?

Atrophy of Basics Skills

The increased efficiency of our tooling creates a Jevons paradox. This is an economic idea that as the efficiency of something increases it will lead to more consumption instead of reducing it. Named after William Jevons who in the 19th century noticed that the consumption of coal increased after the release of a new steam engine design. The improvements with this new design increased the efficiency of the coal-fired steam engine over it’s predecessors. This fueled a wider adoption of the steam engine. It became cheaper for more people to use a new technology and this led to the increased consumption of coal.

For us the engineer’s time is the coal and the tools are the new coal-fired engine. As the efficiency of the tooling increases we tend to use more of the engineer’s time. The adoption of the tooling increases while the number of engineers tends to remain flat. Instead of bringing in more people we tend to try to do more with the people we have.

This contributes to an atrophying of the basic skills needed to do the job. Things like troubleshooting, situational awareness, and being able to hold a mental model of what’s happening. It’s a journeyman’s process to build them. Actual production experience is the best teacher and the best feedback is from your peers. Tools are starting to replace the opportunities for people to have those experiences to learn from.

Children of the Magenta and the Dangers of Automation

For most of us improving the efficiency of an engineer’s time will look like some sort of automation. And while there are obvious benefits there are some not so obvious negatives. First, automation can hide the context of what is, has, and will happen from us. How many times have you heard or asked yourself “What’s it doing now?”

“A lot of what’s happening is hidden from view from the pilots. It’s buried. When the airplane starts doing something that is unexpected and the pilot says ‘hey, what’s it doing now?’ — that’s a very very standard comment in cockpits today.’”
– William Langewiesche, journalist and former American Airlines captain.

In May 31, 2009 228 people died when Air France 447 lost altitude from 35,000 feet and pancaked into the Atlantic Ocean. A pressure probe had iced over preventing the aircraft from determining its speed. This caused the autopilot to disengage and the “fly-by-wire” system switched into a different mode.

“We appear to be locked into a cycle in which automation begets the erosion of skills or the lack of skills in the first place and this then begets more automation.” – William Langewiesche

Four years later Asiana Airlines flight 214 crashed on their final approach into SFO. It came in short of the runway striking the seawall. The NTSB report shows the flight crew mismanaged the initial approach and the aircraft was above the desired glide path. The captain responded by selecting the wrong autopilot mode, which caused the auto throttle to disengage. He had a faulty mental model of the aircraft’s automation logic. This over-reliance on automation and lack of understanding the systems was cited as major factors leading to the accident.

This has been described as “Children of the Magenta” due to the information presented in the cockpit from the autopilot being magenta in color. It was coined by Capt. Warren “Van” Vanderburgh at American Airlines Flight Academy. There are different levels of automation in an aircraft and he argues that by reducing the level of automation you can reduce the workload in some situations. The amount of automation should match the current conditions of the environment. It’s a 25 minute video that’s worth watching, but it boils down to this. Pilots have become too dependent on automation in general and are losing the skills needed to safely control their aircraft.

This led a Federal Aviation Administration task force on cockpit technology to urge airlines to have their pilots spend more time flying by hand. This focus of returning to the basic skills needed is similar to the report released from the Government Accountability Office (GAO) regarding the impact of maintenance and training on the readiness of the US Navy.

Based on updated data, GAO found that, as of June 2017, 37 percent of the warfare certifications for cruiser and destroyer crews based in Japan—including certifications for seamanship—had expired. This represents more than a fivefold increase in the percentage of expired warfare certifications for these ships since GAO’s May 2015 report.

What if Automation is part of the Product?

Knight Capital was a global financial services firm engaging in market making. They used high-frequency trading algorithms to manage a little over a 17% market share on NYSE and almost 17% on NASDAQ. In 2012 a new program was about to be released at the NYSE called Retail Liquidity Program (RLP). Knight Capital made a number of changes to it’s systems and software to handle the change. This included adding new code to an automated, high speed, algorithmic router called SMARS that would send orders into the market.

This was intended to replace some deprecated code in SMARS called Power Peg. Except it wasn’t deprecated and it was still being used. The new code for RLP even used the same feature flag as Power Peg. During the deploy one server was skipped and no one noticed the deployment was incomplete. When the feature flag for SMARS was activated it triggered the Power Peg code on that one server that was missing the update. After 45 minutes of routing millions of order into the market (4 million executions on 154 stocks) Knight Capital lost around $460 million dollars. In this case automation could have helped.

Automation is not a bad thing. You need to be thoughtful and clear on how it’s being used and how it functions. In Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity the authors provide a guideline for this. They show the interaction between people and automation can be improved by meeting four basic requirements (here I’m thinking that robots only had three laws). They then go on to describe the ten biggest challenges to satisfy them.

Four Basic Requirements

  1. Enter into an agreement, which we’ll call a Basic Compact, that the participants intend to work together
  2. Be mutually predictable in their actions
  3. Be mutually directable
  4. Maintain common ground

Complexity and Chaos

Complexity has a tendency to bring about chaos. Due to the difficulty of people to understand the system(s). Incomplete visibility into what is happening. The number of possibilities within them, including both the normal events, mutability of the data, and the out of ordinary. That encompasses failures, people using the system(s) in unexpected ways, large spikes in requests, and bad actors.

If this is the environment that we find ourselves working in we can only control the things we bring into it. That includes making sure we have a grasp on the basics of our profession and maintain the best possible understanding of what’s happening with our systems. This should allow us to work around issues as they happen and decrease our dependencies on our tools. There’s usually more than one way to get information or make a change. Knowing the basics and staying aware of what is happening can get us through the chaos.

There are three people whose works can help navigate these kind of conditions. John Boyd, Richard I. Cook, and Edward Tufte. Boyd gives us reference on how to work within chaos and how to use it to our own advantage. While Cook shows how complexity can fail and suggests ways to find the causes of those failures. And Tufte explains how we can reduce the complexity of the information we’re working with.

Team Resource Management

This leads us to a proven approach that has been used in aviation, fire fighting, and emergency medicine that we can adapt for our own use. In 1973 NASA started research into human factors in aviation safety. Several years later two Boeing 747s collided on the runway and killed 583 people. This prompted a workshop titled “Resource Management on the Flight Deck” that included the NASA researchers and the senior officers responsible for aircrew training from the major airlines. The result of this workshop was a focus on training to reduce the primary causes of aviation accidents called Crew Resource Management (CRM).

They saw the primary causes as human error and communication problems. The training would reinforce the importance of communication and orientating to what is actually happening. It changes the cultural muscle memory so when we start to see things go bad, we speak up and examine it. And it stresses a culture where authority may be questioned.

We’ve already adapted the Incident Command System for our own use… why not do the same with CRM?

What’s the Trend of Causes for Failures?

We don’t have a group like the NTSB or NASA focused on reducing failures in what we do. Yes, there are groups like USENIX, Apache Software Foundation, Linux Foundation, and the Cloud Native Computing Foundation. But I’m not aware of any of them tracking and researching the common causes of failures.

After a few searches for any reference to a postmortem, retro, or outages I came up with this list. These were pulled from searches on Lobsters, Hacker News, Slashdot, TechCrunch, Techmeme, High Scalability, and ArsTechnica. And in this very small nonscientific sample almost half are due to what I would call human error. There also the four power related causes. Don’t take anything out of this list other than the following. We would benefit from having more transparency of the failures in our industry and understanding their causes. 

Date Description and Link Cause
9/7/2017 Softlayer GLBS Outage Unclear
8/26/2017 BGP Leak caused internet outage in Japan Unknown
8/21/2017 Honeycomb outage In Progress
8/4/2017 Visual Studio Team Services outage Human Error
8/2/2017 Issues with Visual Studio Team Services Failed Dependency
5/18/2017 Let’s Encrypt OCSP outage Human Error
3/16/2017 Square Deploy Triggered Load
2/28/2017 AWS S3 Outage Human Error
2/9/2017 Instapaper Outage Cause & Recovery Human Error
1/31/2017 Gitlab Outage Human Error
1/22/2017 United Airlines grounded two hours, computer outage Unknown
10/21/2016 Dyn DNS DDoS DDoS
10/18/2016 Google Compute Engine Human Error
5/10/2016 SalesForce Failure after mitigating a power outage
1/28/2016 Github Service outage Cascading failure after power outage
1/19/2016 Twitter Human Error
7/27/2015 Joyent Manta Outage Locks Blocks The Data
1/10/2014 Dropbox Outage post-mortem Human Error
1/8/2014 GitHub Outage Human Error
3/3/2013 Cloudflare outage Unintended Consequences
8/1/2012 Knight Capital Human Error
6/30/2012 AWS power failure Double Failure of Generators during Power Outage
10/11/2011 Blackberry World Wide 3 Day Outage Hardware failure Hardware Failure and Failed Backup Process
11/15/2010 GitHub Outage Config Error Human Error

“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” – R. P. Feynman Rogers Commission Report

Valid Security Announcements

A checklist item for the next time you need to create a landing page for a security announcement.

Make sure the certificate and the whois on the domain being used actually references the name of your company.

My wife sends me a link to this.

I then find the page for the actual announcement from Equifax.

Go to the dedicated website www.equifaxsecurity2017.com and find it’s using a Cloudflare SSL certificate.

The certificate chain doesn’t mention Equifax other than the DNS names used in the cert (*.equifaxsecurity2017.com and equifaxsecurity2017.com).

What happens if I do a whois?

$ whois equifaxsecurity2017.com
   Domain Name: EQUIFAXSECURITY2017.COM
   Registry Domain ID: 2156034374_DOMAIN_COM-VRSN
   Registrar WHOIS Server: whois.markmonitor.com
   Registrar URL: http://www.markmonitor.com
   Updated Date: 2017-08-25T15:08:31Z
   Creation Date: 2017-08-22T22:07:28Z
   Registry Expiry Date: 2019-08-22T22:07:28Z
   Registrar: MarkMonitor Inc.
   Registrar IANA ID: 292
   Registrar Abuse Contact Email: abusecomplaints@markmonitor.com
   Registrar Abuse Contact Phone: +1.2083895740
   Domain Status: clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited
   Domain Status: clientTransferProhibited https://icann.org/epp#clientTransferProhibited
   Domain Status: clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited
   Name Server: BART.NS.CLOUDFLARE.COM
   Name Server: ETTA.NS.CLOUDFLARE.COM
   DNSSEC: unsigned

Now I want to see if I’m impacted. Click on the “Check Potential Impact” and I’m taken to a new site (trustedidpremier.com/eligibility/eligibility.html).

And we get another certificate and a whois lacking any reference back to Equifax.

$ whois trustedidpremier.com
   Domain Name: TRUSTEDIDPREMIER.COM
   Registry Domain ID: 2157515886_DOMAIN_COM-VRSN
   Registrar WHOIS Server: whois.registrar.amazon.com
   Registrar URL: http://registrar.amazon.com
   Updated Date: 2017-08-29T04:59:16Z
   Creation Date: 2017-08-28T17:25:35Z
   Registry Expiry Date: 2018-08-28T17:25:35Z
   Registrar: Amazon Registrar, Inc.
   Registrar IANA ID: 468
   Registrar Abuse Contact Email: registrar-abuse@amazon.com
   Registrar Abuse Contact Phone: +1.2062661000
   Domain Status: ok https://icann.org/epp#ok
   Name Server: NS-1426.AWSDNS-50.ORG
   Name Server: NS-1667.AWSDNS-16.CO.UK
   Name Server: NS-402.AWSDNS-50.COM
   Name Server: NS-934.AWSDNS-52.NET
   DNSSEC: unsigned

I’m not suggesting that the site equifaxsecurity2017 is malicious, but if you’re going to the trouble of setting up a page like this make sure your certificate and whois actually references back to the company making the announcement. If you look at the creation dates for the domain and the Not Valid Before dates on the certs they had plenty of time to get domains and certificates created that would reference themselves.

How I Minimize Distractions

Being clear on what’s important. Being honest with myself about my own habits. Protect how and where my attention is being pulled. Manage my own calendar.

I know I’m going to want to work on personal stuff, surf a bit, and do some reading. Knowing that I’m a morning person I start my day off before work on something for myself. Usually it falls under reading, writing, or coding.

After that I usually scan through my feed reader and read anything that catches my eye, surf a few regular sites and check my personal email.

When I do start work the first thing I do is about one hour of reactive work. All of those notifications and requests needing my attention get worked on. Right now this usually looks like catching up on email, Slack, Github, and Trello.

I then try to have a few chunks of time blocked off in my calendar throughout the week as “Focused Work”. This creates space to focus on what I need to work on while leaving time to be available for anyone else.

The key has been managing my own calendar to allow time for my attention to be directed on the things I want, the things I need, and the needs of others.

I do keep a running text file where I write out the important or urgent things that need to be done for the day. I’ll also add notes when something interrupts me. When I used to write this out on paper I used this sheet from Dave Seah called The Emergent Task Timer. I found it helped to write out what needs to be done each day and to track what things are pulling my attention away from more important things.

Because the type of work that I do can be interrupt driven I orientate my team in a similar fashion. By creating that same uninterrupted time for everyone it allows us to minimize the impact of distractions. During the week there are blocks of time where one person is the interruptible person while everyone else snoozes notifications in Slack, ignores email and gets some work done.

This also means focusing on minimizing the impact of alert / notification fatigue. Have a zero tolerance on things that page you repeatedly or adds unnecessary notifications to your day

The key really is just those four things I listed at the beginning.

You have to be clear in what is important both for yourself and for those you’re accountable to.

You have to be honest with yourself about your own habits. If you keep telling yourself you’re going to do something, but you never do… well there’s probably something else at play. Maybe you’re more in love with the idea rather than with the doing it.

You need to protect your attention from being pulled away for things that are not important or urgent.

You can work towards those three points by managing your own calendar. Besides tracking the passage of days a calendar schedules where your attention will be focused on. If you don’t manage it others will manage it for you.

I view these as ideals that I try to aim for. The circumstances I’m in might not allow me to always to so, but it does give me direction on how I work each day.

And yeah when I’m in the office I use headphones.

Chai 2000 - 2017

I lost my little buddy of 17 years this week.

Chai, I’m happy you had a peaceful morning in the patio at our new house doing what you loved. Sitting under bushes and in the sun. Even in your last day you still comforted us. You were relaxed throughout the morning, when I picked you up and held you as we drove to the vet till the end.

Chai, thank you for everything, we love you and you will be missed.

Why We Use Buzzwords

WebOps, SecOps, NetOps, DevOps, DevSecOps, ChatOps, NoOps, DataOps. They’re not just buzzwords. They’re a sign of a lack of resources.

I believe one reason why people keep coming up with newer versions of *Ops is there isn’t enough time or people to own the priorities that are being prepended to it. Forgot about the acronyms and the buzzwords and ask yourself why a new *Ops keep coming up.

It’s not always marketing hype. When people are struggling to find the resources to address one or multiple concerns they’ll latch onto anything that might help.

When systems administrators started building and supporting services that ran web sites we started calling it WebOps to differentiate that type of work from the more traditional SA role.

When DevOps came about, it was a reaction from SAs trying to build things at a speed that matched the needs of the companies that employed them. The success of using the label DevOps to frame that work encouraged others to orientate themselves around it as a way to achieve that same level of success.

Security is an early example and I can remember seeing talks and posts related to using DevOps as a framework to meet their needs.

Data related work seemed to be mostly absent. It seems instead we got a number of services, consultants, and companies that focused around providing a better database or some kind of big data product.

Reflecting back there are two surprises. The first being an early trend of including nontechnical departments in the DevOps framework. Adding your marketing department was one I saw a couple of different posts and talks on.

The second is even with the Software Defined Networking providing a programmatic path for traditional network engineering in the DevOps framework it really hasn’t been a standout. Most of the tools available are usually tied to big expensive gear and this seems to have kept network engineering outside of DevOps. The difference being if you are using a cloud platform that provides SDN then DevOps can cover the networking side.

ChatOps is the other interesting one, because it’s the one focused on the least technical thing. The difficulty people have communicating with other parts of their company and with easily finding information to basic questions that frequently come up.

This got me thinking of breaking the different types of engineering and/or development work needed into three common groups. And the few roles that have either been removed from some companies, outsourced, or they’re understaffed.

The three common groups include software engineer for front-end, back-end, mobile, and tooling. Systems engineering to provide the guidance and support for the entire product lifecycle; including traditional SA and operations roles, SRE, and tooling. The third is data engineering which covers traditional DBA roles, analytics, and big data.

Then you have QA, network engineering, and information security which seem to either have gone away unless you’re at a company that’s big enough to provide a dedicated team or has a specific requirement for them.

For QA it seems we’ve moved it from being a role into a responsibility everyone is accountable to.

Network engineering and security are two areas that I’m starting to wonder if they’ll be dispersed across the three common groups just as QA has been for the most part. Maybe SDN would move control of the network to software and system engineers? There is an obvious benefit to hold everyone accountable for security as we’ve done with QA.

This wouldn’t mean there’s no need for network and security professionals anymore, but that you would probably only find them in certain circumstances.

Which brings me back to my original point. Why the need for all of these buzzwords? With the one exception every single version of*Ops is focused on the non-product feature related work. We’re seeing reductions in the number of roles focused on that type of work while at the same time its importance has increased.

I think it’s important that we’re honest with ourselves and consider that maybe the way we do product management and project planning hasn’t caught up with the changes in the non-product areas of engineering. If it had do you think we would have see all of these new buzzwords?