Brain Dump

Every wonder why you have so many ideas when you’re in the bathroom taking a shower, or a bath, brushing your teeth or that other thing we use the room for?

Ever notice that there are no screens in there to pull or hold our attention?

Take a minute and count how many screens are around you right now. How many different TVs, phones, tablets, computers, e-readers, portable video game consoles, smartwatches, or VR headsets are within your sight? How many of these are in the bathroom?

p.s. please don’t admit to owning a pair of AR glasses.

While having a discussion with my wife, who had a MacBook on her lap, a iPad closed on the ottoman next to her feet, and her iPhone within reach I said…

“the nuance is blurred” and then I had no reply as she was still reading whatever it was she was reading.

I waited a minute and then asked her “you know I just asked you a question right?”

She replied “yes, something about being blurred.”

I repeated what I had said “yeah the nuance is blurred” as I started to walk to the bathroom I added “Blur as in the band, as in song number two, as in what I’m going to do.”

Which I did and then noticed there are no screens in the bathroom and wondered maybe this is why we have flashes of clarity and creativity in here. Which reminded me we recently joked about getting an Amazon Echo in here so we could ask Alexa to take notes so we don’t forget things.

So as I step back into the living room, even before I’m through the bathroom door I’m calling out to her “Don’t say anything! I had an idea that I need to write down before I forget!” She was looking at me as I stepped out and I was left with the impression she was holding onto something to tell me.

I grabbed my laptop, sit down and start writing. At this point I’ve read everything that you’ve read so far to her and she laughs.

We talk about the fact there’s a screen in every room in the house except for the bathroom. We go over the semantics of are there really screens in the bedroom. Because you know the iPads we both have can follow us around. She uses hers through out the day and I leave mine at the bed side table to watch something when I go to bed. Which isn’t really going to bed as it’s just laying down and watching TV. No wonder I get so little sleep.

We talk about the screens even in our car. Hell the car even emails and sends us texts messages when it gets low on windshield wiper fluid. Our car has a drinking problem and it likes to let us know about it.

Used to be the only screen in the house was the one television set the family had and if you were well-off your parents had a second set in their bedroom to watch Johnny Carson together.

Instead we watch different things on our own screen that we hold out in front of us or sit on our bellies. Recently my wife asked if there’s an app so we can both watch the same thing synced up on our individual iPads. We laughed when we realized that yes there is and it’s called a television.

We seem to miss the opportunities we have to build something new and we underestimate the value of what we lost. Instead we build things that place each of us in our own individual world. Walled off with wireless headsets and a microphone, where you can be talking with someone who is ignoring the people around them or maybe you’re just listening to music that a machine picked to play for you.

Yes there is usually at least one screen in the bathroom called a mirror. It’s passive and only reflects back what we bring to it. That’s the key difference. That screen is passive and is used as a tool so we can brush our teeth or have a moment of self reflection.

Anyways just wanted to capture this before I lost.

I never did find out what she was holding on to tell me.

Attrition from The Small Things

If you find yourself thinking about what work lies ahead in 2018 consider the following. Doesn’t matter if you’re only thinking of the first month, quarter or the entire year. How many changes can your team handle before it culminates to where they’re no longer capable of performing their normal operations? How about for yourself?

While you’re considering your answer frame change as anything from work related to existing or new projects, ongoing work supporting existing products, incidents, pivots in priorities or business models, new regulations, reorgs, to just plain interruptions.

Every individual and team has a point where their capabilities become ineffective. We have different tools and methodologies that track time spent on tasks. We track complexity (story points) delivered in a given timeframe (sprints). Burn up charts show tasks added across time, but this is a very narrow view of change. A holistic view of the impact of change on a team is missing. One that can show how change wears down on the people we’re responsible for and ourselves.

Reduction of Capability

Before we started needing data for most decisions we placed trust in individuals to do their job. When people were being pushed too far too fast they might push back. This still happens, but the early signs of it are often drowned out by data or a mantra of stick with the process. It’s developed into a narrow focus that has eroded trust in experience to drive us towards our goals. This has damaged some of the basic leadership skills needed and it has focused our industry on efficiency over effectiveness. I’m also starting to think this is creating a tendency where people are second guessing their own abilities due to the inabilities of others.

This reinforces a culture where leaders stop trusting the opinions of people doing the work or those who are close to it. When people push back the leaders have a choice to either listen and take into account the feedback or to double down on the data and methods used. This contributed in creating the environments where the labels “10x”, “Rock Stars” and “Ninjas” started being applied to engineers, designers, and developers.

heroics — “behavior or talk that is bold or dramatic, especially excessively or unexpectedly so: the makeshift team performed heroics.” — New Oxford American Dictionary

Ever think about why we apply the label heroics or hero when teams or people are able to pull through in the end? If the output of work and the frequency of changes were plotted I’d bet you’ll find the point where sustaining normal operations was impracticable or improbable was passed before these labels are used.

Last month’s fatal Amtrak derailment that killed three people was traveling more than twice the speed limit (80 mph in a 30 mph zone). The automated system (positive train control) designed to prevent these types of conditions from happening while installed was not activated. Was this fatal accident on the inaugural run of a new Amtrak route an example of where normal operations were no longer possible? Is this any different than the fatal collisions involving US Navy ships last year due to over-burdened personnel and equipment?

For the derailment it looks like a combination of failing to use available safety systems and following safety guidelines contributed to the accident. There’s also the question was the crew given training to build awareness of the new route. The Navy collisions looks to be the result of the strain of trying to do too much with too few people and resources. This includes individuals working too many hours, a reduction in training, failure to verify readiness, and a backlog of maintenance on the equipment, aircraft and ships.

The cadence of change was greater than what these organizations were capable of supporting.

For most of us working as engineers, designers, developers, product managers, or as online support we wouldn’t consider ourselves to be in a high-risk occupation. But the work we do impacts peoples lives in small to massive ways. These examples are something that we should be learning from. We should also acknowledge that we’re not good at being aware of the negative impacts of the tempo of change on our people.

There’s a phrase and image that can illustrate the dependencies between people, processes, and systems. It’s called the “Swiss Cheese Model” and it highlights when shortcomings between them line up it can allow a problem to happen. It also shows how the strengths from each is able to support the weaknesses of others.

Swiss Cheese Model of Accident Causation

Illustration by David Mack CC BY-SA 3.0.

We have runbooks, playbooks, incident management processes, and things to help us understand what is happening in our products and systems. Remember that these things are not absolute and they’re fallible. The systems and processes we put into place are never final, they’re ideas maintained as long as they stay relevant and then removed when they are no longer necessary. This requires awareness and diligence.

In any postmortem I’ve participated in or read through there were early signs that conditions were unusual. Often people fail to recognize a difference between what is happening and what is expected to happen. This is the point where a difference can start to develop into a problem if we ignore it. If you think you see something that doesn’t seem right you need to speak up.

After the Apollo 1 fire Gene Kranz gave a speech to his team at NASA that is knows as the Kranz Dictum. He begins by stating they work in a field that cannot tolerate incompetence. He then immediately holds himself and every part of the Apollo program accountable for their failures to prevent the deaths of Gus Grissom, Ed White, and Roger Chaffee.

From this day forward, Flight Control will be known by two words: “Tough” and “Competent.” Tough means we are forever accountable for what we do or what we fail to do. We will never again compromise our responsibilities. Every time we walk into Mission Control we will know what we stand for. Competent means we will never take anything for granted. We will never be found short in our knowledge and in our skills. — Gene Kranz

I take this as doing the work to protect the people involved. For us this should include ourselves, the people in our organizations, and our customers. Protection is gained when we’re thorough and accountable; sufficient training and resources are given; communication is concise and assertive; and we have an awareness of what is happening.

When I compare the derailment and collisions, what Kranz was speaking too, any emergency I responded to as a fire fighter, or any incident I worked as an engineer there are similarities. They’re the results from the attrition of little things that continued unabated.

Andon Cord for People

Alerting, availability, continuous integration/deployment, error rates, logging, metrics, monitoring, MTBF, MTTF, MVP, observability, reliability, resiliency, SLA, SLI, SLO, telemetry, throughput and uptime.

We build tools and we have all kinds of words and acronyms to help us frame our thoughts around the planning, building, maintaining and supporting of products. We even allow machines to bug us to fix them, including waking us up in the middle of the night. Why don’t we have the same level of response when people break?

One of the many things that came out of the Toyota Production System is Andon. It gives individuals the ability to stop the production line when a problem is found and call for help.

We talk about rapid feedback loops and iterative workflows, but we don’t talk about feedback from the people close to the work as a way of continuous improvement. We should be giving people the ability to pull the cord when there is an issue that impacts the ability for them or someone else on the team to perform. And that doesn’t mean only technical issues.

What would happen if your on-call staff had horrible time that they’re spent after their first night? Imagine if we gave our people the same level of support that we give our machines? Give them an andon cord to pull (i.e. page) that would get them the help they need.

As you’re planning don’t forget about your people. Could you track the frequency of changes happening to your team? Then plot the impact of that against the work completed? Think about providing an andon cord for them. How could you build a culture where people feel responsible to speak up when they see something that doesn’t line up with what we expect?

“People, ideas and technology. In that order!” — John Boyd.

Too many times we think a solution or problem is technical. More often than not it’s about a breakdown of communication and then sometimes not having the right people or protecting them.

The ideas from Boyd are a good example of how our industry fails to fully understand a concept before using it. If you’ve heard the phrase OODA Loop you’ve probably seen a circular image with points for Observe, Orient, Decide and Act. The thing is he never drew just a single loop. He gave a way to frame an environment and a process to help guide us through the unknowns. And it puts the people first by using their experience so when they recognize something for what it is they can act on it immediately. It was always more than a loop. It was a focus on the people and organizations.

Reducing Human Error in Software Based Services

I removed the second half of this, because the ideas weren’t fully thought out yet.

As I’m writing this Texas and Louisiana continue to deal with the impact of Hurricane Harvey. Hurricane Irma is heading towards Florida. Los Angeles just experienced the biggest fire to burn in its history (La Tuna Fire). And in the last three months there have been two different collisions between US Navy vessels and civilian ships that resulted in 17 fatalities and multiple injuries.

The interaction and relationships between people, actions, and events in high risk endeavors is awe inspiring. Put aside the horrific loss of life and think of the amount of stress and chaos involved. Imagine knowing that your actions can have irreversible consequences. Though they can’t be changed I’m fascinated with efforts to prevent them from repeating.

Think of those interactions between people, actions and events as change. There are examples of software systems having [critical][1] or [fatal][2] [consequences][3] when failing to handle them. For most of us the impact might be setbacks delaying work or at most a financial consequence to our employer or ourselves. While the impact may differ there are benefits from learning from professions other than our own that deal with change on a daily basis.

Our job as systems or ops engineers should be on how to build, maintain, troubleshoot, and retire the systems we’re responsible for. But there’s been a shift building that has us focusing more on becoming experts at evaluating new technology.

Advances in our tooling has allowed us to rebuild, replace, or re-provision from failures. This starts introducing complacency, because the tools start to have more context of the issues than us. It shifts our focus away from reaching a better idea of what’s happening.

As the complexity and the number of systems involved increases our ability to understand what is happening and how they interact hasn’t kept up. If you have any third-party dependencies, what are the chances they’re going through a similar experience? How much of an impact does this have on your understanding of what’s happening in your own systems?

Atrophy of Basics Skills

The increased efficiency of our tooling creates a [Jevons paradox][4]. This is an economic idea that as the efficiency of something increases it will lead to more consumption instead of reducing it. Named after William Jevons who in the 19th century noticed that the consumption of coal increased after the release of a new steam engine design. The improvements with this new design increased the efficiency of the coal-fired steam engine over it’s predecessors. This fueled a wider adoption of the steam engine. It became cheaper for more people to use a new technology and this led to the increased consumption of coal.

For us the engineer’s time is the coal and the tools are the new coal-fired engine. As the efficiency of the tooling increases we tend to use more of the engineer’s time. The adoption of the tooling increases while the number of engineers tends to remain flat. Instead of bringing in more people we tend to try to do more with the people we have.

This contributes to an atrophying of the basic skills needed to do the job. Things like troubleshooting, situational awareness, and being able to hold a mental model of what’s happening. It’s a journeyman’s process to build them. Actual production experience is the best teacher and the best feedback is from your peers. Tools are starting to replace the opportunities for people to have those experiences to learn from.

Children of the Magenta and the Dangers of Automation

For most of us improving the efficiency of an engineer’s time will look like some sort of automation. And while there are obvious benefits there are some not so obvious negatives. First, automation can hide the context of what is, has, and will happen from us. How many times have you heard or asked yourself “What’s it doing now?”

“A lot of what’s happening is hidden from view from the pilots. It’s buried. When the airplane starts doing something that is unexpected and the pilot says ‘hey, what’s it doing now?’ — that’s a very very standard comment in cockpits today.’”
– William Langewiesche, journalist and former American Airlines captain.

In May 31, 2009 228 people died when Air France 447 lost altitude from 35,000 feet and pancaked into the Atlantic Ocean. A pressure probe had iced over preventing the aircraft from determining its speed. This caused the autopilot to disengage and the “fly-by-wire” system switched into a different mode.

“We appear to be locked into a cycle in which automation begets the erosion of skills or the lack of skills in the first place and this then begets more automation.” – William Langewiesche

Four years later Asiana Airlines flight 214 crashed on their final approach into SFO. It came in short of the runway striking the seawall. The NTSB report shows the flight crew mismanaged the initial approach and the aircraft was above the desired glide path. The captain responded by selecting the wrong autopilot mode, which caused the auto throttle to disengage. He had a faulty mental model of the aircraft’s automation logic. This over-reliance on automation and lack of understanding the systems was cited as major factors leading to the accident.

This has been described as “Children of the Magenta” due to the information presented in the cockpit from the autopilot being magenta in color. It was coined by Capt. Warren “Van” Vanderburgh at American Airlines Flight Academy. There are different levels of automation in an aircraft and he argues that by reducing the level of automation you can reduce the workload in some situations. The amount of automation should match the current conditions of the environment. It’s a 25 minute video that’s worth watching, but it boils down to this. Pilots have become too dependent on automation in general and are losing the skills needed to safely control their aircraft.

This led a Federal Aviation Administration task force on cockpit technology to urge airlines to have their pilots spend more time flying by hand. This focus of returning to the basic skills needed is similar to the [report released][5] from the Government Accountability Office (GAO) regarding the impact of maintenance and training on the readiness of the US Navy.

Based on updated data, GAO found that, as of June 2017, 37 percent of the warfare certifications for cruiser and destroyer crews based in Japan—including certifications for seamanship—had expired. This represents more than a fivefold increase in the percentage of expired warfare certifications for these ships since GAO’s May 2015 report.

Complexity and Chaos

Complexity has a tendency to bring about chaos. Due to the difficulty of people to understand the system(s) and an incomplete visibility into what is happening. If this is the environment that we find ourselves working in we can only control the things we bring into it. That includes making sure we have a grasp on the basics of our profession and maintain the best possible understanding of what’s happening with our systems. This should allow us to work around issues as they happen and decrease our dependencies on our tools. There’s usually more than one way to get information or make a change. Knowing the basics and staying aware of what is happening can get us through the chaos.

Valid Security Announcements

A checklist item for the next time you need to create a landing page for a security announcement.

Make sure the certificate and the whois on the domain being used actually references the name of your company.

My wife sends me a link to this.

I then find the page for the actual announcement from Equifax.

Go to the dedicated website and find it’s using a Cloudflare SSL certificate.

The certificate chain doesn’t mention Equifax other than the DNS names used in the cert (* and

What happens if I do a whois?

$ whois
   Registry Domain ID: 2156034374_DOMAIN_COM-VRSN
   Registrar WHOIS Server:
   Registrar URL:
   Updated Date: 2017-08-25T15:08:31Z
   Creation Date: 2017-08-22T22:07:28Z
   Registry Expiry Date: 2019-08-22T22:07:28Z
   Registrar: MarkMonitor Inc.
   Registrar IANA ID: 292
   Registrar Abuse Contact Email:
   Registrar Abuse Contact Phone: +1.2083895740
   Domain Status: clientDeleteProhibited
   Domain Status: clientTransferProhibited
   Domain Status: clientUpdateProhibited
   DNSSEC: unsigned

Now I want to see if I’m impacted. Click on the “Check Potential Impact” and I’m taken to a new site (

And we get another certificate and a whois lacking any reference back to Equifax.

$ whois
   Registry Domain ID: 2157515886_DOMAIN_COM-VRSN
   Registrar WHOIS Server:
   Registrar URL:
   Updated Date: 2017-08-29T04:59:16Z
   Creation Date: 2017-08-28T17:25:35Z
   Registry Expiry Date: 2018-08-28T17:25:35Z
   Registrar: Amazon Registrar, Inc.
   Registrar IANA ID: 468
   Registrar Abuse Contact Email:
   Registrar Abuse Contact Phone: +1.2062661000
   Domain Status: ok
   Name Server: NS-1426.AWSDNS-50.ORG
   Name Server: NS-1667.AWSDNS-16.CO.UK
   Name Server: NS-402.AWSDNS-50.COM
   Name Server: NS-934.AWSDNS-52.NET
   DNSSEC: unsigned

I’m not suggesting that the site equifaxsecurity2017 is malicious, but if you’re going to the trouble of setting up a page like this make sure your certificate and whois actually references back to the company making the announcement. If you look at the creation dates for the domain and the Not Valid Before dates on the certs they had plenty of time to get domains and certificates created that would reference themselves.

How I Minimize Distractions

Being clear on what’s important. Being honest with myself about my own habits. Protect how and where my attention is being pulled. Manage my own calendar.

I know I’m going to want to work on personal stuff, surf a bit, and do some reading. Knowing that I’m a morning person I start my day off before work on something for myself. Usually it falls under reading, writing, or coding.

After that I usually scan through my feed reader and read anything that catches my eye, surf a few regular sites and check my personal email.

When I do start work the first thing I do is about one hour of reactive work. All of those notifications and requests needing my attention get worked on. Right now this usually looks like catching up on email, Slack, Github, and Trello.

I then try to have a few chunks of time blocked off in my calendar throughout the week as “Focused Work”. This creates space to focus on what I need to work on while leaving time to be available for anyone else.

The key has been managing my own calendar to allow time for my attention to be directed on the things I want, the things I need, and the needs of others.

I do keep a running text file where I write out the important or urgent things that need to be done for the day. I’ll also add notes when something interrupts me. When I used to write this out on paper I used this sheet from Dave Seah called The Emergent Task Timer. I found it helped to write out what needs to be done each day and to track what things are pulling my attention away from more important things.

Because the type of work that I do can be interrupt driven I orientate my team in a similar fashion. By creating that same uninterrupted time for everyone it allows us to minimize the impact of distractions. During the week there are blocks of time where one person is the interruptible person while everyone else snoozes notifications in Slack, ignores email and gets some work done.

This also means focusing on minimizing the impact of alert / notification fatigue. Have a zero tolerance on things that page you repeatedly or adds unnecessary notifications to your day

The key really is just those four things I listed at the beginning.

You have to be clear in what is important both for yourself and for those you’re accountable to.

You have to be honest with yourself about your own habits. If you keep telling yourself you’re going to do something, but you never do… well there’s probably something else at play. Maybe you’re more in love with the idea rather than with the doing it.

You need to protect your attention from being pulled away for things that are not important or urgent.

You can work towards those three points by managing your own calendar. Besides tracking the passage of days a calendar schedules where your attention will be focused on. If you don’t manage it others will manage it for you.

I view these as ideals that I try to aim for. The circumstances I’m in might not allow me to always to so, but it does give me direction on how I work each day.

And yeah when I’m in the office I use headphones.