Take a Break

SysAdvent is taking a break this year and I suggest you do the same.

Instead of cramming to stay current, get ready for your next job interview, and working late, spend it on something else.

One month focusing on a series of blog posts, books, or tutorials isn’t going to make up for anything you need to catch up on.

And if you’re like my it will take you days to unwind from work before you can actually relax. So instead focus on your family, friends, and yourself.

Do something that doesn’t require a screen, to be compiled, or a git push.

There will be plenty more months ahead to learn.

Too Many Tools

A couple of months back I decided to replace macOS with Linux on my personal computer. This was based around two things. The idea that I was using too many different tools on a daily basis and I wanted to thin out the list. For some unknown reason I was using multiple text editors (Ulysses, iA Writer, FoldingText, and nvALT). The second was the frustration I’ve experienced with the changes in macOS from the influence of iOS, quality issues, and Apple’s history with shedding one adapter for another in it’s hardware.

I spent a fair amount of time debating if this was really necessary and was I just looking for another thing to tinker with. And then on the question would I just install Linux on my MacBook Air or go looking for a new laptop. In the end I decided that I wanted to use the same OS I use for work on a new laptop that would have the best support for Linux. So I started looking for a replacement of the MacBook Air and Mac Mini I was using for personal use.

The first laptop I bought for myself was a Titanium Powerbook with Mac OS X and OS 9. At the time I was working as a contractor at IBM Global Services doing Unix support for SAP systems. There was plenty of different operating systems being used including a mix of mainframe, HP-UX, SunOS, Solaris, IRIX, Tru64, AIX, NT4, and OS/2 Warp. There was also a number of us running personal systems with BeOS, NeXTSTEP, NetBSD, SuSE, Slackware, or Red Hat. The point being there was a lot of different ideas and approaches on what an operating system could be.

This was when the operating system you used was generally decided by what you needed to do with it. The majority of uses could be satisfied with Windows, many creative and print (design and layout) needs were handled with Mac OS, and a smaller number could be addressed with an Unix-like operating system.

For people who where building and supporting things for the web, OS X delivered something that was missing. It brought mix of a Unix-like OS, a polished GUI, and support from a single-source vendor. The core pieces were released and open-sourced as Darwin, Jordan Hubbard (a co-founder of FreeBSD) joined Apple , and there was at least three different efforts around building or porting package management tools for it. This new direction from Apple created a lot of excitement. Ever since that TiBook I’ve owned four different Macs and it’s only been during the last year that I’ve considered something else.

Today the OS doesn’t matter as much.

Apple seems to be losing it’s place as the OS of choice for development work. Microsoft has added a Linux subsystem and a builtin SSH client to Windows, contributed research and work around containers and unikernels, released a version of Linux (Azure Sphere), and runs one of the top three public clouds. Linux now looks to be the leading OS for servers and mobile. You have Android, Chrome OS, IoT devices, support for ARM based devices, and Linux is pulling it’s weight in iron in the private and public clouds. Extending the computer to mobile and having the cloud as a way to pass messages between devices has opened up our options.

Now, as to which distro would replace macOS I thought about and tried Arch, but it just ended up reminding me of Gentoo. I’m not wanting to have installing and configuring Linux to be the equivalent of re-rolling the dice for the best D&D character sheet. The majority of the machines I work with our using one of the Ubuntu LTS releases. Because of this it became the obvious pick for the OS, but it was the choice of hardware that took the most time.

Originally I was looking at System 76, Lenovo, and Dell. At the time there was a bit of a debate between System 76 and GNOME over firmware updates from LVFS and some of the comments from System 76 left me with an impression of this is a company I’ll avoid for now. It was a review from Linux Journal that introduced me to the Librem laptops from Purism. Their focus on the privacy and security delivered in a product with a simple aesthetic is what sold me.

I settled on buying a Librem 15v3 and planned to install Ubuntu 18.04 LTS. I did give thought to sticking with Purism’s own OS (PureOS) and I may try it at a later time, but for now I want to be using the same tools at work and at home.

The thing that was the most intimidating was the idea of using Linux as a desktop replacement again. It’s been almost two decades since I’ve last used a GUI in Linux. That’s a long time to catch up changes in GNOME and KDE and then I came across AppImage , Flatpak, and Snaps.

Deliver Update Publish Execute System (DUPES)

My first thought was great another set of package tools to navigate, but I think that was a short-sighted view of what they provide. They’re more than just package and dependency managers. They’re closer to the architecture of an app store paired with the process to execute an app. To make it easier for myself I just started thinking of them as “dupes” instead of a package management tool.

Delivery — how the application is delivered to the user.

Update – how the application updates (e.g., auto, delta).

Publish — the process of building the app with dependencies for delivery.

Execute — how the application is run and the level of isolation provided.

System — tooling to provide the previous four points.

The Browser, Editor, and Terminal

Most of the time I’m using one of three different types of applications. Either a browser, text editor, or a terminal. The other native applications are tools for specific purposes. In some cases I use the iOS version, because it’s either easier to hold a tablet while reading, streaming something, or playing music.

Before I placed the order I started listing out the applications I use on a regular basis. A number of these apps were macOS / iOS only, but the ones I really cared about keeping either had a native Linux or iOS version available. The idea became to use Linux on the laptop and continue with the pairing of iOS and WatchOS on the phone, tablet, and watch. All I needed to add was a stand for the iPad Pro, reuse the keyboard from the Mac Mini and use the cloud to transfer files and I was set.

Any development, personal projects, or anything technical would be done on the Librem. Things like budgeting, health / fitness tracking, reading, writing, and streaming would be on the iPad Pro, iPhone, or Apple Watch.

Note: The lack of an IM client that supports Messages has been the most difficult hurdle. My wife and I both have iPhones and she has a Mac. By giving up macOS I’m forced to rely on my phone or tablet when chatting with her. And if one of us wants to send the other a link there’s a bit of a shell game needed if I want it on or send it from Linux.

Impressions after a Month

So far I’ve been happy with my choice. The only issue I’ve had is after an update for Ubuntu the firmware needed for the Bluetooth module was uninstalled and the wireless seems to require I use wicd network manager instead of managing the wireless through settings or the GNOME shell.

The configuration I ordered for the Librem included a 500GB NVMe drive that’s used for the OS and a second 120GB SSD drive. I run Gitlab and Nextcloud locally as containers that are both mounted on the second drive. This gives me some flexibility in how to transfer files between devices and I’ve enjoyed having a local set of tools I control.

There are a few things that are on the list to consider. Replacing 1Password with Bitwarden and deciding how I’ll do backups. Right now I’m thinking of something like Restic with B2 as the remote repository.

Overall the change feels like I made the right choice.

Brain Dump

Every wonder why you have so many ideas when you’re in the bathroom taking a shower, or a bath, brushing your teeth or that other thing we use the room for?

Ever notice that there are no screens in there to pull or hold our attention?

Take a minute and count how many screens are around you right now. How many different TVs, phones, tablets, computers, e-readers, portable video game consoles, smartwatches, or VR headsets are within your sight? How many of these are in the bathroom?

p.s. please don’t admit to owning a pair of AR glasses.

While having a discussion with my wife, who had a MacBook on her lap, a iPad closed on the ottoman next to her feet, and her iPhone within reach I said…

“the nuance is blurred” and then I had no reply as she was still reading whatever it was she was reading.

I waited a minute and then asked her “you know I just asked you a question right?”

She replied “yes, something about being blurred.”

I repeated what I had said “yeah the nuance is blurred” as I started to walk to the bathroom I added “Blur as in the band, as in song number two, as in what I’m going to do.”

Which I did and then noticed there are no screens in the bathroom and wondered maybe this is why we have flashes of clarity and creativity in here. Which reminded me we recently joked about getting an Amazon Echo in here so we could ask Alexa to take notes so we don’t forget things.

So as I step back into the living room, even before I’m through the bathroom door I’m calling out to her “Don’t say anything! I had an idea that I need to write down before I forget!” She was looking at me as I stepped out and I was left with the impression she was holding onto something to tell me.

I grabbed my laptop, sit down and start writing. At this point I’ve read everything that you’ve read so far to her and she laughs.

We talk about the fact there’s a screen in every room in the house except for the bathroom. We go over the semantics of are there really screens in the bedroom. Because you know the iPads we both have can follow us around. She uses hers through out the day and I leave mine at the bed side table to watch something when I go to bed. Which isn’t really going to bed as it’s just laying down and watching TV. No wonder I get so little sleep.

We talk about the screens even in our car. Hell the car even emails and sends us texts messages when it gets low on windshield wiper fluid. Our car has a drinking problem and it likes to let us know about it.

Used to be the only screen in the house was the one television set the family had and if you were well-off your parents had a second set in their bedroom to watch Johnny Carson together.

Instead we watch different things on our own screen that we hold out in front of us or sit on our bellies. Recently my wife asked if there’s an app so we can both watch the same thing synced up on our individual iPads. We laughed when we realized that yes there is and it’s called a television.

We seem to miss the opportunities we have to build something new and we underestimate the value of what we lost. Instead we build things that place each of us in our own individual world. Walled off with wireless headsets and a microphone, where you can be talking with someone who is ignoring the people around them or maybe you’re just listening to music that a machine picked to play for you.

Yes there is usually at least one screen in the bathroom called a mirror. It’s passive and only reflects back what we bring to it. That’s the key difference. That screen is passive and is used as a tool so we can brush our teeth or have a moment of self reflection.

Anyways just wanted to capture this before I lost.

I never did find out what she was holding on to tell me.

Attrition from The Small Things

If you find yourself thinking about what work lies ahead in 2018 consider the following. Doesn’t matter if you’re only thinking of the first month, quarter or the entire year. How many changes can your team handle before it culminates to where they’re no longer capable of performing their normal operations? How about for yourself?

While you’re considering your answer frame change as anything from work related to existing or new projects, ongoing work supporting existing products, incidents, pivots in priorities or business models, new regulations, reorgs, to just plain interruptions.

Every individual and team has a point where their capabilities become ineffective. We have different tools and methodologies that track time spent on tasks. We track complexity (story points) delivered in a given timeframe (sprints). Burn up charts show tasks added across time, but this is a very narrow view of change. A holistic view of the impact of change on a team is missing. One that can show how change wears down on the people we’re responsible for and ourselves.

Reduction of Capability

Before we started needing data for most decisions we placed trust in individuals to do their job. When people were being pushed too far too fast they might push back. This still happens, but the early signs of it are often drowned out by data or a mantra of stick with the process. It’s developed into a narrow focus that has eroded trust in experience to drive us towards our goals. This has damaged some of the basic leadership skills needed and it has focused our industry on efficiency over effectiveness. I’m also starting to think this is creating a tendency where people are second guessing their own abilities due to the inabilities of others.

This reinforces a culture where leaders stop trusting the opinions of people doing the work or those who are close to it. When people push back the leaders have a choice to either listen and take into account the feedback or to double down on the data and methods used. This contributed in creating the environments where the labels “10x”, “Rock Stars” and “Ninjas” started being applied to engineers, designers, and developers.

heroics — “behavior or talk that is bold or dramatic, especially excessively or unexpectedly so: the makeshift team performed heroics.” — New Oxford American Dictionary

Ever think about why we apply the label heroics or hero when teams or people are able to pull through in the end? If the output of work and the frequency of changes were plotted I’d bet you’ll find the point where sustaining normal operations was impracticable or improbable was passed before these labels are used.

Last month’s fatal Amtrak derailment that killed three people was traveling more than twice the speed limit (80 mph in a 30 mph zone). The automated system (positive train control) designed to prevent these types of conditions from happening while installed was not activated. Was this fatal accident on the inaugural run of a new Amtrak route an example of where normal operations were no longer possible? Is this any different than the fatal collisions involving US Navy ships last year due to over-burdened personnel and equipment?

For the derailment it looks like a combination of failing to use available safety systems and following safety guidelines contributed to the accident. There’s also the question was the crew given training to build awareness of the new route. The Navy collisions looks to be the result of the strain of trying to do too much with too few people and resources. This includes individuals working too many hours, a reduction in training, failure to verify readiness, and a backlog of maintenance on the equipment, aircraft and ships.

The cadence of change was greater than what these organizations were capable of supporting.

For most of us working as engineers, designers, developers, product managers, or as online support we wouldn’t consider ourselves to be in a high-risk occupation. But the work we do impacts peoples lives in small to massive ways. These examples are something that we should be learning from. We should also acknowledge that we’re not good at being aware of the negative impacts of the tempo of change on our people.

There’s a phrase and image that can illustrate the dependencies between people, processes, and systems. It’s called the “Swiss Cheese Model” and it highlights when shortcomings between them line up it can allow a problem to happen. It also shows how the strengths from each is able to support the weaknesses of others.

Swiss Cheese Model of Accident Causation

Illustration by David Mack CC BY-SA 3.0.

We have runbooks, playbooks, incident management processes, and things to help us understand what is happening in our products and systems. Remember that these things are not absolute and they’re fallible. The systems and processes we put into place are never final, they’re ideas maintained as long as they stay relevant and then removed when they are no longer necessary. This requires awareness and diligence.

In any postmortem I’ve participated in or read through there were early signs that conditions were unusual. Often people fail to recognize a difference between what is happening and what is expected to happen. This is the point where a difference can start to develop into a problem if we ignore it. If you think you see something that doesn’t seem right you need to speak up.

After the Apollo 1 fire Gene Kranz gave a speech to his team at NASA that is knows as the Kranz Dictum. He begins by stating they work in a field that cannot tolerate incompetence. He then immediately holds himself and every part of the Apollo program accountable for their failures to prevent the deaths of Gus Grissom, Ed White, and Roger Chaffee.

From this day forward, Flight Control will be known by two words: “Tough” and “Competent.” Tough means we are forever accountable for what we do or what we fail to do. We will never again compromise our responsibilities. Every time we walk into Mission Control we will know what we stand for. Competent means we will never take anything for granted. We will never be found short in our knowledge and in our skills. — Gene Kranz

I take this as doing the work to protect the people involved. For us this should include ourselves, the people in our organizations, and our customers. Protection is gained when we’re thorough and accountable; sufficient training and resources are given; communication is concise and assertive; and we have an awareness of what is happening.

When I compare the derailment and collisions, what Kranz was speaking too, any emergency I responded to as a fire fighter, or any incident I worked as an engineer there are similarities. They’re the results from the attrition of little things that continued unabated.

Andon Cord for People

Alerting, availability, continuous integration/deployment, error rates, logging, metrics, monitoring, MTBF, MTTF, MVP, observability, reliability, resiliency, SLA, SLI, SLO, telemetry, throughput and uptime.

We build tools and we have all kinds of words and acronyms to help us frame our thoughts around the planning, building, maintaining and supporting of products. We even allow machines to bug us to fix them, including waking us up in the middle of the night. Why don’t we have the same level of response when people break?

One of the many things that came out of the Toyota Production System is Andon. It gives individuals the ability to stop the production line when a problem is found and call for help.

We talk about rapid feedback loops and iterative workflows, but we don’t talk about feedback from the people close to the work as a way of continuous improvement. We should be giving people the ability to pull the cord when there is an issue that impacts the ability for them or someone else on the team to perform. And that doesn’t mean only technical issues.

What would happen if your on-call staff had horrible time that they’re spent after their first night? Imagine if we gave our people the same level of support that we give our machines? Give them an andon cord to pull (i.e. page) that would get them the help they need.

As you’re planning don’t forget about your people. Could you track the frequency of changes happening to your team? Then plot the impact of that against the work completed? Think about providing an andon cord for them. How could you build a culture where people feel responsible to speak up when they see something that doesn’t line up with what we expect?

“People, ideas and technology. In that order!” — John Boyd.

Too many times we think a solution or problem is technical. More often than not it’s about a breakdown of communication and then sometimes not having the right people or protecting them.

The ideas from Boyd are a good example of how our industry fails to fully understand a concept before using it. If you’ve heard the phrase OODA Loop you’ve probably seen a circular image with points for Observe, Orient, Decide and Act. The thing is he never drew just a single loop. He gave a way to frame an environment and a process to help guide us through the unknowns. And it puts the people first by using their experience so when they recognize something for what it is they can act on it immediately. It was always more than a loop. It was a focus on the people and organizations.

Reducing Human Error in Software Based Services

I removed the second half of this, because the ideas weren’t fully thought out yet.

As I’m writing this Texas and Louisiana continue to deal with the impact of Hurricane Harvey. Hurricane Irma is heading towards Florida. Los Angeles just experienced the biggest fire to burn in its history (La Tuna Fire). And in the last three months there have been two different collisions between US Navy vessels and civilian ships that resulted in 17 fatalities and multiple injuries.

The interaction and relationships between people, actions, and events in high risk endeavors is awe inspiring. Put aside the horrific loss of life and think of the amount of stress and chaos involved. Imagine knowing that your actions can have irreversible consequences. Though they can’t be changed I’m fascinated with efforts to prevent them from repeating.

Think of those interactions between people, actions and events as change. There are examples of software systems having [critical][1] or [fatal][2] [consequences][3] when failing to handle them. For most of us the impact might be setbacks delaying work or at most a financial consequence to our employer or ourselves. While the impact may differ there are benefits from learning from professions other than our own that deal with change on a daily basis.

Our job as systems or ops engineers should be on how to build, maintain, troubleshoot, and retire the systems we’re responsible for. But there’s been a shift building that has us focusing more on becoming experts at evaluating new technology.

Advances in our tooling has allowed us to rebuild, replace, or re-provision from failures. This starts introducing complacency, because the tools start to have more context of the issues than us. It shifts our focus away from reaching a better idea of what’s happening.

As the complexity and the number of systems involved increases our ability to understand what is happening and how they interact hasn’t kept up. If you have any third-party dependencies, what are the chances they’re going through a similar experience? How much of an impact does this have on your understanding of what’s happening in your own systems?

Atrophy of Basics Skills

The increased efficiency of our tooling creates a [Jevons paradox][4]. This is an economic idea that as the efficiency of something increases it will lead to more consumption instead of reducing it. Named after William Jevons who in the 19th century noticed that the consumption of coal increased after the release of a new steam engine design. The improvements with this new design increased the efficiency of the coal-fired steam engine over it’s predecessors. This fueled a wider adoption of the steam engine. It became cheaper for more people to use a new technology and this led to the increased consumption of coal.

For us the engineer’s time is the coal and the tools are the new coal-fired engine. As the efficiency of the tooling increases we tend to use more of the engineer’s time. The adoption of the tooling increases while the number of engineers tends to remain flat. Instead of bringing in more people we tend to try to do more with the people we have.

This contributes to an atrophying of the basic skills needed to do the job. Things like troubleshooting, situational awareness, and being able to hold a mental model of what’s happening. It’s a journeyman’s process to build them. Actual production experience is the best teacher and the best feedback is from your peers. Tools are starting to replace the opportunities for people to have those experiences to learn from.

Children of the Magenta and the Dangers of Automation

For most of us improving the efficiency of an engineer’s time will look like some sort of automation. And while there are obvious benefits there are some not so obvious negatives. First, automation can hide the context of what is, has, and will happen from us. How many times have you heard or asked yourself “What’s it doing now?”

“A lot of what’s happening is hidden from view from the pilots. It’s buried. When the airplane starts doing something that is unexpected and the pilot says ‘hey, what’s it doing now?’ — that’s a very very standard comment in cockpits today.’”
– William Langewiesche, journalist and former American Airlines captain.

In May 31, 2009 228 people died when Air France 447 lost altitude from 35,000 feet and pancaked into the Atlantic Ocean. A pressure probe had iced over preventing the aircraft from determining its speed. This caused the autopilot to disengage and the “fly-by-wire” system switched into a different mode.

“We appear to be locked into a cycle in which automation begets the erosion of skills or the lack of skills in the first place and this then begets more automation.” – William Langewiesche

Four years later Asiana Airlines flight 214 crashed on their final approach into SFO. It came in short of the runway striking the seawall. The NTSB report shows the flight crew mismanaged the initial approach and the aircraft was above the desired glide path. The captain responded by selecting the wrong autopilot mode, which caused the auto throttle to disengage. He had a faulty mental model of the aircraft’s automation logic. This over-reliance on automation and lack of understanding the systems was cited as major factors leading to the accident.

This has been described as “Children of the Magenta” due to the information presented in the cockpit from the autopilot being magenta in color. It was coined by Capt. Warren “Van” Vanderburgh at American Airlines Flight Academy. There are different levels of automation in an aircraft and he argues that by reducing the level of automation you can reduce the workload in some situations. The amount of automation should match the current conditions of the environment. It’s a 25 minute video that’s worth watching, but it boils down to this. Pilots have become too dependent on automation in general and are losing the skills needed to safely control their aircraft.

This led a Federal Aviation Administration task force on cockpit technology to urge airlines to have their pilots spend more time flying by hand. This focus of returning to the basic skills needed is similar to the [report released][5] from the Government Accountability Office (GAO) regarding the impact of maintenance and training on the readiness of the US Navy.

Based on updated data, GAO found that, as of June 2017, 37 percent of the warfare certifications for cruiser and destroyer crews based in Japan—including certifications for seamanship—had expired. This represents more than a fivefold increase in the percentage of expired warfare certifications for these ships since GAO’s May 2015 report.

Complexity and Chaos

Complexity has a tendency to bring about chaos. Due to the difficulty of people to understand the system(s) and an incomplete visibility into what is happening. If this is the environment that we find ourselves working in we can only control the things we bring into it. That includes making sure we have a grasp on the basics of our profession and maintain the best possible understanding of what’s happening with our systems. This should allow us to work around issues as they happen and decrease our dependencies on our tools. There’s usually more than one way to get information or make a change. Knowing the basics and staying aware of what is happening can get us through the chaos.