It Was Never About Ops


For a while I’ve been thinking about Susan J. Fowler’s Ops Identitfy Crisis post. Bits I agreed with and some I did not.

My original reaction to the post was pretty pragmatic. I had concerns (and still do) about adding more responsibilities onto software engineers (SWE). It’s already fairly common to have them responsible for QA, database administration and security tasks but now ops issues are being considered as well.

I suspect there is an as of yet unmeasured negative impact to the productivity of teams that keep expanding the role of SWE. You end up deprioritizing the operations and systems related concerns, because new features and bug fixes will always win out when scheduling tasks.

Over time I’ve refined my reaction to this. The traditional operations role is the hole that you’ve been stuck in and engineering is how you get out of that hole. It’s not that you don’t need Ops teams or engineers anymore. It’s simply that you’re doing it wrong.

It was never solely about operations. There’s always been an implied hint of manual effort for most Ops teams. We’ve seen a quick return from having SWE handle traditional ops tasks, but that doesn’t mean that the role won’t be needed anymore. Previously we’ve been able to add people to ops to continue to scale with the growth of the company, but those days are gone for most. What needs to change is how we approach the work and the skills needed to do the job.

When you’re unable to scale to meet the demands you’ll end up feeling stuck in a reactive and constantly interruptible mode of working. This can then make operations feel more like a burden rather than a benefit. This way of thinking is part of the reason why I think many of the tools and services created by ops teams are not thought of as actual products.

Ever since we got an API to interact with the public cloud and then later the private cloud we’ve been trying to redefine the role of ops. As the ecosystem of tools has grown and changed over time we’ve continued to refine that role. While thinking on the impact Fowler’s post I know that I agree with her that the skills needed are different from they were eight years ago, but the need for the role hasn’t decreased. Instead it’s grown to match the growth of the products it has been supporting. This got me thinking about how I’ve been working during those eight years and looking back it’s easy to see what worked and what didn’t. These are the bits that worked for me.

First don’t call it ops anymore. Sometimes names do matter. By continuing to use “Ops” in the team name or position we continue to hold onto that reactive mindset.

Make scalability your main priority and start building the products and services that become the figurative ladder to get you out of the hole you’re in. I believe you can meet that by focusing on three things: Reliability, Availability, and Serviceability.

For anything to scale it first needs to be reliable and available if people are going to use it. To be able to meet the demands of the growth you’re seeing the products need to be serviceable. You must be able to safely make changes to them in a controlled and observable fashion.

Every product built out of these efforts should be considered a first class citizen. The public facing services and the internal ones should be considered equals. If your main priority is scaling to meet the demands of your growth, then there should be no difference in how you design, build, maintain, or consider anything no matter where it is in the stack.

Focus on scaling your engineers and and making them force multipliers for the company. There is a cognitive load placed on individuals, teams, and organizations for the work they do. Make sure to consider this in the same way we think of the load capacity of a system. At a time where we’re starting to see companies break their products into smaller more manageable chunks (microservices), we’re close to doing the exact opposite for our people and turning the skills needed to do the work into one big monolith.

If you’ve ever experienced yourself or have seen any of your peers go through burnout what do you think is going to happen as we continue to pile on additional responsibilities?

The growth we’re seeing is the result of businesses starting to run into web scale problems.

Web scale describes the tendency of modern sites – especially social ones – to grow at (far-)greater-than-linear rates. Tools that claim to be “web scale” are (I hope) claiming to handle rapid growth efficiently and not have bottlenecks that require rearchitecting at critical moments. The implication for “web scale” operations engineers is that we have to understand this non-linear network effect. Network effects have a profound effect on architecture, tool choice, system design, and especially capacity planning.

Jacob Kaplan-Moss

The real problem I think we’ve always been facing is making sure you have the people you need to do the job. Before we hit web scale issues we could usually keep up by having people pull all nighters, working through weekends or if you’re lucky hiring more. The ability for hard work to make up for any short comings in planning or preparations simply can no longer keep up. The problem has never been with the technology or the challenges you’re facing. It’s always been about having the right people.

In short you can …

  1. Expect services to grow at a non-linear rate.
  2. To be able to keep up with this growth you’ll need to scale both your people and your products.
  3. Scale your people by giving them the time and space to focus on scaling your products.
  4. Scale your products by focusing on Reliability, Availability, and Serviceability.

To think that new tools or services will be the only (or main) answer to the challenges you’re facing brought on from growth is a mistake. You will always need people to understand what is happening and then design, implement, and maintain your solutions. These new tools and services should increase the impact of each member of your team, but it is a mistake to think it will replace a role.