Ops Engineer?

Let’s be honest, systems administration, whether working with bare metal or in the cloud, is often worse than a thankless job. If the site is up and running, you’ll get no thanks. If it goes down, you better get it back up quickly…and then explain what just broke. If you need to schedule downtime, well, you have to schedule that for 4 AM on a Saturday and still show up chipper on Monday.

I’ve seen too many ops engineers work themselves to the bone fire-fighting, scaling, and migrating the foundation on which entire businesses stand as if in a full-on marathon…sprinting. They get no chance to breath, normalcy, or arrive at the autonomy or purpose we all seek to earn our work.

Not on my watch. Here are the practices we employ at LearnZillion to make sure our environment is a livable, enjoyable, and rewarding place to be an ops engineer.

We maintain a sane software engineer to ops engineer ratio. I recently talked with an ops engineer who was responsible for the systems behind the company’s 60-person software engineering team. I wish this was an extreme situation or at the least sustainable, but it’s not. This isn’t the first time I’ve heard it either. Whether the software engineers are great or sucky, you’re in for a rough ride when the ratio is stacked against you. Don’t let this happen. Systems take serious work to build and maintain. Don’t ever let an employee drown in work.

We deploy during working hours whenever possible. Our engineering team practices no-downtime, continuous delivery within a time window that allows for issues to shake out before staff go home for the day or weekend. We typically ship Monday through Thursday 8 AM to 3 PM. If a completely shippable deliverable misses that window, we often wait until the next reasonable workday to deploy. We don’t want anyone, in software or in ops, paged while out of the office. It’s a terrible way to live. Strive to keep work at work.

We have reasonable maintenance windows. It took a bit of Google Analytics investigation and some convincing inside the company, but our maintenance window starts at 8 PM EST when we need one. Will this affect users? Yes. Is this the ideal time for users? No. Do we want to save our ops engineers from burnout, sleep deprivation, and insanity, and allow them to live life? Yes! Since we practice continuous delivery, maintenance that requires our site to be offline is rare, so it’s a reasonable trade-off.

We assume it’s a software issue until ops is proven guilty. Too many people outside an engineering department or even insufficiently experienced software engineers assume the computers are to blame when things go down (guilty!). Operations issues happen, but software change or software engineering flubs are usually at fault. We make sure our issue escalation process assumes this reality. Our ops engineer is our last line of defense, not our first.

We make space for proactive ops engineering. Imagine you’re in a sinking ship and you’re told to keep bailing water, even though there’s a plug and hammer at your feet that will stop a source of the leaking. That’s what it’s like to be deprived of space to make your work life better. Nowadays, software engineers are given space to pay-off tech debt. Not only does this make it easier for them to ship features in the long run, it also makes their working environment less toxic. Help your ops engineers make time for proactive work. Tell your software engineers to endure that less important but painful pain they’re complaining about just a little longer so that ops gets the space it needs to address the top issues on its list too.

We check-in regularly. Ops engineers are a part of our standard kick-off meetings and stand-ups. They have an equal voice at the table. They serve the needs of the business like the rest of us, but they are not subservient. We connect out-of-band to see how things are going too.

We pay them competitively. We send them to meetups, conferences, and training just like software engineers. We let them go to the dentist when they need to. We praise them for their work. We treat them well. Do you?

The 10x Engineer and Delegated Responsibility

Whenever I do an introductory phone call with an engineering candidate, I make sure to explain my management style and how my approach directs our team’s process. Our process is agile, but it is decidedly not a formal Agile methodology. It’s not Agile Scrum; it’s not Extreme Programming; it’s not Kanban. Instead, it’s delegated responsibility in a culture of continuous deployment. I delegate the responsibility of something important to an employee–usually in the form of a significant feature–and let them take it from concept through implementation to deployment.

One of our co-founders serves as our product manager, and we have an experience design team that translates spoken words into diagrams and pictures. However, I make it very clear to my team that any text or visual content they receive are merely representations of product vision. We need them to guide us from here to there. The people on the front-lines–the ones doing the actual building of code and product–are the ones most equipped with information. They face the real constraints of the problem domain and existing code base; they have the best insights into how we can be most economical with their time; and they have the capacity to see all the options before us. I’m there to help them sift through that information, when needed, and to be that supportive coach, but my goal is for them to be carrying us forward. I manage, but I aim to lead, not micro manage.

Delegated responsibility is a very common and efficient practice in the business world. However, the practice has largely been abandoned in the software industry by practices and processes that shift responsibility onto a team of replaceable cogs. The team is expected to churn through a backlog of dozens of insignificantly small bits of larger features, which often lack foresight into the constraints that will be discovered and the interdepencies between smaller bits that result in developer deadlock. On top of this, a generalized backlog of small pieces creates room for misinterpretation by omitting full context around features or results in excessive communication overhead (see The Mythical Man Month).

We are most definitely inspired by Agile. We build a minimum viable product iteratively. We build-measure-learn, pair program when needed, collaborate, peer review each step of the way, and let our QA engineer find our leaky parts. However, my team members are individually responsible for their work and ship whenever they have something ready to show the world.

Some candidates would much rather be working on a team with equally-shared responsibility, collective code ownership, and continuous pair programming. I realize some people need this model, which is why I always discuss it with potential hires. However, others thrive with delegated responsibility. They take ownership, require little to no management or direction, make the right decisions, take pride in what they have built with their own two hands, and are extremely productive. Not surprisingly, others understand their code. It integrates well with the code base. They avoid the dangers that formal methodologies try to curtail. Often they are, or are becoming, that 10x developer. They are liberated, thrilled, and at their best working in this environment. It’s a joy to provide it to them.

If this sort of environment sounds exciting to you, please check out our careers page at LearnZillion.

Pivotal Tracker Dashboard

UPDATE: The script I wrote is no longer needed for the latest release of Pivotal Tracker, as pinned panes persist between browser refreshes. The script below is for the last release of classic Pivotal Tracker on June 20, 2013.

I like Pivotal Tracker. It’s a step-up from the cumbersome ticket tracking systems I’ve used in the past. As a manager though, it’s too cumbersome to see how my individual team members are doing and get an overall picture of how the team is doing at the same time. I’m only given a single backlog, which intermingles everyone’s tickets. I can see all tickets for a single engineer by using search, and I can pin each search results panel to get what I want. But if I reload the page, I lose all my efforts to build a usable dashboard. This is what I’m aiming for but without the hassle: a view of our current sprint (column one), our backlog (column two), our icebox (column three), and each engineer’s backlog (the remaining columns).

Yeah, the screenshot is a bit small here, but I’m not allowed to show you what we’re working on. With a little TamperMonkey grease, you too can have a comprehensive, and persistent dashboard. (GreaseMonkey if you’re still using Firefox.) Here is the script to pull it off. All you have to do is customize the project number and list of engineer name abbreviations.

// ==UserScript==
// @name       Pivot Tracker Dashboard
// @namespace  https://www.pivotaltracker.com/s/projects/
// @version    1.0
// @description  Show Pivotal Tracker panels for each engineer
// @match      https://www.pivotaltracker.com/s/projects/453961
// @copyright  2012, Ian Lotinsky
// ==/UserScript==
function main() {
  setTimeout(function() {
    var devs = 'mms, ay, hkb, bh, js, jw, np'.split(', ');
    for (i = 0; i < devs.length; i++) {
      $('.search .std').attr('value', 'mywork:' + devs[i]);
      $('#search_button').click();
      $('.pin_search').last().click();
    }
    $('searchString').clear();
  }, 2000);
}

// Source: http://stackoverflow.com/questions/2303147/injecting-js-functions-into-the-page-from-a-greasemonkey-script-on-chrome
var script = document.createElement('script');
script.appendChild(document.createTextNode('('+ main +')();'));
(document.body || document.head || document.documentElement).appendChild(script);

Now, this is a bit of hack, so not everything is roses. You still have to move tickets around in the standard backlog and icebox; you can’t move tickets around inside the developer backlogs since they’re just pinned, search result panels; and you need to refresh your browser from time-to-time to refresh the developer backlogs. It’s not as smooth as a first-class feature, but it gets the job done.

Team Debt

I’m currently having a blast leading the technical team behind the LivingSocial Takeout & Delivery web site. One of the challenges of a growing team is maintaining appropriate amounts of communication. You want everyone to know everything that’s important, but not everything. Otherwise, you end up being a case study in The Mythical Man Month.

Although our team did not follow this plan when it was ramping-up, hindsight reveals the need for a team debt management strategy as it grows. After mulling over it for awhile, I’m fairly sure that if I lead a new team in the future, we will follow this path:

First engineer to join the team

Sets-up the source code repository
Writes a starter project README
Provisions the application and team notification email addresses
Wires-up application notification email(s)

Second engineer

Sets-up the continuous integration (CI) server
Provisions the CI notification email address(es)
Wires-up CI notification emails

Third engineer

Sets-up the team’s Campfire
Wires-up commit and deployment notifications (Campfire and/or email)

Fourth engineer

Sets-up a scrubbed production database dump that engineers can use for local development

What tech team debt tools do you typically employ, and when do you employ them?