How Well Do You Treat Your Sysadmin/DevOps/Ops Engineer?

Let’s be honest, systems administration, whether working with bare metal or in the cloud, is often worse than a thankless job. If the site is up and running, you’ll get no thanks. If it goes down, you better get it back up quickly…and then explain what just broke. If you need to schedule downtime, well, you have to schedule that for 4 AM on a Saturday and still show up chipper on Monday.

I’ve seen too many ops engineers work themselves to the bone fire-fighting, scaling, and migrating the foundation on which entire businesses stand as if in a full-on marathon…sprinting. They get no chance to breath, normalcy, or arrive at the autonomy or purpose we all seek to earn our work.

Not on my watch. Here are the practices we employ at LearnZillion to make sure our environment is a livable, enjoyable, and rewarding place to be an ops engineer.

We maintain a sane software engineer to ops engineer ratio. I recently talked with an ops engineer who was responsible for the systems behind the company’s 60-person software engineering team. I wish this was an extreme situation or at the least sustainable, but it’s not. This isn’t the first time I’ve heard it either. Whether the software engineers are great or sucky, you’re in for a rough ride when the ratio is stacked against you. Don’t let this happen. Systems take serious work to build and maintain. Don’t ever let an employee drown in work.

We deploy during working hours whenever possible. Our engineering team practices no-downtime, continuous delivery within a time window that allows for issues to shake out before staff go home for the day or weekend. We typically ship Monday through Thursday 8 AM to 3 PM. If a completely shippable deliverable misses that window, we often wait until the next reasonable workday to deploy. We don’t want anyone, in software or in ops, paged while out of the office. It’s a terrible way to live. Strive to keep work at work.

We have reasonable maintenance windows. It took a bit of Google Analytics investigation and some convincing inside the company, but our maintenance window starts at 8 PM EST when we need one. Will this affect users? Yes. Is this the ideal time for users? No. Do we want to save our ops engineers from burnout, sleep deprivation, and insanity, and allow them to live life? Yes! Since we practice continuous delivery, maintenance that requires our site to be offline is rare, so it’s a reasonable trade-off.

We assume it’s a software issue until ops is proven guilty. Too many people outside an engineering department or even insufficiently experienced software engineers assume the computers are to blame when things go down (guilty!). Operations issues happen, but software change or software engineering flubs are usually at fault. We make sure our issue escalation process assumes this reality. Our ops engineer is our last line of defense, not our first.

We make space for proactive ops engineering. Imagine you’re in a sinking ship and you’re told to keep bailing water, even though there’s a plug and hammer at your feet that will stop a source of the leaking. That’s what it’s like to be deprived of space to make your work life better. Nowadays, software engineers are given space to pay-off tech debt. Not only does this make it easier for them to ship features in the long run, it also makes their working environment less toxic. Help your ops engineers make time for proactive work. Tell your software engineers to endure that less important but painful pain they’re complaining about just a little longer so that ops gets the space it needs to address the top issues on its list too.

We check-in regularly. Ops engineers are a part of our standard kick-off meetings and stand-ups. They have an equal voice at the table. They serve the needs of the business like the rest of us, but they are not subservient. We connect out-of-band to see how things are going too.

We pay them competitively. We send them to meetups, conferences, and training just like software engineers. We let them go to the dentist when they need to. We praise them for their work. We treat them well. Do you?

The 10x Engineer and Delegated Responsibility

Whenever I do an introductory phone call with an engineering candidate, I make sure to explain my management style and how my approach directs our team’s process. Our process is agile, but it is decidedly not a formal Agile methodology. It’s not Agile Scrum; it’s not Extreme Programming; it’s not Kanban. Instead, it’s delegated responsibility in a culture of continuous deployment. I delegate the responsibility of something important to an employee–usually in the form of a significant feature–and let them take it from concept through implementation to deployment.

One of our co-founders serves as our product manager, and we have an experience design team that translates spoken words into diagrams and pictures. However, I make it very clear to my team that any text or visual content they receive are merely representations of product vision. We need them to guide us from here to there. The people on the front-lines–the ones doing the actual building of code and product–are the ones most equipped with information. They face the real constraints of the problem domain and existing code base; they have the best insights into how we can be most economical with their time; and they have the capacity to see all the options before us. I’m there to help them sift through that information, when needed, and to be that supportive coach, but my goal is for them to be carrying us forward. I manage, but I aim to lead, not micro manage.

Delegated responsibility is a very common and efficient practice in the business world. However, the practice has largely been abandoned in the software industry by practices and processes that shift responsibility onto a team of replaceable cogs. The team is expected to churn through a backlog of dozens of insignificantly small bits of larger features, which often lack foresight into the constraints that will be discovered and the interdepencies between smaller bits that result in developer deadlock. On top of this, a generalized backlog of small pieces creates room for misinterpretation by omitting full context around features or results in excessive communication overhead (see The Mythical Man Month).

We are most definitely inspired by Agile. We build a minimum viable product iteratively. We build-measure-learn, pair program when needed, collaborate, peer review each step of the way, and let our QA engineer find our leaky parts. However, my team members are individually responsible for their work and ship whenever they have something ready to show the world.

Some candidates would much rather be working on a team with equally-shared responsibility, collective code ownership, and continuous pair programming. I realize some people need this model, which is why I always discuss it with potential hires. However, others thrive with delegated responsibility. They take ownership, require little to no management or direction, make the right decisions, take pride in what they have built with their own two hands, and are extremely productive. Not surprisingly, others understand their code. It integrates well with the code base. They avoid the dangers that formal methodologies try to curtail. Often they are, or are becoming, that 10x developer. They are liberated, thrilled, and at their best working in this environment. It’s a joy to provide it to them.

If this sort of environment sounds exciting to you, please check out our careers page at LearnZillion.

Document the Why

Like many coders, I am a proponent of writing self-documenting code. The more I have worked with intentional code that omits unnecessary or misleading comments, the more efficient I have been as a software engineer. I read clear code and I understand what is going on.

However, regardless of whether I have spelunked self-documenting code or code with a girth of extraneous comments, both styles often omit why something is being done when it’s not obvious. Maybe an API you’re interacting with has a bug, so you have to do things in a roundabout or undocumented way. Maybe there is a non-obvious edge case that your main conditional branch code covers that another programmer would expect to be solved in a more conventional manner. Maybe you made a calculated business decision for a particular user experience when other sites behave differently.

Make it obvious why a non-obvious approach was taken. Save your fellow engineers and your future self from re-exploring the explored, re-arguing the argued, and re-deciding the decided.

Google Analytics Crash Course Notes

Thinking that you will adequately learn Google Analytics by clicking around the product, even over years, is a foolish concept. You will only understand a subset of its features and how they work together. You need to do your homework.

I cannot improve upon Google Analytics’ (GA) own crash course, titled Google Analytics IQ Lessons. It covers just about all the material in the paid Analytics courses (101, 201, and 301) at just the right level–not too high, and not too deep.

Here are my notes of the key gotcha’s and items to configure for your GA Web Properties. As well, I’ve linked to other helpful learning resources. As is my standard practice, this is mostly for my reference down-the-road, so it’s not comprehensive. However, I figure others can benefit from them as well.

Gotcha’s

  • Incognito mode and other browser privacy sessions count as new Visitors, Visits, and Page Views, as if the user had cleared his cookies. Not a huge surprise to most, I’m sure. (Although other trackers can still track you.)
  • Visits are separated by exits from the site or a 30-minute cookie timeout while on the site. Advertising Campaign attribution expires after a 6-month cookie timeout. Both are customizable.
  • Time on Exit Pages is not tracked because time is calculated between page loads on the same site. This also means that Bounce Page time is not tracked either. This has serious implications for some genres of sites, like blogs where a bit of traffic goes into and out of a single article. Know how to track Exit Page times and Bounce Pages, if you need to.
  • A Visitor can only trigger a Goal conversion once during a Visit, but can trigger an E-commerce Goal multiple times in a Visit.
  • Filters are applied between raw data capture and the Account’s Profile where the data is ultimately stored. Even if you change a Filter that sits in-between the raw feed and Profile, you cannot recover historical data. Try accomplishing the same filtering with Advanced Segments instead, which don’t run the risk of losing data. At the least, you should use Advanced Segments or other features to test concepts before creating a real Filter for them.
  • Domains and subdomains can break tracking in many glorious ways–especially E-commerce Goal tracking.

Basic Checklist

  • Always have a raw Profile that has no Filters, Advanced Segments, etc.
  • Have a Profile that excludes internal IP address so you’re not tracking yourself and your staff as they click around your site.
  • Have a Profile that exclusively tracks internal IP addresses for debugging Google Analytics code on your site.
  • Use the Google Analytics Debugger Chrome Extension for your own debugging and analysis of competitors’ tracking.
  • Enable Auto-Tagging between Google AdWords and Analytics if you are using both products.
  • If you create an AdWords Profile, set up two Filters to focus-in on AdWords traffic (Campaign Source: google, Campaign Medium: cpc).
  • Set up E-commerce tracking.
  • Set up Goal tracking.
  • Set up Internal Site Search tracking. (It’s much easier than you think.)
  • Utilize _addIgnoredOrganic to attribute Organic Search Visits for your web site’s address (i.e. someone searching for “example.com”) to a Direct Visit instead.
  • Set up appropriate Custom Variables to track additional information about Visitors, Visits, and Page Views.

Hopefully, all these will help us do a better job optimizing our customer average lifetime value (LTV).