SRE Weekly Issue #58

SPONSOR MESSAGE

“The How and Why of Minimum Viable Runbooks.” Get the free ebook from VictorOps.

Articles

I’m going to break my normal format and post this outage up here in the article section. Here’s why: GitLab was extremely open about this incident, their incident response process, and even the actual incident response itself.

Linked is their blog post about the incident, with an analysis from 24 hours after the incident that runs circles around the postmortems released by many other companies days after an outage. They also linked to their raw incident response notes (a Google Doc).

Here’s what really blows me away: they live streamed their incident response on youtube. They’re also working on their postmortem document publicly in a merge request and tracking remediations publicly in their issue tracker. Incredible.

Their openness is an inspiration to all of us. Here are a couple of snippets from the email I sent them earlier this week that is (understandably) still awaiting a response:

[…] I’m reaching out with a heartfelt thank you for your openness during and after the incident. Sharing your incident response notes and conclusions provides an unparalleled educational resource for engineers at all kinds of companies. Further, your openness encourages similar sharing at other companies. The benefit to the community is incalculable, and on behalf of my readers, I want to thank you!

[…] Incidents are difficult and painful, but it’s the way that a company conducts themselves during and after that leaves a lasting impression.

Julia Evans is back this week with a brand new zine about networking. It’ll be posted publicly in a couple weeks, but until then, you can get your own shiny copy just by donating to the ACLU (who have been doing a ton of awesome work!). Great idea, Julia!

You can now read the Google SRE book online for free! Pretty nifty. Thanks Google.

An in-depth dive into how Twitter scales. I’m somewhat surprised that they only moved off third-party hosting as recently as 2010. Huge thanks to Twitter for being so open about their scaling challenges and solutions.

Here’s a good intro to unikernels, if you’re unfamiliar with them. The part that caught my attention is under the heading, “How Do You Debug the Result?”. I’m skeptical of the offered solution, “just log everything you need to debug any problem”. If that worked, I’d never need to pull out strace and lsof, yet I find myself using them fairly often.

This article reads a whole lot more like “process problems” than “human error”. Gotta love the flashy headline, though.

Just what exactly does that “five nines” figure in that vendor’s marketing brochures mean, anyway?

Your customers may well take to Twitter to tell you (and everyone else) about your outages. PagerDuty shares suggestions for how to handle it and potentially turn it to your advantage.

This operations team all agreed to work a strict 9-to-5 and avoid checking email or slack after hours. They shared their experience every day in a “dark standup” on Slack: a text-based report of whether each engineer is getting behind and what they’ll do with the extra hours they would normally have worked. They shared their conclusions in this article, and it’s an excellent read.

Faced with limited financing and a high burn rate, many startups focus on product development and application coding at the expense of back of operations engineering.

And the result is operational technical debt.

It’s really interesting to me that paying physicians extra for on-call shifts seems to be an industry standard. Of all my jobs, only one provided special compensation for on-call. It made the rather rough on-call work much more palatable. Does your company provide compensation or Time Off In Lieu (TOIL)? I’d love it if you’d write an article about the reasons behind the policy and the pros and cons!

Bringing non-traditional Ops folks, including developers, on-call can be a tricky process. Initial reactions tend to be highly polarized, either total pushback and refusal, or a meek acceptance coupled with fear of the unknown. For the former, understanding the root of the refusal is useful. For the latter, providing clarity and training is important.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Kind of neat, especially in this era of infrastructures built on (mostly) commodity hardware.

The Emperor has no clothes on, NoOps isn’t a thing, and you still have to monitor your serverless applications. Sorry about that.

Outages

  • Telstra
    • A fire at an exchange resulted in an outage and somehow also caused SMSes to be mis-delivered to random recipients.
  • Heroku
    • Full disclosure: Heroku is my employer.
  • Google App Engine
  • 123-Reg
Updated: February 5, 2017 — 8:45 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme