General

SRE Weekly Issue #68

SPONSOR MESSAGE

Incident management is essential to modern DevOps environments. Learn why in the eBook, “Making the Case for Real-time Incident Management” from your friends at VictorOps. http://try.victorops.com/realtime_incident_mgmt/SREweekly

Articles

The big story this week is the release of the inaugural issue of Increment, a newsletter by Stripe, edited by Susan Fowler. They bill it as “A digital magazine about how teams build and operate software systems at scale” and the first issue, dedicated to on-call, certainly delivers. Below, I’ll include my short take on each article in the issue.

Increment interviewed over thirty companies to build a picture of the common practices in incident response. I’m actually pretty surprised to hear that “it turns out that they all follow similar (if not completely identical) incident response processes”, but apparently the commonalities don’t stop at just process:

Slack and PagerDuty appear to be two points of failure across the entire tech industry

Bonus content: Julia Evans shared her notes on Twitter.

Next up, Increment addresses the dichotomy of ops teams versus developers on call for their code. It turns out that the latter practice is more prevalent than I’d realized.

After laying a solid groundwork of suggestions for avoiding burn-out in on-call, this next Increment article raises a really important point: on-call affects people differently based on privilege. Example: single parents are going to have a much harder time of it.

[…] if you set up an on-call rotation with a schedule or intensity that assumes the participants have no real responsibilities outside of the office, you are limiting the people who will be able to participate on your team.

Remember a couple of months back when GitLab live-streamed their incident response? Increment caught up with their CEO to give us this in-depth interview about their radical transparency.

Increment shares tips and key practices for setting up on-call, targeted to companies of size ranges varying from 0-10 employees all the way up to 10000+.

Increment rounds out their issue with advice in the form of quotes from six of the companies they interviewed.

The other big news of the week is the official launch of Honeycomb.io. If you haven’t had a chance to check it out yet, here’s an introduction, and you can also sign up for a free one-month trial.

Outages

  • Melbourne IT
    • A DDoS took out their DNS service, taking out customer domains and also sites they they host for customers. While this is a news article and not a formal post-analysis, it does include some pretty interesting technical detail from an interview with their CTO. I’m not sure that he did himself any favors by quoting the definition of their SLA:

      “People look at 99.9 per cent and think that’s seconds of downtime, but you work it out and it’s 45 minutes.”

  • Google Cloud HTTP(S) Load Balancer
    • Google Cloud LB threw 502s for 25% of requests in a 22-minute period. They released this post-analysis 7 days later, and I have to say, the root cause is pretty interesting – and sadly familiar.

      A bug in the HTTP(S) Load Balancer configuration update process caused it to revert to a configuration that was substantially out of date.

SRE Weekly Issue #67

SPONSOR MESSAGE

Are your incident management skills sharp, or are you continuously fighting fires? Take the free, online incident management assessment from VictorOps and compare your practices against leading DevOps methodologies: http://try.victorops.com/ima/sreweekly

Articles

This article is about the risks of automation. While automation can reduce risk by making errors less likely, it also disengages human operators from what’s actually happening, meaning that they’re less likely to catch and correct problems.

The author spent seven months sifting through, categorizing, and documenting over 1700 production incidents. The result was impressive: a massive improvement in the SRE team’s incident response process and documentation. It’s got me wondering if we can do something similar at $JOB.

Thanks to Steven Farlie for this one.

A danger of a microservice architecture is that one failing service can affect those that depend on it, even indirectly. The Netflix API handles over 10000 requests per second, and it was carefully designed to avoid the case where a slow dependency breaks unrelated requests.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99% * 30 = 99.7% uptime = 2+ hours in a month).

Nuclear Family is an interactive play in which the audience is presented with critical decisions as the characters move inexorably toward a nuclear plant disaster. The goal is to demonstrate local rationality, the principal that people make the best decision they can with the information they have at hand — even if in retrospect that decision led to an adverse outcome.

Last year, PagerDuty moved toward giving developers operational responsibilty for the systems they create. The really cool thing about their transition is that they have hard stats on reduction of incidents, decrease in MTTR, and increase in changes deployed to production.

This post is primarily a new feature announcement, but the intro section is just awesome. I love the idea of designing a system with empathy for your future self that will be on call for it.

A short but enlightening blog post on designing systems to degrade gracefully.

when weird stuff happens, make sure it doesn’t cause harm you didn’t expect or plan for.

Outages

  • Razer
    • Notably, this outage reset the careful customizations that people had made to their peripherals.

      Thanks to Steven Farlie for this one.

  • Heroku
    • Heroku had a 2-day long disruption that spanned 3 status site posts.

      Full disclosure: Heroku is my employer.

  • DigitalOcean
    • DigitalOcean accidentally deleted their primary database, resulting in a ~5-hour outage.

      A process performing automated testing was misconfigured using production credentials.

SRE Weekly Issue #66

SPONSOR MESSAGE

Are your incident management skills sharp, or are you continuously fighting fires? Take the free, online incident management assessment from VictorOps and compare your practices against leading DevOps methodologies: http://try.victorops.com/ima/sreweekly

Articles

I hope you’ll enjoy reading this debug session as much as I enjoyed co-writing it. My former co-worker and I did some serious digging to get to the bottom of an unexpected EADDRINUSE that caused a production incident.

Full disclosure: Heroku is my employer.

Distributed filesystems provide high availability by duplicating data. In this research paper, the researchers created errorfs, a FUSE plugin that passes through a backing filesystem but introduces single-bit errors. Result: almost all major distributed filesystems missed the error, resulting in corruption.

The part I like most about this article is the emphasis on the difference between DR and HA.

Full disclosure: Heroku, my employer, is mentioned.

The S3 outage a month ago is a great reminder that chaos experiments are useful not just for taking down parts of our own infrastructure, but also simulating the failure of external dependencies.

There are several core HumanOps principles, but the most important one to remember is that human health impacts business health.

It’s about time that we recognised that engineers are humans who get stressed and need downtime and that there are strong business as well as social reasons why these needs should be met.

Impressively quickly, USENIX has posted the videos from SRECon17 Americas! I’ve linked to a post by Woodland Hunter, whose review of SRECon I featured here two weeks ago, with links to the talks he reviewed and more.

The first article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

PagerDuty theorizes that if developers don’t trust the incident response process, they’ll fear outages and thus be less productive. Proper incident management eases that fear so that they feel safer deploying code.

This article could be titled, “Use these three wacky tricks to reduce your pages by 100x!” In all seriousness, the methods described are aggregation (group related alerts), routing (sort alerts by team), and classification (page-worthy alerts versus warnings).

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

A nice primer on using tc to induce latency, which is really important when testing the resiliency of systems to network instability. Thanks, Julia!

Here’s the second half of Stephen Thorne’s commentary on “Embracing Risk”, the third chapter in Google’s SRE book.

As your company grows in infrastructure size, number of employees, load, and other areas, how do you change your incident response to cope?

Outages

  • Azure status history
    • While following up on an outage from a couple of weeks ago, I came upon this archive of Azure incidents, several with detailed postmortems. It’s a goldmine of interesting RCAs, but I wish they’d give each its own page for easy linking.

SRE Weekly Issue #65

SPONSOR MESSAGE

Got ChatOps? This 75 page ebook from O’Reilly MEdia covers ChatOps from concept to deployment. Get started managing operations in group chat today. Download your free copy here: http://try.victorops.com/sreweekly/chatops

Articles

Look, a new newsletter about monitoring! I’m really excited to see what they have to offer.

And another new newsletter! Like Monitoring Weekly, I’m betting this one will have a lot of articles of interest to SREs.

VictorOps held a webinar last Thursday to present and discuss the concept of context in incident management. Just paging in a responder isn’t enough: we need to get them up to speed on the incident as soon as possible. Ideally, the page itself would include snapshots of relevant graphs, links to playbooks, etc. But if we’re not careful and add too much information, the responder is overloaded by a “sandstorm” of irrelevant data. “time to learn” — post incident learning careful of info overload in presenting context with pages

This webinar was created by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Here’s the next in Stephen Thorne’s series of commentary on chapters of the SRE book. I like that Google makes an effort not to be too reliable for fear of setting expectations too high, and they’re also realistic in their availability goals: no end-user will notice a 20-second outage.

Writing an API, system, server or really anything people might make use of? Don’t make the default timeout be infinite.

PagerDuty really has been churning out excellent articles in the past couple of weeks. [Spoiler Alert] The five things are: internal communication, monitoring, a public status site, a support ticket system, and a defined incident response procedure.

Keep on rockin’ it, PagerDuty. This time they identify common problems that hinder incident response and give suggestions on how to fix them.

The author reviews their experience at SRECon17 Americas, including interesting bits on Julia Evans, training, recruiting, and diversity.

I love that the ideas we’re talking about regarding human error apply even to commercial cannabis growing.

Sadly, little is known about the nature of these errors, mainly because our quest for the truth ends where it should begin, once we know it was a human error or is “someone’s fault.”

The newer and shinier the technology, the bigger the production risk.

In other words, software that has been around for a decade is well understood and has fewer unknowns.

Outages

  • Kings College London storage system outage and data loss
    • Kings College London’s HP storage system suffered a routine failure that, due to a firmware bug, resulted in loss of the entire array. Linked is an incredibly detailed PDF including multiple contributing factors and many remediations. Example: primary backups were to another folder on the same storage system, and secondary tape backups were purposefully incomplete.
  • Ryanair
    • This one’s interesting to me because it seems to have been self-inflicted due to a flash sale.
  • Apple Store
    • Another (possibly) self-inflicted outage due to a sale.
  • Microsoft Azure
  • Discord Status – Connectivity Issues
    • Finally, my search alert for “thundering herd” paid off! I hadn’t heard of Discord before now, but they sure do write a great postmortem. Did you know that the thundering herd is a sports team?

SRE Weekly Issue #64

SPONSOR MESSAGE

Got ChatOps? This 75 page ebook from O’Reilly MEdia covers ChatOps from concept to deployment. Get started managing operations in group chat today. Download your free copy here: http://try.victorops.com/sreweekly/chatops

Articles

I wasn’t able to make it to SRECon17 Americas this year, but it sounds like it was a great time. (day two summary)

My heroine, Julia Evans, gave the plenary session at SRECon17 Americas, all about how to learn how to be an excellent engineer (or really anything!). She proved herself once again not just as an excellent student, but also an inspiring teacher. The best part is that she posted the abstract, slides, and a transcript of her talk shortly after giving it! This is a really excellent resource for folks like me that weren’t there, and I hope more talk-givers will follow her example.

This article is long, but I wish I’d carved out time for it long ago, because it’s really incredible and well worth the read. John Allspaw uses the SEC analysis of the Knight Capital incident as a starting point to introduce and discuss the problems with Counterfactual Thinking (“if the engineer had just done ___, this wouldn’t have happened”).

Rolling back a flawed code release can have significant risk. It doesn’t always fix the problem because the erroneous code may have had effects on other parts of the system. Sometimes, as in the Knight Capital incident, a rollback exacerbates the problem.

This is part two of an annotation of the google SRE book by Stephen Thorne, a Google SRE. Part Three is available too.

Here’s an interesting idea: using metadata about incidents as a proxy for measuring technical debt. PagerDuty goes over the definition of technical debt before diving into measuring it.

GitLab posted an update on “team-member-1”, the engineer that entered the commands that caused their production DB to be erased. I love that they posted this, because I for one was worried about “team-member-1” as a second victim.

During an incident, emotions can run strong. How can we set up incident response so as to provide the best environment for our responders?

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

  • AWS Route 53
    • Route 53 had a control plane outage, though actual query responses were unaffected.
  • Square
    • Square suffered a 2-hour outage, and if this postmortem is any indication, they learned a lot from it. This bit is interesting in light of the article above about rollbacks:

      We rolled back all software changes that happened leading up to the incident. This is a non-negotiable response to any customer-impacting event; our engineers are trained to undo any change that happened before an incident regardless of how plausible it is that the change caused the issue.

  • StatusPage.io
    • This happened during Square’s outage and impacted their ability to communicate.
  • CBS
    • CBS’s site was down, so people couldn’t fill out their fantasy sportsball brackets 1 hour before the game started.
A production of Tinker Tinker Tinker, LLC Frontier Theme