General

SRE Weekly Issue #13

SRECon16 registration is open, and I’m excited to say that my colleague Courtney Eckhardt and I will be giving a talk together! If you come to the conference, I’d love it if you’d say hi.

Articles

A deep-dive on EVCache, Netflix’s open source sharding and replication layer on top of memcached.

EVCache is one of the critical components of Netflix’s distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere.

This is a guest post from one of our customers, Aaron, Director of Support Systems at CageData. He’s talking about making alerts actionable and why that’s important.

TechCrunch gives us this overview of the field of SRE, including its origins, motivations, and guesses about its future.

Everyone’s favorite OpenSSL vulnerability of the year. I hope you all had a relatively easy patch day.

A short but sweet analysis of an intermittent bug caused by inconsistent date formatting. The author uses the term “blameful postmortem” to mean finding reasons that explain how the client application was written with faulty date parsing logic (tl;dr: the server side truncated trailing zeroes in the fractional seconds). Really, I think this is less about blame than it is about understanding the full context in which a error was able to occur, and that’s exactly what a blameless postmortem is all about.

Incidents can uncover technical debt in a system. Fixing the technical debt is often necessary if a repeat incident is to be avoided, but it can be difficult to track and allocate resources to make it happen. This article from PagerDuty suggests a method for tracking technical debt uncovered by incidents.

When multiple incidents occur simultaneously, things can get hairy and you need to have an organized incident response structure. This article is about firefighting, but we can take their lessons and apply them to SRE.

PagerDuty advocates for a model I’ve heard referred to as “Total Service Ownership”, where dev teams handle incident response for their subsystems rather than “throwing them over the wall” for Ops to support. Courtney and I will be talking about this and more at SRECon16 next month.

Outages

  • Telstra
    • No free data day for this one.

  • Gopher
    • Metafilter revived their gopher server after 15 years of downtime.

  • Salesforce.com
    • Full disclosure: Salesforce.com (parent company of my employer, Heroku), is mentioned.

  • Uganda

SRE Weekly Issue #12

Articles

What an excellent resource! This repo contains a pile of postmortems for our reading and learning pleasure. I’m linking to the repo now, but I don’t promise not to call out specific awesome postmortems from it in the future.

When you’re in the trenches trying to get the service back up and running, it can be hard to find the time to tell everyone else in your company what’s going on. It’s critically important though, add Statuspage.io writes in this article.

Full disclosure: Heroku, my employer, is mentioned.

Digital Ocean shares this overview of the basic concepts involved in high availability.

This article discusses a method of computing the availability of an overall system made up of individual components with differing availabilities. It gives general formulas and methods that are fairly simple, yet powerful.

What do you do when you have to modify an existing production system that has less-than-wonderful code quality? This article is an impassioned plea to test the heck out of your changes and always try to release production-quality code the first time.

Google is launching a reverse-proxy for DDoS mitigation. Interestingly, it’s only for news and free speech sites and it’s completely free.

Outages

SRE Weekly Issue #11

Articles

The big scary news this week was the buffer overflow vulnerability in glibc’s getaddrinfo function. Aside from the obvious impact of intrusions on availability, this bug requires us to roll virtually every production service in our fleets. If we don’t have good procedures in place, now is when we find out with downtime.

Bryan Cantrill, CTO of Joyent, really crystallized the gross feelings that have been rolling around in my mind with regard to unikernels. I would point colleagues to this article if they suggested we should deploy unikernel applications. He makes lots of good points, especially this one:

Unikernels are entirely undebuggable. There are no processes, so of course there is no ps, no htop, no strace — but there is also no netstat, no tcpdump, no ping!

I find the implicit denial of debugging production systems to be galling, and symptomatic of a deeper malaise among unikernel proponents: total lack of operational empathy.

Atlassian dissects their response to a recent outage and in the process shares a lot of excellent detail on their incident response and SRE process. I love that they’re using the Incident Commander system (though under a different name). This could have (and probably has) come out of my mouth:

The primary goal of the incident team is to restore service. This is not the same as fixing the problem completely – remember that this is a race against the clock and we want to focus first and foremost on restoring customer experience and mitigating the problem. A quick and dirty workaround is often good enough for now – the emphasis is on “now”!

My heart goes out to those passengers hurt and killed and to their families, but also to the controller that made the error. There’s a lot to investigate here about how a single human was in such a position that a single error could caus such devastation. Hopefully there are ways in which the system can be remediated to prevent such catastrophes in the future.

Like medicine, we can learn a lot about how to prevent and deal with errors from the hard lessons learned in aviation.

You’d think technically advanced aircraft would be safer with all that information and fancy displays. Why they’re not has a lot to do with how our brains work.

When I saw Telstra offer a day of free data to its customers to make up for last week’s outage, I cringed. I’m impressed that they survived last Sunday as Australia used 1.8 petabtytes of data.

In this article, the author describes discovering that a service he previously ignored and assumed saw very little traffic actually served a million requests per day.

If ignorance isn’t an operational strategy, what is? Paranoia. You should code and run your systems like a large group of Internet lunatics are out to abuse the hell out of them.

This is a great intro to Chaos Engineering, which is the field I didn’t know existed that was born out of Netflix’s Chaos Monkey. This is the first article in what the author promises will be a biweekly series.

Thanks to Devops Weekly for this one.

Outages

SRE Weekly Issue #10


This week’s issue is packed with really meaty articles, which is a nice departure from last week’s somewhat sparse issue.

Articles

So much about what modern medicine has learned about system failures applies directly to SRE, usually without any adaptation required. In this edition of The Morning Paper, Adrian Colyer gives us his take on an excellent short paper by an MD. My favorite quotes:

Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures.

The Software Evolution & Architecture Lab of the University of Zurich is doing a study on modern continuous deployment practices. I’m really interested to see the results, especially where CD meets reliability, so if you have a moment, please hop on over and add your answers. Thanks to Gerald Schermann at UZH for reaching out to me for this.

I’ve been debating with myself whether or not to link to Emerson Network Power’s survey of datacenter outage costs and causes. The report itself is mostly just uninteresting numbers and it’s behind a signup-wall. However, this article is a good summary of the report and links in other interesting stats.

Facebook algorithmically generated hundreds of millions of custom-tailored video montages for its birthday celebration. How they did it without dedicating specific hardware to the task and without impacting production is a pretty interesting read.

Administering ElasticSearch can be just as complicated and demanding as MySQL. This article has an interesting description of SignalFX’s method for resharding without downtime.

This is a pretty interesting report that I’d never heard of before. It’s long (60 pages), but worth the read for a few choice tidbits. For example, I’ve seen this over and over in my career:

Yet, delayed migrations jeopardize business productivity and effectiveness, as companies experience poor system performance or postpone replacement of hardware past its shelf life.

Also, I was surprised that even now, over 70% of respondents said they still use “Tape Backup / Off-site Storage”. I wonder if people are lumping S3 into that.

Never miss an ack or you’ll be in even worse trouble.

More on last week’s outage. I have to figure “voltage regular” means power supply. Everyone hates simultaneous failure.

A full seven years after they started migration, Netflix announced this week that their streaming service is now entirely run out of AWS. That may seem like a long time until you realize that Netflix took a comprehensive approach to the migration:

Arguably, the easiest way to move to the cloud is to forklift all of the systems, unchanged, out of the data center and drop them in AWS. But in doing so, you end up moving all the problems and limitations of the data center along with it. Instead, we chose the cloud-native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company.

Outages

  • Telstra
  • Visual Studio Online
    • Caused by a memory-hogging bug in MS SQL Server’s query planner.

  • TNReady
    • Tennessee (US state) saw an outage of the new online version of its school system’s standardized tests.

  • CBS Sports App
    • During the Super Bowl is a terrible time to fail, but of course it’s more likely due to the peak in demand.

  • TPG Submarine Fiber Optic Cable
    • This one has some really interesting discussion about how the fiber industry handles failures.

  • Apple Pay

SRE Weekly Issue #9

Articles

I spoke too soon in the last issue! Github has posted an extremely thorough postmortem that answers any questions one might have had about last week’s outage. I like the standard they’re holding themselves to for timely communication:

One of the biggest customer-facing effects of this delay was that status.github.com wasn’t set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

Just monitoring servers isn’t enough to detect an outage. Sometimes even detailed service monitoring can miss an overall performance degradation that involves multiple services in an infrastructure. In this blog post, PagerDuty suggests also monitoring key business metrics (logins, purchase rate, etc).

In this case, “yesterday” is on 2013, but this is an excellent postmortem from Mailgun that can serve as an example for all of us.

A customer’s perspective on a datacenter outage, with emphasis on the need for early, frequent, and thorough communication from service providers.

A nicely detailed outage postmortem, including the gorey details of the train of thought the engineers followed on the way to a solution. They hint at an important technique that’s not discussed nearly enough, in my opinion: judicious application of bandaid solutions to resolve the outage and allow engineers to continue their interrupted personal time. It’s not necessary to fix a problem the “right” way in the moment, and carefully-applied bandaids help reduce on-call burnout.

How can we be sure (or at least sort of confident) that distributed systems won’t fail? They can be incredibly complex, and their failures can be even more complex. Catie McCaffrey gives us this ACM Queue article about methods for formal and informal verification.

Efficiently testing distributed systems is not a solved problem, but by combining formal verification, model checking, fault injection, unit tests, canaries, and more, you can obtain higher confidence in system correctness.

Medium has announced a commitment to publishing postmortems for all outages. I’d love to see more companies making a commitment like this. Thanks to reader Pete Shima for this link.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme