General

SRE Weekly Issue #11

Articles

The big scary news this week was the buffer overflow vulnerability in glibc’s getaddrinfo function. Aside from the obvious impact of intrusions on availability, this bug requires us to roll virtually every production service in our fleets. If we don’t have good procedures in place, now is when we find out with downtime.

Bryan Cantrill, CTO of Joyent, really crystallized the gross feelings that have been rolling around in my mind with regard to unikernels. I would point colleagues to this article if they suggested we should deploy unikernel applications. He makes lots of good points, especially this one:

Unikernels are entirely undebuggable. There are no processes, so of course there is no ps, no htop, no strace — but there is also no netstat, no tcpdump, no ping!

I find the implicit denial of debugging production systems to be galling, and symptomatic of a deeper malaise among unikernel proponents: total lack of operational empathy.

Atlassian dissects their response to a recent outage and in the process shares a lot of excellent detail on their incident response and SRE process. I love that they’re using the Incident Commander system (though under a different name). This could have (and probably has) come out of my mouth:

The primary goal of the incident team is to restore service. This is not the same as fixing the problem completely – remember that this is a race against the clock and we want to focus first and foremost on restoring customer experience and mitigating the problem. A quick and dirty workaround is often good enough for now – the emphasis is on “now”!

My heart goes out to those passengers hurt and killed and to their families, but also to the controller that made the error. There’s a lot to investigate here about how a single human was in such a position that a single error could caus such devastation. Hopefully there are ways in which the system can be remediated to prevent such catastrophes in the future.

Like medicine, we can learn a lot about how to prevent and deal with errors from the hard lessons learned in aviation.

You’d think technically advanced aircraft would be safer with all that information and fancy displays. Why they’re not has a lot to do with how our brains work.

When I saw Telstra offer a day of free data to its customers to make up for last week’s outage, I cringed. I’m impressed that they survived last Sunday as Australia used 1.8 petabtytes of data.

In this article, the author describes discovering that a service he previously ignored and assumed saw very little traffic actually served a million requests per day.

If ignorance isn’t an operational strategy, what is? Paranoia. You should code and run your systems like a large group of Internet lunatics are out to abuse the hell out of them.

This is a great intro to Chaos Engineering, which is the field I didn’t know existed that was born out of Netflix’s Chaos Monkey. This is the first article in what the author promises will be a biweekly series.

Thanks to Devops Weekly for this one.

Outages

SRE Weekly Issue #10


This week’s issue is packed with really meaty articles, which is a nice departure from last week’s somewhat sparse issue.

Articles

So much about what modern medicine has learned about system failures applies directly to SRE, usually without any adaptation required. In this edition of The Morning Paper, Adrian Colyer gives us his take on an excellent short paper by an MD. My favorite quotes:

Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

When new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures.

The Software Evolution & Architecture Lab of the University of Zurich is doing a study on modern continuous deployment practices. I’m really interested to see the results, especially where CD meets reliability, so if you have a moment, please hop on over and add your answers. Thanks to Gerald Schermann at UZH for reaching out to me for this.

I’ve been debating with myself whether or not to link to Emerson Network Power’s survey of datacenter outage costs and causes. The report itself is mostly just uninteresting numbers and it’s behind a signup-wall. However, this article is a good summary of the report and links in other interesting stats.

Facebook algorithmically generated hundreds of millions of custom-tailored video montages for its birthday celebration. How they did it without dedicating specific hardware to the task and without impacting production is a pretty interesting read.

Administering ElasticSearch can be just as complicated and demanding as MySQL. This article has an interesting description of SignalFX’s method for resharding without downtime.

This is a pretty interesting report that I’d never heard of before. It’s long (60 pages), but worth the read for a few choice tidbits. For example, I’ve seen this over and over in my career:

Yet, delayed migrations jeopardize business productivity and effectiveness, as companies experience poor system performance or postpone replacement of hardware past its shelf life.

Also, I was surprised that even now, over 70% of respondents said they still use “Tape Backup / Off-site Storage”. I wonder if people are lumping S3 into that.

Never miss an ack or you’ll be in even worse trouble.

More on last week’s outage. I have to figure “voltage regular” means power supply. Everyone hates simultaneous failure.

A full seven years after they started migration, Netflix announced this week that their streaming service is now entirely run out of AWS. That may seem like a long time until you realize that Netflix took a comprehensive approach to the migration:

Arguably, the easiest way to move to the cloud is to forklift all of the systems, unchanged, out of the data center and drop them in AWS. But in doing so, you end up moving all the problems and limitations of the data center along with it. Instead, we chose the cloud-native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company.

Outages

  • Telstra
  • Visual Studio Online
    • Caused by a memory-hogging bug in MS SQL Server’s query planner.

  • TNReady
    • Tennessee (US state) saw an outage of the new online version of its school system’s standardized tests.

  • CBS Sports App
    • During the Super Bowl is a terrible time to fail, but of course it’s more likely due to the peak in demand.

  • TPG Submarine Fiber Optic Cable
    • This one has some really interesting discussion about how the fiber industry handles failures.

  • Apple Pay

SRE Weekly Issue #9

Articles

I spoke too soon in the last issue! Github has posted an extremely thorough postmortem that answers any questions one might have had about last week’s outage. I like the standard they’re holding themselves to for timely communication:

One of the biggest customer-facing effects of this delay was that status.github.com wasn’t set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

Just monitoring servers isn’t enough to detect an outage. Sometimes even detailed service monitoring can miss an overall performance degradation that involves multiple services in an infrastructure. In this blog post, PagerDuty suggests also monitoring key business metrics (logins, purchase rate, etc).

In this case, “yesterday” is on 2013, but this is an excellent postmortem from Mailgun that can serve as an example for all of us.

A customer’s perspective on a datacenter outage, with emphasis on the need for early, frequent, and thorough communication from service providers.

A nicely detailed outage postmortem, including the gorey details of the train of thought the engineers followed on the way to a solution. They hint at an important technique that’s not discussed nearly enough, in my opinion: judicious application of bandaid solutions to resolve the outage and allow engineers to continue their interrupted personal time. It’s not necessary to fix a problem the “right” way in the moment, and carefully-applied bandaids help reduce on-call burnout.

How can we be sure (or at least sort of confident) that distributed systems won’t fail? They can be incredibly complex, and their failures can be even more complex. Catie McCaffrey gives us this ACM Queue article about methods for formal and informal verification.

Efficiently testing distributed systems is not a solved problem, but by combining formal verification, model checking, fault injection, unit tests, canaries, and more, you can obtain higher confidence in system correctness.

Medium has announced a commitment to publishing postmortems for all outages. I’d love to see more companies making a commitment like this. Thanks to reader Pete Shima for this link.

Outages

SRE Weekly Issue #8


If you only read two articles this week, make it these first two. They’re excellent and exactly the kind of content I’m looking for. If you come across (or write!) anything that would go well in SRE Weekly, I’d love it if you’d toss a link my way.

Articles

Liz Fong-Jones, a Googler and co-chair of SRECon, describes a scale of activities SRE teams engage in, from the basics (keeping the service operating) to having the freedom to improve the service.

This is a really awesome paper. Two Googlers describe in detail the pitfalls of failover-based systems and explain how they design multi-homed active/active services. If Google has learned a lesson, we’d all do well to learn from it, too:

Our experience has been that bolting failover onto previously singly-homed systems has not worked well. These systems end up being complex to build, have high maintenance overhead to run, and expose complexity to users. Instead, we started building systems with multi-homing designed in from the start, and found that to be a much better solution. Multi-homed systems run with better availability and lower cost, and result in a much simpler system overall.

A review of CloudHarmony’s numbers on various cloud providers’ availability in 2015 versus 2014, along with a discussion of how customers deal with outages. I’m a little puzzled by this one:

That’s also partly why most public cloud workloads aren’t used for production or mission-critical applications.

I’m pretty sure plenty of mission-critical stuff is running in EC2, for example.

The team at parall.ax chose Lambda because there are no long-lived servers, and they could offload all the work of scaling their app up and down with demand to Amazon.

Randall Monroe takes on an important question: is it possible to siphon water from a Europa to Earth? Okay, the only relation to SRE is that a team of Google SREs submitted the question, but I really love What If.

VictorOps distilled their Minimum Viable Runbooks series (featured here previously) into a polished PDF in their usual high quality and style.

During an outage this week, Vodafone admitted that they forgot to update their status site. They are looking into an automated system to make updates during outages.

I’ve worked mostly jobs without compensation for on-call, but one with. Compensation is nice, but it was to offset a truly heinous level of pages, so it was small comfort. If you have any good articles about the merits and pitfalls of on-call compensation, please send them my way.

Outages

Lots of downtime this week, including some recurrences and some big names.

SRE Weekly Issue #7

A big thanks to Charity Majors (@mipsytipsy) for tweeting about SRE Weekly and subsequently octupling my subscriber list!

Articles

This article is gold. CatieM explains why clients can’t be trusted, even when they’re written in-house. She describes how her team avoided an outage during the Halo 4 launch by turning off non-essential functionality. Had she trusted the clients, she might not have built in the kill switches that let her shed the excessive load caused by a buggy client.

Facebook recently released a live video streaming feature. Because they’re Facebook, they’re dealing with a scale that existing solutions can’t even come close to supporting (think millions of viewers for celebrity live video broadcasts). This article goes into detail about how they handle that level of concurrency for live streaming. I especially like the bit about request coalescing.

Best. I pretty much only like the parodies of Uptown Funk.

This is a really great little essay comparing running a large infrastructure with flying a plane by instruments. Paying attention to just one or two instruments without understanding the big picture results in errors.

Thanks to Devops Weekly for this one.

An awesome incident response summary for an outage caused by domain name expiration. The live Grafana charts are awesome, along with the dashboard snapshot. It’s exciting to see how far that project has come!

Calculating availability is hard. Really hard. First, you have to define just what constitutes availability in your system. Once you’ve decided how you calculate availability, you’ve defined the goalposts for improving it. In this article, VividCortex presents a general, theoretical formula for availability and a corresponding 3D graph that shows that improving availability involves both increasing MTBF and reducing MTTR.

TechCentral.ie gives us this opinion piece on the frequency of outages in major cloud providers. The author argues that, though reported outages may seem major, they still rarely cause violation of SLAs, and service availability is still probably better than individual companies could manage on their own.

Full disclosure: Heroku, my employer, is mentioned.

An external post-hoc analysis of the recent outage at JetBlue, with speculation on the seeming lack of effective DR plans at JetBlue and Verizon. The article also mentions the massive outage at 365 Main’s San Francisco datacenter in 2007, which is definitely worth a read if you missed that one.

Linden Lab Systems Engineer April wrote up a detailed postmortem of the multiple failures that went into a rough weekend for Second Life users. I worked on recovery from at least a few failures in that central database in my several years at Linden, and it’s pretty tricky managing the thundering herd that floods through the gates when you reopen them. Good luck folks, and thanks for the excellent write-up!

Netflix has taken the Chaos Monkey to the next level. Now their automated system investigates the services a given request touches and injects artificial failures in various dependencies to see if they cause end-user errors. It takes a lot of guts to decide that purposefully introducing user-facing failures is the best way to ultimately improve reliability.

…we’re actually impacting 500 members requests in a day, some of which are further mitigated by retries. When you’re serving billions of requests each day, the impact of these experiments is very small.

Outages

Only a few this week, but they were whoppers!

  • Twitter
    • Twitter suffered a massive outage at least 2 hours long with sporadic availability for several hours after. Hilariously, they posted status about the outage on Tumblr.

  • Comcast (SF Bay area)
  • Africa
    • This is the first time I’ve had an entire continent in this section. Most of Africa’s Internet was cut off from the rest of the world due to a pair of fiber cuts. South Africa was hit especially hard.

A production of Tinker Tinker Tinker, LLC Frontier Theme