SRE Weekly Issue #7

A big thanks to Charity Majors (@mipsytipsy) for tweeting about SRE Weekly and subsequently octupling my subscriber list!

Articles

This article is gold. CatieM explains why clients can’t be trusted, even when they’re written in-house. She describes how her team avoided an outage during the Halo 4 launch by turning off non-essential functionality. Had she trusted the clients, she might not have built in the kill switches that let her shed the excessive load caused by a buggy client.

Facebook recently released a live video streaming feature. Because they’re Facebook, they’re dealing with a scale that existing solutions can’t even come close to supporting (think millions of viewers for celebrity live video broadcasts). This article goes into detail about how they handle that level of concurrency for live streaming. I especially like the bit about request coalescing.

Best. I pretty much only like the parodies of Uptown Funk.

This is a really great little essay comparing running a large infrastructure with flying a plane by instruments. Paying attention to just one or two instruments without understanding the big picture results in errors.

Thanks to Devops Weekly for this one.

An awesome incident response summary for an outage caused by domain name expiration. The live Grafana charts are awesome, along with the dashboard snapshot. It’s exciting to see how far that project has come!

Calculating availability is hard. Really hard. First, you have to define just what constitutes availability in your system. Once you’ve decided how you calculate availability, you’ve defined the goalposts for improving it. In this article, VividCortex presents a general, theoretical formula for availability and a corresponding 3D graph that shows that improving availability involves both increasing MTBF and reducing MTTR.

TechCentral.ie gives us this opinion piece on the frequency of outages in major cloud providers. The author argues that, though reported outages may seem major, they still rarely cause violation of SLAs, and service availability is still probably better than individual companies could manage on their own.

Full disclosure: Heroku, my employer, is mentioned.

An external post-hoc analysis of the recent outage at JetBlue, with speculation on the seeming lack of effective DR plans at JetBlue and Verizon. The article also mentions the massive outage at 365 Main’s San Francisco datacenter in 2007, which is definitely worth a read if you missed that one.

Linden Lab Systems Engineer April wrote up a detailed postmortem of the multiple failures that went into a rough weekend for Second Life users. I worked on recovery from at least a few failures in that central database in my several years at Linden, and it’s pretty tricky managing the thundering herd that floods through the gates when you reopen them. Good luck folks, and thanks for the excellent write-up!

Netflix has taken the Chaos Monkey to the next level. Now their automated system investigates the services a given request touches and injects artificial failures in various dependencies to see if they cause end-user errors. It takes a lot of guts to decide that purposefully introducing user-facing failures is the best way to ultimately improve reliability.

…we’re actually impacting 500 members requests in a day, some of which are further mitigated by retries. When you’re serving billions of requests each day, the impact of these experiments is very small.

Outages

Only a few this week, but they were whoppers!

  • Twitter
    • Twitter suffered a massive outage at least 2 hours long with sporadic availability for several hours after. Hilariously, they posted status about the outage on Tumblr.

  • Comcast (SF Bay area)
  • Africa
    • This is the first time I’ve had an entire continent in this section. Most of Africa’s Internet was cut off from the rest of the world due to a pair of fiber cuts. South Africa was hit especially hard.

Updated: January 24, 2016 — 8:31 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme