SRE Weekly Issue #36

Last week’s DevOps & SRE AMA was super fun! Thanks to the panelists for participating. Recordings should be posted soon.


Frustrated by the lack of tools available to automate incident response? Learn how ChatOps can help manage your operations through group chat in the latest book from O’Reilly. Get your copy here:


This is the second half of Server Density’s series on the lessons they learned as they transitioned to a multi-datacenter architecture. There are lots of interesting tidbits in here, such as an explanation of how they handle failover to the secondary DC and what they do if that goes wrong.

Full disclosure: Heroku, my employer, is mentioned.

Here’s the second half of Mathias Lafeldt’s series that seeks to apply Richard Cook’s How Complex Systems Fail to web systems. The article is great, but the really awesome part is the thoughtful responses by Cook himself to both parts one and two, linked at the end of this article.

Here’s a postmortem for last week’s outage that involved a migration gone awry.

Thanks to Jonathan Rudenberg for this one.

A patent holding firm is alleging that the USPTO overstepped its authority in declaring a system outage (reported in issue #4) to be treated as a national holiday for purposes of deadlines, and that this led to the plaintiff being sued.

Burnout is a crucially important consideration in a field with on-call work. VictorOps has a few tips for alleviating burnout gleaned from this year’s Monitorama.

Edith Harbaugh says that staging servers present a reliability risk that doesn’t outweigh their benefit. This article is an update to her original article, which I also recommend reading.

Github uses HAProxy to balance across is read-only MySQL replicas, which is a common method. Their technique for excluding lagging nodes while avoiding entirely emptying the pool if all nodes are lagging is pretty neat.

Thanks to Devops Weekly for this one.

A highly detailed deep-dive on Serverless — what it means, benefits, and drawbacks. I especially enjoyed the #NoOps section:

[Ops] also means at least monitoring, deployment, security, networking and often also means some amount of production debugging and system scaling. These problems all still exist with Serverless apps and you’re still going to need a strategy to deal with them. In some ways Ops is harder in a Serverless world because a lot of this is so new.


Full disclosure: Heroku, my employer, is mentioned.


SRE Weekly Issue #35


What is Modern Incident Management? Download the Incident Management Buyer’s Guide to learn all about it, and the value it provides. Get your copy here:


Whoops, here’s one that got lost in my review queue. Charity Majors (one of the usual suspects here at SRE Weekly) wrote one of her characteristically detailed and experience-filled posts on how to isolate your production, staging, and development environments in AWS.

I can’t quite tell how much of this is John Allspaw’s writing and how much is written by the US Forestry Service, but I love it all. Here’s a bulleted list of points driving home the fact that we constantly strike a balance between risk and safety.

Server Density added multi-datacenter redundancy to their infrastructure in 2013, and they were kind enough to document what they learned. In this first of two articles, they outline different kinds of multi-datacenter setups and go over the kinds of things you’ll have to think about as you retrofit your application.

This short opinion piece raises an excellent idea: SLAs aren’t for recouping the cost you incurred due to an outage, they are for making a point to a service provider about the outage.

Southwest has released some numbers on the impact of last month’s outage that resulted in thousands of cancelled flights.

Netflix gives us a rundown of how they prepare a title for release by pre-filling caches in their in-house CDN. I like the part about timing pre-filling during off-peak hours to avoid impacting the service.

How much is your company willing to invest for a truly effective DR solution? This article asks that question and along the way digs into what an effective DR solution looks like and why it costs so much.


  • Syria
    • The Syrian government shut internet access down to prevent cheating on school exams.

  • Mailgun
    • Linked, find a really interesting postmortem: Mailgun experienced an outage when their domain registrar placed their domain on hold abruptly. The registrar was subsequently largely uncommunicative, hampering incident resolution. Lesson learned: make sure you can trust your registrar, because they have the power to ruin your day.

  • Belnet
    • The linked article has some intriguing detail about a network equipment failure that caused a routing loop.

  • Australia’s census website
    • This caught my eye:

      Revolution IT simulated an average sustained peak of up to 350 submissions per second, but only expected up to 250 submission per second.

      Load testing only 40% above expected peak demand? That seems like a big red flag to me.

  • Reddit
  • Etisalat (UAE ISP)
  • Vodafone
  • Google Drive
  • AT&T
  • Delta Airline
    • A datacenter power system failure resulted in cancelled flights worldwide.

SRE Weekly Issue #34


What is Modern Incident Management? Download the Incident Management Buyer’s Guide to learn all about it, and the value it provides. Get your copy here:


Winston is Netflix’s tool for runbook automation, based on the open source StackStorm. Winston helps reduce pager burden by filtering out false-positive alerts, collecting information for human responders, and remediating some issues automatically.

Is it valid for those working on non-life-critical systems to try to draw on lessons learned in safety-critical fields like surgery and air traffic control? John Allspaw, citing Dr. Richard Cook, answers with an emphatic yes.

The best HA infrastructure design in the world won’t save you when your credit card on file expires.

There’s a huge amount of detail on both PostgreSQL and MySQL in this article, including some sneaky edge-case pitfalls that prompted Uber to look for a new database.

This article goes into a good amount of depth on setting up a Caassandra cluster to survive a full AZ outage.

When a Maryland, US county’s emergency services went offline for two hours, 100 calls were missed, possibly contributing to two deaths. In the vein of last week’s theme of complex failures:

“This is really complex, and a lot of dominoes fell in a way that people just didn’t expect,” said Marc Elrich, chairman of the Public Safety Committee.

Here’s the third (final?) installment in this series. This one has some fascinating details on a topic near and dear to my heart: live migration of a database. Their use of DRBD and synchronous replication is especially intriguing.

Ooh, this is gonna be fun. Catchpoint and O’Reilly are hosting an AMA (Ask Me Anything) with DevOps and SRE folks, including Liz Fong-Jones and Charity Majors, both of whose articles have been featured here previously. The questions posted so far look pretty great.


SRE Weekly Issue #33


Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”


Here’s another great article urging caution when adopting new tools. Codeship’s Jessica Kerr categorizes technologies into a continuum of risk, from single-developer tools all the way up to new databases. She goes into a really excellent amount of detail, providing examples of how adopting a new technology can come back to bite you.

After several recent incidents of nations cutting off or severely curtailing internet connectivity, the UN took a stand, as reported in this Register article:

The United Nations officially condemned the practice of countries shutting down access to the internet at a meeting of the Human Rights Council on Friday.

Is it possible to design an infrastructure and/or security environment in which a rogue employee cannot take down the service?

Mathais Lafeldt is back in this latest issue of Production Ready. In this part 1 of 2, he reviews Richard Cook’s classic How Complex Systems Fail, with an eye toward applying it to web systems.

And with a nod to Lafeldt for the link, here’s another classic from John Allspaw on complexity of failures.

In the same way that you shouldn’t ever have root cause “human error”, if you only have a single root cause, you haven’t dug deep enough.

SGX released a postmortem for their mid-July outage in the form of a press release. Just as Allspaw tells us, the theoretically simple root cause (disk failure) was exacerbated by a set of complicating factors.

In this recap of a joint webinar, Threat Stack and VictorOps share 7 methods to avoid and reduce alert fatigue.


SRE Weekly Issue #32


Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”


It’s tempting to use the newest shiny stack when building a new system. Dan McKinley argues that you should limit yourself to only a few shiny technologies to avoid excessive operational burden.

[…] the long-term costs of keeping a system working reliably vastly exceed any inconveniences you encounter while building it. 

Quick on the draw, Pete Shima gives us a review of Stack Exchange’s outage postmortem (linked below) as part of the Operations Incident Board’s Postmortem Report Reviews project. Thanks, Pete!

Next month in Seattle will be the second annual Chaos Community Day, an event full of presentations on chaos engineering. I wish I could attend!

As the world becomes more and more dependent on the services we administer, outages become more and more likely to put real people in danger. Here’s a rundown of how dangerous last week’s four-hour outage in US’s national weather service was.

An interesting opinion piece that argues that Microsoft Azure is more robust than Google and Amazon’s offerings.

This week, I’m trying to catch all the articles being written about Pokémon GO. Here’s one that supposes the problem might be a lack of sufficient testing.

Pokémon GO is blowing up like crazy, and I don’t just mean in popularity. Forbes has a lot to say about the complete lack of communication during and after outages, and we’d do well to listen. This article reads a lot like a recipe for how to communicate well to your userbase about outages.

Here’s the continuation of last month’s article on Netflix’s billing migration.


SRE WEEKLY © 2015 Frontier Theme