SRE Weekly Issue #32


Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”


It’s tempting to use the newest shiny stack when building a new system. Dan McKinley argues that you should limit yourself to only a few shiny technologies to avoid excessive operational burden.

[…] the long-term costs of keeping a system working reliably vastly exceed any inconveniences you encounter while building it. 

Quick on the draw, Pete Shima gives us a review of Stack Exchange’s outage postmortem (linked below) as part of the Operations Incident Board’s Postmortem Report Reviews project. Thanks, Pete!

Next month in Seattle will be the second annual Chaos Community Day, an event full of presentations on chaos engineering. I wish I could attend!

As the world becomes more and more dependent on the services we administer, outages become more and more likely to put real people in danger. Here’s a rundown of how dangerous last week’s four-hour outage in US’s national weather service was.

An interesting opinion piece that argues that Microsoft Azure is more robust than Google and Amazon’s offerings.

This week, I’m trying to catch all the articles being written about Pokémon GO. Here’s one that supposes the problem might be a lack of sufficient testing.

Pokémon GO is blowing up like crazy, and I don’t just mean in popularity. Forbes has a lot to say about the complete lack of communication during and after outages, and we’d do well to listen. This article reads a lot like a recipe for how to communicate well to your userbase about outages.

Here’s the continuation of last month’s article on Netflix’s billing migration.


Updated: July 25, 2016 — 9:41 am
SRE WEEKLY © 2015 Frontier Theme