SRE Weekly Issue #33


Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”


Here’s another great article urging caution when adopting new tools. Codeship’s Jessica Kerr categorizes technologies into a continuum of risk, from single-developer tools all the way up to new databases. She goes into a really excellent amount of detail, providing examples of how adopting a new technology can come back to bite you.

After several recent incidents of nations cutting off or severely curtailing internet connectivity, the UN took a stand, as reported in this Register article:

The United Nations officially condemned the practice of countries shutting down access to the internet at a meeting of the Human Rights Council on Friday.

Is it possible to design an infrastructure and/or security environment in which a rogue employee cannot take down the service?

Mathais Lafeldt is back in this latest issue of Production Ready. In this part 1 of 2, he reviews Richard Cook’s classic How Complex Systems Fail, with an eye toward applying it to web systems.

And with a nod to Lafeldt for the link, here’s another classic from John Allspaw on complexity of failures.

In the same way that you shouldn’t ever have root cause “human error”, if you only have a single root cause, you haven’t dug deep enough.

SGX released a postmortem for their mid-July outage in the form of a press release. Just as Allspaw tells us, the theoretically simple root cause (disk failure) was exacerbated by a set of complicating factors.

In this recap of a joint webinar, Threat Stack and VictorOps share 7 methods to avoid and reduce alert fatigue.


Updated: July 31, 2016 — 10:40 pm
SRE WEEKLY © 2015 Frontier Theme