How many minutes per month is 99.95% availability? What about 99.957%? Here’s a tool that’ll give you a quick answer, by the author of awesome-sre.

This article is a partial transcript of Catchpoint’s Chaos Engineering and DiRT AMA.

In chaos engineering, we’re saying, “Look, these systems are just too complicated at this point to reasonably expect an engineer to know all of the properties that a system has.”

Somewhat intro-level, but I like this little gem:

[…] we have to put some requests at risk for the sake of protecting our overall availability. We want to keep that risk to a minimum. This raises the question: What is the smallest experiment we can run that still gives us confidence in the result?”

This article chronicles New Relic’s attempt to test a new system to prove that it was ready for production.

SQS, Kafka, and others tout features like “exactly once” and “FIFO”, but there are necessarily some pretty big caveats and edge cases to those features that really can’t be ignored.

Really, the title should be “The Google SRE Model”. This article discusses Google’s philosophy that the SRE team is optional for any given system — but a team should be doing what SRE would be doing if they’re not around.

SYNQ pushes for transparency in incident response and commits to publishing their RCAs publicly (like this one). They also include a simple template for RCAs at the end of the article.


  • AWS
    • us-east-1 had another one-AZ network outage.
  • Poloniex (altcoin exchange)
  • Skype
  • British Airways
  • Canada
    • A large portion of Canada had a major mobile phone and internet outage due to a fiber cut.
  • Heroku
    • Heroku has had a string of major outages, marked as red on their status page. Apologies for not linking to them individually and as they’ve happened, but here’s a link to their historical list. No public statement has been posted yet.

      Full disclosure: Heroku is my employer.

