SRE Weekly Issue #82


The definitive guide for DevOps Post-Incident Reviews (AKA – Postmortems). Learn why traditional methods don’t work – and why fast incident response isn’t enough. Download your free copy of the 90+ page eBook from O’Reilly Media and VictorOps.


Increment issue #2 is out! Want to hear what it was like for these three big companies to move to the cloud? Read on.

This article covers a lot of ground, from general strategy to specific methods for estimating capacity needs. I love this:

Perhaps surprisingly for engineers who work in mission-critical business applications, occasional spikes of 90%+ of our users being entirely unable to use the sole application of our company was an entirely acceptable engineering tradeoff versus sizing our capacity against our peak loads.

I love the insight this article gives me into the huge networks of big CDNs.

Key point: don’t count your chickens before they’ve recovered.

The MTTR time should be stopped when there is verification that all systems are once again operating as expected and end users are no longer negatively affected

Scalyr explains how to move beyond specific playbooks to create a renewal incident response plan.

Here’s a nice little how-to:

A recent challenge for one of the teams I am currently involved was to find a way in AWS CloudWatch:

  1. To alert if the metric breaches a specified threshold.
  2. To alert if a particular metric has not been sent to CloudWatch within a specified interval.

And another short how-to, this on developing Prometheus with HA.

Self-care is critical in tech, not only for us as individuals, but for the health and reliability of the entire organization. Overstretched engineers make mistakes. This article introduces a new resource:, which is a curated, open-source repository of self-care resources.


Updated: July 23, 2017 — 10:58 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme