SRE Weekly Issue #72


Concerned about downtime? VictorOps helps you prepare, respond, and recover from IT and DevOps Incidents. Swing by our product center to learn how and start your trial.


Idempotence is a critically important tool in building a reliable system. Stripe explains the concept and shows how they wrap theoretically non-idempotent actions like charging a credit card into safely idempotent API calls.

Here’s an account of an effort to move from server-based paging (this server is down) to functional-based alerting (this user action isn’t working), with a resulting impressive reduction in out-of-hours paging.

It pays to study up and deeply understand what a simple metric like “cpu utilization” really means.

Why am I linking to AWS’s status site? Look closely, and you’ll see that the “green checkmark i” symbol has been replaced with a far more noticeable blue circle with a white diamond. Check out the old icon here for comparison. End of an era, or just another way of presenting the same information?

The author introduces a new Ruby gem, grpc-commons that makes it easy to add circuit breaker and statsd support to a grpc client.

Along with being a tutorial on setting up Zipkin with Python, this article also explains some basic Zipkin concepts.

PagerDuty is apparently trying to position itself as more than just a paging service, with a few new features around the entire incident lifecycle. I’m especially interested in checking out the new postmortem tooling.

I included this article last week, but my link was outdated and returned a 404. Here’s the corrected link — sorry about that!

I put a call out for a review of Elastic’s new beta anomaly detection feature last week, and here one is! Thanks to an Elastic employee for forwarding this link to me.

This article cautions one to be careful to look past an obvious root cause, because a deeper systemic or policy problem may be lurking behind it.

Serverless / FaaS abstract away traditional provisioning, and they make it really easy to ignore planning for resource usage.

Wow, what a concept:

you can think of […] reliable systems […] as successfully imagining all of the potential things that could go wrong

This 2.5-minute podcast from Todd Conklin has a really great question: to achieve reliability, do we have to try to imagine in advance all of the possible ways our systems could fail?

A patient was given an incorrect syringe resulting in a 5x insulin overdose. Brigham and Women’s Hospital reports on the accident and what they’re doing to prevent mistakes of this sort in the future.

Consumers today have increasingly high expectations for digital applications and service performance, but do IT personnel feel equipped to rise to the occasion? In this survey, we uncover the extent of the digital services expectation gap between consumers and IT teams as well as top strategies teams are using to solve digital disruption challenges.


  • Our First Kubernetes Outage – Saltside Engineering
    • Kudos to the Saltside folks for sharing a public postmortem for an internal, non-customer-impacting outage!

      This is public postmortem for an a complete shutdown of our internal Kubernetes cluster. It’s shared with you all so everyone may learn.

  • “Re-experience the fun of customizing your Place Page!” A Tale of Oops from Ops
    • Ouch. Linden Lab’s ops team discovered the hard way that they didn’t have a working backup copy of some customer data. The best part of this article is the discussion of the “Shrek Ears” tradition at Linden. It’s one of the things I remember most fondly from my time there, and having worn the ears a few times in my day, I can attest to the fact that it’s a great way to handle the psychological impact of making a mistake.
  • Chase (bank)
  • Facebook
Updated: May 14, 2017 — 10:09 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme