SRE Weekly Issue #215

I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.

A message from our sponsor, VictorOps:

Our people and tools need to be connected now more than ever before. That’s why VictorOps is offering free, 90-day extended Enterprise trials for on-call incident response and alert management, up to 100 users, to anyone who needs it:

https://go.victorops.com/sreweekly-extended-trials-for-incident-response

Articles

The “messy” details of our human/computer systems is their hidden strength.

Lorin Hochstein

In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.

Air Safety Institute

Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.

Danny Mican — Squadcast

The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.

Serhat Can — OpsGenie

A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.

Hexadecimal

Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.

Chad Kimes – Microsoft

Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement

Todd Hoff — High Scalability

The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.

Todd Hoff — High Scalability

Outages

Updated: April 19, 2020 — 9:55 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme