SRE Weekly Issue #215

View on sreweekly.com

I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.

Articles

Embracing the beautiful mess

The “messy” details of our human/computer systems is their hidden strength.

Lorin Hochstein

Accident Case Study: Just a Short Flight

In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.

Air Safety Institute

Succeeding With Service Level Objectives

Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.

Danny Mican — Squadcast

Back to Basics: Why Global Infrastructure Matters

The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.

Serhat Can — OpsGenie

SLA Uptime calculator

A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.

Hexadecimal

Hosted Pools Availability Degradation

Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.

Chad Kimes – Microsoft

Free Google Book: Building Secure and Reliable Systems

Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement

Todd Hoff — High Scalability

One Team at Uber is Moving from Microservices to Macroservices

The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.

Todd Hoff — High Scalability

Outages

Quibi
- Quibi had issues on their launch day.
Deliveroo
Google Cloud Platform IAM
- Click through for their interesting post-incident analysis.
Cloudflare
- Here’s their post-incident analysis that details a remote hands request gone awry.
Chef
Hulu
Lots of Banks in the US
- Banks went down around the time when customers were checking to see if their economic stimulus payments had arrived.
Petnet (smart pet feeder)
Snapchat
Twitter
Fastly
Reddit
DoorDash
StackPath

SRE Weekly Issue #215

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues