SRE Weekly Issue #197

View on sreweekly.com

It’s been four years since I started SRE Weekly. I’m having a ton of fun and learning a lot, and I can’t tell you all how happy it makes me that you read the newsletter.

A huge thank you to everyone who writes amazing SRE content every week. Without you folks, SRE Weekly would be nothing. Thanks also to everyone who sends links in — I definitely don’t catch every interesting article!

Articles

Why build an online community around “learning from incidents”?

Here’s an intro to the Learning From Incidents community. I can’t wait to see what these folks write. They’re coming out of the gate fast, with a post every day for the first week.

Nora Jones

OOPS! Learning from the incident you didn’t have

In order to understand how things went wrong, we need to first understand how they went right

I love the move toward using the term “operational surprise” rather than “incident”.

Lorin Hochstein

The Space Review: All in the family

Fascinating detail about the space shuttle Columbia’s accident, and the confusing jargon at NASA that may have contributed.

Dwayne A. Day — The Space Review

The Art of SLOs

Google released free material (slides, handbooks, worksheets) to help you run a workshop on effective SLOs.

Eliminating toil with fully automated load testing

Lots of really interesting detail about how LinkedIn routes traffic to datacenters and what happens when a datacenter goes down.

Nishant Singh — LinkedIn

Patience in Implementing Effective Incident Reviews

Our field is learning a ton, and it can be tempting to short-circuit that learning. It takes time to really grok and integrate what we’re learning.

Now it may be easy to accept all of this and think “Yeah yeah, I got it. Let me at that ‘resilience’. I’m going to ‘add so much resilience’ to my system!”.

Will Gallego

Shrinking the time to mitigate production incidents

I like the distinction between “unmanaged” and “untrained” incident response.Author: Jesus Climent — Google

Journey into Observability: Reading material

This chronicle of learning about observability makes for an excellent reading list to those just diving in.

Mads Hartmann

Outages

GitLab — Analysis of November 28th outage
- A change to roll out ip-tables to other non gitlab.com hosts was inadvertently applied to the database hosts. That change to host firewalling caused all web and api hosts to lose connectivity to the database. The change has been rolled back and we are now restarting host processes.
Disney+
Dexcom Diabetes Alerts
- Blood sugar monitors failed to send alerts for days. Parents use these monitors for monitoring their diabetic children’s blood sugar levels.
AOL Mail
DRS black-out during Abu Dhabi GP
- A Service failure prevented drivers from being allowed to use the Drag Reduction System (DRS).
Discord
- Related to the Google Compute Engine incident below.
  They also had another incident today.
Heroku incident #1930 followup
Heroku
Google Compute Engine
- High latency in IO operations to SSD-based persistent disks.

SRE Weekly Issue #197

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues