It’s been four years since I started SRE Weekly. I’m having a ton of fun and learning a lot, and I can’t tell you all how happy it makes me that you read the newsletter.
A huge thank you to everyone who writes amazing SRE content every week. Without you folks, SRE Weekly would be nothing. Thanks also to everyone who sends links in — I definitely don’t catch every interesting article!
Articles
Here’s an intro to the Learning From Incidents community. I can’t wait to see what these folks write. They’re coming out of the gate fast, with a post every day for the first week.
Nora Jones
In order to understand how things went wrong, we need to first understand how they went right
I love the move toward using the term “operational surprise” rather than “incident”.
Lorin Hochstein
Fascinating detail about the space shuttle Columbia’s accident, and the confusing jargon at NASA that may have contributed.
Dwayne A. Day — The Space Review
Google released free material (slides, handbooks, worksheets) to help you run a workshop on effective SLOs.
Lots of really interesting detail about how LinkedIn routes traffic to datacenters and what happens when a datacenter goes down.
Nishant Singh — LinkedIn
Our field is learning a ton, and it can be tempting to short-circuit that learning. It takes time to really grok and integrate what we’re learning.
Now it may be easy to accept all of this and think “Yeah yeah, I got it. Let me at that ‘resilience’. I’m going to ‘add so much resilience’ to my system!”.
Will Gallego
I like the distinction between “unmanaged” and “untrained” incident response.Author: Jesus Climent — Google
This chronicle of learning about observability makes for an excellent reading list to those just diving in.
Mads Hartmann
Outages
- GitLab — Analysis of November 28th outage
-
A change to roll out ip-tables to other non gitlab.com hosts was inadvertently applied to the database hosts. That change to host firewalling caused all web and api hosts to lose connectivity to the database. The change has been rolled back and we are now restarting host processes.
-
- Disney+
- Dexcom Diabetes Alerts
- Blood sugar monitors failed to send alerts for days. Parents use these monitors for monitoring their diabetic children’s blood sugar levels.
- AOL Mail
- DRS black-out during Abu Dhabi GP
- A Service failure prevented drivers from being allowed to use the Drag Reduction System (DRS).
- Discord
- Related to the Google Compute Engine incident below.
They also had another incident today.
- Related to the Google Compute Engine incident below.
- Heroku incident #1930 followup
- Heroku
- Google Compute Engine
- High latency in IO operations to SSD-based persistent disks.