I’m having a hard time wrapping my head around the fact that this issue marks 5 years of SRE Weekly. A massive thank you to everyone who writes the content I feature here every week, and also to all of you that subscribe!
Articles
Every service needs a couple of big hammers that are easy to swing.
Jennifer Mace — O’Reilly and Google
Answer: automation. Lots of automation. And automation of the automation.
Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook
Oh, how quaint! This article was written back when people traveled for the holidays.
Ashley Roof — Transposit
Surprise! Fortunately, there are some ways to fix this limitation.
Heidi Howard, Ittai Abraham — Decentralized Thoughts
A common question when a company is implementing incident management is: why do we need this process?
It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.
Kintaba
Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.
Tory Thompson — Firehouse
Not sold yet on full service ownership for development teams? This interview may help.
Vivian Chan — PagerDuty
While ostensibly about Jeli.io, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.
John Allspaw — Adaptive Capacity Labs
A new feature roll-out resulted in impaired service for some customers.
The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.
Lorin Hochstein
Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.
Another great example of making a duplicate request to a new API in the background to test it before deploying it.
Michael P. Geraci — OkCupid
Outages
- Google Workspace Status Dashboard
- All Google services that use OAuth were unreachable due to an issue with Google’s User ID service. Click through for their report. This one caused issues for the start of my daughters’ school day since Meet and Classroom were down.
- Google Cloud Status Dashboard
- Gmail
- Delivery of messages to @gmail.com addresses failed permanently and would not be retried. This report by Google has the details.
- Microsoft Outlook
- Galileo (satellite navigation system)
- Spotify