Articles
Do you need an incident commander? (Yes.) This article is about how to staff your incident command rotation through a couple of different strategies.
Ryan McDonald — FireHydrant
What an interesting idea, an insurance plan that pays out automatically when a cloud provider has an outage.
L.S. Howard — Insurance Journal
Full disclosure: Fastly, my employer, is mentioned.
LaunchDarkly revamped the way that their on-call system works. Learn about the experience through the eyes of a newly-onboarded engineer.
Anna Baker — LaunchDarkly (via The New Stack)
Catchpoint’s yearly SRE Report is out with four key findings. You have to fill out a form with your email address, and then the link to download the report is presented in your browser.
Catchpoint
This article shows why one-thread-per-request can be a bottleneck and presents alternatives.
Ron Pressler — Parallel Universe (via High Scalability)
And this is a truth about incidents: there are always more signals than there is attention available.
It’s so true.
Fred Hebert — Honeycomb
If you’ve ever even considered running a retrospective, read this article.
This is my favorite piece of advice from this article:
If you think ‘this might be a stupid question,’ ask it.
Emily Ruppe — Jeli
I’m still not sure how I feel about AIOps. Fortunately, this article takes a measured stance while providing some useful insight.
Conclusion: AI won’t replace SREs – but it can help
JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
Outages
- Google Cloud Traffic Director
-
Google has already posted a preliminary outage report at the link above.
-
- Spotify
-
This one involved the Traffic Director outage mentioned above, as per Spotify’s outage report here.
-
- Discord
-
This one was also related to the Traffic Director outage, according to the final update on their status post.
-
- TikTok