Articles
Do you need an incident commander? (Yes.) This article is about how to staff your incident command rotation through a couple of different strategies.
Ryan McDonald โ FireHydrant
What an interesting idea, an insurance plan that pays out automatically when a cloud provider has an outage.
L.S. Howard โ Insurance Journal
Full disclosure: Fastly, my employer, is mentioned.
LaunchDarkly revamped the way that their on-call system works. Learn about the experience through the eyes of a newly-onboarded engineer.
Anna Baker โ LaunchDarkly (via The New Stack)
Catchpoint’s yearly SRE Report is out with four key findings. You have to fill out a form with your email address, and then the link to download the report is presented in your browser.
Catchpoint
This article shows why one-thread-per-request can be a bottleneck and presents alternatives.
Ron Pressler โ Parallel Universe (via High Scalability)
And this is a truth about incidents: there are always more signals than there is attention available.
It’s so true.
Fred Hebert โ Honeycomb
If you’ve ever even considered running a retrospective, read this article.
This is my favorite piece of advice from this article:
If you think โthis might be a stupid question,โ ask it.
Emily Ruppe โ Jeli
I’m still not sure how I feel about AIOps. Fortunately, this article takes a measured stance while providing some useful insight.
Conclusion: AI wonโt replace SREs โ but it can help
JJ Tang โ Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
Outages
- Google Cloud Traffic Director
-
Google has already posted a preliminary outage report at the link above.
-
- Spotify
-
This one involved the Traffic Director outage mentioned above, as per Spotify’s outage report here.
-
- Discord
-
This one was also related to the Traffic Director outage, according to the final update on their status post.
-
- TikTok