Do you need an incident commander? (Yes.) This article is about how to staff your incident command rotation through a couple of different strategies.
Ryan McDonald — FireHydrant
What an interesting idea, an insurance plan that pays out automatically when a cloud provider has an outage.
L.S. Howard — Insurance Journal
Full disclosure: Fastly, my employer, is mentioned.
LaunchDarkly revamped the way that their on-call system works. Learn about the experience through the eyes of a newly-onboarded engineer.
Anna Baker — LaunchDarkly (via The New Stack)
Catchpoint’s yearly SRE Report is out with four key findings. You have to fill out a form with your email address, and then the link to download the report is presented in your browser.
This article shows why one-thread-per-request can be a bottleneck and presents alternatives.
Ron Pressler — Parallel Universe (via High Scalability)
And this is a truth about incidents: there are always more signals than there is attention available.
It’s so true.
Fred Hebert — Honeycomb
If you’ve ever even considered running a retrospective, read this article.
This is my favorite piece of advice from this article:
If you think ‘this might be a stupid question,’ ask it.
Emily Ruppe — Jeli
I’m still not sure how I feel about AIOps. Fortunately, this article takes a measured stance while providing some useful insight.
Conclusion: AI won’t replace SREs – but it can help
JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.
- Google Cloud Traffic Director
Google has already posted a preliminary outage report at the link above.
This one involved the Traffic Director outage mentioned above, as per Spotify’s outage report here.
This one was also related to the Traffic Director outage, according to the final update on their status post.