SRE Weekly Issue #313

Articles

Do you need an incident commander? (Yes.) This article is about how to staff your incident command rotation through a couple of different strategies.

Ryan McDonald — FireHydrant

How Cloud Downtime Insurance Became a Thing

What an interesting idea, an insurance plan that pays out automatically when a cloud provider has an outage.

L.S. Howard — Insurance Journal
Full disclosure: Fastly, my employer, is mentioned.

Diary of a First-Time On-Call Engineer

LaunchDarkly revamped the way that their on-call system works. Learn about the experience through the eyes of a newly-onboarded engineer.

Anna Baker — LaunchDarkly (via The New Stack)

2021 SRE Report

Catchpoint’s yearly SRE Report is out with four key findings. You have to fill out a form with your email address, and then the link to download the report is presented in your browser.

Catchpoint

Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck. What you can do?

This article shows why one-thread-per-request can be a bottleneck and presents alternatives.

Ron Pressler — Parallel Universe (via High Scalability)

On the Brittleness of Dashboards

And this is a truth about incidents: there are always more signals than there is attention available.

It’s so true.

Fred Hebert — Honeycomb

Incident Analysis 101: Facilitating the Learning Review

If you’ve ever even considered running a retrospective, read this article.

This is my favorite piece of advice from this article:

If you think ‘this might be a stupid question,’ ask it.

Emily Ruppe — Jeli

What Does AIOps Mean for SREs? It’s Complicated.

I’m still not sure how I feel about AIOps. Fortunately, this article takes a measured stance while providing some useful insight.

Conclusion: AI won’t replace SREs – but it can help

JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Outages

Google Cloud Traffic Director
- Google has already posted a preliminary outage report at the link above.
Spotify
- This one involved the Traffic Director outage mentioned above, as per Spotify’s outage report here.
Discord
- This one was also related to the Traffic Director outage, according to the final update on their status post.
Polygon
TikTok

SRE Weekly Issue #313

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues