SRE Weekly Issue #262

Articles

Chaos Engineering isn’t adding chaos to your systems—it’s seeing the chaos that already exists in your systems.

Along with four prerequisites, this article also includes 3 myths about chaos engineering that might be making you feel hesitant about starting.

Courtney Nash — Verica

Managing On-Call in a Pandemic

This one’s from May of last year. Almost a year on, it’s interesting to see which of these we’ve already implemented.

Ashley Roof — Transposit

Being Just Reliable Enough

An amusing parable illustrating why not to try to be too reliable.

Andrew Ford — Indeed

Google debunks Russian claims that fire was connected to service outage

In the Outages section of last week’s issue, you’ll find two unrelated events referenced in this article: one about Russian internet censorship gone awry and another about a major datacenter fire.

Eric Johansson — Verdict

How to Analyze Contributing Factors Blamelessly

Along with what’s in the title, this article also covers the difference between an RCA and a contributing factors analysis.

Emily Arnott — Blameless

Rethinking site capacity projections with Capacity Analyzer

Lots of detail on how LinkedIn is improving their traffic forecasts. Warning/enticement: math contained within.

Deepanshu Mehndiratta — LinkedIn

Testing in Production for Safety and Sanity

Everyone is testing in production, some organizations admit and plan for it.

How to do it right, what can happen if it goes wrong, and how to limit the blast radius.

Heidi Waterhouse — LaunchDarkly

How we found and fixed a rare race condition in our session handling

Remember when GitHub logged you out? Ah, I remember it like it was last week. I mean, the week before. Here’s GitHub’s troubleshooting story about what went wrong.

Dirkjan Bussink — GitHub

Outages

Google Cloud Platform
- GCP had a major multi-region networking issue, due to a routing glitch. Click through for their followup post.
US National Oceanic and Atmospheric Administration (NOAA)
- This outage impaired NOAA’s tsunami early warning system.
Facebook, Instagram, and WhatsApp
TikTok
Elevated error rates
Microsoft Teams and other services
- Click through for a highly detailed description of what went wrong. I can’t link directly to the incident in question, so you’ll have to scroll down to 3/15.

SRE Weekly Issue #262

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues