I love this crystal clear argument based on statistics and research. MTTR as a metric is simply meaningless.
Courtney Nash — Verica
Their steps for better communication during an outage:
- Provide context to minimise speculation
- Explain what you’re doing to demonstrate you’re ‘on it’
- Set some expectations for when things will return to normal
- Tell people what they should do0
- Let folks know when you’ll be updating them next
Chris Evans — incident.io
Despite checking in advance to be sure their systems would support the new Let’s Encrypt certificate chain, they ran into trouble.
[…] we discovered that several HTTP client libraries our systems use were using their own vendored root certificates.
This is the best case I’ve seen yet against multi-cloud infrastructure. I really like the airline analogy.
Roblox had a major, several-day outage starting on October 28. I don’t usually include game outages in the Outages section since they’re so common and there’s not usually much information to learn from, I sure do like a good post-incident report. Thanks, folks!
David Baszucki — Roblox
When you’re sending small TCP packets, two optimizations can conspire to introduce an artificial 40 millisecond (not megasecond…) delay.
_Here’s Google’s follow-up report for their October 25-26 Meet outage.
Should you count failed requests toward your SLI if the client retries and succeeds? A good argument can be made on either side.
u/Sufficient_Tree4275 and other Reddit users
Mercari restructured its SRE team, moving toward an embedded model to adapt to their growing microservice architecture.
ShibuyaMitsuhiro — Mercari
There’s a really great discussion in this episode about leaving slack in the system in the form of bits of capacity and inefficiency that can be drawn upon to buy time during an outage.
Courtney Nash, with guests Liz Fong-Jones and Fred Hebert — Verica
Here’s how non-SREs can use SRE principles to improve their systems.
Laurel Frazier — Transposit