Here’s a good intro to creating SLOs including a section on best practices.
When they started to get complaints from customers, they knew it was time to get serious about measuring and monitoring their reliability.
arun — Reputation
As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks.
What follows is a thread with tens of realistic failure scenarios, many of which apply not just to Twitter.
@MosquitoCapital on Twitter
A few amusing anecdotes reveal deeper lessons in SRE.
David Cassel — The New Stack
A resilient system like Twitter isn’t likely to go down instantly just because of a few changes. It’s much more likely to slowly degrade, per this article.
Christopher Carbone — Daily Mail
It’s really interesting to see where this write-up differs from a video summary of the same accident by Mentour Pilot. Given the differences, I wonder if there are even more details that both left out?
This is a really great description of common ground breakdown, referencing Woods and Klein.