Articles
Faced with a difficult hiring market for SREs, they embarked on a well-designed, carefully thought out program to hire and train entry-level folks as SREs — and it worked!
Thomas Betts — InfoQ
No matter how good your tooling is, how experienced you are, or how much you’ve prepared, incidents can still be hard.
Five people share about what they find hardest during incident response.
Chris Evans — incident.io
This one has a lot of ideas about how to guide developers toward full ownership of their services in production.
Ambassador
In this post, I will cover the following modes of system resilience:
- Adaptive Response
- Superior Monitoring
- Coordinated Resilience
- Heterogenous Systems
- Dynamic Repositioning
- Requisite Availability
Ash P — Cruform
Root cause of success: unpatched security vulnerability
TMW a security vulnerability allows you to break into your infrastructure, averting disaster during an incident.
Lorin Hochstein, with incident story by Eric Dobbs
A migration didn’t go as planned, and customer traffic lost its way.
Heroku
I’m a big believer in human-in-the-loop automation. My favorite part of this article was this:
A further problem is that full automation — which aims to take the human out of the picture — requires a complete, nuanced understanding of a system and all potential outcomes, paradoxically resulting in heightened system complexity.
Tina Huang — Transposit
Outages
- Google Voice
- Assembled
-
For some users, Assembled’s styling was not rendering and caused the application to be unusable.
“Root cause”: CSS
-
- Apple Store
- United Airlines
- TikTok
- Slack
- GCash
- Solana (Cryptocurrency)