Articles
I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.
Atlassian
Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.
Niosha Behnam — Netflix
Interactions between simple microservices can lead to unexpected emergent behaviors.
To restate: this system is not complicated. But it is complex.
Avdi Grimm
What we had in the two downed airplanes was a textbook failure of airmanship.
While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.
William Langewiesche — The New York Times
My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.
Charity Majors
Tuning your TCP stack is important on busy servers.
Ram Lakshmanan
Outages
- Google Cloud Platform
- This incident primarily affected the control plane of many GCP services. It stemmed from a cascading failure in an important key-value store used by all of them.
- Facebook and Instagram
- Google Maps
- GoDaddy
- Target (retailer)
- Discord
- Fastly
- Squarespace
- GitHub