I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.
Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.
Niosha Behnam — Netflix
Interactions between simple microservices can lead to unexpected emergent behaviors.
To restate: this system is not complicated. But it is complex.
What we had in the two downed airplanes was a textbook failure of airmanship.
While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.
William Langewiesche — The New York Times
My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.
Tuning your TCP stack is important on busy servers.