SRE Weekly Issue #187

A message from our sponsor, VictorOps:

Machine learning (ML) isn’t just a buzzword anymore — it’s affecting how we communicate, shop, live and respond to critical DevOps incidents. Grab your spot for this free webinar to learn about driving success in incident management with machine learning:

http://try.victorops.com/sreweekly/machine-learning-webinar

Articles

I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.

Atlassian

Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.

Niosha Behnam — Netflix

Interactions between simple microservices can lead to unexpected emergent behaviors.

To restate: this system is not complicated. But it is complex.

Avdi Grimm

What we had in the two downed airplanes was a textbook failure of airmanship.

While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.

William Langewiesche — The New York Times

My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.

Charity Majors

Tuning your TCP stack is important on busy servers.

Ram Lakshmanan

Outages

Updated: September 29, 2019 — 8:36 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme