SRE Weekly Issue #187

Articles

I love it when companies publish their incident management documentation! Atlassian’s offering is high-quality — both in content and production value. The Major Incident Manager Cheatsheet at the end is worth distributing to your team.

Atlassian

Evolving Regional Evacuation

Netflix shares more about their N+1 AWS region redundancy design, and it all revolves around accurately modeling demand.

Niosha Behnam — Netflix

Simple is Complex

Interactions between simple microservices can lead to unexpected emergent behaviors.

To restate: this system is not complicated. But it is complex.

Avdi Grimm

What Really Brought Down the Boeing 737 Max?

What we had in the two downed airplanes was a textbook failure of airmanship.

While I don’t necessarily agree with the blame-laden language of this article, it provides some interesting new details. It strikes me that, while MCAS may not be directly responsible for the crashes, it made it significantly harder to recover from contemporaneous pilot errors.

William Langewiesche — The New York Times

Observability — A 3-Year Retrospective

My favorite part is the role-playing scenarios of debugging a problem with observability tooling and traditional tools.

Charity Majors

TCP: Out of Memory — Consider Tuning TCP_Mem

Tuning your TCP stack is important on busy servers.

Ram Lakshmanan

Outages

Google Cloud Platform
- This incident primarily affected the control plane of many GCP services. It stemmed from a cascading failure in an important key-value store used by all of them.
Facebook and Instagram
Google Maps
GoDaddy
Target (retailer)
Discord
Fastly
- Plus two others.
Squarespace
GitHub

SRE Weekly Issue #187

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues