SRE Weekly Issue #279

Articles

This is a presentation by Laura Nolan (with text transcript) all about cascading failure, what causes it, how to avoid it, and how to deal with it when it happens.

I love how succinct this is:

[…] in any system where we design to fail over, so any mechanism at all that redistributes load from a failed component to still working components, we create the potential for a cascading failure to happen.

Laura Nolan — Slack (presented at InfoQ)

The greedy exec trap

It’s so easy to explain an incident by describing how management could have prevented it from investing additional resources.

Lorin goes on to explain the “trap” part: it’s easy to stop investigating an incident too soon and declare the cause “greedy executives”, preventing us from learning more.

Lorin Hochstein

r/WallStreetBets Incident Anthology (What Worked Edition): Recently Consumed

They redesigned one of their caching systems in 2020, and it paid off handsomely during the GameStop saga. This article discusses the redesign and considers what would have happened without it.

Garrett Hoffman — Reddit

Pragmatic Incident Response: 3 Lessons Learned from Failures

The lessons are:

Do retrospectives for small incidents first.
Do a retrospective soon after the incident.
Alert on the user experience.

All great advice, and #1 is an interesting idea I hadn’t heard before.

Robert Ross — FireHydrant

De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

We can’t engineer reliability in a vacuum. This is a great explainer on how SRE siloing happens, the problems it causes, and how to break SRE out of its shell.

JJ Tang — Rootly

CALLBACK 498, July 2021 – Aircrew Resilience

This ASRS (Aviation Safety Reporting System) Callback issue has some real-world examples of resilient systems in action.

Nasa Asrs

Automatic Remediation of Kubernetes Nodes

Facing a common kubernetes node failure modes, Cloudflare uses open source tools (one published by them) to perform automatic restarts.

In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time.

Andrew DeMaria — Cloudflare

SRE Weekly Issue #279

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues