SRE Weekly Issue #277

A message from our sponsor, StackHawk:

Planelty saved weeks of work by implementing StackHawk instead of building an internal ZAP service. See how:
https://sthwk.com/planetly-stackhawk

Articles

Remember all those Robinhood outages? The US financial regulatory agency is making Robinhood repay folks for the losses they sustained as a result and also fining them for other reasons.

Michelle Ong, Ray Pellecchia, Angelita Plemmer Williams, and Andrew DeSouza — FINRA

This is brilliant and I wish I’d thought of it years ago:

One of the things we’ve previously seen during database incidents is that a set of impacted tables can provide a unique fingerprint to identify a feature that’s triggering issues.

Courtney Wang — Reddit

The suggested root cause involves consolidation in cloud providers and the importance of DNS.

Alban Kwan — CircleID

Full disclosure: Fastly, my employer, is mentioned.

This paper is about recognizing normalization of deviance and techniques for dealing with it. This tidbit really made me think:

[…] they might have been taught a system deviation without realizing that it was so […]

Bus Horiz

Blameless incident analysis is often at odds with a desire to “hold people accountable”. This article explores that conflict and techniques for managing the needs involved.

Christina Tan and Emily Arnott — Blameless

What can you do if you’re out of error budget but you still want to deliver new features? Get creative.

Paul Osman — Honeycomb

I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.

Bruce Dominguez

Outages

Updated: July 4, 2021 — 9:03 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme