More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.
The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.
Paul Hammond and Samantha Stoller — Slack
I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.
Thai Wood — Learning from Incidents
[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.
Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).
This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at https://srefromhome.com/. I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.
Julie Gunderson and Mandi Walls — Page it to the Limit
I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.
Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn
- GGPoker had issues during a World Series of Poker (WSOP) event.
- Fastly (control plane)
- Full disclosure: Fastly is my employer.
- Google Cloud Platform
- Several GCP components were impacted, including Layer 7 Load Balancers.