SRE Weekly Issue #229

A message from our sponsor, StackHawk:

Read about how to build test driven security with StackHawk + Travis CI + Docker Compose.
https://www.stackhawk.com/blog/test-driven-security-with-travis-ci-and-docker-compose?utm_source=SREWeekly

Articles

More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.

Lorin Hochstein

The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.

Paul Hammond and Samantha Stoller — Slack

I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.

Thai Wood — Learning from Incidents

[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.

Sanjeev Sharma

Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).

Charity Majors

This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at https://srefromhome.com/. I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.

Julie Gunderson and Mandi Walls — Page it to the Limit

I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.

Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn

Outages

Updated: July 26, 2020 — 8:00 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme