SRE Weekly Issue #229

Articles

More details have emerged about the Twitter break-in last week, leading some to utter the quote above. Here’s a take on how to see it as not being about “stupidity”.

Lorin Hochstein

Data Consistency Checks

The data in your database should be consistent… but then again, incidents shouldn’t happen, right? Slack accepts that things routinely go wrong with data at their scale, and they have framework and a set of tools to deal with it.

Paul Hammond and Samantha Stoller — Slack

Obstacles to Learning from Incidents

I learned a lot from this article. My favorite obstacle is “distancing through differencing”, e.g. “we would never have responded to an incident that way”.

Thai Wood — Learning from Incidents

You don’t need SRE. What you need is SRE.

[…] SRE, that is SRE as defined by Google, is not applicable for most organizations.

Sanjeev Sharma

Questionable Advice: “What’s the critical path?”

Expert advice on what questions to ask as you try to figure out what your critical path is (and why you would want to know what it is).

Charity Majors

Thinking About Your Humans With J. Paul Reed

This podcast episode was kind of like a preview of J. Paul Reed and Tim Heckman’s joint talk at https://srefromhome.com/. I love how they refer to the pandemic as a months-long incident, and point out that if you’re always in an incident then you’re never in an incident.

Julie Gunderson and Mandi Walls — Page it to the Limit

Rebuilding messaging: How we bootstrapped our platform

I love a good dual-write story. Here’s how LinkedIn transitioned to a new messaging storage mechanism.

Pradhan Cadabam and Jingxuan (Rex) Zhang — LinkedIn

Outages

Garmin
Snapchat
Tweetdeck
GGPoker
- GGPoker had issues during a World Series of Poker (WSOP) event.
Fastly (control plane)
- Full disclosure: Fastly is my employer.
Squarespace
- Squarespace had a rough week, with the following incidents:
  - July 21
  - July 22 (includes a detailed follow-up analysis)
  - July 24
  - July 24
Google Cloud Platform
- Several GCP components were impacted, including Layer 7 Load Balancers.

SRE Weekly Issue #229

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues