SRE Weekly Issue #248

Articles

It’s really easy to get an “uptime” SLO wrong, and a lying SLO can give you a false sense of security.

Piyush Verma — Last9

I love this quote. I feel like this is the “root cause” of every incident:

As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences.

Oliver Leaver-Smith — Sky Betting & Gaming

Complexity Has to Live Somewhere

I really love the concept of requisite complexity. This article has me thinking about a big project I’m working on in a new light.

Fred Hebert

The Boring Option

They expected to max out an integer primary key column sometime in 2021. Then the pandemic hit and their timetable suddenly accelerated along with their traffic.

Jeff Pollard — Strava

Scary sysadmin Halloween stories

I shouldn’t enjoy reading these so much… got any of your own to share?

Dean Wilson

Borrow Expertise With Runbook Automation

The idea of borrowing expertise makes me think of Bainbridge’s Ironies of Automation.

Mandi Walls — PagerDuty

Heroku Incident #2127 Follow-Up: Issues with starting new dynos

Heroku’s report explains how their service was impacted as a result of the big Amazon Kinesis outage a couple weeks back.

Heroku

Setting Business Goals with SLOs

This primer focuses on ensuring that your SLOs actually match up with business objectives.

Irving Popovetsky — Honeycomb

Outages

AT&T
- An interesting Twitter thread about a router near San Francisco, California, USA that was flipping bits in packets for weeks. Folks took to Twitter to try to get AT&T’s attention, and they finally fixed it.
Robinhood
Facebook Messenger & Instagram
Microsoft stuff
- - Office 365
  - Teams
  - SharePoint
  - OneDrive
Reddit

SRE Weekly Issue #248

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues