It’s really easy to get an “uptime” SLO wrong, and a lying SLO can give you a false sense of security.
Piyush Verma — Last9
I love this quote. I feel like this is the “root cause” of every incident:
As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences.
Oliver Leaver-Smith — Sky Betting & Gaming
I really love the concept of requisite complexity. This article has me thinking about a big project I’m working on in a new light.
They expected to max out an integer primary key column sometime in 2021. Then the pandemic hit and their timetable suddenly accelerated along with their traffic.
Jeff Pollard — Strava
I shouldn’t enjoy reading these so much… got any of your own to share?
The idea of borrowing expertise makes me think of Bainbridge’s Ironies of Automation.
Mandi Walls — PagerDuty
Heroku’s report explains how their service was impacted as a result of the big Amazon Kinesis outage a couple weeks back.
This primer focuses on ensuring that your SLOs actually match up with business objectives.
Irving Popovetsky — Honeycomb
- An interesting Twitter thread about a router near San Francisco, California, USA that was flipping bits in packets for weeks. Folks took to Twitter to try to get AT&T’s attention, and they finally fixed it.
- Facebook Messenger & Instagram
- Microsoft stuff
- Office 365