SRE Weekly Issue #248

A message from our sponsor, StackHawk:

Join StackHawk and Snyk on Wednesday to learn about how to automate application security testing with GitHub Actions. Register for the webinar here –>
https://sthwk.com/stackhawk-snyk

Articles

It’s really easy to get an “uptime” SLO wrong, and a lying SLO can give you a false sense of security.

Piyush Verma — Last9

I love this quote. I feel like this is the “root cause” of every incident:

As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences.

Oliver Leaver-Smith — Sky Betting & Gaming

I really love the concept of requisite complexity. This article has me thinking about a big project I’m working on in a new light.

Fred Hebert

They expected to max out an integer primary key column sometime in 2021. Then the pandemic hit and their timetable suddenly accelerated along with their traffic.

Jeff Pollard — Strava

I shouldn’t enjoy reading these so much… got any of your own to share?

Dean Wilson

The idea of borrowing expertise makes me think of Bainbridge’s Ironies of Automation.

Mandi Walls — PagerDuty

Heroku’s report explains how their service was impacted as a result of the big Amazon Kinesis outage a couple weeks back.

Heroku

This primer focuses on ensuring that your SLOs actually match up with business objectives.

Irving Popovetsky — Honeycomb

Outages

Updated: December 13, 2020 — 8:19 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme