SRE Weekly Issue #284

Like last week, I prepared this week’s issue in advance, so no Outages section.  Have a great week!

A message from our sponsor, StackHawk:

Trying to automate application and API security testing? See how StackHawk and Burp Suite Enterprise stack up:
https://sthwk.com/burp-enterprise

Articles

Soundcloud is very clear on the fact that they are not at Google scale. It’s interesting to see how they apply SRE principles at their scale.

Björn “Beorn” Rabenstein — SoundCloud

Here’s why Target set up their ELK stack, and how they used it to troubleshoot a problem in ElasticSearch itself.

Dan Getzke — Target

A key point in this article is that calculating your error budget as just “100% – SLO” goes about things backward.

Adam Hammond — Squadcast

They periodically scale up their systems just to test and be sure they’ll be ready for big events like Black Friday / Cyber Monday.

Kathryn Tang — Shopify

In this post, we’ll focus on service ownership. Why is service ownership important? How should teams self-organize to achieve it? Where’s the best place to start?

Cortex

This fun troubleshooting story hinges around the internal details of how PostgreSQL’s sequences work.

Pete Hamilton — incident.io

Updated: August 22, 2021 — 7:38 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme