View on sreweekly.com
Like last week, I prepared this week’s issue in advance, so no Outages section. Have a great week!
Soundcloud is very clear on the fact that they are not at Google scale. It’s interesting to see how they apply SRE principles at their scale.
Björn “Beorn” Rabenstein — SoundCloud
Here’s why Target set up their ELK stack, and how they used it to troubleshoot a problem in ElasticSearch itself.
Dan Getzke — Target
A key point in this article is that calculating your error budget as just “100% – SLO” goes about things backward.
Adam Hammond — Squadcast
They periodically scale up their systems just to test and be sure they’ll be ready for big events like Black Friday / Cyber Monday.
Kathryn Tang — Shopify
In this post, we’ll focus on service ownership. Why is service ownership important? How should teams self-organize to achieve it? Where’s the best place to start?
This fun troubleshooting story hinges around the internal details of how PostgreSQL’s sequences work.
Pete Hamilton — incident.io
View on sreweekly.com
I’m on vacation enjoying the sunny beaches in Maine with my family, so I prepared this week’s issue in advance. No outages section, save for one big one I noticed due to direct personal experience. See you all next week!
We needed a way to deploy our new service seamlessly, and to roll back that deploy should something go wrong. Ultimately many, many, things did go wrong, and every bit of failure tolerance put into the system proved to be worth its weight in gold because none of this was visible to customers.
Geoffrey Plouviez — Cloudflare
I especially like the idea of tailoring retrospective documents to disparate audiences — you may have more than you realize.
Emily Arnott — Blameless
An analysis of two incidents from the venerable John Allspaw. These are from 2012 back when he was at Etsy, and yet there’s still a ton we can learn now by reading them.
John Allspaw — Etsy
An account of how Gojek responds to production issues, and why the RCA is a critical part of the process.
Sooraj Rajmohan — Gojek
Type carefully… or rather, design resilient systems.
JJ Tang — Rootly
Requiring development teams to fully own their services can lead to siloing and redundancy. Heroku works to ameliorate that by embedding SREs in development teams.
Johnny Boursiquot — Salesforce (presented at QCon)
I’ve shared some articles here suggesting doing away with incident metrics like MTTR entirely. This author says that they are useful, but the numbers must be properly ccontextualized.
Vanessa Huerta Granda — Learning From Incidents
Everything could be fine, or we could failing to report or missing problems altogether — we’re flying blind.
Chris Evans — incident.io