Articles
In this post, I’ll share how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how we made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.
Samantha Schaevitz — Google
I love the concept of “battleshorts” just as much as I’ve been enjoying this series of articles analyzing STAMP.
Lorin Hochstein
Honeycomb had 5 incidents in just over a week, prompting not only their normal incident investigation process, but a meta-analysis of all five together.
Emily Nakashima — Honeycomb
Why is Chromium responsible for half of the DNS queries to the root nameservers? And why do they all return NXDOMAIN?
Matthew Thomas — APNIC
“That Moment” when your fire suppression system triggers and the fire department shows up. This is part war story and part description of incident response practices.
Ariel Pisetzky — Taboola
An overload in an internal blob storage system impacted many dependent services.
Sharding as a service, now there’s an interesting idea.
Gerald Guo, Thawan Kooburat — Facebook
In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?
Emily Arnot — Blameless
Outages
- Zoom
- Slack
- Let’s Encrypt
- NZX (New Zealand Stock Exchange)
- eBay
- Garmin
- Heroku
- Fastly
- Also this one.
Full disclosure: Fastly is my employer.
- Also this one.
- Cloudflare