Articles
They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.
Mads Hartmann
This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.
Robert Barron
I like that this focuses on human factors.
Kevin Casey
Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.
Blameless
Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.
Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber
Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.
This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.
Ben Wheatley — GoCardless
Outages
- Gmail and a ton of other Android apps
- This one’s kind of weird. Google presented it as a Gmail outage, but it’s actually a problem with the Android system webview component. Tons of apps were crashing.
- MangaDex
- Canvas