They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.
This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.
I like that this focuses on human factors.
Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.
Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.
Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber
Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.
This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.
Ben Wheatley — GoCardless