Sorry about the automation fail and resend! That definitely wasn’t issue #1.
This article discusses building failure management directly into our systems, using Erlang as a case study.
Building on their experience with their previous load shedding library, Uber built a new one that requires no configuration.
Jakob Holdgaard Thomsen, Vladimir Gavrilenko, Jesper Lindstrom Nielsen, and Timothy Smyth — Uber
These folks found a way to get tag names and values from other people’s AWS resources. I know this is more security- than SRE-related but the technique is just so cool!
Daniel Grzelak — Plerion
How much does it cost to improve resilience? What’s the ROI? It’s fuzzy, but we still need to do it.
Check it out, it’s an entire SRE conference I was totally unaware of!
It’s an SLI/SLO/SLA explainer, but with a twist: a pros and cons list for each of the three.
Laura Clayton — UptimeRobot
A great reddit thread for some schadenfreude… and perhaps you’d like to share your own story?
u/New_Detective_1363 and others — reddit
What an interesting cause for an incident: the service your customers have pointed your product at decides to block your requests, effectively DoSing your systems.
Tomas Koprusak — UptimeRobot
The CAP theorem is useful as a theory, but what does it actually mean in practice?
neda — ReadySet