Sorry about the automation fail and resend! That definitely wasn’t issue #1.
This article discusses building failure management directly into our systems, using Erlang as a case study.
Jamie Allen
Building on their experience with their previous load shedding library, Uber built a new one that requires no configuration.
Jakob Holdgaard Thomsen, Vladimir Gavrilenko, Jesper Lindstrom Nielsen, and Timothy Smyth — Uber
These folks found a way to get tag names and values from other people’s AWS resources. I know this is more security- than SRE-related but the technique is just so cool!
Daniel Grzelak — Plerion
How much does it cost to improve resilience? What’s the ROI? It’s fuzzy, but we still need to do it.
Will Gallego
Check it out, it’s an entire SRE conference I was totally unaware of!
SREday
It’s an SLI/SLO/SLA explainer, but with a twist: a pros and cons list for each of the three.
Laura Clayton — UptimeRobot
A great reddit thread for some schadenfreude… and perhaps you’d like to share your own story?
u/New_Detective_1363 and others — reddit
What an interesting cause for an incident: the service your customers have pointed your product at decides to block your requests, effectively DoSing your systems.
Tomas Koprusak — UptimeRobot
The CAP theorem is useful as a theory, but what does it actually mean in practice?
neda — ReadySet