Whoa. This is the best thing ever. I feel like I want to make this the official theme song of SRE Weekly.
Their auto-scaling algorithm needed a tweak. Before: scale up by N instances. After: scale up by an amount proportional to the current number of instances.
Fran Garcia — Reddit
here’s a look at incidents and reliability challenges that have occurred in outer space, and what SREs stand to learn from them.
JJ Tang — Rootly
This one includes 3 key things to remember while load testing. My favorite: test the whole system, not just parts.
SRE is as much about building consensus and earning buy-in as it is about actual engineering.
The definition of NoOps in this article is more clear than others I’ve seen. It’s not about firing your operations team — their skill set is still necessary.
Even though I know what observability is, I got a lot out of this article. It has some excellent examples of questions that are hard to answer with traditional dashboards, and includes my new favorite term:
The industrial term for this problem is Watermelon Metrics; A situation where individual dashboards look green, but the overall performance is broken and red inside.
Nishant Modak and Piyush Verma — Last9
Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand.
In this epic troubleshooting story, a weird curl bug coupled with Linux memory tuning parameters led to unexpected CPU consumption in an unrelated process.
Pavlos Parissis — Booking.com
Learning a lesson from a rough Black Friday in 2019, these folks used load testing to gather hard data on how they would likely fare in 2020.
Mathieu Garstecki — Back Market