The network is not reliable. What are the implications and what can we do about it?
Beyond a run-of-the-mill severity levels article, this one goes into a couple of common pitfalls.
Some good tips in here, esp. the one about brevity.
Ashley Sawatsky — Rootly
Or, Eleven things we have learned as Site Reliability Engineers at Google
Adrienne Walcer, Kavita Guliani, Mikel Ward, Sunny Hsiao, and Vrai Stacey — Google
Good lessons to learn here that apply more broadly than just EKS.
Christian Alexánder Polanco Valdez — Adevinta
This article is about project management, but a lot of the skills discussed apply to aspects of SRE at Staff+ levels.
Sannie Lee — Thoughtworks (via martinfowler.com)
Now this is more like it: there’s a healthy does of skepticism woven through this article, including things genAI probably won’t be good for, and potential pitfalls.
Jesse Robbins — Heavybit
There are two different ways of alerting on SLOs, for two very different audiences, as explained in this article. Ostensibly this is a product feature announcement, but you don’t need to be using the product to get a lot out of this.
Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.