Datadog has fully merged their SRE and Security teams.
In this post, we’ll look at essential elements of SRE and security, the benefits we’ve realized by combining the two disciplines, and what that approach looks like for us.
Bianca Lankford — Datadog
I love the way this article describes three different audiences for your communication during incidents. It describes what each audience is looking for and gives both positive and negative examples of how to communicate with them.
Hamed Silatani — Uptime Labs
My favorite part of this article is the section on where to run your load tests: production, staging, or something else?
Tom Elliot
What is complexity? This article gives a clear definition and breaks down the qualities one can find in a complex system. Then it goes over various methods of dealing with that complexity.
Teiva Harsanyi — The Coder Cafe
Cloudflare has a history of doing some pretty interesting things with sockets in Linux — and taking us along for the journey with highly-detailed explanations. This article is no exception, sharing the unique challenges encountered when restarting processes that handle UDP streams.
Marek Majkowski
This article examines the standard friday deploy prohibition and ultimately pushes back.
Ok… but why not?
Adrien Guéret — OpenClassrooms
This article introduces the STAMP (System-Theoretic Accident Model and Processes) framework being adopted at Google, after first explaining the shortcomings in traditional SRE practices that prompted Google to adopt STAMP.
Jorge Lainfiesta — Rootly
I really love this framing of what’s wrong with picking a single root cause.
Lorin Hochstein