SRE Weekly Issue #478

Datadog has fully merged their SRE and Security teams.

In this post, we’ll look at essential elements of SRE and security, the benefits we’ve realized by combining the two disciplines, and what that approach looks like for us.

  Bianca Lankford — Datadog

I love the way this article describes three different audiences for your communication during incidents. It describes what each audience is looking for and gives both positive and negative examples of how to communicate with them.

  Hamed Silatani — Uptime Labs

My favorite part of this article is the section on where to run your load tests: production, staging, or something else?

  Tom Elliot

What is complexity? This article gives a clear definition and breaks down the qualities one can find in a complex system. Then it goes over various methods of dealing with that complexity.

  Teiva Harsanyi — The Coder Cafe

Cloudflare has a history of doing some pretty interesting things with sockets in Linux — and taking us along for the journey with highly-detailed explanations. This article is no exception, sharing the unique challenges encountered when restarting processes that handle UDP streams.

  Marek Majkowski

This article examines the standard friday deploy prohibition and ultimately pushes back.

Ok… but why not?

  Adrien Guéret — OpenClassrooms

This article introduces the STAMP (System-Theoretic Accident Model and Processes) framework being adopted at Google, after first explaining the shortcomings in traditional SRE practices that prompted Google to adopt STAMP.

  Jorge Lainfiesta — Rootly

I really love this framing of what’s wrong with picking a single root cause.

  Lorin Hochstein

Updated: May 25, 2025 — 10:04 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme