SRE Weekly Issue #412

A message from our sponsor, FireHydrant:

FireHydrant’s new and improved MTTX analytics dashboard is here! See which services are most affected by incidents, where they take the longest to detect (or acknowledge, mitigate, resolve … you name it); and how metrics and statistics change over time.

Can a single dashboard to cover your entire system really exist?

  Jamie Allen

This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.

  Jonathan Word

This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.

  Emily Arnott — Blameless

This article explores the varying ways that SRE is implemented through a set of 4 archetypes.

  Alex Ewerlöf

It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.

  Frederick Lawler — Cloudflare

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.

  Oleg Obleukhov and Ahmad Byagowi — Meta

Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.

  Fred Hebert

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.

  Vladimirs Romanovskis

Updated: February 18, 2024 — 4:51 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme