Can a single dashboard to cover your entire system really exist?
Jamie Allen
This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.
Jonathan Word
This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.
Emily Arnott — Blameless
This article explores the varying ways that SRE is implemented through a set of 4 archetypes.
Alex Ewerlöf
It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.
Frederick Lawler — Cloudflare
While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.
Oleg Obleukhov and Ahmad Byagowi — Meta
Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.
Fred Hebert
Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.
Vladimirs Romanovskis