SRE Weekly Issue #412

Can a single dashboard to cover your entire system really exist?

Jamie Allen

This one makes the case for having a group of specially-trained incident commanders to handle SEV-1 (worst-case) outages, separate from your normal ICs.

Jonathan Word

Getting Buy-in from Management on Reliability Investments

This article lays out a strategy for gaining buy-in by making three specific, sequential arguments.

Emily Arnott — Blameless

SRE Archetypes

This article explores the varying ways that SRE is implemented through a set of 4 archetypes.

Alex Ewerlöf

connect() – why are you so slow?

It turns out that assigning ephemeral ports to connections in Linux is way more complicated than it might seem at first glance, and there’s room for optimization, as this article explains.

Frederick Lawler — Cloudflare

Simple Precision Time Protocol at Meta

While deploying Precision Time Protocol (PTP) at Meta, we’ve developed a simplified version of the protocol (Simple Precision Time Protocol – SPTP), that can offer the same level of clock synchronization as unicast PTPv2 more reliably and with fewer resources.

Oleg Obleukhov and Ahmad Byagowi — Meta

A Distributed Systems Reading List

Far more than just a list of links, this article gives an overview of each topic before pointing you in the right direction for more information.

Fred Hebert

Streamlining and Implementing Incident Management at Dyninno

Building on the groundwork laid out in our first article about the initial steps in Incident Management (IM) at Dyninno Group, this second installment will explore the practicalities of streamlining and implementing these strategies.

Vladimirs Romanovskis

SRE Weekly Issue #412

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues