General

SRE Weekly Issue #330

Thanks for all the well-wishes as I took a sick day last week. I’m feeling much better!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Is your status page status.yourcompany.com? If so, read this article, then get yourself a new domain.

  Eduardo Messuti — Statuspal

The author used my favorite technique for getting up to speed on a company: analyzing a recent incident.

  Vanessa Huerta Granda — Jeli

There are a number of lessons I learned guiding weeks-long backcountry leadership courses for teens that I carried with me into my roles in incident management. In this blog post, I’ll share three that stand out.

  Ryan McDonald — FireHydrant

I really like these articles about interpreting SRE in a way that makes sense for your organization. SRE is still constantly evolving.

  Steve Smith — Equal Experts

The author led an incident just 3 months into their tenure. Here’s what they learned.

  Milly Leadley — incident.io

while SRE and DevOps type job explainers have been written ad nauseam, I found there’s relatively little online about Observability Teams and roles. I figured I’d share a bit about my experience on an O11y Team.

  Eric Mustin

I found the contrast between this one and the previous article interesting. The previous one includes a quote of Brendan Gregg:

Let me try some observability first. (Means: Let me look at the system without changing it.)

  Jessica Kerr — Honeycomb

In June, we experienced four incidents resulting in significant impact to multiple GitHub.com services. This report also sheds light into an incident that impacted several GitHub.com services in May.

  GitHub

Using the Webb telescope as an example, this article describes the progression of a system toward production operation using a metaphor of 3 days.

  Robert Barron — IBM

Outages

SRE Weekly Issue #329

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

A primer on what makes a good runbook.

Runbooks are most effective when they are readily available, easily actionable, and up-to-date and accurate.

  Cortex

In this article, we describe the architecture and implementation of our SRE infrastructure, how it is used and how it was adopted.

  Philipp GĂĽndisch and Vladyslav Ukis — Siemens

After an explanation of tech debt, this article goes into a possible solution: having on-call folks fix lingering problems in between pages.

  Dormain Drewitz — The New Stack

I’ve read plenty of articles about service ownership, but this one has something new: a discussion of how to divvy up a monolith into separate “services” for teams to own.

  Hannah Culver — PagerDuty

The folks at Sendinblue have chronicled their journey to better incident response, and there’s a lot here to learn from.

  Tanguy Antoine — Sendinblue

Incidents will always happen, but thankfully they have plenty of upsides, as this article explains.

  Andre King — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

You’re not getting paged. Is it because you’ve fixed all the things, or has your alerting atrophied?

  Boris Cherkasky

The folks at incident.io are here with the results of their survey of on-call practices. I like the focus on compensation for being on-call.

  incident.io

Outages

SRE Weekly Issue #328

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Less than 12 hours after their outage, Cloudflare posted this detailed run-down of what happened.

  Tom Strick and Jeremy Hartman — Cloudflare

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

  Marc Brooker

By “derivatives”, the author means rate-of-change, like Prometheus’s irate(). Derivatives have their place, but this article has good reasons to reconsider using them for alerts.

  Boris Cherkasky

In this article, I’ll dive into what it takes to get into site reliability engineering, how to adopt it within your own organization and some of the core principles and best practices you’ll need to keep in mind as you move forward in your SRE maturity journey.

  Jemiah Sius — devops.com

I have given and received this question in many SRE interviews, and it’s famously used by Google in their interviews. This article dissects the question and its merits and downsides for the benefit of both interviewers and interviewees.

  Will Gallego

Outages

  • Cloudflare
    • Cloudflare had a major outage, taking many sites and services with it.

SRE Weekly Issue #327

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Even when your system has redundancy, sometimes all the redundant copies fail at once because of what they share in common.

  Marc Brooker

Feature flags make it easy to roll out database schema migrations without downtime. This example uses double-writing and a data migration script.

  Tom Hombergs — Reflectoring

Like some kind of Netflix of SRE writing, incident.io just dropped an entire guide on incident management, ready for bingeing. My favorite is the section on on-call compensation.

  Chris Evans — incident.io

A major part of SRE is deciding what level of reliability makes sense, and how prepared you should be. This article drives that point home with an analogy to the James Webb Space Telescope.

  Robert Barron — IBM

Ably posted this design overview of their HA real-time messaging system, with lots of juicy details.

  Jo Stichbury — Ably

An advice columnist helps a newbie on-caller ease into the pager life.

  Liz Fong-Jones — Honeycomb

I like that this article advocates using different templates for different kinds of retrospectives with different goals.

  Myra Nizami — Blameless

Yes, we need more of this! The skills covered are: Communication, Empathy, Teamwork, Motivation, and Documentation.

  Paul Marsicovetere — Formidable

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme