General

SRE Weekly Issue #328

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Less than 12 hours after their outage, Cloudflare posted this detailed run-down of what happened.

  Tom Strick and Jeremy Hartman — Cloudflare

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

  Marc Brooker

By “derivatives”, the author means rate-of-change, like Prometheus’s irate(). Derivatives have their place, but this article has good reasons to reconsider using them for alerts.

  Boris Cherkasky

In this article, I’ll dive into what it takes to get into site reliability engineering, how to adopt it within your own organization and some of the core principles and best practices you’ll need to keep in mind as you move forward in your SRE maturity journey.

  Jemiah Sius — devops.com

I have given and received this question in many SRE interviews, and it’s famously used by Google in their interviews. This article dissects the question and its merits and downsides for the benefit of both interviewers and interviewees.

  Will Gallego

Outages

  • Cloudflare
    • Cloudflare had a major outage, taking many sites and services with it.

SRE Weekly Issue #327

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Even when your system has redundancy, sometimes all the redundant copies fail at once because of what they share in common.

  Marc Brooker

Feature flags make it easy to roll out database schema migrations without downtime. This example uses double-writing and a data migration script.

  Tom Hombergs — Reflectoring

Like some kind of Netflix of SRE writing, incident.io just dropped an entire guide on incident management, ready for bingeing. My favorite is the section on on-call compensation.

  Chris Evans — incident.io

A major part of SRE is deciding what level of reliability makes sense, and how prepared you should be. This article drives that point home with an analogy to the James Webb Space Telescope.

  Robert Barron — IBM

Ably posted this design overview of their HA real-time messaging system, with lots of juicy details.

  Jo Stichbury — Ably

An advice columnist helps a newbie on-caller ease into the pager life.

  Liz Fong-Jones — Honeycomb

I like that this article advocates using different templates for different kinds of retrospectives with different goals.

  Myra Nizami — Blameless

Yes, we need more of this! The skills covered are: Communication, Empathy, Teamwork, Motivation, and Documentation.

  Paul Marsicovetere — Formidable

Outages

SRE Weekly Issue #326

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

Catchpoint and Blameless have teamed up on this year’s SRE survey. They’ve sweetened the deal with two $5 donations to charity for every survey completed. Go do it!

  Kurt Andersen — Blameless

I sure miss the good old “checkmark-i” icon. Oh wait, no I don’t.

  Jeff Martens — Metrist

How can you handle failure gracefully? Click through for 6 strategies to consider.

  Boris Cherkasky — Riskified

Declaring the first incident when you start a new job can be intimidating, but it really shouldn’t be. Let’s look at some common fears, and work out how to address them.

  Isaac Seymour — incident.io

The incident involved fiber equipment failure and a suboptimal automated remediation.

  Google

This is a primer on Urgency and Impact in incidents, including the difference between them and how to use them.

  Noor-ul-Anam Ruqayya — Blameless

Running retrospectives on near-miss incidents can be highly valuable, as this article discusses.

  Vanessa Huerta Granda — Jeli

Outages

SRE Weekly Issue #325

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This article really upends the concept of “human error”, in an intriguing way.

  Lorin Hochstein

A key part of building reliable systems is often overlooked: continuously learning.

In the highly dynamic CI/CD environment, engineers with stale or outdated knowledge of the system are less able to detect, diagnose or repair anomalous behavior in their systems […]

  Laura Maguire (jeli.io) — The New Stack

This is the story of how an organization transitioned from a single NOC-like on-call team to individualized alerts routed to the relevant team.

  Boris Cherkasky

This guide has a set of key factors you should consider when building a new SRE team in order to increase the likelihood of success.

  Ash P — SREPath

My favorite pitfall discussed in this article: avoid committing to every possible remediation action from every incident.

  incident.io

This article, written by one of the authors of the Google SRE book, is a critical look at the state of SRE and what the future holds.

Today, I believe we cannot successfully answer several key questions about SRE.

  Niall Murphy

This interview goes into the thorny challenges around building a reliable app based on third-party services. It delves into the lack of reliable reporting we commonly see from cloud service providers and what ideal reporting would look like.

  Jeff Martens (interviewing Jeff Smith) — Metrist

Outages

SRE Weekly Issue #324

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

We’ll start off this week with a recap of a KubeCon talk that urges leaving the concept of “human error” behind.

  Jennifer Riggins — The New Stack
  Talk by Silvia Pina

Just to be clear, they’re saying the tips are written by Instacart’s first SRE — they’re not tips aimed oddly specifically at the second Instacart SRE. Good tips, too.

  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

This is a really good point, and well argued. Then there’s an amusing bit at the end about alerting on the number of WARNING-level log messages generated by the system as a proxy for overall health.

  Chris Siebenmann

In this post, I’m going to expand on the values we’re currently using at Honeycomb to monitor on-call health, why we think they’re good, and some of the challenges we’re still encountering.

  Fred Hebert — Honeycomb

Internal and external communication are critical in an incident, second (perhaps) only to actually resolving the problem. Read this article to learn about who you need to communicate with, how to talk to them, and how to prepare in advance.

  Hannah Culver — PagerDuty

If you’re playing the hero role at your organization, you might be unintentionally masking the need for better incident management practices.

  Malcolm Preston — FireHydrant

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme