General

SRE Weekly Issue #325

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

This article really upends the concept of “human error”, in an intriguing way.

  Lorin Hochstein

A key part of building reliable systems is often overlooked: continuously learning.

In the highly dynamic CI/CD environment, engineers with stale or outdated knowledge of the system are less able to detect, diagnose or repair anomalous behavior in their systems […]

  Laura Maguire (jeli.io) — The New Stack

This is the story of how an organization transitioned from a single NOC-like on-call team to individualized alerts routed to the relevant team.

  Boris Cherkasky

This guide has a set of key factors you should consider when building a new SRE team in order to increase the likelihood of success.

  Ash P — SREPath

My favorite pitfall discussed in this article: avoid committing to every possible remediation action from every incident.

  incident.io

This article, written by one of the authors of the Google SRE book, is a critical look at the state of SRE and what the future holds.

Today, I believe we cannot successfully answer several key questions about SRE.

  Niall Murphy

This interview goes into the thorny challenges around building a reliable app based on third-party services. It delves into the lack of reliable reporting we commonly see from cloud service providers and what ideal reporting would look like.

  Jeff Martens (interviewing Jeff Smith) — Metrist

Outages

SRE Weekly Issue #324

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

We’ll start off this week with a recap of a KubeCon talk that urges leaving the concept of “human error” behind.

  Jennifer Riggins — The New Stack
  Talk by Silvia Pina

Just to be clear, they’re saying the tips are written by Instacart’s first SRE — they’re not tips aimed oddly specifically at the second Instacart SRE. Good tips, too.

  Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

This is a really good point, and well argued. Then there’s an amusing bit at the end about alerting on the number of WARNING-level log messages generated by the system as a proxy for overall health.

  Chris Siebenmann

In this post, I’m going to expand on the values we’re currently using at Honeycomb to monitor on-call health, why we think they’re good, and some of the challenges we’re still encountering.

  Fred Hebert — Honeycomb

Internal and external communication are critical in an incident, second (perhaps) only to actually resolving the problem. Read this article to learn about who you need to communicate with, how to talk to them, and how to prepare in advance.

  Hannah Culver — PagerDuty

If you’re playing the hero role at your organization, you might be unintentionally masking the need for better incident management practices.

  Malcolm Preston — FireHydrant

Outages

SRE Weekly Issue #323

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

I chatted with Emily Arnott of Blameless for a solid hour about everything from the origins of this newsletter and how I make it, to my thoughts on SRE and where it’s going. Somehow she managed to fit it all into this article. Thanks, Emily!

  Emily Arnott — Blameless

The section on TTR (Time To Recovery) really caught my eye, both by confirming that MTTR is generally not a useful metric, and also finding one case where TTR does seem to be predictive.

The Spotify engineering blog seems to be down as of this publishing, so here’s the archive.org version.

  Clint Byrum — Spotify

SRE concepts apply wonderfully well to compliance and governance. Each field has a lot to learn from the other.

  Jennifer Riggins — The New Stack

More than ever, we should all be focused on shipping great products, retaining high-demand engineers, and building trust with customers. And investing in a thoughtful incident management strategy is one way to get there. Let’s explore how.

  Robert Ross — FireHydrant

At this week’s DevOps Enterprise Summit (DOES) Europe, Vanguard talked about how they made the move from traditional architecture to the majority in the cloud, adopted site reliability engineering and even built their own customer-facing SaaS.

  Jennifer Riggins — The New Stack

This article has a great discussion of the risks of larger, less frequent deploys. It goes on to explain how they transitioned to smaller and more frequent deploys while focusing on safety.

  Will Sewell — Monzo

What makes this article special is its focus on addressing the common concerns that people have when you try to get them to own their code for its full lifecycle. It offers practical advice to win folks over.

  Martha Lambert — incident.io

Sounds like there were some pretty great talks at SRECon. I gotta admit, I’m kinda having some FOMO.

  Emily Arnott — Blameless

Outages

SRE Weekly Issue #322

Bit of a short issue this week. This morning, I stepped on my phone, crushing it mightily beneath my bootheel. Unfortunately a lot of my automation for reviewing articles is on there… thank goodness I have functioning backups.

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

What? Actually, it’s a pretty good analogy.

  Emily Arnott — Blameless

Mercari has this update to their previous article on their embedded SRE team with more details on how their embedding model works.

  Taichi Nakashima — Mercari

Interesting things happen when you combine tail latency with a microservice architecture.

  Marc Brooker

Their starting point was paging for every single exception raised by their application. Here’s how they tempered that a bit to get a handle on their paging volume.

  Lisa Karlin Curtis — incident.io

This article draws from the “SRE Hierarchy” in Google’s SRE book (which itself is a reference to Maslow’s hierarchy of needs). It recasts the SRE hierarchy as a path to maturity.

  Ash P. — SREPath

Google posted this summary of an incident from late April. A configuration change had the unintended effect of causing livestream view requests to fail.

  Google

Outages

  • Xbox
    • I don’t normally bother with game outages, but this one caught my eye. During the 4-day outage, customers were unable to play Xbox games that they had already purchased.

  • Twitter
  • Coinbase

SRE Weekly Issue #321

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging and adding responders, postmortem timeline, setting up reminders, and more. Book a demo (+ get a snazzy Rootly lego set):
https://rootly.com/demo/

Articles

A researcher explains how they implemented their microservice failure testing tool at DoorDash. The tool, Fillibuster, automatically discovers microservice dependencies and injects faults, avoiding the need to design specific individual failure scenarios.

  Christopher Meiklejohn — DoorDash

Last week, I shared Atlassian’s outage write-up. This link is a Twitter thread with a critique.

I feel like it is perhaps not a “good look” to repeatedly try to sell your product in your writeup about your product’s catastrophic outage

  @ReinH

“Error” serves a number of functions for an organization: as a defense against entanglement, the illusion of control, as a means for distancing, and as a marker for a failed investigation.

  Eric Dobbs

This is a write-up posted in January for an incident that occurred during an infrastructure migration. I feel like I can relate to every one of the learnings.

  Enom (Tucows)

In the past two years, I’ve been participating in on-call rotations as a Site Reliability Engineer at Vinted. Here are some of the practical lessons I’ve learned about the process.

  Ernestas Narmontas

This article is all about finding out what risks exist that may impact your ability to meet your SLOs. Once you’ve done that, you can determine whether your SLOs are realistic.

  Ayelet Sachto — Google

When your organization chooses to implement SLOs, how do you get everyone on board? This two-part series has an in-depth look at how Klarna did it.

  Andrew Cartine — Klarna

Subtitle: And why do SRE teams need PMs?

After laying out the reasons why SREs need PMs, this article goes into detail about what a PM can bring to an SRE team.

  António Araújo — detech.ai

BellJar helps users find cyclic dependencies in their services, by running totally isolated VMs and requiring users to explicitly enable every external dependency they need in order to bootstrap each service. It has a really neat feature of automatically generating runbooks based on these test cases.

  Christopher Bunn and Jie Huang — Meta

This week, I watched Netflix’s Meltdown: Three Mile Island, a documentary about the nuclear accident in the US in 1979. It’s not exactly a post-incident write-up, but there’s a lot in there about normalization of deviance, situational awareness, and risk-taking (both in and out of incidents).

  Netflix

Outages

  • Slack
  • Heroku
    • Heroku’s been dealing with a security incident since April 13. They performed a mass password reset of all accounts and their GitHub integration has been disabled for days.

  • Roblox
A production of Tinker Tinker Tinker, LLC Frontier Theme