General

SRE Weekly Issue #327

lex

June 19, 2022

Articles

Even when your system has redundancy, sometimes all the redundant copies fail at once because of what they share in common.

Marc Brooker

Zero Downtime Database Changes with Feature Flags

Feature flags make it easy to roll out database schema migrations without downtime. This example uses double-writing and a data migration script.

Tom Hombergs — Reflectoring

Incident Management Guide

Like some kind of Netflix of SRE writing, incident.io just dropped an entire guide on incident management, ready for bingeing. My favorite is the section on on-call compensation.

Chris Evans — incident.io

Known Unknowns —Webb Struck by Meteoroid!

A major part of SRE is deciding what level of reliability makes sense, and how prepared you should be. This article drives that point home with an analogy to the James Webb Space Telescope.

Robert Barron — IBM

A multi-region AWS architecture for low latency edge messaging

Ably posted this design overview of their HA real-time messaging system, with lots of juicy details.

Jo Stichbury — Ably

Going On Call for the First Time

An advice columnist helps a newbie on-caller ease into the pager life.

Liz Fong-Jones — Honeycomb

Retrospective Template (What They Are & How To Use One)

I like that this article advocates using different templates for different kinds of retrospectives with different goals.

Myra Nizami — Blameless

Five Essential Non-technical Skills for SRE Success

Yes, we need more of this! The skills covered are: Communication, Empathy, Teamwork, Motivation, and Documentation.

Paul Marsicovetere — Formidable

Outages

SRE Weekly Issue #326

lex

June 12, 2022

General

Comments

View on sreweekly.com

Articles

Calling all Reliability Practitioners: Participate in the SRE Survey 2022

Catchpoint and Blameless have teamed up on this year’s SRE survey. They’ve sweetened the deal with two $5 donations to charity for every survey completed. Go do it!

Kurt Andersen — Blameless

The New AWS Status Page: Updated Look, Same Problems?

I sure miss the good old “checkmark-i” icon. Oh wait, no I don’t.

Jeff Martens — Metrist

Engineering For Failure

How can you handle failure gracefully? Click through for 6 strategies to consider.

Boris Cherkasky — Riskified

Declare early, declare often: why you shouldn’t hesitate to raise an incident

Declaring the first incident when you start a new job can be intimidating, but it really shouldn’t be. Let’s look at some common fears, and work out how to address them.

Isaac Seymour — incident.io

Google Incident Report — Google Cloud Networking incident on May 20, 2022

The incident involved fiber equipment failure and a suboptimal automated remediation.

Google

Incident Priority Matrix (Understanding Impact and Urgency)

This is a primer on Urgency and Impact in incidents, including the difference between them and how to use them.

Noor-ul-Anam Ruqayya — Blameless

Oops, That Almost Happened – Jeli

Running retrospectives on near-miss incidents can be highly valuable, as this article discusses.

Vanessa Huerta Granda — Jeli

Outages

SRE Weekly Issue #325

lex

June 5, 2022

General

Comments

View on sreweekly.com

Articles

Imagine there’s no human error…

This article really upends the concept of “human error”, in an intriguing way.

Lorin Hochstein

Readiness to Learn: Safely and Reliably Deploy to the Cloud

A key part of building reliable systems is often overlooked: continuously learning.

In the highly dynamic CI/CD environment, engineers with stale or outdated knowledge of the system are less able to detect, diagnose or repair anomalous behavior in their systems […]

Laura Maguire (jeli.io) — The New Stack

My Big Fat Monolithic Alert

This is the story of how an organization transitioned from a single NOC-like on-call team to individualized alerts routed to the relevant team.

Boris Cherkasky

Starting an SRE team from scratch [Quick Guide]

This guide has a set of key factors you should consider when building a new SRE team in order to increase the likelihood of success.

Ash P — SREPath

Incident postmortem pitfalls

My favorite pitfall discussed in this article: avoid committing to every possible remediation action from every incident.

incident.io

What SRE could be

This article, written by one of the authors of the Google SRE book, is a critical look at the state of SRE and what the future holds.

Today, I believe we cannot successfully answer several key questions about SRE.

Niall Murphy

DevOps Expert Jeff Smith On Third-Party Reliability Data

This interview goes into the thorny challenges around building a reliable app based on third-party services. It delves into the lack of reliable reporting we commonly see from cloud service providers and what ideal reporting would look like.

Jeff Martens (interviewing Jeff Smith) — Metrist

Outages

SRE Weekly Issue #324

lex

May 29, 2022

General

Comments

View on sreweekly.com

Articles

The Need to Decouple Human Error from Incident Response

We’ll start off this week with a recap of a KubeCon talk that urges leaving the concept of “human error” behind.

Jennifer Riggins — The New Stack
Talk by Silvia Pina

5 Tips If You’re the 1st SRE Hire by Instacart’s First SRE

Just to be clear, they’re saying the tips are written by Instacart’s first SRE — they’re not tips aimed oddly specifically at the second Instacart SRE. Good tips, too.

Quentin Rousseau — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

Systems should expose a (simple) overall health metric as well as specifics

This is a really good point, and well argued. Then there’s an amusing bit at the end about alerting on the number of WARNING-level log messages generated by the system as a proxy for overall health.

Chris Siebenmann

Tracking On-Call Health

In this post, I’m going to expand on the values we’re currently using at Honeycomb to monitor on-call health, why we think they’re good, and some of the challenges we’re still encountering.

Fred Hebert — Honeycomb

When incident response requires business response, who should you notify?

Internal and external communication are critical in an incident, second (perhaps) only to actually resolving the problem. Read this article to learn about who you need to communicate with, how to talk to them, and how to prepare in advance.

Hannah Culver — PagerDuty

We can’t all be Shaq: why it’s time for the SRE hero to pass the ball and how to get there

If you’re playing the hero role at your organization, you might be unintentionally masking the need for better incident management practices.

Malcolm Preston — FireHydrant

Outages

SRE Weekly Issue #323

lex

May 22, 2022

General

Comments

View on sreweekly.com

Articles

A Chat with Lex Neva of SRE Weekly

I chatted with Emily Arnott of Blameless for a solid hour about everything from the origins of this newsletter and how I make it, to my thoughts on SRE and where it’s going. Somehow she managed to fit it all into this article. Thanks, Emily!

Emily Arnott — Blameless

Failing Forward — How We Grow from Incidents

The section on TTR (Time To Recovery) really caught my eye, both by confirming that MTTR is generally not a useful metric, and also finding one case where TTR does seem to be predictive.

The Spotify engineering blog seems to be down as of this publishing, so here’s the archive.org version.

Clint Byrum — Spotify

Can SRE Bring Governance and Compliance into the Future?

SRE concepts apply wonderfully well to compliance and governance. Each field has a lot to learn from the other.

Jennifer Riggins — The New Stack

The not-so-obvious positive outcomes of great incident management

More than ever, we should all be focused on shipping great products, retaining high-demand engineers, and building trust with customers. And investing in a thoughtful incident management strategy is one way to get there. Let’s explore how.

Robert Ross — FireHydrant

Vanguard’s Iterative Enterprise SRE Transformation

At this week’s DevOps Enterprise Summit (DOES) Europe, Vanguard talked about how they made the move from traditional architecture to the majority in the cloud, adopted site reliability engineering and even built their own customer-facing SaaS.

Jennifer Riggins — The New Stack

How we deploy to production over 100 times a day

This article has a great discussion of the risks of larger, less frequent deploys. It goes on to explain how they transitioned to smaller and more frequent deploys while focusing on safety.

Will Sewell — Monzo

How to empower your team to own incident response

What makes this article special is its focus on addressing the common concerns that people have when you try to get them to own their code for its full lifecycle. It offers practical advice to win folks over.

Martha Lambert — incident.io

SREcon 2022 Americas Wrap Up

Sounds like there were some pretty great talks at SRECon. I gotta admit, I’m kinda having some FOMO.

Emily Arnott — Blameless

SRE Weekly Issue #327

Articles

Outages

SRE Weekly Issue #326

Articles

Outages

SRE Weekly Issue #325

Articles

Outages

SRE Weekly Issue #324

Articles

Outages

SRE Weekly Issue #323

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues