SRE WEEKLY – Page 26 – scalability, availability, incident response, automation

SRE Weekly Issue #359

lex

February 12, 2023

Articles

the Data Reliability Engineering team is here to monitor, automate and manage pipelines to enable our partner USDE teams to have the ease of mind to tackle projects to help Mercari move forward.

LameyerDaniel and OhshimaTakako — Mercari

Recruiting developers into Site Reliability Engineering (SRE)

Hiring in the Site Reliability Engineering (SRE) space is notoriously difficult. So it makes sense to figure out how to expand the hiring pool beyond existing SREs.

Ash Patel — SREpath

The yaml document from hell

SREs end up writing a lot of YAML. I mean, a lot. Fortunately it’s a really simple language with no hidden gotchas, right? Right?!

Ruud van Asseldonk

DNS Outage on 2023-01-25

Two Terraform changes that were developed and tested individually went out to production simultaneously, with unexpected results.

Jan David Nose — Rust

The technology behind GitHub’s new code search

Code search is a different beast from normal english language searching. Regexes, punctuation, no word stemming, and GitHub’s scale made this a challenging design.

Timothy Clem — GitHub

Your non-technical teams should be using incident management tools, too

This article argues that folks outside of engineering are doing incident response, whether they call it that or not.

incident.io

Quick! Grab all the evidence: Capturing application state for post-incident forensics.

In incidents, we’re concentrating on resolving impact as quickly as possible, and this can impair our ability to gather the information we need after the fact in order to actually figure out what happened.

Jake Cohen — PagerDuty

SRE Weekly Issue #358

lex

February 5, 2023

General

Comments

View on sreweekly.com

Articles

Seamless critical traffic migration with CoreDNS request rewrite feature

A new spin on changing the engines on a jet in flight: using DNS request/response rewriting to transition an application over without modification.

lainra — Mercari

Putting a number on scalability

How much additional capacity can you get for a dollar?

Dan Slimmon

How We Manage Incident Response at Honeycomb

Dealing with the unknown, limited cognitive bandwidth, coordination patterns, psychological safety and feeding information back into the organization.

Fred Hebert — The New Stack
Full disclosure: Honeycomb is my employer.

SRE Transformation: our thoughts

How do you enable adoption of SRE principles at a large, mature company that has little capacity for innovation?

the value proposition of “SRE” is the idea that you can handle an exponentially growing business with a logarithmically growing payroll.

Layer Alpeh

How to Setup Multi-burn rate Windows Alert on Service Level Objectives

Read this one to learn about four attributes of good alerting and how to ensure your SLO burn rate alerts are effective.

Saheed Oladosu

Bad Observability

There’s plenty of content out there telling you how to implement observability, or what good looks like. But what about bad observability? What are some anti-patterns to watch out for?

Stephen Townshend — SquaredUp

On-call with Dave O’Connor

This is an interview about on-call with Twilio’s VP of SRE who also spent 17 years as an SRE at Google.

Elena Boroda

Adding Zonal Resiliency to Etsy’s Kafka Cluster: Part 1

They started with a (mostly) single-availability-zone Kafka deployment. Here’s how they transitioned to a multi-zone architecture that can survive a single AZ failure.

Andrey Polyakov and Kamya Shethia — Etsy

SRE Weekly Issue #357

lex

January 29, 2023

General

Comments

View on sreweekly.com

Articles

3 tips for reducing stress during incident response efforts

Panic takes time and energy away from swift incident response, leading to second-guessing, a higher likelihood of mistakes, and analysis paralysis. Here are three tips to minimize it.

Malcolm Preston — incident.io

The FAA outage: On public incident reports and seeking second stories

A great explanation of why we need to wait for more details on the FAA NOTAM outage. My favorite part is the list of clues to whether an incident report might be useful: Time, Artifacts, Jargon, and Narrative.

Thai Wood — Resilience Roundup

Rundown of LinkedIn’s SRE practices

Lots of juicy details about a large SRE organization and how they work.

Ash Patel — SREPath

Cloudflare incident on January 24, 2023

A deploy accidentally wiped authentication tokens for some internal Cloudflare services, causing an outage for those services.

Kenny Johnson and Sam Rhea — Cloudflare

The Staging Dichotomy: Part One

eBay thought about adopting “test in production” and eliminating staging, but they determined that their use case really does require a staging environment. They carefully selected and anonymized real production data to use as test cases in staging.

Senthil Padmanabhan — eBay

Can We Stop With Those Horrible “System Overview” Dashboards Already?

This article has a really great section explaining the pitfalls of full system dashboards.

Boris Cherkasky

5 Exciting Predictions for SRE in 2023

The first one is my favorite:

Economic factors will force companies to look for more efficient ways of managing reliability

I’m not sure if that will happen, but it’s an interesting theory.

Emily Arnott

Remote First Incident Response

This author shares what they learned in adapting to running incidents remotely once the pandemic hit.

Emily Ruppe — Jeli

SRE Weekly Issue #356

lex

January 22, 2023

General

Comments

View on sreweekly.com

Thanks to all of you that took the time to share your ideas about choosing incidents to investigate! I got some great answers and I’m looking forward to pulling them together into an article.

I decided to give this GPT-3 thing a spin. It turns out that it absolutely can assemble a newsletter with links to the week’s top SRE stories, each with a short description. It even includes authors. The authors are even real people. The URLs, though… well, they look real, but they’re mysteriously all 404s, and the articles don’t actually exist. Guess you’re stuck with me for now!

Articles

Platform Engineering as a Startup

This article takes the idea of “internal customers” to its logical conclusion, by treating the platform in the same way as a startup company.

Adam Buggia — Sym

Blameless Postmortems & Bayes’ Theorem

This article uses nifty probability formulas to show that blaming an engineer for an incident may well result in diminished reliability and efficiency.

Dan Slimmon

CircleCI incident report for January 4, 2023 security incident

Here’s a report on the CircleCI security incident at the start of the year. There’s some good stuff in there about not blaming the specific engineer whose device was attacked.

Rob Zuber — CircleCI

Counting Forest Fires

A hot take on how not to measure your incident response process.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

How eBay’s Notification Platform Used Fault Injection in New Ways

eBay’s notification platform team built a fault-tolerant, resilient system by injecting faults in the application level.

Wei Chen — eBay

A small mistake does not a complex systems failure make

This one succinctly sums up why I haven’t covered the NOTAM outage much yet.

If a small mistake was sufficient to take down a complex system, then our systems would be crashing all of the time.

Lorin Hochstein

Production postmortem: The heisenbug server

Don’t you love when merely running strace fixes the problem?

Oren Eini

Question of Intent: The crash of Garuda Indonesia flight 200

This air accident seems at its face to be a clear-cut story of negligence. There’s far more to it, and the author goes into detail on why blaming the captain can damage air safety industry-wide.

Admiral Cloudberg

SRE Weekly Issue #355

lex

January 15, 2023

General

Comments

View on sreweekly.com

Articles

Which incidents aren’t worth analyzing?

I’m trying something new: I’m looking for input from you, dear readers!

This link is a Google Form where I’m asking for ideas that I might turn into a blog post or conference talk. If you’re game, I’d love to hear what you think.

Join Jeli and Honeycomb for an Incident Response and Analysis Discussion

Here’s the panel for this webinar:

Vanessa Huerta Granda (Jeli)
Emily Ruppe (Jeli)
Liz Fong-Jones (Honeycomb)
Fred Hebert (Honeycomb)

Honestly, with that set of names, I’d listen even if they were just discussing the weather.
Full disclosure: Honeycomb, my employer, is mentioned.

The near crash of Air Canada flight 759

This week saw an outage of the NOTAM system which disseminates important information to aircraft pilots in the US. As a result, all flights in the US were grounded.

There’s not much in the way of interesting detail available yet, but I did see a mention of this air incident in which NOTAMs played a significant part. Mentour Pilot also covered this one

Admiral Cloudberg

A New Definition of Reliability

In essence, this new reliability is:

The health of your system

Weighed based on customer expectations and happiness

Prioritized based on your current capabilities

This article focuses on the sociotechnical aspects of reliability.

Jim Gochee — The New Stack

When to Alert on What?

Here are some guidelines for what kind of alerting works best for services at various stages of maturity.

Ali Sattari

Creating Safety is Dangerous Work

The actions we take to avert a potential problem can introduce their own risks.

Will Gallego

Need your own incident post-mortem template? Here’s ours

This one’s from the incident.io folks.

incident.io

Why I only page on downtime. ONLY.

I often meet with skepticism when I say that server monitoring systems should only page when a service stops doing its work.

Read on to find out why.

Dan Slimmon

SRE Weekly Issue #359

Articles

SRE Weekly Issue #358

Articles

SRE Weekly Issue #357

Articles

SRE Weekly Issue #356

Articles

SRE Weekly Issue #355

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues