SRE WEEKLY – Page 31 – scalability, availability, incident response, automation

SRE Weekly Issue #337

lex

September 4, 2022

Thanks for all the vacation well-wishes! It was really great and relaxing. Take vacations, it’s important for reliability!

While I was out, I shipped the past two issues with content prepared in advance, and without the Outages section. This gave me a chance to really think hard about the value of the Outages section versus the time and effort I put into it.

I’ve decided to put the Outages section on hiatus for the time being. For notable outages, I’ll include them in the main section, on a case-by-case basis. Read on if you’re interested in what went into this decision.

The Outages section has always been of lower quality than the rest of the newsletter. I have no scientific process for choosing which Outages make the cut — mostly it’s just whatever shows up in my Google search alerts and seems “important”, minus a few arbitrary categories that don’t seem particularly interesting like telecoms and games. I do only a cursory review of the outage-related news articles I link to, and often they’re on poor-quality sites with a ton of intrusive ads. Gathering the list of Outages has begun taking more and more of my time, and I’d much rather spend that effort on curating quality content, so that’s what I’m going to do going forward.

10 Things I Learned From My First Incident Review

Every one of these 10 items is enough reason to read this article! This makes me want to go investigate some incidents right now.

Fischer Jemison — Jeli

Slowing Down to Speed Up – Circuit Breakers for Slack’s CI/CD

Slack shares with us in great detail why they use circuit breakers and how they rolled them out.

Frank Chen — Slack

Tips to Make Your On-Call Process Less Stressful

My favorite part of this one is the section on expectations. We need to socialize this to help reduce the pressure on folks going on call for the first time.

Prakya Vasudevan — Squadcast

Why Status Pages Are Lying to You and What To Do About It

Status pages are marketing material. Prove me wrong.

Ellen Steinke — Metrist

Using incidents to level up your teams

incidents have unusually high information density compared with day-to-day work, and they enable you to piggy-back on the experience of others

Lisa Karlin Curtis — incident.io

How we store and process millions of orders daily

These folks realized that they had two different use cases for the same data, real-time transactions and batch processing. Rather than try to find one DB that could support both, they fork two copies of the data.

Xi Chen and Siliang Cao — Grab

Live Your Best Life With Structured Events

It’s all about gathering enough information that you can ask new questions when something goes wrong, rather than being stuck with only answers to the questions you thought to ask in advance.

Charity Majors

How Discord Supercharges Network Disks for Extreme Low Latency

They needed the speed of local ephemeral SSDs but the reliability of network-based persistent disks. The solution: a linux MD option to mirror but prefer to read from the local disks. Neat!

Glen Oakley — Discord

Operating system upgrades at LinkedIn’s scale

OS upgrades can be risky. LinkedIn developed a system to unify OS upgrade procedures and make them much less risky.

Hengyang Hu, Dinesh Dhakal, and Kalyanasundaram Somasundaram — LinkedIn

SRE Weekly Issue #336

lex

August 28, 2022

General

Comments

View on sreweekly.com

Articles

What it’s like to work as an embedded microservices SRE

In this article, I will introduce several improvements being made by the Microservices SRE Team, embedded with other teams.

MizumotoShota — Mercari

What should be on a SLI dashboard

What really stood out to me in this article is the Service Info section. A dashboard will quickly atrophy and lose its meaning without an explanation of what it’s for.

Ali Sattari

SRE: From Theory to Practice: What’s Difficult About Incident Command?

When things go wrong, who is in charge? And what does it feel like to do that role?

This is a summary of a forum discussion about incident command, in case you don’t have time to listen to the whole thing.

Emily Arnott — Blameless

Complex Adaptive Systems and ITSM

Complex systems are weird, and a traditional deterministic view such as in older ITIL iterations doesn’t capture the situation. We need to evolve our practices.

Jon Stevens-Hall

Latency- and Throughput-Optimized Clusters Under Load

How can you design and interpret metrics for systems optimized for latency or throughput?

Dan Slimmon

The Latency/Throughput Tradeoff: Why Fast Services Are Slow And Vice Versa

You can optimize for latency or throughput in a given system, but not both, since the two are directly at odds.

Dan Slimmon

SRE Weekly Issue #335

lex

August 21, 2022

General

Comments

View on sreweekly.com

Articles

How an incident transformed Razorpay — Improving the 5 Why RCA format

I really like that “Missing” section in their incident retrospective template. Gotta be careful with “Missed” though, that sounds like it could slide toward blame.

Varun Achar — Razorpay

Uvalde: a reasonable officer

“Unreasonable” is a great way to avoid learning from an incident:

Labeling the responders actions as unreasonable enables us to explain away the failures in the law enforcement response as deficiencies with the individual responders.

Lorin Hochstein

Does the Fastly outage justify “Single Point of Failure” headlines?

The author of this post doesn’t argue the fact that Fastly is clearly a single point of failure for many of their customers. But does that really matter?

Jon Stevens-Hall
Full disclosure: Fastly, my employer, is mentioned.

Big Problems and Small Problems under load

Small problems can pile up unnoticed and interact weirdly to make a Big Problem that is incredibly hard to untangle. Maybe we should hunt down the small problems before they have a chance to trigger a Big one.

Dan Slimmon

Stop apologizing for bugs

Apologizing for bugs encourages a lot of problematic thought patterns, much in the same way as blaming people for incidents.

Dan Slimmon

SRE Weekly Issue #334

lex

August 14, 2022

General

Comments

View on sreweekly.com

I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section.

Articles

Handling third-party provider outages

Should you go multi-cloud? What should you do during an incident involving a third-party dependency? What about after? Read this one for all that and more.

Lisa Karlin Curtis — incident.io
Full disclosure: Fastly, my employer, is mentioned.

Common ground breakdown in Uvalde

An introduction to the concept of common ground breakdown, using the Uvalde shooting in the US as a case study.

Lorin Hochstein

r/sre – How do you handle weekly commitments during your on call rotation?

The comments section is full of some pretty great advice, including questions you can ask while interviewing to suss out whether the on-call culture is going to be livable.

u/dicksoutfoeharambe (and others) — reddit

Lessons from the TSB failure: a perfect storm of waterfall failures

From the archives, this is an analysis of a report on the 2018 major outage at TSB Bank in the UK.

Jon Stevens-Hall

What is Backoff For?

You can determine whether backoff will actually help your system, and this article does a great job of telling you how.

Marc Brooker

An Incident Command Training Handbook

I’ve read (and written) plenty of IC training guides, but this is the first time I’ve come across the concept of a “Hands-Off Update”. I’m definitely going to use that!

Dan Slimmon

No observability without theory

This is a really great exlpanation of observability from an angle I haven’t seen before.

a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study.

Dan Slimmon

Outages

Twitter
Google Search
- Did you catch the Google search outage? I’ve never seen one like it — that’s how rare they are. Google shared a tidbit of information about what went wrong — and it wasn’t the datacenter explosion folks speculated about.
Peloton

SRE Weekly Issue #333

lex

August 7, 2022

General

Comments

View on sreweekly.com

Articles

Is SRE Just Ops with a New Name?

They asked four people and got four answers that run the gamut.

Jeff Martens — Metrist

Automated Incident Management Through Slack

How Airbnb automates incident management in a world of complex, rapidly evolving ensemble of microservices.

Includes an overview of their ChatOps system that would make for a great blueprint to build your own.

Vlad Vassiliouk — Airbnb

Don’t overcategorise incidents

Rigidly categorizing incidents can cause problems, according to this article.

From the customer’s viewpoint… well why would they care what kind of technical classification it is being forced into?

Jon Stevens-Hall

Best Practices for Fixing Your Alerts

Lots of great advice in this one.

If no human needs to be involved, it’s pure automation.

If it doesn’t need a response right now, it’s a report.

If the thing you’re observing isn’t a problem, it’s a dashboard.

If nothing actually needs to be done, you should delete it.

Leon Adato — New Relic

Driving a customer-focused incident response process

Using the recent Atlassian outage as a case study, this article explains the importance of communication during an incident, then goes over best practices.

Martha Lambert — incident.io

SRE: From Theory to Practice | What’s difficult about on-call?

My favorite part about this is the advice to “lower the cost of being wrong”. Important in any case, but especially during incident response.

Emily Arnott — Blameless

GitHub Availability Report: July 2022

There are some interesting incidents in this issue: one involving DNS and another with an overload involving over-eager retries.

Jakub Oleksy — GitHub

Top SRE Interview Questions You Should Know

A great read both for interviewers and interviewees.

Myra Nizami — Blameless

When Microservices Are a Bad Idea

Their main advice is to avoid starting with a microservice architecture, and only transition to one after your monolith has matured and you have a good reason to do so.

Tomas Fernandez and Dan Ackerson — semaphore

SRE Weekly Issue #337

SRE Weekly Issue #336

Articles

SRE Weekly Issue #335

Articles

SRE Weekly Issue #334

Articles

Outages

SRE Weekly Issue #333

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues