General

SRE Weekly Issue #335

lex

August 21, 2022

Articles

How an incident transformed Razorpay — Improving the 5 Why RCA format

I really like that “Missing” section in their incident retrospective template. Gotta be careful with “Missed” though, that sounds like it could slide toward blame.

Varun Achar — Razorpay

Uvalde: a reasonable officer

“Unreasonable” is a great way to avoid learning from an incident:

Labeling the responders actions as unreasonable enables us to explain away the failures in the law enforcement response as deficiencies with the individual responders.

Lorin Hochstein

Does the Fastly outage justify “Single Point of Failure” headlines?

The author of this post doesn’t argue the fact that Fastly is clearly a single point of failure for many of their customers. But does that really matter?

Jon Stevens-Hall
Full disclosure: Fastly, my employer, is mentioned.

Big Problems and Small Problems under load

Small problems can pile up unnoticed and interact weirdly to make a Big Problem that is incredibly hard to untangle. Maybe we should hunt down the small problems before they have a chance to trigger a Big one.

Dan Slimmon

Stop apologizing for bugs

Apologizing for bugs encourages a lot of problematic thought patterns, much in the same way as blaming people for incidents.

Dan Slimmon

SRE Weekly Issue #334

lex

August 14, 2022

General

Comments

View on sreweekly.com

I’ll be on vacation starting next Sunday (yay!). That means the next two issues will be prepared in advance, so there won’t be an Outages section.

Articles

Handling third-party provider outages

Should you go multi-cloud? What should you do during an incident involving a third-party dependency? What about after? Read this one for all that and more.

Lisa Karlin Curtis — incident.io
Full disclosure: Fastly, my employer, is mentioned.

Common ground breakdown in Uvalde

An introduction to the concept of common ground breakdown, using the Uvalde shooting in the US as a case study.

Lorin Hochstein

r/sre – How do you handle weekly commitments during your on call rotation?

The comments section is full of some pretty great advice, including questions you can ask while interviewing to suss out whether the on-call culture is going to be livable.

u/dicksoutfoeharambe (and others) — reddit

Lessons from the TSB failure: a perfect storm of waterfall failures

From the archives, this is an analysis of a report on the 2018 major outage at TSB Bank in the UK.

Jon Stevens-Hall

What is Backoff For?

You can determine whether backoff will actually help your system, and this article does a great job of telling you how.

Marc Brooker

An Incident Command Training Handbook

I’ve read (and written) plenty of IC training guides, but this is the first time I’ve come across the concept of a “Hands-Off Update”. I’m definitely going to use that!

Dan Slimmon

No observability without theory

This is a really great exlpanation of observability from an angle I haven’t seen before.

a metric dashboard only contributes to observability if its reader can interpret the curves they’re seeing within a theory of the system under study.

Dan Slimmon

Outages

Twitter
Google Search
- Did you catch the Google search outage? I’ve never seen one like it — that’s how rare they are. Google shared a tidbit of information about what went wrong — and it wasn’t the datacenter explosion folks speculated about.
Peloton

SRE Weekly Issue #333

lex

August 7, 2022

General

Comments

View on sreweekly.com

Articles

Is SRE Just Ops with a New Name?

They asked four people and got four answers that run the gamut.

Jeff Martens — Metrist

Automated Incident Management Through Slack

How Airbnb automates incident management in a world of complex, rapidly evolving ensemble of microservices.

Includes an overview of their ChatOps system that would make for a great blueprint to build your own.

Vlad Vassiliouk — Airbnb

Don’t overcategorise incidents

Rigidly categorizing incidents can cause problems, according to this article.

From the customer’s viewpoint… well why would they care what kind of technical classification it is being forced into?

Jon Stevens-Hall

Best Practices for Fixing Your Alerts

Lots of great advice in this one.

If no human needs to be involved, it’s pure automation.

If it doesn’t need a response right now, it’s a report.

If the thing you’re observing isn’t a problem, it’s a dashboard.

If nothing actually needs to be done, you should delete it.

Leon Adato — New Relic

Driving a customer-focused incident response process

Using the recent Atlassian outage as a case study, this article explains the importance of communication during an incident, then goes over best practices.

Martha Lambert — incident.io

SRE: From Theory to Practice | What’s difficult about on-call?

My favorite part about this is the advice to “lower the cost of being wrong”. Important in any case, but especially during incident response.

Emily Arnott — Blameless

GitHub Availability Report: July 2022

There are some interesting incidents in this issue: one involving DNS and another with an overload involving over-eager retries.

Jakub Oleksy — GitHub

Top SRE Interview Questions You Should Know

A great read both for interviewers and interviewees.

Myra Nizami — Blameless

When Microservices Are a Bad Idea

Their main advice is to avoid starting with a microservice architecture, and only transition to one after your monolith has matured and you have a good reason to do so.

Tomas Fernandez and Dan Ackerson — semaphore

Outages

SRE Weekly Issue #332

lex

July 31, 2022

General

Comments

View on sreweekly.com

Articles

How Razorpay’s Notification Service Handles Increasing Load

Their notification service had complex load characteristics that made scaling up a tricky proposition.

Anand Prakash — Razorpay

How we improved on-call life by reducing pager noise

Coalescing alerts and adding dependencies in AlertManager were the key to reducing this team’s excessive pager load.

steveazz — GitLab

What’s allowed to count as a cause: ALERRT edition

Lorin Hochstein has started a series of blog posts on what we can learn about incident response from the Uvalde school shooting tragedy in the US. This article looks at how an organization’s perspective can affect their retrospective incident analysis.

Lorin Hochstein

The fog of war in Uvalde

My claim here is that we should assume the officer is telling the truth and was acting reasonably if we want to understand how these types of failure modes can happen.

Every retrospective ever:

We must assume that a person can act reasonably and still come to the wrong conclusion in order to make progress.

Lorin Hochstein

User settings, Lamport clocks and lightweight formal methods

How do you synchronize state between multiple browsers and a backend, and ensure that everyone’s state will eventually converge? These folks explain how they did it, and a bug they found through testing.

Jakub Mikians — Airspace Intelligence

MTTR: lower isn’t always better

MTTR is a mean, so it doesn’t tell you anything about the number of incidents, among other potential pitfalls.

Dan Slimmon

Google Cloud Platform outage report: europe-west2 cooling failure

Last week, I included a GCP outage in europe-west2. This week, Google posted this report about what went wrong, and it’s got layers.

Bonus: another GCP outage report

Google

It’s time to leave the leap second in the past

Meta wants to do away with leap seconds, because they make it especially difficult to create reliable systems.

Oleg Obleukhov and Ahmad Byagowi — Meta

3 common pitfalls of post-mortems

If you’re anywhere near incident analysis in your organization, you need to read this list.

Milly Leadley — incident.io

Outages

SRE Weekly Issue #331

lex

July 24, 2022

General

Comments

View on sreweekly.com

Articles

DisasterCast – A podcast about scary things and how to stop them happening

I’ve been listening to this podcast this week and I love it! Each episode covers a disaster, safety theory, and other topics — with no ads. Their site is down right now, but the podcast is available on the usual platforms.

Drew Rae — DisasterCast

An 8 Step Guide to Go From a Clueless to a Production-aware Software Engineer

If we want to get folks to own their code in production, we need to teach them how to think like an SRE.

Boris Cherkasky

3 mistakes I’ve made at the beginning of an incident (and how not to make them)

Let’s look at three mistakes I’ve made during those stressful moments during the beginning of an incident — and discuss how you can avoid making them.

The mistakes are:

Mistake 1: We didn’t have a plan.
Mistake 2: We weren’t production ready.
Mistake 3: We fell down a cognitive tunnel.

Robert Ross — FireHydrant

When to kill the canary

At what point does your canary test indicate failure? Should the criteria be the same as your normal production alerting?

Øystein Blixhavn

On Counting Alerts

This is a followup to a previous article about on-call health. In this one, the author shares metrics about the number of alerts and discusses whether this number is useful.

Fred Hebert — Honeycomb

High Availability on Razorpay Payments Dashboard

Their dashboard crashed for 50% of user sessions, so they had a lot of work ahead of them. Find out how they got crash-free sessions to 99.9% and improved their time to respond to incidents.

Sandesh Damkondwar — Razorpay

@atoonk on Twitter summarizing the Rogers Communications outage

Rogers Communications, a major telecom in Canada, had a country-wide outage earlier this month. I don’t normally include telecom outages in the Outages section because they rarely share information that we can learn from.

This time, Rogers released a (redacted) report on their outage, and this Twitter thread summarizes the key points.

@atoonk on Twitter

Outages

Microsoft Teams and Office 365
Microsoft blames storage error for Teams outage
Google Cloud Storage
Google Cloud europe-west2 region
- Preliminary root cause has been identified as multiple concurrent failures to our redundant cooling systems within one of the buildings that hosts the europe-west2-a zone for the europe-west2 region.

SRE Weekly Issue #335

Articles

SRE Weekly Issue #334

Articles

Outages

SRE Weekly Issue #333

Articles

Outages

SRE Weekly Issue #332

Articles

Outages

SRE Weekly Issue #331

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

A message from our sponsor, Rootly:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues