General

SRE Weekly Issue #241

lex

October 25, 2020

Articles

A quick note on last week’s issue: Google posted an updated version of their Google Chat incident summary with the “confidential” language removed. They also updated the content at the original link.

June 15, 2020 T-Mobile Network Outage Report

T-Mobile, one of the main mobile phone carriers in the US, had a major outage earlier this year. This report is essentially a retrospective performed by the US FCC (Federal Communications Commission). The report details the satisfyingly complex interplay of contributing factors in the incident.

US Federal Communications Commission

Failing over with falling over

How can you be sure your failover plan will actually work? Hint: it’s almost certainly not going to work properly the first time you try it.

Adrian Cockcroft

3 Ways SRE Can Boost your Business Value

In this blog post, we’ll look at the business value of SRE through customer focus, observability, and efficiency.

Emily Arnott — Blameless

Building Netflix’s Distributed Tracing Infrastructure

Netflix has some interesting ideas around sampling, performance, and storage for their tracing system.

Maulik Pandey — Netflix

10 Days of Errors

Oh, I do0 love reading stories of systems failing in interesting ways. This first installment contains five of the 10.

Yoz Grahame — LaunchDarkly

Preparing for peak holiday shopping in 2020: War rooms go virtual

Black Friday is coming. Here are some ideas on how to deal with the rush — and how to analyze how you dealt with it when it’s over.

Nelly Wilson — Google

The Chaos Engineering Book

Two of my favorite authors/speakers have conspired to create a book on one of my favorite topics. Take my money! Oh wait, they’re giving it away, too?!

Nora Jones and Casey Rosenthal

Outages

Slack
- I missed this one from October 16 in last week’s issue.
Disney Plus

SRE Weekly Issue #240

lex

October 18, 2020

General

Comments

View on sreweekly.com

Articles

Google Cloud Issue Summary — Google Chat — 2020-09-17

This interesting post-incident analysis is marked as “Google Customer Confidential – Not for publication or distribution”, but Google linked it directly from their public status page. I normally would not include a seemingly “leaked” incident report like this, but in this case I think the “confidential” label is erroneous.

Google

40 milliseconds of latency that just would not go away

I keep re-learning and re-forgetting about TCP_NODELAY.

Rachel By the Bay

“Manual” and “Automated” are just words

The distinction between the two is a lot more nuanced than it may seem. What are we really trying to say wit those words?

Michael Nygard

Heroku incident #2110 follow-up

This incident from the week before last involved a Let’s Encrypt API rate limit.

Fixing Linux filesystem performance regressions

Don’t you hate when you’re minding your own business upgrading your OS, and you run smack into a kernel bug in the ext4fs code?

…ext4 performance on kernel versions above 4.5 and below 5.6 suffers severely in the presence of concurrent sequential I/O on rotating disks.

Ryan Underwood — LinkedIn

Identifying and protecting against the largest DDoS attacks

Google discusses DDoS attacks and how they deal with them, including a 2.5Tbps attack in 2017.

Damian Menscher — Google

How I Broke `git push heroku main`

I love these first-hand incident stories. This one is from an engineer at Heroku who was a contributing factor in an incident last month.

Damien Mathieu — Heroku (Salesforce)

Outages

BitBay
Twitter
- It definitely was not taken down purposefully to protect a US presidential election candidate.
TikTok
Crunchyroll
Instagram
Barnes and Noble
- Nook e-readers have experienced a days’-long service disruption.
keepthescore
- Linked is their blog post, “We deleted the production database by accident”.
  Be sure to check out the HackerNews discussion about this article, too.
  
  Caspar — Keepthescore
FanDuel
- This incident seems to be ongoing, October 12 to present.

SRE Weekly Issue #239

lex

October 11, 2020

General

Comments

View on sreweekly.com

Articles

Respect your natural scaling limits

Don’t scale up farther than you need to! If you won’t ever see more than 100 RPS, don’t architect for 100,000.

Ayende Rahien

The Many Shapes of Site Reliability Engineering

This one covers several common patterns of SRE practice and then offers insight on what to look for as you design your own SRE team.

Rob Cummings — Slalom Build

Abstractions and implicit preconditions

Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.

Lorin Hochstein

Keeping CALM: When Distributed Consistency Is Easy

Coordination between nodes in a distributed system can kill performance. What kinds of problems require coordination? The CALM theorem can tell us.

Joseph M. Hellerstein and Peter Alvaro — Communications of the ACM

The Ultimate, Free Incident Retrospective Template

Here’s another good post-incident analysis document template that you can use as inspiration for your own.

Hannah Culver — Blameless

4 Signs Software Reliability Should be Your Top Priority

As your product ages, it transitions from “cool new thing” to “tool everyone uses and expects to Just Work”. Your reliability needs will change accordingly.

Lyon Wong — Blameless

Outages

PagerDuty
- 95% of event submissions (your systems telling PagerDuty to trigger an alert) failed for about an hour. They posted some detail about what went wrong.
Slack
- Their latest update on this outage contains some detail about what went wrong.
Telegram
Microsoft Office 365
Coles Supermarkets
Adobe Creative Cloud
GitHub

SRE Weekly Issue #238

lex

October 4, 2020

General

Comments

View on sreweekly.com

My daughters asked earlier today what I do at work, and I explained all about SRE, reliability, and the importance of work-life balance. They said to tell you they say hi!

Articles

On Call Shouldn’t Suck: A Guide For Managers

Lots of really great advice in here. And really, with a title like that, I couldn’t resist reading it!

Charity Majors

Follow-up for Google Cloud Infrastructure Components Incident #20010

Last week, I mentioned a Google Cloud Platform outage that affected multiple services. Here’s the detailed post-analysis by Google.

Google

Team Play with a Powerful and Independent Agent: A Full-Mission Simulation Study

This one is along the lines of the classic Ironies of Automation paper by Bainbridge. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can automation be a team player, and what happens when it isn’t?

Nadine Sarter and David Woods (original paper)

Thai Wood — Resilience Roundup (summary)

Applying Chaos Engineering in Healthcare: Getting Started with Sensitive Workloads

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can you use chaos engineering when failures in the system can be critical and even life-threatening?

Carl Chesser — Infoq

This is your Guide for Implementing SRE in NOCs

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Emily Arnot — Blameless

Is your microservice a distributed monolith?

This article suggests using chaos engineering to tell if your microservice-based architecture is secretly a monolith in disguise.

Andre Newman — Gremlin

Outages

Slack
Radware
- An accidental BGP hijack by Telstra took down Radware.
Twitter
Tokyo Stock Exchange
- The Tokyo Stock Exchange was down for an entire day, the first time that’s ever happened.
Fastly
Squarespace
Google Search Indexing
Microsoft Azure outage #SM79-F88
- A problem with Azure Active Directory caused trouble for Office365 and other Microsoft services. Click through for their detailed follow-up.

SRE Weekly Issue #237

lex

September 27, 2020

General

Comments

View on sreweekly.com

Articles

Postmortem — why Allegro went down

They fully expected their deep-discount sale to drive traffic, but they didn’t expect their system to handle the increase in the way that it did.

Michał Kosmulski — Allegro

Zero-Downtime Kubernetes Deployments

Pre-stop hooks, liveness probes, and readiness probes were key to smoothly transitioning their services from a home-grown container system to Kubernetes.

Oliver Leaver-Smith — Sky Betting & Gaming

Feelings during incident response

The experience of responding to an incident can evoke emotions that run the gamut.

Mads Hartmann

Join SRE Classroom NALSD workshops

Google has released course materials the first of a series of classes on NALSD (“non-abstract large systems design”). This first one is about a distributed Pub-Sub system.

Auithor: Jenny Liao and Salim Virji — Google

Why you should write up your own incident

Usually, doing a post-analysis on an incident you were in is an anti-pattern because you’re likely to introduce bias. But sometimes, it can lead you to learn more than you would have otherwise.

Lorin Hochstein

Outages

Datadog
G Suite
Google Cloud Platform
Let’s Encrypt
- Google CT logs had an issue, impairing Let’s Encrypt’s ability to issue.
Tesla
Apple
Reddit
Heroku
Connectivity Issues
Crypto.com (cryptocurrency exchange)
- The CEO says a database issue (nearly) opened up the possibility for arbitrage.

SRE Weekly Issue #241

Articles

Outages

SRE Weekly Issue #240

Articles

Outages

SRE Weekly Issue #239

Articles

Outages

SRE Weekly Issue #238

Articles

Outages

SRE Weekly Issue #237

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues