SRE WEEKLY – Page 45 – scalability, availability, incident response, automation

SRE Weekly Issue #266

lex

April 18, 2021

General

Comments

View on sreweekly.com

Articles

Airplane takes off a metric ton heavier than expected after computer error weighs adults as children

This one was brought to my attention by Dr. Richard Cook, who also pointed me to the AAIB incident report.

Dr. Cook went on to share these insights with me, which I’ve copied here with permission:

Note:

the subtle interactions allowed the manual correction to be lost during the interval between recognizing the software problem and having the corrected software functionally ‘catch’ the Ms/Miss title mixup;

the incident is attributed to “a simple flaw in the programming of the IT system” rather than failure of the workarounds that were put in place after the problem was recognized;

the report is careful to demonstrate that the flaws in the system made only a slight difference to the flight parameters;

the report does not describe any IT process changes whatsoever!

The report has the effect of making the incident appear to be an unfortunate series of occurrences rather than being emblematic of the way that these sorts of processes are vulnerable.

Catchpoint Announces Virtual SRE Community Event on June 10

Last year’s SRE From Home event was awesome, and this year’s iteration looks to be just as great.

Catchpoint

The Case of the Connection Timeout

This is fun! Try your hand at troubleshooting a connection issue in this game-ified role-play scenario.

BONUS CONTENT: Read about the author’s motivations, design decisions, and plans here.

Julia Evans

The Five Pillars of Resilience Engineering

Do we need to have some kind of Pillars Registry? Note, these are more like pillars of high availability than resilience engineering.

Hector Aguilar — Okta

Incident analysis as guerrilla case study research

I love this idea that we’re trying to get deep incident analysis done even though that may not be the actual goal of the organization.

As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Lorin Hochstein

Having On-call Nightmares? Runbooks can Help you Wake Up.

This is well worth a read if only for the on-call scenario at the start. Yup, been there. We miss you, Harry.

Harry Hull — Blameless

Platform engineering vs. site reliability engineering (SRE): here’s what you need to know

What’s the difference? Click through to learn about the distinction they’re drawing.

Amir Kazemi — effx

We Don’t Get Bitter, We Get Better

The New York Times’s Operations Engineering group developed an Operational Maturity Assessment and uses it to have collaborative conversations with teams about their systems.

Authro: The NYT Open Team — New York Times

Outages

G-Suite
- Google posted this “Mini Incident Report while full Incident Report is prepared.”
Slack
Docker Hub
Robinhood
Twitter
Elevated CDN Errors
Heroku
- Heroku had a series of incidents this week (1, 2, 3, 4).

SRE Weekly Issue #265

lex

April 11, 2021

General

Comments

View on sreweekly.com

Articles

Insights into a Product SRE team at LinkedIn

Here’s a great look into how LinkedIn’s embedded SREs work.

[…] the mission for Product SRE is to “engineer and drive product reliability by influencing architecture, providing tools, and enhancing observability.”

Zaina Afoulki and Lakshmi Namboori — LinkedIn

DNS propagation does not exist

It’s all just other people’s caches.

Ruurtjan Pul

Advice for someone moving from SRE to backend engineering

Recently there was a Reddit post asking for advice about moving from Site Reliability Engineering to Backend Eng. I started writing a response to it, the response got long, and so I turned it into a blog post.

Charles Cary — Shoreline

The Mightiest Monolith

This is the first in a series about lessons SREs can learn from the space shuttle program. The author likens earlier spacecraft to microservices and the Shuttle to a monolith.

Robert Barron

The 5 characteristics of high reliability organizations

This article is ostensibly about Emergency Medical Services (EMS), but as is so often the case, it’s directly applicable to SRE. The 5 characteristics are enlightening, and so is the fictitious anecdote about an EMT rattled from a previous incident.

Ems1

How we scaled the GitHub API with a sharded, replicated rate limiter in Redis

Simple solution meets reality. I like how we get to see what they did when things didn’t quite work out as they were hoping.

Robert Mosolgo — GitHub

GitHub Availability Report: March 2021

They did the work to convert a database column to a 64-bit integer before it was too late. Unfortunately, one of their library dependencies didn’t use 64-bit integers.

Keith Ballinger — GitHub

Learning from incidents: getting Sidekiq ready to serve a billion jobs

In this post, I’ll walk you through one of our first ever Sidekiq incidents and how we improved our Sidekiq implementation as a result of this incident.

Nakul Pathak — Scribd

Outages

Let’s Encrypt
Uber
Multiple Airlines’ Online Booking Sites
- An error in Google’s flight information service caused problems at multiple sites that consume it.
Tinder
BBC Website
Facebook, Instagram, and WhatsApp
Stellar.org (cryptocurrency)
WazirX (cryptocurrency exchange)
Microsoft Azure and other services
- Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure.

SRE Weekly Issue #264

lex

April 4, 2021

General

Comments

View on sreweekly.com

Articles

Balancing act: the current limits of AWS network load balancers

This well-researched article caught me by surprise. It’s shocking that Ably received advice from AWS to stay under 400,000 simultaneous connections, despite Amazon’s own documentation stating support for “millions of connections per second”.

Paddy Byers — Ably

A Journey Into SRE

This blog is about how a group of hard-working individuals, with unique skills and working methods, managed to create a successful SRE team.

There’s a lot of detail about what their SREs do and how they communicate, with 3 projects as case studies.

Sergio Galvan — Algolia

March 2 incident update

This is an incident followup from an incident at Deno earlier this year. Their CDN saw their heavy use of .ts files (TypeScript, a JavaScript variant) and mistakenly assumed they were MPEG transport segments, a violation of the CDN’s ToS. Oops.

Luca Casonato — Deno

Kubernetes Supports Nine Pillars of SRE

Wait, there are 9 now?

Marc Hornbeek — Container Journal

Frequently Asked Questions on Deviations

There’s a nice little discussion of why “human error” is not a good enough answer for why a deviation (from standard operating procedure) happened.

Susan J. Schniepp and Steven J. Lynn — Pharmaceutical Technolog

How To Get Fooled By Metrics

They deployed an optimization that skipped sending some requests to the backend… and the backend metrics got worse. Why? Hint: aggregate metrics.

Dominik Sandjaja — Trivago

Outages

Twitter
National Weather Service (US)
reddit
Squarespace.com
- Squarespace.com itself, but not user sites.
Microsoft 365

SRE Weekly Issue #263

lex

March 28, 2021

General

Comments

View on sreweekly.com

Articles

[Increment: Reliability] Tracing a path to observability

They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.

Mads Hartmann

Glynn Lunney — SRE Leadership

This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.

Robert Barron

7 top Site Reliability Engineer (SRE) job interview questions

I like that this focuses on human factors.

Kevin Casey

How to Scale for Reliability and Trust

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

Blameless

Engineering Failover Handling in Uber’s Mobile Networking Infrastructure

Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.

Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber

Incident review: Service outage on 25 October 2020

Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.

This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.

Ben Wheatley — GoCardless

Outages

Gmail and a ton of other Android apps
- This one’s kind of weird. Google presented it as a Gmail outage, but it’s actually a problem with the Android system webview component. Tons of apps were crashing.
MangaDex
Canvas
Instagram

SRE Weekly Issue #262

lex

March 21, 2021

General

Comments

View on sreweekly.com

Articles

The Prerequisites for Chaos Engineering

Chaos Engineering isn’t adding chaos to your systems—it’s seeing the chaos that already exists in your systems.

Along with four prerequisites, this article also includes 3 myths about chaos engineering that might be making you feel hesitant about starting.

Courtney Nash — Verica

Managing On-Call in a Pandemic

This one’s from May of last year. Almost a year on, it’s interesting to see which of these we’ve already implemented.

Ashley Roof — Transposit

Being Just Reliable Enough

An amusing parable illustrating why not to try to be too reliable.

Andrew Ford — Indeed

Google debunks Russian claims that fire was connected to service outage

In the Outages section of last week’s issue, you’ll find two unrelated events referenced in this article: one about Russian internet censorship gone awry and another about a major datacenter fire.

Eric Johansson — Verdict

How to Analyze Contributing Factors Blamelessly

Along with what’s in the title, this article also covers the difference between an RCA and a contributing factors analysis.

Emily Arnott — Blameless

Rethinking site capacity projections with Capacity Analyzer

Lots of detail on how LinkedIn is improving their traffic forecasts. Warning/enticement: math contained within.

Deepanshu Mehndiratta — LinkedIn

Testing in Production for Safety and Sanity

Everyone is testing in production, some organizations admit and plan for it.

How to do it right, what can happen if it goes wrong, and how to limit the blast radius.

Heidi Waterhouse — LaunchDarkly

How we found and fixed a rare race condition in our session handling

Remember when GitHub logged you out? Ah, I remember it like it was last week. I mean, the week before. Here’s GitHub’s troubleshooting story about what went wrong.

Dirkjan Bussink — GitHub

Outages

Google Cloud Platform
- GCP had a major multi-region networking issue, due to a routing glitch. Click through for their followup post.
US National Oceanic and Atmospheric Administration (NOAA)
- This outage impaired NOAA’s tsunami early warning system.
Facebook, Instagram, and WhatsApp
TikTok
Elevated error rates
Microsoft Teams and other services
- Click through for a highly detailed description of what went wrong. I can’t link directly to the incident in question, so you’ll have to scroll down to 3/15.

SRE Weekly Issue #266

Articles

Outages

SRE Weekly Issue #265

Articles

Outages

SRE Weekly Issue #264

Articles

Outages

SRE Weekly Issue #263

Articles

Outages

SRE Weekly Issue #262

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues