SRE WEEKLY – Page 50 – scalability, availability, incident response, automation

SRE Weekly Issue #269

lex

May 9, 2021

Articles

Edgar: Solving Mysteries Faster with Observability

We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

Kevin Lew, Maulik Pandey, Narayanan Arunachalam, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Elizabeth Carretto — Netflix

The Comprehensive Site Reliability Engineering (SRE) PDF

The PDF covers 5 main areas:

Availability
Performance
Monitoring
Incident Response
Preparation

No account required or form to fill out to download the PDF.

Splunk/VictorOps

What are MTTx Metrics Good For? Let’s Find Out.

This one’s especially interesting for the section about what MTTx metrics aren’t good for, and the following section on how to improve them.

Emily Arnott — Blameless

Resiliency and Disaster Recovery with Kafka

If you’re interested in deploying Kafka in a multi-region configuration, eBay has put quite a bit of thought into this and has a lot to share.

Engin Yoeyen — eBay

What Chaos Engineering Is (and Isn’t)

Straight from someone who was there from the start. The “what chaos engineering is not” section is especially enlightening.

Casey Rosenthal — Verica

Heroku incident #2226 follow-up: Private Space apps experiencing domain to SSL cert mapping errors

The last paragraph regarding “unknown unknowns” is noteworthy.

Heroku

Failover Conf follow-up: Your team and culture questions answered!

There are some great questions in here on blamelessness and full service ownership.

James Thigpen — Gremlin

Outages

Google Cloud Platform us-west2 region
- They posted a detailed follow-up at the above link.
TikTok
Network Solutions and Register.com
Singapore Exchange (SGX)
reddit
Parler

SRE Weekly Issue #268

lex

May 2, 2021

General

Comments

View on sreweekly.com

Articles

Manageable On-Call for Companies without Money Printers

The SRE book has a chapter covering on-call, but it’s best suited for huge-scale companies. What should the rest of us do?

Utsav Shah

Breaking the top five myths around chaos engineering

If you’re feeling hesitant about chaos engineering, or you’re trying to convince someone who is, this might be useful. The myths are:

Myth #1: Chaos engineering is testing in production
Myth #2: Chaos engineering is about randomly breaking things
Myth #3: Chaos engineering is only for large, modern distributed systems
Myth #4: We don’t need more chaos – we already have plenty!
Myth #5: Chaos engineering is only for very mature teams/products

Mikolaj Pawlikowski

Seeing Like an SRE: Site Reliability Engineering as High Modernism

Drawing parallels to the high modernism movement during the cold war, this article raises interesting questions about the direction SRE is going, and system administration in general.

Laura Nolan — USENIX

Is faster actually safer? How software physics beats human psychology

Riffing off of a tweet by Charity Majors, this article explores the idea that moving faster can actually be safer, despite an urge one may feel to slow down.

Bruce Johnston

NTSB Aircraft Accident Report: Eastern Air Lines, May 5, 1983

An extreme oversimplification of this incident would be: multiple engine failure on a plane subsequent to a maintenance error on all engines. This accident is cited as a reason to have separate mechanics work on each engine, in hopes of avoiding duplicated errors.

US National Transportation Safety Board (multiple authors)

How we ship code faster and safer with feature flags

[…] in order to ship new features and improvements faster while lowering the risk in our deployments, we have a simple but powerful tool: feature flags.

Alberto Gimeno — GitHub

Reverse debugging at scale

This one blew my mind. By recording instruction execution traces in a ring buffer, they’re able to reconstruct enough information to step through the execution leading up to a crash — even though they weren’t running the application under a debugger!

Walter Erquinigo, David Carrillo-Cisneros, Alston Tang — Facebook

The Plane Paradox: More Automation Should Mean More Training

Automation is supposed to take some of the load off of the human operator, right? But in reality, humans need to build a mental model of what the automation is doing in order to use it safely and effectively.

Shem Malmquist — WIRED

Outages

SRE Weekly Issue #267

lex

April 25, 2021

General

Comments

View on sreweekly.com

Articles

SRE Case Study: Mysterious Traffic Imbalance

Yet more proof that DNS behavior varies way more than is obvious at first glance. Who the heck thought longest common prefix matching was a good idea?

Charles Li — eBay

Fast and flexible observability with canonical log lines

The application may log multiple lines during the lifecycle of a request. Stripe has found it invaluable to also log one final line with a fully summary of the request.

Brandur Leach — Stripe

Google Incident Report — April 12, 2021

This is a followup with more detail on the G-Suite outage I reported here last week. A database issue caused two separate outages.

Google

The top 3 mistakes companies make with SLOs, SLAs, and SLIs

Really great advice about 3 common pitfalls in implementing SL*s.

Cortex

Going solid: a model of system dynamics and consequences for patient safety – Resilience Roundup

This research paper explores the marginal boundary, a set of conditions beyond which a system enters a different operating mode and an accident is much more likely. It discusses the concept of coupling between seemingly unrelated parts of the system and shows how economic incentives can push a system toward this boundary.

Dr. Richard Cook and Jens Rasmussen (Original paper)

Thai Wood — Resilience Roundup (summary)

Vodafone Idea BGP Leak – Global Routing System Must Implement MANRS

This is an analysis of a recent BGP leak with a discussion about how the impact from such events can be mitigated through emerging best practices.

Alessandro Improta and Luca Sani — Catchpoint

How to Successfully Hand Over Systems

How do you hand over ownership of a system, transferring enough knowledge that the new owners can maintain its availability and reliability successfully?

Aleksandra Gavrilovska — SoundCloud

Resiliency Planning for High-Traffic Events

Shopify works toward Black Friday / Cyber Monday all year long, through a combination of load testing, failure mode analysis, game days, and incident analysis.

Ryan McIlmoyl — Shopify

Outages

Microsoft Azure web portal
Microsoft 365
Discord
google.com.ar
- This one’s interesting. A random person was able to buy the domain name google.com.ar, despite the fact that its registration had not expired.

SRE Weekly Issue #266

lex

April 18, 2021

General

Comments

View on sreweekly.com

Articles

Airplane takes off a metric ton heavier than expected after computer error weighs adults as children

This one was brought to my attention by Dr. Richard Cook, who also pointed me to the AAIB incident report.

Dr. Cook went on to share these insights with me, which I’ve copied here with permission:

Note:

the subtle interactions allowed the manual correction to be lost during the interval between recognizing the software problem and having the corrected software functionally ‘catch’ the Ms/Miss title mixup;

the incident is attributed to “a simple flaw in the programming of the IT system” rather than failure of the workarounds that were put in place after the problem was recognized;

the report is careful to demonstrate that the flaws in the system made only a slight difference to the flight parameters;

the report does not describe any IT process changes whatsoever!

The report has the effect of making the incident appear to be an unfortunate series of occurrences rather than being emblematic of the way that these sorts of processes are vulnerable.

Catchpoint Announces Virtual SRE Community Event on June 10

Last year’s SRE From Home event was awesome, and this year’s iteration looks to be just as great.

Catchpoint

The Case of the Connection Timeout

This is fun! Try your hand at troubleshooting a connection issue in this game-ified role-play scenario.

BONUS CONTENT: Read about the author’s motivations, design decisions, and plans here.

Julia Evans

The Five Pillars of Resilience Engineering

Do we need to have some kind of Pillars Registry? Note, these are more like pillars of high availability than resilience engineering.

Hector Aguilar — Okta

Incident analysis as guerrilla case study research

I love this idea that we’re trying to get deep incident analysis done even though that may not be the actual goal of the organization.

As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Lorin Hochstein

Having On-call Nightmares? Runbooks can Help you Wake Up.

This is well worth a read if only for the on-call scenario at the start. Yup, been there. We miss you, Harry.

Harry Hull — Blameless

Platform engineering vs. site reliability engineering (SRE): here’s what you need to know

What’s the difference? Click through to learn about the distinction they’re drawing.

Amir Kazemi — effx

We Don’t Get Bitter, We Get Better

The New York Times’s Operations Engineering group developed an Operational Maturity Assessment and uses it to have collaborative conversations with teams about their systems.

Authro: The NYT Open Team — New York Times

Outages

G-Suite
- Google posted this “Mini Incident Report while full Incident Report is prepared.”
Slack
Docker Hub
Robinhood
Twitter
Elevated CDN Errors
Heroku
- Heroku had a series of incidents this week (1, 2, 3, 4).

SRE Weekly Issue #265

lex

April 11, 2021

General

Comments

View on sreweekly.com

Articles

Insights into a Product SRE team at LinkedIn

Here’s a great look into how LinkedIn’s embedded SREs work.

[…] the mission for Product SRE is to “engineer and drive product reliability by influencing architecture, providing tools, and enhancing observability.”

Zaina Afoulki and Lakshmi Namboori — LinkedIn

DNS propagation does not exist

It’s all just other people’s caches.

Ruurtjan Pul

Advice for someone moving from SRE to backend engineering

Recently there was a Reddit post asking for advice about moving from Site Reliability Engineering to Backend Eng. I started writing a response to it, the response got long, and so I turned it into a blog post.

Charles Cary — Shoreline

The Mightiest Monolith

This is the first in a series about lessons SREs can learn from the space shuttle program. The author likens earlier spacecraft to microservices and the Shuttle to a monolith.

Robert Barron

The 5 characteristics of high reliability organizations

This article is ostensibly about Emergency Medical Services (EMS), but as is so often the case, it’s directly applicable to SRE. The 5 characteristics are enlightening, and so is the fictitious anecdote about an EMT rattled from a previous incident.

Ems1

How we scaled the GitHub API with a sharded, replicated rate limiter in Redis

Simple solution meets reality. I like how we get to see what they did when things didn’t quite work out as they were hoping.

Robert Mosolgo — GitHub

GitHub Availability Report: March 2021

They did the work to convert a database column to a 64-bit integer before it was too late. Unfortunately, one of their library dependencies didn’t use 64-bit integers.

Keith Ballinger — GitHub

Learning from incidents: getting Sidekiq ready to serve a billion jobs

In this post, I’ll walk you through one of our first ever Sidekiq incidents and how we improved our Sidekiq implementation as a result of this incident.

Nakul Pathak — Scribd

Outages

Let’s Encrypt
Uber
Multiple Airlines’ Online Booking Sites
- An error in Google’s flight information service caused problems at multiple sites that consume it.
Tinder
BBC Website
Facebook, Instagram, and WhatsApp
Stellar.org (cryptocurrency)
WazirX (cryptocurrency exchange)
Microsoft Azure and other services
- Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure.

SRE Weekly Issue #269

Articles

Outages

SRE Weekly Issue #268

Articles

Outages

SRE Weekly Issue #267

Articles

Outages

SRE Weekly Issue #266

Articles

Outages

SRE Weekly Issue #265

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues