General

SRE Weekly Issue #169

lex

April 21, 2019

Articles

Boeing’s crisis: AOPA safety expert weighs in

My coworker pointed me toward this article, and we had a really great conversation. I shared this article that I’d linked previously here, and it hit me: Boeing (and the FAA?) assumed MCAS was fine because a failure in it would look like a normal kind of failure with an established recovery procedure.

The problem is, we’ve seen that the recovery procedure can fail if the plane is moving so fast toward the ground that the pilots can’t physically pull it out of a dive. And it seems possible that no one knew that the recovery mechanism had this fatal vulnerability. This has all the hallmarks of a classic complex failure.

Thanks to John Goerzen for this one.

Richard McSpadden — AOPA

Colm MacCárthaigh on Twitter: Heartbleed

Pretty much any thread by Colm MacCárthaigh is a great read.

I think right around this minute is just about exactly 5 years since the Heartbleed vulnerability in OpenSSL became public. I remember the day vividly, and if you’re interested, allow me to tell you about how the day, and the subsequent months, and years unfolded …

Colm MacCárthaigh

A New Bee’s First Oncall

Find out why going on call made sense for a Developer Advocate and how it went.

Liz Fong-Jones — Honeycomb

Some internet outages predicted for the coming month as ‘768k Day’ approaches

As the BGP route table grows, some devices will soon run out of space to store it all.

Catalin Cimpanu

Minimising the Risk of Data Damage

The risk of logical damage to the data in a DB is the kind of risk that means there’s no such thing as a true rollback (You Can’t Have a Rollback Button).

Benji Weber

Peering into the future of Resilience Engineering in Tech

Our field is evolving toward adopting resilience engineering, and it’s not an easy process. This post goes into some detail on the mental struggle and points in the direction we need to go to get there.

Will Gallego [Note: Will is my coworker]

Outages

Gmail Suffers Two-Hour Global Outage: Reports 04/18/2019
Google Oauth
- Seems like this may have effectively taken down Gmail.
Grindr
1&1 Ionos

SRE Weekly Issue #168

lex

April 14, 2019

General

Comments

View on sreweekly.com

Articles

How to Get Into SRE

This one’s great for folks that are new to SRE, and it’s also an enlightening read for seasoned SREs. What caught me most was the Definition section, on what it means to be an SRE.

Alice Goldfuss

Chaos Engineering Traps

In this articlization of a conference talk, the author lays out 8 common pitfalls in chaos engineering, with detailed example stories related to them. It goes much deeper than mere chaos engineering into the theory of how to operate complex systems.

Nora Jones

Ghosts in the machines

Automation can have unintended effects — and can tend to not have the effect we hope it will.

Thanks to Greg Burek for this one.

Courtney Nash

What SREs can learn from Aviation industry? ·

Recently having binged watch Air Emergency, I felt that SREs can learn many things from aviation industry.

Anshul Patel

Notes on running production code

Lessons learned by a software engineer on supporting their code in production.

Kashyap Kondamudi

The CASE Method: Better Monitoring For Humans

CASE stands for Context-heavy, Actionable, Symptom-based, and Evaluated. That last one’s really key. The author proposes setting an expiration time for your alerts after which time you should evaluate them to make sure that they still make sense.

Cory Watson

Outages

Heroku: (EU) routing issues for ssl:endpoint applications
- Heroku posted this followup for an outage on April 2.
The Travis CI Blog: Incident review for slow booting Linux builds outage
- The outage happened March 27-28.
Azure VMs — North Central US
- Since deep-linking to Azure incident summaries doesn’t work and this one is especially interesting, I’ll quote it here:
  
  Azure Storage team made a configuration change on 9 April 2019 at 21:30 UTC to our back-end infrastructure in North Central US to improve performance and latency consistency for Azure Disks running inside Azure Virtual Machines. This change was designed to be transparent to customers. It was enabled following our normal deployment process, first to our test environment, and lower impact scale units before being rolled out to the North Central US region. However, this region hit bugs which impacted customer VM availability. Due to a bug, VM hosts were able to establish session with the storage scale unit but hit issues when trying to receive/send data from/to storage scale unit. This situation was designed to be handled with fallback to our existing data path, but an additional bug led to failure in the fallback path and resulted in in VM reboots.
Facebook, Instagram, and WhatsApp

SRE Weekly Issue #167

lex

April 7, 2019

General

Comments

View on sreweekly.com

Articles

Conference Report: SRECon Americas 2019

This is an awesome write-up of SRECon, but the part I really love is the intro. The author gives voice to a growing tension I’ve seen in our field, as we try to adopt the tenets of Safety II which can seem to be at odds with traditional SRE practices. There’s a lot here that we SREs need to work out as our profession matures, and I’m really enjoying the process.

Tanya Reilly

Control Is an Illusion

Experts recommend trying to keep the concepts of blame, root cause, and hindsight bias out of our retrospective investigations. This insightful article explains that they all stem from the illusion that we are in full control of our systems.

Thanks to Will Gallego for this one.

Ryan Frantz

What We Learned from the Recent Mandrill Outage

Here’s a top-notch followup analysis from Mailchimp on the Mandrill outage last month. Their Postgresql DB ran out of transaction IDs (a common failure mode), causing a painful outage. Tons of great stuff here including a mention of rotating ICs every 3 hours to prevent exhaustion and allow them to sleep.

Mailchimp

Ethiopian crash: Boeing 737 Max pilots followed expected procedures, aviation officials say

And here’s where things get really interesting. Incidents are never as simple as they seem from the outside, and the 737 MAX situation is no exception. I anxiously await the full report, in which we’ll hear more about the confluence of contributing factors that must have been involved here.

Thom Patterson — CNN

The true story behind the deadliest air disaster of all time

There’s a lot in this, and I don’t feel comfortable summarizing it with a little blurb about lessons learned. Chilling though it is, I’m glad I read it.

Thanks to Sri Ray for this one.

Patrick Smith — The Telegraph

Production ready code is much more than error handling – Ayende @ Rahien

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

Ayende Rahien

Outages

Travis CI
Slack
- And this one.
Google Cloud Platform (us-central1)
Heroku
Instagram
Squarespace
- Click for another A+ followup analysis from Squarespace. Thanks, folks!

SRE Weekly Issue #166

lex

March 31, 2019

General

Comments

View on sreweekly.com

SRECon was amazing! The talk line-up was mind-blowing, and it was great to meet many of you there. A big thanks to all the speakers for making this one a conference to remember.

Articles

OnCall of Duty

One of my favorite moments of SRECon: during their talk, Dorothy Jung and Wenting Wang unveiled this choose-your-own-adventure-style game for practicing your incident response skills. See if you can resolve the incident before your stress level gets too high!

Chie Shu, Dorothy Jung, Joel Salas, Dennis So, Sam Faber-Manning, and Wenting Wang — Yelp

SRECon19 Americas interesting tidbits

Last week was only the second SRECon I’ve managed to attend. Rather than post raw notes from all the talks I attended, I tried something different: I only wrote down the really big stuff that made me think or blew my mind. I’m hoping that just reading this might give those of you that weren’t able to attend a taste of the conference.

Lex Neva

John Allspaw on Twitter: HABA/MABA

Inspired by SRECon, John Allspaw posted this Twitter thread on the “Humans Are Better At” / “Machines Are Better At” concept.

Who will argue with “make the computers do the easy/tedious stuff so humans can do the difficult/interesting stuff”? (apparently, I will)

John Allspaw

Lion Air 737 MAX crew had seconds to react, Boeing simulation finds

This article goes into what the pilots of the Lion Air 737 Max 8 (and presumably the Ethiopian Airlines one as well) would have had to do in order to regain control over the aircraft. We’re starting to get hints of the task saturation and alert overload both sets of pilots may have faced as they tried to handle the situation:

The Lion Air crew would have had to accomplish this while dealing with a host of alerts, including differences in other sensor data between the pilot and co-pilot positions that made it unclear what the aircraft’s altitude was.

Thanks to Courtney Eckhardt for this one.

Sean Gallagher — Ars Technica

Pilot Who Hitched a Ride Saved Lion Air 737 Day Before Deadly Crash

The day before Lion Air’s 737 Max 8 crash last fall, the exact same plane had a similar failure to the one that may have taken that plane down the next day.

Thanks to Courtney Eckhardt for this one.

Alan Levin and Harry Suhartono — Bloomberg

Calvin: fast distributed transactions for partitioned database systems

Calvin is interesting for (at least) two reasons: first, it’s designed to work with an existing database, and second, it manages an impressively fast transaction throughput rate.

Adrian Colyer (summary) — The Morning Paper

Thomson et al. (original paper)

Monitoring distributed systems means first, do no harm

This article draws an interesting parallel between two talks at SRECon last week, about making sure that your monitoring doesn’t itself cause incidents.

Beth Pariseau — TechTarget

Outages

Hosted Graphite
Sabre
- Sabre provides the infrastructure behind several airlines, and this outage affected customers that were traveling.
American Express
Santander (bank)
Reddit
Ionos/1&1 (hosting provider)
Canadian Revenue Agency

SRE Weekly Issue #165

lex

March 24, 2019

General

Comments

View on sreweekly.com

As I write this, I’m headed to New York City for SRECon19 Americas, and I can’t wait! If you’re there, come hit me up for some SRE Weekly swag, made using open source software.

Articles

The Complaints Pilots Filed About Boeing’s 737 Max

As we discover more about the Boeing 737 MAX accidents, this author trolled through the ASRS database looking for related complaints.

Thanks to Greg Burek for this one.

James Fallows — The Atlantic

40 Years of Safer Aviation Through Reporting

Learn about ASRS, the Aviation Safety Reporting System. Pilots and other aviation crew can report concerns anonymously, and the results are summarized regularly and reported to the FAA, NTSB, and other organizations.

Thanks to Greg Burek for this one.

Jerry Colen — NASA

Boeing 737 rudder issues

I caught wind of a previous Boeing 737 issue from the 90s during a personal conversation this week. There’s an interesting parallel to the current 737 MAX issue, as Boeing blamed pilots for incorrectly responding to a “normal” flight incident for which pilots are routinely trained.

Various — Wikipedia

Deploying a pager-free sleep period

Dr Justine Jordan gives a personal account of how on-duty napping during extended overnight in-hospital duty hours as a trainee doctor eased her fatigue levels and raised her state of alertness

Dr. Justine Jordan — Irish Medical Times

Designing resilient systems beyond retries (Part 1): Rate-Limiting

circuit breakers aren’t great because server depends on clients to be configured correctly. throttling server-side is better

Circuit-breakers are great, but the service depends on the clients to be configured correctly. A server-side rate-limiting solution is more robust.

Michael Cartmell — Grab

Authorization at LinkedIn’s Scale

The concept of an ACL-based authorization system is simple enough, but can be a challenge to maintain at scale.

Michael Leong — LinkedIn

Isolating Root Cause: March 13th Facebook Outage

We can tell one thing from the outside: it wasn’t a BGP issue.

Alec Pinkham — AppNeta

Outages

Google BigQuery
MySpace
- Social networking company Myspace has apologized for apparently losing 12 years’ worth of music uploaded to its site, following a server migration error — a loss potentially amounting to 50 million songs.
  
  Matthew Robinson — CNN
Fastly
- Fastly suffered a series of incidents in South America.Full disclosure: Fastly is my employer.
iVote registration system (New South Wales, AU)
- The voter registration system failed on the eve of the election.
Basecamp
- Basecamp suffered a major outage, hot on the heels of another outage a couple weeks back.Thanks to Greg Burek for this one.
  DHH — Basecamp
British Government Petition Site Crashes As People Demand Brexit Cancelation
LinkedIn
Spark (medical alert system)

SRE Weekly Issue #169

Articles

Outages

SRE Weekly Issue #168

Articles

Outages

SRE Weekly Issue #167

Articles

Outages

SRE Weekly Issue #166

Articles

Outages

SRE Weekly Issue #165

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues