SRE WEEKLY – Page 68 – scalability, availability, incident response, automation

SRE Weekly Issue #168

lex

April 14, 2019

Articles

This one’s great for folks that are new to SRE, and it’s also an enlightening read for seasoned SREs. What caught me most was the Definition section, on what it means to be an SRE.

Alice Goldfuss

Chaos Engineering Traps

In this articlization of a conference talk, the author lays out 8 common pitfalls in chaos engineering, with detailed example stories related to them. It goes much deeper than mere chaos engineering into the theory of how to operate complex systems.

Nora Jones

Ghosts in the machines

Automation can have unintended effects — and can tend to not have the effect we hope it will.

Thanks to Greg Burek for this one.

Courtney Nash

What SREs can learn from Aviation industry? ·

Recently having binged watch Air Emergency, I felt that SREs can learn many things from aviation industry.

Anshul Patel

Notes on running production code

Lessons learned by a software engineer on supporting their code in production.

Kashyap Kondamudi

The CASE Method: Better Monitoring For Humans

CASE stands for Context-heavy, Actionable, Symptom-based, and Evaluated. That last one’s really key. The author proposes setting an expiration time for your alerts after which time you should evaluate them to make sure that they still make sense.

Cory Watson

Outages

Heroku: (EU) routing issues for ssl:endpoint applications
- Heroku posted this followup for an outage on April 2.
The Travis CI Blog: Incident review for slow booting Linux builds outage
- The outage happened March 27-28.
Azure VMs — North Central US
- Since deep-linking to Azure incident summaries doesn’t work and this one is especially interesting, I’ll quote it here:
  
  Azure Storage team made a configuration change on 9 April 2019 at 21:30 UTC to our back-end infrastructure in North Central US to improve performance and latency consistency for Azure Disks running inside Azure Virtual Machines. This change was designed to be transparent to customers. It was enabled following our normal deployment process, first to our test environment, and lower impact scale units before being rolled out to the North Central US region. However, this region hit bugs which impacted customer VM availability. Due to a bug, VM hosts were able to establish session with the storage scale unit but hit issues when trying to receive/send data from/to storage scale unit. This situation was designed to be handled with fallback to our existing data path, but an additional bug led to failure in the fallback path and resulted in in VM reboots.
Facebook, Instagram, and WhatsApp

SRE Weekly Issue #167

lex

April 7, 2019

General

Comments

View on sreweekly.com

Articles

Conference Report: SRECon Americas 2019

This is an awesome write-up of SRECon, but the part I really love is the intro. The author gives voice to a growing tension I’ve seen in our field, as we try to adopt the tenets of Safety II which can seem to be at odds with traditional SRE practices. There’s a lot here that we SREs need to work out as our profession matures, and I’m really enjoying the process.

Tanya Reilly

Control Is an Illusion

Experts recommend trying to keep the concepts of blame, root cause, and hindsight bias out of our retrospective investigations. This insightful article explains that they all stem from the illusion that we are in full control of our systems.

Thanks to Will Gallego for this one.

Ryan Frantz

What We Learned from the Recent Mandrill Outage

Here’s a top-notch followup analysis from Mailchimp on the Mandrill outage last month. Their Postgresql DB ran out of transaction IDs (a common failure mode), causing a painful outage. Tons of great stuff here including a mention of rotating ICs every 3 hours to prevent exhaustion and allow them to sleep.

Mailchimp

Ethiopian crash: Boeing 737 Max pilots followed expected procedures, aviation officials say

And here’s where things get really interesting. Incidents are never as simple as they seem from the outside, and the 737 MAX situation is no exception. I anxiously await the full report, in which we’ll hear more about the confluence of contributing factors that must have been involved here.

Thom Patterson — CNN

The true story behind the deadliest air disaster of all time

There’s a lot in this, and I don’t feel comfortable summarizing it with a little blurb about lessons learned. Chilling though it is, I’m glad I read it.

Thanks to Sri Ray for this one.

Patrick Smith — The Telegraph

Production ready code is much more than error handling – Ayende @ Rahien

I consider a system to production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

Ayende Rahien

Outages

Travis CI
Slack
- And this one.
Google Cloud Platform (us-central1)
Heroku
Instagram
Squarespace
- Click for another A+ followup analysis from Squarespace. Thanks, folks!

SRE Weekly Issue #166

lex

March 31, 2019

General

Comments

View on sreweekly.com

SRECon was amazing! The talk line-up was mind-blowing, and it was great to meet many of you there. A big thanks to all the speakers for making this one a conference to remember.

Articles

OnCall of Duty

One of my favorite moments of SRECon: during their talk, Dorothy Jung and Wenting Wang unveiled this choose-your-own-adventure-style game for practicing your incident response skills. See if you can resolve the incident before your stress level gets too high!

Chie Shu, Dorothy Jung, Joel Salas, Dennis So, Sam Faber-Manning, and Wenting Wang — Yelp

SRECon19 Americas interesting tidbits

Last week was only the second SRECon I’ve managed to attend. Rather than post raw notes from all the talks I attended, I tried something different: I only wrote down the really big stuff that made me think or blew my mind. I’m hoping that just reading this might give those of you that weren’t able to attend a taste of the conference.

Lex Neva

John Allspaw on Twitter: HABA/MABA

Inspired by SRECon, John Allspaw posted this Twitter thread on the “Humans Are Better At” / “Machines Are Better At” concept.

Who will argue with “make the computers do the easy/tedious stuff so humans can do the difficult/interesting stuff”? (apparently, I will)

John Allspaw

Lion Air 737 MAX crew had seconds to react, Boeing simulation finds

This article goes into what the pilots of the Lion Air 737 Max 8 (and presumably the Ethiopian Airlines one as well) would have had to do in order to regain control over the aircraft. We’re starting to get hints of the task saturation and alert overload both sets of pilots may have faced as they tried to handle the situation:

The Lion Air crew would have had to accomplish this while dealing with a host of alerts, including differences in other sensor data between the pilot and co-pilot positions that made it unclear what the aircraft’s altitude was.

Thanks to Courtney Eckhardt for this one.

Sean Gallagher — Ars Technica

Pilot Who Hitched a Ride Saved Lion Air 737 Day Before Deadly Crash

The day before Lion Air’s 737 Max 8 crash last fall, the exact same plane had a similar failure to the one that may have taken that plane down the next day.

Thanks to Courtney Eckhardt for this one.

Alan Levin and Harry Suhartono — Bloomberg

Calvin: fast distributed transactions for partitioned database systems

Calvin is interesting for (at least) two reasons: first, it’s designed to work with an existing database, and second, it manages an impressively fast transaction throughput rate.

Adrian Colyer (summary) — The Morning Paper

Thomson et al. (original paper)

Monitoring distributed systems means first, do no harm

This article draws an interesting parallel between two talks at SRECon last week, about making sure that your monitoring doesn’t itself cause incidents.

Beth Pariseau — TechTarget

Outages

Hosted Graphite
Sabre
- Sabre provides the infrastructure behind several airlines, and this outage affected customers that were traveling.
American Express
Santander (bank)
Reddit
Ionos/1&1 (hosting provider)
Canadian Revenue Agency

SRE Weekly Issue #165

lex

March 24, 2019

General

Comments

View on sreweekly.com

As I write this, I’m headed to New York City for SRECon19 Americas, and I can’t wait! If you’re there, come hit me up for some SRE Weekly swag, made using open source software.

Articles

The Complaints Pilots Filed About Boeing’s 737 Max

As we discover more about the Boeing 737 MAX accidents, this author trolled through the ASRS database looking for related complaints.

Thanks to Greg Burek for this one.

James Fallows — The Atlantic

40 Years of Safer Aviation Through Reporting

Learn about ASRS, the Aviation Safety Reporting System. Pilots and other aviation crew can report concerns anonymously, and the results are summarized regularly and reported to the FAA, NTSB, and other organizations.

Thanks to Greg Burek for this one.

Jerry Colen — NASA

Boeing 737 rudder issues

I caught wind of a previous Boeing 737 issue from the 90s during a personal conversation this week. There’s an interesting parallel to the current 737 MAX issue, as Boeing blamed pilots for incorrectly responding to a “normal” flight incident for which pilots are routinely trained.

Various — Wikipedia

Deploying a pager-free sleep period

Dr Justine Jordan gives a personal account of how on-duty napping during extended overnight in-hospital duty hours as a trainee doctor eased her fatigue levels and raised her state of alertness

Dr. Justine Jordan — Irish Medical Times

Designing resilient systems beyond retries (Part 1): Rate-Limiting

circuit breakers aren’t great because server depends on clients to be configured correctly. throttling server-side is better

Circuit-breakers are great, but the service depends on the clients to be configured correctly. A server-side rate-limiting solution is more robust.

Michael Cartmell — Grab

Authorization at LinkedIn’s Scale

The concept of an ACL-based authorization system is simple enough, but can be a challenge to maintain at scale.

Michael Leong — LinkedIn

Isolating Root Cause: March 13th Facebook Outage

We can tell one thing from the outside: it wasn’t a BGP issue.

Alec Pinkham — AppNeta

Outages

Google BigQuery
MySpace
- Social networking company Myspace has apologized for apparently losing 12 years’ worth of music uploaded to its site, following a server migration error — a loss potentially amounting to 50 million songs.
  
  Matthew Robinson — CNN
Fastly
- Fastly suffered a series of incidents in South America.Full disclosure: Fastly is my employer.
iVote registration system (New South Wales, AU)
- The voter registration system failed on the eve of the election.
Basecamp
- Basecamp suffered a major outage, hot on the heels of another outage a couple weeks back.Thanks to Greg Burek for this one.
  DHH — Basecamp
British Government Petition Site Crashes As People Demand Brexit Cancelation
LinkedIn
Spark (medical alert system)

SRE Weekly Issue #164

lex

March 17, 2019

General

Comments

View on sreweekly.com

Articles

Boeing 737 Max jets grounded by FAA emergency order

I previously shared an article about the 737 MAX 8, and I’m truly saddened that another accident has occurred. Learning from accidents like this is incredibly important, and the NTSB is among the best at it. I look forward to see what we can take away from this to make air travel even safer.

Farnoush Amiri and Ben Kesslen — NBC

Pilots complained about the 737 Max in a federal database

The existence of this anonymous channel for pilots is really interesting to me. It sounds like a great way to learn about near misses, which can be nearly identical to catastrophic accidents. Can we implement this kind of anonymous channel in our organizations too?

Thom Patterson and Aaron Cooper — CNN

Black Boxes From Ethiopia Plane Crash Show Similarities With Indonesia Crash

“Aviation accidents are rarely the result of a single cause,” Lewis noted. “There are often many small things that lead to a crash, and that’s why these investigations take so long.”

Francesca Paris — NPR

How the Internet Travels Across Oceans

Google and other companies are working on their own private undersea cables.

‘People think that data is in the cloud, but it’s not. It’s in the ocean.’

Adam Satariano — New York Times

Shift Changes, Updates, and the On-Call Architecture in Space Shuttle Mission Control

For this week, I have an article about on-call and how its done at NASA. Many of the conclusions here may not be that surprising to those who have been on-call for any length of time, but I think there is a lot to learn from how NASA makes the system work.

Thai Wood — Resilience Roundup (summary)

Emily S Patterson and David D Woods — Ohio State University (original article)

Postmortems Part 3: Getting the Most out of Your Postmortem Meetings

I hadn’t thought of this before, but I really like this idea:

The facilitator’s role in the meeting is different from the other participants. They do not voice their own ideas, but keep the discussion on track and encourage the group to speak up.

Rachael Byrne — PagerDuty

Outages

Facebook & Instagram
- The Facebook and Instagram outage is the biggest the company has ever experienced.
  
  Yesterday, as a result of a server configuration change, many people had trouble accessing our apps and services. We’ve now resolved the issues and our systems are recovering. We’re very sorry for the inconvenience and appreciate everyone’s patience.
BART (San Francisco, USA rapid transit system)
Etsy
GitHub
Google Cloud Storage
- Click through for a tale of cascading failure.
Gmail and Google Drive
- Here’s the link to the google drive status posting: post
Google Cloud Platform web console
Spotify

SRE Weekly Issue #168

Articles

Outages

SRE Weekly Issue #167

Articles

Outages

SRE Weekly Issue #166

Articles

Outages

SRE Weekly Issue #165

Articles

Outages

SRE Weekly Issue #164

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues