SRE Weekly Issue #266

Articles

Airplane takes off a metric ton heavier than expected after computer error weighs adults as children

This one was brought to my attention by Dr. Richard Cook, who also pointed me to the AAIB incident report.

Dr. Cook went on to share these insights with me, which I’ve copied here with permission:

Note:

the subtle interactions allowed the manual correction to be lost during the interval between recognizing the software problem and having the corrected software functionally ‘catch’ the Ms/Miss title mixup;

the incident is attributed to “a simple flaw in the programming of the IT system” rather than failure of the workarounds that were put in place after the problem was recognized;

the report is careful to demonstrate that the flaws in the system made only a slight difference to the flight parameters;

the report does not describe any IT process changes whatsoever!

The report has the effect of making the incident appear to be an unfortunate series of occurrences rather than being emblematic of the way that these sorts of processes are vulnerable.

Catchpoint Announces Virtual SRE Community Event on June 10

Last year’s SRE From Home event was awesome, and this year’s iteration looks to be just as great.

Catchpoint

The Case of the Connection Timeout

This is fun! Try your hand at troubleshooting a connection issue in this game-ified role-play scenario.

BONUS CONTENT: Read about the author’s motivations, design decisions, and plans here.

Julia Evans

The Five Pillars of Resilience Engineering

Do we need to have some kind of Pillars Registry? Note, these are more like pillars of high availability than resilience engineering.

Hector Aguilar — Okta

Incident analysis as guerrilla case study research

I love this idea that we’re trying to get deep incident analysis done even though that may not be the actual goal of the organization.

As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Lorin Hochstein

Having On-call Nightmares? Runbooks can Help you Wake Up.

This is well worth a read if only for the on-call scenario at the start. Yup, been there. We miss you, Harry.

Harry Hull — Blameless

Platform engineering vs. site reliability engineering (SRE): here’s what you need to know

What’s the difference? Click through to learn about the distinction they’re drawing.

Amir Kazemi — effx

We Don’t Get Bitter, We Get Better

The New York Times’s Operations Engineering group developed an Operational Maturity Assessment and uses it to have collaborative conversations with teams about their systems.

Authro: The NYT Open Team — New York Times

Outages

G-Suite
- Google posted this “Mini Incident Report while full Incident Report is prepared.”
Slack
Docker Hub
Robinhood
Twitter
Elevated CDN Errors
Heroku
- Heroku had a series of incidents this week (1, 2, 3, 4).

SRE Weekly Issue #266

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues