SRE Weekly Issue #173

View on sreweekly.com

I’m back! Thank you all so much for the outpouring of support while SRE Weekly was on hiatus. My recovery is going nicely and I’m starting to catch up on my long backlog of articles to review. I’m going to skip trying to list all the outages that occurred since the last issue and instead just focus on a couple of interesting follow-up posts.

Articles

The Negotiability of “Severity” Levels

So many awesome concepts packed into this article. Here are just a couple:

Seen in this light, “severity” could be seen as a currency that product owners and/or hiring managers could use to ‘pay’ for attention.

This yields the logic that if a customer was affected, learning about the incident is worth the effort, and if no customers experienced negative consequences for the incident, then there must not be much to learn from it.

John Allspaw — Adaptive Capacity Labs

How Webhook.site handles 100 mbit/s traffic on a single VPS

This shares more in common with the server behind sreweekly.com than I perhaps ought to admit to:

Additionally, lots can be done for scalability regarding infrastructure: I’ve kept everything on a single, smaller server basically as a matter of stubbornness and wanting to see how far I can push a single VPS.

Simon Fredsted

PostgreSQL: pg_upgrade can result in early wraparound on databases with high
transaction load

A Reddit engineer explains a hidden gotcha of pg_upgrade that caused an outage I reported here previously.

Jason Harvey — Reddit

Pilots at MIA’s Biggest Cargo Airline Warned Execs a Crash Was Coming. Then a Plane Went Down.

This has “normalization of deviance” all over it.

Taylor Dolven — The Miami Herald

Boeing Built Deadly Assumptions Into 737 Max, Blind to a Late Design Change

The deep details around MCAS are starting to come out. This article tells a tale that is all too familiar to me about organizational pressures and compartmentalization.

Jack Nicas, David Gelles and James Glanz — New York Times

Outages

Google
- Click through for Google’s blog post about the outage that impacted Google Cloud Platform, YouTube, Gmail, Google Drive.A configuration change intended for a small number of servers was incorrectly applied more broadly, causing reduced network capacity. The similarity to the second Heroku outage below is striking.
Heroku Incident #1776 Follow-up
- An expired SSL certificate caused control plane impact and some impact to running applications.
Heroku Incident #1789 Follow-up
- A configuration change intended for a testing environment was mistakenly applied to production, resulting in 100% of requests in the EU failing.

SRE Weekly Issue #173

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues