SRE Weekly Issue #145

Articles

An article on looking past human error in investigating air sports (definition) accidents, drawing on the writing of Don Norman. Special emphasis on slips versus mistakes:

“Slips tend to occur more frequently to skilled people than to novices
[…]

Mara Schmid — Blue Skies Magazine

Five Ways to Tackle Digital Transformation Without Downtime

An VP of NS1 explains how his company rewrote and deployed their core service without downtime.

Shannon Weyric — NS1

Status page updates: It’s all about timing

This guide from Hosted Graphite has a ton of great advice and reads almost as if they’ve released their internal incident response guidelines. Bonus content: check out this exemplary post-incident followup from their status site.

Fran Garcia — Hosted Graphite

Atlassian Incident Management Handbook

Check it out, Atlassian posted their incident management documentation publicly!

Ten Platform Commandments

On Monday I gave a talk at DOES18 called “All the World’s a Platform”, where I talked about a bunch of the lessons learned by using and abusing and running and building platforms at scale.

I promised to do a blog post with the takeaways, so here they are.

Charity Majors

Not All Bugs Are Worth Fixing (And That’s Okay)

[…] at a certain point, it’s too expensive to keep fixing bugs because of the high-opportunity cost of building new features. You need to decide your target for stability just like you would availability, and it should not be 100%.

Kristine Pinedo — Bugsnag

Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

Maelstrom is Facebook’s tool to assist engineers in safely moving traffic off of impaired infrastructure.

Adrian Colyer — The Morning Paper (summary)
Veeraraghavan et al. — Facebook (original paper)

Failure differently

Attempting to stamp out failure entirely can have the paradoxical effect of reducing resiliency to anomalous situations. Instead, we need to handle failure constructively.

Daniel Hummerdal — Safety Differently

Outages

Postmortem: RDS Clogs & Cache-Refresh Crash Loops – Honeycomb
- I guess it’s probably mean of me, but I always get excited when Honeycomb has an outage, because I love reading their followup analyses. This one expertly deconstructs a messy incident with lots of contributing factors.
  
  Rachel Fong — Honeycomb
GitHub
- GitHub had a severe outage this week. Their brief summary (linked above) brings to mind the mention of the risk of data center isolation in this article from July:
  - GitHub Engineering Adopts New Architecture for MySQL High Availability
Travis CI
- Caused by the GitHub outage.
Fastly
- Fastly had a rough week:
  - Degraded Performance in South Africa
  - Elevated Errors in DCA/Ashburn
  - CDN Degraded Shield Performance
  - Elevated Errors in GIG/Rio de Janeiro
  - Elevated Errors in HHN/FrankfurtFull disclosure: Fastly is my employer.
PagerDuty
YouTube
Heroku
- Also this one and a few other minor ones.
Snapchat
BitBucket
- The above is a total outage for one hour. They also had a less severe incident the previous day.
Reddit
iCloud

SRE Weekly Issue #145

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues