SRE Weekly Issue #49

View on sreweekly.com

My vacation visiting the Mouse was awesome! I had a lot of time to work on a little project of mine. And now I’m back just in time for Black Friday and Cyber-whatever. Good luck out there, everyone!

Articles

Price-Checking for Prescriptions Results in Dangerous Combination of Medications

Another issue of BWH’s Safety Matters, this time about a prescribing accident. The system seems to have been almost designed to cause this kind of error, so it’s good to hear that a fix was already about to be deployed.

Do You Make This Critical Root Cause Analysis (RCA) Mistake?

This is a great article on identifying the true root cause(s) of an incident, as opposed to stopping with just a “direct cause”. I only wish it were titled, Use These Five Weird Tricks to Run Your RCA!

How Etsy Uses Code “Slush” to Manage Development During the Holidays

Etsy describes how they do change management during the holidays:

[…] at Etsy, we still push code during the holidays, just more carefully, it’s not a true code freeze, but a cold, melty mixture of water and ice. Hence, Slush.

Production Ready: Always Leave the Campground Cleaner Than You Found It

This issue of Production Ready is about battling code rot through incrementally refactoring an area of a codebase while you’re doing development work that touches it.

5 ways to hone your production incident postmortems

Shutterstock shares some tips they’ve learned from writing postmortems. My favorite part is about recording a timeline of events in an incident. I’ve found that reading an entire chat transcript for an incident can be tedious, so it can be useful to tag items of interest using a chat-bot command or a search keyword so that you can easily find them later.

OUTAGE! AMA on-demand video

The “Outage!” AMA happened while I was on vacation, and I still haven’t had a chance to listen to it. Here’s a link in case you’d like to.

10 DevOps Interview Questions to Gauge a Candidate’s Real Knowledge

My favorite:

If something breaks in production, how do you know about it?

Weaver: Ill-Behaved Microservice Emulator

Weaver is a tool to help you identify problems in your microservice consumers by doing “bad” things like responding slowly to a fraction of requests.

How Barclays Avoids Downtime Chaos

Barclays reduced load on their mainframe by adding MongoDB as a caching layer to handle read requests. What the heck does “mainframe” mean in this decade, anyway?

SOASTA Report: Online Holiday Shoppers Will Only Wait for Two Seconds

We’d do well to remember during this holiday season that several seconds of latency in web requests is tantamount to an outage.

You Are Not Paid to Write Code

Tyler Treat gives us an eloquently presented argument for avoiding writing code as much as possible, for the sake of stability.

Outages

So far, no big-name Black Friday outages. We’ll see what Cyber Monday has in store.

Everest (datacenter)
- Everest suffered a cringe-worthy network outage subsequent to a power failure. Power came off and on a couple of times, prompting their stacked Juniper routers to assume they’d failed in booting and go into failure recovery mode. Unfortunately, the secondary OS partitions on the two devices contained different JunOS versions, so they wouldn’t stack properly.
  
  I’d really like to read the RFO on the power outage itself, but I can’t find it. If anyone has a link, could you please send it my way?
Argos

SRE Weekly Issue #49

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues