SRE Weekly Issue #175

A message from our sponsor, VictorOps:

Looking to go serverless? Beau Christensen, VictorOps Director of Platform Engineering, and Tom McLaughlin, Founder of ServerlessOps, sat down to talk about when VictorOps decided to venture into AWS:

http://try.victorops.com/SREWeekly/going-serverless

Articles

This and other enlightened reflections on incident reviews can be found in this article:

Many organizations have driven post-incident reviews to become pallid, vapid, mechanical exercises whose value is limited to producing a defensible argument that management  is occurring.

Richard Cook — Adaptive Capacity Labs

In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.

Jeff Jo — Stripe

“Multi-cloud” never really lived up to its hype, did it? This article argues that a multi-cloud strategy is only useful in specific, constrained situations.

Disco Posse

I love how they used idempotency to avoid downtime and missed or repeated transactions during the cutover.

Miguel Carranza — RevenueCat

This is either really clever or just unsporting.

Tonya Garcia — MarketWatch

This article discusses six kinds of SRE team (“kitchen sink”, infrastructure, tools, product/application, embedded, and consulting) and the pros and cons of each.

Gustavo Franco and Matt Brown — Google

If you see half the incidents this quarter compared to last, does it actually mean anything, statistically speaking? The math in this article applies equally well to SRE, and casts a shadow on the idea of tracking “metrics” like MTTR.

Marloes Nitert — Safety Differently

This field guide to debugging is the synthesis of a bunch of contributions by folks on Twitter, forged into an article by the inimitable Julia Evans. Maybe a zine is in the works?

Julia Evans

Outages

SRE Weekly Issue #174

A special treat awaits you in the Outages section this week: five awesome incident followups!

A message from our sponsor, VictorOps:

Creating on-call schedules for your SRE team(s) can be challenging. We’ve put together a short list of tips, tricks, and tools you can use to better organize your on-call rotations and help your SRE efforts:

http://try.victorops.com/SREWeekly/SRE-On-Call-Tips

Articles

This is a study of every high severity production incident at Microsoft Azure services over a span of six months, where the root cause of that incident was a software bug.

Adrian Colyer (summary)

Liu et al., HotOS’19 (original paper)

PagerDuty created this re-enactment of an incident response phone bridge. It’s obviously fairly heavily redacted and paraphrased, but it’s still quite educational. It includes interludes where terms such as Incident Commander are explained.

George Miranda — PagerDuty

Outages

  • Google Calendar
  • Netflix
  • Hulu
  • Joyent May 27 2014 outage followup
    • In this 2014 outage followup, we learn that a Joyent engineer accidentally rebooted an entire datacenter:

      The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center.

  • Salesforce May 17 outage followup
    • Click through to read about the massive Salesforce outage last month. A database edit script contained a bug that ran an UPDATE without its WHERE clause, granting elevated permissions to more users than intended. Salesforce shut down broad chunks of their service to prevent data leakage.
  • Second Life mid-May outage followup
    • Linden Lab posted about a network maintenance that went horribly wrong, resulting in a total outage.

      Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.

      April Linden — Linden Lab

  • Monzo May 30 outage followup
    • Monzo posted this incredibly detailed followup for an outage from several weeks ago. Not only does it give us a lot of insight into their incident response process, I also got to learn about how UK bank transfers work.Thanks to an anonymous reader for this one.

      Nicholas Robinson-Wall — Monzo

  • Google Cloud Platform June 2 outage followup
    • Along with the blog post I linked to last week, Google also posted this technical followup for their major June 2 outage. I’ve never seen one of their followups even close to this long or detailed, and that’s saying a lot.

SRE Weekly Issue #173

I’m back! Thank you all so much for the outpouring of support while SRE Weekly was on hiatus.  My recovery is going nicely and I’m starting to catch up on my long backlog of articles to review.  I’m going to skip trying to list all the outages that occurred since the last issue and instead just focus on a couple of interesting follow-up posts.

A message from our sponsor, VictorOps:

Alert fatigue will kill team morale. Take a look at some great ways to avoid alert fatigue and why it’s important for employee health and incident resolution speed:

http://try.victorops.com/SREWeekly/Avoiding-Alert-Fatigue

Articles

So many awesome concepts packed into this article. Here are just a couple:

Seen in this light, “severity” could be seen as a currency that product owners and/or hiring managers could use to ‘pay’ for attention.

This yields the logic that if a customer was affected, learning about the incident is worth the effort, and if no customers experienced negative consequences for the incident, then there must not be much to learn from it.

John Allspaw — Adaptive Capacity Labs

This shares more in common with the server behind sreweekly.com than I perhaps ought to admit to:

Additionally, lots can be done for scalability regarding infrastructure: I’ve kept everything on a single, smaller server basically as a matter of stubbornness and wanting to see how far I can push a single VPS.

Simon Fredsted

A Reddit engineer explains a hidden gotcha of pg_upgrade that caused an outage I reported here previously.

Jason Harvey — Reddit

This has “normalization of deviance” all over it.

Taylor Dolven — The Miami Herald

The deep details around MCAS are starting to come out. This article tells a tale that is all too familiar to me about organizational pressures and compartmentalization.

Jack Nicas, David Gelles and James Glanz — New York Times

Outages

  • Google
    • Click through for Google’s blog post about the outage that impacted Google Cloud Platform, YouTube, Gmail, Google Drive.A configuration change intended for a small number of servers was incorrectly applied more broadly, causing reduced network capacity. The similarity to the second Heroku outage below is striking.
  • Heroku Incident #1776 Follow-up
    • An expired SSL certificate caused control plane impact and some impact to running applications.
  • Heroku Incident #1789 Follow-up
    • A configuration change intended for a testing environment was mistakenly applied to production, resulting in 100% of requests in the EU failing.

SRE Weekly Issue #172

A message from our sponsor, VictorOps:

[You’re Invited] Puppet, Splunk and VictorOps are teaming up for a live webinar on powering continuous improvement by combining analytics, incident response and automation. Learn best practices for releasing better applications faster, without the fire drills.

http://try.victorops.com/sreweekly/continuous-improvement-webinar

Articles

An experienced pilot and programmer details the background behind the 737 MAX’s MCAS system and discusses the risks and motivations involved.

Boeing’s solution to its hardware problem was software.

Thanks to John Goerzen for this one.

Gregory Travis — IEEE Spectrum

A detailed analysis of a paper by Eric Hollnagel and David Woods on designing systems that include humans and computers.

The operator detects failures better when he participates in system control as opposed to functioning only as a monitor…

Thai Wood (summary)

An essay on the difference in philosophies between Safety I and Safety II and on understanding how our systems succeed rather than focusing on how they fail.

Ryan Frantz

Azure’s project tardigrade is exploring interesting ideas like keeping VMs resident in memory even when the host kernel reboots. This reminds me of another similarly-named project.

Chris Kanaracus — TechTarget

This is a followup to an article from last week about a Honeycomb incident, going into more detail on what went wrong and how they figured it out using Honeycomb itself.

Douglas Soo — Honeycomb

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement.

[…]

This blog post describes Grab’s post-mortem findings for the outage caused by the Redis Cluster failure.

· Michael Cartmell, Jiahao Huang, and Sandeep Kumar — Grab

I like how their chosen solution fetches from all the datacenters in the normal case, so they don’t experience a sudden shift in traffic pattern during a failover.

Preslav Le — Dropbox

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme