General

SRE Weekly Issue #173

I’m back! Thank you all so much for the outpouring of support while SRE Weekly was on hiatus.  My recovery is going nicely and I’m starting to catch up on my long backlog of articles to review.  I’m going to skip trying to list all the outages that occurred since the last issue and instead just focus on a couple of interesting follow-up posts.

A message from our sponsor, VictorOps:

Alert fatigue will kill team morale. Take a look at some great ways to avoid alert fatigue and why it’s important for employee health and incident resolution speed:

http://try.victorops.com/SREWeekly/Avoiding-Alert-Fatigue

Articles

So many awesome concepts packed into this article. Here are just a couple:

Seen in this light, “severity” could be seen as a currency that product owners and/or hiring managers could use to ‘pay’ for attention.

This yields the logic that if a customer was affected, learning about the incident is worth the effort, and if no customers experienced negative consequences for the incident, then there must not be much to learn from it.

John Allspaw — Adaptive Capacity Labs

This shares more in common with the server behind sreweekly.com than I perhaps ought to admit to:

Additionally, lots can be done for scalability regarding infrastructure: I’ve kept everything on a single, smaller server basically as a matter of stubbornness and wanting to see how far I can push a single VPS.

Simon Fredsted

A Reddit engineer explains a hidden gotcha of pg_upgrade that caused an outage I reported here previously.

Jason Harvey — Reddit

This has “normalization of deviance” all over it.

Taylor Dolven — The Miami Herald

The deep details around MCAS are starting to come out. This article tells a tale that is all too familiar to me about organizational pressures and compartmentalization.

Jack Nicas, David Gelles and James Glanz — New York Times

Outages

  • Google
    • Click through for Google’s blog post about the outage that impacted Google Cloud Platform, YouTube, Gmail, Google Drive.A configuration change intended for a small number of servers was incorrectly applied more broadly, causing reduced network capacity. The similarity to the second Heroku outage below is striking.
  • Heroku Incident #1776 Follow-up
    • An expired SSL certificate caused control plane impact and some impact to running applications.
  • Heroku Incident #1789 Follow-up
    • A configuration change intended for a testing environment was mistakenly applied to production, resulting in 100% of requests in the EU failing.

SRE Weekly Issue #172

A message from our sponsor, VictorOps:

[You’re Invited] Puppet, Splunk and VictorOps are teaming up for a live webinar on powering continuous improvement by combining analytics, incident response and automation. Learn best practices for releasing better applications faster, without the fire drills.

http://try.victorops.com/sreweekly/continuous-improvement-webinar

Articles

An experienced pilot and programmer details the background behind the 737 MAX’s MCAS system and discusses the risks and motivations involved.

Boeing’s solution to its hardware problem was software.

Thanks to John Goerzen for this one.

Gregory Travis — IEEE Spectrum

A detailed analysis of a paper by Eric Hollnagel and David Woods on designing systems that include humans and computers.

The operator detects failures better when he participates in system control as opposed to functioning only as a monitor…

Thai Wood (summary)

An essay on the difference in philosophies between Safety I and Safety II and on understanding how our systems succeed rather than focusing on how they fail.

Ryan Frantz

Azure’s project tardigrade is exploring interesting ideas like keeping VMs resident in memory even when the host kernel reboots. This reminds me of another similarly-named project.

Chris Kanaracus — TechTarget

This is a followup to an article from last week about a Honeycomb incident, going into more detail on what went wrong and how they figured it out using Honeycomb itself.

Douglas Soo — Honeycomb

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement.

[…]

This blog post describes Grab’s post-mortem findings for the outage caused by the Redis Cluster failure.

· Michael Cartmell, Jiahao Huang, and Sandeep Kumar — Grab

I like how their chosen solution fetches from all the datacenters in the normal case, so they don’t experience a sudden shift in traffic pattern during a failover.

Preslav Le — Dropbox

Outages

SRE Weekly Issue #171

A message from our sponsor, VictorOps:

[You’re Invited] Puppet, Splunk and VictorOps are teaming up for a live webinar on powering continuous improvement by combining analytics, incident response and automation. Learn best practices for releasing better applications faster, without the fire drills.

http://try.victorops.com/sreweekly/continuous-improvement-webinar

Articles

TL:DR; Prefer investing in recovery instead of prevention.

Make failure a non-event, rather than trying to prevent it. You won’t succeed in fully preventing failures, and you’ll instead get out of practice of recovering.

Aaron Blohowiak

They had me at “normalization of deviance”. I’ll read pretty much anything with that in the title.

Tim Davies — Fast Jet Performance

Monzo’s system is directly integrated with Slack, helping you manage your incident and track what happens. Check out their video presentation for more details.

Monzo

Me too! Great thread.

Nolan Caudill and others

I love Honeycomb incident reviews, I really do.

Douglas Soo

Born from a Twitter argument thread, this article goes into depth about why Friday change freezes can cause much more trouble than good.

Charity Majors

Outages

SRE Weekly Issue #170

A message from our sponsor, VictorOps:

Our latest list of the top 12 server monitoring tools can help your SRE team get started in building a comprehensive monitoring strategy. Drive deeper service reliability through effective server monitoring:

http://try.victorops.com/sreweekly/top-server-monitoring-software

Articles

This myth is a misguided belief that engineers are like Laplace’s Demon; they maintain an accurate mental model of the system, foresee all the consequences of their actions, predict where the business is going, and are careful enough to avoid mistakes.

Aaron Blohowiak — Netflix

I highly recommend watching some of the talks or at least perusing slides.

The concern is that incidents have been investigated by parties that were involved or related to the incident, raising concerns about conflicts of interest. In a small company, avoiding this kind of thing may not be possible, but we should at least keep the risks in mind.

Patrick Kingsland — Railway Technology

An absolute treasure trove of links to many articles and papers on resilience engineering. Beyond just links, there are short profiles of 30+ important thinkers in the field. I’m going to be busy for awhile.

@lorin (GitHub)

This is about project retrospectives, but it applies equally well to incident retrospectives.

Dominika Bula — Red Hat

Here’s a counterpoint to an article I linked to last week.

Karl Bode — Motherboard

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme