SRE Weekly Issue #60

Sorry I’m late this week!  My family experienced a low-redundancy event as two grown-ups and one kid (so far) have been laid low by Norovirus.

That said, I’m glad that the delay provided me the opportunity to share this first article so soon after it was published.

SPONSOR MESSAGE

Interested in ChatOps? Get the free 75 page O’Reilly report covering everything from basic concepts to deployment strategies. http://try.victorops.com/sreweekly/chatops

Articles

Susan Fowler’s articles have been featured here several times previously, and she’s one of my all-time favorite authors. Now it seems that while she was busy writing awesome articles and a book, she was also dealing with a terribly toxic and abhorrent environment of sexual harassment and discrimination at Uber. I can only be incredibly thankful that somehow, despite their apparent best efforts, Uber did not manage to drive Susan out of engineering as happens all to often in this kind of scenario.

Even, and perhaps especially if we think we’re doing a good job preventing the kind of abusive environment Susan described, it’s quite possible we’re just not aware of the problems. Likely, even. This kind of situation is unfortunately incredibly common.

Wow, what a cool idea! GitLab open-sourced their runbooks. Not only are their runbooks well-structured and great as examples, some of them are general enough to apply to other companies.

Every line of code has some probability of having an undetected flaw that will be seen in production. Process can affect that probability, but it cannot make it zero. Large diffs contain many lines, and therefore have a high probability of breaking when given real data and real traffic.

Full disclosure: Heroku, my employer, is mentioned.
Thanks to Devops Weekly for this one.

TIL: cgroup memory limits can cause a group of processes to use swap even when the system as a whole is not under memory pressure. Thanks again, Julia Evans!

This week from VictorOps is nifty primer on structuring your team’s on-call and incident response. I love when a new concept catches my eye like this one:

While much has been said about the importance of keeping after-action analysis blameless, I think it is doubly important to keep escalations blameless. A lone wolf toiling away in solitude makes for a great comic book, but rarely leads to effective resolution of incidents in complex systems.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Open source IoT platform ThingsBoard’s authors share a detailed account of how they diagnosed and fixed reliability and throughput issues in their software so that it could handle 30k incoming events per second.

There’s both theory and practice in this article, which opens with an architecture discussion and then continues into the steps to deploy a first verison in a testing Azure environment on your workstation.

I don’t often link to new product announcements, but DigitalOcean’s new Load Balancer product caught my attention. It looks to be squarely aimed at improving on Amazon’s ELB product.

Okay, apparently I do link to product announcements often.  Google unveiled a new beta product this week for their Cloud Platform: Cloud Spanner. Based on their Spanner paper from 2012, they have some big claims.

Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. […] With automatic scaling, synchronous data replication, and node redundancy, Cloud Spanner delivers up to 99.999% (five 9s) of availability for your mission critical applications.

Outages

  • US National Weather Service
    • The U.S. National Weather Service said on Tuesday it suffered its first-ever outage of its data system during Monday’s blizzard in New England, keeping the agency from sending out forecasts and warnings for more than two hours. [Reuters]

  • The Travis CI Blog: Postmortem for 2017-02-04 container-based Infrastructure issues
    • A garden-variety bug in a newly-deployed version was exacerbated by a failed rollback, in a perfect example of a complex failure with a complex intersection of contributing factors.
  • Instapaper Outage Cause & Recovery
    • Last week, I incorrectly stated that Instapaper’s database hit a performance cliff. In actuality, their RDS instance was, unbeknownst to them, running on an ext3 filesystem with its single-file limit of 2TB per file. Their only resolution path when they ran out of space was to mysqldump all their data and restore into a new DB running on ext4.

      Even if we had executed perfectly, from the moment we diagnosed the issue to the moment we had a fully rebuilt database, the total downtime would have been at least 10 hours.

Updated: February 20, 2017 — 8:42 pm
SRE WEEKLY © 2015 Frontier Theme