General

SRE Weekly Issue #63

SPONSOR MESSAGE

Free Incident Management Maturity Assessment. Learn how your team ranks against leading DevOps practices and get helpful tips on how to improve. http://try.victorops.com/ima/sreweekly

Articles

I love the analogy: you can’t work around a slow drain with a bigger sink.

Stephen Thorne, a Google SRE, annotates the first chapter of the Google SRE book with his personal opinions and interpretations.

The author of this short article starts with the blooper during the Oscars and beautifully segues into a description of techniques organizations can use to halt the propagation of errors.

This webinar looks really interesting, and I’m going to try to see it. It’s about the importance of providing context to incident responders, how much to provide, and how to provide it.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Outages

SRE Weekly Issue #62

S3 fails, and suddenly it’s SRE “go-time” at companies everywhere!  I don’t know about you, but I sure am exhausted.

SPONSOR MESSAGE

DevOps incident management at its finest. Start your free trial of VictorOps. http://try.victorops.com/sreweekly/trial

Articles

When you do as the title suggests, you realize that network partitions go from the realm of theoretical to everyday.

Asana shares their “Five Whys” process, which they use not only for outages but even for missed deadlines. This caught my eye:

Our team confidently focuses on problem mitigation while fighting a fire, knowing that there will be time for post-mortem and long-term fixes later.

Using Route 53 as a case study, AWS engineers explain how they carefully designed their deploy process to reduce impact from failed deploys.

One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact.

GitHub used a data-driven approach when migrating a storage load from Redis to MySQL. It’s a good thing they did, because a straight one-for-one translation would have overloaded MySQL.

We’ve heard before that it’s important to make sure that your alerts are actionable. I like that this article goes into some detail on why we sometimes tend to create inactionable alerts before explaining how to improve your alerting.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Ubuntu backported a security fix into Xenial’s kernel last month, and unfortunately, they introduced a regression. Under certain circumstances, the kernel will give up way too easily when attempting to find memory to satisfy an allocation and will needlessly trigger the OOM killer. A fix was released on February 20th.

Need to tell someone their perpetual motion machine CAP-satisfying system won’t work? Low on time? Use this handy checklist to explain why their idea won’t work.

GitLab seriously considered fleeing the cloud for a datacenter, and they asked the community for feedback. That feedback was very useful and was enough to change their minds. The common theme: “you are not an infrastructure company, so why try to be one?”

If you’ve got a firehose of events going into your metrics/log aggregation system, you may need to reduce load on it by only sending in a portion of your events. Do you pick one out of every N? HoneyComb’s makers suggest an interesting alternative: tag each sampled event you send as representing N events from the source — and N is allowed to very between samples.

Outages

  • Amazon S3
    • Amazon S3 in the us-east-1 region went down, taking many sites and services down with it, including Trello, Heroku, portions of Slack and GitHub, and tons more. Amazon’s status page had a note at the top but was otherwise green across the board for hours.  Meanwhile nearly 100% of S3 requests failed and many other AWS services burned as well.Their outage summary (linked above) indicated that the outage uncovered a dependency of their status site on S3. Oops. Once they got that fixed a few hours later, they posted something I’ve never seen before: actual red icons.Full disclosure: Heroku is my employer.
  • Joyent: Postmortem for July 27 outage of the Manta service
    • Here’s a deeply technical post-analysis of a Postgresql outage that Joyent experienced in 2015. A normally benign automatic maintenance (an auto-vacuum) turned into total DB lockup due to their workload.
  • PagerDuty
  • GoDaddy
    • DDoS attack on their nameservers.

SRE Weekly Issue #61

A fairly large Outages section this week as I experiment with including post-analyses there even for older incidents.

SPONSOR MESSAGE

Interested in ChatOps? Get the free 75 page O’Reilly report covering everything from basic concepts to deployment strategies. http://try.victorops.com/sreweekly/chatops

Articles

Every week, there’s an article with a title like this (just like with “costs of downtime”). Almost every week, they’re total crap, but this one from PagerDuty is a bit better than the rest. The bit that interests me is the assertion that a microservice-based architecture “makes maintenance much easier” and “makes your app more resilient”. Sure it can, but it can also just mean that you trade one problem for 1300 problems.

Coping with that complexity requires a different approach to monitoring and alert management. You need to do much more than treat incident management as a process of responding to alerts in the order they come in or assuming that every alert requires action.

This post explains why a flexible, nuanced approach to alert management is vital, and how to implement it.

HelloFresh live-migrated their infrastructure to an API gateway to facilitate a transition to microservices. They kindly wrote up their experience, which is especially educational because their first roll-out attempt didn’t go as planned.

[…] our first attempt at going live was pretty much a disaster. Even though we had a quite nice plan in place we were definitely not ready to go live at that point.

In this issue, Mathias shows us the benefits of “dogfooding” and cases where it can break down. I like the way the feedback loop is shortened, so that developers feel a painful user experience and have incentive to quickly fix it. It reminds me a lot of the feedback loop you get when developers go on call for the services they write.

A breakdown of four categories of monitoring tools using the “2×2” framework. I like the mapping of “personas” (engineering roles) to the monitoring typesa they tend to find most useful.

Outages

  • Cloudflare: “Cloudbleed”
    • Cloudflare experienced a minor outage due to mitigating a major leak of private information. They posted this (incredibly!) detailed analysis of the bug and their response to it. Other vendors such as PagerDuty, Monzo, TechDirt, and MaxMind posted responses to the outage. There’s also this handy list of sites using cloudflare.
  • mailgun
    • Here’s a really interesting postmortem for a Mailgun outage I linked to in January. What apparently started off as a relatively minor outage was significantly exacerbated “due to human error”. The intriguing bit: the “corrective actions” section makes no mention at all of process improvements to make the system more resilient to this kind of error.
  • All Circuits are Busy Now: The 1990 AT&T Long Distance Network Collapse
    • In 1990, the entire AT&T phone network experienced a catastrophic failure, and 50% of all calls failed. The analysis is pretty interesting and shows us that a simple bug can break even an incredibly resilient distributed system.

      the Jan. 1990 incident showed the possibility for all of the modules to go “crazy” at once, how bugs in self-healing software can bring down healthy systems, and the difficulty of detecting obscure load- and time-dependent defects in software.

  • vzaar
    • They usually fork a release branch off of master, test it, and push that out to production. This time, they accidentally pushed master to production. How do I know that? Because they published this excellent post-analysis of the incident just two days after it happened.
  • U.S. Dept. of Homeland Security
    • This article has some vague mention of an expired certificate.
  • YouTube
  • CD Baby
  • Facebook

SRE Weekly Issue #60

Sorry I’m late this week!  My family experienced a low-redundancy event as two grown-ups and one kid (so far) have been laid low by Norovirus.

That said, I’m glad that the delay provided me the opportunity to share this first article so soon after it was published.

SPONSOR MESSAGE

Interested in ChatOps? Get the free 75 page O’Reilly report covering everything from basic concepts to deployment strategies. http://try.victorops.com/sreweekly/chatops

Articles

Susan Fowler’s articles have been featured here several times previously, and she’s one of my all-time favorite authors. Now it seems that while she was busy writing awesome articles and a book, she was also dealing with a terribly toxic and abhorrent environment of sexual harassment and discrimination at Uber. I can only be incredibly thankful that somehow, despite their apparent best efforts, Uber did not manage to drive Susan out of engineering as happens all to often in this kind of scenario.

Even, and perhaps especially if we think we’re doing a good job preventing the kind of abusive environment Susan described, it’s quite possible we’re just not aware of the problems. Likely, even. This kind of situation is unfortunately incredibly common.

Wow, what a cool idea! GitLab open-sourced their runbooks. Not only are their runbooks well-structured and great as examples, some of them are general enough to apply to other companies.

Every line of code has some probability of having an undetected flaw that will be seen in production. Process can affect that probability, but it cannot make it zero. Large diffs contain many lines, and therefore have a high probability of breaking when given real data and real traffic.

Full disclosure: Heroku, my employer, is mentioned.
Thanks to Devops Weekly for this one.

TIL: cgroup memory limits can cause a group of processes to use swap even when the system as a whole is not under memory pressure. Thanks again, Julia Evans!

This week from VictorOps is nifty primer on structuring your team’s on-call and incident response. I love when a new concept catches my eye like this one:

While much has been said about the importance of keeping after-action analysis blameless, I think it is doubly important to keep escalations blameless. A lone wolf toiling away in solitude makes for a great comic book, but rarely leads to effective resolution of incidents in complex systems.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Open source IoT platform ThingsBoard’s authors share a detailed account of how they diagnosed and fixed reliability and throughput issues in their software so that it could handle 30k incoming events per second.

There’s both theory and practice in this article, which opens with an architecture discussion and then continues into the steps to deploy a first verison in a testing Azure environment on your workstation.

I don’t often link to new product announcements, but DigitalOcean’s new Load Balancer product caught my attention. It looks to be squarely aimed at improving on Amazon’s ELB product.

Okay, apparently I do link to product announcements often.  Google unveiled a new beta product this week for their Cloud Platform: Cloud Spanner. Based on their Spanner paper from 2012, they have some big claims.

Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. […] With automatic scaling, synchronous data replication, and node redundancy, Cloud Spanner delivers up to 99.999% (five 9s) of availability for your mission critical applications.

Outages

  • US National Weather Service
    • The U.S. National Weather Service said on Tuesday it suffered its first-ever outage of its data system during Monday’s blizzard in New England, keeping the agency from sending out forecasts and warnings for more than two hours. [Reuters]

  • The Travis CI Blog: Postmortem for 2017-02-04 container-based Infrastructure issues
    • A garden-variety bug in a newly-deployed version was exacerbated by a failed rollback, in a perfect example of a complex failure with a complex intersection of contributing factors.
  • Instapaper Outage Cause & Recovery
    • Last week, I incorrectly stated that Instapaper’s database hit a performance cliff. In actuality, their RDS instance was, unbeknownst to them, running on an ext3 filesystem with its single-file limit of 2TB per file. Their only resolution path when they ran out of space was to mysqldump all their data and restore into a new DB running on ext4.

      Even if we had executed perfectly, from the moment we diagnosed the issue to the moment we had a fully rebuilt database, the total downtime would have been at least 10 hours.

SRE Weekly Issue #59

Much like I did with telecoms, I’ve decided that it’s time to stop posting every MMO game outage that I see go by.  They rarely share useful postmortems and they’re frequently the target of DDoS attacks.  If I see an intriguing one go by though, I’ll be sure to include it.

SPONSOR MESSAGE

Interested in ChatOps? Get the free 75 page O’Reilly report covering everything from basic concepts to deployment strategies. http://try.victorops.com/sreweekly/chatops

Articles

Here’s a great article about burnout in the healthcare sector. There’s mention of second victims (see also Sydney Dekker) and a vicious circle: burnout leads to mistakes, which lead to adverse patient outcomes, which lead to guilt and frustration, which leads to burnout.

Every week, I find and ignore at least one bland article about the “huge cost of downtime”. They almost never have anything interesting or new to say. This article by PagerDuty takes a different approach that I find refreshing, starting off by defining “downtime” itself.

A frustrated CEO speaks out against AWS’s infamously sanguine approach to posting on their status site.

As mentioned last week, here’s the final, published version of GitLab’s postmortem for their incident at the end of last month.

An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact.

MongoDB contracted Jepsen to test their new replication protocol. Jepsen found some issues, which are fixed, and now MongoDB gets a clean bill of health. Pretty impressive! Even cooler is that the Mongo folks have integrated Jepsen’s tests into their CI.

Outages

  • Instapaper
    • Instapaper hit a performance cliff with their database, and the only path forward was to dump all data and load it into a new, more powerful DB instance.
  • Google Cloud Status Dashboard
    • Google released a postmortem for a network outage at the end of January.
  • OWASA (Orange County, FL, USA water authority)
    • OWASA had to cut off the municipal water supply for 3 days after an accidental overfeed of fluoride into the drinking supply. They engaged in an impressive post-analysis and released a detailed root cause analysis document. It was a pretty interesting read, and I highly recommend clicking through to the PDF and reading it.  There you’ll see that “human error” was a proximal but by no means root cause of the outage, especially since the human in question corrected their error after just 12 seconds.
A production of Tinker Tinker Tinker, LLC Frontier Theme