General

SRE Weekly Issue #39

SPONSOR MESSAGE

Got ChatOps? Download the free eBook from O’Reilly Media and VictorOps: http://try.victorops.com/devopsweekly/chatops

Want even more? Meet the author on Sept 8th in a live stream event: http://try.victorops.com/devopsweekly/chatops/livestream

Articles

A+ article! Susan Fowler has been a developer, an ops person, and now an SRE. That means she’s well-qualified to give an opinion on who should be on call, and she says that the answer is developers (in most cases). Bonus content includes “What does SRE become if developers are on call?”

[…]if you are going to be woken up in the middle of the night because a bug you introduced into code caused an outage, you’re going to try your hardest to write the best code you possibly can, and catch every possible bug before it causes an outage.

Thanks to Devops Weekly for this one.

I figured this new zine from Julia Evans would be mostly review for me. WRONG. I’d never heard of dstat, opensnoop, or execsnoop, or perf before, but I sure will be using them now. As far as I can tell, Julia wants to learn literally everything, and better yet, she wants to teach us what she learned and how she learned it. Hats off to her.

“While we’ve got the entire system down to do X, shall we do Y also?”

This article argues that we should never do Y. If something goes wrong, we won’t know whether to roll back X or Y, and it’ll take twice as long to figure out which one is to blame.

This week, Mathias introduces “system blindness”, the flawed understanding of how a system works and the lack of knowledge of how incomplete our understanding of it is. Whether we realize it or not, we struggle to mentally model the intricate interconnections in the increasingly complex systems we’re building.

There are no side effects, just effects that result from our flawed understanding of the system.

I’ve mentioned Spokes (formerly DGit) here previously. This time, GitHub shares the details on how they designed Spokes for high durability and availability.

TIL: Ruby can suffer from Java-style stop-the-world garbage collection freezes.

Here’s recap of a talk about Facebook’s “Protect Storm”, given by VP Jay Parikh at @Scale. Project Storm involved retrofitting Facebook’s infrastructure time handle the failure of entire datacenters.

“I was having coffee with a colleague just before the first drill. He said, ‘You’re not going to go through with it; you’ve done all the prep work, so you’re done, right?’ I told him, ‘There’s only one way to find out’” if it works.

Here’s an interview with Jason Hand of VictorOps about the importance of a blameless culture. He mentions the idea that “Why?” is an inherently blameful kind of question (hat tip to John Allspaw’s Infinite “How?”s). I have to say that I’m not sure I agree with Jason’s other point that we shouldn’t bother attempting incident prevention, though. Just look at the work the aviation industry has done toward accident prevention.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

SCALE has opened their CFP, and one of the chairs told me that they’d “love to get SRE focused sessions on open-source.”

Outages

  • British Airways
  • FLOW (Jamaica telecom)
  • SSP
    • SSP provides a SaaS for insurance companies to run their business on. They’re dealing with a ten-plus-day outage initially caused by some kind of power issue that fried their SAN. As a result, they’re going to decommission the datacenter in question.
  • Heroku
    • Full disclosure: Heroku is my employer.
  • Azure
    • Two EU regions went down simultaneously.
  • Overwatch (game)
  • Asana
    • Linked is a postmortem with an interesting set of root causes. A release went out that increased CPU usage, but it didn’t cause issues until peak traffic the next day. Asana is brave for enabling comments on their postmortem — not sure I’d have the stomach for that.Thanks to an anonymous contributor for this one.
  • ESPN’s fantasy football
    • Unfortunate timing, being down on opening day.

SRE Weekly Issue #38

Welcome to the many new subscribers that joined in the past week. I’m not sure who I have to thank for the sudden surge, but whoever you are, thanks!

SPONSOR MESSAGE

Got ChatOps? Download the free eBook from O’Reilly Media and VictorOps: http://try.victorops.com/devopsweekly/chatops

Want even more? Meet the author on Sept 8th in a live stream event: http://try.victorops.com/devopsweekly/chatops/livestream

Articles

What can the fire service learn about safety from the aviation industry? A 29-year veteran in the fire service answers that question in detail. We could in turn apply all of those lessons to operating complex infrastructures.

I’m surprised that I haven’t come across the term “High Reliability Organization” before reading the previous article. Here’s Wikipedia’s article on HROs.

A high reliability organization (HRO) is an organization that has succeeded in avoiding catastrophes in an environment where normal accidents can be expected due to risk factors and complexity.

Etsy instruments their deployment system to strike a vertical line on their graphite graphs for every deploy. This helps them quickly figure out whether a deploy happened right before a key metric took a turn for the worse.

A really interesting dive into the world of subsea network cables and the impact that cuts can have on businesses worldwide.

How closely can you really mimic production in your testing environments? In a way we’re all testing in production, and this article talks about getting that fact out in the open.

I wrote this article on my terrible little blog back in 2008 — forgive the horrid theme and apparently broken unicode support. This was well before I worked in Linden Lab’s Ops team, back when I was making a living as a user selling content in Second Life. What’s interesting to me in reading this article 8 years later is the user perspective on the impact of the string of bad outages, and especially Linden’s poor communication during outages.

More on the impact of Delta Airline’s major outage last month.

Most often a catastrophic failure is not due to a lack of standards, but a breakdown or circumvention of established procedures that compounded into a disastrous outcome. Multilayer complex systems outages signify management failure to drive change and improvement.

Outages

SRE Weekly Issue #37

SPONSOR MESSAGE

Frustrated by the lack of tools available to automate incident response? Learn how ChatOps can help manage your operations through group chat in the latest book from O’Reilly. Get your copy here: http://try.victorops.com/l/44432/2016-08-19/f2xt33

Articles

Sometimes I follow chains of references from article to article until I find a new author to follow, and this time it’s Kelly Sommers. In this gem, she debunks the rarity of network partitions by recasting them as availability partitions. If half of your nodes aren’t responding because their CPUs are pegged, you still have a network partition.

most partitions I’ve experienced have nothing to do with network infrastructure failures

Two engineers from MMO company DIGIT gave this short, nicely detailed interview in which they outline how they achieve HA on AWS.

Here’s a recording of the DevOps/SRE AMA from a couple weeks back, in case you missed it.

A blog post by Skyline, who is launching their new deployment-as-a-service offering. The intro is pretty great, outlining the inherent risks in changing code and releasing new code into production.

Other online schema-change tools I’m familiar with (e.g. pt-online-schema-change) use triggers to keep a new table in sync with changes while copying old rows over. Instead, gh-ost monitors changes by hooking on as a replication slave. Very clever! This article goes into several reasons why this is a much better approach.

Outages

SRE Weekly Issue #36

Last week’s DevOps & SRE AMA was super fun! Thanks to the panelists for participating. Recordings should be posted soon.

SPONSOR MESSAGE

Frustrated by the lack of tools available to automate incident response? Learn how ChatOps can help manage your operations through group chat in the latest book from O’Reilly. Get your copy here: http://try.victorops.com/l/44432/2016-08-19/f2xt33

Articles

This is the second half of Server Density’s series on the lessons they learned as they transitioned to a multi-datacenter architecture. There are lots of interesting tidbits in here, such as an explanation of how they handle failover to the secondary DC and what they do if that goes wrong.

Full disclosure: Heroku, my employer, is mentioned.

Here’s the second half of Mathias Lafeldt’s series that seeks to apply Richard Cook’s How Complex Systems Fail to web systems. The article is great, but the really awesome part is the thoughtful responses by Cook himself to both parts one and two, linked at the end of this article.

Here’s a postmortem for last week’s outage that involved a migration gone awry.

Thanks to Jonathan Rudenberg for this one.

A patent holding firm is alleging that the USPTO overstepped its authority in declaring a system outage (reported in issue #4) to be treated as a national holiday for purposes of deadlines, and that this led to the plaintiff being sued.

Burnout is a crucially important consideration in a field with on-call work. VictorOps has a few tips for alleviating burnout gleaned from this year’s Monitorama.

Edith Harbaugh says that staging servers present a reliability risk that doesn’t outweigh their benefit. This article is an update to her original article, which I also recommend reading.

Github uses HAProxy to balance across is read-only MySQL replicas, which is a common method. Their technique for excluding lagging nodes while avoiding entirely emptying the pool if all nodes are lagging is pretty neat.

Thanks to Devops Weekly for this one.

A highly detailed deep-dive on Serverless — what it means, benefits, and drawbacks. I especially enjoyed the #NoOps section:

[Ops] also means at least monitoring, deployment, security, networking and often also means some amount of production debugging and system scaling. These problems all still exist with Serverless apps and you’re still going to need a strategy to deal with them. In some ways Ops is harder in a Serverless world because a lot of this is so new.

#ServerlessIsMadeOfServers

Full disclosure: Heroku, my employer, is mentioned.

Outages

SRE Weekly Issue #35

SPONSOR MESSAGE

What is Modern Incident Management? Download the Incident Management Buyer’s Guide to learn all about it, and the value it provides. Get your copy here: http://try.victorops.com/l/44432/2016-08-04/dwp8lc

Articles

Whoops, here’s one that got lost in my review queue. Charity Majors (one of the usual suspects here at SRE Weekly) wrote one of her characteristically detailed and experience-filled posts on how to isolate your production, staging, and development environments in AWS.

I can’t quite tell how much of this is John Allspaw’s writing and how much is written by the US Forestry Service, but I love it all. Here’s a bulleted list of points driving home the fact that we constantly strike a balance between risk and safety.

Server Density added multi-datacenter redundancy to their infrastructure in 2013, and they were kind enough to document what they learned. In this first of two articles, they outline different kinds of multi-datacenter setups and go over the kinds of things you’ll have to think about as you retrofit your application.

This short opinion piece raises an excellent idea: SLAs aren’t for recouping the cost you incurred due to an outage, they are for making a point to a service provider about the outage.

Southwest has released some numbers on the impact of last month’s outage that resulted in thousands of cancelled flights.

Netflix gives us a rundown of how they prepare a title for release by pre-filling caches in their in-house CDN. I like the part about timing pre-filling during off-peak hours to avoid impacting the service.

How much is your company willing to invest for a truly effective DR solution? This article asks that question and along the way digs into what an effective DR solution looks like and why it costs so much.

Outages

  • Syria
    • The Syrian government shut internet access down to prevent cheating on school exams.

  • Mailgun
    • Linked, find a really interesting postmortem: Mailgun experienced an outage when their domain registrar placed their domain on hold abruptly. The registrar was subsequently largely uncommunicative, hampering incident resolution. Lesson learned: make sure you can trust your registrar, because they have the power to ruin your day.

  • Belnet
    • The linked article has some intriguing detail about a network equipment failure that caused a routing loop.

  • Australia’s census website
    • This caught my eye:

      Revolution IT simulated an average sustained peak of up to 350 submissions per second, but only expected up to 250 submission per second.

      Load testing only 40% above expected peak demand? That seems like a big red flag to me.

  • Reddit
  • Etisalat (UAE ISP)
  • Vodafone
  • Google Drive
  • AT&T
  • Delta Airline
    • A datacenter power system failure resulted in cancelled flights worldwide.

A production of Tinker Tinker Tinker, LLC Frontier Theme