General

SRE Weekly Issue #29

Articles

I can’t summarize this awesome article well enough, so I’m just going to quote Charity a bunch:

the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.

if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

Traffic spikes can be incredibly difficult to handle, foreseen or not. Packagecloud.io details its efforts to survive a daily spike of 600% of normal traffic in March.

This checklist is aimed toward deployment on Azure, but a lot of the items could be generalized and applied to infrastructures deployed elsewhere.

In-depth detail surrounding the multiple failures of TNReady mentioned earlier this year (issues #10 and #20).

A two-sided debate, both sides of which are Gareth Rushgrove (maintainer of the excellent Devops Weekly). Should we try to adopt Google’s way of doing things in our own infrastructures? For example, error budgets:

What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages

Outages

SRE Weekly Issue #28

A more packed issue this week to make up for missing last week. This issue is coming to you from beautiful Cape Cod, where my family and I are vacationing for the week.

Articles

In April, Google Compute Engine suffered a major outage that was reported here. I wrote up this review for the Operations Incident Board’s Postmortem Report Reviews project.

Migration of a service without downtime can be an incredibly challenging engineering feat. Netflix details their effort to migrate their billing system complete with its rens of terabytes of RDBMS data into EC2.

Our primary goal was to define a secure, resilient and granular path for migration to the Cloud, without impacting the member experience.

Ransomware is designed to really ruin your day. It not only corrupts your in-house data; it also tries to encrypt your backup. Even if it’s off-site. Does your backup/recovery strategy stand up to this kind of failure?

VictorOps gives us this shiny, number-filled PDF that you can use as ammunition to convince execs that downtime really matters.

Students of Holberton School‘s full-stack engineer curriculum are on-call and actually get paged in the middle of the night. Nifty idea. Why should training in on-call only be on-the-job?

I think the rumble strip is a near-perfect safeguard.

That’s Pre-Accident Podcast’s Todd Conklin on rumble strips, the warning tracks on the sides of highways. This short (4-minute) podcast asks the question, can we apply the principles behind rumble strips in our infrastructures?

The FCC adds undersea cable operators to the list of mandatory reporters to the NORS (Network Outage Reporting System). But companies such as AT&T claim that the reporting will be of limited value, since outages that have no end-user impact (due to redundant underseas links) must still be reported.

Microsoft updated its article on designing highly available apps using Azure. These kinds of articles are important. In theory, no one ought to go down just because one EC2 or Azure region goes down.

SignalFX published this four-part series on avoiding spurious alerts in metric-based monitoring systems. The tutorial bits are specific to SignalFX, but the general principles could be applied to any metric-based alerting system.

Thanks to Aneel at SignalFX for this one.

Outages

SRE Weekly Issue #27

Sorry I’m a tad late this week!

If you only have time to read one long article, I highly recommend this first one.

Articles

This fascinating series delves deeply into the cascade of failures leading up to the nearly fatal overdose of a pediatric patient hospitalized for a routine colonoscopy. It’s a five-article series, and it’s well worth every minute you’ll spend reading it. Human error, interface design, misplaced trust in automation, learning from aviation; it’s all here in abundance and depth.

In this second part of a two part series (featured here last week), Charity Majors delves into what operations means as we move toward a “serverless” infrastructure.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.

Interesting, though I have to say I’m a bit skeptical when I hear someone target six nines. Especially when they say this:

Redefining five nines is redefining them to go up to six nines,” said James Feger, CenturyLink’s vice president of network strategy and development […]

The Pre-Accident Podcast reminds us that incident response is just as important as incident prevention.

As automated remediation increases, the problems that actually hit our pagers become more complex and higher-level. This opinion piece from PagerDuty explores that trend and where it’s leading us.

A high-level overview of the difference between HA and DR and Netflix’s HA testing tool, Chaos Monkey.

Outages

SRE Weekly Issue #26

Articles

Here’s Charity Majors being awesome as always. There’s a reason this article is first this week. In this part one of two articles, Charity recaps her recent talk at serverlessconf in which she argues that you can never get away from operations, no matter how “serverless” you go.

[…] no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth […]

I’m still laughing about #NoDevs. Thought-leadering through trolling FTW.

This is an older article (2011), but it’s still well worth reading. Facebook began automating remediation of standard hardware failure, and then they reinvested the time saved into improving the automation.

Today, the FBAR service is run by two full time engineers, but according to the most recent metrics, it’s doing the work of 200 full fine system administrators.

A system that doesn’t auto-scale to meet demand can be unreliable in the face of demand spikes. But auto-scaling adds complexity to a system, and increasing complexity can also decrease reliability. This article outlines a method to attempt to reason about auto-scaling based on multiple metrics. Bonus TIL: Erlang threads busy-wait for work.

A run-down of basic techniques for avoiding and dealing with human error. I like this article for a couple of choice quotes, such as: “human error scales up” — as your infrastructure grows bigger, the scope of potential damage from a single error also grows bigger.

The latest in Mathias Lafeldt’s Production Ready series is this article about complexity.

The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.

Outages

SRE Weekly Issue #25

Articles

This blows my mind. Chef held a live, public retrospective meeting for a recent production incident. I love this idea and I can only hope that more companies follow suit. The transparency is great, but more than that is their sharing of their retrospective process itself. They have a well-defined format for retrospectives including a statement of blamelessness at the beginning. Kudos to Chef for this, and thanks to Nell Shamrell-Harrington for posting the link on Hangops.

The actual incident was fairly interesting too. The crux of it comes down to this quote that we’ve probably all uttered ourselves at one point or another:

The further distant staging is from production, the more likely we are to introduce a bug.

PagerDuty has this explanation of alert fatigue and some tips on preventing it. One thing they missed in their list of impacts of alert fatigue: employee attrition, which directly impacts reliability.

For the network-heads out there, here’s an article on how to set up Anycast routing.

As we become more dependent on our mobile phones, the FCC is gathering information on provider outages. I, for one, wouldn’t be able to call 911 (emergency services) if AT&T had an outage, because I don’t have a land line.

I love this article if only for its title. It’s short, but its thesis bears considering: all the procedure documentation in the world won’t help you if you can’t find it during an incident, or it can’t practically be followed.

The only procedure that is worth a damn is one that has been successfully followed in the heat of battle.

So when legacy vendors suggest that the Salesforce outage calls cloud into question, they tend to ignore the fact that their own systems suffer regular outages. They just rely on the fact that few people know about them.

Full disclosure: Salesforce (parent company of my employer, Heroku), is mentioned.

An introduction to the application of formal mathematical verification to network configurations. A good overview, but I wish it went into more practical detail.

[…] a software application designer might want to know that her code will never crash or that it will never execute certain functions without an authorized login. These are simple, practical questions – but answering them is computationally challenging because of the enormous number of possible ways code may be executed, […]

Earlier this year, I featured a story about Pinboard.in and IFTTT. IFTTT released this official apology and explanation of the problems Pinboard.in’s author outlined, and they (unofficially) promised to retain support through the end of 2016. Pinboard.in is an integral part of how I produce SRE Weekly every week, so I’m glad to see that this turned out for the best.

This article is more on the theoretical side than practical, and it’s a really interesting read. It’s the second in a series, but I recommend reading both at once (or skipping the first).

A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole.

Outages

  • Twitter
  • NS1
    • NS1’s CEO posted this incredibly detailed and excellent postmortem on the sophisticated DDoS attacks they suffered.

  • Pirate Bay
  • WhatsApp
  • Virginia (US state) government network
  • Walmart MoneyCard
  • Telstra
    • Telstra has had a hell of a time this year. This week social media and news were on fire with this days-long Telstra outage. This time, they’re offering customers a $25 credit instead of a free data day. Click through for Telstra’s explanation of what went wrong.

  • GitLab
    • Linked is their post-incident analysis.

  • Kimbia (May 3)
    • A couple weeks ago, Kimbia, a company that helps non-profits raise funds, suffered a massive failure. This occurred during Give Local America, a huge fundraising day for thousands of non-profits in the US, with the result that many organizations had a hard time accepting donations.

A production of Tinker Tinker Tinker, LLC Frontier Theme