General

SRE Weekly Issue #31

Huge thanks to SRE Weekly’s new sponsor, VictorOps!

SPONSOR MESSAGE

Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”

Articles

Opzzz is a new app that graphs sleep data (from a Fitbit) against pager alerts (from PagerDuty or Server Density). I love this idea!

By correlating sleep data with on call incidents, we can then illustrate the human cost of on-call work.

Speaking of measuring sleep data against pages, Etsy is doing that too with their open source on-call analysis tool Opsweekly. Engineers also classify their events based on whether they were actionable.

We’ve been doing this for a year and we are seeing an increasingly improving signal to noise ratio.

Slides from a talk on a really important topic. There are some great resource links included.

I’m a firm believer in work/life balance, especially as it pertains to on-call. I have a reputation for rigidly defending my personal time and that of my co-workers. I strongly feel that this is the best thing I can do for my company because exhaustion and burnout are huge reliability risks. Read this article if you’re trying to figure out how to improve your on-call experience and aren’t sure how to start.

FBAR, Facebook’s Auto-Remediation system, was mentioned here last month. This week, they posted an update explaining AMH, their system for safely handing maintenance of blocks of servers.

Pingdom released this set of short postmortems for last week’s series of outages.

A really detailed article about how one company got Docker into production safely and reliably. I especially love the parts about nginx cutover (when deploying new container versions) and supervising running containers. With the common refrain that Docker isn’t ready for production, it’s nice to see how GoCardless did it — but it’s also interesting to see how much tooling they felt compelled to write in-house.

What good is an arbitrary number of nines from a cloud service provider if their transit links go down? Or if vast swathes of end-users can’t reach your site due to a major internet disruption? ServiceNow’s vice president argues that cloud providers must pay attention to “real availability” and partner with their customers to deal with external threats to availability.

Last month, Bitfinex (a bitcoin exchange) experienced multiple outages, and the subsequent bitcoin sell-off caused the price of the bitcoin to drop 7.5%. Bitcoin’s lack of regulation is a blessing, but is it also a curse?

How can I even intro a gem like this? John Allspaw’s essay on blameless and just culture at Etsy is a classic, and it’s a great read even if you’re well-versed in the topic. I especially liked the concept of the “Second Story”.

Outages

SRE Weekly Issue #30

Articles

How did I not know about HumanOps before now?? Their site is great, as is their manifesto. A large part of what I do at $JOB is to study and improve the human aspects of operations.

The wellbeing of human operators impacts the reliability of systems.

Slides from Charity Majors’s talk at HumanOps. Some choice tidbits in there, and I can’t wait until they post the audio.

Here’s a description of how Server Density handles their on-call duties. They use a hybrid approach with some alerts going to devs and some handled by a dedicated ops team. This idea is really intriguing to me:

After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

This article is written by Netflix’s integration testing team, which is obviously not their SRE team. Nevertheless, integration testing at Netflix is important to ensure that new features start out working reliably and stay working after they’re out.

The pitfall discussed in this article is a lack of packet-level visibility that hampers operators’ ability to quickly diagnose network issues. The article starts by outlining the issue then discusses methods of mitigating it including Tap As a Service.

This article makes the case for out of band management (OOBM) tools in responding to network issues. It’s good review, especially for those that have experience primarily or solely with cloud infrastructure.

Now there’s an inflammatory article title — it reeks of the NoOps debate. I would argue that a microservice architecture makes an RCA just as necessary if not more so.

Former Slideshare engineer Sylvain Kalache shares this war-story about DevOps gone awry. I’d say there’s a third takeaway not listed in the article: DevOps need not mean full access to the entire infrastructure for everyone.

Outages

SRE Weekly Issue #29

Articles

I can’t summarize this awesome article well enough, so I’m just going to quote Charity a bunch:

the outcomes associated with operations (reliability, scalability, operability) are the responsibility of *everyone* from support to CEO.

if you have a candidate come in and they’re a jerk to your office manager or your cleaning person, don’t fucking hire that person because having jerks on your team is an operational risk

If you try and just apply Google SRE principles to your own org according to their prescriptive model, you’re gonna be in for a really, really bad time.

Traffic spikes can be incredibly difficult to handle, foreseen or not. Packagecloud.io details its efforts to survive a daily spike of 600% of normal traffic in March.

This checklist is aimed toward deployment on Azure, but a lot of the items could be generalized and applied to infrastructures deployed elsewhere.

In-depth detail surrounding the multiple failures of TNReady mentioned earlier this year (issues #10 and #20).

A two-sided debate, both sides of which are Gareth Rushgrove (maintainer of the excellent Devops Weekly). Should we try to adopt Google’s way of doing things in our own infrastructures? For example, error budgets:

What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages

Outages

SRE Weekly Issue #28

A more packed issue this week to make up for missing last week. This issue is coming to you from beautiful Cape Cod, where my family and I are vacationing for the week.

Articles

In April, Google Compute Engine suffered a major outage that was reported here. I wrote up this review for the Operations Incident Board’s Postmortem Report Reviews project.

Migration of a service without downtime can be an incredibly challenging engineering feat. Netflix details their effort to migrate their billing system complete with its rens of terabytes of RDBMS data into EC2.

Our primary goal was to define a secure, resilient and granular path for migration to the Cloud, without impacting the member experience.

Ransomware is designed to really ruin your day. It not only corrupts your in-house data; it also tries to encrypt your backup. Even if it’s off-site. Does your backup/recovery strategy stand up to this kind of failure?

VictorOps gives us this shiny, number-filled PDF that you can use as ammunition to convince execs that downtime really matters.

Students of Holberton School‘s full-stack engineer curriculum are on-call and actually get paged in the middle of the night. Nifty idea. Why should training in on-call only be on-the-job?

I think the rumble strip is a near-perfect safeguard.

That’s Pre-Accident Podcast’s Todd Conklin on rumble strips, the warning tracks on the sides of highways. This short (4-minute) podcast asks the question, can we apply the principles behind rumble strips in our infrastructures?

The FCC adds undersea cable operators to the list of mandatory reporters to the NORS (Network Outage Reporting System). But companies such as AT&T claim that the reporting will be of limited value, since outages that have no end-user impact (due to redundant underseas links) must still be reported.

Microsoft updated its article on designing highly available apps using Azure. These kinds of articles are important. In theory, no one ought to go down just because one EC2 or Azure region goes down.

SignalFX published this four-part series on avoiding spurious alerts in metric-based monitoring systems. The tutorial bits are specific to SignalFX, but the general principles could be applied to any metric-based alerting system.

Thanks to Aneel at SignalFX for this one.

Outages

SRE Weekly Issue #27

Sorry I’m a tad late this week!

If you only have time to read one long article, I highly recommend this first one.

Articles

This fascinating series delves deeply into the cascade of failures leading up to the nearly fatal overdose of a pediatric patient hospitalized for a routine colonoscopy. It’s a five-article series, and it’s well worth every minute you’ll spend reading it. Human error, interface design, misplaced trust in automation, learning from aviation; it’s all here in abundance and depth.

In this second part of a two part series (featured here last week), Charity Majors delves into what operations means as we move toward a “serverless” infrastructure.

If you chose a provider, you do not get to just point your finger at them in the post mortem and say it’s their fault.  You chose them, it’s on you.

Interesting, though I have to say I’m a bit skeptical when I hear someone target six nines. Especially when they say this:

Redefining five nines is redefining them to go up to six nines,” said James Feger, CenturyLink’s vice president of network strategy and development […]

The Pre-Accident Podcast reminds us that incident response is just as important as incident prevention.

As automated remediation increases, the problems that actually hit our pagers become more complex and higher-level. This opinion piece from PagerDuty explores that trend and where it’s leading us.

A high-level overview of the difference between HA and DR and Netflix’s HA testing tool, Chaos Monkey.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme