General

SRE Weekly Issue #34

SPONSOR MESSAGE

What is Modern Incident Management? Download the Incident Management Buyer’s Guide to learn all about it, and the value it provides. Get your copy here: http://try.victorops.com/l/44432/2016-08-04/dwp8lc

Articles

Winston is Netflix’s tool for runbook automation, based on the open source StackStorm. Winston helps reduce pager burden by filtering out false-positive alerts, collecting information for human responders, and remediating some issues automatically.

Is it valid for those working on non-life-critical systems to try to draw on lessons learned in safety-critical fields like surgery and air traffic control? John Allspaw, citing Dr. Richard Cook, answers with an emphatic yes.

The best HA infrastructure design in the world won’t save you when your credit card on file expires.

There’s a huge amount of detail on both PostgreSQL and MySQL in this article, including some sneaky edge-case pitfalls that prompted Uber to look for a new database.

This article goes into a good amount of depth on setting up a Caassandra cluster to survive a full AZ outage.

When a Maryland, US county’s emergency services went offline for two hours, 100 calls were missed, possibly contributing to two deaths. In the vein of last week’s theme of complex failures:

“This is really complex, and a lot of dominoes fell in a way that people just didn’t expect,” said Marc Elrich, chairman of the Public Safety Committee.

Here’s the third (final?) installment in this series. This one has some fascinating details on a topic near and dear to my heart: live migration of a database. Their use of DRBD and synchronous replication is especially intriguing.

Ooh, this is gonna be fun. Catchpoint and O’Reilly are hosting an AMA (Ask Me Anything) with DevOps and SRE folks, including Liz Fong-Jones and Charity Majors, both of whose articles have been featured here previously. The questions posted so far look pretty great.

Outages

SRE Weekly Issue #33

SPONSOR MESSAGE

Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”

Articles

Here’s another great article urging caution when adopting new tools. Codeship’s Jessica Kerr categorizes technologies into a continuum of risk, from single-developer tools all the way up to new databases. She goes into a really excellent amount of detail, providing examples of how adopting a new technology can come back to bite you.

After several recent incidents of nations cutting off or severely curtailing internet connectivity, the UN took a stand, as reported in this Register article:

The United Nations officially condemned the practice of countries shutting down access to the internet at a meeting of the Human Rights Council on Friday.

Is it possible to design an infrastructure and/or security environment in which a rogue employee cannot take down the service?

Mathais Lafeldt is back in this latest issue of Production Ready. In this part 1 of 2, he reviews Richard Cook’s classic How Complex Systems Fail, with an eye toward applying it to web systems.

And with a nod to Lafeldt for the link, here’s another classic from John Allspaw on complexity of failures.

In the same way that you shouldn’t ever have root cause “human error”, if you only have a single root cause, you haven’t dug deep enough.

SGX released a postmortem for their mid-July outage in the form of a press release. Just as Allspaw tells us, the theoretically simple root cause (disk failure) was exacerbated by a set of complicating factors.

In this recap of a joint webinar, Threat Stack and VictorOps share 7 methods to avoid and reduce alert fatigue.

Outages

SRE Weekly Issue #32

SPONSOR MESSAGE

Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”

Articles

It’s tempting to use the newest shiny stack when building a new system. Dan McKinley argues that you should limit yourself to only a few shiny technologies to avoid excessive operational burden.

[…] the long-term costs of keeping a system working reliably vastly exceed any inconveniences you encounter while building it. 

Quick on the draw, Pete Shima gives us a review of Stack Exchange’s outage postmortem (linked below) as part of the Operations Incident Board’s Postmortem Report Reviews project. Thanks, Pete!

Next month in Seattle will be the second annual Chaos Community Day, an event full of presentations on chaos engineering. I wish I could attend!

As the world becomes more and more dependent on the services we administer, outages become more and more likely to put real people in danger. Here’s a rundown of how dangerous last week’s four-hour outage in US’s national weather service was.

An interesting opinion piece that argues that Microsoft Azure is more robust than Google and Amazon’s offerings.

This week, I’m trying to catch all the articles being written about Pokémon GO. Here’s one that supposes the problem might be a lack of sufficient testing.

Pokémon GO is blowing up like crazy, and I don’t just mean in popularity. Forbes has a lot to say about the complete lack of communication during and after outages, and we’d do well to listen. This article reads a lot like a recipe for how to communicate well to your userbase about outages.

Here’s the continuation of last month’s article on Netflix’s billing migration.

Outages

SRE Weekly Issue #31

Huge thanks to SRE Weekly’s new sponsor, VictorOps!

SPONSOR MESSAGE

Downtime is expensive — in more ways than one. Learn the costs of downtime and how to minimize them in the new eBook from VictorOps, “Making the Case for Real-Time Incident Management.”

Articles

Opzzz is a new app that graphs sleep data (from a Fitbit) against pager alerts (from PagerDuty or Server Density). I love this idea!

By correlating sleep data with on call incidents, we can then illustrate the human cost of on-call work.

Speaking of measuring sleep data against pages, Etsy is doing that too with their open source on-call analysis tool Opsweekly. Engineers also classify their events based on whether they were actionable.

We’ve been doing this for a year and we are seeing an increasingly improving signal to noise ratio.

Slides from a talk on a really important topic. There are some great resource links included.

I’m a firm believer in work/life balance, especially as it pertains to on-call. I have a reputation for rigidly defending my personal time and that of my co-workers. I strongly feel that this is the best thing I can do for my company because exhaustion and burnout are huge reliability risks. Read this article if you’re trying to figure out how to improve your on-call experience and aren’t sure how to start.

FBAR, Facebook’s Auto-Remediation system, was mentioned here last month. This week, they posted an update explaining AMH, their system for safely handing maintenance of blocks of servers.

Pingdom released this set of short postmortems for last week’s series of outages.

A really detailed article about how one company got Docker into production safely and reliably. I especially love the parts about nginx cutover (when deploying new container versions) and supervising running containers. With the common refrain that Docker isn’t ready for production, it’s nice to see how GoCardless did it — but it’s also interesting to see how much tooling they felt compelled to write in-house.

What good is an arbitrary number of nines from a cloud service provider if their transit links go down? Or if vast swathes of end-users can’t reach your site due to a major internet disruption? ServiceNow’s vice president argues that cloud providers must pay attention to “real availability” and partner with their customers to deal with external threats to availability.

Last month, Bitfinex (a bitcoin exchange) experienced multiple outages, and the subsequent bitcoin sell-off caused the price of the bitcoin to drop 7.5%. Bitcoin’s lack of regulation is a blessing, but is it also a curse?

How can I even intro a gem like this? John Allspaw’s essay on blameless and just culture at Etsy is a classic, and it’s a great read even if you’re well-versed in the topic. I especially liked the concept of the “Second Story”.

Outages

SRE Weekly Issue #30

Articles

How did I not know about HumanOps before now?? Their site is great, as is their manifesto. A large part of what I do at $JOB is to study and improve the human aspects of operations.

The wellbeing of human operators impacts the reliability of systems.

Slides from Charity Majors’s talk at HumanOps. Some choice tidbits in there, and I can’t wait until they post the audio.

Here’s a description of how Server Density handles their on-call duties. They use a hybrid approach with some alerts going to devs and some handled by a dedicated ops team. This idea is really intriguing to me:

After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

This article is written by Netflix’s integration testing team, which is obviously not their SRE team. Nevertheless, integration testing at Netflix is important to ensure that new features start out working reliably and stay working after they’re out.

The pitfall discussed in this article is a lack of packet-level visibility that hampers operators’ ability to quickly diagnose network issues. The article starts by outlining the issue then discusses methods of mitigating it including Tap As a Service.

This article makes the case for out of band management (OOBM) tools in responding to network issues. It’s good review, especially for those that have experience primarily or solely with cloud infrastructure.

Now there’s an inflammatory article title — it reeks of the NoOps debate. I would argue that a microservice architecture makes an RCA just as necessary if not more so.

Former Slideshare engineer Sylvain Kalache shares this war-story about DevOps gone awry. I’d say there’s a third takeaway not listed in the article: DevOps need not mean full access to the entire infrastructure for everyone.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme