General

SRE Weekly Issue #44

SPONSOR MESSAGE

DevOps Executive Webinar: Security for Startups in a DevOps World. http://try.victorops.com/l/44432/2016-10-12/fgh7n3

Articles

With all the “NoOps” and “Serverless” stuff floating around, do we need ops? Susan Fowler says not necessarily, but that we do need ops skills.

VictorOps is gathering data for the 2016 edition of their yearly State of On-Call Report (2015’s if you missed it). Please click the link above and take the survey if you have a moment! The report provides some pretty awesome stats that we can all use to improve the on-call experience at our organizations.

This survey is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Scalyr writes about cascading failure scenarios, using the DynamoDB outage of September 20th, 2015 (no, not this year’s September DynamoDB outage) as a case study.

Capacity problems are a common type of failure, and often they’re of this “cascading” variety. A system that’s thrashing around in a failure state often uses more resources than it did when it was healthy, creating a self-reinforcing overload.

Check it out! Apparently this newsletter started around the same time that SRE Weekly did. Content includes a lot of really nifty stuff about Linux system administration.

I previously linked to a twopart series by Mathias Lafeldt on writing postmortems. At my request, Jimdo graciously agreed to release their (previously) internal postmortem about the incident that prompted him to write the articles. Thanks so much, Mathias!

A review of what sounds like a really interesting play about just culture, blameless retrospectives, and restorative justice in aviation, based on real events.

Thanks to Mathias Lafeldt for this one.

When you’re big like Facebook, sometimes reliability means essentially building your own Internet.

If you haven’t had time to watch Matt Ranney’s talk on Scaling Uber to 1000 Microservices, check out this detailed summary. Growing your engineering force 10x over a year while still keeping the service reliable is a pretty impressive feat.

PagerDuty shares some tips for lowering your MTTR, but first they ask the important question: how are you measuring MTTR, and is lowering it meaningful?

David Christensen riffs on Charity Majors’s concept of “3 Types of Code”: “no code” (SaaS, PaaS, etc), “someone else’s code”, and “your code”. Try to spend as much development time as possible writing code that supports what makes your business unique (your key differentiator).

Julia Evans is back with a write-up of the lessons she’s learned as she’s begun to gain an understanding of operations. My favorite bit:

Stage 2.5: learn to be scared
I think learning to be scared is a really important skill – you should be worried about upgrading a database safely, or about upgrading the version of Ruby you’re using in production. These are dangerous changes!

SysAdvent is happening again this year! Click the link above if you’d like to propose an article or volunteer to be an editor.

Outages

  • United Airlines
  • Yahoo mail
  • Google Cloud
  • FNB (South Africa bank)
  • GlobalSign (SSL certificate authority)
    • GlobalSign had a major problem in their PKI that resulted in all of their certificates being treated as revoked. They’ve posted a detailed postmortem that’s pretty heavy on deep SSL details, but the basic story is that their OCSP service misinterpreted a routine action as a request to revoke their intermediate CA certificate. Yikes.I love this quote and the mental image of a panicked party with streamers and ribbon-cutting that it conjures up:

      Our AlphaSSL and CloudSSL customers had to wait a few hours more while an emergency key ceremony was held to create alternatives.

SRE Weekly Issue #43

Dreamforce this past week was insanely busy but tons of fun.  My colleague Courtney Eckhardt and I gave a shorter version of our talk at SRECon16 about SRE and human factors.

SPONSOR MESSAGE

Downtime costs a lot more than you think. Learn why – and how to make the case for Real-time Incident Management. http://try.victorops.com/l/44432/2016-07-13/dpn2qw

Articles

A theme here in the past few issues has been the insane growth in complexity in our infrastructures. Honeycomb is a new tool-as-a-service to help you make sense of that complexity through event-based introspection. Think ELK or Splunk, but opinionated and way faster. The goal is to give you the ability to reach a state of flow in asking and answering questions about your infrastructure, so you can understand it more deeply, find problems you didn’t know you had, and discover new questions to ask. Here’s where I started getting really interested:

We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.

Mathias Lafeldt rocks it again, this time with a great essay on finding root causes for an incident. I love the idea of using the term “Contributing Conditions” instead. And the Retrospective Prime Directive is so on-point I’ve gotta re-quote it here:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

This paper review by The Morning Paper reminds us of the importance of checking return codes and properly handling errors. Best part: solid statistical evidence.

A followup note on Rachel Kroll’s hilarious and awesome story about 1213486160 (a.k.a. “HTTP”). Basically, if you see a weird number showing up in your logs, it might be a good idea to try interpreting it as a string!

A solid basic primer on Netflix’s chaos engineering tools, with some info about the history and motivation behind them. I love the bit about how they ran into issues when Chaos Monkey terminated itself. Oops.

This article should really be titled, Make Sure Your DNS Is Reliable! It’s easy to forget that all the HA in the world won’t help your infrastructure if the traffic never reaches it due to a DNS failure. And here’s a really good corollary:

Even if your status site is on a separate subdomain, web host, etc… it will still be unavailable if your DNS goes down.

We’ve had a couple of high-profile airline computer system failures this year. Here’s an analysis of the difficulty companies are having bolting new functionality onto systems from the 90s and earlier, even as those systems try to support higher volume due to airline mergers. You may want to skip the bits toward the end that read like an ad, though.

I don’t think I’ve ever been at a company with a dedicated DBA role. It’s becoming a thing of the past, and instead ops folks (and increasingly developers) are becoming the new DBAs. Charity Majors tells us that we need to apply proper operational principals to our datastores. One change at a time, proper deploy and rollback plans, etc.

I love this idea: it’s an exercise in building your own command-line shell. It’s important to have a good grounding in the fundamentals of how processes get spawned and IO works in POSIX systems. Occasionally that’s the only way you can get to the root cause(s) of a really thorny incident.

Outages

SRE Weekly Issue #42

SPONSOR MESSAGE

[WEBINAR] The Do’s and Dont’s of Post-Incident Analysis. Join VictorOps and Datadog to get an inside look at how to conduct modern post-incident analysis. Sign up now: http://try.victorops.com/l/44432/2016-09-21/f8k6rn

Articles

Netflix’s API has an advanced circuit-breaker system including a defined automated fallback plan for every dependency.

This is Sydney Dekker’s course on Just Culture, including a full explanation of Restorative Just Culture. I especially like the concept of Second Victims of incidents: the practitioner (e.g. engineer) that was directly involved in the incident.

 Your practitioners are not necessarily the cause of the incident. They themselves are the recipients of trouble deeper inside your organization.

Think you know how TCP works? There are sneaky edge-cases that can cause an outage if you don’t know about them. Example: a MySQL replicating slave will happily report “0 seconds behind master” indefinitely while waiting on a connection to the master that’s long-since silently failed.

Etsy shares the operational issues they encountered as they moved toward an API/microservice architecture. I especially like the detail about limiting concurrent in-flight sub-requests per root request across the entire request tree.

My co-worker at Heroku, Stella Cotton, gave this rockin’ keynote at RailsConf 2016. She covers load testing and performance bottleneck diagnosis, and most of what she says applies not just to Rails.

Here’s a summary of a talk about Uber’s system that stores live location data of riders and drivers. They run Cassandra in containers managed by Mesos.

With an MVP, you’re just trying to get into the market and test the waters as quickly as possible, so there’s a temptation to leave considerations like scalability for later. But what if your MVP is unexpectedly successful?

Systems We Love is a new conference modeled after the popular Papers We Love. It looks really interesting, and they’re saying they already have a lot of great proposals.

Travis CI shares more about a major outage last month.

A nice incident response primer from Scalyr.

Outages

SRE Weekly Issue #41

SPONSOR MESSAGE

[WEBINAR] The Do’s and Dont’s of Post-Incident Analysis. Join VictorOps and Datadog to get an inside look at how to conduct modern post-incident analysis. Sign up now: http://try.victorops.com/l/44432/2016-09-21/f8k6rn

Articles

Trestus is a new tool to generate a status page from a Trello board. Neat idea!

Your card can include markdown like any other Trello card and that will be converted to HTML on the generated status page, and any comments to the card will show up as updates to the status (and yes, markdown works in these too).

An excellent intro to writing post-incident analysis documents is the subject of this issue of Production Ready by Mathias Lafeldt. I can’t wait for the sequel in which he’ll address root causes.

Adrian Colyer of The Morning Paper gave a talk at Operability.IO with a round-up of his favorite write-ups of operations-related papers. I really love the fascinating trend of “I have no idea what I’m doing” — tools that help us infer interconnections, causality, and root causes in our increasingly complex infrastructures. Rather than try (and in my experience, usually fail) to document our massively complicated infrastructures in the face of increasing employee turnover rates, let’s just accept that this is impossible and write tools to help us understand our systems.

And for fun, a couple of amusing tweets I came across this week:

Me: oh sorry, I got paged
Date: are you a doctor?
Me: uh
Nagios: holy SHIT this cert expires in SIXTY DAYS
Me: …yes

— Alice Goldfuss (@alicegoldfuss) (check out her awesome talk at SRECon16 about the Incident Command System)

We just accidentally nuked all our auto-scaling stuff and everything shutdown. We’re evidently #serverless now.

— Honest Status Page (@honest_update)

@mipsytipsy @ceejbot imagine you didn’t know anything about dentistry and decided we don’t need to brush our teeth any more. That’s NoOps.

— Senior Oops Engineer (@ReinH)

Netflix documents the new version of their frontend gateway system, Zuul 2. They moved from blocking IO to async, which allows them to handle persistent connections from clients and better withstand retry storms and other spikes.

The advantages of async systems sound glorious, but the above benefits come at a cost to operations. […] It is difficult to follow a request as events and callbacks are processed, and the tools to help with debugging this are sorely lacking in this area.

In last week’s issue, I linked to a chapter from Susan Fowler’s upcoming book on microservices. Here’s an article summarizing her recent talk at Velocity about the same subject: how to make microservices operable. She should know: Uber runs over 1300 microservices. Also summarized is her fellow SRE Tom Croucher’s keynote talk about outages at Uber.

In this first of a series, GitHub lays out the design of their new load balancing solution. It’s pretty interesting due to a key constraint: git clones of huge repositories can’t resume if the connection is dropped, so they need to avoid losing connections whenever possible.

I’m embarrassed to say that I haven’t yet found the time to take my copy of the SRE book from its resting place on my shelf, but here’s another review with a good amount of detail on the highlights of the book.

Live migration of VMs while maintaining TCP connections makes sense — the guest’s kernel holds all the connection state. But how about live migrating containers? The answer is a Linux feature called TCP connection repair.

The SSP story (linked here two issues ago) is getting even more interesting. They apparently decided not to switch to their secondary datacenter in order to avoid losing up to fifteen minutes’ worth of data, instead taking a week+ outage.

While, in SRE, we generally don’t have to worry about our deploys literally blowing up in our faces and killing us, I find it valuable to look to other fields to learn from how they manage risk. Here’s an article about a tragic accident at UCLA in which a chemistry graduate student was severely injured and later died. A PhD chemist I know mentioned to me that the culture of safety in academia is much less rigorous than in the industry, perhaps due in part to a differing regulatory environment.

Outages

SRE Weekly Issue #40

SPONSOR MESSAGE

Take a bite out of all things DevOps with video series, DevChops. Get easy to digest explanations of most-used DevOps terms and concepts in 90 seconds or less. Watch now: http://try.victorops.com/l/44432/2016-09-16/f7gpzp

Articles

Adrian Colyer summarizes James Hamilton’s 2007 paper in this edition of The Morning Paper. There’s a lot of excellent advice here — some I knew explicitly, some I mostly implement without thinking about it, and some I’d never thought about. The paper is great, but even if you don’t have time to read it, Colyer’s digest version is well worth a browse.

Susan Fowler (featured here a couple weeks ago) has a philosophy of failure in her life that I find really appealing as an SRE:

We can learn something about how to become the best versions of ourselves from how we engineer the best complex systems in the world of software engineering.

And while we’re on the subject of Susan Fowler, she’s got a book coming soon about writing reliable microservices. In the linked ebook-version of the second chapter, she goes over the requirements for a production-ready microservice: stability, reliability, scalability, fault-tolerance, catastrophe-preparedness, performance, monitoring, and documentation.

Pinterest explains how they broke their datastore up into 4096(!) shards on 4 pairs of MySQL servers (later 8192 on 8 pairs). It’s an interesting approach, although in essence it treats MySQL as a glorified key-value store for JSON documents.

Do you use Kerberos or similar to authenticate your SSH users? What happens if there’s an incident that’s bad enough to take down your auth infrastructure? I hadn’t realized that openSSH supports CAs, but Facebook shows us that PKI support is easy and feature-rich.

Another project from Facebook: a load balancer for DHCP. Facebook found that anycast was not distributing requests evenly across DHCP servers, so they wrote a loadbalancer in Go.

In incident post-analysis, a fundamental attribution error is a tendency to see flaws in others as a cause if they were involved in an incident, but to blame the system if we were the one involved. This 4-minute segment from the Pre-Accident Podcast explains fundamental attribution error in more detail.

411 is Etsy’s new tool that runs scheduled queries against Elasticsearch and alerts on the result.

Outages

  • ING Bank
    • Here’s a terribly interesting root cause: during a test, the fire response system emitted an incredibly loud sound while dumping an inert gas into the datacenter — probably loud enough to cause hearing damage. This caused failure in multiple key spinning hard drives. Remember shouting at hard drives?
  • Heroku Status
    • Heroku released a followup with details on last week’s outage.

      Full disclosure: Heroku is my employer.

  • Gmail for Work
  • Microsoft Azure
    • Major outage involving most DNS queries for Azure resources failing. Microsoft posted a report including a root cause analysis.
A production of Tinker Tinker Tinker, LLC Frontier Theme