General

SRE Weekly Issue #6

Articles

A discussion of failing fast, degrading gracefully, and applying back-pressure to avoid cascading failure in a service-oriented architecture.

Many times, it’s our own internal services which cause the biggest DoS attacks on ourselves.

A SUSE developer introduces kGraft, SUSE’s system for live kernel patching. Anyone who survived the AWS reboot-a-thon is probably a big fan of live kernel patching solutions.

One thing that is critical is avoiding burnout in on-call. This article is a description of the “urgency” feature in Pagerduty, but they make a generally applicable point: don’t wake someone for something just because it’s critical; only wake them if it needs immediate action.

This is a review/update of the 1994 article. The fallacies still hold true, and anyone designing a large-scale service should heed them. The fallacies:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

As I get into SRE Weekly, I repeatedly run across articles that I probably should have read long since in my career. Hopefully they’re new to some of you, too.

Every position I’ve held has involved supporting reliability in a 24/7 service, but let’s be realistic: it’s unlikely someone would have died as a result of an outage. In cars, reliability takes a whole new meaning. I first got interested in MISRA and the other standards surrounding the code running in cars when I read some technical write-ups of the investigation surrounding the “unintended acceleration” incidents a few years back. This article discusses how devops practices are being applied in the development of vehicle code.

Evidence has come out that the recent major power outage in Ukraine was a network-based attack (I can’t make myself say “cyber-” anything).

I should have seen this coming.

One blogger’s take on the JetBlue outage.

It’s very hard to create an entirely duplicate universe where you can test plan B.  And it’s even hard to keep on testing it regularly and make sure it actually works. To wit: Your snow plow often doesn’t start after the first snow because it’s been sitting idle all summer.

The SRECon call for participation is now open!

Sean Cassidy has discovered an easy and indistinguishable phishing method for LastPass in Chrome, with a slightly less simple and effective method for Firefox. This one’s important for availability because many organizations rely heavily on LastPass. Compromising the right Employee’s vault could spell big trouble and possibly downtime.

Outages

SRE Weekly Issue #5

Articles

What does owning your availability really mean? Brave New Geek argues that it simply means owning your design decisions. I love this quote:

An SLA is not an insurance policy or a hedge against the business impact of an outage, it’s merely a refund policy.

Apparently last week’s BBC outage was “just a test”. Now we have to defend our networks against misdirected hacktivism?

Increased deployment automation leads to the suggestion that developers can now “do ops” (see also: “NoOps”). This author explains why operations is much more than deployment.

Full disclosure: Heroku, my employer, is briefly mentioned.

Tips on how to move toward rapid releases without drastically increasing your risk of outages. They cite the Knight Capital automated trading mishap as a cautionary example, along with Starbucks and this week’s Oyster outage.

Facebook uses configuration for many facets of its service, and they embrace “configuration as code”. They make extensive use of automated testing and canary deployments to keep things safe.

Thousands of changes made by thousands of people is a recipe for configuration errors – a major source of site outages.

PagerDuty shares a few ideas about how and why to do retrospective analysis of incidents.

Another talk from QCon. Netflix’s Nitesh Kant explains how an asynchronous microstructure architecture naturally supports graceful degradation. (thanks to DevOps Weekly for the link)

One of the fallacies of distributed computing. This ACM Queue article is an informal survey of all sorts of fascinating ways that networks fail.

Outages

SRE Weekly Issue #4

Articles

A nifty-looking packet generator with packets crafted by Lua scripts. If this thing lives up to the hype in its documentation, it’d be pretty awesome! Thanks to Chris Maynard for the link and for the sleepless days and nights we spent mucking with trafgen’s source.

Just as we design systems to be monitored, this article suggests that we should design systems to be audited. Doing the work up front and incrementally rather than as an afterthought can take the pain out of auditing.

A nice intro to structured logging. I’m a big fan of ELK, and especially using Logstash to alert on events that might be difficult to catch otherwise.

I looked at a few “lessons learned from black Friday 2015” articles, but they’re all low on good technical detail. My consolation prize is this article that seems eerily appropriate, given Target’s outage on Cyber Monday.

The strategy of turning away only some requesters to avoid a full site outage is interesting, but I could see it causing a thundering herd problem if not done carefully, where folks just repeatedly hit reload and cause more traffic.

These “predictions” (suggestions, really) about load testing may be review to some, but this article caught my interest because it was the first time I’d heard the term Performance Engineering. Definitely a field worth paying attention to as it becomes more prevalent due to its overlap with SRE.

Modern medicine has been working through very similar issues to SRE, related to controlling the impact of human error through process design and analysis of human factors. We stand to learn a lot from articles such as this one. For example, they’ve been doing the “blameless retrospective” for a long time:

As the attitude to adverse events has changed from the defensive “blame and shame culture” to an open and transparent healthcare delivery system, it is timely to examine the nature of human errors and their impact on the quality of surgical health care.

A speedy and detailed postmortem from Valve on the Steam issue on Christmas.

Outages

This issue covers Christmas and New Year’s, and we have quite a list of outages. Notably lacking from this list is Xbox Live, despite threats reported in the last issue.

SRE Weekly Issue #3

Articles

I love this article! Simplicity is especially important to us in SRE, as we try to gain a holistic understanding of the service in order to understand its failure modes. Complexity is the enemy. Every bit of complexity can hide a design weakness that threatens to take down your service.

I once heard that it takes Google SRE hires 6 months to come up to speed. I get that Google is huge, but is it possible to reduce this kind of spin-up time by aggressively simplifying?

This Reddit thread highlights the importance of management’s duty to adequately staff on-call teams and make on-call not suck. My favorite quote:

If you’re the sole sysadmin ANYWHERE then your company is dropping the ball big-time with staffing. One of anything means no backup. Point to the redundant RAID you probably installed in the server for them and say “See, if I didn’t put in multiples, you’d be SOL every time one of those drives fails, same situation except if I fail, I’m not coming back and the new one you have to replace me with won’t be anywhere near as good.”

Whether or not you celebrate, I hope this holiday is a quiet and incident-free one for all of you. Being on call during the holidays can be an especially quick path to burnout and pager fatigue. As an industry, it’s important that we come up with ways to keep our services operational during the holidays with minimal interruption to folks’ family/vacation time.

Even if you think you’ve designed your infrastructure to be bulletproof, there may weaknesses lurking.

Molly-guard is a tool to help you avoid rebooting the wrong host. Back at ${JOB[0]}, I mistakenly rebooted the host running the first production trial of a deploy tool that I’d spent 6 months writing, when it was 95% complete. Oops.

The final installment in a series on writing runbooks. The biggest takeaway for me is the importance of including a runbook link in every automated alert. Especially useful for those 3am incidents.

In this talk from last year (video & transcript), Blake Gentry talks about how Heroku’s incident response had evolved. Full disclosure: I work for Heroku. We still largely do things the same way, although now there’s a entire team dedicated to only the IC role.

Outages

SRE Weekly Issue #2

I’m still working out all of the kinks for SRE Weekly, so the issue for this “week” is hot on the heels of the last one as I clear out my backlog of articles. Coming soonish: decent CSS.

Articles

Managing the burden of on-call is critical to any organization’s incident response. Tired incident responders make mistakes, miss pages, and don’t perform as effectively. In SRE, we can’t afford to ignore this stuff. Thanks to VictorOps for doing the legwork on this!
A talk at QCon from LinkedIn about how they spread out to multiple datacenters.
A review of designing a disaster recovery solution, and where virtualization fits in the picture.
Not strictly directly related to reliability (unless you’re providing ELK as a service, of course), but I’ve found ELK to be very valuable in detecting and investigating incidents. Scaling ELK well can be an art, and in this article, Etsy describes how they set theirs up.
This series of articles is actually the first time I’d seen mention of DRaaS. I’m not sure I’m convinced that it makes sense to hire an outside firm to handle your DR, but it’s an interesting concept.

Outages

A weekend outage for Rockstar.
A large hospital network in the US went down, making health records unavailable.
Snapchat suffered an extended outage.
Anonymous is suspected to be involved.
A case-sensitivity bug took down Snapchat, among other users of Google Cloud.
Google’s postmortem analyses are excellent, in my opinion. We can learn a lot from the issues they encounter, given such a thorough explanation.
A production of Tinker Tinker Tinker, LLC Frontier Theme