General

SRE Weekly Issue #3

Articles

I love this article! Simplicity is especially important to us in SRE, as we try to gain a holistic understanding of the service in order to understand its failure modes. Complexity is the enemy. Every bit of complexity can hide a design weakness that threatens to take down your service.

I once heard that it takes Google SRE hires 6 months to come up to speed. I get that Google is huge, but is it possible to reduce this kind of spin-up time by aggressively simplifying?

This Reddit thread highlights the importance of management’s duty to adequately staff on-call teams and make on-call not suck. My favorite quote:

If you’re the sole sysadmin ANYWHERE then your company is dropping the ball big-time with staffing. One of anything means no backup. Point to the redundant RAID you probably installed in the server for them and say “See, if I didn’t put in multiples, you’d be SOL every time one of those drives fails, same situation except if I fail, I’m not coming back and the new one you have to replace me with won’t be anywhere near as good.”

Whether or not you celebrate, I hope this holiday is a quiet and incident-free one for all of you. Being on call during the holidays can be an especially quick path to burnout and pager fatigue. As an industry, it’s important that we come up with ways to keep our services operational during the holidays with minimal interruption to folks’ family/vacation time.

Even if you think you’ve designed your infrastructure to be bulletproof, there may weaknesses lurking.

Molly-guard is a tool to help you avoid rebooting the wrong host. Back at ${JOB[0]}, I mistakenly rebooted the host running the first production trial of a deploy tool that I’d spent 6 months writing, when it was 95% complete. Oops.

The final installment in a series on writing runbooks. The biggest takeaway for me is the importance of including a runbook link in every automated alert. Especially useful for those 3am incidents.

In this talk from last year (video & transcript), Blake Gentry talks about how Heroku’s incident response had evolved. Full disclosure: I work for Heroku. We still largely do things the same way, although now there’s a entire team dedicated to only the IC role.

Outages

SRE Weekly Issue #2

I’m still working out all of the kinks for SRE Weekly, so the issue for this “week” is hot on the heels of the last one as I clear out my backlog of articles. Coming soonish: decent CSS.

Articles

Managing the burden of on-call is critical to any organization’s incident response. Tired incident responders make mistakes, miss pages, and don’t perform as effectively. In SRE, we can’t afford to ignore this stuff. Thanks to VictorOps for doing the legwork on this!
A talk at QCon from LinkedIn about how they spread out to multiple datacenters.
A review of designing a disaster recovery solution, and where virtualization fits in the picture.
Not strictly directly related to reliability (unless you’re providing ELK as a service, of course), but I’ve found ELK to be very valuable in detecting and investigating incidents. Scaling ELK well can be an art, and in this article, Etsy describes how they set theirs up.
This series of articles is actually the first time I’d seen mention of DRaaS. I’m not sure I’m convinced that it makes sense to hire an outside firm to handle your DR, but it’s an interesting concept.

Outages

A weekend outage for Rockstar.
A large hospital network in the US went down, making health records unavailable.
Snapchat suffered an extended outage.
Anonymous is suspected to be involved.
A case-sensitivity bug took down Snapchat, among other users of Google Cloud.
Google’s postmortem analyses are excellent, in my opinion. We can learn a lot from the issues they encounter, given such a thorough explanation.

SRE Weekly Issue #1

Articles

An excellent discussion of the need to look at human error in a broader context.
A cogent argument that code freezes increase risk rather than reducing it.
An interesting outline of a hardware platform with duplicate everything (cpu, RAM, etc) claiming 7+ nines of availability. I’m not sure I’m convinced of its utility in all but a few niche areas, but it’s a neat concept.
A short discussion of how Netflix prepares for the holidays.
A new tool that checks connectivity though real requests. Is it enough to monitor internal services from a central monitoring machine? What if service A is unreachable only to hosts in cluster B, but Nagios can see it just fine? I’ve seen that before and it made me wonder if I need to monitor everything from everywhere.
Instagram posted an account of growing to double the traffic and multiple regions.

An example of a retrospective analysis process. I especially like this quote:

We also ask what led a person to believe that what they did was the right choice. Rarely does someone intend to do the wrong thing.

What happens when an operator that knows all of the secrets is suddenly unavailable? How do you make their secrets available without compromising security?

Outages

Black Friday, Cyber Monday, and the weekend in between is a critical time for sites to remain available. This year, some notable companies had a hard time.

An older but nice postmortem analysis posted by Slack in October.
PSN was down over black Friday weekend.
Nieman Marcus lost out on much of Black Friday.
Newegg also had issues on Black Friday.
Argos’s Black Friday Deals page was down.
EBay suffered an outage the day before thanksgiving.
PayPal suffered a major outage during cyber weekend.
Google Compute Engine saw issues with some if its transit traffic stemming from a new BGP peer accidentally announcing far more routes than they should have. They posted a nicely detailed analysis.
Target had downtime on Cyber Monday.
Time Warner Cable customers were frustrated by “intermittent availability” (read: outages) on Cyber Monday, hampering their ability to get in on all the deals.
A production of Tinker Tinker Tinker, LLC Frontier Theme