SRE Weekly Issue #3

Articles

I love this article! Simplicity is especially important to us in SRE, as we try to gain a holistic understanding of the service in order to understand its failure modes. Complexity is the enemy. Every bit of complexity can hide a design weakness that threatens to take down your service.

I once heard that it takes Google SRE hires 6 months to come up to speed. I get that Google is huge, but is it possible to reduce this kind of spin-up time by aggressively simplifying?

This Reddit thread highlights the importance of management’s duty to adequately staff on-call teams and make on-call not suck. My favorite quote:

If you’re the sole sysadmin ANYWHERE then your company is dropping the ball big-time with staffing. One of anything means no backup. Point to the redundant RAID you probably installed in the server for them and say “See, if I didn’t put in multiples, you’d be SOL every time one of those drives fails, same situation except if I fail, I’m not coming back and the new one you have to replace me with won’t be anywhere near as good.”

Whether or not you celebrate, I hope this holiday is a quiet and incident-free one for all of you. Being on call during the holidays can be an especially quick path to burnout and pager fatigue. As an industry, it’s important that we come up with ways to keep our services operational during the holidays with minimal interruption to folks’ family/vacation time.

Even if you think you’ve designed your infrastructure to be bulletproof, there may weaknesses lurking.

Molly-guard is a tool to help you avoid rebooting the wrong host. Back at ${JOB[0]}, I mistakenly rebooted the host running the first production trial of a deploy tool that I’d spent 6 months writing, when it was 95% complete. Oops.

The final installment in a series on writing runbooks. The biggest takeaway for me is the importance of including a runbook link in every automated alert. Especially useful for those 3am incidents.

In this talk from last year (video & transcript), Blake Gentry talks about how Heroku’s incident response had evolved. Full disclosure: I work for Heroku. We still largely do things the same way, although now there’s a entire team dedicated to only the IC role.

Outages

Updated: December 26, 2015 — 10:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme