SRE Weekly Issue #73

SPONSOR MESSAGE

Concerned about downtime? VictorOps helps you prepare, respond, and recover from IT and DevOps Incidents. Swing by our product center to learn how and start your trial. http://try.victorops.com/SREWeekly/ProductCenter

Articles

ELBs (Amazon’s Elastic Load Balancers) depend on clients properly respecting DNS round-robin record sets. This article follows a debugging session in excellent detail as they try to answer the question: why are our clients preferring (and overloading) just one ELB IP?

Sarah Schieffer Riehl shares her take on ServerlessConf Austin 2017. She’s got a healthy dose of skepticism that I like, concluding that “serverful and serverless architectures don’t do the same things.” I like this bit:

For processes that require polling or any kind of server wakefulness, converting to a serverless architecture can be an exercise in “serverless for serverless’ sake”.

Wow, this dovetails so well into the Todd Conklin’s “Safety Moment” from last week, on imagining all the possible things that could go wrong.  I’d love to hear more thoughts along these lines: is it possible to design a reliable system without envisioning the majority of things that could go wrong?

PagerDuty outlines an incident lifecycle management policy based on ITIL.

DropBox created Cape for “asynchronous processing of billions of events a day, powering many Dropbox features”. Example: you upload a text file, and a Cape job indexes it immediately for full-text searching. I’d love to hear more on why existing solutions didn’t fit the bill, although they do cover their requirements in depth.

When I signed on for my first SRE position, I had no idea how huge a part vendor relations would play in ensuring reliability.

Initially, LinkedIn’s SRE team hired engineers only based on technical skill. As they’ve grown, they’ve discovered the importance of collaboration skills as well.

StatusPage.io explains the reasons for having a solid incident communication policy and guides you through setting one up.

As the title suggest, this ACM Queue article goes into some depth on the kinds of calculations one might make when designing a reliable system. Specifically, they focus on service dependencies and introduce Google’s “rule of the extra 9”: a dependency should have one more nine of reliability than the thing that critically depends on it.

At the next conference, when somebody tries to sell you a circuit breaker talk, tell them that this is only the starter and ask for the main course.

Outages

Updated: May 21, 2017 — 8:59 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme