SRE Weekly Issue #13

SRECon16 registration is open, and I’m excited to say that my colleague Courtney Eckhardt and I will be giving a talk together! If you come to the conference, I’d love it if you’d say hi.

Articles

A deep-dive on EVCache, Netflix’s open source sharding and replication layer on top of memcached.

EVCache is one of the critical components of Netflix’s distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere.

This is a guest post from one of our customers, Aaron, Director of Support Systems at CageData. He’s talking about making alerts actionable and why that’s important.

TechCrunch gives us this overview of the field of SRE, including its origins, motivations, and guesses about its future.

Everyone’s favorite OpenSSL vulnerability of the year. I hope you all had a relatively easy patch day.

A short but sweet analysis of an intermittent bug caused by inconsistent date formatting. The author uses the term “blameful postmortem” to mean finding reasons that explain how the client application was written with faulty date parsing logic (tl;dr: the server side truncated trailing zeroes in the fractional seconds). Really, I think this is less about blame than it is about understanding the full context in which a error was able to occur, and that’s exactly what a blameless postmortem is all about.

Incidents can uncover technical debt in a system. Fixing the technical debt is often necessary if a repeat incident is to be avoided, but it can be difficult to track and allocate resources to make it happen. This article from PagerDuty suggests a method for tracking technical debt uncovered by incidents.

When multiple incidents occur simultaneously, things can get hairy and you need to have an organized incident response structure. This article is about firefighting, but we can take their lessons and apply them to SRE.

PagerDuty advocates for a model I’ve heard referred to as “Total Service Ownership”, where dev teams handle incident response for their subsystems rather than “throwing them over the wall” for Ops to support. Courtney and I will be talking about this and more at SRECon16 next month.

Outages

  • Telstra
    • No free data day for this one.

  • Gopher
    • Metafilter revived their gopher server after 15 years of downtime.

  • Salesforce.com
    • Full disclosure: Salesforce.com (parent company of my employer, Heroku), is mentioned.

  • Uganda
Updated: March 6, 2016 — 10:22 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme