SRE Weekly Issue #13

View on sreweekly.com

SRECon16 registration is open, and I’m excited to say that my colleague Courtney Eckhardt and I will be giving a talk together! If you come to the conference, I’d love it if you’d say hi.

Articles

The Netflix Tech Blog: Caching for a Global Netflix

A deep-dive on EVCache, Netflix’s open source sharding and replication layer on top of memcached.

EVCache is one of the critical components of Netflix’s distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere.

Actionable Alerts: Reducing False Positives & Making On-call Suck Less – VictorOps

This is a guest post from one of our customers, Aaron, Director of Support Systems at CageData. He’s talking about making alerts actionable and why that’s important.

Are site reliability engineers the next data scientists?

TechCrunch gives us this overview of the field of SRE, including its origins, motivations, and guesses about its future.

DROWN Attack

Everyone’s favorite OpenSSL vulnerability of the year. I hope you all had a relatively easy patch day.

A “Principled”, Blameful Post-Mortem

A short but sweet analysis of an intermittent bug caused by inconsistent date formatting. The author uses the term “blameful postmortem” to mean finding reasons that explain how the client application was written with faulty date parsing logic (tl;dr: the server side truncated trailing zeroes in the fractional seconds). Really, I think this is less about blame than it is about understanding the full context in which a error was able to occur, and that’s exactly what a blameless postmortem is all about.

Reducing Technical Debt With Incident Management

Incidents can uncover technical debt in a system. Fixing the technical debt is often necessary if a repeat incident is to be avoided, but it can be difficult to track and allocate resources to make it happen. This article from PagerDuty suggests a method for tracking technical debt uncovered by incidents.

Are You Prepared To Handle More Than The “Routine” Incident?

When multiple incidents occur simultaneously, things can get hairy and you need to have an organized incident response structure. This article is about firefighting, but we can take their lessons and apply them to SRE.

7 Benefits of Incident Management in Supporting Applications

PagerDuty advocates for a model I’ve heard referred to as “Total Service Ownership”, where dev teams handle incident response for their subsystems rather than “throwing them over the wall” for Ops to support. Courtney and I will be talking about this and more at SRECon16 next month.

Outages

Telstra
- No free data day for this one.
Gopher
- Metafilter revived their gopher server after 15 years of downtime.
Salesforce.com
- Full disclosure: Salesforce.com (parent company of my employer, Heroku), is mentioned.
Uganda

SRE Weekly Issue #13

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues