SRE Weekly Issue #42

SPONSOR MESSAGE

[WEBINAR] The Do’s and Dont’s of Post-Incident Analysis. Join VictorOps and Datadog to get an inside look at how to conduct modern post-incident analysis. Sign up now: http://try.victorops.com/l/44432/2016-09-21/f8k6rn

Articles

Netflix’s API has an advanced circuit-breaker system including a defined automated fallback plan for every dependency.

This is Sydney Dekker’s course on Just Culture, including a full explanation of Restorative Just Culture. I especially like the concept of Second Victims of incidents: the practitioner (e.g. engineer) that was directly involved in the incident.

 Your practitioners are not necessarily the cause of the incident. They themselves are the recipients of trouble deeper inside your organization.

Think you know how TCP works? There are sneaky edge-cases that can cause an outage if you don’t know about them. Example: a MySQL replicating slave will happily report “0 seconds behind master” indefinitely while waiting on a connection to the master that’s long-since silently failed.

Etsy shares the operational issues they encountered as they moved toward an API/microservice architecture. I especially like the detail about limiting concurrent in-flight sub-requests per root request across the entire request tree.

My co-worker at Heroku, Stella Cotton, gave this rockin’ keynote at RailsConf 2016. She covers load testing and performance bottleneck diagnosis, and most of what she says applies not just to Rails.

Here’s a summary of a talk about Uber’s system that stores live location data of riders and drivers. They run Cassandra in containers managed by Mesos.

With an MVP, you’re just trying to get into the market and test the waters as quickly as possible, so there’s a temptation to leave considerations like scalability for later. But what if your MVP is unexpectedly successful?

Systems We Love is a new conference modeled after the popular Papers We Love. It looks really interesting, and they’re saying they already have a lot of great proposals.

Travis CI shares more about a major outage last month.

A nice incident response primer from Scalyr.

Outages

Updated: October 2, 2016 — 8:14 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme