SRE Weekly Issue #94

SPONSOR MESSAGE

All Day DevOps is on Oct. 24th! This FREE, online conference offers 100 DevOps-focused sessions across six different tracks. Learn more & register: http://bit.ly/2waBukw

Articles

This article by the Joint Commission opened my eyes to just how far medicine in the US is from being a High Reliability Organization (HRO). It’s long, but I’m really glad I read it.

HROs recognize that the earliest indicators of threats to organizational performance typically appear in small changes in the organization’s operations.

[…] in several instances, particularly those involving the rapid identification and management of errors and unsafe conditions, it appears that today’s hospitals often exhibit the very opposite of high reliability.

Increment issue #3 is out this week, and Alice Goldfuss gives us this juicy article on staging environments. I love the section on potential pitfalls with staging environments.

For all their advantages, if staging environments are built incorrectly or used for the wrong reasons, they can sometimes make products less stable and reliable.

A Honeycomb engineer gives us a deep-dive into Honeycomb’s infrastructure and shows how they use their product itself (in a separate, isolated installation) to debug problems in their production service. Microservices are key to allowing them to diagnose and fix problems.

This is a nice summary of a paper by Google employees entitled, “The Tail at Scale”. 99th percentile behavior can really bite you if you’re composing microservices. The paper has some suggestions for how to deal with this.

This post by VictorOps recommends moving away from Root Cause Analysis (RCA) toward a Cynefin-based method.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

I love the idea of detecting race conditions through static analysis. It sounds hard, but the key is that RacerD seeks only to avoid false-positives, not false-negatives.

RacerD has been running in production for 10 months on our Android codebase and has caught over 1000 multi-threading issues which have been fixed by Facebook developers before the code reaches production.

Our business requires us to deliver near-100% uptime for our API, but after multiple outages that nearly crippled our business, we became obsessed with eliminating single points of failure. In this post, I’ll discuss how we use Fastly’s edge cloud platform and other strategies to make sure we keep our customers’ websites up and running.

Full disclosure: Heroku, my employer, is mentioned.

Outages

  • Honeycomb
    • Honeycomb had a partial outage on the 17th due to a Kafka bug, and they posted an analysis the next day (nice!). They chronicle their discovery of a Kafka split-brain scenario through snapshots of the investigation they did using their dogfood instance of Honeycomb.
  • Visual Studio Team Services
    • Linked is an absolutely top-notch post-incident analysis by Microsoft. The bug involved is fascinating and their description had me on the edge of my seat (yes, I’m an incident nerd).
  • Heroku
    • Heroku posted a followup for an outage in their API. Faulty rate-limiting logic prevented the service from surviving a flood of requests. Earlier in the week, they posted a followup for incident #1297 (link).Full disclosure: Heroku is my employer.
Updated: October 22, 2017 — 9:16 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme