General

SRE Weekly Issue #98

SPONSOR MESSAGE

Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas! http://try.victorops.com/SREWeekly/AWS

Articles

I’ve mentioned Blackrock3 Partners here before, a team of veteran firefighters that train IT incident responders in the same Incident Management System used by firefighters and other disaster responders. Before now, they’ve only done training and consulting directly with companies.

Now, for the first time, they are opening a training session to the public, so you can learn to be an Incident Commander (IC) without having to be at a company that contracts with them. Their training will significantly up your incident response game, so don’t miss out on this. Click through for information on tickets.

Blackrock3 Partners has not provided me with compensation in any form for including this link.

This is John Allspaw’s 30-minute talk at DOES17, and it contains so much awesomeness that I really hope you’ll make time for it. Here are a couple of teasers (paraphrased):

Treat incidents as unplanned investments in your infrastructure.

Perform retrospectives not to brainstorm remediation items but to understand where your mental model of the system went wrong.

Here’s some more detail on Slack’s major outage on Halloween, in the form of a summary of an interview with their director of infrastructure, Julia Grace.

Google claims a lot with Cloud Spanner. Does it deliver? I’d really like to see a balanced, deeply technical review, so if you know of one, please drop me a link.

With this release, we’ve extended Cloud Spanner’s transactions and synchronous replication across regions and continents. That means no matter where your users may be, apps backed by Cloud Spanner can read and write up-to-date (strongly consistent) data globally and do so with minimal latency for end users.

Ever been on-call for work and your baby? I think a fair number of us can relate. Thankfully, it sounds like these folks realized that it’s not exactly a best practice to have a parent of a 5-day old premie be on call…

Here’s a nice pair of articles on fault tolerance and availability. In the first post (linked above), the author defines the terms “fault”, “error”, and “failure”. The second post starts with definitions of “availability” and “stability” and covers ways of achieving them.

John Allspaw, former CTO of Etsy and author of a ton of awesome articles I’ve featured here, is moving on to something new.

Along with Dr. Richard Cook and Dr. David Woods, I’m launching a company we’re calling Adaptive Capacity Labs, and it’s focused on helping companies (via consulting and training) build their own capacity to go beyond the typical “template-driven” postmortem process and treat post-incident review as the powerful insight lens it can be.

I’m really hoping to have an opportunity to try out their training, because I know it’s going to be awesome.

Outages

  • Heroku
    • Heroku suffered an outage caused by Daylight Saving Time, according to this incident report. Happens to someone every year.Full disclosure: Heroku is my employer.
  • Google Docs
  • Discord

SRE Weekly Issue #97

SPONSOR MESSAGE

Attending AWS re:Invent 2017? Visit the VictorOps booth, schedule a meeting, or join us for some after hours fun. See you in Vegas! http://try.victorops.com/SREWeekly/AWS

Articles

Last month, I linked to an article on Xero’s incident response process, and I said:

I find it interesting that incident response starts off with someone filling out a form.

This article goes into detail on how the form works, why they have it, and the actual questions on the form! Then they go on to explain their “on-call configuration as code” setup, which is really nifty. I can’t wait to see part II and beyond.

Spokes is GitHub’s system for storing distributed replicas of git repositories. This article explains how they can do this over long distances in a reasonable amount of time (and why that’s hard). I especially love the “Spokes checksum” concept.

From the CEO of NS1, a piece on the value of checklists in incident response.

Here’s another great guide on the hows and whys of secondary DNS, including options on dealing with nonstandard record types that aren’t compatible with AXFR.

From a customer’s perspective, “planned downtime” and “outage” often mean the same thing.

“serverless” != “NoOps”

Willis urges the importance of integration with existing operations processes over replacement. “Serverless is just another form of compute. … All the core principles that we’ve really learned about high-performance organizations apply differently … but the principles stay the same,” he said.

When we use root cause analysis, says Michael Nygard, we narrow our focus into counter-factuals that get in the way of finding out what really happened.

CW: hypothetical violent imagery

Outages

This week had a weirdly large number of outages!

SRE Weekly Issue #96

SPONSOR MESSAGE

Integrate VictorOps into your SRE ops to support faster recovery and improved post-incident analysis. Get your free trial started today: http://try.victorops.com/SREWeekly/FreeTrial

Articles

Here’s the recording of my Velocity 2017 talk, posted on YouTube with permission from O’Reilly (thanks!). Want to learn about some gnarly DNS details?

I fell in love with this after reading just the title, and it only got better from there. Why add debug statements haphazardly when an algorithm can automatically figure out where they’ll be most effective? I especially love the analysis of commit histories to build stats on when debug statements were added to various open source projects.

Julia Evans is back with another article about Kubernetes. Along with explaining how it all fits together, she describes a few things that can go wrong and how to fix them.

In this introductory post of a four-part series, we learn why chaos testing a lambda-based infrastructure is especially challenging.

I love the idea of a service that automatically optimizes things even without knowing anything about their internals. Mmm, cookies.

What we are releasing is unfortunately not going to be readily consumable. It is also not an OSS project that will be maintained in any way. The goal is to provide a snapshot of what Lyft does internally (what is on each dashboard, what stats do we look at, etc.). Our hope is having that as a reference will be useful in developing new dashboards for your organization.

It’s not a secret since they published a paper about it. This is an intriguing idea, but I’m wondering whether it’s really more effective than staging environments tend to be in practice.

A history of the SRE profession and a description of how New Relic does SRE.

Full disclosure: Heroku, my employer, is mentioned.

Outages

SRE Weekly Issue #95

SPONSOR MESSAGE

Integrate VictorOps into your SRE ops to support faster recovery and improved post-incident analysis. Get your free trial started today: http://try.victorops.com/SREWeekly/FreeTrial

Articles

Chaos Engineering and Jepsen-style testing is still in its infancy. As this ACM Queue article explains, figuring out what kind of failure to test is still a manual process involving building a mental model of the system. Can we automate it?

GitLab shares the story of how they implemented connection pooling and load balancing with read-only replicas in PostgreSQL.

When you have 600,000(!!) tables in one MySQL Database, traditional migration tools like mysqldump or AWS’s Database Migration Service show cracks. The folks at PressBooks used a different tool instead: mydumper.

AWS Lambda spans multiple availability zones in each region. This author wonders whether it would it be more reliable to have separate installations of Lambda running in each availability zone, to protect against failure in Lambda itself.

High-cardinality fields are where all the interesting data exist, says Charity Majors of Honeycomb. But that’s exactly where most monitoring systems break down, leaving you to throw together hacks to work around their limitations.

Google shares some best practices for building Service Level Objectives.

Hosted Graphite brings candidates in to work with them for a day and pays them for their time.

Grueling is right: their entire team came to the office over the weekend to work on the outage. Lesson learned:

When something goes horribly wrong, don’t bring everybody in. More ideas are good to a point, but if you don’t solve it in the window of a normal human’s ability to stay awake, the value they are giving you goes down exponentially as they get tired.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

Google’s Project Aristotle discovered that the number one predictor of successful teams is psychological safety. The anecdotes in this piece show how psychological safety is also critical in analyzing incidents.

Outages

SRE Weekly Issue #94

SPONSOR MESSAGE

All Day DevOps is on Oct. 24th! This FREE, online conference offers 100 DevOps-focused sessions across six different tracks. Learn more & register: http://bit.ly/2waBukw

Articles

This article by the Joint Commission opened my eyes to just how far medicine in the US is from being a High Reliability Organization (HRO). It’s long, but I’m really glad I read it.

HROs recognize that the earliest indicators of threats to organizational performance typically appear in small changes in the organization’s operations.

[…] in several instances, particularly those involving the rapid identification and management of errors and unsafe conditions, it appears that today’s hospitals often exhibit the very opposite of high reliability.

Increment issue #3 is out this week, and Alice Goldfuss gives us this juicy article on staging environments. I love the section on potential pitfalls with staging environments.

For all their advantages, if staging environments are built incorrectly or used for the wrong reasons, they can sometimes make products less stable and reliable.

A Honeycomb engineer gives us a deep-dive into Honeycomb’s infrastructure and shows how they use their product itself (in a separate, isolated installation) to debug problems in their production service. Microservices are key to allowing them to diagnose and fix problems.

This is a nice summary of a paper by Google employees entitled, “The Tail at Scale”. 99th percentile behavior can really bite you if you’re composing microservices. The paper has some suggestions for how to deal with this.

This post by VictorOps recommends moving away from Root Cause Analysis (RCA) toward a Cynefin-based method.

This article is published by my sponsor, VictorOps, but their sponsorship did not influence its inclusion in this issue.

I love the idea of detecting race conditions through static analysis. It sounds hard, but the key is that RacerD seeks only to avoid false-positives, not false-negatives.

RacerD has been running in production for 10 months on our Android codebase and has caught over 1000 multi-threading issues which have been fixed by Facebook developers before the code reaches production.

Our business requires us to deliver near-100% uptime for our API, but after multiple outages that nearly crippled our business, we became obsessed with eliminating single points of failure. In this post, I’ll discuss how we use Fastly’s edge cloud platform and other strategies to make sure we keep our customers’ websites up and running.

Full disclosure: Heroku, my employer, is mentioned.

Outages

  • Honeycomb
    • Honeycomb had a partial outage on the 17th due to a Kafka bug, and they posted an analysis the next day (nice!). They chronicle their discovery of a Kafka split-brain scenario through snapshots of the investigation they did using their dogfood instance of Honeycomb.
  • Visual Studio Team Services
    • Linked is an absolutely top-notch post-incident analysis by Microsoft. The bug involved is fascinating and their description had me on the edge of my seat (yes, I’m an incident nerd).
  • Heroku
    • Heroku posted a followup for an outage in their API. Faulty rate-limiting logic prevented the service from surviving a flood of requests. Earlier in the week, they posted a followup for incident #1297 (link).Full disclosure: Heroku is my employer.
A production of Tinker Tinker Tinker, LLC Frontier Theme