SRE Weekly Issue #117


“If it ain’t broke—let’s break it, fix it, then break it again, then fix it again.” Read more about making your SRE team(s) more proactive through chaos engineering:


Brilliant, just brilliant. This isn’t just another “there isn’t just one root cause” article to skip over. The author takes time to explain the concept with cogent examples and useful metaphors. This one really caught my eye:

What’s the root cause of success?
[…] When building a successful project, there’s never just one thing that goes right for it to succeed.

Will Gallego

This episode of Food Fight is an hour-long interview with guests Rob Schnepp, Ron Vidal, and Chris Hawley, the 3 firefighters behind Blackrock 3 Partners. It’s a great intro to the Incident Management System, and well worth a listen.

Shout-out to Maple Player, an android audio player with a really high-quality tempo increase feature. I was able to listen at 1.5x speed and still understand everything; otherwise, I wouldn’t have had time this week.

Nell Shamrell-Harrington and Nathen Harvey

Here’s one from the archives, an incident report from 2013. After a temporary network partition in a redis cluster, the replicas all tried to resynchronize at once, overloading the master. One of the results was that some customers got repeatedly charged for the same thing.


You have to design a system such that the natural thing to do yields a good result and doesn’t put anyone in harm’s way.

Rachel Kroll

I thought consistent hashing was largely solved. I was wrong! There are some good solutions out there, but you have to evaluate their relative trade-offs and pick the right one for your use case.

Damian Gryski

Full disclosure: Damian Gryski is my coworker at Fastly.

As you read this article, consider the ethical imperative of system reliability, when system reliability can literally mean life and death in some cases. That’s only going to be more common in the coming years.

Yonatan Zunger

Our service needs to be available 24/7, without question. In order to ensure this happens, the LogicMonitor TechOps team uses HashiCorp Packer, Terraform, and Consul to dynamically build infrastructure for disaster recovery (DR) in a reliable and sustainable way.

Randall Thomson — LogicMonitor

On Tuesday, 13 March 2018 at 12:04 UTC a database query was accidentally run against our production database which truncated all tables.

Oof. Sorry, Travis folks, but a sincere thanks for sharing your experience with us.

Konstantin Haase — Travis CI

I like these “preliminary results” better than the kinds of aggregate statistics you normally get from a survey report. There are real quotes from free-form survey answers, including a couple of real gems. There’s a link to download the actual survey report if you’re into that, too.

Dawn Parzych — Catchpoint


SRE Weekly Issue #116


How can breaking something also fix it? Controlled chaos engineering can help your SRE team(s) better understand your systems and ultimately improve site reliability. See how VictorOps is incorporating “Game Days” to bolster their systems and their SRE culture:


The BBC suffered two simultaneous major outages that broke their online streaming product and forced their website into a limited-functioning mode.  This post-incident followup explains what happened and how they dealt with it.

Richard Cooper — BBC

Bursting is a hidden reliability risk that has bitten me hard in the past. Click through for an explanation of the risk and how to mitigate it.

Michael Wittig — Cloudonaut

This post has the most concise definition I’ve seen yet for observability, along with a quiz that will tell you whether you’re Doing It RightTM.

the power to ask new questions of your system, without having to ship new code or gather new data in order to ask those new questions

Charity Majors — Honeycomb

This debugging story is an entertaining read, and it’s also got some useful stuff to watch out for in your systems.

Tick tick tick. Time is hard.

Rachel Kroll

Solid knowledge of how DNS works is critical for SREs. This repo contains an introduction to DNS written to be far more approachable than the (many!) DNS RFCs. It’s a work in progress but there’s a lot of good content already.

Bert Hubert and others

Within this post, we’ll discuss growth planning, the challenges associated with being part of a remote team, and some of the unexpected advantages geographically distributed SRE teams can offer.

Akhil Ahuja — LinkedIn

Her thread starts here and continues being awesome:

Real talk, you should never have a paging alert on a system stats metric. Or a single host anything metric. (Or an aggregate host metric, or an aggregate divided by host count, or …)

Charity Majors


SRE Weekly Issue #115


SREcon addresses engineering resilience, reliability, and performance in complex distributed systems. Join us to grab Jason Hand’s new SRE book, and attend a book signing w/ Nicole Forsgren and Jez Humble. March 27-29.


Metrics like Mean Time to Detection (MTTD), Resolution (MTTR), and the like pave over all of the incredibly valuable details of the individual incidents. If you place a lot of emphasis on aggregate incident response metrics, this article may cause you to rethink your methods.

Incidents are unplanned investments. When you focus solely on shallow data you are giving up the return on those investments that you can realize by deeper and more elaborate analysis.

John Allspaw — Adaptive Capacity Labs

Duct tape: you know, all the little shell scripts you have in your ~/bin directory that you wrote because your system’s tooling got in your way or didn’t do what you needed? Find that, according to this article, and you’ll find interesting things to work on to make the system better. I’d add that these rough edges are often also the kinds of things that contribute to incidents.

Rachel Kroll

A thoughtful and detailed incident post-analysis, including an in-depth discussion of the weeks-long investigation to determine the contributing factors. The outage involved the interaction of Pacemaker and Postgres.

Chris Sinjakli , Harry Panayiotou , Lawrence Jones , Norberto Lopes and Raúl Naveiras — GoCardless

Here’s a nice overview of chaos engineering, including a mention of a tool I wasn’t aware of for applying chaos to Docker containers.

Jennifer Riggins — The New Stack

The question in the title refers to the gathering of metrics from many systems in an infrastructure. Do they push their metrics in, or should the system pull metrics from each host instead? This Prometheus author explains why they pull and how it scales.

Julius Volz — Prometheus

A primer on achieving seamless deployments with Docker, including examples.

Jussi Nummelin — Kontena

I had some extra time for reviewing content this week, and I took the opportunity to listen to this episode of the Food Fight podcast, with a focus on observability. The discussion is really excellent, and there are some really thought-provoking moments.

Nell Shamrell-Harrington, with Nathen Harvey, Charity Majors, and Jamie Osler

How? By writing runbooks. This article takes you through how, why, and what tools to use as you develop runbooks for your systems.

Francesco Negri — Buildo

As a security-focused company, it only makes sense that Threat Stack would focus on safety when giving developers access to operate their software production.

We believe that good operations makes for good security. Reducing the scope of engineers’ access to systems reduces the noise if we ever have to investigate malicious activity.

Pete Cheslock — Threat Stack


  • Data Action
    • Data Action is a dependency of many Australian banks.
  • Travis CI
  • S3
    • Amazon S3 had a pair of outages for connections through VPC Endpoints. The Travis CI, Datadog, and New Relic outages were around the same time, but I can’t tell conclusively whether they were related.
  • Datadog
  • New Relic

SRE Weekly Issue #114


Why is design so important to data-driven teams, and what does it mean for observability? See what several experts have to say.


The FCC has released a report on the major Level 3 outage in October of 2016. This summary article serves as a good TL;DR summary on what went wrong and includes a link to the full report.

Brian Santo — Fierce Telecom

They had an awesome approach: use RSpec to create a test suite of HTTP requests and run it continuously during the deployment to ensure that nothing changed from the end-user’s perspective. Bonus points for generating tests automatically.

Jacob Bednarz — Envato

Netflix reduced the time it takes to evacuate a failed AWS region from 50 minutes to just 8.

Luke Kosewski, Amjith Ramanujam, Niosha Behnam, Aaron Blohowiak, and Katharina Probst — Netflix

I don’t usually link to talks, but this talk transcript reads almost like an article, and it’s a good one. The premise: if you’re not monitoring well, then you can’t safely test in production. Scalyr found a few ways in which their monitoring showed cracks, and now they’re sharing it with us.

Steven Czerwinski — Scalyr

Design carefully, especially around retries, lest you create a thundering herd that makes it much harder to recover from an outage. That lesson and more, in this article on shooting yourself in the foot at web scale.

Benjamin Campbell — Business Computing World

Have I mentioned how much I love GitLab’s openness? Here’s how they handle on-call shift transitions in their remote-only organization.

John Jarvis — GitLab

What is the definition of a distributed system, and why are they difficult? I really love the definition in the second tweet.

Charity Majors

I sure love a good troubleshooting story. This one has a pretty excellent failure mode, A+ investigative technique, and an emphasis on following something through until you find an answer.

Rachel Kroll

This discussion of how and why to create a globally-distributed SRE team may only apply to bigger companies, but it’s got a lot of useful bits in it. I just have to stop laughing at the acronym “GD”…

Akhil Ahuja — LinkedIn


SRE Weekly Issue #113


Grafana and VictorOps help teams visualize time series metrics across incident management. Here’s what you need to know:


The best kind of engineer is one that understands not only their specialty, but at least something about the fields adjacent to theirs. The empathy this confers allows one to work incredibly effectively across the company. For SREs, this is even more important.

[…] many of us are finding that the most valuable skill sets sit at the intersection of two or more disciplines.

Charity Majors — Honeycomb

GitLab held a session about recognizing and preventing burnout at their recent employee summit. They share the best tips in this article, and true to their radically open culture, they also added what they learned to their employee handbook, which is publicly available.

Clement Ho — GitLab

Here’s a post-analysis for a Travis CI incident early last year. Despite a couple of easy targets that could have been labelled as “root cause”, they instead skillfully laid out all of the contributing factors and left it at that.

Travis CI

What indeed? The same thing, just organized differently. There’s a lot of great analysis here about how ops roles can adapt to a serverless infrastructure, and how teams can best make use of ops folks.

Tom McLaughlin — ServerlessOps

Charity Majors wants you to look forward to on-call. This superb write-up of her recent conference talk explains why folks should think of on-call as an enjoyable privilege and how to shape your on-call to get there.

Jennifer Riggins

The Canary Analysis Service is Google’s internal tool that automatically analyzes canary runs and decides whether performance has been negatively impacted. My favorite section is the Lessons Learned.

Štěpán Davidovič with Betsy Beyer — ACM Queue


  • Snapchat
  • 123 Reg (hosting provider)
    • Customers lost files added since 123 Reg’s last valid backup from August, 2017.
  • partypoker
  • eBay
  • Signal and Telegram (messenger apps)
  • Alexa
    • I missed this one last week — it was apparently due to the AWS outage I reported on.
  • TD Bank
  • Oculus Rift
    • A code-signing certificate expired, rendering some existing VR headsets non-functional. Oculus is issuing a $15 store credit to affected customers.

      Because of the particulars of what expired and how it happened, the company wasn’t able to simply push an update out to users because the expired certificate was blocking Oculus’ standard software update system.

SRE WEEKLY © 2015 Frontier Theme