SRE Weekly Issue #80


New eBook for DevOps pros: The Dev and Ops Guide to Incident Management offers 25+ pages of essential insight into building teams and improving your response to downtime.


I had no idea there were so many tracing systems in Linux! Fortunately Julia Evans did, and she learned all about them so that she could explain them to us.

There’s strace, and ltrace, kprobes, and tracepoints, and uprobes, and ftrace, and perf, and eBPF, and how does it all fit together and what does it all MEAN?

What do you get when a high school teacher switches careers, goes to boot camp, and becomes an SRE? In this case, we get Krishelle Hardson-Hurley, who wrote this really great intro to the SRE field. She also included a set of links to other SRE materials. Thanks for the link to SRE Weekly, Krishelle!

This issue of Production Ready is a transcript (with slides) of Mathias’s talk at ContainerDays on doing chaos engineering in a container-based infrastructure. I really like the idea of attaching a side-car container to inject latency using tc.

Here’s an interesting side-effect from an IPO: Redfin was obliged to mention the fact that its website runs out of a single datacenter.

This article, part of a series from on structured event logging, contains some tips on structuring your events well to get the most out of your logs.

I’d never thought about what IT systems must exist on a cruise ship before. This article left me wanting to know more, so I found this ZDNet article with pictures and descriptions of another cruise ship datacenter layout.


SRE Weekly Issue #79


New eBook for DevOps pros: The Dev and Ops Guide to Incident Management offers 25+ pages of essential insight into building teams and improving your response to downtime.


Asking “what failed?” can point an investigation in an entirely different and more productive direction.

[…] the power you have is not in the answer to your question; it’s in the question […]

If you’re planning to write reliable, well-performing server code in Linux, you’ll need to know how to use epoll. Here’s Julia Evans to tell you what she learned about epoll and related syscalls.

Tyler Treat rectifies Kafka 0.11’s exactly-once semantics with his classic article, “You Cannot Have Exactly-Once Delivery”.

A “refcard” from Dzone covering a wide range of SRE basics, including load balancing, caching, clustering, redundancy, and fault tolerance.

A PagerDuty engineer applies on-the-job expertise to labor, delivery, and parenting. Lots of concepts translate pretty well. Some… not so much.

As an SRE, I want “quality” code to be shipped so that our system is reliable. But what am I really after? Sam Stokes says we should avoid using the term “quality” in favor of finding common ground and understanding the whole situation.

The reality is that doing anything in the real world involves difficult decisions in the face of constraints.

The value of logs is in what questions you can answer with them.

A sample rate of 20 means “there were 19 other events just like this one”. A sample rate of 1 means “this event is interesting on its own”.

The Signiant team previously had no dedicated solution for incident communication. As a result, any hiccup in service resulted in a flooded queue for service agents and a stuffed inbox of “what’s going on here” notes from internal team members.

In practice, a message broker is a service that transforms network errors and machine failures into filled disks.

Queues inevitably run in two states: full, or empty.

You can use a message broker to glue systems together, but never use one to cut systems apart.


  • Fastly
  • Rackspace
    • experienced a bit of feature degradation as its admin replaced a disk. I’m only including this because it meant that I couldn’t post this issue on time. ;)

      Pinboard‘s really awesome, and I wouldn’t be able to put together this newsletter without it. The API is super-simple to use, and I’m able to save and classify links right on my phone. A+, would socially bookmark with again.

SRE Weekly Issue #78


New eBook for DevOps pros: The Dev and Ops Guide to Incident Management offers 25+ pages of essential insight into building teams and improving your response to downtime.


This Master’s thesis by Crista Vesel seeks to answer the question, “How does the language used in the U.S. Forest Service’s Serious Accident Investigation Guide bias accident investigation analysis?” It’s an awe-inspiring analysis, drawing on Dekker, Woods, Cook, and other authors I’ve linked here repeatedly.

The most exciting part for me was the confirmation of some vague thoughts I’ve had around the use of passive versus active voice in retrospectives. By using passive voice, we can seek to reduce the kind of blaming that is inherent in active/agentive language.

It’s by Julia Evans. Just read it.

Being responsible for my programs’ operations makes me a better developer

PagerDuty again draws on ITIL, this time to outline an example system for classifying incident impact and urgency in order to determine priority.

PagerDuty’s take on automating chaos includes a chat-bot that lets folks trigger one-off host failures, along with running periodically, of course.

Unfortunately, ChaosCat is significantly tied into our internal infrastructure tooling. For the moment this means we won’t be open-sourcing it.

This article is an overview of Microsoft’s DRaaS offering, Azure Site Recovery. Protip: you can just scroll past the signup-gate if you don’t feel like entering your email address.

Grab evaluated a couple of existing solutions but went with a simple custom sharding layer as a method to scale out their Redis usage.


  • Rollbar
  • LinkedIn
  • Skype
    • Suspected DDoS.
  • ATO (Australian Tax Office)
  • Dyn
    • Dyn suffered a long outage, and they posted an amazing 28 detailed updates to their status site before all was said and done. That’s something to aspire to.
  • Heroku
    • Heroku posted a followup for their series of incidents early this month. Sorry for missing posting those outages when they happened!Full disclosure: Heroku is my employer.

SRE Weekly Issue #77

I really love that some of you are taking vacations. Preventing burnout is really critical for improving reliability. That said, if you’d please exempt my address from your vacation auto-responder, that’d be super-cool ;)


New eBook for DevOps pros: The Dev and Ops Guide to Incident Management offers 25+ pages of essential insight into building teams and improving your response to downtime.


Last week, I linked to a reddit story of an engineer that was unfairly fired for a mistake on their first day. Dr. Richard Cook picked this up and wrote up a great analysis of the underlying organizational issues.

Thanks to John Allspaw for this one.

This was released the week before last, but it took me awhile to digest it. The ATO did a very thorough post-analysis on their two outages and released this polished report. I like that they took full responsibility for the outage even though it was an issue with a fully-managed vendor SAN offering, and they clearly sought to learn as much as possible.

Pinterest tech lead Suman Karumuri explains how they use distributed tracing and the benefits it’s brought them.

With these new use cases, we see tracing infrastructure as the third pillar of monitoring our services in addition to metrics and log search systems.

Frustrated by British Airways’s Willie Walsh’s public statement regarding their major outage, TripWire founder Gene Kim took it upon himself to write an open letter of apology as if he were an airline CEO.  It’s pretty great.

This article explores several options for HA with Nginx: put an ELB in front of it, Route 53 with health checks, or an elastic IP switched either by keepalived or a Lambda function.

I’ve been following GitLab’s blog since their engineer accidentally deleted their database earlier this year, and I’m glad I did. This article touches on all sorts of topics near to my heart: preventing burnout, examining incident response metrics, enforcing vacations, incident command, and having developers go on-call for what they wrote.

The costs associated with running a full-capacity redundant system in a secondary site can be numerous and subtle. Those costs can be especially hard to swallow when expected returns on infrastructure investments prove elusive.

Netflix explains in depth the careful scientific experiments they perform in production in order to improve the QoE (quality of experience).


  • Google Cloud Services
    • 62-minute multiple-zone total internet outage in asia-northeast1. Postmortem linked, including a description of several contributing factors.

      We apologize for the impact this issue had on our customers, and especially to those customers with deployments across multiple zones in the asia-northeast1 region. We recognize we failed to deliver the regional reliability that multiple zones are meant to achieve.

  • Coinbase
  • YouTube

SRE Weekly Issue #76

This week, I had the awesome opportunity to attend a short-form training session on the Incident Management System (the broader system that includes Incident Command) given by Blackrock 3 Partners.  Shout-out to Rob, Ron, and Chris – it was awesome meeting you guys, and I really enjoyed our conversations!


Upcoming webinar: Top 10 Practices of Highly Successful DevOps Incident Management Teams. Learn more and register:


In case you missed it, Uber kicked off this and another investigation in response to a blog post by Susan Fowler, an SRE whose writing I’ve featured here a number of times. I’m pleased at this first step by Uber and I’m looking forward to what comes next. It might be a leave of absence for Uber’s CEO, although no decision has been made yet.

Here’s the 2013 article that started it all. If you’re unfamiliar with Jepsen, it’s an article series on testing various distributed data systems for partition tolerance, along with a companion tool set for inducing failures.

For those not completely “cloud native” (ugh) by this point, here’s a nifty primer on some of the BGP tricks you’ll need to know if you manage your own IP transit links.

Redis has a pretty big gotcha regarding deletion of expired keys, as these engineers discovered. In fact, my experience with Redis was full of operational gotchas like this.

This poor anonymous Reddit poster had a very bad day. The community rallied around them to explain that no, the anonymous poster is not to blame. One of the top commenters is Yorick Peterse, the engineer that inadvertently deleted’s main database earlier this year. Click through to see blamelessness in action.

PagerDuty is deeply invested in the Incident Management System, and most especially Incident Command. This article is a great overview, and if you want more, don’t forget that they also released their incident response documentation awhile back, including their Incident Commander training material.

The main theme in this article by is the direct relationship between increasing complexity and difficulty in attaining high reliability. I like the mention of microservices as a trade-off and not a panacea.

Automation doesn’t replace ops, it augments it. Abstraction doesn’t replace ops, it hides it. Function as a service doesn’t remove complexity, it increases it exponentially.


  • Amazon product pages went down today in a rare outage
    • The linked story was for an outage on June 7th. There was at least one additional similar outage on June 9th (source: personal experience).
  • Verelox
    • Dutch hosting provider Verelox is having a really rough time:

      First of all, we want to offer our apologies for any inconvenience. Unfortunately, an ex administrator has deleted all customer data and wiped most servers.

      Ouch. Good luck, folks.

SRE WEEKLY © 2015 Frontier Theme