SRE Weekly Issue #79


New eBook for DevOps pros: The Dev and Ops Guide to Incident Management offers 25+ pages of essential insight into building teams and improving your response to downtime.


Asking “what failed?” can point an investigation in an entirely different and more productive direction.

[…] the power you have is not in the answer to your question; it’s in the question […]

If you’re planning to write reliable, well-performing server code in Linux, you’ll need to know how to use epoll. Here’s Julia Evans to tell you what she learned about epoll and related syscalls.

Tyler Treat rectifies Kafka 0.11’s exactly-once semantics with his classic article, “You Cannot Have Exactly-Once Delivery”.

A “refcard” from Dzone covering a wide range of SRE basics, including load balancing, caching, clustering, redundancy, and fault tolerance.

A PagerDuty engineer applies on-the-job expertise to labor, delivery, and parenting. Lots of concepts translate pretty well. Some… not so much.

As an SRE, I want “quality” code to be shipped so that our system is reliable. But what am I really after? Sam Stokes says we should avoid using the term “quality” in favor of finding common ground and understanding the whole situation.

The reality is that doing anything in the real world involves difficult decisions in the face of constraints.

The value of logs is in what questions you can answer with them.

A sample rate of 20 means “there were 19 other events just like this one”. A sample rate of 1 means “this event is interesting on its own”.

The Signiant team previously had no dedicated solution for incident communication. As a result, any hiccup in service resulted in a flooded queue for service agents and a stuffed inbox of “what’s going on here” notes from internal team members.

In practice, a message broker is a service that transforms network errors and machine failures into filled disks.

Queues inevitably run in two states: full, or empty.

You can use a message broker to glue systems together, but never use one to cut systems apart.


  • Fastly
  • Rackspace
    • experienced a bit of feature degradation as its admin replaced a disk. I’m only including this because it meant that I couldn’t post this issue on time. ;)

      Pinboard‘s really awesome, and I wouldn’t be able to put together this newsletter without it. The API is super-simple to use, and I’m able to save and classify links right on my phone. A+, would socially bookmark with again.

Updated: July 3, 2017 — 9:04 am
A production of Tinker Tinker Tinker, LLC Frontier Theme