SRE Weekly Issue #269

A message from our sponsor, StackHawk:

Tune into ZAPCon After Hours this Tuesday at 8 am PT to learn how to include automated security testing in your builds with ZAP


We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

Kevin Lew, Maulik Pandey, Narayanan Arunachalam, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Elizabeth Carretto — Netflix

The PDF covers 5 main areas:

  1. Availability
  2. Performance
  3. Monitoring
  4. Incident Response
  5. Preparation

No account required or form to fill out to download the PDF.


This one’s especially interesting for the section about what MTTx metrics aren’t good for, and the following section on how to improve them.

Emily Arnott — Blameless

If you’re interested in deploying Kafka in a multi-region configuration, eBay has put quite a bit of thought into this and has a lot to share.

Engin Yoeyen — eBay

Straight from someone who was there from the start. The “what chaos engineering is not” section is especially enlightening.

Casey Rosenthal — Verica

The last paragraph regarding “unknown unknowns” is noteworthy.


There are some great questions in here on blamelessness and full service ownership.

James Thigpen — Gremlin


Updated: May 9, 2021 — 10:02 pm
SRE WEEKLY © 2015 Frontier Theme