SRE Weekly Issue #269

A message from our sponsor, StackHawk:

Tune into ZAPCon After Hours this Tuesday at 8 am PT to learn how to include automated security testing in your builds with ZAP
http://sthwk.com/after-hours-3

Articles

We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

Kevin Lew, Maulik Pandey, Narayanan Arunachalam, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Elizabeth Carretto — Netflix

The PDF covers 5 main areas:

  1. Availability
  2. Performance
  3. Monitoring
  4. Incident Response
  5. Preparation

No account required or form to fill out to download the PDF.

Splunk/VictorOps

This one’s especially interesting for the section about what MTTx metrics aren’t good for, and the following section on how to improve them.

Emily Arnott — Blameless

If you’re interested in deploying Kafka in a multi-region configuration, eBay has put quite a bit of thought into this and has a lot to share.

Engin Yoeyen — eBay

Straight from someone who was there from the start. The “what chaos engineering is not” section is especially enlightening.

Casey Rosenthal — Verica

The last paragraph regarding “unknown unknowns” is noteworthy.

Heroku

There are some great questions in here on blamelessness and full service ownership.

James Thigpen — Gremlin

Outages

Updated: May 9, 2021 — 10:02 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme