SRE Weekly Issue #47

SPONSOR MESSAGE

Downtime sucks. Learn how leading minds in tech respond to outages on the Nov. 16th “Ask Me Anything” from Catchpoint & O’Reilly Media: http://try.victorops.com/AMA

Articles

Next year, SRECon is expanding to three events: Americas, EMEA, and Asia. The Americas event is also moving from Santa Clara to San Francisco, which I, for one, am especially grateful for. The CFP for SRECon17 Americas just opened up, and proposals are due November 30th, so get cracking! I can’t wait to see what all of you have to share!

I have a somewhat dim view of automated anomaly detection in metrics based on my experience with a few tools, but if Datadog’s algorithms live up to their description, they might really have something worthwhile.

When a responder gets an anomaly alert, he or she needs to know exactly why the alert triggered. The monitor status page for anomaly alerts shows what the metric in question looked like over the alert’s evaluation window, overlaid with the algorithm’s predicted range for that metric.

This issue of Production Ready chronicles Mathias Lafeldt’s effort to create a staging environment. I like the emphasis on using an entirely separate AWS account for staging. This is increasingly becoming a best practice.

What’s causing all that API request latency? Here’s an interesting debug run using Honeycomb. Negative HTTP status codes? Sure, that’s totally a thing, right?

I love this idea: Susan Fowler notes that large, complex systems are constantly changing, and this makes reproducing bugs difficult or impossible. Her suggestion is to log enough that you can logically reconstruct the state of the system at the time the bug occurred. This is the same kind of thing the Honeycomb folks are saying: throw a lot of information into your logs, just in case you might need it to debug something unforeseen.

Outages

Updated: November 6, 2016 — 8:57 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme