SRE Weekly Issue #100


Whoa, it’s issue #100! Thank you all so much for reading.

SPONSOR MESSAGE

Integrate VictorOps into your SRE ops to support faster recovery and improved post-incident analysis. Get your free trial started today: http://try.victorops.com/SREWeekly/FreeTrial

Articles

Richard Cook wrote this short, incredibly insightful essay on how we can use incidents to improve our mental model of the system.

An incident is the message from the underlying system about where the people who made and who operate that system are uncalibrated.

A nifty trip through a debugging session that shows the importance of being able to dig into high-cardinality fields in your monitoring system.

Various sources list a couple of key metrics to keep an eye on, including request rate, error rate, latency, and others. This 6-part series defines the golden signals and shows how to monitor them in several popular systems.

This article explains some downsides of Thrift and introduces the author’s solution: Frugal, a Thrift wrapper.

re:Invent 2017 is over (whew) and now we have a raft of new products and features to play with. I’m going to leave the detailed analysis for Last Week in AWS and just point out a few bits of special interest to SREs:

  • Hibernation for spot instances
  • T2 unlimited
  • EC2 spread placement groups
  • Aurora DB multi-master support (preview)
  • DynamoDB global tables

Etsy details their caching setup and explains the importance of consistent hashing in cache cluster design. I haven’t heard of their practice of “cache smearing” before, and I like it.

[…] “Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk.” […]

Gremlin had an incident that was caused by filled disks. Because they’re Gremlin, they now purposefully fill a disk on a random server every day just to make sure their systems deal with it gracefully, a practice they call “continuous chaos”.

Google’s CRE team (Customer Reliability Engineering) discusses when to post public followups and how to write them. I love their idea of investigating where they got lucky during an incident, catching cases where things could have been much worse if not for serendipity. I’m going to start using that.

Outages

Updated: December 3, 2017 — 9:43 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme