SRE Weekly Issue #100

View on sreweekly.com
Whoa, it’s issue #100! Thank you all so much for reading.

Articles

Richard Cook wrote this short, incredibly insightful essay on how we can use incidents to improve our mental model of the system.

An incident is the message from the underlying system about where the people who made and who operate that system are uncalibrated.

a short example of why dimensions are suuuuuper valuable

A nifty trip through a debugging session that shows the importance of being able to dig into high-cardinality fields in your monitoring system.

How to Monitor the SRE Golden Signals

Various sources list a couple of key metrics to keep an eye on, including request rate, error rate, latency, and others. This 6-part series defines the golden signals and shows how to monitor them in several popular systems.

Thrift on Steroids: A Tale of Scale and Abstraction

This article explains some downsides of Thrift and introduces the author’s solution: Frugal, a Thrift wrapper.

re:Invent 2017 | New Products & Services

re:Invent 2017 is over (whew) and now we have a raft of new products and features to play with. I’m going to leave the detailed analysis for Last Week in AWS and just point out a few bits of special interest to SREs:

Hibernation for spot instances
T2 unlimited
EC2 spread placement groups
Aurora DB multi-master support (preview)
DynamoDB global tables

How Etsy caches: hashing, Ketama, and cache smearing

Etsy details their caching setup and explains the importance of consistent hashing in cache cluster design. I haven’t heard of their practice of “cache smearing” before, and I like it.

The role of software in spacecraft accidents

[…] “Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk.” […]

Won’t Get Fooled Again

Gremlin had an incident that was caused by filled disks. Because they’re Gremlin, they now purposefully fill a disk on a random server every day just to make sure their systems deal with it gracefully, a practice they call “continuous chaos”.

Fearless shared postmortems — CRE life lessons

Google’s CRE team (Customer Reliability Engineering) discusses when to post public followups and how to write them. I love their idea of investigating where they got lucky during an incident, catching cases where things could have been much worse if not for serendipity. I’m going to start using that.

SRE Weekly Issue #100

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues