View on sreweekly.com
Well, that was a fun week. I hope all of you have had a chance for a rest after any hectic patching you might have been involved in.
Local Rationale: the reasoning and context behind a decision that an operator made. Here’s Todd Conklin reminding us to find out what was really going on when the benefit of hindsight makes a decision seem irrational.
In part two of the series I linked to last week, Tyler Treat introduces data replication strategies including replicating data to all replicas before returning or just a quorum.
Here’s something I wasn’t aware of: hospitals have their own version of the ICS.
In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy.
Now this is neat. This research team pings basically the entire internet all the time and can track outages across the globe. They can see things like Egypt shutting down Internet access for all of its citizens and the effects of hurricanes.
This is a summary of a couple of talks from Influx Days. I especially like the bit about Baron Schwartz’s talk on the pitfalls of anomaly detection.
Meltdown is especially scary because the fix has the potential to significantly impact performance.
View on sreweekly.com
My phone died this week, and I discovered the hard way that my backups hadn’t been functioning properly. SRE Weekly is served out of a single t2.micro, too. Sometimes it’s hard to practice what I preach outside of work. ;) Anyway, bit of a light issue this week, but still some great stuff.
I’ve lost count of the number of incidents I’ve witnessed that were caused by TCP connections in which one end disappeared. This cautionary tale has a pretty interesting cause as well.
In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.
If the title of this article doesn’t make sense to you, then you may well have been interpreting traceroute results incorrectly. Definitely worth a read.
Gremlin inc. is live! Here’s the official “coming out” post for this chaos engineering startup.
There’s so much to delve into in this long article about Netflix’s infrastructure. It’s part of the book, Explain the Cloud Like I’m 10, but I didn’t really find the explanations watered-down or over-simplified.
A great description of booking.com‘s incident response and followup process.
Incidents are like presents: You love them as long as you don’t get the same present twice.
View on sreweekly.com
Whoa, it’s issue #100! Thank you all so much for reading.
Richard Cook wrote this short, incredibly insightful essay on how we can use incidents to improve our mental model of the system.
An incident is the message from the underlying system about where the people who made and who operate that system are uncalibrated.
A nifty trip through a debugging session that shows the importance of being able to dig into high-cardinality fields in your monitoring system.
Various sources list a couple of key metrics to keep an eye on, including request rate, error rate, latency, and others. This 6-part series defines the golden signals and shows how to monitor them in several popular systems.
This article explains some downsides of Thrift and introduces the author’s solution: Frugal, a Thrift wrapper.
re:Invent 2017 is over (whew) and now we have a raft of new products and features to play with. I’m going to leave the detailed analysis for Last Week in AWS and just point out a few bits of special interest to SREs:
- Hibernation for spot instances
- T2 unlimited
- EC2 spread placement groups
- Aurora DB multi-master support (preview)
- DynamoDB global tables
Etsy details their caching setup and explains the importance of consistent hashing in cache cluster design. I haven’t heard of their practice of “cache smearing” before, and I like it.
[…] “Success is ironically one of the progenitors of accidents when it leads to overconfidence and cutting corners or making tradeoffs that increase risk.” […]
Gremlin had an incident that was caused by filled disks. Because they’re Gremlin, they now purposefully fill a disk on a random server every day just to make sure their systems deal with it gracefully, a practice they call “continuous chaos”.
Google’s CRE team (Customer Reliability Engineering) discusses when to post public followups and how to write them. I love their idea of investigating where they got lucky during an incident, catching cases where things could have been much worse if not for serendipity. I’m going to start using that.