SRE Weekly Issue #104

Well, that was a fun week.  I hope all of you have had a chance for a rest after any hectic patching you might have been involved in.


Curious about the state of on-call, but don’t have a ton of time to do the research? VictorOps has gathered the most important stats in one place for you to skim.


Local Rationale: the reasoning and context behind a decision that an operator made. Here’s Todd Conklin reminding us to find out what was really going on when the benefit of hindsight makes a decision seem irrational.

In part two of the series I linked to last week, Tyler Treat introduces data replication strategies including replicating data to all replicas before returning or just a quorum.

Here’s something I wasn’t aware of: hospitals have their own version of the ICS.

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy.

Now this is neat. This research team pings basically the entire internet all the time and can track outages across the globe. They can see things like Egypt shutting down Internet access for all of its citizens and the effects of hurricanes.

This is a summary of a couple of talks from Influx Days. I especially like the bit about Baron Schwartz’s talk on the pitfalls of anomaly detection.

Meltdown is especially scary because the fix has the potential to significantly impact performance.


Updated: January 7, 2018 — 9:41 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme