SRE Weekly Issue #85

SPONSOR MESSAGE

Being on-call sucks – but is it getting better? See what 800+ professionals have to say about being on-call in VictorOps’ annual “State of On-Call” report.
http://try.victorops.com/StateofOnCall/SREWeekly

Articles

Here’s Charity Majors with another gem about how ops looks in the era of distributed systems.

You simply can’t develop quality software for distributed systems without constant attention to its operability, maintainability, and debuggability.

I hope most of you have been reading up on the infamous “Googler manifesto”, and if so, maybe you’ve already seen this article. What caught my eye is the emphasis on people-oriented engineering, because these are the skills that have become increasingly important to me as an SRE.

A key metric goes through the roof and pages you. Why? Answering that can be really easy if you can quickly see the changes deployed to your system around the same time. This article is about a specific product that solves this problem and is thus a bit advertisey, but it’s still a good read.

Here’s a good argument for anomaly detection. Great, but I still have yet to see anomaly detection that I trust! That said, this was still an interesting read due to the real-world story about a glitch Wal-Mart faced.

For the Java crowd, here’s a primer on Resilience4j, a framework that makes it easier to write code that can recover from errors.

I like the description of their “The Watch” pager rotation in which developers periodically serve.

Grab engineers talk about migrating from Redis to ElastiCache veeeery carefully.

In a nutshell, we planned to switch the datasource for the 20k QPS system, without any user experience impact, while in a live running mode.

Outages

  • Paragon (game)
    • Epic Games released version 42 of Paragon, and the new version unexpectedly overloaded their servers. To get back to a good state, they were forced into developing novel code and upgrading a DB on the fly.
  • FedEx
  • SYNQ
    • As mentioned here previously, SYNQ has dedicated to posting their incident RCAs publicly. In this one, they identified a need for better regression testing.
Updated: August 13, 2017 — 10:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme