SRE Weekly Issue #85

Articles

Here’s Charity Majors with another gem about how ops looks in the era of distributed systems.

You simply can’t develop quality software for distributed systems without constant attention to its operability, maintainability, and debuggability.

So, about this Googler’s manifesto.

I hope most of you have been reading up on the infamous “Googler manifesto”, and if so, maybe you’ve already seen this article. What caught my eye is the emphasis on people-oriented engineering, because these are the skills that have become increasingly important to me as an SRE.

Beyond Observing Behavior

A key metric goes through the roof and pages you. Why? Answering that can be really easy if you can quickly see the changes deployed to your system around the same time. This article is about a specific product that solves this problem and is thus a bit advertisey, but it’s still a good read.

Glitches are inevitable, and consumers know it: how anomaly detection can safely corral a glitch stampede

Here’s a good argument for anomaly detection. Great, but I still have yet to see anomaly detection that I trust! That said, this was still an interesting read due to the real-world story about a glitch Wal-Mart faced.

Achieving Fault Tolerance With Resilience4j

For the Java crowd, here’s a primer on Resilience4j, a framework that makes it easier to write code that can recover from errors.

Beyond Google SRE: What is Site Reliability Engineering like at Medium?

I like the description of their “The Watch” pager rotation in which developers periodically serve.

Migrating Existing Datastores

Grab engineers talk about migrating from Redis to ElastiCache veeeery carefully.

In a nutshell, we planned to switch the datasource for the 20k QPS system, without any user experience impact, while in a live running mode.

Outages

Paragon (game)
- Epic Games released version 42 of Paragon, and the new version unexpectedly overloaded their servers. To get back to a good state, they were forced into developing novel code and upgrading a DB on the fly.
FedEx
SYNQ
- As mentioned here previously, SYNQ has dedicated to posting their incident RCAs publicly. In this one, they identified a need for better regression testing.

SRE Weekly Issue #85

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues