SRE Weekly Issue #188

A message from our sponsor, VictorOps:

[Free Webinar] Last chance to register for this week’s live webinar – How to Succeed in Machine Learning Without Really Trying. See how IT and engineering leaders are implementing ML to build more robust systems and improve on-call incident response


Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.

Cindy Sridharan

This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.

Andrey Falko — Lyft

As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.

Alex Davies – Wired

This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.

Tom Wright — Google

My favorite part is the explanation of why observability is critical in microservice architectures.

The system is no longer in one of two states but more like one of n-factorial states.

Tyler Treat

Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.

Yan Cui


  • GitHub
    • Repository forking operations were delayed.
  • Slack
    • Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.

      Now I really want to know what 1AE32E16D91F is…

  • Twitter
Updated: October 6, 2019 — 9:26 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme