SRE Weekly Issue #188

Articles

Two of the hardest problems of testing in production are curtailing blast radius and dealing with state. In this post, I aim to explore the topic of curtailing blast radius in more detail.

Cindy Sridharan

Operating Apache Kafka Clusters 24/7 Without A Global Ops Team

This team was getting paged constantly to fix failed Kafka nodes, and their outlook for the future was looking even worse. They responded by developing an auto-remediation system.

Andrey Falko — Lyft

Feds Say Boeing 737s Need to Be Better Designed for Humans

As last week’s Boeing-related article explained, Boeing and Airbus have significantly different philosophies regarding the role of pilots vs aircraft in aviation safety. This new NTSB report strikes at the heart of that dichotomy.

Alex Davies – Wired

Transitioning a typical engineering ops team into an SRE powerhouse

This is an especially interesting read because the team in question was a network operations team, and the members largely had no software engineering experience. Part of the transformation involved essentially training them for a new career.

Tom Wright — Google

Microservice Observability, Part 1: Disambiguating Observability and Monitoring

My favorite part is the explanation of why observability is critical in microservice architectures.

The system is no longer in one of two states but more like one of n-factorial states.

Tyler Treat

All you need to know about caching for serverless applications

Given that Lambda et al. auto-scale, is caching still relevant? Find out why by reading this article.

Yan Cui

Outages

GitHub
- Repository forking operations were delayed.
Statuspage.io
Slack
- Some customers are seeing an error code (“1AE32E16D91F”) when connecting to Slack.
  
  Now I really want to know what 1AE32E16D91F is…
Twitter

SRE Weekly Issue #188

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues