SRE Weekly Issue #132

Articles

How to Lead a Disaster Recovery Exercise For Your On-Call Team

In this blog post I will show you what a disaster recovery exercise is, how it can diagnose weak points in your infrastructure, and how it can be a learning experience for your on-call team.

Alexandra Johnson — SigOpt

Exploring Spring Boot resiliency on AWS EKS

This article showcases the Chaos Toolkit experiments these folks wrote to test their system’s resiliency.

Sylvain Hellegouarc — chaosiq

Location-Aware Distribution: Configuring servers at scale

With millions of servers and thousands of configuration changes per day, distribution of configuration information becomes a huge scaling challenge. Here’s some insight (and pretty architecture diagrams) explaining how Facebook does it.

Ali Haider Zaveri — Facebook [NOTE: originally miscredited, sorry!]

Introducing Liftbridge: Lightweight, Fault-Tolerant Message Streams

Liftbridge is a system for lightweight, fault-tolerant (LIFT) message streams built on NATS and gRPC. Fundamentally, it extends NATS with a Kafka-like publish-subscribe log API that is highly available and horizontally scalable.

Tyler Treat

Transparent SLIs: See Google Cloud the way your application experiences it

This pretty neat: Google Cloud Platform now exposes their SLIs directly to you, as they pertain to the requests you make of the platform. For example, if a given API call has increased latency, you’ll see it on their graph. This can be great for those “is it us or is it them?” incidents.

Jay Judkowitz — Google

Safety Moment – Predicting the FUTURE!!!!

What can I do to make sure that, when this system fails, it fails as effectively as possible?

Todd Conklin — Pre-Accident Podcast

Google’s New Book: The Site Reliability Workbook

Here’s a review of Google’s new SRE book. I’m a little miffed that now I have to say that, instead of just “Google’s SRE book” or just “the SRE book”. Ah well. This one appears to be more about practical use cases than theory.

Todd Hoff — High Scalability

Great GameDays: Thinking About Failure Holistically | LaunchDarkly Blog

Chaos engineering isn’t just for SREs.

everyone benefits from observing a failure. Even UI engineers, people from a UX background, product managers.

Patrick Higgins — Gremlin

Outages

MoviePass
- Interestingly, the company reported in their SEC filing that the outage was the result of their running out of cash and being unable to pay vendors.
BBC website

SRE Weekly Issue #132

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues