Articles
In this blog post I will show you what a disaster recovery exercise is, how it can diagnose weak points in your infrastructure, and how it can be a learning experience for your on-call team.
Alexandra Johnson — SigOpt
This article showcases the Chaos Toolkit experiments these folks wrote to test their system’s resiliency.
Sylvain Hellegouarc — chaosiq
With millions of servers and thousands of configuration changes per day, distribution of configuration information becomes a huge scaling challenge. Here’s some insight (and pretty architecture diagrams) explaining how Facebook does it.
Ali Haider Zaveri — Facebook [NOTE: originally miscredited, sorry!]
Liftbridge is a system for lightweight, fault-tolerant (LIFT) message streams built on NATS and gRPC. Fundamentally, it extends NATS with a Kafka-like publish-subscribe log API that is highly available and horizontally scalable.
Tyler Treat
This pretty neat: Google Cloud Platform now exposes their SLIs directly to you, as they pertain to the requests you make of the platform. For example, if a given API call has increased latency, you’ll see it on their graph. This can be great for those “is it us or is it them?” incidents.
Jay Judkowitz — Google
What can I do to make sure that, when this system fails, it fails as effectively as possible?
Todd Conklin — Pre-Accident Podcast
Here’s a review of Google’s new SRE book. I’m a little miffed that now I have to say that, instead of just “Google’s SRE book” or just “the SRE book”. Ah well. This one appears to be more about practical use cases than theory.
Todd Hoff — High Scalability
Chaos engineering isn’t just for SREs.
everyone benefits from observing a failure. Even UI engineers, people from a UX background, product managers.
Patrick Higgins — Gremlin
Outages
- MoviePass
- Interestingly, the company reported in their SEC filing that the outage was the result of their running out of cash and being unable to pay vendors.
- BBC website