SRE Weekly Issue #132

SPONSOR MESSAGE

Build reliability and optimize application performance for your complete infrastructure with effective monitoring. See how we used metrics to uncover issues in our own mobile application’s performance:

http://try.victorops.com/sreweekly/mobile-monitoring-sre

Articles

In this blog post I will show you what a disaster recovery exercise is, how it can diagnose weak points in your infrastructure, and how it can be a learning experience for your on-call team.

Alexandra Johnson — SigOpt

This article showcases the Chaos Toolkit experiments these folks wrote to test their system’s resiliency.

Sylvain Hellegouarc — chaosiq

With millions of servers and thousands of configuration changes per day, distribution of configuration information becomes a huge scaling challenge. Here’s some insight (and pretty architecture diagrams) explaining how Facebook does it.

Ali Haider Zaveri — Facebook [NOTE: originally miscredited, sorry!]

Liftbridge is a system for lightweight, fault-tolerant (LIFT) message streams built on NATS and gRPC. Fundamentally, it extends NATS with a Kafka-like publish-subscribe log API that is highly available and horizontally scalable.

Tyler Treat

This pretty neat: Google Cloud Platform now exposes their SLIs directly to you, as they pertain to the requests you make of the platform. For example, if a given API call has increased latency, you’ll see it on their graph. This can be great for those “is it us or is it them?” incidents.

Jay Judkowitz — Google

What can I do to make sure that, when this system fails, it fails as effectively as possible?

Todd Conklin — Pre-Accident Podcast

Here’s a review of Google’s new SRE book. I’m a little miffed that now I have to say that, instead of just “Google’s SRE book” or just “the SRE book”. Ah well. This one appears to be more about practical use cases than theory.

Todd Hoff — High Scalability

Chaos engineering isn’t just for SREs.

everyone benefits from observing a failure. Even UI engineers, people from a UX background, product managers.

Patrick Higgins — Gremlin

Outages

  • MoviePass
    • Interestingly, the company reported in their SEC filing that the outage was the result of their running out of cash and being unable to pay vendors.
  • BBC website
Updated: July 29, 2018 — 8:52 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme