SRE Weekly Issue #136

Articles

The 18 ghosts in your infrastructure stack that can cause failure (and how to avoid them)

This infographic shows how Ably’s client library and backend infrastructure is designed to work around many common failure modes. My favorite: they have redundant TLS certificates from distinct issuers.

Matthew O’Riordan — Ably

QA Instability Implies Production Instability

This article argues that spending a little time to fix staging can make production significantly more stable.

Michael Nygard

Through a Dashboard Darkly – Brain of Buildchimp – Curated Selections from the Voices in My Head

This is a story of a flawed development process on top of a flawed infrastructure, without the necessary data to drive decision-making. It’s also a story of waking up to these problems and charting a way out.

[…]

As it turns out, pure reasoning cannot solve the kind of problems you see in the production environment of a complex application. These problems are almost always more difficult, since they have survived all of the testing you could throw at them.

John Casey

Simple/hard metrics that help reduce MTTR when looking for a root cause

A story of a somewhat rare failure case (a datacenter heat buildup event) and how to monitor for such a thing without contributing to metrics overload.

Pavel Trukhanov — okmeter

Shipping Software Should Not Be Scary – charity.wtf

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

Great advice, useful for managers and individual contributors.

Charity Majors

Outages

Apple CloudKit
- There appears to be some prolonged issues with Apple’s CloudKit service today, which Apple offers to developers as a way to store user data and sync across devices. Several developers have reported to us that they have seen data for their apps temporarily wiped in the last 24 hours as the CloudKit service experiences some form of outage.
Heroku
Commonwealth Bank (AU)
Coles (Supermarket chain)
Sydney, AU train system
reddit
- And another one.

SRE Weekly Issue #136

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues