SRE Weekly Issue #180

Articles

This reads like a mini list of war stories from a grizzled veteran reliability engineer… because that’s exactly what it is. Don’t forget to click the link at the bottom for the followup post!

rachelbythebay

Top Seven Myths of Robust Systems

The myths:

Add Redundancy

Simplify

Avoid Risk

Enforce Procedures

Defend against Prior Root Causes

Document Best Practices and Runbooks

Remove the People Who Cause Accidents

If that doesn’t make you want to read this, I don’t know what will.

Casey Rosenthal — Verica

Treading in Haunted Graveyards

The graveyard that no one dared tread in was the Terraform code. Once they got CI/CD set up, deploys became much easier — and less scary.

Liz Fong-Jones — Honeycomb

Resilience Roundup – Illusions of explanation: A critical essay on error classification

My favorite idea in this article is that the absence of “errors” is not the same thing as safety.

Thai Woods (summary)

Sidney Dekker (original paper)

Increasing resilience in Kubernetes

High availability and resilience are key features of Kubernetes. But what do you do when your Kubernetes cluster starts to become unstable and it looks like your ship is starting to sink?

Tim Little — Kudos

Outages

We had issues with Monzo on 29th July. Here’s what happened, and what we did to fix it.
- At this point, we’ve confirmed that something we thought was impossible, had in fact happened.
  
  I know the feeling, folks.
Heroku Incident #1819 follow-up
- Heroku’s API service degraded when its external error logging provider suffered an outage.
Slack
Halifax and Lloyds (bank)
Facebook, Instagram, and WhatsApp
Google search indexing
British Airways

SRE Weekly Issue #180

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues