SRE Weekly Issue #144

Articles

GitLab is incredibly open with their policies, and incident management is no exception.

GitLab

Ooh, new newsletter! This one focuses specifically on resiliency. It seems to have just a few articles each week with in-depth summaries.

Thai Wood

Mo’ developers, mo’ problems: How serverless has trouble with teams

This article starts with a fictitious(?) account of the kind of failure that can occur when teams step on each other’s toes in a serverless environment. It goes on to discuss techniques for dealing with this class of problems, including careful permission management.

Toby Fee — jaxenter

REPT: reverse debugging of failures in deployed software

Sometimes fixing a rarely-occurring bug can be especially difficult. Recording enough information all the time to debug those rare failures would be too resource-intensive. Check out this fascinating technique for working backward from a memory dump to infer the prior contents of memory in the time leading up to a failure.

Adrian Colyer — The Morning Paper (summary)
Cui et al. (original paper)

To the brave new world of reactive systems and back

An introduction to the concept of reactive systems including a description of the high-level architectural features.

Sinkevich Uladzimir — The Server Side

Why do things go right?

Initially, you can improve reliability by studying incidents to find out what went wrong. This article explains why that strategy will only get you so far.

Thanks to Thomas Depierre for this one.
Sidney Dekker — Safety Differently

Chaos Monkey Guide for Engineers – Tips, Tutorials, and Training

Gremlin released this huge guide on chaos monkey, covering theory, practice, further reading, and lots of other resources.

Gremlin, inc.

Outages

YouTube
- YouTube had a major outage this past week, and a popular adult site saw a simultaneous uptick in traffic.
Fastly
- And this one too.Full disclosure: Fastly is my employer.
Amazon S3
Amazon.com
Amazon Prime Music
HSBC (Bank)
Yale (smart home products)
- Home security company Yale has denied that a server outage caused anyone to be locked out of their house, after an app used to remotely set and turn off one its smart alarm product went down late last week.
  
  Check out that first tweet in the article.
Instacart

SRE Weekly Issue #144

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues