SRE Weekly Issue #144

A message from our sponsor, VictorOps:

Customers expect reliability–even in today’s era of CI/CD and Agile software development. That’s why SRE is more important than ever. Learn about the importance of getting buy-in from your entire team when taking on SRE:

http://try.victorops.com/sreweekly/organizational-sre-support

Articles

GitLab is incredibly open with their policies, and incident management is no exception.

GitLab

Ooh, new newsletter! This one focuses specifically on resiliency. It seems to have just a few articles each week with in-depth summaries.

Thai Wood

This article starts with a fictitious(?) account of the kind of failure that can occur when teams step on each other’s toes in a serverless environment. It goes on to discuss techniques for dealing with this class of problems, including careful permission management.

Toby Fee — jaxenter

Sometimes fixing a rarely-occurring bug can be especially difficult. Recording enough information all the time to debug those rare failures would be too resource-intensive. Check out this fascinating technique for working backward from a memory dump to infer the prior contents of memory in the time leading up to a failure.

Adrian Colyer — The Morning Paper (summary)
Cui et al. (original paper)

An introduction to the concept of reactive systems including a description of the high-level architectural features.

Sinkevich Uladzimir — The Server Side

Initially, you can improve reliability by studying incidents to find out what went wrong. This article explains why that strategy will only get you so far.

Thanks to Thomas Depierre for this one.
Sidney Dekker — Safety Differently

Chaos Monkey Guide for Engineers – Tips, Tutorials, and Training

Gremlin released this huge guide on chaos monkey, covering theory, practice, further reading, and lots of other resources.

Gremlin, inc.

Outages

Updated: October 21, 2018 — 8:43 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme