GitLab is incredibly open with their policies, and incident management is no exception.
Ooh, new newsletter! This one focuses specifically on resiliency. It seems to have just a few articles each week with in-depth summaries.
This article starts with a fictitious(?) account of the kind of failure that can occur when teams step on each other’s toes in a serverless environment. It goes on to discuss techniques for dealing with this class of problems, including careful permission management.
Toby Fee — jaxenter
Sometimes fixing a rarely-occurring bug can be especially difficult. Recording enough information all the time to debug those rare failures would be too resource-intensive. Check out this fascinating technique for working backward from a memory dump to infer the prior contents of memory in the time leading up to a failure.
Adrian Colyer — The Morning Paper (summary)
Cui et al. (original paper)
An introduction to the concept of reactive systems including a description of the high-level architectural features.
Sinkevich Uladzimir — The Server Side
Initially, you can improve reliability by studying incidents to find out what went wrong. This article explains why that strategy will only get you so far.
Thanks to Thomas Depierre for this one.
Sidney Dekker — Safety Differently
Gremlin released this huge guide on chaos monkey, covering theory, practice, further reading, and lots of other resources.
- YouTube had a major outage this past week, and a popular adult site saw a simultaneous uptick in traffic.
- And this one too.Full disclosure: Fastly is my employer.
- Amazon S3
- Amazon Prime Music
- HSBC (Bank)
- Yale (smart home products)
Home security company Yale has denied that a server outage caused anyone to be locked out of their house, after an app used to remotely set and turn off one its smart alarm product went down late last week.
Check out that first tweet in the article.