A nice juicy post-incident report from the archives. Remember the first time you took down production?
Mads Hartmann — Glitch
While testing a new power transmission link, it was accidentally overloaded by a factor of ~14x, with far-reaching but ultimately well-managed effects.
Thanks to Jesper Lundkvist for this one.
As Facebook moved from a static to an auto-scaled web pool, they had to try to predict their expected demand as accurately as possible.
Daniel Boeve, Kiryong Ha, and Anca Agape — Facebook
The key lesson involves ensuring that your migrations avoid using parts of the production code, which could cause their action to change down the line inadvertently.
Frank Lin — Octopus Deploy
Cloudflare uses an interesting multi-layered approach to mitigating attacks.
Omer Yoachimik — Cloudflare
The availability/reliability distinction in this article is thought-provoking.
Emily Arnott — Blameless
2020 has shown the value of adaptive capacity. 2021 will show whether or not adaptive capacity can be sustained.
This article (not a video or podcast despite the name) also focuses on the increasing importance of learning from incidents.
Dr. Richard Cook — Adaptice Capacity Labs
What is resilience engineering? What does a resilience engineer do? Are there principles of resilience engineering? If so, what are they? What makes it possible to engineer resilience?
This academic paper uses a case study to show how a company engineered the resilience of their system in response to a series of incidents.
Richard I. Cook and Beth Adele Long — Applied Ergonomics
- Google Drive
- This is a post-analysis for two outages, one from this past week and the other from the week before.
Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6
A layer 2 network loop was accidentally introduced, on two separate occasions.
Sébastien Dupas — Gandi
- This was an outage on Sept. 14 in the UK South region. A cooling system was shut off in error during a maintenance procedure.