SRE Weekly Issue #236

A message from our sponsor, StackHawk:

Add application security checks with GitHub actions. Check out the video on how.
https://www.stackhawk.com/blog/application-security-with-github-actions?utm_source=SREWeekly

Articles

A nice juicy post-incident report from the archives. Remember the first time you took down production?

Mads Hartmann — Glitch

While testing a new power transmission link, it was accidentally overloaded by a factor of ~14x, with far-reaching but ultimately well-managed effects.

Thanks to Jesper Lundkvist for this one.

As Facebook moved from a static to an auto-scaled web pool, they had to try to predict their expected demand as accurately as possible.

Daniel Boeve, Kiryong Ha, and Anca Agape — Facebook

The key lesson involves ensuring that your migrations avoid using parts of the production code, which could cause their action to change down the line inadvertently.

Frank Lin — Octopus Deploy

Cloudflare uses an interesting multi-layered approach to mitigating attacks.

Omer Yoachimik — Cloudflare

The availability/reliability distinction in this article is thought-provoking.

Emily Arnott — Blameless

2020 has shown the value of adaptive capacity. 2021 will show whether or not adaptive capacity can be sustained.

This article (not a video or podcast despite the name) also focuses on the increasing importance of learning from incidents.

Dr. Richard Cook — Adaptice Capacity Labs

What is resilience engineering? What does a resilience engineer do? Are there principles of resilience engineering? If so, what are they? What makes it possible to engineer resilience?

This academic paper uses a case study to show how a company engineered the resilience of their system in response to a series of incidents.

Richard I. Cook and Beth Adele Long — Applied Ergonomics

Outages

  • Google Drive
    • This is a post-analysis for two outages, one from this past week and the other from the week before.
  • Instagram
  • Facebook
  • Discord
  • Fastly
  • Gandi
    • Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6

      A layer 2 network loop was accidentally introduced, on two separate occasions.

      Sébastien Dupas — Gandi

  • Azure
    • This was an outage on Sept. 14 in the UK South region.  A cooling system was shut off in error during a maintenance procedure.
Updated: September 20, 2020 — 9:20 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme