General

SRE Weekly Issue #240

A message from our sponsor, StackHawk:

Be sure to register for SnykCon to learn about the latest DevSecOps trends. And while you are there, check out the StackHawk booth for our Nintendo Switch giveaway.
http://bit.ly/SnykConStackHawk

Articles

This interesting post-incident analysis is marked as “Google Customer Confidential – Not for publication or distribution”, but Google linked it directly from their public status page. I normally would not include a seemingly “leaked” incident report like this, but in this case I think the “confidential” label is erroneous.

Google

I keep re-learning and re-forgetting about TCP_NODELAY.

Rachel By the Bay

The distinction between the two is a lot more nuanced than it may seem. What are we really trying to say wit those words?

Michael Nygard

This incident from the week before last involved a Let’s Encrypt API rate limit.

Don’t you hate when you’re minding your own business upgrading your OS, and you run smack into a kernel bug in the ext4fs code?

…ext4 performance on kernel versions above 4.5 and below 5.6 suffers severely in the presence of concurrent sequential I/O on rotating disks.

Ryan Underwood — LinkedIn

Google discusses DDoS attacks and how they deal with them, including a 2.5Tbps attack in 2017.

Damian Menscher — Google

I love these first-hand incident stories. This one is from an engineer at Heroku who was a contributing factor in an incident last month.

Damien Mathieu — Heroku (Salesforce)

Outages

SRE Weekly Issue #239

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

Don’t scale up farther than you need to! If you won’t ever see more than 100 RPS, don’t architect for 100,000.

Ayende Rahien

This one covers several common patterns of SRE practice and then offers insight on what to look for as you design your own SRE team.

Rob Cummings — Slalom Build

Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.

Lorin Hochstein

Coordination between nodes in a distributed system can kill performance. What kinds of problems require coordination? The CALM theorem can tell us.

Joseph M. Hellerstein and Peter Alvaro — Communications of the ACM

Here’s another good post-incident analysis document template that you can use as inspiration for your own.

Hannah Culver — Blameless

As your product ages, it transitions from “cool new thing” to “tool everyone uses and expects to Just Work”. Your reliability needs will change accordingly.

Lyon Wong — Blameless

Outages

SRE Weekly Issue #238

My daughters asked earlier today what I do at work, and I explained all about SRE, reliability, and the importance of work-life balance.  They said to tell you they say hi!

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

Lots of really great advice in here. And really, with a title like that, I couldn’t resist reading it!

Charity Majors

Last week, I mentioned a Google Cloud Platform outage that affected multiple services. Here’s the detailed post-analysis by Google.

Google

This one is along the lines of the classic Ironies of Automation paper by Bainbridge. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can automation be a team player, and what happens when it isn’t?

Nadine Sarter and David Woods (original paper)

Thai Wood — Resilience Roundup (summary)

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can you use chaos engineering when failures in the system can be critical and even life-threatening?

Carl Chesser — Infoq

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Emily Arnot — Blameless

This article suggests using chaos engineering to tell if your microservice-based architecture is secretly a monolith in disguise.

Andre Newman — Gremlin

Outages

SRE Weekly Issue #237

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

They fully expected their deep-discount sale to drive traffic, but they didn’t expect their system to handle the increase in the way that it did.

Michał Kosmulski — Allegro

Pre-stop hooks, liveness probes, and readiness probes were key to smoothly transitioning their services from a home-grown container system to Kubernetes.

Oliver Leaver-Smith — Sky Betting & Gaming

The experience of responding to an incident can evoke emotions that run the gamut.

Mads Hartmann

Google has released course materials the first of a series of classes on NALSD (“non-abstract large systems design”). This first one is about a distributed Pub-Sub system.

Auithor: Jenny Liao and Salim Virji — Google

Usually, doing a post-analysis on an incident you were in is an anti-pattern because you’re likely to introduce bias. But sometimes, it can lead you to learn more than you would have otherwise.

Lorin Hochstein

Outages

SRE Weekly Issue #236

A message from our sponsor, StackHawk:

Add application security checks with GitHub actions. Check out the video on how.
https://www.stackhawk.com/blog/application-security-with-github-actions?utm_source=SREWeekly

Articles

A nice juicy post-incident report from the archives. Remember the first time you took down production?

Mads Hartmann — Glitch

While testing a new power transmission link, it was accidentally overloaded by a factor of ~14x, with far-reaching but ultimately well-managed effects.

Thanks to Jesper Lundkvist for this one.

As Facebook moved from a static to an auto-scaled web pool, they had to try to predict their expected demand as accurately as possible.

Daniel Boeve, Kiryong Ha, and Anca Agape — Facebook

The key lesson involves ensuring that your migrations avoid using parts of the production code, which could cause their action to change down the line inadvertently.

Frank Lin — Octopus Deploy

Cloudflare uses an interesting multi-layered approach to mitigating attacks.

Omer Yoachimik — Cloudflare

The availability/reliability distinction in this article is thought-provoking.

Emily Arnott — Blameless

2020 has shown the value of adaptive capacity. 2021 will show whether or not adaptive capacity can be sustained.

This article (not a video or podcast despite the name) also focuses on the increasing importance of learning from incidents.

Dr. Richard Cook — Adaptice Capacity Labs

What is resilience engineering? What does a resilience engineer do? Are there principles of resilience engineering? If so, what are they? What makes it possible to engineer resilience?

This academic paper uses a case study to show how a company engineered the resilience of their system in response to a series of incidents.

Richard I. Cook and Beth Adele Long — Applied Ergonomics

Outages

  • Google Drive
    • This is a post-analysis for two outages, one from this past week and the other from the week before.
  • Instagram
  • Facebook
  • Discord
  • Fastly
  • Gandi
    • Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6

      A layer 2 network loop was accidentally introduced, on two separate occasions.

      Sébastien Dupas — Gandi

  • Azure
    • This was an outage on Sept. 14 in the UK South region.  A cooling system was shut off in error during a maintenance procedure.
A production of Tinker Tinker Tinker, LLC Frontier Theme