General

SRE Weekly Issue #237

A message from our sponsor, StackHawk:

CI/CD has changed software engineering. Application security, however, has been left behind. Why doesn’t your CI pipeline have AppSec checks?
https://www.stackhawk.com/blog/ci-pipeline-security-bug-testing?utm_source=SREWeekly

Articles

They fully expected their deep-discount sale to drive traffic, but they didn’t expect their system to handle the increase in the way that it did.

Michał Kosmulski — Allegro

Pre-stop hooks, liveness probes, and readiness probes were key to smoothly transitioning their services from a home-grown container system to Kubernetes.

Oliver Leaver-Smith — Sky Betting & Gaming

The experience of responding to an incident can evoke emotions that run the gamut.

Mads Hartmann

Google has released course materials the first of a series of classes on NALSD (“non-abstract large systems design”). This first one is about a distributed Pub-Sub system.

Auithor: Jenny Liao and Salim Virji — Google

Usually, doing a post-analysis on an incident you were in is an anti-pattern because you’re likely to introduce bias. But sometimes, it can lead you to learn more than you would have otherwise.

Lorin Hochstein

Outages

SRE Weekly Issue #236

A message from our sponsor, StackHawk:

Add application security checks with GitHub actions. Check out the video on how.
https://www.stackhawk.com/blog/application-security-with-github-actions?utm_source=SREWeekly

Articles

A nice juicy post-incident report from the archives. Remember the first time you took down production?

Mads Hartmann — Glitch

While testing a new power transmission link, it was accidentally overloaded by a factor of ~14x, with far-reaching but ultimately well-managed effects.

Thanks to Jesper Lundkvist for this one.

As Facebook moved from a static to an auto-scaled web pool, they had to try to predict their expected demand as accurately as possible.

Daniel Boeve, Kiryong Ha, and Anca Agape — Facebook

The key lesson involves ensuring that your migrations avoid using parts of the production code, which could cause their action to change down the line inadvertently.

Frank Lin — Octopus Deploy

Cloudflare uses an interesting multi-layered approach to mitigating attacks.

Omer Yoachimik — Cloudflare

The availability/reliability distinction in this article is thought-provoking.

Emily Arnott — Blameless

2020 has shown the value of adaptive capacity. 2021 will show whether or not adaptive capacity can be sustained.

This article (not a video or podcast despite the name) also focuses on the increasing importance of learning from incidents.

Dr. Richard Cook — Adaptice Capacity Labs

What is resilience engineering? What does a resilience engineer do? Are there principles of resilience engineering? If so, what are they? What makes it possible to engineer resilience?

This academic paper uses a case study to show how a company engineered the resilience of their system in response to a series of incidents.

Richard I. Cook and Beth Adele Long — Applied Ergonomics

Outages

  • Google Drive
    • This is a post-analysis for two outages, one from this past week and the other from the week before.
  • Instagram
  • Facebook
  • Discord
  • Fastly
  • Gandi
    • Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6

      A layer 2 network loop was accidentally introduced, on two separate occasions.

      Sébastien Dupas — Gandi

  • Azure
    • This was an outage on Sept. 14 in the UK South region.  A cooling system was shut off in error during a maintenance procedure.

SRE Weekly Issue #235

A message from our sponsor, StackHawk:

Adding application security tests to your CI pipeline is simple. It typically takes <30 minutes to setup automated testing so you can be confident of your application’s security. Check out our onboarding guide to see how to get started.
https://www.stackhawk.com/blog/onboarding-guide?SREWeekly

Articles

This isn’t just another boring article about SLOs. There’s a ton of good stuff in here about why they moved to SLO-based alerts, too.

we’re hoping that by implementing SLOs – and alerting on them – we’ll be able to improve communication during incidents, reduce the toil on on-callers, and help improve our reliability in a way that’s meaningful to our users.

Mads Hartmann

Often, serendipity gets us out of an incident or makes it less severe.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.

Lorin Hochstein

It’s your classic “replace the engines on a jet while flying it” story. My favorite part is how they recorded real traffic and played it at the old and new backend API to compare the JSON responses.

Rohan Dhruva and Ed Ballot — Netflix

Feature flags can help with load shedding and throttling, and feature flag activity can even be useful data that points to contributing factors.

Dawn Parzych — LaunchDarkly

Unimog uses a lot of really interesting techniques to balance layer 4 traffic, about which this article goes into in great detail.

David Wragg — Cloudflare

I like this idea: it’s like a normal canary, except that you only send it a copy of traffic and discard the result, so as to avoid impacting users.

David Hoa — LinkedIn

Outages

SRE Weekly Issue #234

Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together.  There were so many outages that I’m not even going to bother listing all of them in the Outages section.

A message from our sponsor, StackHawk:

Everyone talks about shifting security left, but in many cases, it isn’t happening. There is a better way with developer-centric application security testing.
https://www.stackhawk.com/blog/align-engineering-security-appsec-tests-in-ci?utm_source=SREWeekly

Articles

I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.

Emily Arnott — Blameless

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.

John Carrol (original paper)

Thai Wood — Resilience Roundup (summary)

My latest adventures in (negligently) running sreweekly.com. It started with a surprise AWS bill, and then it got kinda weird…

Lex Neva

Deep technical details on a series of recent incidents involving Basecamp.

Troy Toman — Basecamp

Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.

Charity Majors

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Keith Ballinger — GitHub

Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.

Matthew Prince — Cloudflare

This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.

Angelique Medina and Archana Kesavan — ThousandEyes

Outages

SRE Weekly Issue #233

A message from our sponsor, StackHawk:

Did you catch the GitLab Commit keynote by StackHawk Founder Joni Klippert? View on demand now to learn about how security got left behind, and what can be done to improve.
https://about.gitlab.com/events/commit/

Articles

In this post, I’ll share how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how we made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.

Samantha Schaevitz — Google

I love the concept of “battleshorts” just as much as I’ve been enjoying this series of articles analyzing STAMP.

Lorin Hochstein

Honeycomb had 5 incidents in just over a week, prompting not only their normal incident investigation process, but a meta-analysis of all five together.

Emily Nakashima — Honeycomb

Why is Chromium responsible for half of the DNS queries to the root nameservers? And why do they all return NXDOMAIN?

Matthew Thomas — APNIC

“That Moment” when your fire suppression system triggers and the fire department shows up. This is part war story and part description of incident response practices.

Ariel Pisetzky — Taboola

An overload in an internal blob storage system impacted many dependent services.

Google

Sharding as a service, now there’s an interesting idea.

Gerald Guo, Thawan Kooburat — Facebook

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

Emily Arnot — Blameless

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme