SRE WEEKLY – Page 56 – scalability, availability, incident response, automation

SRE Weekly Issue #238

lex

October 4, 2020

My daughters asked earlier today what I do at work, and I explained all about SRE, reliability, and the importance of work-life balance. They said to tell you they say hi!

Articles

On Call Shouldn’t Suck: A Guide For Managers

Lots of really great advice in here. And really, with a title like that, I couldn’t resist reading it!

Charity Majors

Follow-up for Google Cloud Infrastructure Components Incident #20010

Last week, I mentioned a Google Cloud Platform outage that affected multiple services. Here’s the detailed post-analysis by Google.

Google

Team Play with a Powerful and Independent Agent: A Full-Mission Simulation Study

This one is along the lines of the classic Ironies of Automation paper by Bainbridge. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can automation be a team player, and what happens when it isn’t?

Nadine Sarter and David Woods (original paper)

Thai Wood — Resilience Roundup (summary)

Applying Chaos Engineering in Healthcare: Getting Started with Sensitive Workloads

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can you use chaos engineering when failures in the system can be critical and even life-threatening?

Carl Chesser — Infoq

This is your Guide for Implementing SRE in NOCs

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Emily Arnot — Blameless

Is your microservice a distributed monolith?

This article suggests using chaos engineering to tell if your microservice-based architecture is secretly a monolith in disguise.

Andre Newman — Gremlin

Outages

Slack
Radware
- An accidental BGP hijack by Telstra took down Radware.
Twitter
Tokyo Stock Exchange
- The Tokyo Stock Exchange was down for an entire day, the first time that’s ever happened.
Fastly
Squarespace
Google Search Indexing
Microsoft Azure outage #SM79-F88
- A problem with Azure Active Directory caused trouble for Office365 and other Microsoft services. Click through for their detailed follow-up.

SRE Weekly Issue #237

lex

September 27, 2020

General

Comments

View on sreweekly.com

Articles

Postmortem — why Allegro went down

They fully expected their deep-discount sale to drive traffic, but they didn’t expect their system to handle the increase in the way that it did.

Michał Kosmulski — Allegro

Zero-Downtime Kubernetes Deployments

Pre-stop hooks, liveness probes, and readiness probes were key to smoothly transitioning their services from a home-grown container system to Kubernetes.

Oliver Leaver-Smith — Sky Betting & Gaming

Feelings during incident response

The experience of responding to an incident can evoke emotions that run the gamut.

Mads Hartmann

Join SRE Classroom NALSD workshops

Google has released course materials the first of a series of classes on NALSD (“non-abstract large systems design”). This first one is about a distributed Pub-Sub system.

Auithor: Jenny Liao and Salim Virji — Google

Why you should write up your own incident

Usually, doing a post-analysis on an incident you were in is an anti-pattern because you’re likely to introduce bias. But sometimes, it can lead you to learn more than you would have otherwise.

Lorin Hochstein

Outages

Datadog
G Suite
Google Cloud Platform
Let’s Encrypt
- Google CT logs had an issue, impairing Let’s Encrypt’s ability to issue.
Tesla
Apple
Reddit
Heroku
Connectivity Issues
Crypto.com (cryptocurrency exchange)
- The CEO says a database issue (nearly) opened up the possibility for arbitrage.

SRE Weekly Issue #236

lex

September 20, 2020

General

Comments

View on sreweekly.com

Articles

My first outage

A nice juicy post-incident report from the archives. Remember the first time you took down production?

Mads Hartmann — Glitch

Fault during testing of NordLink

While testing a new power transmission link, it was accidentally overloaded by a factor of ~14x, with far-reaching but ultimately well-managed effects.

Thanks to Jesper Lundkvist for this one.

Throughput autoscaling: Dynamic sizing for Facebook.com

As Facebook moved from a static to an auto-scaled web pool, they had to try to predict their expected demand as accurately as possible.

Daniel Boeve, Kiryong Ha, and Anca Agape — Facebook

Database migrations lessons learned

The key lesson involves ensuring that your migrations avoid using parts of the production code, which could cause their action to change down the line inadvertently.

Frank Lin — Octopus Deploy

Moobot vs. Gatebot: Cloudflare Automatically Blocks Botnet DDoS Attack Topping At 654 Gbps

Cloudflare uses an interesting multi-layered approach to mitigating attacks.

Omer Yoachimik — Cloudflare

Availability, Maintainability, Reliability: What’s the Difference?

The availability/reliability distinction in this article is thought-provoking.

Emily Arnott — Blameless

Troubled Times: Episode 3

2020 has shown the value of adaptive capacity. 2021 will show whether or not adaptive capacity can be sustained.

This article (not a video or podcast despite the name) also focuses on the increasing importance of learning from incidents.

Dr. Richard Cook — Adaptice Capacity Labs

Building and revising adaptive capacity sharing for technical incident response: A case of resilience engineering

What is resilience engineering? What does a resilience engineer do? Are there principles of resilience engineering? If so, what are they? What makes it possible to engineer resilience?

This academic paper uses a case study to show how a company engineered the resilience of their system in response to a series of incidents.

Richard I. Cook and Beth Adele Long — Applied Ergonomics

Outages

Google Drive
- This is a post-analysis for two outages, one from this past week and the other from the week before.
Instagram
Facebook
Discord
Fastly
Gandi
- Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6
  
  A layer 2 network loop was accidentally introduced, on two separate occasions.
  
  Sébastien Dupas — Gandi
Azure
- This was an outage on Sept. 14 in the UK South region. A cooling system was shut off in error during a maintenance procedure.

SRE Weekly Issue #235

lex

September 13, 2020

General

Comments

View on sreweekly.com

Articles

Alerting on SLOs

This isn’t just another boring article about SLOs. There’s a ton of good stuff in here about why they moved to SLO-based alerts, too.

we’re hoping that by implementing SLOs – and alerting on them – we’ll be able to improve communication during incidents, reduce the toil on on-callers, and help improve our reliability in a way that’s meaningful to our users.

Mads Hartmann

A nudge in the right direction

Often, serendipity gets us out of an incident or makes it less severe.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.

Lorin Hochstein

Seamlessly Swapping the API backend of the Netflix Android app

It’s your classic “replace the engines on a jet while flying it” story. My favorite part is how they recorded real traffic and played it at the old and new backend API to compare the JSON responses.

Rohan Dhruva and Ed Ballot — Netflix

Using feature flags during incident management

Feature flags can help with load shedding and throttling, and feature flag activity can even be useful data that points to contributing factors.

Dawn Parzych — LaunchDarkly

Unimog – Cloudflare’s edge load balancer

Unimog uses a lot of really interesting techniques to balance layer 4 traffic, about which this article goes into in great detail.

David Wragg — Cloudflare

Production testing with dark canaries

I like this idea: it’s like a normal canary, except that you only send it a copy of traffic and discard the result, so as to avoid impacting users.

David Hoa — LinkedIn

Outages

SRE Weekly Issue #234

lex

September 6, 2020

General

Comments

View on sreweekly.com

Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together. There were so many outages that I’m not even going to bother listing all of them in the Outages section.

Articles

How to Build Your SRE Team

I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.

Emily Arnott — Blameless

Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.

John Carrol (original paper)

Thai Wood — Resilience Roundup (summary)

AD 0001

My latest adventures in (negligently) running sreweekly.com. It started with a surprise AWS bill, and then it got kinda weird…

Lex Neva

Inside a CODE RED: Network Edition

Deep technical details on a series of recent incidents involving Basecamp.

Troy Toman — Basecamp

Questionable Advice: War Rooms? Really?!?

Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.

Charity Majors

GitHub Availability Report: August 2020

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Keith Ballinger — GitHub

Analysis of Today’s CenturyLink/Level(3) Outage

Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.

Matthew Prince — Cloudflare

CenturyLink / Level 3 Outage Analysis

This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.

Angelique Medina and Archana Kesavan — ThousandEyes

SRE Weekly Issue #238

Articles

Outages

SRE Weekly Issue #237

Articles

Outages

SRE Weekly Issue #236

Articles

Outages

SRE Weekly Issue #235

Articles

Outages

SRE Weekly Issue #234

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues