SRE WEEKLY – Page 61 – scalability, availability, incident response, automation

SRE Weekly Issue #211

lex

March 15, 2020

General

Comments

View on sreweekly.com

Articles

SREcon20 Asia/Pacific

SRECon20 Asia/Pacific is rescheduled to September 7–9, 2020.

Business continuity at Slack: Keeping our customers up and running during COVID-19

This article has a definite marketing slant. It’s nonetheless interesting to see how Slack is handling the situation.

Cal Henderson and Robby Kwok, Slack

Journey into Observability: Glitch’s journey

I love this gem:

I’m not surprised companies that are far into their observability journey start advocating for testing in production – once you have the data and you can slice & dice it as you see fit, testing in production seems like a totally reasonable thing to do.

Mads Hartmann

Lessons in Distributed Communication From Incident Response

With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.

George Miranda — PagerDuty

When correlation (or lack of it) can be causation

Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation.

Paper #1 is a tool that helps identify when files A and B are often changed at the same time, and warns you if you forgot B.

Paper #2 is a tool for finding correlated failure risks that threaten reliability.

Mehta et al. — NSDI’20 (original paper #1)
Zhai et al. — NSDI’20 (original paper #2)
Adrian Colyer — The Morning Paper (summaries)

Great Incident Response Requires 3 Major Components

The components from the article are:

Ability to recognize how bad the situation really is, and prioritize it
Effective communication skills
Compassionate responses to mistakes and a learning mindset

Hannah Culver — Blameless

Announcing Failover Conf

We’re pleased to announce Failover Conf, a conference focused on building resilient systems. The conference will be held online on April 21 and session submissions will be accepted through March 23.

CFP open through March 23.

Gremlin

Grow your blame-free culture with these postmortem best practices | FireHydrant

There are some good tips in here, especially if you’re new to this.

Mandy Mak

How network automation helps Fastly support the world’s biggest live-streaming moments

Fastly’s APS tool (Auto Peer Slasher) detects when a link is nearing saturation and automatically reroutes traffic through a different interface.

Ryan Landry — Fastly

Full disclosure: Fastly is my employer.

Outages

SRE Weekly Issue #210

lex

March 8, 2020

General

Comments

View on sreweekly.com

Articles

Introducing Dispatch

Netflix open sourced their incident management system.

Put simply, Dispatch is:

All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!

Kevin Glisson, Marc Vilanova, Forest Monsen — Netflix

Reading /proc/pid/cmdline can hang forever

I wasn’t aware of this little pitfall of memory cgroups.

rachelbythebay

In space, no one can hear you kernel panic

Your failover DB instance is cute. Try 4x+ redundancy. That’s the kind of engineering required when designing systems to operate in space.

Glenn Fleishman — Increment

A single person on-call “rotation” is a critical vulnerability

This post enumerates some of the risks introduced when a single person carries 100% of the on-call duties of a team, and shows why those risks are not simply eliminated by increasing the number of people in the rotation.

Daniel Condomitti — FireHydrant

Experimental study on the effect of procedure under unexpected situations

This is a pretty nifty experiment showing the importance of letting folks use their judgement to handle unexpected situations rather than relying on adherence to procedures.

Thai Wood — Resilience Roundup (summary)

Makoto Takahashi, Daisuke Karikawa, Genta Sawasato and Yoshitaka Hoshii — Tohoku University (original paper)

Coronavirus/COVID-19 and USENIX Conferences

FYI: SRECon Americas West has been rescheduled to June 2-4.

Millions of tiny databases

This week, we have another summary of the Physalia paper. I especially like the bit about poison pills.

Adrian Colyer — The Morning Paper (summary)

Brooker et al. — NSDI’20 (original paper)

How did software get so reliable without proof?

In this case, “proof” means “formal proof”.

It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

Lorin Hochstein

Outages

Let’s Encrypt Status
- Let’s Encrypt purposefully suspended certificate issuance to investigate a bug around validating CAA DNS records. See their initial report and subsequent full report for details.
  Subsequently, they decided to revoke 3 million certificates with a pretty short warning. Both actions (the revocations and taking down issuance initially) were likely warranted and mandated under the compliance guidelines that CAs are subjected to.
  
  I’ve found two third-party incidents so far that seem to stem from the revocations:
  * statuspage.io
  * Heroku
  
  Got any more? Please do send them my way.
Robinhood (stock trading platform)
- Thanks to Daniel Lucas for this and a couple other recent ones. for this one.
G Suite Status Dashboard
PagerDuty
Uber
Interactive Brokers (Stock Broker)
Binion’s and Four Queens (Las Vegas casinos)
- Slot machines stopped working, and an eerie quiet descended.
crates.io incident report for 2020-02-20
- On 2020-02-20 at 21:28 UTC we received a report from a user of crates.io that their crate was not available on the index even after 10 minutes since the upload. This was a bug in the crates.io webapp exposed by a GitHub outage.
  
  crates.io is the Rust language package registry.
  
  Pietro Albini — crates.io
Discord

SRE Weekly Issue #209

lex

March 2, 2020

General

Comments

View on sreweekly.com

Articles

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Azure developed this tool to sniff out production problems caused by deploys and guess which deploy might have been the culprit. Its accuracy is impressive.

Adrian Colyer — The Morning Paper (summary)

Li et al. — NSDI’20 (original paper)

fork() can fail: this is important

This one made me laugh out loud. Better check those system call return codes, people.

rachelbythebay

Managing the Hidden Costs of Coordination

This caught my eye:

In addition, what is seen as the IC maintaining organizational discipline during a response can actually be undermining the sources of resilient practice that help incident responders cope with poorly matched coordination strategies and the cognitive demands of the incident.

Laura M.D. Maguire — ACM Queue Volume 17, Issue 6

How much money do SREs make?

A guide on salary expectations for various levels of SRE, especially useful if you’re changing jobs.

Gremlin

3 microservices resiliency patterns for better reliability

The flipside of microservices agility is the resiliency you can lose from service distribution. Here are some microservices resiliency patterns that can keep your services available and reliable.

Joydip Kanjilal

It’s time for smart home devices to have local failover options during cloud outages

There have been several recent failures of consumer devices based on a cloud service outage, and this author argues for change.

Kevin C. Tofel — Stacey on IoT

Human error, miscommunication and lack of training behind false alarm at Pickering nuclear station

This sounds familiar…

Durham Radio News

Friday deploys: comfort, not pressure

Essentially, you’re taking that risk of the Friday afternoon deployment, and spreading it thinly across many deployments throughout the week.

Ben New

Outages

Fidelity
- This one was especially problematic because it happened on Monday, a day of huge losses for the US stock market.
GitHub
- This one too. GitHub posted a short note on the recent outages.
TechCrunch
- TechCrunch was serving an expired TLS certificate. The strange thing is that the certificate had only been valid for 12 hours.
Petnet pet feeders
Google Nest

SRE Weekly Issue #208

lex

February 23, 2020

General

Comments

View on sreweekly.com

Articles

Anatomy of Cascading Failure

There’s so much in this article:

how to recognize when your system may be susceptible to cascading failure
how to prevent it
how to deal with it when it happens (and how hard that can be)

Laura Nolan — Slack

Catchpoint’s SRE Survey 2020 Is Here

It’s time for this year’s SRE Survey. Don’t forget that with each completed survey, Catchpoint donates $5 to charity.

This growing demand [for SREs] is not without growing pains as a skills gap problem has emerged due to the fact that SRE training requires a hands-on, interactive learning environment.

Peter Murray — Catchpoint

Resilience Roundup – Above the Line, Below the Line

Both the summary and the original article are well worth reading. This stood out to me:

As much as we may think of incidents as taking place in all those technical parts of the system below the line, incidents actually take place above it

Thai Wood (summary)

Dr. Richard Cook (original article)

The Jellyfish-Inspired Database Under AWS Block Storage

The EBS control plane data store resembles a “jellyfish” (actually a Physalia, a.k.a. Portuguese man-of-war).

Timothy Prickett Morgan — The Next Platform

The Problem with Microservices: ‘Deep Systems’

Ideal: each team manages their microservice(s) in isolation.

Reality: microservices interact in unexpected ways and a broader system emerges that has remarkable similarities to running a monolith.

Ben Sigelman — LightStep

SRE for single-tiered software applications

This one discusses how to handle SRE for a monolith, and some examples of what often goes wrong.

Eric Harvieux — Google

Trying to sneak in a sketchy .so over the weekend

The author blocked an unexpected Sunday deploy of untested code, and it turned out to be a good thing they did.

rachelbythebay

Outages

GitHub
NPM
- Linked is an interesting explanation from Cloudflare, posted as a comment on a GitHub issue.
New Relic
PagerDuty
Fidelity
- Fidelity customers saw a $0 balance for their 401(k) [US retirement] accounts.
Microsoft Office 365 & Outlook down – Users getting service unavailable error
Heathrow Airport (London, UK)
Zillow
Indeed
Kobo
Heroku
Squarespace
- Also this one.
reddit

SRE Weekly Issue #207

lex

February 16, 2020

General

Comments

View on sreweekly.com

Articles

You see pilot error, I see normal work

The scenario: a seemingly botched landing, a finding of human error, and retraining for the errant pilots. The author recasts the entire incident in a much more realistic light that shows that the pilots’ actions were perfectly reasonable.

Robyn Ironside — Safety Differently

Running servers (and services) well is not trivial

Just exactly what would it take to (reliably) run your own git server internally?

Chris Siebenmann

Trade-offs under pressure: heuristics and observations of teams resolving internet service outages

In this two part series, The Morning paper takes on John Allspaw’s master’s thesis from Lund University. Here’s part two.

Adrian Colyer — The Morning Paper (summary)

John Allspaw — Lund University (original paper)

Team Structure for Software Reliability within your Organization

The section toward the end under the heading “Things need to get worse before they get better.” especially resonated with me.

Hannah Culver — Blameless

Music in Resilience: The Practice of Practice

Incident response and improvisational music share a lot in common.

Matt Davis — Verica

SRE Weekly Issue #211

Articles

Outages

SRE Weekly Issue #210

Articles

Outages

SRE Weekly Issue #209

Articles

Outages

SRE Weekly Issue #208

Articles

Outages

SRE Weekly Issue #207

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues