SRE WEEKLY – Page 57 – scalability, availability, incident response, automation

SRE Weekly Issue #213

lex

March 29, 2020

Articles

COVID-19: Why We Should All Wear Masks — There Is New Scientific Rationale

This is important, and well worth a read. Where’s the SRE connection? The article explains that the U.S. Surgeon General’s comment that masks are “not effective” led to a stigma against those that wear them here. That kind of unintended sociological effect is uncovered commonly in incident post-analysis.

Sui Huang

Keeping the Internet “Always On” — the Pressure of COVID-19 on Incident Response Teams

Pagerduty ran the numbers and discovered an increase in incidents recently, especially in certain companies.

Rachel Obstler — PagerDuty

February service disruptions post-incident analysis

Here’s the scoop on all those GitHub incidents in February.

Keith Ballinger — GitHub

Embrace Resilience for Business Continuity in Times of Uncertainty

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.

Hannah Culver — Blameless

Remote incident management

5 tips for incident management when you’re suddenly remote

I love the concept of “ephemeral information”, that is, discussions that happen out-of-band, making it much harder to analyze the incident after the fact.

Blake Thorne — Atlassian

Elastic Cloud January 18, 2019 Incident Report

Grey failure turned a seemingly reasonable auto-recovery mechanism into a DoS caused by a thundering herd.

Panagiotis Moustafellos, Uri Cohen, and Sylvain Wallez — Elastic

Outages

G Suite
Google Cloud Platform
- GCP had a major incident that caused the G Suite outage.GCP also had an (apparently) unrelated outage later in the day.
BitBay (cryptocurrency exchange)
Netflix
Uber
WhatsApp
Fastly
- Also this one.Full disclosure: Fastly is my employer.
Reddit
Discord
Brightcove
Zoom
DoorDash
Nest
Canvas (remote learning tool)

SRE Weekly Issue #212

lex

March 22, 2020

General

Comments

View on sreweekly.com

Articles

Meaningful availability

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper)

Adrian Colyer — The Morning Paper (summary)

Our Top 5 On-Call Practices – Blameless: Better Reliability Through SRE

Their top 5 are:

Use Meaningful Severity Levels
Create Detailed Runbooks
Load Balance Through Qualitative Metrics
Get Ahead of Incidents
Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

NTP: Building a more accurate time service at Facebook scale

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

The Fallacy of Move Fast and Break Things

You might end up just breaking things.

Dawn Parzych — LaunchDarkly

InSearch: LinkedIn’s new message search platform

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.

Suruchi Shah and Hari Shankar — LinkedIn

Destiny 2 Outage and Rollback

This followup post from Bungie covers two related incidents in February that caused loss of user data.

Bungie

Involving Engineers in Incident Management: QCon London Q&A

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.

Ben Linders — InfoQ

Outages

Statuspage.io
- The text of this incident originally mentioned Heroku, and it lines up with the Heroku outage below.
- They also had this unrelated outage.
Heroku
- Heroku suffered two short bouts of 85% request failure to applications hosted on their platform.Separately, they recently posted a couple of followup reports for previous incidents:
  - Incident #1961: logging outage
  - Incident #1968: EU application errors
Zoom
MacStadium
Hulu
Bumble
Microsoft Teams and Office 365
Discord
- Discord posted this gem of a followup analysis just a few days after their outage last week.
GoToMeeting
Google Nest
DoorDash

SRE Weekly Issue #211

lex

March 15, 2020

General

Comments

View on sreweekly.com

Articles

SREcon20 Asia/Pacific

SRECon20 Asia/Pacific is rescheduled to September 7–9, 2020.

Business continuity at Slack: Keeping our customers up and running during COVID-19

This article has a definite marketing slant. It’s nonetheless interesting to see how Slack is handling the situation.

Cal Henderson and Robby Kwok, Slack

Journey into Observability: Glitch’s journey

I love this gem:

I’m not surprised companies that are far into their observability journey start advocating for testing in production – once you have the data and you can slice & dice it as you see fit, testing in production seems like a totally reasonable thing to do.

Mads Hartmann

Lessons in Distributed Communication From Incident Response

With many companies suddenly shifting into figuring out how to become distributed organizations overnight, we can learn many lessons by looking at incident response patterns.

George Miranda — PagerDuty

When correlation (or lack of it) can be causation

Today’s post is a double header. I’ve chosen two papers from NSDI’20 that are both about correlation.

Paper #1 is a tool that helps identify when files A and B are often changed at the same time, and warns you if you forgot B.

Paper #2 is a tool for finding correlated failure risks that threaten reliability.

Mehta et al. — NSDI’20 (original paper #1)
Zhai et al. — NSDI’20 (original paper #2)
Adrian Colyer — The Morning Paper (summaries)

Great Incident Response Requires 3 Major Components

The components from the article are:

Ability to recognize how bad the situation really is, and prioritize it
Effective communication skills
Compassionate responses to mistakes and a learning mindset

Hannah Culver — Blameless

Announcing Failover Conf

We’re pleased to announce Failover Conf, a conference focused on building resilient systems. The conference will be held online on April 21 and session submissions will be accepted through March 23.

CFP open through March 23.

Gremlin

Grow your blame-free culture with these postmortem best practices | FireHydrant

There are some good tips in here, especially if you’re new to this.

Mandy Mak

How network automation helps Fastly support the world’s biggest live-streaming moments

Fastly’s APS tool (Auto Peer Slasher) detects when a link is nearing saturation and automatically reroutes traffic through a different interface.

Ryan Landry — Fastly

Full disclosure: Fastly is my employer.

Outages

SRE Weekly Issue #210

lex

March 8, 2020

General

Comments

View on sreweekly.com

Articles

Introducing Dispatch

Netflix open sourced their incident management system.

Put simply, Dispatch is:

All of the ad-hoc things you’re doing to manage incidents today, done for you, and a bunch of other things you should’ve been doing, but have not had the time!

Kevin Glisson, Marc Vilanova, Forest Monsen — Netflix

Reading /proc/pid/cmdline can hang forever

I wasn’t aware of this little pitfall of memory cgroups.

rachelbythebay

In space, no one can hear you kernel panic

Your failover DB instance is cute. Try 4x+ redundancy. That’s the kind of engineering required when designing systems to operate in space.

Glenn Fleishman — Increment

A single person on-call “rotation” is a critical vulnerability

This post enumerates some of the risks introduced when a single person carries 100% of the on-call duties of a team, and shows why those risks are not simply eliminated by increasing the number of people in the rotation.

Daniel Condomitti — FireHydrant

Experimental study on the effect of procedure under unexpected situations

This is a pretty nifty experiment showing the importance of letting folks use their judgement to handle unexpected situations rather than relying on adherence to procedures.

Thai Wood — Resilience Roundup (summary)

Makoto Takahashi, Daisuke Karikawa, Genta Sawasato and Yoshitaka Hoshii — Tohoku University (original paper)

Coronavirus/COVID-19 and USENIX Conferences

FYI: SRECon Americas West has been rescheduled to June 2-4.

Millions of tiny databases

This week, we have another summary of the Physalia paper. I especially like the bit about poison pills.

Adrian Colyer — The Morning Paper (summary)

Brooker et al. — NSDI’20 (original paper)

How did software get so reliable without proof?

In this case, “proof” means “formal proof”.

It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

Lorin Hochstein

Outages

Let’s Encrypt Status
- Let’s Encrypt purposefully suspended certificate issuance to investigate a bug around validating CAA DNS records. See their initial report and subsequent full report for details.
  Subsequently, they decided to revoke 3 million certificates with a pretty short warning. Both actions (the revocations and taking down issuance initially) were likely warranted and mandated under the compliance guidelines that CAs are subjected to.
  
  I’ve found two third-party incidents so far that seem to stem from the revocations:
  * statuspage.io
  * Heroku
  
  Got any more? Please do send them my way.
Robinhood (stock trading platform)
- Thanks to Daniel Lucas for this and a couple other recent ones. for this one.
G Suite Status Dashboard
PagerDuty
Uber
Interactive Brokers (Stock Broker)
Binion’s and Four Queens (Las Vegas casinos)
- Slot machines stopped working, and an eerie quiet descended.
crates.io incident report for 2020-02-20
- On 2020-02-20 at 21:28 UTC we received a report from a user of crates.io that their crate was not available on the index even after 10 minutes since the upload. This was a bug in the crates.io webapp exposed by a GitHub outage.
  
  crates.io is the Rust language package registry.
  
  Pietro Albini — crates.io
Discord

SRE Weekly Issue #209

lex

March 2, 2020

General

Comments

View on sreweekly.com

Articles

Gandalf: an intelligent, end-to-end analytics service for safe deployment in cloud-scale infrastructure

Azure developed this tool to sniff out production problems caused by deploys and guess which deploy might have been the culprit. Its accuracy is impressive.

Adrian Colyer — The Morning Paper (summary)

Li et al. — NSDI’20 (original paper)

fork() can fail: this is important

This one made me laugh out loud. Better check those system call return codes, people.

rachelbythebay

Managing the Hidden Costs of Coordination

This caught my eye:

In addition, what is seen as the IC maintaining organizational discipline during a response can actually be undermining the sources of resilient practice that help incident responders cope with poorly matched coordination strategies and the cognitive demands of the incident.

Laura M.D. Maguire — ACM Queue Volume 17, Issue 6

How much money do SREs make?

A guide on salary expectations for various levels of SRE, especially useful if you’re changing jobs.

Gremlin

3 microservices resiliency patterns for better reliability

The flipside of microservices agility is the resiliency you can lose from service distribution. Here are some microservices resiliency patterns that can keep your services available and reliable.

Joydip Kanjilal

It’s time for smart home devices to have local failover options during cloud outages

There have been several recent failures of consumer devices based on a cloud service outage, and this author argues for change.

Kevin C. Tofel — Stacey on IoT

Human error, miscommunication and lack of training behind false alarm at Pickering nuclear station

This sounds familiar…

Durham Radio News

Friday deploys: comfort, not pressure

Essentially, you’re taking that risk of the Friday afternoon deployment, and spreading it thinly across many deployments throughout the week.

Ben New

Outages

Fidelity
- This one was especially problematic because it happened on Monday, a day of huge losses for the US stock market.
GitHub
- This one too. GitHub posted a short note on the recent outages.
TechCrunch
- TechCrunch was serving an expired TLS certificate. The strange thing is that the certificate had only been valid for 12 hours.
Petnet pet feeders
Google Nest

SRE Weekly Issue #213

Articles

Outages

SRE Weekly Issue #212

Articles

Outages

SRE Weekly Issue #211

Articles

Outages

SRE Weekly Issue #210

Articles

Outages

SRE Weekly Issue #209

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues