SRE WEEKLY – Page 61 – scalability, availability, incident response, automation

SRE Weekly Issue #193

lex

November 10, 2019

Articles

Ever had a Sev 1 non-impacting incident? This team’s Consul cluster was balanced on a razor’s edge: one false move and quorum would be lost. Read about their incident response and learn how they avoided customer impact.

Devin Sylva — GitLab

Hot SRE trends in 2019 (brought to you from SREcon EMEA)

This SRECon EMEA highlight reel is giving me serious FOMO.

Will Sewell — Pusher

Resilience Roundup – Handoff strategies in settings with high consequences for failure: lessons for health care operations

This week we’re taking a look at how teams in high consequence domains perform handoffs between shifts.

Emily Patterson, Emilie Roth, David Woods, and Renee Chow (original paper)

Thai Wood (summary)

Scaling in the presence of errors—don’t ignore them

This is an interesting essay on handling errors in complex systems.

In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.

tef

Fast dimensional analysis for root cause analysis at scale

To be clear: this is about assisting incident responders in gaining an understanding of an incident in the moment, not about finding a “root cause” to present in an after-action report.

I’m not going to pretend to understand the math, but the concept is intriguing.

Nikolay Pavlovich Laptev, Fred Lin, Keyur Muzumdar, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar — Facebook

The inflection point hypothesis: a principled approach to finding the root cause of a failure

This one’s about assisting humans in debugging, when they have a reproduction case for a bug but can’t see what’s actually going wrong.

That’s two different uses of “root cause” this week, and neither one is the troublesome variety that John Allspaw has debunked repeatedly.

Zhang et al. (original paper)

Adrian Colyer (summary)

Outages

Honeycomb
- Here‘s an unroll of an interesting Twitter thread by Honeycomb’s Liz Fong-Jones during and after the incident.
GitHub
Amazon Prime Video
Google Compute Engine
- Network administration functions were impacted. Click for their post-incident analysis.
Squarespace
- On Wednesday November 6th, many Squarespace websites were unavailable for 102 minutes between 14:13 and 15:55 ET.
  
  Click through for their post-incident analysis.

SRE Weekly Issue #192

lex

November 3, 2019

General

Comments

View on sreweekly.com

Articles

Deploys: It’s Not Actually About Fridays

This is a reply/follow-on/not-rebuttal to the article I linked to last week, Deploy on Fridays, or Don’t. I really love the vigorous discussion!

Charity Majors

About those puppies

And this is a reply to Charity’s earlier article, Friday Deploy Freezes Are Exactly Like Murdering Puppies. Keep it coming, folks!

Marko Bjelac

Fun with NULL pointers

In this story from the archives, a well-meaning compiler optimizes away a NULL pointer check, yielding an exploitable kernel bug. I love complex systems (kinda).

Jonathan Corbet — LWN

How malformed packets caused CenturyLink’s 37-hour, nationwide outage

A new report has been released about a major telecommunications outage last winter. This summary paints the picture of a classic complex systems failure.

Ronald Lewis

Code it, ship it, own it with full-service ownership

Making engineers responsible for their code and services in production offers multiple advantages—for the engineer as well as the code.

Julie Gunderson — PagerDuty

Outages

Google Hangouts
Google Cloud Platform: Followup for Incident #19020
- Incident #19020 occurred on October 22.
Facebook, WhatsApp, and Instagram
Yahoo Mail
Second Life

SRE Weekly Issue #191

lex

October 27, 2019

General

Comments

View on sreweekly.com

Articles

The Post-Incident Review Issue 1: Autumn/Winter 2019

Check it out! A new zine dedicated to post-incident reviews. This first issue includes a reprint of 4 real gems from the past month plus one original article about disseminating lessons learned from incidents.

Emil Stolarsky and Jaime Woo

New – Amazon CloudWatch Anomaly Detection

I swear, it’s like they heard me talking about anomaly detection last week. Anyone used this thing? I’d love to hear your experience. Better still, perhaps you’d like to write a blog post or article?

CPDoS: Cache Poisoned Denial of Service

I know this isn’t Security Weekly, but this vulnerability has the potential to cause reliability issues, and it’s dreadfully simple to understand and exploit.

Hoai Viet Nguyen and Luigi Lo Iacono

Behind the Scenes of a long EVE Online downtime [2015]

In this incident followup from the archives, read the saga of a deploy gone horribly wrong. It took them hours and several experiments to figure out how to right the ship.

CCP Goliath — EVE Online

Nine Experimentation Best Practices

The best practices:

Create a culture of experimentation

Define what success looks like as a team

Statistical significance

Proper segmentation

Recognize your biases

Conduct a retro

Consider experiments during the planning phase

Empower others

Avoid technical debt

Dawn Parzych — LaunchDarkly

Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications

Mantis uses an interesting stream processing / subscriber model for observability tooling.

From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.

Cody Rioux, Daniel Jacobson, Jeff Chao, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, and Zhenzhong Xu — Netflix

Deploy on Fridays, or Don’t.

choosing not to deploy on Fridays is very different than having the capability to deploy on Fridays. You should have the capability to deploy at any time.

We can’t ever be sure deploy will be safe, but we can be sure that folks have plans for their weekend.

David Mangot — Mangoteque

Outages

Amazon Route 53
- Route 53 had significant DNS resolution impairment.
  Their status site still doesn’t allow deep linking or browsing the archive in any kind of manageable way, so here’s the full text of their followup post:
  
  On October 22, 2019, we detected and then mitigated a DDoS (Distributed Denial of Service) attack against Route 53. Due to the way that DNS queries are processed, this attack was first experienced by many other DNS server operators as the queries made their way through DNS resolvers on the internet to Route 53. The attack targeted specific DNS names and paths, notably those used to access the global names for S3 buckets. Because this attack was widely distributed, a small number of ISPs operating affected DNS resolvers implemented mitigation strategies of their own in an attempt to control the traffic. This is causing DNS lookups through these resolvers for a small number of AWS names to fail. We are doing our best to identify and contact these operators, as quickly as possible, and working with them to enhance their mitigations so that they do not cause impact to valid requests. If you are experiencing issues, please contact us so we can work with your operator to help resolve.
Heroku
- I’m guessing this stemmed from the Route 53 incident.
  
  Our infrastructure provider is currently reporting intermittent DNS resolution errors. This may result in issues resolving domains to our services.
Twitter
Yahoo Mail
Hosted Graphite
Discord
Google Cloud Platform
- Many GCP services were affected. There was also a possible Google search outage, though I wasn’t able to corroborate this report.

SRE Weekly Issue #190

lex

October 20, 2019

General

Comments

View on sreweekly.com

Articles

Making On-Call Not Suck

This company had a really challenging on-call situation to fix. Monolithic codebase, and a huge team with so many people in the on-call rotation that folks were out of practice by the time it was their turn.

Molly Struve

Incidents — Trends from the Trenches

This article includes charts, observations, and conclusions from the author’s by-hand analysis and categorization of several hundred incidents.

Subbu Allamaraju

@mipsytipsy on Twitter: don’t alert on everything

Charity Majors replied to a suggestion to write alerts for everything with her ideas for a better way.

Charity Majors (@mipsytipsy)

PostgreSQL Connection Pooling: Part 1 – Pros & Cons

Where many databases use threading to handle concurrent clients, PostgreSQL forks one child process per client. This has ramifications that an operator must take into consideration.

Kristi Anderson — High Scalability

Four Key Attributes of Advanced Anomaly Detection

This article is about attributes, but it doesn’t mention a specific system. I have yet to find an anomaly detection system that doesn’t produce so many false positives that it’s useless.

Hive mind: if you’re using an anomaly detection system that actually works and doesn’t drown you with false positives, I want to hear about it. Bonus points if you want to write an article about it!

Amit Levi

Outages

SRE Weekly Issue #189

lex

October 13, 2019

General

Comments

View on sreweekly.com

Articles

What Would It Take to Shut Down the Entire Internet?

…no reason. Asking for a friend.

Daniel Kolitz — Gizmodo

Multi Cloud Happens But Not Necessarily By Design

Multi-cloud may not be your first choice — but it may not be your choice at all.

Krishnan Subramanian – StackSense

Halloween Story: The Friday Deploy

Should you deploy on a Friday?
If you’ve got the confidence in your build and deploy pipelines, go for it.
If you don’t, go build some confidence.

Mitch Pomery — DEV

Break before make, abstractions, and sleazy ISPs

This story was so good I read it twice. The little details under the hood of your automation tools can reach out and bite you.

Rachel by the Bay

“A wild Demogorgon just wrecked your Kubernetes cluster”

D&D-themed game days!

Lukas van Driel — Q42

Why Facebook keeps going down

Some interesting details courtesy of leaked internal audio from Facebook.

Casey Newton — The Verge

Introducing SLOG: Cheating the low-latency vs. strict serializability tradeoff

How do they cheat? By making assumptions about where a read for a given datum is likely to come from.

Daniel Abadi

Heroku Incident #1908 Follow-up

The incident was the result of mismatched library versions.

Outages

PG&E Website
- PG&E is a power company in California, USA. They’re cutting power as a way of preventing the risk of fires starting from power lines blown around in high winds.
Instagram

SRE Weekly Issue #193

Articles

Outages

SRE Weekly Issue #192

Articles

Outages

SRE Weekly Issue #191

Articles

Outages

SRE Weekly Issue #190

Articles

Outages

SRE Weekly Issue #189

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues