SRE WEEKLY – Page 54 – scalability, availability, incident response, automation

SRE Weekly Issue #248

lex

December 13, 2020

General

Comments

View on sreweekly.com

Articles

SLOs That Lie – SRE Journal

It’s really easy to get an “uptime” SLO wrong, and a lying SLO can give you a false sense of security.

Piyush Verma — Last9

It’s Just a Monitoring Change

I love this quote. I feel like this is the “root cause” of every incident:

As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences.

Oliver Leaver-Smith — Sky Betting & Gaming

Complexity Has to Live Somewhere

I really love the concept of requisite complexity. This article has me thinking about a big project I’m working on in a new light.

Fred Hebert

The Boring Option

They expected to max out an integer primary key column sometime in 2021. Then the pandemic hit and their timetable suddenly accelerated along with their traffic.

Jeff Pollard — Strava

Scary sysadmin Halloween stories

I shouldn’t enjoy reading these so much… got any of your own to share?

Dean Wilson

Borrow Expertise With Runbook Automation

The idea of borrowing expertise makes me think of Bainbridge’s Ironies of Automation.

Mandi Walls — PagerDuty

Heroku Incident #2127 Follow-Up: Issues with starting new dynos

Heroku’s report explains how their service was impacted as a result of the big Amazon Kinesis outage a couple weeks back.

Heroku

Setting Business Goals with SLOs

This primer focuses on ensuring that your SLOs actually match up with business objectives.

Irving Popovetsky — Honeycomb

Outages

AT&T
- An interesting Twitter thread about a router near San Francisco, California, USA that was flipping bits in packets for weeks. Folks took to Twitter to try to get AT&T’s attention, and they finally fixed it.
Robinhood
Facebook Messenger & Instagram
Microsoft stuff
- - Office 365
  - Teams
  - SharePoint
  - OneDrive
Reddit

SRE Weekly Issue #247

lex

December 6, 2020

General

Comments

View on sreweekly.com

Articles

2020 09 25 Incident: Infrastructure connectivity issue impacting multiple systems

This incident report from a September Datadog outage has an interesting tidbit aboiut scaling external incident response in tandem with internal.

Alexis Lê-Quôc — Datadog

Google Cloud Issue Summary — Google Drive — 2020-11-16

This is Google’s write-up for an interesting issue that involved repeated re-sending of invitations to edit a Google Drive document.

Google

What I Wish I Knew About Incident Management

I basically want to immediately absorb any article with this title, unless it’s just clickbait spam. This one definitely isn’t.

Ronak Nathani

Scaling Datastores at Slack with Vitess

Lots of juicy details in this one about the difficulty Slack has had in scaling their DB layer and how Vitess solved their problems.

Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón — Slack

Mitigate Connection Leaks in Production via Proxies

Hitting file descriptor limits is such an annoying kind of outage. Some good tips here, clearly coming from hard-won experience.

Utsav Shah

Improving the Resiliency of Our Infrastructure DNS Zone

They used two providers synced with OctoDNS.

Ryan Timken and Kiran Naidoo — Cloudflare

Root Cause Analysis For Reliability: A Case Study

This is all about understanding the whole system (people and technology) and building learning, rather than finding a superficial “root cause”.

Piyush Verma — Last9

Outages

Solana
Poloniex
New Zealand Reserve Bank
OneDayOnly
- Local e-commerce site OneDayOnly is running Black Friday discount deals again today, after the shopping site was down for a few hours last Friday.
Infura
MobileCause
- This outage occurred on Giving Tuesday, a very important day for nonprofits to raise funds.

SRE Weekly Issue #246

lex

November 29, 2020

General

Comments

View on sreweekly.com

Articles

One Year of Load Balancing

DNS-based load balancing is a nice simple solution, but unfortunately it doesn’t work well in certain circumstances. Read to find out how Algolia evolved their load balancing system in response.

Paul Berthaux — Algolia

Your Percentiles are incorrect P99 of the times.

We use percentiles all the time, so it’s really important to actually understand what they say (and what they don’t).

Piyush Verma — Last9

Thanks to An anonymous reader for this one.

My journey to SRE into 2020 and beyond

The author started out as an embedded systems developer and moved into SRE. Here’s what they learned.

Eric Uriostigue — effx

How to apologize for server outages and keep users happy

Some great tips here. It’s hard to sound sincere in a public incident report, especially if you post a lot of them.

Adam Fowler

Democratizing Fare Storage at scale using Event Sourcing

In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.

Sourabh Suman — Grab

Simple streaming telemetry

This article covers Netflix’s gnmi-gateway, their open source tool for collecting metrics from network devices in a highly available and fault-tolerant manner.

Colin McIntosh and Michael Costello — Netflix

A guide to the reliability talks at AWS re:Invent

This year, re:Invent is online only, so you still have a chance to attend if you’re interested.

Ana M Medina — Gremlin

A Byzantine failure in the real world

Cloudflare’s API service was impaired early this month. This is their incident report that describes a grey failure in a switch and downstream impact to etcd and their database system.

Tom Lianza and Chris Snook — Cloudflare

Outages

Slack
Giphy
Spotify
Currys PC World
DoorDash
Amazon Prime Video
AWS
- This link points to Amazon’s detailed report on the outage.

SRE Weekly Issue #245

lex

November 22, 2020

General

Comments

View on sreweekly.com

Articles

Trust Asia 2021 has produced inconsistent STHs

A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.

Trust Asia

Knowing your systems and how they can fail: Twilio and AWS talk at Chaos Conf 2020

I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.

Andre Newman — Gremlin

Building for reliability at HelloSign

This one introduces some interesting concepts: the error kernel and property testing.

Kenneth Cross — HelloSign

Tech Startup Dilemmas: Resilient Deployment vs. Exhaustive Tests

[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.

Xavier Grand — Algolia

8 Tips to Create an Accurate and Helpful Post-Mortem Incident Report

More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.

Zachary Flower — Splunk

Achieving exactly-once message processing with Ably

Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.

Paddy Byers — Ably

Why you should frequently turn down ~30% of canary instances

The most effective (if scary) way to understand how your stateless service operates under load

Utsav Shah — Software at Scale

The Engineer’s Guide to Preparing for Black Friday 2020

Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.

Outages

ASX (Australian Stock Exchange)
Coinbase
GoDaddy
- GoDaddy’s statement took care to explicitly state that the outage was not a security incident. This may be because they appear to have had an unrelated security incident around the same time, and some customer domains were taken over.
Nest

SRE Weekly Issue #244

lex

November 15, 2020

General

Comments

View on sreweekly.com

Articles

Type in the exact number of machines to proceed

If you’re gonna operate on a pile of computers all at once that numbers 6+ figures, making you type that number in is a way to make you pause and think about what you’re doing.

Rachel by the bay

IT metrics: Why the five 9s must go

Find out why they decided to focus less on nines, and what they did instead.

Robert Sullivan

Rule 1: It’s ALWAYS DNS

Reminds me of the classic:

It’s not DNS
There’s no way it’s DNS
It was DNS

— (ssbroski on reddit)
Mike S.

Moving OkCupid from REST to GraphQL

Their front-end made duplicate calls to the new API to test load and response time prior to cutting over.

Michael P. Geraci — OkCupid

New Arctic Air Crash Aftermath Role-Play Simulation Orchestrating a Fundamental Surprise

This is really cool. The researchers created a role-play scenario based on a real plane crash. They tried to get participants to blame “human error”, so that they could then surprise them with all of the (many) contributing factors that were involved.

Emily S. Patterson, Richard I. Cook, David D. Woods, Marta L. Render

From Sysadmin to SRE

Tips from one Sysadmin’s journey to becoming an SRE.

Josh Duffney — Octopus Deploy

Outages

YouTube
Macs
- Mac users had issues launching applications, owing to an outage of ocsp.apple.com. Apple confirmed the issue.
PrometheusKube
- The link points to their awesome writeup of what went wrong and the on-the-fly reworking they had to do to fix it.
Instagram
Hotmail
Various stock trading platforms
- There’s some speculation that this was a result of increased trading volume following Pfizer’s announcement about vaccine trial results.
Robinhood
Increased Error Rates

SRE Weekly Issue #248

Articles

Outages

SRE Weekly Issue #247

Articles

Outages

SRE Weekly Issue #246

Articles

Outages

SRE Weekly Issue #245

Articles

Outages

SRE Weekly Issue #244

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues