SRE WEEKLY – Page 48 – scalability, availability, incident response, automation

SRE Weekly Issue #278

lex

July 11, 2021

General

Comments

View on sreweekly.com

Articles

That Sinking Feeling (The #HugOps Song)

Whoa. This is the best thing ever. I feel like I want to make this the official theme song of SRE Weekly.

Forrest Brazeal

r/WallStreetBets Incident Anthology (What Worked Edition): Autoscaler

Their auto-scaling algorithm needed a tweak. Before: scale up by N instances. After: scale up by an amount proportional to the current number of instances.

Fran Garcia — Reddit

The Incident Review: 4 Incidents in Outer Space

here’s a look at incidents and reliability challenges that have occurred in outer space, and what SREs stand to learn from them.

JJ Tang — Rootly

Prepare for overnight success — with the right load testing approach

This one includes 3 key things to remember while load testing. My favorite: test the whole system, not just parts.

Cortex

4 ways to improve your influence as an SRE

SRE is as much about building consensus and earning buy-in as it is about actual engineering.

Cortex

NoOps: What Does the Future Hold for DevOps Engineers?

The definition of NoOps in this article is more clear than others I’ve seen. It’s not about firing your operations team — their skill set is still necessary.

Kentaro Wakayama

Systems Observability

Even though I know what observability is, I got a lot out of this article. It has some excellent examples of questions that are hard to answer with traditional dashboards, and includes my new favorite term:

The industrial term for this problem is Watermelon Metrics; A situation where individual dashboards look green, but the overall performance is broken and red inside.

Nishant Modak and Piyush Verma — Last9

Controlling a process we don’t understand

Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand.

Lorin Hochstein

Troubleshooting: A journey into the unknown

In this epic troubleshooting story, a weird curl bug coupled with Linux memory tuning parameters led to unexpected CPU consumption in an unrelated process.

Pavlos Parissis — Booking.com

How Back Market SREs prepared for Black Friday

Learning a lesson from a rough Black Friday in 2019, these folks used load testing to gather hard data on how they would likely fare in 2020.

Mathieu Garstecki — Back Market

Outages

SRE Weekly Issue #277

lex

July 4, 2021

General

Comments

View on sreweekly.com

Articles

FINRA Orders Record Financial Penalties Against Robinhood Financial LLC

Remember all those Robinhood outages? The US financial regulatory agency is making Robinhood repay folks for the losses they sustained as a result and also fining them for other reasons.

Michelle Ong, Ray Pellecchia, Angelita Plemmer Williams, and Andrew DeSouza — FINRA

r/WallStreetBets Incident Anthology: More Data, More Problems

This is brilliant and I wish I’d thought of it years ago:

One of the things we’ve previously seen during database incidents is that a set of impacted tables can provide a unique fingerprint to identify a feature that’s triggering issues.

Courtney Wang — Reddit

The Deeper Root Cause of the Fastly and Akamai Outages

The suggested root cause involves consolidation in cloud providers and the importance of DNS.

Alban Kwan — CircleID

Full disclosure: Fastly, my employer, is mentioned.

The normalization of deviance in healthcare delivery

This paper is about recognizing normalization of deviance and techniques for dealing with it. This tidbit really made me think:

[…] they might have been taught a system deviation without realizing that it was so […]

Bus Horiz

Elephant in the Blameless War Room: Accountability

Blameless incident analysis is often at odds with a desire to “hold people accountable”. This article explores that conflict and techniques for managing the needs involved.

Christina Tan and Emily Arnott — Blameless

Shipping on a Spent Error Budget

What can you do if you’re out of error budget but you still want to deliver new features? Get creative.

Paul Osman — Honeycomb

The SRE Incident Response game

I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.

Bruce Dominguez

Outages

SRE Weekly Issue #276

lex

June 27, 2021

General

Comments

View on sreweekly.com

Articles

@GergelyOrosz on blaming the intern

HBO accidentally sent an email to a bunch of people, and they tweeted (jokingly?) blaming their intern. This is a link to a short, thoughtful response thread.

Gergely Orosz

The stack overflow of death. How we lost DNS and what we’re doing to prevent this in the future.

This is the story of the Bunny CDN outage linked below. Great read, thanks folks!

Dejan Grofelnik Pelzel — Bunny

Navigating the 8 fallacies of distributed computing

There’s never a bad time to review the fallacies of distributed computing. This article introduces them with examples and discussion of each.

Alex Diaconu — Ably

7 Essential Tools for SREs

These aren’t specific tools, but rather 7 classes of tools (with examples). They are:

Chaos engineering

Monitoring and alerting

Observability

Paging tools

SLO management

Infrastructure-as-Code (and everything-as-code)

Automated incident response

Quentin Rousseau — Rootly

Designing like a joint cognitive system

Design is interpretive. We have to find common ground before we can even start to create a design, but finding that common ground is part of the design.

For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them.

Lorin Hochstein

Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

This starts with a really neat moment in which the interviewer asks Yiu to talk about lessons from her jewelry-making hobby that she applies to SRE.

Kurt Andersen

r/WallStreetBets Incident Anthology: Reddit’s Open Systems

When Gamestop’s stock shot through the roof earlier this year, Reddit’s traffic did too. This is the first article in a short series by Reddit’s SRE team on how they handled the influx.

This article is about the ways that user actions affected their systems in unexpected ways, and how they responded.

Courtney Wang — Reddit

SRE Cultural Values

Recently in our Site Reliability Engineering organization in Azure, we established a set of cultural values that we hold ourselves and each other accountable to.

Bill Johnson — Microsoft

Outages

Western Digital “My Book Live” hard drives
Amazon Prime Video and Alexa
PharmOutcomes
- PharmOutcomes is a SaaS used by pharmacies.
Commonwealth Bank
medium
- I’ve gotten a few 500s from Medium while trying to review articles last week and this week. Maybe it’s this incident on their status page?
Bunny (CDN)
reddit
- This post on their status site says “API errors”, but I saw rumblings that suggested that reddit itself was down.

SRE Weekly Issue #275

lex

June 21, 2021

General

Comments

View on sreweekly.com

Articles

Practical Guide to SRE: Incident Severity Levels

Here’s a take on incident severity levels. I enjoy learning what criteria folks use for this, so please send similar articles my way (or maybe write your own?).

Nancy Chauhan — Rootly

Counterfactuals are not Causality

Counterfactuals (“should haves”) stifle incident retrospectives by tempting us to stop digging deeper. This article points out that there are unending possible counterfactuals for any incident.

Michael Nygard

Don’t count your incidents, make your incidents count

Read to find out how counting incidents (or “# days since an outage”) won’t help and will cause more problems than it’s worth. Also included: options for what to count instead.

incident.io

SLOs should be easy, say hi to Sloth

Sloth is a tool for generating SLOs as Prometheus metrics, claiming to support “any kind of service”.

Xabier Larrakoetxea

Evaluating where your team lies on the SRE spectrum

If you’re looking for a way to evaluate your SRE process, this might help.

Alex Bramley — Google

The Cost of 100% Reliability

This article tries to put an actual number on the cost of adding more nines of reliability.

Jack Shirazi — Expedia

2021 SRE Report

It’s time for Catchpoint’s yearly SRE report, downloadable in PDF form through this link. Note: you have to give them your email address.

Catchpoint

Outages

Akamai
- This outage impacted banks and airlines, among other Akamai customers.

SRE Weekly Issue #274

lex

June 13, 2021

General

Comments

View on sreweekly.com

Articles

Chicken Soup for the SLO

The last section suggests selling SLOs to executives by likening them to OKRs or KPIs.

Austin Parker — Devops.com

How Lowe’s leverages Google SRE practices

Lowe’s is a home improvement retailer in North America. I often find it fascinating when I learn that a company that’s not seen as being in the tech-sector has a robust SRE practice.

Vivek Balivada and Rahul Mohan Kola Kandy — Lowe’s

Incident writeup as sociological storytelling

The hallmark of sociological storytelling is if it can encourage us to put ourselves in the place of any character, not just the main hero/heroine, and imagine ourselves making similar choices.

Lorin Hochstein

DevOps & Autism Care

This is brilliant: they apply DevOps and SRE practices to the challenging work of raising two autistic children.

Zac Nickens — USENIX ;login:

Implementing ChatOps into our Incident Management Procedure

I especially like how their bot automatically pages reinforcements after folks have been on an incident for long enough to become fatigued.

Daniella Niyonkuru

The MTTR that matters

Rather than measuring Mean Time To Recovery for incidents, let’s track our Mean Time To Retrospective.

Robert Ross — FireHydrant

Outages

Fastly
- Fastly had a global outage of their CDN service, with many 5xx errors for around 40 minutes and diminished cache hit ratios following after. Many customers of Fastly experienced degradation, notably including Amazon, Reddit, and GitHub, among many others.
  Fastly posted a summary shortly after the incident, describing a latent bug that was triggered by a customer’s (valid) configuration change.
  
  Full disclosure: Fastly is my employer.
Salesforce
Facebook, Instagram, and WhatsApp

SRE Weekly Issue #278

Articles

Outages

SRE Weekly Issue #277

Articles

Outages

SRE Weekly Issue #276

Articles

Outages

SRE Weekly Issue #275

Articles

Outages

SRE Weekly Issue #274

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues