SRE Weekly Issue #278

A message from our sponsor, StackHawk:

Learn how our team at StackHawk tests external cookie authentication using Ktor, and check out some of the helper functions we wrote to make the tests easy to write, read, and maintain
https://sthwk.com/ktor

Articles

Whoa.  This is the best thing ever.  I feel like I want to make this the official theme song of SRE Weekly.

Forrest Brazeal

Their auto-scaling algorithm needed a tweak. Before: scale up by N instances. After: scale up by an amount proportional to the current number of instances.

Fran Garcia — Reddit

here’s a look at incidents and reliability challenges that have occurred in outer space, and what SREs stand to learn from them.

JJ Tang — Rootly

This one includes 3 key things to remember while load testing. My favorite: test the whole system, not just parts.

Cortex

SRE is as much about building consensus and earning buy-in as it is about actual engineering.

Cortex

The definition of NoOps in this article is more clear than others I’ve seen. It’s not about firing your operations team — their skill set is still necessary.

Kentaro Wakayama

Even though I know what observability is, I got a lot out of this article. It has some excellent examples of questions that are hard to answer with traditional dashboards, and includes my new favorite term:

The industrial term for this problem is Watermelon Metrics; A situation where individual dashboards look green, but the overall performance is broken and red inside.

Nishant Modak and Piyush Verma — Last9

Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand.

Lorin Hochstein

In this epic troubleshooting story, a weird curl bug coupled with Linux memory tuning parameters led to unexpected CPU consumption in an unrelated process.

Pavlos Parissis — Booking.com

Learning a lesson from a rough Black Friday in 2019, these folks used load testing to gather hard data on how they would likely fare in 2020.

Mathieu Garstecki — Back Market

Outages

SRE Weekly Issue #277

A message from our sponsor, StackHawk:

Planelty saved weeks of work by implementing StackHawk instead of building an internal ZAP service. See how:
https://sthwk.com/planetly-stackhawk

Articles

Remember all those Robinhood outages? The US financial regulatory agency is making Robinhood repay folks for the losses they sustained as a result and also fining them for other reasons.

Michelle Ong, Ray Pellecchia, Angelita Plemmer Williams, and Andrew DeSouza — FINRA

This is brilliant and I wish I’d thought of it years ago:

One of the things we’ve previously seen during database incidents is that a set of impacted tables can provide a unique fingerprint to identify a feature that’s triggering issues.

Courtney Wang — Reddit

The suggested root cause involves consolidation in cloud providers and the importance of DNS.

Alban Kwan — CircleID

Full disclosure: Fastly, my employer, is mentioned.

This paper is about recognizing normalization of deviance and techniques for dealing with it. This tidbit really made me think:

[…] they might have been taught a system deviation without realizing that it was so […]

Bus Horiz

Blameless incident analysis is often at odds with a desire to “hold people accountable”. This article explores that conflict and techniques for managing the needs involved.

Christina Tan and Emily Arnott — Blameless

What can you do if you’re out of error budget but you still want to deliver new features? Get creative.

Paul Osman — Honeycomb

I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.

Bruce Dominguez

Outages

SRE Weekly Issue #276

A message from our sponsor, StackHawk:

Get ready for some GraphQL! Tune in this Tuesday, June 29 at 9 AM PT for an automated GraphQL security testing learning lab. Register:
http://sthwk.com/graphql-learning-lab

Articles

HBO accidentally sent an email to a bunch of people, and they tweeted (jokingly?) blaming their intern. This is a link to a short, thoughtful response thread.

Gergely Orosz

This is the story of the Bunny CDN outage linked below. Great read, thanks folks!

Dejan Grofelnik Pelzel — Bunny

There’s never a bad time to review the fallacies of distributed computing. This article introduces them with examples and discussion of each.

Alex Diaconu — Ably

These aren’t specific tools, but rather 7 classes of tools (with examples). They are:

  • Chaos engineering
  • Monitoring and alerting
  • Observability
  • Paging tools
  • SLO management
  • Infrastructure-as-Code (and everything-as-code)
  • Automated incident response

Quentin Rousseau — Rootly

Design is interpretive. We have to find common ground before we can even start to create a design, but finding that common ground is part of the design.

For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them.

Lorin Hochstein

This starts with a really neat moment in which the interviewer asks Yiu to talk about lessons from her jewelry-making hobby that she applies to SRE.

Kurt Andersen

When Gamestop’s stock shot through the roof earlier this year, Reddit’s traffic did too. This is the first article in a short series by Reddit’s SRE team on how they handled the influx.

This article is about the ways that user actions affected their systems in unexpected ways, and how they responded.

Courtney Wang — Reddit

Recently in our Site Reliability Engineering organization in Azure, we established a set of cultural values that we hold ourselves and each other accountable to.

Bill Johnson — Microsoft

Outages

SRE Weekly Issue #275

A message from our sponsor, StackHawk:

Join ZAP Founder & Project Lead Simon Bennetts on June 30 for a live AMA where he will be answering questions on all things open source and AppSec. Register:
http://sthwk.com/Simon-AMA

Articles

Here’s a take on incident severity levels. I enjoy learning what criteria folks use for this, so please send similar articles my way (or maybe write your own?).

Nancy Chauhan — Rootly

Counterfactuals (“should haves”) stifle incident retrospectives by tempting us to stop digging deeper. This article points out that there are unending possible counterfactuals for any incident.

Michael Nygard

Read to find out how counting incidents (or “# days since an outage”) won’t help and will cause more problems than it’s worth. Also included: options for what to count instead.

incident.io

Sloth is a tool for generating SLOs as Prometheus metrics, claiming to support “any kind of service”.

Xabier Larrakoetxea

If you’re looking for a way to evaluate your SRE process, this might help.

Alex Bramley — Google

This article tries to put an actual number on the cost of adding more nines of reliability.

Jack Shirazi — Expedia

It’s time for Catchpoint’s yearly SRE report, downloadable in PDF form through this link. Note: you have to give them your email address.

Catchpoint

Outages

  • Akamai
    • This outage impacted banks and airlines, among other Akamai customers.

SRE Weekly Issue #274

A message from our sponsor, StackHawk:

Join the GraphQL Security Testing Learning Lab on June 29 at 9 AM PT. Learn how to run automated security testing against your GraphQL APIs so you can find and fix vulnerabilities fast.
http://sthwk.com/graphql-learning-lab

Articles

The last section suggests selling SLOs to executives by likening them to OKRs or KPIs.

Austin Parker — Devops.com

Lowe’s is a home improvement retailer in North America. I often find it fascinating when I learn that a company that’s not seen as being in the tech-sector has a robust SRE practice.

Vivek Balivada and Rahul Mohan Kola Kandy — Lowe’s

The hallmark of sociological storytelling is if it can encourage us to put ourselves in the place of any character, not just the main hero/heroine, and imagine ourselves making similar choices.

Lorin Hochstein

This is brilliant: they apply DevOps and SRE practices to the challenging work of raising two autistic children.

Zac Nickens — USENIX ;login:

I especially like how their bot automatically pages reinforcements after folks have been on an incident for long enough to become fatigued.

Daniella Niyonkuru

Rather than measuring Mean Time To Recovery for incidents, let’s track our Mean Time To Retrospective.

Robert Ross — FireHydrant

Outages

  • Fastly
    • Fastly had a global outage of their CDN service, with many 5xx errors for around 40 minutes and diminished cache hit ratios following after. Many customers of Fastly experienced degradation, notably including Amazon, Reddit, and GitHub, among many others.

      Fastly posted a summary shortly after the incident, describing a latent bug that was triggered by a customer’s (valid) configuration change.

      Full disclosure: Fastly is my employer.

  • Salesforce
  • Facebook, Instagram, and WhatsApp
A production of Tinker Tinker Tinker, LLC Frontier Theme