General

SRE Weekly Issue #394

A warm welcome to my new sponsor, FireHydrant!

A message from our sponsor, FireHydrant:

The 2023 DORA report has two conclusions with big impacts on incident management: incremental steps matter, and good culture contributes to performance. Dig into both topics and explore ideas for how to start making incremental improvements of your own.
https://firehydrant.com/ebook/dora-2023-incident-management/

This article gives an example checklist for a database version upgrade in RDS and explains why checklists cam be so useful for changes like this.

  Nick Janetakis

The distinction in this article is between responding at all and responding correctly. Different techniques solve for availability vs reliability.

  incident.io

Latency and throughput are inextricably linked in TCP, and this article explains why with a primer on congestion windows and handshakes.

  Roberto Vitillo

Tail latency has a huge impact on throughput and on the overall user experience. Measuring average latency just won’t cut it.

  Roberto Vitillo

Is it really wrong though? Is it?

  Adam Gordon Bell — Earthly

I’ve shared the FAA’s infographic of the Dirty Dozen here previously, but here’s a more in-depth look at the first six items.

  Dr. Omar Memon — Simple Flying

It’s often necessary to go through far more than five whys to understand what’s really going on in a sociotechnical system.

  rachelbythebay

I found the bit about the AWS Incident/Communication Manager on-call role pretty interesting.

  Prathamesh Sonpatki — SRE Stories

SRE Weekly Issue #393

A message from our sponsor, Rootly:

Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community, we’re giving away some awesome Rootly swag. Read our CEO’s blog post and pick up some free swag here:
https://rootly.com/blog/celebrating-our-nine-new-g2-awards

This repo contains a path to learn SRE, in the form of a list of concepts to familiarize oneself with.

  Teiva Harsanyi

How can we justify the (sometimes significant) expense of instilling observability into our systems?

  Nočnica Mellifera — SigNoz

It was DNS. Cloudflare’s 1.1.1.1 recursive DNS service failed this week, stemming from failure to parse the new ZONEMD record type.

  Ólafur Guðmundsson — Cloudflare

Rather than just dry theory, this article helps you understand what the CAP theory means in practice as you choose a data store.

Note: this link was 504ing at time of publishing, so here’s the archive.org copy.

  Bala Kalavala — Open Source For U

A “blameless” culture can get in the way if it means you’re not allowed to make any mention of who was at the pointy-end of your system when things blew up.

  incident.io

In this post, we will share how we formalized the LinkedIn Business Continuity & Resilience Program, how this new program helped increase our customers’ confidence in our operations, and the lessons that we learned as we attained ISO 22301 certification.

  Chau Vu — LinkedIn

This is the start of a 6-article series, with each going through one week along a path to prepare for SRE interviews.

We’ll spend each week focusing on building up your expertise in the key areas SREs need to know, like automation, monitoring, incident response, etc.

  Code Reliant

Beyond the CAP theorem, what actually happens during a partition?

“ if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency and consistency (L and C)” [Daniel J. Abadi]

  Lohith Chittineni

SRE Weekly Issue #392

A message from our sponsor, Rootly:

Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community, we’re giving away some awesome Rootly swag. Read our CEO’s blog post and pick up some free swag here:
https://rootly.com/blog/celebrating-our-nine-new-g2-awards

In the midst of industry discussions about productivity and automation, it’s all too easy to overlook the importance of properly reckoning with complexity.

There’s a cool bit in there about redistributing complexity rather than simply getting rid of it, using microservices as an example.

  Ken Mugrage — Thoughtworks — MIT Technology Review

Interesting idea: if we go too far toward making incident investigations comfortable and routine, we can make learning less likely.

  Dane Hillard — Jeli

A problem with P99 is that 1% of your customers have a worse experience, and P99 doesn’t capture how worse.

   Cynthia Dunlop — The New Stack

Lambda isn’t “NoOps”, it’s just a different flavor of ops.

  Ernesto Marquez — Concurrency Labs

Salesforce had a major outage earlier this month, and now they’ve posted this followup analysis.

  Salesforce

This sysadmin story is a lesson in understanding the full context before passing judgement.

  rachelbythebay

Things get interesting toward the end, where they warn that focusing too narrowly on learning from incidents can cause problems.

  Luis Gonzalez — incident.io

The fail fast pattern is highly relevant for building reliable distributed systems. Rapid error detection and failure propagation prevents localized issues from cascading across system components.

  Code Reliant

SRE Weekly Issue #391

A message from our sponsor, Rootly:

Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community, we’re giving away some awesome Rootly swag. Read our CEO’s blog post and pick up some free swag here:
https://rootly.com/blog/celebrating-our-nine-new-g2-awards

Articles

Operating complex systems is about creating accurate mental models, and abstractions are a key ingredient.

   Code Reliant

Why is it hard to get an organization to focus on LFI (learning from incidents) rather than RCA (root cause analysis)? Here’s a really great explanation.

  Lorin Hochstein

It’s about more than just money — like engineer morale, slowed innovation, and lost customers.

  Aaron Lober — Blameless

A great primer on the CAP theorem with a real-world example scenario.

  Lohith Chittineni

It’s really interesting to see how they handled distributed queuing and throttling across a highly distributed cache network without sacrificing speed.

  George Thomas — Cloudflare

[…] LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

Dig into and understand how enough things work, and eventually you’ll look like a wizard.

  Rachel By the Bay

As a rule of thumb, always set timeouts when making network calls. And if you build libraries, always set reasonable default timeouts and make them configurable for your clients.

  Roberto Vitillo

SRE Weekly Issue #390

Many apologies to my email subscribers, who have seen two accidental re-sends of old issues recently due to a weird glitch in my automation. I think I’ve gotten a handle on it, and I’ll run an internal retrospective of this incident, of course.

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

Is it really SRE vs platform engineer? Or is there a way platforms can take site reliability to the next level?

  Jennifer Riggins — The New Stack

A surgeon delves into the key component that allows a group of skilled individuals to work effectively and safely together, using the term “heed” to describe this special interaction.

Sidenote: in a hilarious coincidence this article managed to spoil me on a movie I was in the middle of watching (Arrival) — but it also put me in a really cool mindset to watch the rest of the film.

  Dr. Rob Poston

More details on Square’s outage from a couple weeks ago (it was DNS).

  Square

Azure had an interesting outage in its Australia East region involving a power failure and the order cooling units were restored in.

  Microsoft Azure

Asking this question is how you unlock the hidden essence of an incident. This talk compares two public incident reports to show what it looks like when you dig into this question and when you don’t.

  Jacob Scott — InfoQ

In this air accident, the pilots made a seemingly inexplicable mistake.

This sentence really stood out to me, especially after reading the “How Did It Make Sense at the Time?” article:

When we inexplicably grab the wrong utensil when cooking or accidentally start taking our dirty dishes to the bathroom instead of the kitchen, we should be thankful that we aren’t responsible for a plane full of people.

  Admiral Cloudberg

There’s an interesting failure mode in this one that might stand out for the Kafka admins among us:

The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process.

  Jakub Oleksy — GitHub

After explaining the difference between the ITIL terms “incident management” and “problem management”, this article goes into a discussion of recent trends and whether it still makes sense to draw a distinction between the two.

  Luis Gonzalez — incident.io

A production of Tinker Tinker Tinker, LLC Frontier Theme