General

SRE Weekly Issue #392

A message from our sponsor, Rootly:

Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community, we’re giving away some awesome Rootly swag. Read our CEO’s blog post and pick up some free swag here:
https://rootly.com/blog/celebrating-our-nine-new-g2-awards

In the midst of industry discussions about productivity and automation, it’s all too easy to overlook the importance of properly reckoning with complexity.

There’s a cool bit in there about redistributing complexity rather than simply getting rid of it, using microservices as an example.

  Ken Mugrage — Thoughtworks — MIT Technology Review

Interesting idea: if we go too far toward making incident investigations comfortable and routine, we can make learning less likely.

  Dane Hillard — Jeli

A problem with P99 is that 1% of your customers have a worse experience, and P99 doesn’t capture how worse.

   Cynthia Dunlop — The New Stack

Lambda isn’t “NoOps”, it’s just a different flavor of ops.

  Ernesto Marquez — Concurrency Labs

Salesforce had a major outage earlier this month, and now they’ve posted this followup analysis.

  Salesforce

This sysadmin story is a lesson in understanding the full context before passing judgement.

  rachelbythebay

Things get interesting toward the end, where they warn that focusing too narrowly on learning from incidents can cause problems.

  Luis Gonzalez — incident.io

The fail fast pattern is highly relevant for building reliable distributed systems. Rapid error detection and failure propagation prevents localized issues from cascading across system components.

  Code Reliant

SRE Weekly Issue #391

A message from our sponsor, Rootly:

Rootly is proud to have been recognized by G2 as a High Performer and Enterprise Leader in Incident Management for the sixth consecutive quarter! In total, we received nine G2 awards in the Summer Report. As a thank-you to our community, we’re giving away some awesome Rootly swag. Read our CEO’s blog post and pick up some free swag here:
https://rootly.com/blog/celebrating-our-nine-new-g2-awards

Articles

Operating complex systems is about creating accurate mental models, and abstractions are a key ingredient.

   Code Reliant

Why is it hard to get an organization to focus on LFI (learning from incidents) rather than RCA (root cause analysis)? Here’s a really great explanation.

  Lorin Hochstein

It’s about more than just money — like engineer morale, slowed innovation, and lost customers.

  Aaron Lober — Blameless

A great primer on the CAP theorem with a real-world example scenario.

  Lohith Chittineni

It’s really interesting to see how they handled distributed queuing and throttling across a highly distributed cache network without sacrificing speed.

  George Thomas — Cloudflare

[…] LLMs are black boxes that produce nondeterministic outputs and cannot be debugged or tested using traditional software engineering techniques. Hooking these black boxes up to production introduces reliability and predictability problems that can be terrifying.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

Dig into and understand how enough things work, and eventually you’ll look like a wizard.

  Rachel By the Bay

As a rule of thumb, always set timeouts when making network calls. And if you build libraries, always set reasonable default timeouts and make them configurable for your clients.

  Roberto Vitillo

SRE Weekly Issue #390

Many apologies to my email subscribers, who have seen two accidental re-sends of old issues recently due to a weird glitch in my automation. I think I’ve gotten a handle on it, and I’ll run an internal retrospective of this incident, of course.

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

Is it really SRE vs platform engineer? Or is there a way platforms can take site reliability to the next level?

  Jennifer Riggins — The New Stack

A surgeon delves into the key component that allows a group of skilled individuals to work effectively and safely together, using the term “heed” to describe this special interaction.

Sidenote: in a hilarious coincidence this article managed to spoil me on a movie I was in the middle of watching (Arrival) — but it also put me in a really cool mindset to watch the rest of the film.

  Dr. Rob Poston

More details on Square’s outage from a couple weeks ago (it was DNS).

  Square

Azure had an interesting outage in its Australia East region involving a power failure and the order cooling units were restored in.

  Microsoft Azure

Asking this question is how you unlock the hidden essence of an incident. This talk compares two public incident reports to show what it looks like when you dig into this question and when you don’t.

  Jacob Scott — InfoQ

In this air accident, the pilots made a seemingly inexplicable mistake.

This sentence really stood out to me, especially after reading the “How Did It Make Sense at the Time?” article:

When we inexplicably grab the wrong utensil when cooking or accidentally start taking our dirty dishes to the bathroom instead of the kitchen, we should be thankful that we aren’t responsible for a plane full of people.

  Admiral Cloudberg

There’s an interesting failure mode in this one that might stand out for the Kafka admins among us:

The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process.

  Jakub Oleksy — GitHub

After explaining the difference between the ITIL terms “incident management” and “problem management”, this article goes into a discussion of recent trends and whether it still makes sense to draw a distinction between the two.

  Luis Gonzalez — incident.io

SRE Weekly Issue #389

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

Here’s four of the lessons I learned that should help you build a successful SRE organization.

  1. Focus on Developer Training
  2. Focus on the Right Abstractions
  3. Focus on Self Service
  4. Automate Yourself out of a job

  Sven Hans Knecht

In this blog post, we’ll talk about two incident management structure models — distributed and centralized, including the pros and cons of each, and examples of what each structure looks like in our community.

  Robert Ross — FireHydrant

The Rasmussen model conceptualizes the limits of a system along 3 boundaries: Cost, System Performance, and Human Capacity.

  Nishant Modak — Last9

Wow, this is a really interesting incident. it has all the hallmarks of a nightmare sev1: time pressure, unknown problem, inventing new procedures on the spot, multiple different teams/specialties having to work together, etc.

  Jorg Wenninger — CERN

What do you do when many engineers all need to take the same day off each week for religious reasons?

  TimeWeSp

Toyota recently halted production in their factories due to a problem in their order system, about which they shared some interesting details.

  Toyota

Here’s a guidebook on how to handle being the first SRE at a company.

  Sven Hans Knecht

SRE Weekly Issue #388

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

This article makes a cool analogy between designing systems to operate well under unexpected load and designing socio-technical systems that operate well when the people are surprised by what the system is doing.

  Lorin Hochstein

If you need to create SLAs, this article has some solid advice on how to go about it — and what to avoid.

  incident.io

If Prometheus can’t scrape your service, an alert can get resolved incorrectly — and that can happen exactly when your service is failing!

  Chris Siebenmann

A really nifty three-part exploration of action items in the aftermath of an incidents. Rather than consider cost/benefit, this article series proposes that we think about the likelihood of an action item being completed.

  J. Paul Reed

Yes, as it turns out — and these folks have the receipts (along with some theories as to why).

  Colin Bartlett

The “wow” moment in this article is under the heading, “What can we learn from creative desperation?”

  Eric Dobbs — Learning From Incidents

Before explaining how they set up their on-call, these folks share why they avoided it in the early stages of their startup, and what made them finally take the plunge.

  Dustin Brown — DoltHub

For the good of the profession, the SRE community still needs to coalesce around more consistent job ladders, expectations, and competencies.

  Code Reliant

Honeycomb had their worst incident ever at the end of July, and in their characteristic style, they’ve posted an incredibly detailed analysis of what happened — and that’s just the blog post. Then you can click through for a 17-page PDF with lots more detail.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

A production of Tinker Tinker Tinker, LLC Frontier Theme