SRE Weekly Issue #387

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

In this post, we’ll explore 10 areas that are key to designing highly scalable architectures.

The 10 areas they cover in-depth are:

  1. Horizontal vs. Vertical Scaling
  2. Load Balancing
  3. Database Scaling
  4. Asynchronous Processing
  5. Stateless Systems
  6. Caching
  7. Network Bandwidth Optimization
    8, Progressive Enhancement
  8. Graceful Degradation
  9. Code Scalability

  Code Reliant

Are you looking at the number of requests that were served successfully out of the total number of requests? Or the percentage of time the system was up and working properly?

  Alex Ewerlöf

This is my personal take on something that is considered standard that I just don’t understand. So here we go — the Apdex, what it is, and why I don’t use it!

  Boris Cherkasky

Here’s a great explanation of three common cognitive biases we should try to avoid while analyzing incidents.

  Randy Horwitz — Learning From Incidents

A horrifying tale of gitops gone wrong and backups that didn’t back up, leading to catastrophic data loss. This, this is what hugops is for. I’m so sorry, Lily!

  Lily Cohen

Here’s a followup analysis from Duo for an incident they had last week.

The first SRE hire at incident.io shares what they learned as they became familiar with the infrastructure and figured out what to do with it.

  Ben Wheatley — The New Stack

This is a story of building a new on-call rotation in a company that didn’t have one. They started out with a pretty awesome list of principles that we could all aspire to.

  Felix Lopez — The New Stack

Why should we test in production? This article gives a really spot-on argument and goes on to explain how to do it.

  Sven Hans Knecht

SRE Weekly Issue #386

This issue was delayed a day while I was enjoying a much-needed vacation with my family. While I’m on the subject, it’s hot take time: vacations are important for the reliability of our sociotechnical systems, so good SREs should take vacations regularly and encourage others to as well.

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

If “you build it, you run it” requires mandate, knowledge, and responsibility, what happens when one of those is missing?

  Alex Ewerlöf

Slack developed an all-encompassing metric for the user experience that goes beyond a simple SLO.

  Matthew McKeen and Ryan Katkov

This whitepaper delves deep into the ways a microservice architecture changes how transactions work. It presents a method of dealing with microservice transaction failures through application-specific compensation logic.

  Frank Leymann — WSO2

Bambu is a brand of 3d printers that are primarily cloud-based. A problem in their cloud system resulted in printers running jobs unexpectedly, causing significant damage to some customer’s printers.

  Bambu Lab

An interesting confluence of fiber optic line failures resulted in loss of connectivity on what should have been a redundant link.

  Google

I know the title looks like click-bait, but this article delivers with 7 well thought-out critiques of SLOs.

  Code Reliant

This latest entry into the awesome-* arena is a curated list of runbooks and related resources for popular software.

  Runbear

You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?”

This article immediately made me think of the latest Mentour Pilot accident investigation in which everyone acted nearly perfectly and yet still only narrowly avoided a mid-air collision.

  Lorin Hochstein

SRE Weekly Issue #385

Many apologies to Matt Cooper at GitHub, who is the actual author of the article Scaling Merge-ort Across GitHub from last week. Sorry for the mis-credit, Matt!

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

This article will really come in handy next time you need to explain SRE to your execs.

   Kit Merker — DevOps.com

By mapping the Westrum Model of organizational cultures to SRE, we can understand SRE culture adoption.

  Vladyslav Ukis and Ben Linders — InfoQ

Disney’s SRE teams have ensured that the magic keeps happening, even as experiences and their underlying technology become more and more complex.

  Ash Patel — SREPath

There’s so much to learn from this tragedy, I might read this one again. A mid-air collision these days should be effectively impossible due to TCAS. In this case, many factors conspired to bring about disaster.

  Admiral Cloudberg

Here they are, out in the open:

  • SLOs create a common understanding in the organization about reliability
  • SLOs require investment into improved observability
  • SLOs prompt decisions about risk management… and risk-taking

  Amin Astaneh — Certo Modo

The “five standard models” are actually more like a 5-stage workflow:

  • Triage,
  • Examine,
  • Diagnose,
  • Test, and
  • Cure.

  Saheed Oladosu

This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.

  Jennifer Shin, Tejas Shikhare, Will Emmanuel — Netflix

Building from a review of traditional rate limiting techniques, this article then explains adaptive rate limiting and its benefits.

  Sudhanshu Prajapati — FluxNinja

SRE Weekly Issue #384

A message from our sponsor, Rootly:

When incidents impact your customers, failing to communicate with them effectively can erode trust even further and compound an already difficult situation. Learn the essentials of customer-facing incident communication in Rootly’s latest blog post:
https://rootly.com/blog/the-medium-is-the-message-how-to-master-the-most-essential-incident-communication-channels

Articles

They tested this new git merge strategy by using Scientist, a framework that runs both the old and new implementation and compares the results.

  Jesse Toth — GitHub

DNS is simple (kinda) but it can be really difficult to fully wrap your head around it. This article explains why, and in the process gives a blueprint for designing more understandable tools in general.

  Julia Evans

Fallback is different from Failover for a number of reasons. This article describes how they differ, how fallback works, and why you might choose it over failover.

  Alex Ewerlöf

Repository Purpose: Provide teams and individuals an idea on what to take into consideration and what to aspire for in the SRE field and work

Note: these checklists are opinionated.

  Arie Bregman

A thought-provoking article on trying to change people’s behavior in incidents through incentives (positive or negative) without also changing the context in which they act.

  Fred Hebert — Learning From Incidents

Cloudflare shares what they learned as they transitioned their KV service to a new architecture which resulted in multiple unexpected problems.

  Matt Silverlock, Charles Burnett, Rob Sutter, and Kris Evans — Cloudflare

In this article, learn about two interesting strategies for getting an organization to prioritize technical debt work: using a more specific name for the work, and referencing the work’s impact on an SLO — and the impact of not doing the work.

  Emily Nakashima — Honeycomb
  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #383

A message from our sponsor, Rootly:

Eliminate the anxiety around declaring an incident for nebulous problems by introducing a triage phase into your incident management process. Our latest blog posts dives into why the triage phase is so important, and how you can automate yours with Rootly.

Read more on the Rootly blog:
https://rootly.com/blog/improve-visibility-and-capture-more-data-with-triage-incidents

Articles

This delightful talk explores what SRE can look like in practical terms by learning about the sociotechnical situation at a fictitious company. To do that, Amy Tobey plays a game she created, walking through a town and talking to NPCs.

  Amy Tobey — InfoQ

Honeycomb had a major outage last tuesday, and they posted this interim outage report on their status page.

Note: Honeycomb is my employer, and I proofread this article.

  Honeycomb

The system resiliency pyramid provides a holistic framework for thinking about reliability across five key layers.

I like the way this system of layers breaks down the multiple different aspects of reliability.

  Code Reliant

This article explores system overload using a traffic congestion analogy. I especially like the note about failover as a cause of an overload condition.

  Tanveer Gill — FluxNinja

in this article, I’ll dive into this vital DORA metric, detail its benchmarks, and provide practical insights to help you drive more frequent successful changes.

  incident.io

This article explains four different rate limiting algorithms and includes code snippets in Java.

  Code Reliant

PostgreSQL vacuuming can be a total pain — and a serious threat to performance and reliability. This new database engine sounds pretty interesting.

  Oriole

Current IaC tools are like plain HTML, says this author, and we should have something like CSS to avoid repeating ourselves.

  Nathan Peck

PagerDuty looks back on a decade of weekly chaos experiments and shares advice on starting your own similar program.

  Cristina Dias — PagerDuty

A production of Tinker Tinker Tinker, LLC Frontier Theme