General

SRE Weekly Issue #383

A message from our sponsor, Rootly:

Eliminate the anxiety around declaring an incident for nebulous problems by introducing a triage phase into your incident management process. Our latest blog posts dives into why the triage phase is so important, and how you can automate yours with Rootly.

Read more on the Rootly blog:
https://rootly.com/blog/improve-visibility-and-capture-more-data-with-triage-incidents

Articles

This delightful talk explores what SRE can look like in practical terms by learning about the sociotechnical situation at a fictitious company. To do that, Amy Tobey plays a game she created, walking through a town and talking to NPCs.

  Amy Tobey — InfoQ

Honeycomb had a major outage last tuesday, and they posted this interim outage report on their status page.

Note: Honeycomb is my employer, and I proofread this article.

  Honeycomb

The system resiliency pyramid provides a holistic framework for thinking about reliability across five key layers.

I like the way this system of layers breaks down the multiple different aspects of reliability.

  Code Reliant

This article explores system overload using a traffic congestion analogy. I especially like the note about failover as a cause of an overload condition.

  Tanveer Gill — FluxNinja

in this article, I’ll dive into this vital DORA metric, detail its benchmarks, and provide practical insights to help you drive more frequent successful changes.

  incident.io

This article explains four different rate limiting algorithms and includes code snippets in Java.

  Code Reliant

PostgreSQL vacuuming can be a total pain — and a serious threat to performance and reliability. This new database engine sounds pretty interesting.

  Oriole

Current IaC tools are like plain HTML, says this author, and we should have something like CSS to avoid repeating ourselves.

  Nathan Peck

PagerDuty looks back on a decade of weekly chaos experiments and shares advice on starting your own similar program.

  Cristina Dias — PagerDuty

SRE Weekly Issue #382

A message from our sponsor, Rootly:

Eliminate the anxiety around declaring an incident for nebulous problems by introducing a triage phase into your incident management process. Our latest blog posts dives into why the triage phase is so important, and how you can automate yours with Rootly.

Read more on the Rootly blog:
https://rootly.com/blog/improve-visibility-and-capture-more-data-with-triage-incidents

Articles

The Linux OOM killer can already be a bugbear, and things only get more complicated when you add containers to the mix.

  Rafał Korepta — RedPanda

This post explores how to align platform and product engineering teams by implementing business value proxy metrics and using incidents to inform them.

The same metrics that we use to measure other initiatives against business priorities may be able to show us whether our incident response process is effective.

  Gonzalo Maldonado — FireHydrant

Here’s another take on devops vs SRE, using a metaphor of organizing a party.

  Diogo Souza

how do you balance taking advantage of the acceleration and innovation of AI while not compromising reliability and losing users?

  Jim Gochee — The New Stack

My favorite part is the bit about the risks of automation and keeping humans in the loop.

  Dr. Mica Endsley — Business News This Week

It’s about reliability: IaC changes carry just as much risk to reliability as product code changes, if not more. How can we bring feature flags to IaC?

  Josephine E. Justin, Srikanth Murali, and Norton Stanley S A — DZone

Oh, the tangled web we weave when we send automated emails.

  Amin Astaneh — Certo Modo

Here are four things we learned while scaling up Presto to Meta scale, and some advice if you’re interested in running your own queries at scale.

  High Scalability

SRE Weekly Issue #381

A message from our sponsor, Rootly:

Curious how companies like Elastic, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

The Pyramid introduced in this article is three levels of monitoring: Operational, Data Validation, and Business Assumptions. These roughly correspond to questions like: is the system up? Is the right amount of data flowing through it? Is that data correct?

  Karel Vanden Bussche — DEV

Extremely powerful tools can become extremely powerful footguns, for example Terraform.

  Dave Smith — GitLab

Sure, you know what latency is, but do you really know what a percentile is? A histogram? A heatmap?

  igor

If you’re using a CDN, you need to keep an eye on it. Here’s a primer on what to watch for.

   Or Hillel — DZone

This article series covers 12 aspects important in the design of reliable systems. Some of the aspects, such as modularity, loose coupling, graceful degradation, and redundancy, are covered in depth.

  Code Reliant

A couple weeks back, GitHub was hard down, even including its status page at times. This report goes into that in detail, and the cause is pretty interesting.

  Jakub Oleksy – GitHub

An in-depth look at different kinds of failover, including each kind’s methodology and purposes.

  Alex Ewerlöf

This one is especially interesting for the controversial and baseless conclusions popularized in the media about a supposed cause rooted in Korean culture. It’s a good reminder that we need to be careful to ensure the validity of the lessons we learn from incidents.

  Admiral Cloudberg

SRE Weekly Issue #380

A message from our sponsor, Rootly:

Curious how companies like Elastic, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

Well, that cleared things up. (It didn’t, but the debate is interesting).

  Scott M. Fulton III — The New Stack

This article has five tips for great incident communication, along with a section on why this matters.

  Luis Gonzalez — incident.io

Beyond just a list of ways SREs interface with other teams, this article also compares them and gives advantages and disadvantages of each.

  Amin Astaneh — Certo Modo

Building every system to be strong enough to handle peak load can be very expensive. Can we instead build our systems to take excess load from each other cooperatively?

  Lorin Hochstein — Surfing Complexity

Another useful “how we do SRE” post, including an incident report template.

  Pavel Pritchin — Dodo Engineering

Here’s an interesting twist on the usual “incident severity 101” article: in a company where “anyone can declare an incident”, how do you make sure incident severity gets set consistently in every incident?

  Mike Lacsamana — FireHydrant

How can we work to improve reliability when folks perceive our efforts to be counter to velocity?

  Code Reliant

In a blameless culture without consequences, what’s the incentive for learning to make the system more reliable? This is an incredibly thought-provoking article and I’m still not sure how I feel about it.

  Robert Poston MD

SRE Weekly Issue #379

A message from our sponsor, Rootly:

Curious how companies like Figma, Tripadvisor, and 100s of others leverage Rootly to manage incidents in Slack and unlock instant best practices? Check out this lightning demo:
https://www.loom.com/share/051c4be0425a436e888dc0c3690855ad

Articles

In case you weren’t familiar with the Saga pattern like I was, it’s basically a pseudo-transaction across multiple microservices. Here’s why it might not be a great idea.

  Sergiy Yevtushenko

During a rolling deploy, for a very brief period of time, different parts of the infrastructure had old or new code running, with unexpected results.

  Andrew Ayer

On its face, we have a simple requirement:

  • Generate sequential numbers
  • Ensure that there can be no gaps
  • Do that in a distributed manner

It’s never simple with distributed systems.

In classic Cloudflare style, here’s an ultra-deep dive into the kernel to find the source of trouble-making packet loss.

  Terin Stock — Cloudflare

Even with a “duplicate” incident, there’s always at least one thing that’s different: the fact that it’s happened before. That changes things. In practice, a lot more will be different too.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

There are definitely pros and cons to being in the most popular (and most oft-maligned) AWS region.

  Jeff Martens — Metrist

Changes are frequent causes of incidents, but what exactly counts as a change? This article delves into that with examples.

  Boris Cherkasky

This crash is a great reminder that we have to look past “human error” to the systems around the humans that set them up for failure (or don’t set them up for success).

  Admiral Cloudberg

A production of Tinker Tinker Tinker, LLC Frontier Theme