General

SRE Weekly Issue #411

A message from our sponsor, FireHydrant:

“To be honest, when can we switch?” The first impressions are in. Check out what people are saying after seeing Signals, the new standard in alerting and on-call from FireHydrant, for the first time. https://firehydrant.com/signals/

Software engineers and SREs should share a single on-call rotation as part of a single team, as this is where empathy for each other is built.

  Jamie Allen

I was pretty fuzzy on what HTTP/3 was all about, but this article set me straight.

  Roopa Kushtagi

An overview of the modulith pattern including reasons to choose modulith over microservices.

  Pier-Jean Malandrino

This article explores feedback loops formed out of various ways of responding to incidents that in turn increase the likelihood of more incidents. It took me a couple tries to get into this one, but it was well worth my effort.

  Steven Shorrock

Here, we’re going to outline some practical things you should consider when visiting on-call compensation and the incentives you create around it. We’ll also share how we approach this conversation here at incident.io.

  incident.io

This link-aggregation repo isn’t just about interviewing for SRE roles. It also links to resources on a ton of topics relevant to those starting out in SRE.

  @mxssl on GitHub

Cool trick: this paper uses counterfactual “should have” statements for good as a way of surfacing what incident investigators wish auditing was looking for. Click through for Fred Hebert’s synopsis of the paper.

  Fred Hebert (summary)   Ben Hutchinson, Sidney Dekker, and Andrew Rae (original authors) — Process Safety Progress

This article (part one in a series) follows the author’s journey to learn and improve incident management at their company.

  Vladimirs Romanovskis — Dyninno

SRE Weekly Issue #410

A message from our sponsor, FireHydrant:

How many seats are you paying for in your legacy alerting tool that rarely get paged? With Signals’ bucket pricing, you only pay for what you use. Join the beta for a better tool at a better price.
https://firehydrant.com/blog/signals-beta-live/

In this blog post, we describe the journey DoorDash took using a service mesh to realize data transfer cost savings without sacrificing service quality.

  Hochuen Wong and Levon Stepanian — DoorDash

When just a few “regulars” are called in to handle every incident, you’ve got a knowledge gap to fill in your organization.

  David Ridge — PagerDuty

Dropbox expands into new datacenters often, so they have a streamlined and detailed process for choosing datacenter vendors.

  Edward del Rio — Dropbox

This is either nine things that could derail your SRE program, or a list of things to do with “not” in front of them — either way, it’s a good list.

  Shyam Venkat

We need enough alerting in our systems that we can detect lurking anomalies, but not so much that we get alert fatigue.

  Dennis Henry

A post about the importance of product in SRE, and how to make product and SRE first-class citizens in your Software Development Lifecycle.

  Jamie Allen

A relatively minor incident took a turn for the worse after the pilots attempted a close fly-by in an attempt to resolve it. I swear I’ve been in this kind of incident before, where I took risks significantly out of proportion to the problem I was trying to solve.

  Kyra Dempsey (Admiral Cloudberg)

SRE Weekly Issue #409

A message from our sponsor, FireHydrant:

It’s time for a new world of alerting tools that prioritize engineer well-being and efficiency. The future lies in intelligent systems that are compatible with real life and use conditional rules to adapt and refine thresholds, reducing alert fatigue.
https://firehydrant.com/blog/the-alert-fatigue-dilemma-a-call-for-change-in-how-we-manage-on-call/

I’ve occasionally wondered what’s behind Slack’s /remind or “clear my away status after my vacation ends”. Now I know!

  Claire Adams

This article is an exploration of consistency and coordination in distributed systems, with lots of really interesting examples.

  Lorin Hochstein

Lots of good stuff in here, including infrastructure, monitoring, and incident management tools.

   saifeddine Rajhi

my first conference

Whew, way to dive into the deep end!

  Mike [surname unknown] — SREZone

This article explains why circuit breakers are especially useful in microservice architectures based on Lambda. It explains how to implement circuit breakers using Step Functions.

   Satrajit Basu — DZone

Definitely some interesting (and spicy!) takes in this one.

  Code Reliant

When you’re at LinkedIn’s scale, building an automated abuse mitigation means designing for high throughput. The answer: lots of caching.

  Amit Mathapati — LinkedIn

A short but thought-provoking article about where SREs belong in the management heirarchy, and why.

  Jamie Allen

SRE Weekly Issue #408

A message from our sponsor, FireHydrant:

It’s time for a new world of alerting tools that prioritize engineer well-being and efficiency. The future lies in intelligent systems that are compatible with real life and use conditional rules to adapt and refine thresholds, reducing alert fatigue.
https://firehydrant.com/blog/the-alert-fatigue-dilemma-a-call-for-change-in-how-we-manage-on-call/

This is either a set of SRE interview topics or the squares for the SRE bingo card.

  Lorin Hochstein

Blame awareness only works if you work towards blame awareness with all incidents, not just the ones that affect you.

  Will Gallego

a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses.

  Liwei Guo, Anush Moorthy, Li-Heng Chen, Vinicius Carvalho, Aditya Mavlankar, Agata Opalach, Adithya Prakash, Kyle Swanson, Jessica Tweneboah, Subbu Venkatrav, Lishan Zhu — Netflix

Here are five concrete tips to fix your alerts and improve alert fatigue.

  Candace Shamieh, Daljeet Sandu, and Nicolas Narbais Datadog

This article contains guidelines for many kinds of reviews and activities SRE can do to improve reliability, such as SLO reviews, dependency reviews, and more.

  Jamie Allen

However, the reality of alerting in a socio-technical system must cater not only to the mess around the signal, but also to the longer term interpretation of alerts by people and automation acting on them. This post will expand on this messiness and why Honeycomb favors an iterative approach to setting our alerts.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

This far-ranging conversation covers many aspects of developing a reliable platform for engineering. There’s a text summary if audio’s not your thing.

  Ash Patel — SREPath

Spurred by a single-AZ outage that took down their service, Slack set out to break their system into isolated segments so that an AZ can be drained of traffic quickly and without impacting customers.

  Cooper Bethea — Slack

SRE Weekly Issue #407

A message from our sponsor, FireHydrant:

Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally. https://firehydrant.com/blog/signals-beta-live/

If you really want to understand how complex systems fail, you need to think in terms of webs rather than chains.

  Lorin Hochstein

We asked members of the PagerDuty Community what they do to remove the fear of being on-call and also asked them to share a piece of advice for those starting out on the on-call rotation and here are some of their insightful tips!

  Xenda Amici

There’s some interesting advice in here that I haven’t heard before, like rerunning the incident review meeting if you don’t get enough out of it the first time. Have any of you ever done this?

  Jonathan Word

Catchpint’s annual SRE report is out, and you can download the PDF without even having to fill out a form.

  Catchpoint

The cool thing about this article is the discussions of anti-patterns to avoid, sprinkled throughout.

  Vanessa Huerta Granda — InfoQ

I cover GCP and AWS here a lot, so now it’s Azure’s turn, with this detailed guide on load balancing.

  Shivaprasad Sankesha Narayana — DZone

Read this one to learn how Cloudflare implemented a reliable logging pipeline with 1 million log lines per second.

  Colin Douch — Cloudflare

A production of Tinker Tinker Tinker, LLC Frontier Theme