General

SRE Weekly Issue #408

A message from our sponsor, FireHydrant:

It’s time for a new world of alerting tools that prioritize engineer well-being and efficiency. The future lies in intelligent systems that are compatible with real life and use conditional rules to adapt and refine thresholds, reducing alert fatigue.
https://firehydrant.com/blog/the-alert-fatigue-dilemma-a-call-for-change-in-how-we-manage-on-call/

This is either a set of SRE interview topics or the squares for the SRE bingo card.

  Lorin Hochstein

Blame awareness only works if you work towards blame awareness with all incidents, not just the ones that affect you.

  Will Gallego

a brief history of our pipeline and the platforms, why the rebuilding was necessary, what these new services look like, and how they are being used for Netflix businesses.

  Liwei Guo, Anush Moorthy, Li-Heng Chen, Vinicius Carvalho, Aditya Mavlankar, Agata Opalach, Adithya Prakash, Kyle Swanson, Jessica Tweneboah, Subbu Venkatrav, Lishan Zhu — Netflix

Here are five concrete tips to fix your alerts and improve alert fatigue.

  Candace Shamieh, Daljeet Sandu, and Nicolas Narbais Datadog

This article contains guidelines for many kinds of reviews and activities SRE can do to improve reliability, such as SLO reviews, dependency reviews, and more.

  Jamie Allen

However, the reality of alerting in a socio-technical system must cater not only to the mess around the signal, but also to the longer term interpretation of alerts by people and automation acting on them. This post will expand on this messiness and why Honeycomb favors an iterative approach to setting our alerts.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer.

This far-ranging conversation covers many aspects of developing a reliable platform for engineering. There’s a text summary if audio’s not your thing.

  Ash Patel — SREPath

Spurred by a single-AZ outage that took down their service, Slack set out to break their system into isolated segments so that an AZ can be drained of traffic quickly and without impacting customers.

  Cooper Bethea — Slack

SRE Weekly Issue #407

A message from our sponsor, FireHydrant:

Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally. https://firehydrant.com/blog/signals-beta-live/

If you really want to understand how complex systems fail, you need to think in terms of webs rather than chains.

  Lorin Hochstein

We asked members of the PagerDuty Community what they do to remove the fear of being on-call and also asked them to share a piece of advice for those starting out on the on-call rotation and here are some of their insightful tips!

  Xenda Amici

There’s some interesting advice in here that I haven’t heard before, like rerunning the incident review meeting if you don’t get enough out of it the first time. Have any of you ever done this?

  Jonathan Word

Catchpint’s annual SRE report is out, and you can download the PDF without even having to fill out a form.

  Catchpoint

The cool thing about this article is the discussions of anti-patterns to avoid, sprinkled throughout.

  Vanessa Huerta Granda — InfoQ

I cover GCP and AWS here a lot, so now it’s Azure’s turn, with this detailed guide on load balancing.

  Shivaprasad Sankesha Narayana — DZone

Read this one to learn how Cloudflare implemented a reliable logging pipeline with 1 million log lines per second.

  Colin Douch — Cloudflare

SRE Weekly Issue #406

A message from our sponsor, FireHydrant:

Signals is now available in beta. Sign up to experience alerting for modern DevOps teams: Page teams, not services. Ingest inputs from any source. Bucket pricing based on usage. And one platform — ring to retro — finally. https://firehydrant.com/blog/signals-beta-live/

This article describes how to clearly show your value delivered to a tech company as someone who focuses on non-functional requirements such as operability, performance, or reliability.

  Amin Astaneh — Certo Modo

Doggedly preventing a recurrence of an incident may not be the best way to protect our systems — and may in fact make things worse.

  Lorin Hochstein

Should your SLO cover a rolling 30 days? 7 days? A calendar month?

  Alex Ewerlöf

Threads was built in five months and had over 100 million users in its first week.

   Laine Campbell and Chunqiang (CQ) Tang — Meta

This article is full of advice on setting up an on-call process that’s livable and less likely to burn folks out.

  incident.io

A pilot violated a major aviation principle, and it was the right move. It’s very interesting to me that pilots are trained on the principle but not on the exceptions, with the expectation that they will react well in exceptional circumstances.

  Admiral Cloudberg

Integer IDs or UUIDs as your DB primary key? I can’t count the number of incidents I’ve been involved in where integer primary keys played a part.

  Bertrand Florat

SRE Weekly Issue #405

A message from our sponsor, FireHydrant:

In this episode of FireHydrant’s Gimme 5 video series, Asaf Gaon, Director of Technical Support for automated grocery fulfillment solution Takeoff Technologies, talks about how to handle third-party downtime in a collaborative – and automated – way. https://firehydrant.com/blog/gimme-5-with-takeoff-technologies-asaf-gaon/

Using the Swedish word “Lagom” as a jumping-off point, this article explains the importance of choosing an SLO that is just right: not too lax and not too strict.

  Alex Ewerlöf

A simple security change like ceasing to use IMDSv1 can involve profound risk and necessitate a major migration process.

  Archie Gunasekara — Slack

It can be all too easy to let a subset of your IT organization “handle” resiliency. If resilience is about an ability to adapt and respond to change, then it needs broad buy-in.

  Richard Gall — The New Stack

If any seemingly innocuous change can break our systems, what should we do?

  Lorin Hochstein

What exactly is “human error”?

  Steven Shorrock — Humanistic Systems

We recently upgraded from Postgres 11.9 to 15.3 with zero downtime by using logical replication, a suite of support scripts, and tools in Elixir & Erlang’s BEAM virtual machine.

They share a ton of details about how they did it.

  Brent Anderson — Knock

Why do doctors still use antiquated pagers? There’s a lot here that speaks to what it’s really like to operate in an on-call environment, and how to evaluate new tools.

  Fred Hebert

This article riffs on Murphy’s law, exploring various aspects of how things go wrong using anecdotes.

   Bertrand Florat

SRE Weekly Issue #404

A message from our sponsor, FireHydrant:

Looking to cozy up with a good read this week? Check out “Your guide to better status pages.” It’s a mini masterclass on how to better communicate on your status pages. https://firehydrant.com/blog/your-guide-to-better-incident-status-pages/

For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.

  Alex Ewerlöf

In this incident story, the feature flags were served by the main application server. When a new feature caused the server to crash, there was no way to flip the flag back off to stop the crashes.

  rachelbythebay

The author of a classification system for human error reflects 20 years later on the harm that such systems can cause by using deficit-based language.

  Dr. Steven Shorrock

Here’s Fred Hebert’s analysis of Cloudflare’s write-up of their incident on November 2.

I’m hoping they’re going to do a more in-depth review.

  Fred Hebert — VOID

In this post, we introduce a hybrid approach that seamlessly combines the precision of manual instrumentation with the comfort, efficiency, and performance of automatic instrumentation.

  Ron Federman — Odigos

Change is not the problem. It’s unaddressed risk

  Bruce Johnston — High Scalability

A shell script with a loop running a DB client can fill up your ephemeral ports in a hurry.

  Oren Eini — RavenDB

When you get right down to it, it’s all human communication, even assembly code. It’s human factors all the way down.

  Michael Hart

A production of Tinker Tinker Tinker, LLC Frontier Theme