General

SRE Weekly Issue #368

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article uses a simulation to demonstrate the power of shuffle sharding to limit the blast radius of overload conditions.

   Eugene Retunsky — DZone

A comprehensive look at stress testing, how it differs from load testing, how to implement it, and how to analyze the results.

  Lambdatest

Retries and high availability are great, but for critical dependencies, we can go a step further and define an alternative in case a dependency is down.

  Leart Gjoni — DoorDash

From the archives, here’s an incident report from a major outage at DoorDash in 2022.

  Ryan Sokol — DoorDash

Amazon’s old internal “retrospective” process sounds painful and scary. Fortunately the author took the good parts and learned some valuable lessons from the rest.

  Lee Atchison — Container Journal

Instead of asking PMs to “speak SRE,” span the communication gap by using the common language of user stories to build business-cogent SLOs.

  Kit Merker — DevOps.com

Amazon advantages their service offerings like RDS by making the (normally pricey) cross-availability-zone data transfer free.

  Corey Quinn — Last Week In AWS

It’s easy to think of reasons to run a retrospective on an incident. What about the reverse? Which incidents should we skip over?

  Lex Neva — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

SRE Weekly Issue #367

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Reading this article will teach you the math you need to know to build alerting that has a low false positive rate and why this is trickier than it may seem.

  Dan Slimmon

Cloudflare adapted a technique shared by PagerDuty to detect failed Kafka consumers and restart them.

   Chris Shepherd and Andrea Medda — Cloudflare

Gracefully shutting down is important, otherwise every deploy will result in client-facing errors.

  Srinavas — eightnoteight

There’s a wealth of lessons learned in this article. My favorite: idempotency was never part of the contract, but consumers nevertheless depended on it.

  Gergely Orosz — Pragmatic Engineer

Making our companies into High Reliability Organizations (HROs) rarely makes sense, but we can still learn useful skills and techniques from them. This article gives a good overview and analysis of HROs.

  Dominic Cooper — Safety & Health Practitioner

By “tiered”, this article means having discussions about reliability at three levels: the engineering team level, the director level, and the executive level.

  Ross Brodbeck

After explaining why deploys aren’t the right approach, this article proposes feature flags as a safer approach.

  Charity Majors — Honeycomb
  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #366

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

In incident management as in so many areas, there’s the shiny work and the unglamorous but critical parts, and the latter often fall to women. This article seeks to reverse that trend by reminding us of the incredibly important glue work Women have been doing since the dawn of computing.

  Emily Arnott — Blameless

I love stories about applying IT incident response processes to non-IT incidents.

  Robert Ross — FireHydrant

Dear reader, perhaps you would enjoy reading this article on the many benefits of engineering blogs… then go write more great content and send me a link. :D

  New York Times — Jordan News

Okay, this isn’t exactly an SRE story, but it sounds really familiar. It’s a story of “user error” that’s really about designing systems to help users catch errors.

  Jakub Roztocil — httpie

nginx has a pretty nifty zero-downtime restart system, but it didn’t quite fit Cloudflare’s needs.

  Maciej Lechowski — Cloudflare

This article does a great job of summarizing SRECon Americas by pulling out five major themes that ran through multiple talks.

  Gavin Cahill — Gremlin

Building buy-in is everything.

[…] the key function of SRE being to help shape engineering’s perception of reality rather than act as a gatekeeper.

  Ross Brodbeck

By “FinOps”, they mean a team in your company dedicated to reducing cloud computing costs. Does that really help?

  Lydia Leong

[…] it is also possible to create incident writeups that engineers choose to read, that clearly describe and highlight difficult and poorly-understood aspects of our systems, and that become part of the organisation’s collective understanding.

  Laura Nolan — Container Solutions`

Years after we both started doing the newsletter thing, I finally sat down with Corey Quinn for an episode of his podcast. We talked about running newsletters, my other side project, and of course, reliability.

  Corey Quinn — Last Week In AWS

SRE Weekly Issue #365

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

They take us from the requirements analysis all the way through implementation of a high-throughput data store based on CockroachDB.

  Chuanpin Zhu and Debalin Das — DoorDash

On March 14th, Reddit engineers upgraded a Kubernetes cluster from 1.23 to 1.24, and all hell broke loose. I admire their precision in being down for 100π minutes.

  Jayme Howard — Reddit

With a huge user-base of students and teachers, these folks upped their incident response game, and they share how.

  Nadinastiti and Estu Fardani — GovTech Edu

A lurking bug in redis-py allowed users to see one another’s data, and OpenAI took ChatGPT down to limit the damage.

  OpenAI

In Linux, source port allocation can be complex. This article shows why with a ton of code and tracing examples.

  Jakub Sitnicki — Cloudflare

The gap between “paying for peak” and “earning on average” is critical to understand how the economics of large-scale cloud systems differ from traditional single-tenant systems.

  Marc Brooker

A configuration error was masked because the app automatically fell back to the original configuration. The problem only surfaced when the service was redeployed.

  Heroku

SRE Weekly Issue #364

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Heresy! This article provides a counterpoint to many of the benefits of IaC. While IaC may still be the right answer, it’s not a slam dunk.

  Luke Shaughnessy

Short but sweet, this article outlines three focus areas that the author argues should be a part of any SRE role.

  Kyle Robertson

Way beyond just an intro to aperture, this article also covers microservice architecture failure modes, techniques used to avoid failures, and the weaknesses in those techniques.

  Cong Ma and Matt Ranney — Doordash

I’m including this here not just for the staff+ SREs out there. Many of these skills are important for SREs to develop much earlier than the Staff level, since our role can be so collaborative.

  Ryn Daniels — GitHub

I love that fully half of this article is about mentoring developing SREs in identifying and managing risk.

  Ross Brodbeck

Learn how the Honeycomb SRE team has structured its work, including a fully copy of the team charter.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer and I am a member of the SRE team described in this article.

An intriguing approach: define technical debt as a risk, and manage it in much the same way that we handle reliability-related risks, with a “threat budget”.

  Jason Bloomberg — Intellyx

Instead, because our time and attention is limited, we have to get good at identifying cues to indicate that our models have gotten stale or are incorrect.

  Lorin Hochstein

Using a simulation, this article comes to the conclusion that a hybrid between FIFO and LIFO is better than picking just one.

   Eugene Retunsky — DZone

A production of Tinker Tinker Tinker, LLC Frontier Theme