SRE Weekly Issue #278

Articles

Whoa. This is the best thing ever. I feel like I want to make this the official theme song of SRE Weekly.

Forrest Brazeal

r/WallStreetBets Incident Anthology (What Worked Edition): Autoscaler

Their auto-scaling algorithm needed a tweak. Before: scale up by N instances. After: scale up by an amount proportional to the current number of instances.

Fran Garcia — Reddit

The Incident Review: 4 Incidents in Outer Space

here’s a look at incidents and reliability challenges that have occurred in outer space, and what SREs stand to learn from them.

JJ Tang — Rootly

Prepare for overnight success — with the right load testing approach

This one includes 3 key things to remember while load testing. My favorite: test the whole system, not just parts.

Cortex

4 ways to improve your influence as an SRE

SRE is as much about building consensus and earning buy-in as it is about actual engineering.

Cortex

NoOps: What Does the Future Hold for DevOps Engineers?

The definition of NoOps in this article is more clear than others I’ve seen. It’s not about firing your operations team — their skill set is still necessary.

Kentaro Wakayama

Systems Observability

Even though I know what observability is, I got a lot out of this article. It has some excellent examples of questions that are hard to answer with traditional dashboards, and includes my new favorite term:

The industrial term for this problem is Watermelon Metrics; A situation where individual dashboards look green, but the overall performance is broken and red inside.

Nishant Modak and Piyush Verma — Last9

Controlling a process we don’t understand

Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand.

Lorin Hochstein

Troubleshooting: A journey into the unknown

In this epic troubleshooting story, a weird curl bug coupled with Linux memory tuning parameters led to unexpected CPU consumption in an unrelated process.

Pavlos Parissis — Booking.com

How Back Market SREs prepared for Black Friday

Learning a lesson from a rough Black Friday in 2019, these folks used load testing to gather hard data on how they would likely fare in 2020.

Mathieu Garstecki — Back Market

SRE Weekly Issue #278

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues