General

SRE Weekly Issue #354

lex

January 8, 2023

General

Comments

View on sreweekly.com

Articles

DisasterCast Episode 31

This episode of DisasterCast discusses what happens when attempts to make things safer backfire.

by trying to suppress small problems, we create a reservoir of danger waiting to burst out

Drew Rae

Museum of Borgmon Abstract Art

These images offer a glimpse into the visual patterns that appear in our variables and time-series, and the beauty that emerges from chaos. Some of the images in these galleries appeared during difficult rollouts, and some even during production incidents. All come from graphs generated by Google’s monitoring systems.

Google

The Scientific Method for Testing System Resilience

The popular slogan says “test in production”, but what if your business simply doesn’t allow it?

For any scenario where I expect to be causing client impact, I’d rather test in non-production than not test at all, since production is clearly off the table.

Christina Yakomin — InfoQ

The Engineers Are Bloggers Now

There’s been a trend toward narrating our engineering work on company blogs, without which this newsletter probably wouldn’t exist.

Jordan Teicher — New York Times

Moving to cloud: How to do Migrations the wrong way

My team recently moved databases from local files in the codebase to an online Database.

It didn’t go quite as planned, but they got there in the end.

Kaustubh Hiware — Mercari

Developing a data driven tool to estimate the cost of incidents

In Product Analytics we wanted to support our colleagues in SRE, so we created a model to predict the monetary costs of incidents affecting our conversion funnel.

Enrique Hernani Ros — HelloFresh

Air traffic system outage in Philippines strands 65,000

There’s some interesting detail here about multiple failed UPSes and an accidental voltage mismatch exacerbating the situation.

Laura Dobberstein — The Register

No issue this week

lex

January 1, 2023

General

Comments

Happy new year! I’m taking a break this week and I’ll be back with a new issue next week. See you in issue #354!

SRE Weekly Issue #353

lex

December 26, 2022

General

Comments

View on sreweekly.com

Articles

What Does the Future Hold for Role of SRE?

This article contains:

two reasons why site reliability engineers may be part of IT teams for years to come, and two reasons why site reliability engineering may turn out just to be a fad.

Christopher Tozzi — ITPro Today

Best practices for observability

This article proposes an interesting method for incident investigation: constantly try to disprove your hypotheses to avoid confirmation bias.

Ivan Merill — Fiberplane

Killed By A Machine: The Therac-25

How I’ve managed to run this newsletter for almost 7 years without a single mention of the Therac-25 incidents is beyond me. Therac-25 is an important lesson for all of us as we design systems and analyze incidents.

Adam Fabio — Hackaday

Network Errors and Data Loss 2008-01

Even though this happened 14 years ago, the cause is very much still relevant today. If you have two bit-flips in the same TCP packet, it’ll still pass the checksum.

Poppy Linden — Linden Lab

What makes a good alert?

This article proposes two criteria: Actionability and Investigability.

Dan Slimmon

Intermittent downtime from repeated crashes

This incident write-up chronicles an incident in which a poison pill message repeatedly crashed their Heroku app.

Lawrence Jones — incident.io

The Words Not Spoken: The crash of Avianca flight 052

Take this one with a grain of salt since there’s a fair bit of counterfactual reasoning in the description. Nevertheless there’s a lot to learn from this and Wikipedia’s article on the same accident.

Admiral Cloudberg

SRE Weekly Issue #352

lex

December 18, 2022

General

Comments

View on sreweekly.com

Articles

VOID 2022 Report Now Available

Incident duration and severity are not related, and we have the in-depth data to prove it.

It’s time for another VOID report! I’m glad this project is still going strong.

Courtney Nash — Verica

The US, UK, and EU Want to Regulate Cloud Reliability. Is That Necessary?

I haven’t been paying attention to the recent attempts to legislate cloud provider reliability, and this article was a great catch-up. There’s a lot going on here.

Jeff Martens — Metrist

The Law of Stretched [Cognitive] Systems

I’m still trying to figure out how I feel about this one, but I’m definitely glad I read it.

Fred Hebert

The Incident Benchmark Report from FireHydrant

FireHydrant published this report with statistics from over 50,000 incidents experienced by their customers.

FireHydrant

What every SRE should know about GNU/Linux shell related internals

Want to get a solid understanding of how the Linux shells work, including file descriptors, process management, and sessions? This one goes really deep with lots of example programs.

Viacheslav Biriukov

Introducing the Google Search Status Dashboard

Check it out, Google search finally has a proper status page!

Google

Awesome SLOs

It’s one of those “awesome ___” repos on GitHub, this time for resources about writing SLOs.

Steve Azzopardi (@steveazz)

Incident categories I’d like to see

If you’re going to classify incidents by “root cause”, try these on for size: “production pressure”, “goal conflicts”, and more in this article.

Lorin Hochstein

The Four One Zero Club: The crash of Pinnacle Airlines flight 3701

Sure, the pilots were engaging in an activity that could be considered dubious. But what’s really worth digging into in this air accident is how surprise may have led them to forget their training on how to recover stable flight.

SRE Weekly Issue #351

lex

December 11, 2022

General

Comments

View on sreweekly.com

Seven years ago, I was busy pulling together content for the first several issues of SRE Weekly. Since then, I estimate that I’ve consumed over 6000 articles in my quest to curate content each week, most of them via text-to-speech. You all make it worthwhile! Thank you so much for reading, and thanks to all of the great authors out there for writing awesome articles. Here’s to another great year!

Articles

On-call with Tammy Bryant Butow

In this interview, Tammy Butow goes into detail on what it’s like being on call and how she improved a team’s horrible on-call burden by a factor of 10.

Elena Boroda — Fiberplane

How Many SREs Does Your Company Need? Here’s How to Decide

Do you need just one or two SREs? Or should you build a sprawling SRE team, with a dozen or more SREs on hand to support your organization’s reliability needs?

JJ Tang — Rootly
This article is published by my sponsor, Rootly, but their sponsorship did not influence its inclusion in this issue.

The Luxembourg Maneuver: The crash of Luxair flight 9642

An unsanctioned (but not unheard of) action, a race condition, and multiple known design issues all contributed to this air accident.

Admiral Cloudberg

I recently had to implement my disaster recovery plan.

A first-hand account of one way to handle DR in this reddit post. Worth reading through to the end.

u/disasterrecoverywhat — reddit

Rackspace Hosted Exchange Outage Due to Security Incident

Rackspace’s Hosted Microsoft Exchange offering has been down for over a week, and they’re assisting (and paying for) customers to move to Microsoft 365.

Roger Montti — Search Engine Journal

Building emergency pathways in your software (never to be used)

It’s a good idea to leave yourself a safety hatch to administer your system when everthing’s gone to heck… otherwise you might have to break out the angle grinders.

Oren Eini — Hibernating Rhinos

Solving a Murder Mystery: The Columnar Datastore Bug

This intriguing debugging story also sheds some light on how Honeycomb’s custom-built columnar data store works.

Paul Osman — Honeycomb
Full disclosure: Honeycomb is my employer.

A debugging manifesto

Tons of incredibly good advice in this infographic + article on debugging.

Julia Evans

SRE Weekly Issue #354

Articles

No issue this week

SRE Weekly Issue #353

Articles

SRE Weekly Issue #352

Articles

SRE Weekly Issue #351

Articles

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues