SRE Weekly Issue #355

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

I’m trying something new: I’m looking for input from you, dear readers!

This link is a Google Form where I’m asking for ideas that I might turn into a blog post or conference talk. If you’re game, I’d love to hear what you think.

Here’s the panel for this webinar:

  • Vanessa Huerta Granda (Jeli)
  • Emily Ruppe (Jeli)
  • Liz Fong-Jones (Honeycomb)
  • Fred Hebert (Honeycomb)

Honestly, with that set of names, I’d listen even if they were just discussing the weather.
  Full disclosure: Honeycomb, my employer, is mentioned.

This week saw an outage of the NOTAM system which disseminates important information to aircraft pilots in the US. As a result, all flights in the US were grounded.

There’s not much in the way of interesting detail available yet, but I did see a mention of this air incident in which NOTAMs played a significant part. Mentour Pilot also covered this one

  Admiral Cloudberg

In essence, this new reliability is:

  1. The health of your system
  2. Weighed based on customer expectations and happiness
  3. Prioritized based on your current capabilities

This article focuses on the sociotechnical aspects of reliability.

  Jim Gochee β€” The New Stack

Here are some guidelines for what kind of alerting works best for services at various stages of maturity.

  Ali Sattari

The actions we take to avert a potential problem can introduce their own risks.

  Will Gallego

This one’s from the incident.io folks.

  incident.io

I often meet with skepticism when I say that server monitoring systems should only page when a service stops doing its work.

Read on to find out why.

  Dan Slimmon

SRE Weekly Issue #354

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This episode of DisasterCast discusses what happens when attempts to make things safer backfire.

by trying to suppress small problems, we create a reservoir of danger waiting to burst out

  Drew Rae

These images offer a glimpse into the visual patterns that appear in our variables and time-series, and the beauty that emerges from chaos. Some of the images in these galleries appeared during difficult rollouts, and some even during production incidents. All come from graphs generated by Google’s monitoring systems.

  Google

The popular slogan says “test in production”, but what if your business simply doesn’t allow it?

For any scenario where I expect to be causing client impact, I’d rather test in non-production than not test at all, since production is clearly off the table.

  Christina Yakomin β€” InfoQ

There’s been a trend toward narrating our engineering work on company blogs, without which this newsletter probably wouldn’t exist.

  Jordan Teicher β€” New York Times

My team recently moved databases from local files in the codebase to an online Database.

It didn’t go quite as planned, but they got there in the end.

  Kaustubh Hiware β€” Mercari

In Product Analytics we wanted to support our colleagues in SRE, so we created a model to predict the monetary costs of incidents affecting our conversion funnel.

  Enrique Hernani Ros β€” HelloFresh

There’s some interesting detail here about multiple failed UPSes and an accidental voltage mismatch exacerbating the situation.

  Laura Dobberstein β€” The Register

SRE Weekly Issue #353

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

This article contains:

two reasons why site reliability engineers may be part of IT teams for years to come, and two reasons why site reliability engineering may turn out just to be a fad.

  Christopher Tozzi β€” ITPro Today

This article proposes an interesting method for incident investigation: constantly try to disprove your hypotheses to avoid confirmation bias.

  Ivan Merill β€” Fiberplane

How I’ve managed to run this newsletter for almost 7 years without a single mention of the Therac-25 incidents is beyond me. Therac-25 is an important lesson for all of us as we design systems and analyze incidents.

  Adam Fabio β€” Hackaday

Even though this happened 14 years ago, the cause is very much still relevant today. If you have two bit-flips in the same TCP packet, it’ll still pass the checksum.

  Poppy Linden β€” Linden Lab

This article proposes two criteria: Actionability and Investigability.

  Dan Slimmon

This incident write-up chronicles an incident in which a poison pill message repeatedly crashed their Heroku app.

  Lawrence Jones β€” incident.io

Take this one with a grain of salt since there’s a fair bit of counterfactual reasoning in the description. Nevertheless there’s a lot to learn from this and Wikipedia’s article on the same accident.

  Admiral Cloudberg

SRE Weekly Issue #352

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly πŸš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Incident duration and severity are not related, and we have the in-depth data to prove it.

It’s time for another VOID report! I’m glad this project is still going strong.

  Courtney Nash β€” Verica

I haven’t been paying attention to the recent attempts to legislate cloud provider reliability, and this article was a great catch-up. There’s a lot going on here.

  Jeff Martens β€” Metrist

I’m still trying to figure out how I feel about this one, but I’m definitely glad I read it.

  Fred Hebert

FireHydrant published this report with statistics from over 50,000 incidents experienced by their customers.

  FireHydrant

Want to get a solid understanding of how the Linux shells work, including file descriptors, process management, and sessions? This one goes really deep with lots of example programs.

  Viacheslav Biriukov

Check it out, Google search finally has a proper status page!

  Google

It’s one of those “awesome ___” repos on GitHub, this time for resources about writing SLOs.

  Steve Azzopardi (@steveazz)

If you’re going to classify incidents by “root cause”, try these on for size: “production pressure”, “goal conflicts”, and more in this article.

  Lorin Hochstein

Sure, the pilots were engaging in an activity that could be considered dubious. But what’s really worth digging into in this air accident is how surprise may have led them to forget their training on how to recover stable flight.

More on the same accident:

Β Β Admiral Cloudberg

A production of Tinker Tinker Tinker, LLC Frontier Theme