SRE Weekly Issue #290

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒. Automate manual admin tasks like creating incident channel, Jira and Zoom, paging the right team, postmortem timeline, setting up reminders, and more. Book a demo:
https://rootly.io/?utm_source=sreweekly

Articles

Despite carefully testing how they would handle this week’s expiration of the root CA that cross-signed Let’s Encrypt’s CA certificate, they had an outage. The reason? Poor behavior in OpenSSL. See the next article for a deeper explanation of what went wrong with OpenSSL.

Oren Eini — RavenDB

This article explains why some versions of OpenSSL are unable to validate certificates issued by Let’s Encrypt now, even though the certificates should be considered valid.

Ryan Sleevi

This says it all:

It turns out that the path to safety isn’t increased complexity.

Matt Asay — TechRepublic

The thrust of this article is that reliability applies to and should matter to the entire company, not just engineering. I really like the term “pitchfork alerting”.

Robert Ross — FireHydrant

Lesson learned: always make your application server’s timeout longer than your reverse proxy’s.

Ivan Velichko

Who deploys the deploy tool? The deploy tool, obviously — unless it’s down.

Lorin Hochstein

Their approach: group tables into “schema domains”, make sure that queries don’t span schema domains, and then move a schema domain to its own separate database cluster.

Thomas Maurer — GitHub

Groot is about helping figure out what’s wrong during an incident, not about analyzing an incident after the fact. I totally get why they need this tool, since they have over 5000 microservices!

Hanzhang Wang — eBay

SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly.

Ash P — Cruform

Outages

  • Heroku
    • (also this one)Heroku had a major outage that coincided with an Amazon EBS failure in a single availability zone in us-east1. Customers of Heroku such as Dead Man’s Snitch were impacted.
  • Slack
    • Slack had a big disruption related to DNSSEC. Here’s an interesting analysis of what may have gone wrong (link).
  • Let’s Encrypt
    • Let’s Encrypt saw heavy traffic as everyone clamored to renew their certificates, causing certificate issuance to slow down.
  • Microsoft 365
  • Apple’s “Find My” service
  • Signal
  • Xero
    • This one coincided with the same Amazon EBS outage mentioned above. Xero also had another outage on October 1.
Updated: June 1, 2022 — 9:43 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme