SRE Weekly Issue #422

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

The PIOSEE model is taught to pilots as a rubric for coming to a decision in a difficult aviation situation. As this article explains, we can also use it during IT incidents.

  Francisco Melo Jr.

What is high cardinality in monitoring systems? Here’s an excellent explanation that includes tips on how to manage cardinality.

  Ash P — SREPath

As Xero transitioned to a standard of “you build it you run it”, suddenly more engineering teams were responsible for knowing about and implementing observability. They designed this maturity model to help teams understand what they were aiming for and how to get there.

  Andrew Macdonald — Xero

With around 200 undersea fiber cuts worldwide per year, a fleet of ships is at the ready to pull up the cables and repair them.

  Josh Dzieza — The Verge

Last year, Cloudflare suffered a control plane outage when one of their datacenters lost power. They since did significant work to improve their resilience to power outages, and it was put to the test when the same datacenter lost power again.

   Matthew Prince, John Graham-Cumming, and Jeremy Hartman — Cloudflare

Going from non-remote to remote was challenging but here’s how our team changed as we began working from home

  Stefan Mikolajczyk — WeTransfer

Platform teams have a hugely important role to fill in the engineering organization. They are often the teams that enable other teams to move with more speed and safety. They can also quickly become disconnected from their customers.

  Ross Brodbeck

When your system successfully serves a degraded response to the customer, how should you count that toward your SLO? Is it success? Failure? Something in between?

  Niall Murphy

Updated: April 28, 2024 — 8:28 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme