SRE Weekly Issue #419

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/

Our nine month journey to horizontally shard Figma’s Postgres stack, and the key to unlocking (nearly) infinite scalability.

Retrofitting sharding is a huge undertaking.

  Sammy Steele — Figma

Ride along as this company evolves from constantly shipping directly to production to a robust staging and internal canary deployment system.

  Greg Foster — Graphite

A lighthearted but still detail-filled take on a post-incident analysis for a short production outage.

  Greg Foster — Graphite

This one has an interesting discussion of the nature of reliability and the impact of multiple services on overall reliability, including possible mathematical models to use.

  Fitz — Temporal

This episode of the SREPath Podcast covers a variety of themes around observability and SLOs. There’s a great text-based summary if that’s your preference.

  Ash Patel — SREPath

This piece argues that you should install system debugging tools in on your production systems now, because it’s going to be really hard to do it live when you need them.

  Brendan Gregg

Following on from a previous article about the squiggliness of availability numbers, this article evaluates SLAs from 4 major companies to try to divine what they actually mean.

  Ross Brodbeck

I want to present real-life examples of both availability and latency SLOs, as they are more nuanced than they may initially appear. Also, I find it worthwhile sharing a detailed guide as it showcases uncommon uses of PromQL and demonstrates the language’s versatility.

  Michał Kaźmierczak

Updated: April 7, 2024 — 9:58 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme