SRE Weekly Issue #440

Continually testing our product with smoke tests

As part of designing their new paging product, incident.io created a set of end-to-end tests to exercise the system and alert on failures. Click through for details on how they designed the tests and lessons learned.

Rory Malcolm — incident.io

Unified Grid: How We Re-Architected Slack for Our Largest Customers

As Slack rolled out their new experience for large, multi-workspace customers, they had to re-work fundamental parts of their infrastructure, including database sharding.

Ian Hoffman and Mike Demmer — Slack

Heroku incident 2678 Followup: Issues with Essential Tier Databases in EU region

A third-party vendor’s Support Engineer […] acknowledged that the root cause for both outages was a monitoring agent consuming all available resources.

Heroku

Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses

Resilience engineering is about focusing on making your organization better able to handle the unexpected, rather than preventing repetition of the same incident. This article gives a thought-provoking overview of the difference.

John Allspaw — InfoQ

3 reasons traces are better than metrics for debugging

Metrics are great for many other things, but they can’t compete with traces for investigating problems.

Jean-Mark Wright

Good Retry, Bad Retry: An Incident Story

Through fictional storytelling, this article explains not just the benefits of retries, but how they can go wrong.

Denis Isaev — Yandex

Just use Postgres

Hot take? Sure, but they back it up with a well-reasoned argument.

Ethan McCue

Dealing with rejection (in distributed systems)

A detailed look at the importance of backpressure and how to use it to reduce load effectively, as implemented in WarpStream.

Richard Artoul — WarpStream

SRE Weekly Issue #440

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues