As part of designing their new paging product, incident.io created a set of end-to-end tests to exercise the system and alert on failures. Click through for details on how they designed the tests and lessons learned.
Rory Malcolm — incident.io
As Slack rolled out their new experience for large, multi-workspace customers, they had to re-work fundamental parts of their infrastructure, including database sharding.
Ian Hoffman and Mike Demmer — Slack
A third-party vendor’s Support Engineer […] acknowledged that the root cause for both outages was a monitoring agent consuming all available resources.
Heroku
Resilience engineering is about focusing on making your organization better able to handle the unexpected, rather than preventing repetition of the same incident. This article gives a thought-provoking overview of the difference.
John Allspaw — InfoQ
Metrics are great for many other things, but they can’t compete with traces for investigating problems.
Jean-Mark Wright
Through fictional storytelling, this article explains not just the benefits of retries, but how they can go wrong.
Denis Isaev — Yandex
Hot take? Sure, but they back it up with a well-reasoned argument.
Ethan McCue
A detailed look at the importance of backpressure and how to use it to reduce load effectively, as implemented in WarpStream.
Richard Artoul — WarpStream