A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.
I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.
Andre Newman — Gremlin
This one introduces some interesting concepts: the error kernel and property testing.
Kenneth Cross — HelloSign
[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.
Xavier Grand — Algolia
More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.
Zachary Flower — Splunk
Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.
Paddy Byers — Ably
The most effective (if scary) way to understand how your stateless service operates under load
Utsav Shah — Software at Scale
Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.