Articles
A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.
Trust Asia
I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.
Andre Newman — Gremlin
This one introduces some interesting concepts: the error kernel and property testing.
Kenneth Cross — HelloSign
[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.
Xavier Grand — Algolia
More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.
Zachary Flower — Splunk
Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.
Paddy Byers — Ably
The most effective (if scary) way to understand how your stateless service operates under load
Utsav Shah — Software at Scale
Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.
Outages
- ASX (Australian Stock Exchange)
- Coinbase
- GoDaddy
- GoDaddy’s statement took care to explicitly state that the outage was not a security incident. This may be because they appear to have had an unrelated security incident around the same time, and some customer domains were taken over.
- Nest