SRE Weekly Issue #245

Articles

Trust Asia 2021 has produced inconsistent STHs

A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.

Trust Asia

Knowing your systems and how they can fail: Twilio and AWS talk at Chaos Conf 2020

I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.

Andre Newman — Gremlin

Building for reliability at HelloSign

This one introduces some interesting concepts: the error kernel and property testing.

Kenneth Cross — HelloSign

Tech Startup Dilemmas: Resilient Deployment vs. Exhaustive Tests

[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.

Xavier Grand — Algolia

8 Tips to Create an Accurate and Helpful Post-Mortem Incident Report

More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.

Zachary Flower — Splunk

Achieving exactly-once message processing with Ably

Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.

Paddy Byers — Ably

Why you should frequently turn down ~30% of canary instances

The most effective (if scary) way to understand how your stateless service operates under load

Utsav Shah — Software at Scale

The Engineer’s Guide to Preparing for Black Friday 2020

Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.

Outages

ASX (Australian Stock Exchange)
Coinbase
GoDaddy
- GoDaddy’s statement took care to explicitly state that the outage was not a security incident. This may be because they appear to have had an unrelated security incident around the same time, and some customer domains were taken over.
Nest

SRE Weekly Issue #245

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues