SRE Weekly Issue #245

A message from our sponsor, StackHawk:

Check out how we have built our microservices in Kubernetes here at StackHawk.
https://sthwk.com/kube-services

Articles

A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.

Trust Asia

I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.

Andre Newman — Gremlin

This one introduces some interesting concepts: the error kernel and property testing.

Kenneth Cross — HelloSign

[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.

Xavier Grand — Algolia

More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.

Zachary Flower — Splunk

Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.

Paddy Byers — Ably

The most effective (if scary) way to understand how your stateless service operates under load

Utsav Shah — Software at Scale

Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.

Outages

Updated: November 22, 2020 — 8:47 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme