SRE Weekly Issue #160

A message from our sponsor, VictorOps:

Establishing an effective post-incident review process and taking the time to execute on it makes a world of difference in software reliability. See this example of a post-incident review process that’s already helping SRE teams continuously improve:

http://try.victorops.com/sreweekly/post-incident-review-process

Articles

This is a long one, but trust me, it’s worth the read. My favorite part is where the author gets into mental models, hearkening back to the Stella Report.

Fred Hebert

When CDN outages occur, it becomes immediately clear who is using multiple CDNs and who is not.

A multi-CDN approach can be tricky to pull off, but as these folks explain, it can be critical for reliability and performance.

Scott Kidder — mUX

Full disclosure: Fastly, my employer, is mentioned.

This article explains five different phenomena that people mean when they say “technical debt”, and advocates understanding the full context rather than just assuming the folks that came before were fools.

/thanks Greg Burek

Kellan Elliott-McCrea

The work we did to get our teams aligned and our systems in good shape meant that we were able to scale, even with some services getting 40 times the normal traffic.

Kriton Dolias and Vinessa Wan — The New York Times

How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?

There’s a real gem in here, definitely worth a read.

Charity Majors (and Liz Fong-Jones in reply)

Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing.

Rachel Perkins — Honeycomb

Delayed replication can be used as a first resort to recover from accidental data loss and lends itself perfectly to situations where the loss-inducing event is noticed within the configured delay.

Andreas Brandl — GitLab

Outages

Updated: February 17, 2019 — 8:30 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme