SRE Weekly Issue #230

Happy BTW: Wear a mask.

A message from our sponsor, StackHawk:

Add security testing to your CI pipelines with GitHub Actions. Check out this webinar recording (no email required) to learn how.
https://www.youtube.com/watch?v=W_7BxFgMYHs&time_continue=8

Articles

LaunchDarkly started off with a polling-based architecture and ultimately migrated to pushing deltas out to clients.

Dawn Parzych — LaunchDarkly

A brief overview of some problems with distributed tracing, along with a suggestion of another way involving AI.

Larry Lancaster — Zebrium

This is Google’s post-incident report for their Google Classroom incident on July 7.

Uber has long been a champion of microservices. Now, with several years of experience, they share the lessons they’ve learned and how they deal with some of the pitfalls.

Adam Gluck — Uber

This article opens with an interesting description of what the Cloudflare outage looked like from PagerDuty’s perspective.

Dave Bresci — PagerDuty

This post reflects on two distinct philosophies of safety:

the engineering design should ensure that the system is safe

design alone cannot ensure that the system is safe

Lorin Hochstein

You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem.

Lorin Hochstein

Outages

Updated: August 2, 2020 — 8:35 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme