SRE Weekly Issue #234

Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together.  There were so many outages that I’m not even going to bother listing all of them in the Outages section.

A message from our sponsor, StackHawk:

Everyone talks about shifting security left, but in many cases, it isn’t happening. There is a better way with developer-centric application security testing.
https://www.stackhawk.com/blog/align-engineering-security-appsec-tests-in-ci?utm_source=SREWeekly

Articles

I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.

Emily Arnott — Blameless

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.

John Carrol (original paper)

Thai Wood — Resilience Roundup (summary)

My latest adventures in (negligently) running sreweekly.com. It started with a surprise AWS bill, and then it got kinda weird…

Lex Neva

Deep technical details on a series of recent incidents involving Basecamp.

Troy Toman — Basecamp

Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.

Charity Majors

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Keith Ballinger — GitHub

Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.

Matthew Prince — Cloudflare

This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.

Angelique Medina and Archana Kesavan — ThousandEyes

Outages

Updated: September 6, 2020 — 8:41 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme