SRE Weekly Issue #234

View on sreweekly.com

Last Sunday, there was a major backbone Internet provider outage after I finished putting SRE Weekly together. There were so many outages that I’m not even going to bother listing all of them in the Outages section.

Articles

How to Build Your SRE Team

I love the way this article portrays SRE by placing less emphasis on specific skills and more on a holistic approach to reliability.

Emily Arnott — Blameless

Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability

Incident review is an important part of the organizational learning process, but it can be practiced in a way where the focus shifts away from learning to fixing.

John Carrol (original paper)

Thai Wood — Resilience Roundup (summary)

AD 0001

My latest adventures in (negligently) running sreweekly.com. It started with a surprise AWS bill, and then it got kinda weird…

Lex Neva

Inside a CODE RED: Network Edition

Deep technical details on a series of recent incidents involving Basecamp.

Troy Toman — Basecamp

Questionable Advice: War Rooms? Really?!?

Here’s why eyes-on-glass constant monitoring won’t help and can be actively harmful.

Charity Majors

GitHub Availability Report: August 2020

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Keith Ballinger — GitHub

Analysis of Today’s CenturyLink/Level(3) Outage

Here are Cloudflare’s thoughts on what happened with Sunday’s Internet trouble.

Matthew Prince — Cloudflare

CenturyLink / Level 3 Outage Analysis

This is ThousandEyes’s analysis of the outage, which goes along similar lines to Cloudflare’s and includes a lot more detail.

Angelique Medina and Archana Kesavan — ThousandEyes

SRE Weekly Issue #234

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues