SRE Weekly Issue #213

A message from our sponsor, VictorOps:

Major incidents lead to more alerts, more downtime and unhappy customers. See how modern DevOps-minded teams are building virtual war rooms to quickly mobilize cross-functional engineering and IT teams around major incidents – improving incident remediation while reducing burnout:

https://go.victorops.com/sreweekly-war-rooms-for-major-incidents

Articles

This is important, and well worth a read. Where’s the SRE connection? The article explains that the U.S. Surgeon General’s comment that masks are “not effective” led to a stigma against those that wear them here. That kind of unintended sociological effect is uncovered commonly in incident post-analysis.

Sui Huang

Pagerduty ran the numbers and discovered an increase in incidents recently, especially in certain companies.

Rachel Obstler — PagerDuty

Here’s the scoop on all those GitHub incidents in February.

Keith Ballinger — GitHub

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.

Hannah Culver — Blameless

5 tips for incident management when you’re suddenly remote

I love the concept of “ephemeral information”, that is, discussions that happen out-of-band, making it much harder to analyze the incident after the fact.

Blake Thorne — Atlassian

Grey failure turned a seemingly reasonable auto-recovery mechanism into a DoS caused by a thundering herd.

Panagiotis Moustafellos, Uri Cohen, and Sylvain Wallez — Elastic

Outages

Updated: March 29, 2020 — 9:05 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme