SRE Weekly Issue #238

View on sreweekly.com

My daughters asked earlier today what I do at work, and I explained all about SRE, reliability, and the importance of work-life balance. They said to tell you they say hi!

Articles

On Call Shouldn’t Suck: A Guide For Managers

Lots of really great advice in here. And really, with a title like that, I couldn’t resist reading it!

Charity Majors

Follow-up for Google Cloud Infrastructure Components Incident #20010

Last week, I mentioned a Google Cloud Platform outage that affected multiple services. Here’s the detailed post-analysis by Google.

Google

Team Play with a Powerful and Independent Agent: A Full-Mission Simulation Study

This one is along the lines of the classic Ironies of Automation paper by Bainbridge. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can automation be a team player, and what happens when it isn’t?

Nadine Sarter and David Woods (original paper)

Thai Wood — Resilience Roundup (summary)

Applying Chaos Engineering in Healthcare: Getting Started with Sensitive Workloads

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.How can you use chaos engineering when failures in the system can be critical and even life-threatening?

Carl Chesser — Infoq

This is your Guide for Implementing SRE in NOCs

In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Emily Arnot — Blameless

Is your microservice a distributed monolith?

This article suggests using chaos engineering to tell if your microservice-based architecture is secretly a monolith in disguise.

Andre Newman — Gremlin

Outages

Slack
Radware
- An accidental BGP hijack by Telstra took down Radware.
Twitter
Tokyo Stock Exchange
- The Tokyo Stock Exchange was down for an entire day, the first time that’s ever happened.
Fastly
Squarespace
Google Search Indexing
Microsoft Azure outage #SM79-F88
- A problem with Azure Active Directory caused trouble for Office365 and other Microsoft services. Click through for their detailed follow-up.

SRE Weekly Issue #238

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues