SRE Weekly Issue #101

Articles

It’s Sysadvent season again! This article is a great introduction to the idea that there is never just one root cause in an incident.

4 Chaos Experiments to Start With

Want to try out chaos engineering? Here are four kinds of terrible things you can do to your infrastructure, from the folks at Gremlin.

Load Balancing Strategies for HashiCorp Consul

To be clear, this is about using Consul as part of load balancing another service, not load-balancing Consul itself. Several methods are discussed, along with the pros and cons of each.

Is Root Cause Analysis Dead or Are We Just Getting Started?

This article has some interesting ideas, including automated root cause discovery or at least computer-assisted analysis. It also contains this week’s second(!) Challenger shuttle accident reference.

sysadvent: Day 6 – sysadmins – the evolution of a role amidst revolutionary hype.

As job titles change, this author argues that the same basic operations skills are still applicable.

Black Friday & Cyber Monday Performance Report 2017

Here’s Catchpoint’s yearly round-up of how various sites fared over the recent US holiday period.

Monitoring, Analytics, Diagnostics, Observability, and Root Cause Analysis

These terms mean similar things, and sometimes some of them are used interchangeably. Baron Schwartz sets the record straight, defining each term and explaining the distinctions between them.

PostMortems and Proactive Learning From Events

If you have a moment, please consider filling out this survey by John Allspaw:

[…] I’m looking to understand what engineers in software-reliant companies need in learning better from post-incident reviews.

Google Cloud Platform Blog: Getting the most out of shared postmortems

In a continuation of last week’s article, Google’s CRE team discusses sharing a postmortem with customers. “Sharing” here means not only giving it to them, but actually working on the postmortem process together with customers, including assigning them followup actions(!).

sysadvent: Day 8 – Breaking in a New Company as an SRE

SRE Amy Tobey approached a new SRE gig with a beginner’s mind and took notes. The result is a useful set of lessons learned and observations that may come in useful next time you change jobs.

Outages

Facebook
Coinbase (Bitcoin exchange)
Zimbabwe
- Zimbabwe suffered two coincidental fiber cuts. Here’s a related article on the growing concern of undersea fiber cuts: (link)
Nationwide (bank)
Gemini (Bitcoin exchange)
NiceHash (Bitcoin exchange)
- Not an outage per se, but I had to include this since it’s the third Bitcoin-related incident this week. Thieves broke into NiceHash’s systems and stole $78 million (USD) worth of bitcoins. Of course, the actual value of the theft changes almost by the minute…

SRE Weekly Issue #101

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues