SRE Weekly Issue #6

Articles

A discussion of failing fast, degrading gracefully, and applying back-pressure to avoid cascading failure in a service-oriented architecture.

Many times, it’s our own internal services which cause the biggest DoS attacks on ourselves.

Kernel Patching 101: How to Make Repairs Without System Downtime/

A SUSE developer introduces kGraft, SUSE’s system for live kernel patching. Anyone who survived the AWS reboot-a-thon is probably a big fan of live kernel patching solutions.

Not Everything Critical is Urgent. Learn the Difference.

One thing that is critical is avoiding burnout in on-call. This article is a description of the “urgency” feature in Pagerduty, but they make a generally applicable point: don’t wake someone for something just because it’s critical; only wake them if it needs immediate action.

Fallacies of Distributed Computing Explained

This is a review/update of the 1994 article. The fallacies still hold true, and anyone designing a large-scale service should heed them. The fallacies:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

As I get into SRE Weekly, I repeatedly run across articles that I probably should have read long since in my career. Hopefully they’re new to some of you, too.

Delivering safer cars faster through automation and continuous delivery

Every position I’ve held has involved supporting reliability in a 24/7 service, but let’s be realistic: it’s unlikely someone would have died as a result of an outage. In cars, reliability takes a whole new meaning. I first got interested in MISRA and the other standards surrounding the code running in cars when I read some technical write-ups of the investigation surrounding the “unintended acceleration” incidents a few years back. This article discusses how devops practices are being applied in the development of vehicle code.

Security experts confirm Ukraine power grid blackout a ‘coordinated intentional attack’

Evidence has come out that the recent major power outage in Ukraine was a network-based attack (I can’t make myself say “cyber-” anything).

PS4 porn viewers rocket during PSN outage

I should have seen this coming.

Verizon grounds JetBlue — how could that happen? Another plan B gone bad

One blogger’s take on the JetBlue outage.

It’s very hard to create an entirely duplicate universe where you can test plan B. And it’s even hard to keep on testing it regularly and make sure it actually works. To wit: Your snow plow often doesn’t start after the first snow because it’s been sitting idle all summer.

SRECon16 Call for Participation

The SRECon call for participation is now open!

LostPass

Sean Cassidy has discovered an easy and indistinguishable phishing method for LastPass in Chrome, with a slightly less simple and effective method for Firefox. This one’s important for availability because many organizations rely heavily on LastPass. Compromising the right Employee’s vault could spell big trouble and possibly downtime.

Outages

GTA Online
EE (phone network)
Amplitude
- A truly heinous multi-day outage for Amplitude. The root cause appears to be inadvertent deletion of data in DynamoDB. Thanks to the folks at Amplitude for the extremely detailed status and analysis. Get some sleep, folks.
PlayStation Network
Xbox Live Down
JetBlue
- This one was all over the news. JetBlue points the finger at a Verizon datacenter outage.
TalkTalk
Yahoo Mail

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues