A discussion of failing fast, degrading gracefully, and applying back-pressure to avoid cascading failure in a service-oriented architecture.
Many times, it’s our own internal services which cause the biggest DoS attacks on ourselves.
A SUSE developer introduces kGraft, SUSE’s system for live kernel patching. Anyone who survived the AWS reboot-a-thon is probably a big fan of live kernel patching solutions.
One thing that is critical is avoiding burnout in on-call. This article is a description of the “urgency” feature in Pagerduty, but they make a generally applicable point: don’t wake someone for something just because it’s critical; only wake them if it needs immediate action.
This is a review/update of the 1994 article. The fallacies still hold true, and anyone designing a large-scale service should heed them. The fallacies:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
As I get into SRE Weekly, I repeatedly run across articles that I probably should have read long since in my career. Hopefully they’re new to some of you, too.
Every position I’ve held has involved supporting reliability in a 24/7 service, but let’s be realistic: it’s unlikely someone would have died as a result of an outage. In cars, reliability takes a whole new meaning. I first got interested in MISRA and the other standards surrounding the code running in cars when I read some technical write-ups of the investigation surrounding the “unintended acceleration” incidents a few years back. This article discusses how devops practices are being applied in the development of vehicle code.
Evidence has come out that the recent major power outage in Ukraine was a network-based attack (I can’t make myself say “cyber-” anything).
I should have seen this coming.
One blogger’s take on the JetBlue outage.
It’s very hard to create an entirely duplicate universe where you can test plan B. And it’s even hard to keep on testing it regularly and make sure it actually works. To wit: Your snow plow often doesn’t start after the first snow because it’s been sitting idle all summer.
The SRECon call for participation is now open!
Sean Cassidy has discovered an easy and indistinguishable phishing method for LastPass in Chrome, with a slightly less simple and effective method for Firefox. This one’s important for availability because many organizations rely heavily on LastPass. Compromising the right Employee’s vault could spell big trouble and possibly downtime.
- GTA Online
- EE (phone network)
- A truly heinous multi-day outage for Amplitude. The root cause appears to be inadvertent deletion of data in DynamoDB. Thanks to the folks at Amplitude for the extremely detailed status and analysis. Get some sleep, folks.
- PlayStation Network
- Xbox Live Down
- This one was all over the news. JetBlue points the finger at a Verizon datacenter outage.
- Yahoo Mail