SRE Weekly Issue #83

Articles

Decision fatigue is the diminishment of certain mental faculties after making many decisions. It can cause incidents, and just as importantly, it can make incident response more difficult. After reading this article, I’m wondering if I should be asking incident responders to stop and drink a glass of orange juice before making a tough call during an incident.

The mystery of the hanging S3 downloads

Here’s an interesting debugging session that plumbs some of the more obscure depths of TCP.

Serverless computing has landed: How IT Ops can adapt

What does DR look like if your system is serverless? How do you manage performance if you don’t control the thing that loads (and hopefully pre-caches) your code?

Incident Management for Operations (book)

The new book on incident response from the folks at Blackrock3 has arrived! They draw on their years of fire incident response experience to teach us how to resolve outages. I had the privilege of attending one of Blackrock3’s 2-day training sessions last week and I highly recommend it.

Support Driven Development: Listen now so you don’t hear it later

I like the idea of focusing on reducing customer pain points, even if they’re not directly due to bugs. After all, reliability is all about the customer experience.

ChAP: Chaos Automation Platform

Netflix’s ChAP tests a target microservice by creating experimental and control clusters and routing a small portion of traffic to them.

Starting the Avalanche

Microservice-based architecture is great, right? The problem is that the fan-out of backend requests can create an amplification vector for a DDoS attack. A small, carefully-constructed API call from an attacker can result in a massive number of requests to services in the backend, taking them down.

Learning From Failure and Success – Production Ready

The latest from Mathias Lafeldt is this article about post-hoc learning. He draws on Zwiebeck and Cook, reminding us that both success and failure are normal circumstances in complex systems.

It’s important to understand that every outcome, successful or not, is the result of a gamble.

GitHub – dastergon/awesome-chaos-engineering: A curated list of awesome Chaos Engineering resources

Remember Awesome SRE? The same author, Pavlos Ratis, has pulled together a ton of links on Chaos Engineering. Thanks, Pavlos!

GitHub – dastergon/postmortem-templates: A collection of postmortem templates

He’s also compiled this set of postmortem templates, drawn from various sources. He’s unstoppable!

Pingdom’s Live Map Shows You The State Of The Internet As It Happens

What a great idea, and I wish I’d known about it earlier! Pingdom uses their aggregate monitoring data to create a live map of the internet. Might be useful for those big events like the Dyn DDoS or the S3 outage.

Verizon points finger at Niantic for problems at Pokémon Go event

Last week, I reported on the disaster that was Niantic’s Pokémon Go live event. Verizon wants to assure us that it wasn’t a capacity issue on their part.

Outages

EC2 (us-east-1)
- Between 6:47 AM and 7:10 AM PDT we experienced increased launch failures for EC2 Instances, degraded EBS volume performance and connectivity issues for some instances in a single Availability Zone in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.
  
  This one seems to have affected several companies including Heroku and Rollbar.
Marketo
- Marketo failed to renew their domain name registration, reportedly due to a failure in their automated tooling.
Instagram
Report on July 7, 2017 incident | Gandi News
- Here’s one I missed from earlier this month.
  
  In all, 751 domains were affected by this incident, which involved an unauthorized modification of the name servers [NS] assigned to the affected domains that then forwarded traffic to a malicious site exploiting security flaws in several browsers.
  
  Thanks to an anonymous reader for this one.
Threat Stack Status – Config Audit Database Maintenance
- Another one I missed. This one appears to be a maintenance that went wrong.Thanks to an anonymous reader for this one.

SRE Weekly Issue #83

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues