SRE Weekly Issue #191

Articles

The Post-Incident Review Issue 1: Autumn/Winter 2019

Check it out! A new zine dedicated to post-incident reviews. This first issue includes a reprint of 4 real gems from the past month plus one original article about disseminating lessons learned from incidents.

Emil Stolarsky and Jaime Woo

New – Amazon CloudWatch Anomaly Detection

I swear, it’s like they heard me talking about anomaly detection last week. Anyone used this thing? I’d love to hear your experience. Better still, perhaps you’d like to write a blog post or article?

CPDoS: Cache Poisoned Denial of Service

I know this isn’t Security Weekly, but this vulnerability has the potential to cause reliability issues, and it’s dreadfully simple to understand and exploit.

Hoai Viet Nguyen and Luigi Lo Iacono

Behind the Scenes of a long EVE Online downtime [2015]

In this incident followup from the archives, read the saga of a deploy gone horribly wrong. It took them hours and several experiments to figure out how to right the ship.

CCP Goliath — EVE Online

Nine Experimentation Best Practices

The best practices:

Create a culture of experimentation

Define what success looks like as a team

Statistical significance

Proper segmentation

Recognize your biases

Conduct a retro

Consider experiments during the planning phase

Empower others

Avoid technical debt

Dawn Parzych — LaunchDarkly

Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications

Mantis uses an interesting stream processing / subscriber model for observability tooling.

From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.

Cody Rioux, Daniel Jacobson, Jeff Chao, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, and Zhenzhong Xu — Netflix

Deploy on Fridays, or Don’t.

choosing not to deploy on Fridays is very different than having the capability to deploy on Fridays. You should have the capability to deploy at any time.

We can’t ever be sure deploy will be safe, but we can be sure that folks have plans for their weekend.

David Mangot — Mangoteque

Outages

Amazon Route 53
- Route 53 had significant DNS resolution impairment.
  Their status site still doesn’t allow deep linking or browsing the archive in any kind of manageable way, so here’s the full text of their followup post:
  
  On October 22, 2019, we detected and then mitigated a DDoS (Distributed Denial of Service) attack against Route 53. Due to the way that DNS queries are processed, this attack was first experienced by many other DNS server operators as the queries made their way through DNS resolvers on the internet to Route 53. The attack targeted specific DNS names and paths, notably those used to access the global names for S3 buckets. Because this attack was widely distributed, a small number of ISPs operating affected DNS resolvers implemented mitigation strategies of their own in an attempt to control the traffic. This is causing DNS lookups through these resolvers for a small number of AWS names to fail. We are doing our best to identify and contact these operators, as quickly as possible, and working with them to enhance their mitigations so that they do not cause impact to valid requests. If you are experiencing issues, please contact us so we can work with your operator to help resolve.
Heroku
- I’m guessing this stemmed from the Route 53 incident.
  
  Our infrastructure provider is currently reporting intermittent DNS resolution errors. This may result in issues resolving domains to our services.
Twitter
Yahoo Mail
Hosted Graphite
Discord
Google Cloud Platform
- Many GCP services were affected. There was also a possible Google search outage, though I wasn’t able to corroborate this report.

SRE Weekly Issue #191

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues