SRE Weekly Issue #191

A message from our sponsor, VictorOps:

Need a new SRE podcast? Then check out episode one of the new VictorOps podcast, Ship Happens. Engineering Manager, Benton Rochester sits down with Bethany Abbott, TechOps Manager at NS1 to discuss on-call and the gender gap in tech.

http://try.victorops.com/sreweekly/ship-happens-episode-one

Articles

Check it out! A new zine dedicated to post-incident reviews. This first issue includes a reprint of 4 real gems from the past month plus one original article about disseminating lessons learned from incidents.

Emil Stolarsky and Jaime Woo

I swear, it’s like they heard me talking about anomaly detection last week. Anyone used this thing? I’d love to hear your experience. Better still, perhaps you’d like to write a blog post or article?

I know this isn’t Security Weekly, but this vulnerability has the potential to cause reliability issues, and it’s dreadfully simple to understand and exploit.

Hoai Viet Nguyen and Luigi Lo Iacono

In this incident followup from the archives, read the saga of a deploy gone horribly wrong. It took them hours and several experiments to figure out how to right the ship.

CCP Goliath — EVE Online

The best practices:

  1. Create a culture of experimentation
  2. Define what success looks like as a team
  3. Statistical significance
  4. Proper segmentation
  5. Recognize your biases
  6. Conduct a retro
  7. Consider experiments during the planning phase
  8. Empower others
  9. Avoid technical debt

Dawn Parzych — LaunchDarkly

Mantis uses an interesting stream processing / subscriber model for observability tooling.

From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.

Cody Rioux, Daniel Jacobson, Jeff Chao, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, and Zhenzhong Xu — Netflix

choosing not to deploy on Fridays is very different than having the capability to deploy on Fridays.  You should have the capability to deploy at any time.

We can’t ever be sure deploy will be safe, but we can be sure that folks have plans for their weekend.

David Mangot — Mangoteque

Outages

  • Amazon Route 53
    • Route 53 had significant DNS resolution impairment.

      Their status site still doesn’t allow deep linking or browsing the archive in any kind of manageable way, so here’s the full text of their followup post:

      On October 22, 2019, we detected and then mitigated a DDoS (Distributed Denial of Service) attack against Route 53. Due to the way that DNS queries are processed, this attack was first experienced by many other DNS server operators as the queries made their way through DNS resolvers on the internet to Route 53. The attack targeted specific DNS names and paths, notably those used to access the global names for S3 buckets. Because this attack was widely distributed, a small number of ISPs operating affected DNS resolvers implemented mitigation strategies of their own in an attempt to control the traffic. This is causing DNS lookups through these resolvers for a small number of AWS names to fail. We are doing our best to identify and contact these operators, as quickly as possible, and working with them to enhance their mitigations so that they do not cause impact to valid requests. If you are experiencing issues, please contact us so we can work with your operator to help resolve.

  • Heroku
    • I’m guessing this stemmed from the Route 53 incident.

      Our infrastructure provider is currently reporting intermittent DNS resolution errors. This may result in issues resolving domains to our services.

  • Twitter
  • Yahoo Mail
  • Hosted Graphite
  • Discord
  • Google Cloud Platform
Updated: October 27, 2019 — 8:37 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme