General

SRE Weekly Issue #194

A message from our sponsor, VictorOps:

As DevOps and IT teams ingest more alerts and respond to more incidents, they collect more information and historical context. Today, teams are using this data to optimize incident response through constant automation and machine learning.

https://go.victorops.com/sreweekly-incident-response-automation-and-machine-learning

Articles

Last week, I mistakenly listed an outage as “Connectivity Issues”, when it should have been attributed to Squarespace. Sorry about that!

From the authors of the new Post-Incident Review Zine comes this summary of Niall Murphy’s SRECon talk. It’s way more than a talk recap, tying together related blog posts and talks from other authors.

Jaime Woo and Emil Stolarsky

They didn’t trust the datacenter’s backup power, so they added rack UPSes. Little did they realize that a single UPS failure could take down all of the rest.

Richard Speed — The Register

Taiji chooses which datacenter to route a Facebook user’s traffic to. It identifies clusters of users that have friended each other and routes them to the same place, on the theory that they’re likely to be interested in the same content.

Adrian Colyer (summary)

Xu et al., SOSP’19 (original paper)

<3 detailed debugging stories. TIL: Google Compute Engine’s network drops connections from its state table after 10 minutes with no packets.

Stan Hu — GitHub

Vortex is DropBox’s custom-built metrics system, designed for horizontal scalability. Find out why they rolled their own and learn how it works in this article that includes shiny diagrams.

Dave Zbarsky — DropBox

How do we come up with our SLOs, anyway? This one puts me in mind of Will Gallego’s post on error budgets.

Dean Wilson (@unixdaemon)

A network stack in userland as an alternative to TCP/IP? Yup, that seems like a pretty Google thing to do.

Adrian Colyer (summary)

Marty et al., SOSP’19 (original paper)

Outages

SRE Weekly Issue #193

A message from our sponsor, VictorOps:

Episode two of Ship Happens, a DevOps podcast, is now live! VictorOps Engineering Manager, Benton Rochester sits down with Raygun’s Head of DevOps, Patrick Croot to learn about his journey into DevOps and how they’ve tightened their internal feedback loops:

http://try.victorops.com/sreweekly/ship-happens-episode-two

Articles

Ever had a Sev 1 non-impacting incident? This team’s Consul cluster was balanced on a razor’s edge: one false move and quorum would be lost. Read about their incident response and learn how they avoided customer impact.

Devin Sylva — GitLab

This SRECon EMEA highlight reel is giving me serious FOMO.

Will Sewell — Pusher

This week we’re taking a look at how teams in high consequence domains perform handoffs between shifts.

Emily Patterson, Emilie Roth, David Woods, and Renee Chow (original paper)

Thai Wood (summary)

This is an interesting essay on handling errors in complex systems.

In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.

tef

To be clear: this is about assisting incident responders in gaining an understanding of an incident in the moment, not about finding a “root cause” to present in an after-action report.

I’m not going to pretend to understand the math, but the concept is intriguing.

Nikolay Pavlovich Laptev, Fred Lin, Keyur Muzumdar, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar — Facebook

This one’s about assisting humans in debugging, when they have a reproduction case for a bug but can’t see what’s actually going wrong.

That’s two different uses of “root cause” this week, and neither one is the troublesome variety that John Allspaw has debunked repeatedly.

Zhang et al. (original paper)

Adrian Colyer (summary)

Outages

  • Honeycomb
    • Here‘s an unroll of an interesting Twitter thread by Honeycomb’s Liz Fong-Jones during and after the incident.
  • GitHub
  • Amazon Prime Video
  • Google Compute Engine
    • Network administration functions were impacted. Click for their post-incident analysis.
  • Squarespace
    • On Wednesday November 6th, many Squarespace websites were unavailable for 102 minutes between 14:13 and 15:55 ET.

      Click through for their post-incident analysis.

SRE Weekly Issue #192

A message from our sponsor, VictorOps:

Keeping your local repository in sync with an open-source GitHub repo can cause headaches. But, it can also lead to more flexible, resilient services. See how these techniques can help you maintain consistency between both environments:

http://try.victorops.com/sreweekly/keeping-github-and-local-repos-in-sync

Articles

This is a reply/follow-on/not-rebuttal to the article I linked to last week, Deploy on Fridays, or Don’t. I really love the vigorous discussion!

Charity Majors

And this is a reply to Charity’s earlier article, Friday Deploy Freezes Are Exactly Like Murdering Puppies. Keep it coming, folks!

Marko Bjelac

In this story from the archives, a well-meaning compiler optimizes away a NULL pointer check, yielding an exploitable kernel bug. I love complex systems (kinda).

Jonathan Corbet — LWN

A new report has been released about a major telecommunications outage last winter. This summary paints the picture of a classic complex systems failure.

Ronald Lewis

Making engineers responsible for their code and services in production offers multiple advantages—for the engineer as well as the code.

Julie Gunderson — PagerDuty

Outages

SRE Weekly Issue #191

A message from our sponsor, VictorOps:

Need a new SRE podcast? Then check out episode one of the new VictorOps podcast, Ship Happens. Engineering Manager, Benton Rochester sits down with Bethany Abbott, TechOps Manager at NS1 to discuss on-call and the gender gap in tech.

http://try.victorops.com/sreweekly/ship-happens-episode-one

Articles

Check it out! A new zine dedicated to post-incident reviews. This first issue includes a reprint of 4 real gems from the past month plus one original article about disseminating lessons learned from incidents.

Emil Stolarsky and Jaime Woo

I swear, it’s like they heard me talking about anomaly detection last week. Anyone used this thing? I’d love to hear your experience. Better still, perhaps you’d like to write a blog post or article?

I know this isn’t Security Weekly, but this vulnerability has the potential to cause reliability issues, and it’s dreadfully simple to understand and exploit.

Hoai Viet Nguyen and Luigi Lo Iacono

In this incident followup from the archives, read the saga of a deploy gone horribly wrong. It took them hours and several experiments to figure out how to right the ship.

CCP Goliath — EVE Online

The best practices:

  1. Create a culture of experimentation
  2. Define what success looks like as a team
  3. Statistical significance
  4. Proper segmentation
  5. Recognize your biases
  6. Conduct a retro
  7. Consider experiments during the planning phase
  8. Empower others
  9. Avoid technical debt

Dawn Parzych — LaunchDarkly

Mantis uses an interesting stream processing / subscriber model for observability tooling.

From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.

Cody Rioux, Daniel Jacobson, Jeff Chao, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, and Zhenzhong Xu — Netflix

choosing not to deploy on Fridays is very different than having the capability to deploy on Fridays.  You should have the capability to deploy at any time.

We can’t ever be sure deploy will be safe, but we can be sure that folks have plans for their weekend.

David Mangot — Mangoteque

Outages

  • Amazon Route 53
    • Route 53 had significant DNS resolution impairment.

      Their status site still doesn’t allow deep linking or browsing the archive in any kind of manageable way, so here’s the full text of their followup post:

      On October 22, 2019, we detected and then mitigated a DDoS (Distributed Denial of Service) attack against Route 53. Due to the way that DNS queries are processed, this attack was first experienced by many other DNS server operators as the queries made their way through DNS resolvers on the internet to Route 53. The attack targeted specific DNS names and paths, notably those used to access the global names for S3 buckets. Because this attack was widely distributed, a small number of ISPs operating affected DNS resolvers implemented mitigation strategies of their own in an attempt to control the traffic. This is causing DNS lookups through these resolvers for a small number of AWS names to fail. We are doing our best to identify and contact these operators, as quickly as possible, and working with them to enhance their mitigations so that they do not cause impact to valid requests. If you are experiencing issues, please contact us so we can work with your operator to help resolve.

  • Heroku
    • I’m guessing this stemmed from the Route 53 incident.

      Our infrastructure provider is currently reporting intermittent DNS resolution errors. This may result in issues resolving domains to our services.

  • Twitter
  • Yahoo Mail
  • Hosted Graphite
  • Discord
  • Google Cloud Platform

SRE Weekly Issue #190

A message from our sponsor, VictorOps:

In the latest guide, Resilience First, you’ll learn about the origin of SRE, how it’s evolved over the last few years, and the future of its impact on building highly observable, resilient applications and infrastructure.

http://try.victorops.com/sreweekly/sre-golden-signals-guide

Articles

This company had a really challenging on-call situation to fix. Monolithic codebase, and a huge team with so many people in the on-call rotation that folks were out of practice by the time it was their turn.

Molly Struve

This article includes charts, observations, and conclusions from the author’s by-hand analysis and categorization of several hundred incidents.

Subbu Allamaraju

Charity Majors replied to a suggestion to write alerts for everything with her ideas for a better way.

Charity Majors (@mipsytipsy)

Where many databases use threading to handle concurrent clients, PostgreSQL forks one child process per client. This has ramifications that an operator must take into consideration.

Kristi Anderson — High Scalability

This article is about attributes, but it doesn’t mention a specific system. I have yet to find an anomaly detection system that doesn’t produce so many false positives that it’s useless.

Hive mind: if you’re using an anomaly detection system that actually works and doesn’t drown you with false positives, I want to hear about it. Bonus points if you want to write an article about it!

Amit Levi

Outages

SRE WEEKLY © 2015 Frontier Theme