General

SRE Weekly Issue #196

A message from our sponsor, VictorOps:

From everyone at VictorOps, we wanted to wish you a happy holiday season and give thanks for this SRE community. So, we put together this fun post to highlight the highs and lows of being on-call during the holidays.

https://go.victorops.com/sreweekly-on-call-holidays

Articles

My favorite:

Don’t wait until the post-mortem; consider doing a “pre-mortem” brainstorm to identify potential issues.

John Agger — Fastly

Full disclosure: Fastly is my employer.

Let’s Encrypt deals with pretty heavy traffic. This post goes into what it takes for them to run a successful Certificate Transparency log.

 Phil Porada — Let’s Encrypt

In this air traffic radio recording from Las Vegas (2018), the air traffic controller becomes impaired and starts issuing confusing and dangerously incorrect directives. The pilots work together to correct the situation and no accidents occur. This is a classic example of a resilient system.

I don’t normally link to posts that heavily cover product offerings, but this one has some real gems. I especially like the discussion toward the end of the importance of analyzing an incident shortly after it happens.

John Allspaw — Adaptive Capacity Labs

This is a striking analogue for an infrastructure with many unactionable alerts.

The commission has estimated that of the thousands of alarms going off throughout a hospital every day, an estimated 85 to 99 percent do not require clinical intervention.

Melissa Bailey — The Washington Post

A fascinating look at the early days of Etsy, in which a system is rewritten, the rewrite blows up, the rewrite is rewritten, and finally that is rewritten again. Ouch.

Dan McKinley (@mcfunley)

If your DR test involves carefully contrived circumstances that don’t match the real world, then it’s not a real test. Point your upper management at this article if you need to argue for true DR testing.

Ivan Pepelnjak

Outages

SRE Weekly Issue #195

A message from our sponsor, VictorOps:

Understanding the incident lifecycle can guide DevOps and IT engineers into a future where on-call sucks less. See how you can breakdown the stages of the incident lifecycle and use automation, transparency and collaboration to improve each stage:

https://go.victorops.com/sreweekly-incident-lifecycle-guide

Articles

An entertaining take on defining Observability.

Joshua Biggley

There are some really great tips in here, wrapped up in a handy mnemonic, the Five As:

  • actionable
  • accessible
  • accurate
  • authoritative
  • adaptable

Dan Moore — Transposit

“The Internet routes around damage”, right? Not always, and if it does, it’s often too slow. Fastly has a pretty interesting solution to that problem.

Lorenzo Saino and Raul Landa — Fastly

Full disclosure: Fastly is my employer.

The stalls were caused by a gnarly kernel performance issue. They had to use bcc and perf to dig into the kernel in order to figure out what was wrong.

Theo Julienne — GitHub

Heading to Las Vegas for re:Invent? Here’s a handy guide of talks you might want to check out.

Rui Su — Blameless

How can you tell when folks are learning effectively from incident reviews? Hint: not by measuring MTTR and the like.

John Allspaw — Adaptive Capacity Labs

Outages

SRE Weekly Issue #194

A message from our sponsor, VictorOps:

As DevOps and IT teams ingest more alerts and respond to more incidents, they collect more information and historical context. Today, teams are using this data to optimize incident response through constant automation and machine learning.

https://go.victorops.com/sreweekly-incident-response-automation-and-machine-learning

Articles

Last week, I mistakenly listed an outage as “Connectivity Issues”, when it should have been attributed to Squarespace. Sorry about that!

From the authors of the new Post-Incident Review Zine comes this summary of Niall Murphy’s SRECon talk. It’s way more than a talk recap, tying together related blog posts and talks from other authors.

Jaime Woo and Emil Stolarsky

They didn’t trust the datacenter’s backup power, so they added rack UPSes. Little did they realize that a single UPS failure could take down all of the rest.

Richard Speed — The Register

Taiji chooses which datacenter to route a Facebook user’s traffic to. It identifies clusters of users that have friended each other and routes them to the same place, on the theory that they’re likely to be interested in the same content.

Adrian Colyer (summary)

Xu et al., SOSP’19 (original paper)

<3 detailed debugging stories. TIL: Google Compute Engine’s network drops connections from its state table after 10 minutes with no packets.

Stan Hu — GitHub

Vortex is DropBox’s custom-built metrics system, designed for horizontal scalability. Find out why they rolled their own and learn how it works in this article that includes shiny diagrams.

Dave Zbarsky — DropBox

How do we come up with our SLOs, anyway? This one puts me in mind of Will Gallego’s post on error budgets.

Dean Wilson (@unixdaemon)

A network stack in userland as an alternative to TCP/IP? Yup, that seems like a pretty Google thing to do.

Adrian Colyer (summary)

Marty et al., SOSP’19 (original paper)

Outages

SRE Weekly Issue #193

A message from our sponsor, VictorOps:

Episode two of Ship Happens, a DevOps podcast, is now live! VictorOps Engineering Manager, Benton Rochester sits down with Raygun’s Head of DevOps, Patrick Croot to learn about his journey into DevOps and how they’ve tightened their internal feedback loops:

http://try.victorops.com/sreweekly/ship-happens-episode-two

Articles

Ever had a Sev 1 non-impacting incident? This team’s Consul cluster was balanced on a razor’s edge: one false move and quorum would be lost. Read about their incident response and learn how they avoided customer impact.

Devin Sylva — GitLab

This SRECon EMEA highlight reel is giving me serious FOMO.

Will Sewell — Pusher

This week we’re taking a look at how teams in high consequence domains perform handoffs between shifts.

Emily Patterson, Emilie Roth, David Woods, and Renee Chow (original paper)

Thai Wood (summary)

This is an interesting essay on handling errors in complex systems.

In other words, the trick to scaling in the presence of errors is building software around the notion of recovery. Automated recovery.

tef

To be clear: this is about assisting incident responders in gaining an understanding of an incident in the moment, not about finding a “root cause” to present in an after-action report.

I’m not going to pretend to understand the math, but the concept is intriguing.

Nikolay Pavlovich Laptev, Fred Lin, Keyur Muzumdar, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar — Facebook

This one’s about assisting humans in debugging, when they have a reproduction case for a bug but can’t see what’s actually going wrong.

That’s two different uses of “root cause” this week, and neither one is the troublesome variety that John Allspaw has debunked repeatedly.

Zhang et al. (original paper)

Adrian Colyer (summary)

Outages

  • Honeycomb
    • Here‘s an unroll of an interesting Twitter thread by Honeycomb’s Liz Fong-Jones during and after the incident.
  • GitHub
  • Amazon Prime Video
  • Google Compute Engine
    • Network administration functions were impacted. Click for their post-incident analysis.
  • Squarespace
    • On Wednesday November 6th, many Squarespace websites were unavailable for 102 minutes between 14:13 and 15:55 ET.

      Click through for their post-incident analysis.

SRE Weekly Issue #192

A message from our sponsor, VictorOps:

Keeping your local repository in sync with an open-source GitHub repo can cause headaches. But, it can also lead to more flexible, resilient services. See how these techniques can help you maintain consistency between both environments:

http://try.victorops.com/sreweekly/keeping-github-and-local-repos-in-sync

Articles

This is a reply/follow-on/not-rebuttal to the article I linked to last week, Deploy on Fridays, or Don’t. I really love the vigorous discussion!

Charity Majors

And this is a reply to Charity’s earlier article, Friday Deploy Freezes Are Exactly Like Murdering Puppies. Keep it coming, folks!

Marko Bjelac

In this story from the archives, a well-meaning compiler optimizes away a NULL pointer check, yielding an exploitable kernel bug. I love complex systems (kinda).

Jonathan Corbet — LWN

A new report has been released about a major telecommunications outage last winter. This summary paints the picture of a classic complex systems failure.

Ronald Lewis

Making engineers responsible for their code and services in production offers multiple advantages—for the engineer as well as the code.

Julie Gunderson — PagerDuty

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme