SRE Weekly Issue #178

A message from our sponsor, VictorOps:

Containers and microservices can improve development speed and service flexibility. But, more complex systems have a higher potential for incidents. Learn how SRE teams are building more reliable services and adding context to microservices and containerized environments:

http://try.victorops.com/sreweekly/container-monitoring-and-alerting-best-practices

Articles

Imagine a database that promises consistency except in the case of a network partition, in which case it favors availability. That’s conditional consistency, and it’s effectively the same as no consistency.

Daniel Abadi

This is a story about distributed coordination, the TCP API, and how we debugged and fixed a bug in Puma that only shows up at scale.

Richard Schneeman — Heroku

Here’s more on the Australian Tax Office outage earlier this month.

Max Smolaks — The Register

Ever experience a total outage while your cloud provider still reports 99.999% availability? This one’s for you.

rachelbythebay

What’s good or bad to do in production? And how do you transfer knowledge when new team members want to release production services or take the ownership of existing services?

Jaana B. Dogan (JBD)

The internet is a series of tubes — the kind that transmit light. Favorite thing I learned: fiber optic cables are sheathed in copper that powers repeaters along their length.

 James Griffiths — CNN

How do you build a reliable network when faced with highly skilled and motivated adversaries?

Alex Wawro — DARKReading

Outages

SRE Weekly Issue #177

A message from our sponsor, VictorOps:

[Free Webinar] VictorOps partnered with Catchpoint to put death to downtime with actionable monitoring and incident response practices. See how SRE teams are being more proactive toward service reliability:

http://try.victorops.com/sreweekly/death-to-downtime

Articles

The point of this thread is to bring attention to the notion that our reactions to surprising events are the fuel that effectively dictates what we learn from them.

John Allspaw — Adaptive Capacity Labs

This article is an attempt to classify the causes of major outages at the big three cloud providers (AWS, Azure, and GCP).

David Mytton

It was, wasn’t it? Here’s a nice summary of the recent spate of unrelated major incidents.

Zack Whittaker — TechCrunch

Calculating CIRT (Critical Incident Response Time) involves ignoring various types of incidents to try to get a number that is more representative of the performance of an operations team.

Julie Gunderson, Justin Kearns, and Ophir Ronen — PagerDuty

There is so much great detail in this followup article about Cloudflare’s global outage earlier this month. Thanks, folks!

John Graham-Cumming — Cloudflare

Outages

  • Statuspage.io
  • NS1
  • PagerDuty
  • Nordstrom
    • Nordstrom’s site went down at the start of a major sale.
  • Twitter
  • Heroku
  • Honeycomb
    • Honeycomb had an 8-minute outage preceded by 4 minutes of degradation. Click through to find out how their CI pipeline surprised them and what they did about it.
  • LinkedIn
  • Australian Tax Office
  • Reddit
  • Stripe
    • […] two different database bugs and a configuration change interacted in an unforeseen way, causing a cascading failure across several critical services.

      Click through for Stripe’s full analysis.

  • Discord

SRE Weekly Issue #176

A message from our sponsor, VictorOps:

[Free Guide] VictorOps partnered with Catchpoint and came up with six actionable ways to transform your monitoring and incident response practices. See how SRE teams are being more proactive toward service reliability.

http://try.victorops.com/sreweekly/transform-monitoring-and-incident-response

Articles

[…] spans are too low-level to meaningfully be able to unearth the most valuable insights from trace data.

Find out why current distributed tracing tools fall short and the author’s vision of the future of distributed tracing.

Cindy Sridharan

If I wanted to introduce the concept of blameless culture to execs, this article would be a great starting point.

Rui Su — Blameless

When we look closely at post-incident artifacts, we find that they can serve a number of different purposes for different audiences.

John Allspaw — Adaptive Capacity Labs

When you meant to type /127 but entered /12 instead

Oops?

The early failure injection testing mechanisms from Chaos Monkey and friends were like acts of random vandalism. Monocle is more of an intelligent probing, seeking out any weakness a service may have.

There’s a great example of Monocle discovering a mismatched timeout between client and server and targeting it for a test.

Adrian Colyer (summary)

Basiri et al., ICSE 2019 (original paper)

Take the axiom of “don’t hardcode values” to an extreme, and you end up right back where you started.

Mike Hadlow

Outages

SRE Weekly Issue #175

A message from our sponsor, VictorOps:

Looking to go serverless? Beau Christensen, VictorOps Director of Platform Engineering, and Tom McLaughlin, Founder of ServerlessOps, sat down to talk about when VictorOps decided to venture into AWS:

http://try.victorops.com/SREWeekly/going-serverless

Articles

This and other enlightened reflections on incident reviews can be found in this article:

Many organizations have driven post-incident reviews to become pallid, vapid, mechanical exercises whose value is limited to producing a defensible argument that management  is occurring.

Richard Cook — Adaptive Capacity Labs

In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.

Jeff Jo — Stripe

“Multi-cloud” never really lived up to its hype, did it? This article argues that a multi-cloud strategy is only useful in specific, constrained situations.

Disco Posse

I love how they used idempotency to avoid downtime and missed or repeated transactions during the cutover.

Miguel Carranza — RevenueCat

This is either really clever or just unsporting.

Tonya Garcia — MarketWatch

This article discusses six kinds of SRE team (“kitchen sink”, infrastructure, tools, product/application, embedded, and consulting) and the pros and cons of each.

Gustavo Franco and Matt Brown — Google

If you see half the incidents this quarter compared to last, does it actually mean anything, statistically speaking? The math in this article applies equally well to SRE, and casts a shadow on the idea of tracking “metrics” like MTTR.

Marloes Nitert — Safety Differently

This field guide to debugging is the synthesis of a bunch of contributions by folks on Twitter, forged into an article by the inimitable Julia Evans. Maybe a zine is in the works?

Julia Evans

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme