SRE Weekly Issue #175

Articles

Some Observations On the Messy Realities of Incident Reviews

This and other enlightened reflections on incident reviews can be found in this article:

Many organizations have driven post-incident reviews to become pallid, vapid, mechanical exercises whose value is limited to producing a defensible argument that management is occurring.

Richard Cook — Adaptive Capacity Labs

The secret life of DNS packets: investigating complex networks

In this post, I’ll describe how we monitor our DNS systems and how we used an array of tools to investigate and fix an unexpected spike in DNS errors that we encountered recently.

Jeff Jo — Stripe

Multi-Cloud: You keep using that word…

“Multi-cloud” never really lived up to its hype, did it? This article argues that a multi-cloud strategy is only useful in specific, constrained situations.

Disco Posse

How we migrated our database to Amazon Aurora with zero downtime

I love how they used idempotency to avoid downtime and missed or repeated transactions during the cutover.

Miguel Carranza — RevenueCat

Ebay to hold ‘Crash Sale’ on July 15 in case Amazon’s site goes down

This is either really clever or just unsporting.

Tonya Garcia — MarketWatch

How SRE teams are organized, and how to get started | Google Cloud Blog

This article discusses six kinds of SRE team (“kitchen sink”, infrastructure, tools, product/application, embedded, and consulting) and the pros and cons of each.

Gustavo Franco and Matt Brown — Google

When does a reduction in injury numbers become statistically significant?

If you see half the incidents this quarter compared to last, does it actually mean anything, statistically speaking? The math in this article applies equally well to SRE, and casts a shadow on the idea of tracking “metrics” like MTTR.

Marloes Nitert — Safety Differently

What does debugging a program look like?

This field guide to debugging is the synthesis of a bunch of contributions by folks on Twitter, forged into an article by the inimitable Julia Evans. Maybe a zine is in the works?

Julia Evans

Outages

How Verizon and a BGP Optimizer Knocked Large Parts of the Internet Offline Today
- The big outage this week happened when a small ISP accidentally told the Internet that it was the best place to send all their packets. Tom Strickx — Cloudflare
Statuspage.io
Slack
Hulu
- Hulu suffered an outage during their live stream of an important US political debate.

SRE Weekly Issue #175

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues