SRE Weekly Issue #153

Articles

110: Human Incident Response with Courtney Eckhardt – Greater Than Code

In this podcast episode, Courtney Eckhardt and the panel cover a lot of bases related to incident response, retrospectives, defensiveness, blamelessness, social justice, and tons more engrossing stuff. Well worth a listen.

Mandy Moore (summary); John K. Sawers, Sam Livingston-Gray, Jamey Hampton, and Coraline Ada Ehmke (panelists); Courtney Eckhardt (guest)

DBMS Musings: Partitioned consensus and its impact on Spanner’s latency

Do you wonder what effect partitioned versus unified consistency might have on latency? Do you want to know what those terms mean? Read on.

Daniel Abadi

Cape Technical Deep Dive

Cape is Dropbox’s real-time event processing system. The design bits in this article have a ton of interesting detail, and I also love the part where they go into their motivations behind not just using an existing queuing system.

Peng Kang — Dropbox

Designing resilient systems: Circuit Breakers or Retries? (Part 1)

This is a great intro to the circuit breaker pattern if you’re unfamiliar with it, and it’s also got a lot of meaty content for folks experienced with them.

Corey Scott — Grab

Don’t Choose Dashboards Over Analysis

Though it sounds counterintuitive, more dashboards often make people less informed and less aligned.

Having a few good dashboards is important, but if you have too many, it’ll get in the way of your ability to do dynamic analysis.

Benn Stancil — Mode

Site Reliability Engineering is Operations

What activities count as SRE work, versus “just” Operations?

Site Reliability Engineering do Operations but are not an Operations Team.

Stephen Thorne

Outages

Twitch
Google Cloud Platform (europe-west-1-b)
- A pair of redundant switches were erroneously taken down simultaneously for maintenance, causing a major outage. Click for Google’s followup post.
Xero
Spotify

SRE Weekly Issue #153

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues