SRE Weekly Issue #239

Articles

Don’t scale up farther than you need to! If you won’t ever see more than 100 RPS, don’t architect for 100,000.

Ayende Rahien

The Many Shapes of Site Reliability Engineering

This one covers several common patterns of SRE practice and then offers insight on what to look for as you design your own SRE team.

Rob Cummings — Slalom Build

Abstractions and implicit preconditions

Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.

Lorin Hochstein

Keeping CALM: When Distributed Consistency Is Easy

Coordination between nodes in a distributed system can kill performance. What kinds of problems require coordination? The CALM theorem can tell us.

Joseph M. Hellerstein and Peter Alvaro — Communications of the ACM

The Ultimate, Free Incident Retrospective Template

Here’s another good post-incident analysis document template that you can use as inspiration for your own.

Hannah Culver — Blameless

4 Signs Software Reliability Should be Your Top Priority

As your product ages, it transitions from “cool new thing” to “tool everyone uses and expects to Just Work”. Your reliability needs will change accordingly.

Lyon Wong — Blameless

Outages

PagerDuty
- 95% of event submissions (your systems telling PagerDuty to trigger an alert) failed for about an hour. They posted some detail about what went wrong.
Slack
- Their latest update on this outage contains some detail about what went wrong.
Telegram
Microsoft Office 365
Coles Supermarkets
Adobe Creative Cloud
GitHub

SRE Weekly Issue #239

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues