SRE Weekly Issue #183

Articles

Another issue of Increment, on a topic integral to SRE: testing.

It doesn’t matter if you’ve already read everything Charity Majors has written; in this article she’s still managed to find new and even more compelling ways to argue that we should embrace testing in production.

My two other favorite articles from this issue:

What Broke the Bank (Chris Stokel-Walker)
Tests from the Crypt (Tammy Butow)

Charity Majors — Honeycomb

We Re-Launched The New York Times Paywall and No One Noticed

That’s exactly what we hoped for.

They rewrote this critical service and carefully deployed it to avoid user impact, using a technique I love: run the new code alongside the old for awhile to verify that it returns the same result.

Jeremy Gayed, Said Ketchman, Oleksii Khliupin, Ronny Wang and Sergey Zheznyakovskiy — New York Times

For CAs: What makes a Good Incident Response?

This is aimed at Certification Authorities dealing with TLS certificate misissuance issues and the like, but it very much applies to any kind of incident.

BONUS CONTENT: An incident report from LetsEncrypt just a few days later included this gem, exactly in line with what Ryan wrote:

After initially confirming the report we reached out to multiple other CAs that we believed would also be affected.

Ryan Sleevi

Our incident postmortem template

Whose? Hosted Graphite’s. Definitely worth a read.

Fran Garcia — Hosted Graphite

Understanding the risk profile of your technical debt

Which brings me to this unpopular opinion: All code is technical debt.

However, debt itself isn’t bad. It can be risky, especially if misunderstood, but debt itself is not inherently good or bad. It’s a tool.

Dormain Drewitz — Pivotal

Building a Culture of Continuous Improvement – Blameless: Better Reliability Through SRE

Blameless is running a free workshop on writing post-incident reports.

In this talk we will discuss the elements of an effective postmortem and the challenges faced while defining the process. We will introduce concrete methodologies that alleviate the cognitive overhead and emotional burden of doing postmortems.

Blameless

Outages

Heroku Status
- Heroku experienced 8+ hours of instability. This status page posting is really worth a read because it has:
  - meticulously detailed customer impact
  - no sugar-coating
  - detailed workarounds when they were available
  Hats off to you, folks.
Slack
Reddit
Sling TV
Disney Plus
- Increased traffic from a sale caused instability.
Fastly

SRE Weekly Issue #183

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues