SRE Weekly Issue #78

Articles

This Master’s thesis by Crista Vesel seeks to answer the question, “How does the language used in the U.S. Forest Service’s Serious Accident Investigation Guide bias accident investigation analysis?” It’s an awe-inspiring analysis, drawing on Dekker, Woods, Cook, and other authors I’ve linked here repeatedly.

The most exciting part for me was the confirmation of some vague thoughts I’ve had around the use of passive versus active voice in retrospectives. By using passive voice, we can seek to reduce the kind of blaming that is inherent in active/agentive language.

What can developers learn from being on call? – Julia Evans

It’s by Julia Evans. Just read it.

Being responsible for my programs’ operations makes me a better developer

Determining Alert Urgency

PagerDuty again draws on ITIL, this time to outline an example system for classifying incident impact and urgency in order to determine priority.

ChaosCat: Automating Fault Injection at PagerDuty

PagerDuty’s take on automating chaos includes a chat-bot that lets folks trigger one-off host failures, along with running periodically, of course.

Unfortunately, ChaosCat is significantly tied into our internal infrastructure tooling. For the moment this means we won’t be open-sourcing it.

Reduce downtime with Azure Site Recovery service

This article is an overview of Microsoft’s DRaaS offering, Azure Site Recovery. Protip: you can just scroll past the signup-gate if you don’t feel like entering your email address.

How We Scaled Our Cache and Got a Good Night’s Sleep

Grab evaluated a couple of existing solutions but went with a simple custom sharding layer as a method to scale out their Redis usage.

Outages

Rollbar
LinkedIn
Skype
- Suspected DDoS.
ATO (Australian Tax Office)
Dyn
- Dyn suffered a long outage, and they posted an amazing 28 detailed updates to their status site before all was said and done. That’s something to aspire to.
Heroku
- Heroku posted a followup for their series of incidents early this month. Sorry for missing posting those outages when they happened!Full disclosure: Heroku is my employer.

SRE Weekly Issue #78

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues