SRE Weekly Issue #26

Articles

Operational Best Practices #serverless – charity.wtf

Here’s Charity Majors being awesome as always. There’s a reason this article is first this week. In this part one of two articles, Charity recaps her recent talk at serverlessconf in which she argues that you can never get away from operations, no matter how “serverless” you go.

[…] no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth […]

I’m still laughing about #NoDevs. Thought-leadering through trolling FTW.

Making Facebook Self-Healing

This is an older article (2011), but it’s still well worth reading. Facebook began automating remediation of standard hardware failure, and then they reinvested the time saved into improving the automation.

Today, the FBAR service is run by two full time engineers, but according to the most recent metrics, it’s doing the work of 200 full fine system administrators.

Autoscaling on Complex Telemetry

A system that doesn’t auto-scale to meet demand can be unreliable in the face of demand spikes. But auto-scaling adds complexity to a system, and increasing complexity can also decrease reliability. This article outlines a method to attempt to reason about auto-scaling based on multiple metrics. Bonus TIL: Erlang threads busy-wait for work.

You deleted the customer what, now? Human error – deal with it

A run-down of basic techniques for avoiding and dealing with human error. I like this article for a couple of choice quotes, such as: “human error scales up” — as your infrastructure grows bigger, the scope of potential damage from a single error also grows bigger.

Simplicity: A Prerequisite for Reliability

The latest in Mathias Lafeldt’s Production Ready series is this article about complexity.

The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.

Outages

J.F.K International Airport (New York)
Epyx 1link
Mailgun
- Mailgun had a series of outages in May, and they’ve released this postmortem.
PlayStation Network
Apple
Amazon.com search
TeamViewer
- Rampant rumors circulated suggesting that the outage was a security breach and that many users’ computers had been hijacked. TeamViewer denies this and states that it was a DoS attack.
Cricket Wireless (US telecom)
Amazon Web Services (Sydney, AU)
- The outage took many Australian services down with it.

SRE Weekly Issue #26

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues