Articles
Here’s Charity Majors being awesome as always. There’s a reason this article is first this week. In this part one of two articles, Charity recaps her recent talk at serverlessconf in which she argues that you can never get away from operations, no matter how “serverless” you go.
[…] no matter how pretty the abstractions are, you’re still dealing with dusty old concepts like “persistent state” and “queries” and “unavailability” and so forth […]
I’m still laughing about #NoDevs. Thought-leadering through trolling FTW.
This is an older article (2011), but it’s still well worth reading. Facebook began automating remediation of standard hardware failure, and then they reinvested the time saved into improving the automation.
Today, the FBAR service is run by two full time engineers, but according to the most recent metrics, it’s doing the work of 200 full fine system administrators.
A system that doesn’t auto-scale to meet demand can be unreliable in the face of demand spikes. But auto-scaling adds complexity to a system, and increasing complexity can also decrease reliability. This article outlines a method to attempt to reason about auto-scaling based on multiple metrics. Bonus TIL: Erlang threads busy-wait for work.
A run-down of basic techniques for avoiding and dealing with human error. I like this article for a couple of choice quotes, such as: “human error scales up” — as your infrastructure grows bigger, the scope of potential damage from a single error also grows bigger.
The latest in Mathias Lafeldt’s Production Ready series is this article about complexity.
The more complex a system, the more difficult it is to build a mental model of the system, and the harder it becomes to operate and debug it.
Outages
- J.F.K International Airport (New York)
- Epyx 1link
- Mailgun
-
Mailgun had a series of outages in May, and they’ve released this postmortem.
-
- PlayStation Network
- Apple
- Amazon.com search
- TeamViewer
-
Rampant rumors circulated suggesting that the outage was a security breach and that many users’ computers had been hijacked. TeamViewer denies this and states that it was a DoS attack.
-
- Cricket Wireless (US telecom)
- Amazon Web Services (Sydney, AU)
-
The outage took many Australian services down with it.
-