SRE Weekly Issue #11

Articles

The big scary news this week was the buffer overflow vulnerability in glibc’s getaddrinfo function. Aside from the obvious impact of intrusions on availability, this bug requires us to roll virtually every production service in our fleets. If we don’t have good procedures in place, now is when we find out with downtime.

Unikernels are unfit for production

Bryan Cantrill, CTO of Joyent, really crystallized the gross feelings that have been rolling around in my mind with regard to unikernels. I would point colleagues to this article if they suggested we should deploy unikernel applications. He makes lots of good points, especially this one:

Unikernels are entirely undebuggable. There are no processes, so of course there is no ps, no htop, no strace — but there is also no netstat, no tcpdump, no ping!

I find the implicit denial of debugging production systems to be galling, and symptomatic of a deeper malaise among unikernel proponents: total lack of operational empathy.

Inside Atlassian: how our site reliability engineers do incident management

Atlassian dissects their response to a recent outage and in the process shares a lot of excellent detail on their incident response and SRE process. I love that they’re using the Incident Commander system (though under a different name). This could have (and probably has) come out of my mouth:

The primary goal of the incident team is to restore service. This is not the same as fixing the problem completely – remember that this is a race against the clock and we want to focus first and foremost on restoring customer experience and mitigating the problem. A quick and dirty workaround is often good enough for now – the emphasis is on “now”!

Fatal German train crash caused by human error, prosecutor says

My heart goes out to those passengers hurt and killed and to their families, but also to the controller that made the error. There’s a lot to investigate here about how a single human was in such a position that a single error could caus such devastation. Hopefully there are ways in which the system can be remediated to prevent such catastrophes in the future.

Steam Gauges Are Safer

Like medicine, we can learn a lot about how to prevent and deal with errors from the hard lessons learned in aviation.

You’d think technically advanced aircraft would be safer with all that information and fancy displays. Why they’re not has a lot to do with how our brains work.

Telstra customer chews through 425GB of mobile data during free data Sunday

When I saw Telstra offer a day of free data to its customers to make up for last week’s outage, I cringed. I’m impressed that they survived last Sunday as Australia used 1.8 petabtytes of data.

The Power Of Paranoia

In this article, the author describes discovering that a service he previously ignored and assumed saw very little traffic actually served a million requests per day.

If ignorance isn’t an operational strategy, what is? Paranoia. You should code and run your systems like a large group of Internet lunatics are out to abuse the hell out of them.

Chaos Engineering 101 — Production Ready — Medium

This is a great intro to Chaos Engineering, which is the field I didn’t know existed that was born out of Netflix’s Chaos Monkey. This is the first article in what the author promises will be a biweekly series.

Thanks to Devops Weekly for this one.

Outages

MTN (South Africa telecom co.)
Comcast
- Two sets of sporadic outages for Comcast customers. This one took out my home internet access a couple of times.
Commuter Rail (Boston-area transit system)
Xbox Live
Street Fighter 5
Adobe Creative Cloud

SRE Weekly Issue #11

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues