SRE Weekly Issue #1

Articles

An excellent discussion of the need to look at human error in a broader context.
A cogent argument that code freezes increase risk rather than reducing it.
An interesting outline of a hardware platform with duplicate everything (cpu, RAM, etc) claiming 7+ nines of availability. I’m not sure I’m convinced of its utility in all but a few niche areas, but it’s a neat concept.
A short discussion of how Netflix prepares for the holidays.
A new tool that checks connectivity though real requests. Is it enough to monitor internal services from a central monitoring machine? What if service A is unreachable only to hosts in cluster B, but Nagios can see it just fine? I’ve seen that before and it made me wonder if I need to monitor everything from everywhere.
Instagram posted an account of growing to double the traffic and multiple regions.

An example of a retrospective analysis process. I especially like this quote:

We also ask what led a person to believe that what they did was the right choice. Rarely does someone intend to do the wrong thing.

What happens when an operator that knows all of the secrets is suddenly unavailable? How do you make their secrets available without compromising security?

Outages

Black Friday, Cyber Monday, and the weekend in between is a critical time for sites to remain available. This year, some notable companies had a hard time.

An older but nice postmortem analysis posted by Slack in October.
PSN was down over black Friday weekend.
Nieman Marcus lost out on much of Black Friday.
Newegg also had issues on Black Friday.
Argos’s Black Friday Deals page was down.
EBay suffered an outage the day before thanksgiving.
PayPal suffered a major outage during cyber weekend.
Google Compute Engine saw issues with some if its transit traffic stemming from a new BGP peer accidentally announcing far more routes than they should have. They posted a nicely detailed analysis.
Target had downtime on Cyber Monday.
Time Warner Cable customers were frustrated by “intermittent availability” (read: outages) on Cyber Monday, hampering their ability to get in on all the deals.
Updated: December 19, 2015 — 3:28 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme