SRE Weekly Issue #33

Articles

Here’s another great article urging caution when adopting new tools. Codeship’s Jessica Kerr categorizes technologies into a continuum of risk, from single-developer tools all the way up to new databases. She goes into a really excellent amount of detail, providing examples of how adopting a new technology can come back to bite you.

UN council: Seriously, nations, stop switching off the damn internet

After several recent incidents of nations cutting off or severely curtailing internet connectivity, the UN took a stand, as reported in this Register article:

The United Nations officially condemned the practice of countries shutting down access to the internet at a meeting of the Human Rights Council on Friday.

Man who spurred Citibank outage after work review sentenced to prison

Is it possible to design an infrastructure and/or security environment in which a rogue employee cannot take down the service?

How Complex Web Systems Fail — Part 1

Mathais Lafeldt is back in this latest issue of Production Ready. In this part 1 of 2, he reviews Richard Cook’s classic How Complex Systems Fail, with an eye toward applying it to web systems.

Each necessary, but only jointly sufficient

And with a nod to Lafeldt for the link, here’s another classic from John Allspaw on complexity of failures.

In the same way that you shouldn’t ever have root cause “human error”, if you only have a single root cause, you haven’t dug deep enough.

SGX provides further details on market disruption

SGX released a postmortem for their mid-July outage in the form of a press release. Just as Allspaw tells us, the theoretically simple root cause (disk failure) was exacerbated by a set of complicating factors.

Ending Alert Fatigue: Threat Stack and VictorOps On Modern-Day Security and Incident Management

In this recap of a joint webinar, Threat Stack and VictorOps share 7 methods to avoid and reduce alert fatigue.

Outages

Sprint
Zen (UK ISP)
Petnet
- Petnet suffered a server outage that prevented their smart feeders from feeding customers’ pets for hours.
IKEA
- Last week’s British Telecom outage resulted in 102 IKEA brick-and-mortar store customers’ cards being double-charged.
Instagram
Pokemon GO
Tarsnap
- Thanks to Jonathan Rudenberg for this one.
Twitter
Amazon.com
- This one slipped through my normal news collection. Fortunately(?) I caught it while trying to make a purchase on Amazon.
Netflix
Amazon Prime Instant Video

SRE Weekly Issue #33

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues