SRE Weekly Issue #80

Articles

Linux tracing systems & how they fit together

I had no idea there were so many tracing systems in Linux! Fortunately Julia Evans did, and she learned all about them so that she could explain them to us.

There’s strace, and ltrace, kprobes, and tracepoints, and uprobes, and ftrace, and perf, and eBPF, and how does it all fit together and what does it all MEAN?

So you want to be an SRE?

What do you get when a high school teacher switches careers, goes to boot camp, and becomes an SRE? In this case, we get Krishelle Hardson-Hurley, who wrote this really great intro to the SRE field. She also included a set of links to other SRE materials. Thanks for the link to SRE Weekly, Krishelle!

Embracing Failure in a Container World – Production Ready

This issue of Production Ready is a transcript (with slides) of Mathias’s talk at ContainerDays on doing chaos engineering in a container-based infrastructure. I really like the idea of attaching a side-car container to inject latency using tc.

Why is Redfin running its site from a single data center without a backup facility?

Here’s an interesting side-effect from an IPO: Redfin was obliged to mention the fact that its website runs out of a single datacenter.

Event Foo: Designing for Results

This article, part of a series from Honeycomb.io on structured event logging, contains some tips on structuring your events well to get the most out of your logs.

The Peculiarities of High-Availability Data Center Design on a Cruise Ship

I’d never thought about what IT systems must exist on a cruise ship before. This article left me wanting to know more, so I found this ZDNet article with pictures and descriptions of another cruise ship datacenter layout.

Outages

Chase Bank
Data glitch sets tech company stock prices at $123.47
- Here’s an interesting one. Vendors that consume and distribute price information from Nasdaq incorrectly interpreted “normal test data” from Nasdaq as if it were real. It looked like a bunch of companies’s stock prices had crashed or jumped by huge amounts.
Alphabay

Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region
- Here’s a classic postmortem from Amazon, in which a developer inadvertently deleted the production ELB state information.
Slack: This was not normal. Really.
- I’d forgotten about this superlative example of an incident followup posting from Slack after a pair of outages in 2014. What reminded me was a commit to Dan Luu’s post-mortems repo in github that mentioned it.
Heroku Incident 372: HTTP Routing Errors
- Here’s another classic incident followup posting. Heroku spills the details on a major outage that cut off access to all applications for 30 minutes in 2012.Full disclosure: Heroku is my employer.

SRE Weekly Issue #80

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues