SRE Weekly Issue #48

View on sreweekly.com

This is the first issue of SRE Weekly going out to over 1000 email subscribers! Thanks all of you for continuing to make my little side project so rewarding and fulfilling. I can’t believe I’m almost at a year.

Speaking of which, there won’t be an issue next week while my family and I are vacationing at Disney World. See you in two weeks!

Articles

When downtime is not an option

A detailed description of Disaster Recovery as a Service (DRaaS), including a discussion of the cost versus creating a DR site oneself. This is the part I always wonder about:

However, for larger enterprises with complex infrastructures and larger data volumes spread across disparate systems, DRaaS has often been too complicated and expensive to implement.

The Prime Directive

This one’s so short I can almost quote the whole thing here. I love its succinctness:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

The Netflix Tech Blog: Post-mortem of October 22,2012 AWS degradation

Just over four years ago, Amazon had a major outage in Elastic Block Store (EBS). Did you see impact? I sure did. Here’s Netflix’s account of how they survived the outage mostly unscathed.

Serverless promises and the persistent need for critical alerting

I’m glad to see more people writing that Serverless != #NoOps. This article is well-argued even though it turns into an OnPage ad 3 paragraphs from the end.

Episode 004: Charity Majors – Greater Than Code

What else can we expect from Greater Than Code + Charity Majors? This podcast is 50 minutes of awesome, and there’s a transcription, too! Listen/read for awesome phrases like “stamping out chaos”, find out why Charity says, “I personally hate [the term ‘SRE’] (but I hate a lot of things)”, and hear Conway’s law applied to microservices, #NoOps debunking, and a poignant ending about misogyny and equality.

Microsoft Announces Azure DNS General Availability

Microsoft released its Route 53 competitor in late September. They say:

Azure DNS has the scale and redundancy built-in to ensure high availability for your domains. As a global service, Azure DNS is resilient to multiple Azure region failures and network partitioning for both its control plane and DNS serving plane.

Communication Breakdown Leads to Patient Burn

This issue of BWH Safety Matters details an incident in which a communication issue between teams that don’t normally work together resulted in a patient injury. This is exactly the kind of pitfall that becomes more prevalent with the move toward microservices, as siloed teams sometimes come into contact only during an incident.

New systems will fail: A site outage case study from Envato Market

A detailed postmortem from an outage last month. Lots of takeaways, including one that kept coming up: test your emergency tooling before you need to use it.

Outages

Canada’s immigration site
- I’m sure this is indicative of something.
Office 365
Twitter
- Twitter stopped announcing their AS in BGP worldwide, resulting in a 30-minute outage on Monday.
Google BigQuery
- Google writes really great postmortems! Here’s one for a 4-hour outage in BigQuery on November 8, posted on November 11. Fast turnaround and an excellent analysis. Thanks, Google — we appreciate your hard work and transparency!
Pingdom
- Normally I wouldn’t include such a minor outage, but I love the phrase “unintended human error” that they used. Much better than the intended kind.
WikiLeaks
eBay

SRE Weekly Issue #48

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues