SRE Weekly Issue #96

Articles

The Phone Book Is On Fire: Lessons From the Dyn DNS DDoS — Velocity NYC 2017

Here’s the recording of my Velocity 2017 talk, posted on YouTube with permission from O’Reilly (thanks!). Want to learn about some gnarly DNS details?

Log20: Fully automated optimal placement of log printing statements under specified overhead threshold

I fell in love with this after reading just the title, and it only got better from there. Why add debug statements haphazardly when an algorithm can automatically figure out where they’ll be most effective? I especially love the analysis of commit histories to build stats on when debug statements were added to various open source projects.

Operating a Kubernetes network

Julia Evans is back with another article about Kubernetes. Along with explaining how it all fits together, she describes a few things that can go wrong and how to fix them.

How can we apply the principles of chaos engineering to AWS Lambda?

In this introductory post of a four-part series, we learn why chaos testing a lambda-based infrastructure is especially challenging.

Google Vizier: A service for black-box optimization

I love the idea of a service that automatically optimizes things even without knowing anything about their internals. Mmm, cookies.

Lyft’s Envoy dashboards – mattklein123 – Medium

What we are releasing is unfortunately not going to be readily consumable. It is also not an OSS project that will be maintained in any way. The goal is to provide a snapshot of what Lyft does internally (what is on each dashboard, what stats do we look at, etc.). Our hope is having that as a reference will be useful in developing new dashboards for your organization.

Microsoft has built a secret network emulator it says can prevent most cloud outages

It’s not a secret since they published a paper about it. This is an intriguing idea, but I’m wondering whether it’s really more effective than staging environments tend to be in practice.

The Rise of Site Reliability Engineers

A history of the SRE profession and a description of how New Relic does SRE.

Full disclosure: Heroku, my employer, is mentioned.

Outages

Collision with buffer stops at King’s Cross station, London, 15 August 2017
- This is the Rail Accident Investigation Branch’s report on a minor accident involving a driver that suffered a “microsleep” due to fatigue.
LearnVest
Slack

SRE Weekly Issue #96

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues