SRE Weekly Issue #130

Articles

Goodbye Microservices: From 100s of problem children to 1 superstar

Segment discovered the hard way that their move to a microservice architecture had brought far more problems than benefits. Here’s why they transitioned back and how they pulled it off. Awesome article!

Alexandra Noonan — Segment

Establishing Resilience: The Challenges and Opportunities of Complexity

Drawing on the work of Dr. David Woods and the rest of the SNAFU Catchers, this article discusses the concepts behind resiliency and how to measure and achieve it.

Beth Long — New Relic

Solving for serverless: How do you manage something that’s not there?

Serverless is not the magical gateway to the land of NoOps. You still have to operate your system even if you’re not directly running the servers. This article does a great job of explaining why.

Bhanu Singh — Network World

How I use Wireshark

New to me: Wireshark’s statistics view and how it can be useful.

Julia Evans

Health and availability in computer systems

How do you define whether your system is available and healthy? This article uses an anology to medical health.

Claiming that our system is doing well means nothing if users can perceive an outage.

José Carlos Chávez — Typeform

On the AWS Application Load Balancer HTTP/2 Support

These folks are experiencing mysterious latency with HTTP/2 traffic to their ALB and published this report on their investigation. There’s no happy ending here — ultimately they disabled HTTP/2 support. I hope they update if they do discover the culprit.

Peter Forsberg — ShopGun

relp 100% cpu – rsyslog stop after start · Issue #13 · rsyslog/librelp · GitHub

I had some fun this week unearthing the cause for the chronic lockups in Rsyslog that we’ve experienced at work. I found the cause (overeager retries of socket writes) and put together a bug report and a pull request.

Full disclosure: Fastly, my employer, is mentioned.

Building Grab’s Experimentation Platform

I love science! Grab wrote a nifty tool to help them select cohorts of users and perform experiments on them.

Abeesh Thomas and Roman Atachiants — Grab

Auto Scaling Production Services on Titus – Netflix TechBlog – Medium

Titus is the container orchestration system that Netflix created and open sourced. Rather than building a new auto-scaling feature for Titus, they instead got Amazon to productize EC2’s auto-scaling engine as a generalized auto-scaling tool, which Netflix then integrated with Titus. Neat!

See Amazon’s Application Auto Scaling announcement, published this past week.

Andrew Leung, Amit Joshi, and the rest of the Titus team — Netflix

Outages

Gmail
Google Docs, Sheets, et al.
YouTube TV
- During the World Cup match.
Discord
- Discord had a couple of outages this week.
Instagram
Mastercard
Facebook Messenger
Snapchat
99acres (real estate site)
Heroku
Disney blames 4-hour tech woes on network maintenance
- Here’s an update on the Disney system outage I linked to last week.
  Gabrielle Russon — Orlando Sentinel

SRE Weekly Issue #130

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues