SRE Weekly Issue #233

Articles

Keeping Google Meet ahead of usage demand during COVID-19

In this post, I’ll share how we ensured that Meet’s available service capacity was ahead of its 30x COVID-19 usage growth, and how we made that growth technically and operationally sustainable by leveraging a number of site reliability engineering (SRE) best practices.

Samantha Schaevitz — Google

Battleshorts, exaptations, and the limits of STAMP

I love the concept of “battleshorts” just as much as I’ve been enjoying this series of articles analyzing STAMP.

Lorin Hochstein

Incident Review: Meta-Review, August 2020

Honeycomb had 5 incidents in just over a week, prompting not only their normal incident investigation process, but a meta-analysis of all five together.

Emily Nakashima — Honeycomb

Chromium’s impact on root DNS traffic

Why is Chromium responsible for half of the DNS queries to the root nameservers? And why do they all return NXDOMAIN?

Matthew Thomas — APNIC

That Moment

“That Moment” when your fire suppression system triggers and the fire department shows up. This is part war story and part description of incident response practices.

Ariel Pisetzky — Taboola

Google Cloud Issue Summary Multiple Products – 2020-08-19

An overload in an internal blob storage system impacted many dependent services.

Google

Scaling services with Shard Manager

Sharding as a service, now there’s an interesting idea.

Gerald Guo, Thawan Kooburat — Facebook

What is a Kubernetes Operator and Why it Matters for SRE

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

Emily Arnot — Blameless

Outages

Zoom
Slack
Let’s Encrypt
NZX (New Zealand Stock Exchange)
eBay
Garmin
Heroku
Fastly
- Also this one.
  Full disclosure: Fastly is my employer.
Cloudflare

SRE Weekly Issue #233

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues