SRE Weekly Issue #358

Articles

Seamless critical traffic migration with CoreDNS request rewrite feature

A new spin on changing the engines on a jet in flight: using DNS request/response rewriting to transition an application over without modification.

lainra — Mercari

Putting a number on scalability

How much additional capacity can you get for a dollar?

Dan Slimmon

How We Manage Incident Response at Honeycomb

Dealing with the unknown, limited cognitive bandwidth, coordination patterns, psychological safety and feeding information back into the organization.

Fred Hebert — The New Stack
Full disclosure: Honeycomb is my employer.

SRE Transformation: our thoughts

How do you enable adoption of SRE principles at a large, mature company that has little capacity for innovation?

the value proposition of “SRE” is the idea that you can handle an exponentially growing business with a logarithmically growing payroll.

Layer Alpeh

How to Setup Multi-burn rate Windows Alert on Service Level Objectives

Read this one to learn about four attributes of good alerting and how to ensure your SLO burn rate alerts are effective.

Saheed Oladosu

Bad Observability

There’s plenty of content out there telling you how to implement observability, or what good looks like. But what about bad observability? What are some anti-patterns to watch out for?

Stephen Townshend — SquaredUp

On-call with Dave O’Connor

This is an interview about on-call with Twilio’s VP of SRE who also spent 17 years as an SRE at Google.

Elena Boroda

Adding Zonal Resiliency to Etsy’s Kafka Cluster: Part 1

They started with a (mostly) single-availability-zone Kafka deployment. Here’s how they transitioned to a multi-zone architecture that can survive a single AZ failure.

Andrey Polyakov and Kamya Shethia — Etsy

SRE Weekly Issue #358

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues