SRE Weekly Issue #400

View on sreweekly.com

A guide to Managing the First Fallacy of Distributed Computing

The network is not reliable. What are the implications and what can we do about it?

Anadi Misra

Incident severity levels for online platforms

Beyond a run-of-the-mill severity levels article, this one goes into a couple of common pitfalls.

Jonathan Word

Status Pages 101: How to Create a Status Page You and Your Customers Will Actually Want to Use

Some good tips in here, esp. the one about brevity.

Ashley Sawatsky — Rootly

Lessons learned from two decades of Site Reliability Engineering

Subtitle:

Or, Eleven things we have learned as Site Reliability Engineers at Google

Adrienne Walcer, Kavita Guliani, Mikel Ward, Sunny Hsiao, and Vrai Stacey — Google

Don’t name your EKS Managed NodeGroups (unless you want to trigger an incident)

Good lessons to learn here that apply more broadly than just EKS.

Christian Alexánder Polanco Valdez — Adevinta

Three reasons a liberal arts degree helped me succeed in tech

This article is about project management, but a lot of the skills discussed apply to aspects of SRE at Staff+ levels.

Sannie Lee — Thoughtworks (via martinfowler.com)

How Does Generative AI Work with Devops and Incident Response?

Now this is more like it: there’s a healthy does of skepticism woven through this article, including things genAI probably won’t be good for, and potential pitfalls.

Jesse Robbins — Heavybit

From Oops to Ops: SLOs Get Budget Rate Alerts

There are two different ways of alerting on SLOs, for two very different audiences, as explained in this article. Ostensibly this is a product feature announcement, but you don’t need to be using the product to get a lot out of this.

Fred Hebert — Honeycomb
Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #400

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues