SRE Weekly Issue #370

Articles

Improving Incident Recovery By using SLI Pyramid

[…] although “getting the system back up” should be our first priority, to do so safely, we first need to very carefully define what “up” means.

What functionality is critical? Should we sacrifice feature A to save feature B? It’s important to plan ahead.

Boris Cherkasky

Slack Said It Had 100% Uptime. Did It Really?

It turns out that it depends on how you define “uptime”. Does claiming “100%” actually benefit you?

Ellen Steinke — Metrist

The importance of right-sizing your retro

Skipping the retro shouldn’t be an option. Ditch the one-size-fits-all process to ensure that this important step is held at the end of every incident.

Jouhné Scott — FireHydrant

Site Reliability Engineering 101

Another good one to have in your back pocket for those “What would you say… you do here?” moments.

Ash Patel — SREPath

The True Cost of Building Your Own IMS

Build versus buy for incident management systems: what is the true cost of rolling your own?

Biju Chacko and Nir Sharma — Squadcast

Deploy AWS Resources Seamlessly With ChatGPT

A plugin to give ChatGPT the ability to run AWS API calls. I’m not sure how I feel about this.

Banjo Obayomi — DZone

Improved Alerting with Atlas Streaming Eval

They solved a cardinality explosion by switching from query-based alerting to stream data processing.

Ruchir Jha, Brian Harrington, and Yingwu Zhao — Netflix

SRE Weekly Issue #370

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues