SRE Weekly Issue #375

Articles

How can you land 5 kilometers above the Moon?

An in-depth analysis of the crash of a recent lunar lander. It’s really interesting that a feature designed specifically to improve robustness to failures instead made the system less reliable in unforeseen circumstances.

Robert Barron — IBM

Cloud Dependencies Need to Stop F—ing Us When They Go Down

With each external cloud service you deploy, you introduce the amount of unreliability that product has into your own product’s reliability (even if it’s incredibly small).

Jeff Martens — The New Stack

How to Get an SRE Role

Are you a software engineer or an IT professional interested in transitioning to an SRE role? You’ve come to the right place! This article provides guidance on the skills and behaviors needed to apply for an SRE position at medium-to-large-sized tech companies successfully.

Amin Astaneh — Certo Modo

Incident vs. bug: How to distinguish between these two (seemingly) related concepts

While it can seem pretty insignificant, properly distinguishing between an incident and a bug is worthwhile. Why? Because it will ultimately help dictate your response to it.

Luis Gonzalez — incident.io

An educational side project

This is impressive: an engineer built an entire model of a ride-share system, complete with simulated riders and drivers, metrics, containerization, the works, all to gain a better understanding of how these kinds of systems work.

Gergely Orosz — Pragmatic Engineer

Why bother with SLI and SLO?

This article answers the most important questions:
* How is using service levels any different than “regular” alarms?
* What’s in it for the company and the teams?
* Why bother? Don’t we already have enough work to do?

Alex Ewerlöf

eBay’s Common Automation Solution for Platform Evolution

Here at eBay, we’ve crafted a brand new approach to automate platform evolution for all applications — one that provides a repeatable and reusable infrastructure to streamline evolution.

Paul Zhang and Tao Jin

How Traceloop Leverages Honeycomb and LLMs to Generate E2E Tests

Interesting idea: feeding trace data into an LLM and asking it to build an end-to-end (E2E) test for the entire system. This article is a good description of what they’re doing but I’d be interested to hear more about the results.

Nir Gazit — Honeycomb
Full disclosure: Honeycomb is my employer.

Reflections on Amazon Prime Video’s Monolith Move

What conclusions can we draw from the recent announcement that Amazon Prime Video is moving from serverless to a monolith?

The supposed difference between the two methods is not based on the technology itself, but the context in which you’re working.

Ian Miell

SRE Weekly Issue #375

Articles

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, Rootly:

Articles

Subscribe

RSS

Mastodon

Search Issues