SRE Weekly Issue #439

Client-Side Monitoring Is a Must for Mobile Apps

Read on to learn why client-side network monitoring is so important and what you are missing if your only visibility into network performance is from a backend perspective.

Fredric Newberg — The New Stack

Piloting through the Fog: A Tale of Migrating to a New Kubernetes Platform

An engineer with no Kubernetes experience migrates an app to Kubernetes — with a bit of help from StackOverflow and Copilot, of course.

Jacob Brandt — Klaviyo

How our data team handles incidents

As data teams become increasingly critical, problems in their systems become incidents. Here’s an overview of how one data team has designed their incident response process.

Navo Das — incident.io

Avoiding downtime: modern alternatives to outdated certificate pinning practices

Certificate pinning can be a useful practice, but it’s also fraught with pitfalls and outage risks, especially with the modern tendency toward shorter certificates and multiple intermediates. What can we do instead?

Dina Kozlov — Cloudflare

What is an SLA?

A super-thorough overview of SLAs with a helpful section on how to chose the level for an SLA.

Diana Bocco — UptimeRobot

Optimizing global message transit latency: a journey through TCP configuration

This debugging story focuses on a Linux TCP option I wasn’t familiar with: tcp_slow_start_after_idle.

Amnon Cohen — Ably

When Publicity Gets in the Way of Scalability: Dreamport Case

This is the story of a company that got an unexpectedly huge rush of interest in their platform—and traffic too. They made a number of changes to quickly scale to meet the demand.

Jekaterina Petrova — Dyninno

[Honeycomb] UI and API unavailable

This Honeycomb incident followup seems to be related to their post that I shared last week.

Honeycomb

Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #439

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, FireHydrant:

Subscribe

RSS

Mastodon

Search Issues