SRE Weekly Issue #186

Articles

DBMS Musings: Introduction to Transaction Isolation Levels

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams

The traps are:

You don’t have enough cross-team usage or buy-in.

Your difficult and dense process is slowing down incident response.

Postmortems are underutilized and don’t encompass in-depth learnings.

You wait for incidents to happen.

You stop at incident management without SLOs.

Lyon Wong — Blameless

Distributed Tracing: Impact on Engineering Organizations

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

Love (and Alerting) in the Time of Cholera (and Observability)

The question is: what is the proper role of alerting in the modern era of distributed systems? Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Enhancing Bandaid load balancing at Dropbox by leveraging real-time backend server load information

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

Splash the cache: how caching improved our reliability

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

Tokbox
- Thanks to Aos Dabbagh for this one.
Chef (system administration tool)
- Many of us experienced failures in our Chef runs after their former employee removed their code. Chef posted a followup explaining their position on the matter.
Fastly
Reddit
Net4 (hosting provider)
Salesforce
LinkedIn
Google Search
Heroku
Squarespace
- Also this one.

SRE Weekly Issue #186

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues