SRE Weekly Issue #186

A message from our sponsor, VictorOps:

See why DevOps teams are more collaborative and transparent than traditional IT operations – helping them build highly efficient incident management and response systems:

http://try.victorops.com/sreweekly/devops-incident-management-guide

Articles

This article is highly technical, while also not being overwhelmingly detailed.

It is very important that a database user is aware of the isolation level guaranteed by the database system, and what concurrency bugs may emerge as a result.

Daniel Abadi

The traps are:

  1. You don’t have enough cross-team usage or buy-in.
  2. Your difficult and dense process is slowing down incident response.
  3. Postmortems are underutilized and don’t encompass in-depth learnings.
  4. You wait for incidents to happen.
  5. You stop at incident management without SLOs.

Lyon Wong — Blameless

Need to argue the benefits of implementing distributed tracing in your organization? This article will help you get started.

dm03514

The question is: what is the proper role of alerting in the modern era of distributed systems?  Have alerting best practices changed with the shift from monitoring and known-unknowns to observability and unknown-unknowns?

Charity Majors

Round-robin load balancing often isn’t good enough; it’s necessary to intelligently route requests to nodes that aren’t overloaded. How do you get information about backend health to distributed load balancer nodes efficiently? A: add a response header.

Haowei Yuan, Yi-Shu Tai, and Dmitry Kopytkov — Dropbox

By adding in-memory caching with a mere 3-second TTL, these folks achieved a ~75% cache hit rate, allowing them to withstand request spikes without an outage.

MINA GYIMAH — Pusher

Outages

Updated: September 22, 2019 — 9:03 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme