SRE Weekly Issue #136

SPONSOR MESSAGE

Define goals, set agendas, and build SRE like a boss. SRE team lead, Jonathan Schwietert, discusses how to organize effective SRE meetings and cultivate a collaborative culture of resiliency:

http://try.victorops.com/sreweekly/organized-sre

Articles

This infographic shows how Ably’s client library and backend infrastructure is designed to work around many common failure modes. My favorite: they have redundant TLS certificates from distinct issuers.

Matthew O’Riordan — Ably

This article argues that spending a little time to fix staging can make production significantly more stable.

Michael Nygard

This is a story of a flawed development process on top of a flawed infrastructure, without the necessary data to drive decision-making. It’s also a story of waking up to these problems and charting a way out.

[…]

As it turns out, pure reasoning cannot solve the kind of problems you see in the production environment of a complex application. These problems are almost always more difficult, since they have survived all of the testing you could throw at them.

John Casey

A story of a somewhat rare failure case (a datacenter heat buildup event) and how to monitor for such a thing without contributing to metrics overload.

Pavel Trukhanov — okmeter

On twitter this week, @srhtcn noted that “Many incidents happen during or right after release” and asked for advice on ways to fix this.

Great advice, useful for managers and individual contributors.

Charity Majors

Outages

Updated: August 26, 2018 — 9:00 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme