SRE Weekly Issue #190

A message from our sponsor, VictorOps:

In the latest guide, Resilience First, you’ll learn about the origin of SRE, how it’s evolved over the last few years, and the future of its impact on building highly observable, resilient applications and infrastructure.

http://try.victorops.com/sreweekly/sre-golden-signals-guide

Articles

This company had a really challenging on-call situation to fix. Monolithic codebase, and a huge team with so many people in the on-call rotation that folks were out of practice by the time it was their turn.

Molly Struve

This article includes charts, observations, and conclusions from the author’s by-hand analysis and categorization of several hundred incidents.

Subbu Allamaraju

Charity Majors replied to a suggestion to write alerts for everything with her ideas for a better way.

Charity Majors (@mipsytipsy)

Where many databases use threading to handle concurrent clients, PostgreSQL forks one child process per client. This has ramifications that an operator must take into consideration.

Kristi Anderson — High Scalability

This article is about attributes, but it doesn’t mention a specific system. I have yet to find an anomaly detection system that doesn’t produce so many false positives that it’s useless.

Hive mind: if you’re using an anomaly detection system that actually works and doesn’t drown you with false positives, I want to hear about it. Bonus points if you want to write an article about it!

Amit Levi

Outages

Updated: October 20, 2019 — 9:00 pm
A production of Tinker Tinker Tinker, LLC Frontier Theme