SRE Weekly Issue #72

Articles

Designing robust and predictable APIs with idempotency

Idempotence is a critically important tool in building a reliable system. Stripe explains the concept and shows how they wrap theoretically non-idempotent actions like charging a credit card into safely idempotent API calls.

What’s not Actionable & Business Critical Shouldn’t Ring: Building the Right Alerting System

Here’s an account of an effort to move from server-based paging (this server is down) to functional-based alerting (this user action isn’t working), with a resulting impressive reduction in out-of-hours paging.

CPU Utilization is Wrong

It pays to study up and deeply understand what a simple metric like “cpu utilization” really means.

AWS Service Health Dashboard

Why am I linking to AWS’s status site? Look closely, and you’ll see that the “green checkmark i” symbol has been replaced with a far more noticeable blue circle with a white diamond. Check out the old icon here for comparison. End of an era, or just another way of presenting the same information?

Circuit breaker and monitoring of a gRPC service in Ruby (Part 1)

The author introduces a new Ruby gem, grpc-commons that makes it easy to add circuit breaker and statsd support to a grpc client.

Introducing distributed tracing in your Python application via Zipkin

Along with being a tutorial on setting up Zipkin with Python, this article also explains some basic Zipkin concepts.

Announcing the Modern Incident Resolution Lifecycle

PagerDuty is apparently trying to position itself as more than just a paging service, with a few new features around the entire incident lifecycle. I’m especially interested in checking out the new postmortem tooling.

How we Upgraded a 22TB MySQL Cluster from 5.6 to 5.7 (in 9 months)

I included this article last week, but my link was outdated and returned a 404. Here’s the corrected link — sorry about that!

A first look at Elastic’s new Machine Learning Technology

I put a call out for a review of Elastic’s new beta anomaly detection feature last week, and here one is! Thanks to an Elastic employee for forwarding this link to me.

IT Outages, Who’s Really at Fault?

This article cautions one to be careful to look past an obvious root cause, because a deeper systemic or policy problem may be lurking behind it.

Watch out for serverless computing’s blind spot

Serverless / FaaS abstract away traditional provisioning, and they make it really easy to ignore planning for resource usage.

Safety Moment – Are Accidents a Failure of Imagination? | PreAccident Investigation Podcast

Wow, what a concept:

you can think of […] reliable systems […] as successfully imagining all of the potential things that could go wrong

This 2.5-minute podcast from Todd Conklin has a really great question: to achieve reliability, do we have to try to imagine in advance all of the possible ways our systems could fail?

“The Scariest Moment of My Life” – BWH Safety Matters

A patient was given an incorrect syringe resulting in a 5x insulin overdose. Brigham and Women’s Hospital reports on the accident and what they’re doing to prevent mistakes of this sort in the future.

PagerDuty’s 2017 State of Digital Operations Report

Consumers today have increasingly high expectations for digital applications and service performance, but do IT personnel feel equipped to rise to the occasion? In this survey, we uncover the extent of the digital services expectation gap between consumers and IT teams as well as top strategies teams are using to solve digital disruption challenges.

Outages

Our First Kubernetes Outage – Saltside Engineering
- Kudos to the Saltside folks for sharing a public postmortem for an internal, non-customer-impacting outage!
  
  This is public postmortem for an a complete shutdown of our internal Kubernetes cluster. It’s shared with you all so everyone may learn.
“Re-experience the fun of customizing your Place Page!” A Tale of Oops from Ops
- Ouch. Linden Lab’s ops team discovered the hard way that they didn’t have a working backup copy of some customer data. The best part of this article is the discussion of the “Shrek Ears” tradition at Linden. It’s one of the things I remember most fondly from my time there, and having worn the ears a few times in my day, I can attest to the fact that it’s a great way to handle the psychological impact of making a mistake.
Chase (bank)
Facebook

SRE Weekly Issue #72

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues