General

Happy new year!

lex

December 30, 2018

Happy new year, SRE Weekly readers! No issue this week as I attempt to recover from the holidays.

Thank you all so much for reading. The past three years have been awesome, and I love all the great comments and contributions I receive from you folks.

See you next week!

SRE Weekly Issue #153

lex

December 23, 2018

General

Comments

View on sreweekly.com

Articles

110: Human Incident Response with Courtney Eckhardt – Greater Than Code

In this podcast episode, Courtney Eckhardt and the panel cover a lot of bases related to incident response, retrospectives, defensiveness, blamelessness, social justice, and tons more engrossing stuff. Well worth a listen.

Mandy Moore (summary); John K. Sawers, Sam Livingston-Gray, Jamey Hampton, and Coraline Ada Ehmke (panelists); Courtney Eckhardt (guest)

DBMS Musings: Partitioned consensus and its impact on Spanner’s latency

Do you wonder what effect partitioned versus unified consistency might have on latency? Do you want to know what those terms mean? Read on.

Daniel Abadi

Cape Technical Deep Dive

Cape is Dropbox’s real-time event processing system. The design bits in this article have a ton of interesting detail, and I also love the part where they go into their motivations behind not just using an existing queuing system.

Peng Kang — Dropbox

Designing resilient systems: Circuit Breakers or Retries? (Part 1)

This is a great intro to the circuit breaker pattern if you’re unfamiliar with it, and it’s also got a lot of meaty content for folks experienced with them.

Corey Scott — Grab

Don’t Choose Dashboards Over Analysis

Though it sounds counterintuitive, more dashboards often make people less informed and less aligned.

Having a few good dashboards is important, but if you have too many, it’ll get in the way of your ability to do dynamic analysis.

Benn Stancil — Mode

Site Reliability Engineering is Operations

What activities count as SRE work, versus “just” Operations?

Site Reliability Engineering do Operations but are not an Operations Team.

Stephen Thorne

Outages

Twitch
Google Cloud Platform (europe-west-1-b)
- A pair of redundant switches were erroneously taken down simultaneously for maintenance, causing a major outage. Click for Google’s followup post.
Xero
Spotify

SRE Weekly Issue #152

lex

December 16, 2018

General

Comments

View on sreweekly.com

Articles

Support Driven Engineering (SDE)

It’s hard to summarize all the awesome here, but it boils down to empathy, collaboration, and asking, “How can I help?”. These pay dividends all over an organization, especially in reliability.

Note: Will Gallego is my coworker, although I came across this post on my own.

Will Gallego

Temporary outage of Google CT Logs

This followup post for a Google Groups outage was (fittingly) hidden away in a Google Group.

Thanks to Jonathan Rudenberg for this one.

Introducing the new GitHub Status Site

Now I can link directly to specific incidents! I miss the graphs, though.

Jamie Hannaford — GitHub

@amyngyn on Twitter: root cause

I laughed so hard I scared my cats:

COWORKER: we need to find the root cause asap
ME: takes long drag the root cause is that our processes are not robust enough to prevent a person from making this mistake
COWORKER: amy please not right now”

Great discussion in the thread!

Amy Nguyen

When ATC Says ‘Unable’

In Air Traffic Control parlance, if a pilot or controller can’t satisfy with a request, they should state that they are “unable” to comply. It can be difficult to decide in the moment what one is truly “unable” to do. There are a lot of great lessons here that apply equally well to IT incident response.

Tarrance Kramer — AVweb

Enterprise SREs guide devs through Kubernetes in production

The common theme at KubeCon is that SRE teams at many companies produce reliable, reusable patterns for their developers to build with.

Beth Pariseau — TechTarget

Postmortem: Beating the NATS race

This is the story of a tenacious fight to find out what went wrong during an incident. If you read nothing else, the Conclusion section has a lot of great tidbits.

Tony Meehan — Endgame

Restorative Just Culture Checklist

Here’s a new guide on how to apply Restorative Just Culture. This made me laugh:

They also fail to address the systemic issues that gave rise to the harms caused, since they reduce an incident to an individual who needs to be ‘just cultured’.

Sidney Dekker — Safety Differently

Outages

SRE Weekly Issue #151

lex

December 9, 2018

General

Comments

View on sreweekly.com

Articles

A victim of its own popularity: Scaling our CloudWatch integration

They used feature flags to safely transition from a single-host service to a horizontally-scaled distributed system.

Ciaran Egan and Cian Synnott — Hosted Graphite

Working with AWS Limits

Limits and quotas can really ruin your day, and it can be very difficult to predict limit exhaustion before a change reaches production, as we learn in this incident story from RealSelf.

Bakha Nurzhanov — RealSelf

Defending Against Abuse at LinkedIn’s Scale

The challenge: you have to defend against abuse to keep your service running, but the abuse detection also must not adversely impact the user experience.

Sahil Handa — LinkedIn

Answer to the Ultimate Question of (On-Call) Life, the Universe, and Everything: 71

PagerDuty has developed a system for measuring on-call health, factoring in quantity of pages, time of each page, frequency, clustering of pages, etc. I love what they’re doing and I hope we see more of this in our industry.

Lisa Yang — PagerDuty

Spooky Tales of Testing In Production: A Recap and Lessons Learned

A summary of three outage stories from Honeycomb’s recent event. My favorite is the third:

While Google engineers had put in place procedures for ensuring bad code did not take down their servers, they hadn’t taken the same precautions with data pushes.

Alaina Valenzuela — Honeycomb

Reasons to Scale Horizontally

Looking at that title, I thought to myself, “Uh, because it’s better?” It’s worth a read though, because it so eloquently explains horizontal versus vertical scaling, why you’d do one or the other, and why horizontal scaling is hard.

Sean T. Allen — Wallaroo Labs

Cache warming: Agility for a stateful service

Netflix has some truly massive cache systems at a scale of hundreds of terabytes. Find out what they do to warm up new cache nodes before putting them in production.

Deva Jayaraman, Shashi Madappa, Sridhar Enugula, and Ioannis Papapanagiotou — Netflix

Software Sprawl, The Golden Path, and Scaling Teams With Agency

This article lays out a promising plan for reducing the number of technologies your engineering department is using while still giving engineers the freedom to choose the right tool for the job.

Charity Majors

Outages

Nest
GitHub
O2 (UK) and SoftBank (Japan)
- I normally don’t bother mentioning mobile phone service outages, but this one has an interesting cause: an expired TLS certificate in Ericsson’s systems.
Google Allo and Duo
Facebook

SRE Weekly Issue #150

lex

December 2, 2018

General

Comments

View on sreweekly.com

Articles

5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code

This article is a condensed version of a talk, but it stands firmly on its own. Their Production-Grade Infrastructure Checklist is well worth a read.

Yevgeniy Brikman — Gruntwork

OVMC, EORH Hope To Have Emergency Rooms Back Online

More and more, the reliability of our infrastructure is moving into the realm of life-critical.

Thanks to Richard Cook

Linda Comins — The Intelligencer for this one.

SREcon EMEA 2018 conference notes

Detailed notes on lots of talks from SRECon, with a great sum-up at the top discussing the major themes of the conference.

Max Timchenko

Developers On Call

Drawing from an @mipsytipsy Twitter thread from back in February, this article is a great analysis of why it’s right to put developers on call and how to make it humane. I especially like the part about paying extra for on-call, a practice I’ve been hearing more mentions of recently.

John Barton

AWS Says It’s Never Seen a Whole Data Center Go Down

Really? Never? I could have sworn I remembered reading about power outages…

Yevgeniy Sverdlik — DataCenter Knowledge

Confusion over medicine names threatens lives

Lots of good stuff in this one about preventing mistakes and analyzing failures.

Rachel Bryan — Swansea University

Happy new year!

SRE Weekly Issue #153

Articles

Outages

SRE Weekly Issue #152

Articles

Outages

SRE Weekly Issue #151

Articles

Outages

SRE Weekly Issue #150

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues