General

SRE Weekly Issue #160

lex

February 17, 2019

Articles

This is a long one, but trust me, it’s worth the read. My favorite part is where the author gets into mental models, hearkening back to the Stella Report.

Fred Hebert

Multi-CDN support in Mux for improved performance and reliability

When CDN outages occur, it becomes immediately clear who is using multiple CDNs and who is not.

A multi-CDN approach can be tricky to pull off, but as these folks explain, it can be critical for reliability and performance.

Scott Kidder — mUX

Full disclosure: Fastly, my employer, is mentioned.

Towards an understanding of technical debt

This article explains five different phenomena that people mean when they say “technical debt”, and advocates understanding the full context rather than just assuming the folks that came before were fools.

/thanks Greg Burek

Kellan Elliott-McCrea

How We Prepared New York Times Engineering for the Midterm Elections

The work we did to get our teams aligned and our systems in good shape meant that we were able to scale, even with some services getting 40 times the normal traffic.

Kriton Dolias and Vinessa Wan — The New York Times

@mipsytipsy on Twitter: what to alert on

How does one resolve the emerging consensus for alerting exclusively on user-visible outages, with the undeniable need to learn about and react to things +before* users notice? Like a high cache eviction rate?

There’s a real gem in here, definitely worth a read.

Charity Majors (and Liz Fong-Jones in reply)

Notes from On-call Adjacency – Honeycomb

Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing.

Rachel Perkins — Honeycomb

How we used delayed replication for disaster recovery with PostgreSQL

Delayed replication can be used as a first resort to recover from accidental data loss and lends itself perfectly to situations where the loss-inducing event is noticed within the configured delay.

Andreas Brandl — GitLab

Outages

Azure Kubernetes Service (US East)
- There’s a pretty interesting incident description in their history page.
VFEmail
- Via Twitter:
  
  At this time, the attacker has formatted all the disks on every server. Every VM is lost. Every file server is lost, every backup server is lost. NL was 100% hosted with a vastly smaller dataset. NL backups by the provideer were intact, and service should be up there.
  
  My sympathies, folks.
Slack
- Emails into slack were failing due to an expired TLS certificate.
Squarespace
- Linked is their followup post explaining more about the incident.
JPMorgan Chase
Instagram
Strava and Garmin Connect
Microsoft Windows Update
Snapchat
Sydney, AU Train Network
Lloyds Bank

SRE Weekly Issue #159

lex

February 10, 2019

General

Comments

View on sreweekly.com

A huge thanks to my awesome former coworker Greg Burek whose helpful link contributions make up fully half of this issue. Thanks, Greg!

Articles

Ironies of Automation

This paper discusses the ways in which automation of industrial processes may expand rather than eliminate problems with the human operator.

My favorite bit of irony: presenting data to the user in the manner most readily understood results in lower likelihood of remembering the data, so perhaps the most easily grasped display is not actually the best!

Lisanne Bainbridge

Laziness Does Not Exist

Like malice and incompetence, laziness should be far off our radar when we investigate an incident. I hope that reading this article opens minds about the true scope of blamelessness.

Devon Price

The New Systems Engineer

Whether or not you agree with this particular attempt at defining what a Systems Engineer (or SRE or anything related) is, it’s worth thinking about and discussing. Our field is evolving quickly, and titles are a moving target.

Matt Ouille

Behind the Lion Air Crash, a Trail of Decisions Kept Pilots in the Dark

Driven by a desire to update their 737 without causing airlines to have to retrain pilots, Boeing seemingly kept pilots in the dark about what may have been an important little detail of how the new 737 Max operates, with a tragic result.

James Glanz, Julie Creswell, Thomas Kaplan and Zach Wichter — New York Times

Questions for a new technology.

An experienced SRE will develop an innate skepticism of new technologies, even if they don’t realize it. This article provides an excellent list of questions to help articulate that skepticism when evaluating a potential design.

Kellan Elliott-McCrea

When AWS Autoscale Doesn’t

Auto-scaling isn’t all roses. Like any tool, you have to understand how it works in order to avoid the pitfalls. Read this article to learn what these folks learned the hard way.

Tyson Mote — Segment

Postmortems Part 2: How to Adopt a Learning Culture

Transitioning to a blameless culture can be difficult, especially as folks might blame each other for forgetting to be blameless!

Rachael Byrne — PagerDuty

Logs vs Structured Events

Many of the old arguments for not instrumenting code (mostly about performance) no longer apply, and a host of new arguments push toward structured events.

Charity Majors

Outages

QuadrigaCX
- Bloomberg’s title for the above-linked article says it all:
  
  Crypto CEO Dies Holding Only Passwords That Can Unlock Millions in Customer Coins
  
  QuadrigaCX ceased trading and posted a note on their front page.
Gmail
Mailchimp Mandrill
- A PostgreSQL transaction ID wraparound in a central database caused this prolonged outage on Superbowl Sunday.
Wells Fargo (bank)
Crunchyroll
Hosted Graphite
reddit

SRE Weekly Issue #158

lex

February 3, 2019

General

Comments

View on sreweekly.com

Articles

Accident Case Study: Traffic Pattern Tragedy

This air traffic accident analysis is chilling to listen to, and also incredibly educational. As you listen through the conversation, it becomes more and more clear that the pilot is suffering from information overload. An Incident Commander would be wise to remember the lessons learned here.

After listening to the above recording, I got hooked and kept listening to more and more case studies. Here’s another enlightening one: Real Pilot Story: From Miscue to Rescue

US Air Safety Institute

Introducing the PagerDuty Postmortem Guide

PagerDuty is quickly approaching Etsy’s level of awesome incident-related articles and guides.

Rachael Byrne — PagerDuty

Executing a Sunset

Retiring features and products can often be harder to do safely than deploying them in the first place.

Rachana Kumar– Etsy

Tune up your SLI metrics: CRE life lessons

Do your SLIs measure what really matters to your customers? This article discusses how to find out and what to do if they don’t.

Adrian Hilton and Yaniv Aknin — Google

Outages

Google G Suite
- All services exerienced an outage, most notably Gmail.
Microsoft Azure
- Parts of Azure were dependent on a third-party DNS provider, and an outage in that provider caused widespread issues in Azure. See Microsoft’s followup post in their status history.
WhatsApp
Reddit
- And a second one the same day.
Microsoft Office 365
Gmail
Instagram

SRE Weekly Issue #157

lex

January 27, 2019

General

Comments

View on sreweekly.com

Articles

When Does an Investigation End?

Best article about post-incident investigations that I’ve seen in awhile. My favorite part is the recommendation not to use a template for the retrospective, as it will artificially narrow the scope of the investigation.

Ryan Frantz

[Survey] On-call compensation in tech/IT

These folks have set up a survey to gather information on whether and how folks are compensated for on-call in IT. This topic has been gaining traction over the past couple of years, and I can’t wait to see the results of the survey. Please take a moment to fill it out.

Chris Evans and Spike Lindsey

[Upcoming Talk] Running Excellent Retrospectives: Talking for Humans

I’ll be speaking at SRECon19 Americas this March with my former coworker, Courtney Eckhardt. The talk lineup looks incredible and I’m really excited to go!

If you’re going to be there, drop me an email (I’m terrible at Twitter) and let me know. I’ll have lots of swag available, made with 100% open source software (Ink/Stitch and inkscape-silhouette).

Surviving On-Call: Tips from a Hosted Graphite SRE

Especially useful for folks new to on-call.

If you only take one thing away from this post, it’s that you need to put your own well-being first, and once you do that other aspects of on-call will become easier.

Dave Fennell — Hosted Graphite

DBMS Musings: It’s Time to Move on from Two Phase Commit

I have to admit I wasn’t clear on two-phase commit before I read this. Now I know what it’s all about — and its drawbacks.

Daniel Abadi

Do you have an SRE team yet? How to start and assess your journey

This guide from Google describes the qualities and practices of SRE teams of various levels from beginner to advanced.

Gustavo Franco — Google

What Does a Site Reliability Engineer Do?

A good intro if you’re new around here.

Sylvia Fronczak — Scalyr

Outages

Slack
Greenhouse.io
Microsoft Office 365
CenturyLink explains end-of-year outage
- Here are some details on the CenturyLink outage that took down 911 emergency services across portions of the US in late December.
Passport Canada
- Canada was unable to process passports during the outage.
WhatsApp

SRE Weekly Issue #156

lex

January 20, 2019

General

Comments

View on sreweekly.com

Articles

Introducing the Redesigned Twilio Status Page

Lots of companies seem to be redesigning their status pages lately. I love learning what was wrong with the old one and what they’ve changed to try to fix it.

Benjamin Stein — Twilio

On Infrastructure at Scale: A Cascading Failure of Distributed Systems

A cringe-worthy story of a system failure (thankfully not production!) along with some ideas on preventing such failures.

Dan Woods

SRE Survey 2019

Just like last year, Catchpoint will donate $5 to charity if you take their survey!

This year we are back with a focus on outages and incidents. What impact do incidents have on the organization and the people responding to the incidents? How does this change across industry and organization?

Catchpoint

The Myth of the Server’s Terrible, Horrible, No Good, Very Bad Day

You can do a lot better than “the server is unhappy.” Be on the lookout for language like that. It’s usually a good learning opportunity or at the very least a good time to fill some gaps in instrumentation.

Arya Asemanfar — LightStep

Outages

Sling TV
UK’s Criminal Justice Secure eMail system (CJSM)
Amazon.com
Amazon Alexa
Fastly Status – [Retrospective] Elevated Errors in Ashburn (IAD/BWI/DCA)
- Also this one.Full disclosure: Fastly is my employer.

SRE Weekly Issue #160

Articles

Outages

SRE Weekly Issue #159

Articles

Outages

SRE Weekly Issue #158

Articles

Outages

SRE Weekly Issue #157

Articles

Outages

SRE Weekly Issue #156

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues