SRE WEEKLY – Page 72 – scalability, availability, incident response, automation

SRE Weekly Issue #158

lex

February 3, 2019

Articles

Accident Case Study: Traffic Pattern Tragedy

This air traffic accident analysis is chilling to listen to, and also incredibly educational. As you listen through the conversation, it becomes more and more clear that the pilot is suffering from information overload. An Incident Commander would be wise to remember the lessons learned here.

After listening to the above recording, I got hooked and kept listening to more and more case studies. Here’s another enlightening one: Real Pilot Story: From Miscue to Rescue

US Air Safety Institute

Introducing the PagerDuty Postmortem Guide

PagerDuty is quickly approaching Etsy’s level of awesome incident-related articles and guides.

Rachael Byrne — PagerDuty

Executing a Sunset

Retiring features and products can often be harder to do safely than deploying them in the first place.

Rachana Kumar– Etsy

Tune up your SLI metrics: CRE life lessons

Do your SLIs measure what really matters to your customers? This article discusses how to find out and what to do if they don’t.

Adrian Hilton and Yaniv Aknin — Google

Outages

Google G Suite
- All services exerienced an outage, most notably Gmail.
Microsoft Azure
- Parts of Azure were dependent on a third-party DNS provider, and an outage in that provider caused widespread issues in Azure. See Microsoft’s followup post in their status history.
WhatsApp
Reddit
- And a second one the same day.
Microsoft Office 365
Gmail
Instagram

SRE Weekly Issue #157

lex

January 27, 2019

General

Comments

View on sreweekly.com

Articles

When Does an Investigation End?

Best article about post-incident investigations that I’ve seen in awhile. My favorite part is the recommendation not to use a template for the retrospective, as it will artificially narrow the scope of the investigation.

Ryan Frantz

[Survey] On-call compensation in tech/IT

These folks have set up a survey to gather information on whether and how folks are compensated for on-call in IT. This topic has been gaining traction over the past couple of years, and I can’t wait to see the results of the survey. Please take a moment to fill it out.

Chris Evans and Spike Lindsey

[Upcoming Talk] Running Excellent Retrospectives: Talking for Humans

I’ll be speaking at SRECon19 Americas this March with my former coworker, Courtney Eckhardt. The talk lineup looks incredible and I’m really excited to go!

If you’re going to be there, drop me an email (I’m terrible at Twitter) and let me know. I’ll have lots of swag available, made with 100% open source software (Ink/Stitch and inkscape-silhouette).

Surviving On-Call: Tips from a Hosted Graphite SRE

Especially useful for folks new to on-call.

If you only take one thing away from this post, it’s that you need to put your own well-being first, and once you do that other aspects of on-call will become easier.

Dave Fennell — Hosted Graphite

DBMS Musings: It’s Time to Move on from Two Phase Commit

I have to admit I wasn’t clear on two-phase commit before I read this. Now I know what it’s all about — and its drawbacks.

Daniel Abadi

Do you have an SRE team yet? How to start and assess your journey

This guide from Google describes the qualities and practices of SRE teams of various levels from beginner to advanced.

Gustavo Franco — Google

What Does a Site Reliability Engineer Do?

A good intro if you’re new around here.

Sylvia Fronczak — Scalyr

Outages

Slack
Greenhouse.io
Microsoft Office 365
CenturyLink explains end-of-year outage
- Here are some details on the CenturyLink outage that took down 911 emergency services across portions of the US in late December.
Passport Canada
- Canada was unable to process passports during the outage.
WhatsApp

SRE Weekly Issue #156

lex

January 20, 2019

General

Comments

View on sreweekly.com

Articles

Introducing the Redesigned Twilio Status Page

Lots of companies seem to be redesigning their status pages lately. I love learning what was wrong with the old one and what they’ve changed to try to fix it.

Benjamin Stein — Twilio

On Infrastructure at Scale: A Cascading Failure of Distributed Systems

A cringe-worthy story of a system failure (thankfully not production!) along with some ideas on preventing such failures.

Dan Woods

SRE Survey 2019

Just like last year, Catchpoint will donate $5 to charity if you take their survey!

This year we are back with a focus on outages and incidents. What impact do incidents have on the organization and the people responding to the incidents? How does this change across industry and organization?

Catchpoint

The Myth of the Server’s Terrible, Horrible, No Good, Very Bad Day

You can do a lot better than “the server is unhappy.” Be on the lookout for language like that. It’s usually a good learning opportunity or at the very least a good time to fill some gaps in instrumentation.

Arya Asemanfar — LightStep

Outages

Sling TV
UK’s Criminal Justice Secure eMail system (CJSM)
Amazon.com
Amazon Alexa
Fastly Status – [Retrospective] Elevated Errors in Ashburn (IAD/BWI/DCA)
- Also this one.Full disclosure: Fastly is my employer.

SRE Weekly Issue #155

lex

January 13, 2019

General

Comments

View on sreweekly.com

Articles

Developer On Call

A developer’s perspective on why being on call is important and how to structure it fairly (hint: compensation).

Henrik Warne

Interpreting Kafka’s Exactly-Once Semantics

The Conclusion section sums it up nicely:

In this post, we talked about various delivery guarantee semantics such as at-least-once, at-most-once, and exactly-once. We also talked about why exactly-once is important, the issues in the way of achieving exactly-once, and how Kafka supports it out-of-the-box with a simple configuration and minimal coding.

Rahul Agarwal — DZone

DevOps Discussions: Postmortem Chat – Part 1 (YouTube)

This is a riveting discussion about retrospective analysis of incidents, hosted by Microsoft. Throughout the discussion, there’s an emphasis on learning from incidents as opposed to simply coming up with action items.

Note: one of the panelists is my fellow employee at Fastly.

Jessica DeVita — Microsoft, with Duck Lawn (Pushpay), Tom Griffin (Pushpay), Sue Allspaw Pomeroy (Fastly), John Allspaw (Adaptive Capatacity Labs) and Dr. Richard Cook (Adaptive Capacity Labs)

An Agile SRE Meeting Plan

If you’re looking for a blueprint of how to structure your SRE organization’s meetings, this is a great resource.

Dave Mangot

Designing resilient systems: Circuit Breakers or Retries? (Part 2)

This post is the second part of the series on Designing Resilient Systems. In Part 1, we looked at use cases for implementing circuit breakers. In this second part, we will do a deep dive on retries and its use cases, followed by a technical comparison of both approaches.

This article is really thorough and includes a section on combining retries with circuit breakers.

Corey Scott — Grab

Towards Successful Resilient Software Design

The problem is that most advice how to “get design right” only applies to design inside a process boundary. Most of those advices do not work well if applied to distributed systems.

What I have learnt over time is that we basically need to re-learn how to design systems, i.e., how to spread the functionality in a distributed environment.

Uwe Friedrichsen — InfoQ

Courier: Dropbox migration to gRPC

This really stood out to me:

In practice, we have fixed whole classes of reliability problems by forcing engineers to define deadlines in their service definitions.

Ruslan Nigmatullin and Alexey Ivanov — Dropbox

Outages

Fastly
- Fastly had the above issue in its MDW PoP and also a repeat.
  Full disclosure: Fastly is my employer.
Zoom
Slack
- Android notifications were busted.
Gov Availability
- This site shows a live-updated availability percentage for the US Government. As of now, the “This Year” percentage is stuck at infinite zeroes (due to our current government shutdown). On a less tongue-in-cheek note, lots of US Government sites have expired TLS certificates because employees aren’t there to renew them.
Duo Security
Azure Storage (UK south region)
GitHub
Reddit
YouTube
Tinder
Google Cloud Platform (various API functions)
- Google engineers began rolling out a new feature designed to improve the fault-tolerance of the metadata store.
  
  Ironically, that rollout took down the metadata store.

SRE Weekly Issue #154

lex

January 6, 2019

General

Comments

View on sreweekly.com

Articles

STAMPing on event-stream

Hands-down the best thing I’ve read in awhile! The author draws on the work of Nancy Leveson, applying her STAMP theory to a recent incident involving a rogue NPM package that stole bitcoin wallets.

Hillel Wayne

Applying STAMP in Accident Analysis

For more on STAMP theory (Systems-Theoretic Accident Modeling and Processes), check out this academic paper by Leveson et al. It centers around a chilling case study of the e. coli poisoning of a community in Canada. While starts off looking to be a clear case of negligence, it quickly becomes apparent that an accident of this sort was nearly guaranteed to happen.

Nancy Leveson, Mirna Daouk, Nicolas Dulac, and Karen Marais

I’m John Allspaw, Ask Me Anything about incident analysis and postmortems

It’s pretty much as awesome as you’d expect given that title. I originally thought this was a video or audio AMA and was waiting for a recording to be posted. Instead, he answered the excellent questions in the comments, and each answer is like its own polished article.

John Allspaw (and many commenters)

Sarah Mei on Twitter: compensation for on-call

My fundamental issue with being on call is that I care more about my personal life & health than I do about whether my employer’s website is operational.

I assume everyone does! So…why do we put up with on-call at all?

Required reading for anyone who’s on call or manages folks that are on call.

Sarah Mei

Why SRE Documents Matter

If you manage an SRE team or intend to start one, this article will help you understand the types of documents your team needs to write and why each type is needed, allowing you to plan for and prioritize documentation work along with other team projects.

Shylaja Nukala and Vivek Rau — ACM Qeueue

Outages

Amazon Alexa
Discord
Google Cloud
- Followup post for an incident that occurred on December 21:
  
  The additional load was created by a partially-deployed new feature. A routine maintenance operation in combination with this new feature resulted in an unexpected increase in the load on the metadata store.
911 emergency service in communities across the US
- While visiting the library with my kids, my phone (and those of others around me) blew up with an emergency alert telling me that 911 service was down and that I should dial the local police directly. CenturyLink provides the infrastructure that runs emergency phone services for various areas in the US, and they had an extended outage.
Snapchat
Banner Health electronic health records

SRE Weekly Issue #158

Articles

Outages

SRE Weekly Issue #157

Articles

Outages

SRE Weekly Issue #156

Articles

Outages

SRE Weekly Issue #155

Articles

Outages

SRE Weekly Issue #154

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues