SRE WEEKLY – Page 60 – scalability, availability, incident response, automation

SRE Weekly Issue #218

lex

May 10, 2020

General

Comments

View on sreweekly.com

Articles

Checklists and Runbooks

An airplane pilot’s take on runbooks, by way of comparison to aviation checklists.

Bill Duncan

Old box, dumb code, few thousand connections, no big deal

This article demonstrates that we don’t need to be afraid of spinning up a new thread per connection, and Linux is very good at what it does. This seems to have been a surprisingly controversial point of view, judging by the follow-up article.

Rachel by the bay

Avoid rolling your own leader election algorithm

It’s not as easy as you think… even if you think it’s not easy.

Oren Eini — RavenDB

How companies are operating always-on services in the COVID-19 era

Atlassian shows us what’s changed in operations, based on their State of Incident Management survey.

A little over half of survey respondents – 51 percent – reported that their incident response time has been slower since beginning to work remotely

Patrick Hill — Atlassian

How Learning is Different Than Fixing

A key idea here is that rather than focusing on simply focusing on identifying fixes for parts involved in the event and instead focusing on developing a richer understanding of the event, a much greater ROI the effort will result, and that will include more effective “fixes” and more.

John Allspaw

Managing Burnout During COVID-19 for People in Tech

The part about pandemic-induced decision fatigue was revelatory for me.

Hannah Culver — Blameless

Looking back at Failover Conf

Gremlin talks about Failover Conf, and I love that it pretty much reads like a retrospective.

Kimbre Lancaster — Gremlin

Outages

SRE Weekly Issue #217

lex

May 3, 2020

General

Comments

View on sreweekly.com

Articles

Pre-requisites to Practicing Reliability?

Reliability is something you do, not something you buy.

When discussing SRE, I love to pose the question, “What does it mean to engineer reliability?”. That’s what this article is all about.

Russ Miles — ChaosIQ

Thought Leadership Panel: What is a ‘Real’ SRE?

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.

Amy Tobey, with guests Craig Sebenik, David Blank-Edelman, and Kurt Andersen — Blameless

The inevitable double bind

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

Lorin Hochstein

The Post-Incident Review Issue #3

It’s time for another issue already! This one contains a really great essay by Jamie Woo entitled “What Does Fairness Mean for On-call Rotations?”, about how not all on-call shifts are equal.

Jamie Woo and Emil Stolarsky — Incident Labs

The Tail at Scale

If your frontend has a hard dependency on multiple microservices, their failure rates are compounded. This article fills in the math behind the paper The Tail at Scale and shows that your backends’ SLOs may have to be significantly tighter than the frontend’s.

Bill Duncan

Heroku Incident #2021 Follow-up

This post-incident analysis details a case of a hard dependency that needn’t be hard, taking down the Heroku API, along with a fall-back that didn’t work as intended.

Why strace doesn’t work in Docker

I love Julia Evans’s ability to teach me something new that I didn’t realize I didn’t know.

Julia Evans

Outages

GitHub
Let’s Encrypt
- But you’ve automated your renewals, so this totally doesn’t matter, right?
Hulu
Uber Eats
Reddit
Discord

SRE Weekly Issue #216

lex

April 26, 2020

General

Comments

View on sreweekly.com

Articles

How to create an incident response playbook

Awesome resource! In each section, they explain what to include, why to include it, and an example from their playbook.

Blake Thorne — Atlassian

Failover Conf Wrapup

I didn’t make it to Failover Conf, and it sounds like I missed a great time, so I’m especially grateful for this writeup.

Rich Burroughs — FireHydrant

Failover Conf, a Recap of Gremlin’s Epic Virtual Event

And this one!

Hannah Culver — Blamelss

COVID-19 Oncall Survey

I’m a little late with this one, sorry folks! Survey ends tomorrow, April 27.

This is an anonymous survey to look at the impact that COVID-19 has had on oncall teams in tech.

FireHydrant

Incident Analysis: How *Learning* is Different Than *Fixing*

Most post-incident review documents are written to be filed, not written to be read.

This slide deck is awesome and well worth the read.

John Allspaw — Adaptive Capacity Labs

How to build robust anomaly detectors with machine learning

A deep dive into the math behind anomaly detection.

Nikita Butakov — Ericsson

Advice for On-call Teams During COVID-19

This article brings together thoughts on on-call work during the pandemic from folks at different companies.

Rich Burroughs — FireHydrant

Shadowing a Site Reliability Engineer

A frontend engineer shares their key takeaways from their time shadowing.

Laura Montemayor — GitLab

Outages

GitHub
DataDog
Poloniex
DigitalOcean
Apple Pay
ShipStation
Sendy
Sharp online store and IoT devices
- Sharp retooled one of its factories to produce masks and started selling them commercially. The increased load caused problems with their online store and existing consumer IoT devices.
Discord
Fastly
- Also a control plane issue earlier the same day.Full disclosure: Fastly is my employer.
reddit

SRE Weekly Issue #215

lex

April 19, 2020

General

Comments

View on sreweekly.com

I missed last week to set up a new swing set for my kids (gotta give ’em something to do while they’re stuck at home). I’m still a bit behind on articles and I’ll catch up over the next couple weeks.

Articles

Embracing the beautiful mess

The “messy” details of our human/computer systems is their hidden strength.

Lorin Hochstein

Accident Case Study: Just a Short Flight

In this accident report, learn how two pilots lost situational awareness, with disastrous consequences.

Air Safety Institute

Succeeding With Service Level Objectives

Without a structured strategy, and careful consideration of the full SLO lifecycle, SLOs risk partial implementation. This can result in low ROI and, in many cases, a complete failure.

Danny Mican — Squadcast

Back to Basics: Why Global Infrastructure Matters

The cloud’s multiple availability zones and regions can be powerful, but it’s hard to get a multi-region architecture correct.

Serhat Can — OpsGenie

SLA Uptime calculator

A useful little JavaScript tool: plug in an availability percentage (e.g. 99.99%), and get back the number of minutes you can be down in a day, month, quarter, or year.

Hexadecimal

Hosted Pools Availability Degradation

Azure Pipelines had an incident of delayed builds at the end of March. Find out more in this post-incident analysis.

Chad Kimes – Microsoft

Free Google Book: Building Secure and Reliable Systems

Google published another book in their SRE series. This short summary gives an overview of what’s inside along with an explanation of the motivation for another book. See also: Google’s announcement

Todd Hoff — High Scalability

One Team at Uber is Moving from Microservices to Macroservices

The pendulum is swinging back, and folks are starting to see the downsides of a plethora of microservices, including early champions, Uber.

Todd Hoff — High Scalability

Outages

Quibi
- Quibi had issues on their launch day.
Deliveroo
Google Cloud Platform IAM
- Click through for their interesting post-incident analysis.
Cloudflare
- Here’s their post-incident analysis that details a remote hands request gone awry.
Chef
Hulu
Lots of Banks in the US
- Banks went down around the time when customers were checking to see if their economic stimulus payments had arrived.
Petnet (smart pet feeder)
Snapchat
Twitter
Fastly
Reddit
DoorDash
StackPath

SRE Weekly Issue #214

lex

April 5, 2020

General

Comments

View on sreweekly.com

Articles

Trying to be too (io)nice created a “killer” directory

A nifty little pitfall in which an ioniced process can block non-ioniced processes.Author: rachelbythebay

Technical Writing

Google published this free set of courses on technical writing. As an SRE, I have the constant need to write effectively to justify and document my designs.

Every engineer is also a writer.

This collection of courses and learning resources aims to improve your technical documentation. Learn how to plan and author technical documents.

Google

Message from ACM Regarding Open Access to ACM Digital Library during Coronavirus

The ACM has made their ACM Digital Library free to the public for the next 3 months. Many of their articles have been featured here previously.

The Post-Incident Review Issue 2: March 2020

Includes a great article by Jamie Woo, entitled Imagining Your Post-Incident Report As A Documentary.

Emil Stolarsky and Jaime Woo — The Post-Incident Review

SRE Thought Leader Panel about Embracing Resilience during Crises

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends.

I especially liked the discussion of pent-up demand that may cause problems when we eventually get to relax social distancing.

Amy Tobey (moderator), Alex Hidalgo, Liz Fong-Jones, Dave Rensin

Incidents: What Is Often Missed & What Can Be Done About That

This is a talk that John Allspaw gave for Spotify.

Learning is not the same as fixing

John Allspaw — Adaptive Capacity Labs

Outages

Google Cloud Platform
- This is an update to the outage included in last week’s issue, giving details on what went wrong. A problem with Cloud IAM affected many other GCP services.
Let’s Encrypt
GitHub
Apple News
Facebook, Instagram, WhatsApp
Twitch
GameStop
Discord
- Includes a short description of what went wrong. Take it easy on yourselves, Discord folks, it happens to all of us. ♥

SRE Weekly Issue #218

Articles

Outages

SRE Weekly Issue #217

Articles

Outages

SRE Weekly Issue #216

Articles

Outages

SRE Weekly Issue #215

Articles

Outages

SRE Weekly Issue #214

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

A message from our sponsor, VictorOps:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues