SRE WEEKLY – Page 47 – scalability, availability, incident response, automation

SRE Weekly Issue #263

lex

March 28, 2021

General

Comments

View on sreweekly.com

Articles

[Increment: Reliability] Tracing a path to observability

They make a really clear case for why traditional metrics and monitoring couldn’t help them solve their problems.

Mads Hartmann

Glynn Lunney — SRE Leadership

This article commemorates the death of NASA flight director Glynn Lunney by showing the SRE lessons we can learn from him.

Robert Barron

7 top Site Reliability Engineer (SRE) job interview questions

I like that this focuses on human factors.

Kevin Casey

How to Scale for Reliability and Trust

Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

Blameless

Engineering Failover Handling in Uber’s Mobile Networking Infrastructure

Uber’s customers are especially likely to be moving around and going in and out of tunnels, losing connectivity along the way. That means it’s difficult to tell when the client should fail over to a different server.

Sivabalan Narayanan, Rajesh Mahindra, and Christopher Francis — Uber

Incident review: Service outage on 25 October 2020

Here’s one I missed from last November. Some good stuff to learn from, especially if you run Vault on kubernetes.

This outage was caused by a cascading failure stemming from our secrets management engine, which is a dependency of almost all of the production GoCardless services.

Ben Wheatley — GoCardless

Outages

Gmail and a ton of other Android apps
- This one’s kind of weird. Google presented it as a Gmail outage, but it’s actually a problem with the Android system webview component. Tons of apps were crashing.
MangaDex
Canvas
Instagram

SRE Weekly Issue #262

lex

March 21, 2021

General

Comments

View on sreweekly.com

Articles

The Prerequisites for Chaos Engineering

Chaos Engineering isn’t adding chaos to your systems—it’s seeing the chaos that already exists in your systems.

Along with four prerequisites, this article also includes 3 myths about chaos engineering that might be making you feel hesitant about starting.

Courtney Nash — Verica

Managing On-Call in a Pandemic

This one’s from May of last year. Almost a year on, it’s interesting to see which of these we’ve already implemented.

Ashley Roof — Transposit

Being Just Reliable Enough

An amusing parable illustrating why not to try to be too reliable.

Andrew Ford — Indeed

Google debunks Russian claims that fire was connected to service outage

In the Outages section of last week’s issue, you’ll find two unrelated events referenced in this article: one about Russian internet censorship gone awry and another about a major datacenter fire.

Eric Johansson — Verdict

How to Analyze Contributing Factors Blamelessly

Along with what’s in the title, this article also covers the difference between an RCA and a contributing factors analysis.

Emily Arnott — Blameless

Rethinking site capacity projections with Capacity Analyzer

Lots of detail on how LinkedIn is improving their traffic forecasts. Warning/enticement: math contained within.

Deepanshu Mehndiratta — LinkedIn

Testing in Production for Safety and Sanity

Everyone is testing in production, some organizations admit and plan for it.

How to do it right, what can happen if it goes wrong, and how to limit the blast radius.

Heidi Waterhouse — LaunchDarkly

How we found and fixed a rare race condition in our session handling

Remember when GitHub logged you out? Ah, I remember it like it was last week. I mean, the week before. Here’s GitHub’s troubleshooting story about what went wrong.

Dirkjan Bussink — GitHub

Outages

Google Cloud Platform
- GCP had a major multi-region networking issue, due to a routing glitch. Click through for their followup post.
US National Oceanic and Atmospheric Administration (NOAA)
- This outage impaired NOAA’s tsunami early warning system.
Facebook, Instagram, and WhatsApp
TikTok
Elevated error rates
Microsoft Teams and other services
- Click through for a highly detailed description of what went wrong. I can’t link directly to the incident in question, so you’ll have to scroll down to 3/15.

SRE Weekly Issue #261

lex

March 14, 2021

General

Comments

View on sreweekly.com

Articles

What Do Fighter Pilots and Incident Management Have in Common?

I find it really refreshing that fighter pilots have a retrospective about every single mission, successful or not. There’s always something to learn.

Jessica Abelson — Transposit

Incident Response at Heroku

Heroku applies the Incident Management System, designating an Incident Commander who keeps the incident on track and oversees communications, both external and internal.

Guillaume Winter — Heroku

How Khan Academy Successfully Handled 2.5x Traffic in a Week

This story is becoming common: Khan had a sudden influx of traffic when pandemic lockdowns began. Their strategy involved the use of the cloud and a CDN.

Marta Kosarchyn — Khan Academy

Full disclosure: Fastly, my employer, is mentioned.

Under the Hood: Ensuring Site Reliability

Here’s a great summary of how Squarespace does SRE.

Franklin Angulo — Squarespace

[Increment: Reliability] Reliability at scale

Leaders at Deliveroo, DigitalOcean, Fastly, and Headspace share how their organizations think about reliability and resiliency and their advice to engineering orgs embarking on reliability journeys.

The leaders each answer a series of questions about how their organization handles reliability, giving an interesting compare-and-contrast overview.

Increment

Full disclosure: Fastly is my employer.

[Increment: Reliability] Case study: Resilience as adaptability at Freshworks

Using a disaster plan created after a devastating hurricane, Freshworks survived and thrived during the pandemic, delivering a major new product by its pre-pandemic deadline.

Ipsita Agarwal — Increment

What Is a Canary Deployment?

This one explains what a canary deployment is, how it can help you, and how canary deployments differ from blue/green deployments.

LaunchDarkly

How to Build an SRE Team with a Growth Mindset

This article explains the meaning of a growth mindset and shows how it applies to SRE.

Emily Arnott — Blameless

Outages

Fastly
- Full disclosure: Fastly is my employer.
OVH Cloud
- This week, there was a major fire at an OVH Cloud datacenter. As a result, Rust (an MMOG) permanently lost data, according to its creators.
All domains containing “t.co” in Russia
- It appears that Russia tried to impair access to Twitter’s URL-shortening domain t.co, but their pattern-matching was overzealous and affected any domain that contained “t.co” (think reddit.com, microsoft.com, and many others).
Dyn
- Dyn had a DNS outage. I noted impact to Heroku, but I didn’t see any other related outage postings.
Chef
GitHub

SRE Weekly Issue #260

lex

March 7, 2021

General

Comments

View on sreweekly.com

Articles

[Increment: Reliability] Interview: Dr. David D. Woods

People throw around “resiliency” quite often when they mean “reliability” or “high availability”. Dr. Woods sets the record straight.

Ipsita Agarwal — Increment

[Increment: Reliability] The process: Implementing Yelp’s failover strategy

A key part of their strategy is to keep their service running at 50% capacity or less, allowing them to lose a datacenter without overloading the remaining datacenter.

Mathieu Frappier, Dorothy Jung, and Qui Nguyen — Increment

[Increment: Reliability] On adaptive capacity in incident response

In issue #236, I linked to an excellent paper by Dr. Richard Cook and Beth Long about engineering resilience in incident response. Now they’re back, teaming up with John Allspaw to summarize and expand on that paper!

John Allspaw, Beth Adele Long, and Dr. Richard Cook — Increment

Security Chaos Engineering: How to Security Differently

A quick s/security/reliability/g and this is an SRE article; the same principles apply to both fields.

Aaron Rinehart — Verica

SRE2AUX: How Flight Controllers were the first SREs

How can we apply the tenets and principles of NASA mission controllers to our SRE work?

Geoff White — Blameless

SRE as Organizational Transformation: Lessons from Activist Organizers

Genius idea: we can take our lead from activists as we try to win over our organization to adopt SRE principles.

Chris Hendrix — Blameless

Atlas: Our journey from a Python monolith to a managed platform

This insightful observation caught my eye:

It’s unnecessary overhead for a product team to plan capacity, set up good alerts and multihoming (automatically running in multiple data centers) for small, simple functionality.

Naphat Sanguansin and Utsav Shah — Dropbox

Outages

Fitbit
Netflix
Disney+
- This week was the Wandavision finale.
Fastly
- Fastly is my employer.

SRE Weekly Issue #259

lex

February 28, 2021

General

Comments

View on sreweekly.com

Articles

Increment: Reliability

This quarter’s Increment issue is about Reliability, and I haven’t had this much fun since their first issue about on-call. I’ll include a few of the articles here and more in later issues as I have a chance to review them.

Stripe

[Increment: Reliability] Everything is broken, and it’s okay

Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.

Understanding that no system is without errors is critical to building resilient systems.

Heidi Waterhouse

[Increment: Reliability] How to build organizational resilience

The very first sentence sets the tone, and I love it:

Resilience is a process: something you must actively perform, not something you check off a list once.

Ryn Daniels

[Increment: Reliability] Embrace your inner incident commander

Most of all, having an incident commander only works if everyone believes in the role. Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman.

The next time I’m incident commander, I am totally going to jump in and say, “I’m Batman!”.

This article is a great primer on what an IC is and how to adopt incident command at your organization.

Tanya Reilly

Retry pattern in microservices

After reading this blog post, you will have an understanding of the retry pattern used in microservices architecture, why it should be used, a few considerations while using the retry pattern, and how to use it in Python.

I love the W. C. Fields quote.

Anand Prashant

2021 Site Reliability Engineering (SRE) Survey Now Open

It’s that time again! Be sure to fill out the survey, not only so they can gather useful data, but also because Catchpoint will donate $5 to charity.

DevOps Institute, Catchpoint, and VMWare Tanzu

QA Engineers, This is How SRE will Transform your Role

When considering the value of a QA test, SLIs can provide very valuable context.

SRE and QA can work hand in hand.

Emily Arnott — Blameless

Silent data corruption: Mitigating effects at scale

This kind of thing keeps me up at night. Silent data corruption can destroy your reliability just as quickly as a backhoe on a non-redundant link.

Harish Dattatraya Dixit — Facebook

How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020

Etsy experienced years of growth practically overnight in 2020 as quarantines set in. Here’s how they handled it.

Mike Adler — Etsy

Outages

Let’s Encrypt
Google Voice
- This is Google’s analysis for the incident on February 16, caused by a TLS certificate management mishap.
India’s National Stock Exchange (NSE)
LinkedIn
US Federal Reserve
- The US Fed’s computer system was down, preventing transfers between banks from going through.
Venmo
Facebook and Instagram
Reddit
Discord

SRE Weekly Issue #263

Articles

Outages

SRE Weekly Issue #262

Articles

Outages

SRE Weekly Issue #261

Articles

Outages

SRE Weekly Issue #260

Articles

Outages

SRE Weekly Issue #259

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

A message from our sponsor, StackHawk:

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues