SRE Weekly Issue #262

A message from our sponsor, StackHawk:

Join the Secure Coding Summit to hear from industry-leading AppSec and DevSecOps practitioners, analysts, and visionaries as they share their best pro tips to level up your code security.
http://sthwk.com/secure-code-summit

Articles

Chaos Engineering isn’t adding chaos to your systems—it’s seeing the chaos that already exists in your systems.

Along with four prerequisites, this article also includes 3 myths about chaos engineering that might be making you feel hesitant about starting.

Courtney Nash — Verica

This one’s from May of last year. Almost a year on, it’s interesting to see which of these we’ve already implemented.

Ashley Roof — Transposit

An amusing parable illustrating why not to try to be too reliable.

Andrew Ford — Indeed

In the Outages section of last week’s issue, you’ll find two unrelated events referenced in this article: one about Russian internet censorship gone awry and another about a major datacenter fire.

Eric Johansson — Verdict

Along with what’s in the title, this article also covers the difference between an RCA and a contributing factors analysis.

Emily Arnott — Blameless

Lots of detail on how LinkedIn is improving their traffic forecasts. Warning/enticement: math contained within.

Deepanshu Mehndiratta — LinkedIn

Everyone is testing in production, some organizations admit and plan for it.

How to do it right, what can happen if it goes wrong, and how to limit the blast radius.

Heidi Waterhouse — LaunchDarkly

Remember when GitHub logged you out? Ah, I remember it like it was last week. I mean, the week before. Here’s GitHub’s troubleshooting story about what went wrong.

Dirkjan Bussink — GitHub

Outages

SRE Weekly Issue #261

A message from our sponsor, StackHawk:

Join Snyk and StackHawk on March 18 as they walk through how to use Software Composition Analysis (SCA) and Dynamic Application Security Testing (DAST) in CI/CD to ship more secure applications.
http://sthwk.com/snyk-stackhawk-webinar

Articles

I find it really refreshing that fighter pilots have a retrospective about every single mission, successful or not. There’s always something to learn.

Jessica Abelson — Transposit

Heroku applies the Incident Management System, designating an Incident Commander who keeps the incident on track and oversees communications, both external and internal.

Guillaume Winter — Heroku

This story is becoming common: Khan had a sudden influx of traffic when pandemic lockdowns began. Their strategy involved the use of the cloud and a CDN.

Marta Kosarchyn — Khan Academy

Full disclosure: Fastly, my employer, is mentioned.

Here’s a great summary of how Squarespace does SRE.

Franklin Angulo — Squarespace

Leaders at Deliveroo, DigitalOcean, Fastly, and Headspace share how their organizations think about reliability and resiliency and their advice to engineering orgs embarking on reliability journeys.

The leaders each answer a series of questions about how their organization handles reliability, giving an interesting compare-and-contrast overview.

Increment

Full disclosure: Fastly is my employer.

Using a disaster plan created after a devastating hurricane, Freshworks survived and thrived during the pandemic, delivering a major new product by its pre-pandemic deadline.

Ipsita Agarwal — Increment

This one explains what a canary deployment is, how it can help you, and how canary deployments differ from blue/green deployments.

LaunchDarkly

This article explains the meaning of a growth mindset and shows how it applies to SRE.

Emily Arnott — Blameless

Outages

  • Fastly
    • Full disclosure: Fastly is my employer.
  • OVH Cloud
  • All domains containing “t.co” in Russia
    • It appears that Russia tried to impair access to Twitter’s URL-shortening domain t.co, but their pattern-matching was overzealous and affected any domain that contained “t.co” (think reddit.com, microsoft.com, and many others).
  • Dyn
    • Dyn had a DNS outage. I noted impact to Heroku, but I didn’t see any other related outage postings.
  • Chef
  • GitHub

SRE Weekly Issue #260

A message from our sponsor, StackHawk:

Check out this guide to modern dynamic application security testing to learn how it works and what to look for in tooling.
http://sthwk.com/dynamic-appsec-overview

Articles

People throw around “resiliency” quite often when they mean “reliability” or “high availability”. Dr. Woods sets the record straight.

Ipsita Agarwal — Increment

A key part of their strategy is to keep their service running at 50% capacity or less, allowing them to lose a datacenter without overloading the remaining datacenter.

Mathieu Frappier, Dorothy Jung, and Qui Nguyen — Increment

In issue #236, I linked to an excellent paper by Dr. Richard Cook and Beth Long about engineering resilience in incident response. Now they’re back, teaming up with John Allspaw to summarize and expand on that paper!

John Allspaw, Beth Adele Long, and Dr. Richard Cook — Increment

A quick s/security/reliability/g and this is an SRE article; the same principles apply to both fields.

Aaron Rinehart — Verica

How can we apply the tenets and principles of NASA mission controllers to our SRE work?

Geoff White — Blameless

Genius idea: we can take our lead from activists as we try to win over our organization to adopt SRE principles.

Chris Hendrix — Blameless

This insightful observation caught my eye:

It’s unnecessary overhead for a product team to plan capacity, set up good alerts and multihoming (automatically running in multiple data centers) for small, simple functionality.

Naphat Sanguansin and Utsav Shah — Dropbox

Outages

SRE Weekly Issue #259

A message from our sponsor, StackHawk:

Mark your calendars! The first conference for OWASP ZAP users is taking place March 9. Get your free ticket to connect with other ZAP users and learn about the project’s roadmap
http://sthwk.com/zapcon-sreweekly

Articles

This quarter’s Increment issue is about Reliability, and I haven’t had this much fun since their first issue about on-call. I’ll include a few of the articles here and more in later issues as I have a chance to review them.

Stripe

Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.

Understanding that no system is without errors is critical to building resilient systems.

Heidi Waterhouse

The very first sentence sets the tone, and I love it:

Resilience is a process: something you must actively perform, not something you check off a list once.

Ryn Daniels

Most of all, having an incident commander only works if everyone believes in the role. Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman.

The next time I’m incident commander, I am totally going to jump in and say, “I’m Batman!”.

This article is a great primer on what an IC is and how to adopt incident command at your organization.

Tanya Reilly

After reading this blog post, you will have an understanding of the retry pattern used in microservices architecture, why it should be used, a few considerations while using the retry pattern, and how to use it in Python.

I love the W. C. Fields quote.

Anand Prashant

It’s that time again! Be sure to fill out the survey, not only so they can gather useful data, but also because Catchpoint will donate $5 to charity.

DevOps Institute, Catchpoint, and VMWare Tanzu

When considering the value of a QA test, SLIs can provide very valuable context.

SRE and QA can work hand in hand.

Emily Arnott — Blameless

This kind of thing keeps me up at night. Silent data corruption can destroy your reliability just as quickly as a backhoe on a non-redundant link.

Harish Dattatraya Dixit — Facebook

Etsy experienced years of growth practically overnight in 2020 as quarantines set in. Here’s how they handled it.

Mike Adler — Etsy

Outages

SRE Weekly Issue #258

A message from our sponsor, StackHawk:

On February 25 at 10 am PT we are going to show you how easy it is to add application security testing to a #GitLab pipeline. Save your spot for our live session
http://sthwk.com/gitlab-stackhawk-automation

Articles

When acting as a retrospective facilitator, there’s a huge potential to color the discussion with our words and actions.

You’re there to position other folks to learn, not wear the badge.

Will Gallego

upgundecha/howtheysre: A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)

A huge thanks to the curator for the many awesome links in this repo! Some have been featured here in previous issues, and some are new to me. As I go through those, I’ll share my favorites here and tell you why I think you should read them.

Unmesh Gundecha

In this article, we discuss the concepts of dependability and fault tolerance in detail and explain how the Ably platform is designed with fault tolerant approaches to uphold its dependability guarantees.

Paddy Byers — Ably

More details on the Notion outage mentioned here last week. Complaints of phishing by a Notion user resulted in their registrar pulling their domain name out of DNS.

Peter Judge — Datacenter Dynamics

Google has three guiding principles for improving resiliency:

  • Create maximum observability of the overall system
  • Design for effectiveness, not perfection
  • Learn and iterate as you go

Will Grannis — Google

This is an awesome guide to writing a production-ready checklist — and why you’d want one.

Emily Arnott — Blameless

Facebook found that as a regression is discovered later, it will take much longer to deploy a fix. With a combination of heuristics and machine learning, they’re detecting regressions earlier and bringing them to the attention of folks that can fix them.

Jian Zhang and Brian Keller — Facebook

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme