General

SRE Weekly Issue #361

I’m having some serious FOMO from having missed out on the Learning From Incidents conference. If you post or see any write-ups, please send them my way!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

An in-depth explanation of health checking, including the importance of failing open to avoid a metastable cascading failure.

  Srinavas — eightnoteight

SQS (Amazon’s Simple Queue Service) is hugely scalable, but you must design your system with its limitations and behaviors in mind.

   Satrajit Basu — DZone

What if your SSO provider is down? This article describes a scheme for falling back to HTTP Basic Authentication in an emergency.

  Chris Siebenmann

Etsy scaled their database by transitioning to a sharding strategy using Vitess. The journey was long and involved some tricky gotchas, as explained in this 3-part series.

  River Rainne and Amy Ciavolino — Etsy

An in-depth explanation of consistent hashing with a special focus on building a case for why other sharding mechanisms fall short.

  Nk — High Scalability

LinkedIn chronicles their recent improvements to HODOR (the Holistic Overload Detection and Overload Remediation) including new kinds of overload detectors.

  Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande — LinkedIn

An airline that gave monetary rewards for early arrivals and a steep cockpit authority gradient were just two of the factors that contributed to this crash.

  Admiral Cloudberg

SRE Weekly Issue #360

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Another case of “pilot error” vs “systemic problems”. It’s interesting to me how the organizational pressures the pilots were facing mirror many stories I’ve seen in tech firms, especially startups.

  Admiral Cloudberg

This article recommends improving MTTA (mean time to assemble) by modeling our dispatch systems on the emergency services for a large city.

  Robert Ross

Lots of great stuff to aspire to, with a big emphasis on observability.

   Adriana Villela and Ana Margarita Medina — The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

I really love the concept of “incident legalism” introduced in this article. I’ve definitely been there.

Anyone who has coordinated over Slack during the incident has felt the pain of the ambiguity of Slack messages.

But communicating with specificity has a cost.

  Lorin Hochstein

I remember this one! I was trying to listen to music at the time. Turns out it was DNS (and a git repo).

  Erik Lindblad — Spotify

If you’re gonna group your incidents, use tags, not exclusive groups.

  Lorin Hochstein

SRE Weekly Issue #359

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

the Data Reliability Engineering team is here to monitor, automate and manage pipelines to enable our partner USDE teams to have the ease of mind to tackle projects to help Mercari move forward.

  LameyerDaniel and OhshimaTakako — Mercari

Hiring in the Site Reliability Engineering (SRE) space is notoriously difficult. So it makes sense to figure out how to expand the hiring pool beyond existing SREs.

  Ash Patel — SREpath

SREs end up writing a lot of YAML. I mean, a lot. Fortunately it’s a really simple language with no hidden gotchas, right? Right?!

  Ruud van Asseldonk

Two Terraform changes that were developed and tested individually went out to production simultaneously, with unexpected results.

  Jan David Nose — Rust

Code search is a different beast from normal english language searching. Regexes, punctuation, no word stemming, and GitHub’s scale made this a challenging design.

  Timothy Clem — GitHub

This article argues that folks outside of engineering are doing incident response, whether they call it that or not.

  incident.io

In incidents, we’re concentrating on resolving impact as quickly as possible, and this can impair our ability to gather the information we need after the fact in order to actually figure out what happened.

  Jake Cohen — PagerDuty

SRE Weekly Issue #358

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

A new spin on changing the engines on a jet in flight: using DNS request/response rewriting to transition an application over without modification.

  lainra — Mercari

How much additional capacity can you get for a dollar?

  Dan Slimmon

Dealing with the unknown, limited cognitive bandwidth, coordination patterns, psychological safety and feeding information back into the organization.

  Fred Hebert — The New Stack
  Full disclosure: Honeycomb is my employer.

How do you enable adoption of SRE principles at a large, mature company that has little capacity for innovation?

the value proposition of “SRE” is the idea that you can handle an exponentially growing business with a logarithmically growing payroll.

  Layer Alpeh

Read this one to learn about four attributes of good alerting and how to ensure your SLO burn rate alerts are effective.

  Saheed Oladosu

There’s plenty of content out there telling you how to implement observability, or what good looks like. But what about bad observability? What are some anti-patterns to watch out for?

  Stephen Townshend — SquaredUp

This is an interview about on-call with Twilio’s VP of SRE who also spent 17 years as an SRE at Google.

  Elena Boroda

They started with a (mostly) single-availability-zone Kafka deployment. Here’s how they transitioned to a multi-zone architecture that can survive a single AZ failure.

  Andrey Polyakov and Kamya Shethia — Etsy

SRE Weekly Issue #357

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly đźš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Panic takes time and energy away from swift incident response, leading to second-guessing, a higher likelihood of mistakes, and analysis paralysis. Here are three tips to minimize it.

  Malcolm Preston — incident.io

A great explanation of why we need to wait for more details on the FAA NOTAM outage. My favorite part is the list of clues to whether an incident report might be useful: Time, Artifacts, Jargon, and Narrative.

  Thai Wood — Resilience Roundup

Lots of juicy details about a large SRE organization and how they work.

  Ash Patel — SREPath

A deploy accidentally wiped authentication tokens for some internal Cloudflare services, causing an outage for those services.

   Kenny Johnson and Sam Rhea — Cloudflare

eBay thought about adopting “test in production” and eliminating staging, but they determined that their use case really does require a staging environment. They carefully selected and anonymized real production data to use as test cases in staging.

   Senthil Padmanabhan — eBay

This article has a really great section explaining the pitfalls of full system dashboards.

  Boris Cherkasky

The first one is my favorite:

Economic factors will force companies to look for more efficient ways of managing reliability

I’m not sure if that will happen, but it’s an interesting theory.

  Emily Arnott

This author shares what they learned in adapting to running incidents remotely once the pandemic hit.

  Emily Ruppe — Jeli

A production of Tinker Tinker Tinker, LLC Frontier Theme