SRE Weekly Issue #365

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

They take us from the requirements analysis all the way through implementation of a high-throughput data store based on CockroachDB.

  Chuanpin Zhu and Debalin Das — DoorDash

On March 14th, Reddit engineers upgraded a Kubernetes cluster from 1.23 to 1.24, and all hell broke loose. I admire their precision in being down for 100π minutes.

  Jayme Howard — Reddit

With a huge user-base of students and teachers, these folks upped their incident response game, and they share how.

  Nadinastiti and Estu Fardani — GovTech Edu

A lurking bug in redis-py allowed users to see one another’s data, and OpenAI took ChatGPT down to limit the damage.

  OpenAI

In Linux, source port allocation can be complex. This article shows why with a ton of code and tracing examples.

  Jakub Sitnicki — Cloudflare

The gap between “paying for peak” and “earning on average” is critical to understand how the economics of large-scale cloud systems differ from traditional single-tenant systems.

  Marc Brooker

A configuration error was masked because the app automatically fell back to the original configuration. The problem only surfaced when the service was redeployed.

  Heroku

SRE Weekly Issue #364

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Heresy! This article provides a counterpoint to many of the benefits of IaC. While IaC may still be the right answer, it’s not a slam dunk.

  Luke Shaughnessy

Short but sweet, this article outlines three focus areas that the author argues should be a part of any SRE role.

  Kyle Robertson

Way beyond just an intro to aperture, this article also covers microservice architecture failure modes, techniques used to avoid failures, and the weaknesses in those techniques.

  Cong Ma and Matt Ranney — Doordash

I’m including this here not just for the staff+ SREs out there. Many of these skills are important for SREs to develop much earlier than the Staff level, since our role can be so collaborative.

  Ryn Daniels — GitHub

I love that fully half of this article is about mentoring developing SREs in identifying and managing risk.

  Ross Brodbeck

Learn how the Honeycomb SRE team has structured its work, including a fully copy of the team charter.

  Fred Hebert — Honeycomb
  Full disclosure: Honeycomb is my employer and I am a member of the SRE team described in this article.

An intriguing approach: define technical debt as a risk, and manage it in much the same way that we handle reliability-related risks, with a “threat budget”.

  Jason Bloomberg — Intellyx

Instead, because our time and attention is limited, we have to get good at identifying cues to indicate that our models have gotten stale or are incorrect.

  Lorin Hochstein

Using a simulation, this article comes to the conclusion that a hybrid between FIFO and LIFO is better than picking just one.

   Eugene Retunsky — DZone

SRE Weekly Issue #363

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

A super in-depth look at on-call compensation strategies. Includes a sampling of companies and how much they pay (if anything).

  Gergely Orosz — The Pragmatic Engineer

Husky uses a nifty sharding strategy where a customer’s shard allocation changes over time automatically based on load.

  Daniel Intskirveli — Datadog

This analogy goes far enough to even include rules. Anyone up for a round?

  Robert Ross

[…] in order to be truly great at being an SRE you will constantly need to understand how to work with people in the organization, how to set expectations and how to move the needle on people’s understanding of reliability.

  Ross Brodbeck

MongoDB -> Cassandra -> ScyllaDB. Storing a ton of stuff is hard.

  Bo Ingram — Discord

When designing complex technical systems, you should ask yourself, “how does the human operator fit into the picture”.

  Cursed Quail

It sounds like it was a great conference!

  Paige Cruz — Chronosphere

[…] complex systems don’t yield to analysis. We have to add another skill: sense-making.

  Jessica Kerr — Honeycomb
  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #362

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

You might wonder why I have given almost zero coverage to “AIOps” here, and why my coverage of “anomaly detection” has included heavy skepticism. The reason: I simply haven’t seen any proof that it works.

The FTC’s recent stance on AI sums up my position nicely. If you want your AIOps product covered here, don’t just tell me it works, prove to me that it works.

  Michael Atleson — Federal Trade Commission

How? With a safe and repeatable procedure for database migrations involving double-writing.

  Lisa Karlin Curtis — incident.io

Push to main on a new microservice repo and it deploys to production, spins up a slack channel for alerts, invites the CODEOWNERS, creates an on-call rotation, and puts them in it. Wow!

  Kiselev Ivan — Better Programming

A routing issue caused widespread packet loss with worldwide impact across many services.

  Google

This month’s report had a couple of fascinating incidents, especially the one about source code archive hashes.

  Jakub Oleksy — GitHub

Folks from the New York Times used chaos engineering to prepare for the surge of traffic during the US’s presidential election. They share 5 guidelines for effective chaos engineering for big data systems.

  Shane Murray — Monte Carlo

Here’s that LFI Conf recap I wanted!

  Vanessa Huerta Granda — Jeli

Former Google folks published this guide to help recently laid-off Google SREs integrate with the way SRE is done in the rest of the tech world. There’s an interesting hint about Google’s on-call compensation that I’m going to have to look into.

  Murali Suriar and Niall Murphy

A normally conscientious airline captain made a decision he normally would not have, likely owing to severe sleep deprivation.

  Admiral Cloudberg

SRE Weekly Issue #361

I’m having some serious FOMO from having missed out on the Learning From Incidents conference. If you post or see any write-ups, please send them my way!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly 🚒.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

An in-depth explanation of health checking, including the importance of failing open to avoid a metastable cascading failure.

  Srinavas — eightnoteight

SQS (Amazon’s Simple Queue Service) is hugely scalable, but you must design your system with its limitations and behaviors in mind.

   Satrajit Basu — DZone

What if your SSO provider is down? This article describes a scheme for falling back to HTTP Basic Authentication in an emergency.

  Chris Siebenmann

Etsy scaled their database by transitioning to a sharding strategy using Vitess. The journey was long and involved some tricky gotchas, as explained in this 3-part series.

  River Rainne and Amy Ciavolino — Etsy

An in-depth explanation of consistent hashing with a special focus on building a case for why other sharding mechanisms fall short.

  Nk — High Scalability

LinkedIn chronicles their recent improvements to HODOR (the Holistic Overload Detection and Overload Remediation) including new kinds of overload detectors.

  Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande — LinkedIn

An airline that gave monetary rewards for early arrivals and a steep cockpit authority gradient were just two of the factors that contributed to this crash.

  Admiral Cloudberg

A production of Tinker Tinker Tinker, LLC Frontier Theme