General

SRE Weekly Issue #363

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ๐Ÿš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

A super in-depth look at on-call compensation strategies. Includes a sampling of companies and how much they pay (if anything).

  Gergely Orosz โ€” The Pragmatic Engineer

Husky uses a nifty sharding strategy where a customer’s shard allocation changes over time automatically based on load.

  Daniel Intskirveli โ€” Datadog

This analogy goes far enough to even include rules. Anyone up for a round?

  Robert Ross

[…] in order to be truly great at being an SRE you will constantly need to understand how to work with people in the organization, how to set expectations and how to move the needle on peopleโ€™s understanding of reliability.

  Ross Brodbeck

MongoDB -> Cassandra -> ScyllaDB. Storing a ton of stuff is hard.

  Bo Ingram โ€” Discord

When designing complex technical systems, you should ask yourself, โ€œhow does the human operator fit into the pictureโ€.

  Cursed Quail

It sounds like it was a great conference!

  Paige Cruz โ€” Chronosphere

[…] complex systems donโ€™t yield to analysis. We have to add another skill: sense-making.

  Jessica Kerr โ€” Honeycomb
  Full disclosure: Honeycomb is my employer.

SRE Weekly Issue #362

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ๐Ÿš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

You might wonder why I have given almost zero coverage to “AIOps” here, and why my coverage of “anomaly detection” has included heavy skepticism. The reason: I simply haven’t seen any proof that it works.

The FTC’s recent stance on AI sums up my position nicely. If you want your AIOps product covered here, don’t just tell me it works, prove to me that it works.

  Michael Atleson โ€” Federal Trade Commission

How? With a safe and repeatable procedure for database migrations involving double-writing.

  Lisa Karlin Curtis โ€” incident.io

Push to main on a new microservice repo and it deploys to production, spins up a slack channel for alerts, invites the CODEOWNERS, creates an on-call rotation, and puts them in it. Wow!

  Kiselev Ivan โ€” Better Programming

A routing issue caused widespread packet loss with worldwide impact across many services.

  Google

This month’s report had a couple of fascinating incidents, especially the one about source code archive hashes.

  Jakub Oleksy โ€” GitHub

Folks from the New York Times used chaos engineering to prepare for the surge of traffic during the US’s presidential election. They share 5 guidelines for effective chaos engineering for big data systems.

  Shane Murray โ€” Monte Carlo

Here’s that LFI Conf recap I wanted!

  Vanessa Huerta Granda โ€” Jeli

Former Google folks published this guide to help recently laid-off Google SREs integrate with the way SRE is done in the rest of the tech world. There’s an interesting hint about Google’s on-call compensation that I’m going to have to look into.

  Murali Suriar and Niall Murphy

A normally conscientious airline captain made a decision he normally would not have, likely owing to severe sleep deprivation.

  Admiral Cloudberg

SRE Weekly Issue #361

I’m having some serious FOMO from having missed out on the Learning From Incidents conference. If you post or see any write-ups, please send them my way!

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ๐Ÿš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

An in-depth explanation of health checking, including the importance of failing open to avoid a metastable cascading failure.

ย ย Srinavas โ€” eightnoteight

SQS (Amazon’s Simple Queue Service) is hugely scalable, but you must design your system with its limitations and behaviors in mind.

   Satrajit Basu โ€” DZone

What if your SSO provider is down? This article describes a scheme for falling back to HTTP Basic Authentication in an emergency.

  Chris Siebenmann

Etsy scaled their database by transitioning to a sharding strategy using Vitess. The journey was long and involved some tricky gotchas, as explained in this 3-part series.

ย ย River Rainne and Amy Ciavolino โ€” Etsy

An in-depth explanation of consistent hashing with a special focus on building a case for why other sharding mechanisms fall short.

  Nk โ€” High Scalability

LinkedIn chronicles their recent improvements to HODOR (the Holistic Overload Detection and Overload Remediation) including new kinds of overload detectors.

  Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande โ€” LinkedIn

An airline that gave monetary rewards for early arrivals and a steep cockpit authority gradient were just two of the factors that contributed to this crash.

  Admiral Cloudberg

SRE Weekly Issue #360

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ๐Ÿš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

Another case of “pilot error” vs “systemic problems”. It’s interesting to me how the organizational pressures the pilots were facing mirror many stories I’ve seen in tech firms, especially startups.

  Admiral Cloudberg

This article recommends improving MTTA (mean time to assemble) by modeling our dispatch systems on the emergency services for a large city.

  Robert Ross

Lots of great stuff to aspire to, with a big emphasis on observability.

   Adriana Villela and Ana Margarita Medina โ€” The New Stack
  Full disclosure: Honeycomb, my employer, is mentioned.

I really love the concept of “incident legalism” introduced in this article. I’ve definitely been there.

Anyone who has coordinated over Slack during the incident has felt the pain of the ambiguity of Slack messages.

But communicating with specificity has a cost.

  Lorin Hochstein

I remember this one! I was trying to listen to music at the time. Turns out it was DNS (and a git repo).

  Erik Lindblad โ€” Spotify

If you’re gonna group your incidents, use tags, not exclusive groups.

  Lorin Hochstein

SRE Weekly Issue #359

A message from our sponsor, Rootly:

Manage incidents directly from Slack with Rootly ๐Ÿš’.

Rootly automates manual tasks like creating an incident channel, Jira ticket and Zoom rooms, inviting responders, creating statuspage updates, postmortem timelines and more. Want to see why companies like Canva and Grammarly love us?:

https://rootly.com/demo/

Articles

the Data Reliability Engineering team is here to monitor, automate and manage pipelines to enable our partner USDE teams to have the ease of mind to tackle projects to help Mercari move forward.

  LameyerDaniel and OhshimaTakako โ€” Mercari

Hiring in the Site Reliability Engineering (SRE) space is notoriously difficult. So it makes sense to figure out how to expand the hiring pool beyond existing SREs.

  Ash Patel โ€” SREpath

SREs end up writing a lot of YAML. I mean, a lot. Fortunately it’s a really simple language with no hidden gotchas, right? Right?!

  Ruud van Asseldonk

Two Terraform changes that were developed and tested individually went out to production simultaneously, with unexpected results.

  Jan David Nose โ€” Rust

Code search is a different beast from normal english language searching. Regexes, punctuation, no word stemming, and GitHub’s scale made this a challenging design.

  Timothy Clem โ€” GitHub

This article argues that folks outside of engineering are doing incident response, whether they call it that or not.

  incident.io

In incidents, we’re concentrating on resolving impact as quickly as possible, and this can impair our ability to gather the information we need after the fact in order to actually figure out what happened.

  Jake Cohen โ€” PagerDuty

A production of Tinker Tinker Tinker, LLC Frontier Theme