General

SRE Weekly Issue #426

Got any burning questions to ask an experienced SRE? I’m gathering your questions in this google form, and I’d love to hear from you. I’m hoping to use your questions to help inspire authors looking to write more great SRE-related content.

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

If your overall request volume is low, single errors can have a big impact on your metrics — a phenomenon I’ve experienced at work recently.

  Ross Brodbeck

This article outlines five facets of microservice architectures that can have implications for reliability.

  Andre Newman — Gremlin

If this title sounds familiar, I’ve linked to an article about the Children of the Magenta concept before. In this accident report, the pilots became confused about their location and course, and ultimately, their trust in the Flight Management System contributed to the disaster.

  Kyra Dempsey (Admiral Cloudberg)

A Center of Production Excellence can be a powerful means for an organization to initiate transformations which foster resilience as it matures and its environment changes.

  Nick Travaglini — Honeycomb

  Full disclosure: Honeycomb is my employer.

Last week, I shared a story about an outage at UniSuper that was caused by Google Cloud. This week, Google shared more details about what went wrong, and it’s well worth a read.

  Google

This incident is intriguing because exponential backoff made the problem harder to detect.

  Heroku

A discussion of what might get in the way of an organization implementing SLI/SLO/SLAs.

Note that the second half of the article (overcoming those obstacles) is behind a paywall. I don’t often recommend pay-only content, but it’s worth considering a subscription, because Alex is an excellent author whose work I’ve featured here many times.

  Alex Ewerlöf

if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents.

  Lorin Hochstein

SRE Weekly Issue #425

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

Great practical advice for how to present reliability problems (and your proposed solutions) to e-staff.

  Ross Brodbeck

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk.

  Lorin Hochstein

The three kinds of roles covered in this article relate to Standards, Operations, and Leadership.

  Gavin Cahill — Gremlin

Nagle’s algorithm considered harmful? It’s important to be aware of it because it can trip you up.

  Marc Brooker

In issue #423, I linked to a story about Amazon charging for unauthenticated and failed requests to S3 buckets. Thankfully, they’re no longer charging for that.

  Amazon

A little low on details, but interesting nonetheless: Google Cloud did something weird and accidentally deleted a customer’s account out from under them.

  UniSuper

What is a “service” in the context of service levels (SLI/SLO)?

  Alex Ewerlöf

My favorite part of this one is the description of techniques for improving psychological safety at your company.

  Incident.io

SRE Weekly Issue #424

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates.

https://firehydrant.com/blog/ai-for-incident-management-is-here/

Here’s an ultra-practical guide to pushing for reliability investments at your company, formatted as a runbook with a set of specific steps.

  Ross Brodbeck

A neat dive into how Amazon’s MemoryDB composes multiple systems to create a redundant Redis-compatible data store.

  Marc Brooker

This article looks into the economic and psychological impact of a culture of blame.

  Lee Atchison — Blameless

It took me two read-throughs to fully get this one, and I’m reallyglad I did it.

If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

  Lorin Hochstein

To prevent dangerous deploy collisions, these folks wrote an open source tool to mediate who gets to deploy when.

  Andrew Kannan — Klaviyo

if you’ve never worked at a startup before, you may be over-estimating how much you need to learn and how quickly.

When all you have is early adopters, you’re in a more forgiving environment, including for reliability.

  Nicholas Yan — Graphite

Structured logging is great, but there can be pitfalls and gotchas.

  Oakley Hall

An intro to SLOs with useful formulas, from the creator of the SLO Calculator featured here awhile back.

  Alex Ewerlöf

SRE Weekly Issue #423

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/

This one’s full of great advice about making sure alerts are actionable, including alerting on flows that actually matter to customers.

  Nočnica Mellifera — Checkly

Here are a collection of things I learned after getting back into Magic: the Gathering over the past 10 years or so. They are things that apply to both the MTG scene and your friendly neighborhood incident response process.

  Ross Brodbeck

It was a classic application of technical debt: they chose to focus on customer-facing features and let k8s updates slide. Here’s how they caught back up safely.

  Jeff Wolski

This article presents an interesting hypothesis, and from it draws some nifty conclusions about reasoning about failure in systems.

we cannot know for sure whether or not software is going to be incident-free. It might well be, but we can’t ever prove it.

  Niall Murphy

For teams to solve incidents quickly and effectively, responders need to be able to trust each other and stakeholders have to trust the responders. This level of trust is hard to cultivate if your organization doesn’t have a significant amount of psychological safety.

  Mandi Walls — PagerDuty

More than just an interview, this article outlines a multi-year transformation from disorganized haphazard incident investigation to a smooth and efficient incident response process.

  Eric Silberstein — Klaviyo

In this article, you will learn how to prevent broken connections when a Pod starts or shuts down. You will also learn how to shut down long-running tasks and connections gracefully.

   Daniele Polencic — Learnk8s

It turns out that an S3 bucket owner pays for failed requests to that bucket, even if they’re unauthenticated, so anyone can run up your AWS bill if they know your bucket’s name. Oops.

Oh, and they can get the bucket name from CT logs (thanks, Corey Quinn!)

  Maciej Pocwierz

SRE Weekly Issue #422

A message from our sponsor, FireHydrant:

FireHydrant is now AI-powered for faster, smarter incidents! Power up your incidents with auto-generated real-time summaries, retrospectives, and status page updates. https://firehydrant.com/blog/ai-for-incident-management-is-here/

The PIOSEE model is taught to pilots as a rubric for coming to a decision in a difficult aviation situation. As this article explains, we can also use it during IT incidents.

  Francisco Melo Jr.

What is high cardinality in monitoring systems? Here’s an excellent explanation that includes tips on how to manage cardinality.

  Ash P — SREPath

As Xero transitioned to a standard of “you build it you run it”, suddenly more engineering teams were responsible for knowing about and implementing observability. They designed this maturity model to help teams understand what they were aiming for and how to get there.

  Andrew Macdonald — Xero

With around 200 undersea fiber cuts worldwide per year, a fleet of ships is at the ready to pull up the cables and repair them.

  Josh Dzieza — The Verge

Last year, Cloudflare suffered a control plane outage when one of their datacenters lost power. They since did significant work to improve their resilience to power outages, and it was put to the test when the same datacenter lost power again.

   Matthew Prince, John Graham-Cumming, and Jeremy Hartman — Cloudflare

Going from non-remote to remote was challenging but here’s how our team changed as we began working from home

  Stefan Mikolajczyk — WeTransfer

Platform teams have a hugely important role to fill in the engineering organization. They are often the teams that enable other teams to move with more speed and safety. They can also quickly become disconnected from their customers.

  Ross Brodbeck

When your system successfully serves a degraded response to the customer, how should you count that toward your SLO? Is it success? Failure? Something in between?

  Niall Murphy

A production of Tinker Tinker Tinker, LLC Frontier Theme