Articles
This article really upends the concept of “human error”, in an intriguing way.
  Lorin Hochstein
A key part of building reliable systems is often overlooked: continuously learning.
In the highly dynamic CI/CD environment, engineers with stale or outdated knowledge of the system are less able to detect, diagnose or repair anomalous behavior in their systems […]
  Laura Maguire (jeli.io) — The New Stack
This is the story of how an organization transitioned from a single NOC-like on-call team to individualized alerts routed to the relevant team.
Boris Cherkasky
This guide has a set of key factors you should consider when building a new SRE team in order to increase the likelihood of success.
Ash P — SREPath
My favorite pitfall discussed in this article: avoid committing to every possible remediation action from every incident.
incident.io
This article, written by one of the authors of the Google SRE book, is a critical look at the state of SRE and what the future holds.
Today, I believe we cannot successfully answer several key questions about SRE.
Niall Murphy
This interview goes into the thorny challenges around building a reliable app based on third-party services. It delves into the lack of reliable reporting we commonly see from cloud service providers and what ideal reporting would look like.
Jeff Martens (interviewing Jeff Smith) — Metrist
Outages
- Spotify
-
TLS certificate expiration.
-
- Solana
- Square
- Australian Taxation Office
- Google Cloud Platform us-east1