This repo contains a path to learn SRE, in the form of a list of concepts to familiarize oneself with.
Teiva Harsanyi
How can we justify the (sometimes significant) expense of instilling observability into our systems?
Nočnica Mellifera — SigNoz
It was DNS. Cloudflare’s 1.1.1.1 recursive DNS service failed this week, stemming from failure to parse the new ZONEMD
record type.
Ólafur Guðmundsson — Cloudflare
Rather than just dry theory, this article helps you understand what the CAP theory means in practice as you choose a data store.
Note: this link was 504ing at time of publishing, so here’s the archive.org copy.
Bala Kalavala — Open Source For U
A “blameless” culture can get in the way if it means you’re not allowed to make any mention of who was at the pointy-end of your system when things blew up.
incident.io
In this post, we will share how we formalized the LinkedIn Business Continuity & Resilience Program, how this new program helped increase our customers’ confidence in our operations, and the lessons that we learned as we attained ISO 22301 certification.
Chau Vu — LinkedIn
This is the start of a 6-article series, with each going through one week along a path to prepare for SRE interviews.
We’ll spend each week focusing on building up your expertise in the key areas SREs need to know, like automation, monitoring, incident response, etc.
Code Reliant
Beyond the CAP theorem, what actually happens during a partition?
“ if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency and consistency (L and C)” [Daniel J. Abadi]
Lohith Chittineni