In this debugging story, an engineer wielded SystemTap to figure out why a Kafka broker was doing a ridiculous amount of reads.
Terra Field — Honeycomb
Full disclosure: Honeycomb is my employer.
A concise breakdown of the math involved in getting that extra nine of reliability.
It all boils down to creating the SLOs and requirements to keep your users happy, but nothing more. Unnecessary reliability is a high cost.
Thomas Stringer
If you’re looking to advance in SRE, this article has some examples of the skills and experience you should aim for.
Prabesh
Will Gallego shows us a way of thinking that helps turn “should haves” into deeper understanding of our sociotechnical systems.
Will Gallego
Some words of wisdom I came across this week around startups choosing not to work on scalability too early.
Vassil Popovski
Some commenters in this reddit thread are saying it’s easier to be called an SRE, but what does it mean? Some say SRE has gotten easier, and some say it’s gotten harder. What do you think?
u/sreiously and others — reddit
The full report isn’t available yet (and may not ever be?) but this executive summary has a lot of juicy bits about the major 2022 Rogers internet and emergency service outage in Canada.
Xona Partners, Inc.
The Rogers report executive summary includes some blamey and blame-adjacent language, and this analysis does a good job of calling it out and suggesting ways to recast it.
Lorin Hochstein
The Rogers outage report executive summary indicates that truly out-of-band network management access may have made recovery easier. What exactly is involved in setting that up?
Chris Siebenmann