There are way too many gorgeous, mind-blowing ways for incidents to occur without a single change to code being deployed.
That last hot take is the kicker: even if you don’t do a code freeze in December (in the US), you’ll still see a lot of the same pitfalls as you would have if you did.
Emily Ruppe — Jeli
Ah, IaC, the tool we use to machine-gun our feet in a highly-available manner at scale. This analysis of an incident from back in August tells what happened and what they learned.
Stuart Davidson — Skyscanner
By establishing a set of core principles (Response, Observability, Availability and Delivery) aka our “ROAD to SRE”, we now have clarity on what areas we expect our SRE team should be focusing on and avoiding a common pitfall of becoming another platform or Ops team.
In this blog post, we’ll look at:
- The advantages of an SRE team where each member is a specialist.
- Some SRE specialist roles and how they help.
Emily Arnott — The New Stack
I love these “predictions for $YEAR” posts. What are your predictions?
Emily Arnott — Blameless
Deployment Decision-Making during the holidays amid the COVID19 Pandemic
A sneak peek into my forthcoming MSc. thesis in Human Factors and Systems Safety, Lund University.
Jessica DeVita (edited by Jennifer Davis) — SysAdvent
This article covers what to do as an incident commander, how to handle long-running incidents, and how to do a post-incident review.
Joshua Timberman — SysAdvent
So in this post I’m going to go over what makes a good metric, why data aggregation on its own loses resolution and messy details that are often critical to improvements, and that good uses of metrics are visible by their ability to assist changes and adjustments.
Here’s a great tutorial to get started with eBPF through a (somewhat convoluted) “Hello World” exercise.
Ania Kapuścińska (edited by Shaun Mouton) — SysAdvent
The concept of engineering work being about resolving ambiguity really resonates with me.
This appears to have caused a problem with Microsoft Exchange servers. Maybe this belongs in the Outages section…