Articles
Here’s a great look into how LinkedIn’s embedded SREs work.
[…] the mission for Product SRE is to “engineer and drive product reliability by influencing architecture, providing tools, and enhancing observability.”
Zaina Afoulki and Lakshmi Namboori — LinkedIn
It’s all just other people’s caches.
Ruurtjan Pul
Recently there was a Reddit post asking for advice about moving from Site Reliability Engineering to Backend Eng. I started writing a response to it, the response got long, and so I turned it into a blog post.
Charles Cary — Shoreline
This is the first in a series about lessons SREs can learn from the space shuttle program. The author likens earlier spacecraft to microservices and the Shuttle to a monolith.
Robert Barron
This article is ostensibly about Emergency Medical Services (EMS), but as is so often the case, it’s directly applicable to SRE. The 5 characteristics are enlightening, and so is the fictitious anecdote about an EMT rattled from a previous incident.
Ems1
Simple solution meets reality. I like how we get to see what they did when things didn’t quite work out as they were hoping.
Robert Mosolgo — GitHub
They did the work to convert a database column to a 64-bit integer before it was too late. Unfortunately, one of their library dependencies didn’t use 64-bit integers.
Keith Ballinger — GitHub
In this post, I’ll walk you through one of our first ever Sidekiq incidents and how we improved our Sidekiq implementation as a result of this incident.
Nakul Pathak — Scribd
Outages
- Let’s Encrypt
- Uber
- Multiple Airlines’ Online Booking Sites
- An error in Google’s flight information service caused problems at multiple sites that consume it.
- Tinder
- BBC Website
- Facebook, Instagram, and WhatsApp
- Stellar.org (cryptocurrency)
- WazirX (cryptocurrency exchange)
- Microsoft Azure and other services
-
Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure.
-