
SRE Weekly Issue #106


See how AlienVault focuses their incident management on collaboration and shared responsibility while relying on the rules engine of the VictorOps Transmogrifier.


Chaos engineering is extremely useful, and Mathias Lafeldt has written plenty about its virtues. But as with everything, it’s important to be aware of its pitfalls and shortcomings too.

There’s been a lot of talk of firing (or worse) the person whose actions led to the false alarm in Hawaii. That’s why I’m especially glad to see this excellent analysis by Don Norman (The Design of Everyday Things and others). Bonus content: another article along the same vein with some more interesting tidbits.

Think twice before you disable swap, says Chris Down, an author of the upcoming cgroup v2 in the Linux kernel.

Catchpoint is running a survey of SREs and SRE-like folks, and I’d really appreciate it if you’d take a moment to fill it out. Not only will the resulting data be very interesting, but Catchpoint is donating $5 to charity for every survey completed. Let’s stuff that ballot box and get them to hit their cap of $3000!

The awesome continues this week with a discussion of the importance of simplicity in the design of a reliable system.

This article from Heidi Waterhouse at Launch Darkly starts off with a really interesting take on the Y2K bug and continues on to discuss risk management in operations.

This short article has an extremely cogent point: design your system to be flexible enough to allow the user to do something seemingly incorrect, because they might need to while responding to an incident!

LinkedIn had a problem: their on-call system was so dysfunctional that they had to scramble to find coverage for an engineer that had been scheduled to be on call when they were on vacation. They explain how they identified the problem, came up with a solution, and implemented it, including automation and cultural fixes.

If the phrase “a DevOps World” makes you feel ill, don’t dismiss this article from ACM Queue out of hand. It’s got some great points about designing effective monitoring, and I like the introduction of the “Real Systems Monitoring” concept (akin to “Real User Monitoring” or RUM).


  • Heroku
    • Heroku had a 29-hour impairment to their application log routing platform.

SRE Weekly Issue #105

A quick note: Friday was my last day at Heroku/Salesforce, so don’t be surprised if you see my “full disclosure” notices change.


See how CloudBees Jenkins Solutions & VictorOps work together to bridge the on-call gap for CI/CD in this webinar. Register today.


PagerDuty put a call out on Twitter, asking what folks are doing to improve the on-call experience at their companies.

Here’s part three in the series. This one’s about sharding, horizontal scaling, and client versus server complexity.

Here’s how Azure’s new availability zones change the way highly available apps can be designed on Azure.

The meltdown patch seems to be having a disproportionate impact on Redis performance. Here’s Grab’s story of how they figured out what was up and what they did to deal with it.

I don’t often do the Twitter thing, but this chain by Charity Majors is worth reading. Is that what they call it? a chain?

Google on the advantages of Cloud Spanner’s strong consistency and why to use it. I’m still looking out for an explanation of what the downside to Spanner is…

Just to be clear, this is about how critical it is that Facebook keep their machine learning applications running, rather than using machine learning to design disaster recovery solutions.

This article is about useful error messages, which are important both for the customer experience and for operations. I’m not sure what really qualifies as a “mainframe” these days, though….

LinkedIn is open-sourcing two tools that they use for troubleshooting during incidents. Fossor automates running data-gathering can and Ascii Etch displays graphs using ASCII art.


  • LastPass
  • Slack
  • Spotify
  • Bitbucket
    • Bitbucket has had severe performance problems due to a failure in their storage layer.
  • Kraken (cryptocurrency exchange)
    • This appears to have been a scheduled upgrade that blew up in complexity, preventing Kraken from coming back up for two days. From the article:

      Most astonishing of all, about 36 hours after the upgrade began, Kraken apparently sent their engineers home to take a nap!

      Not that astonishing! Tired engineers make mistakes, after all.

  • Missile threat alert for Hawaii a false alarm
    • There’s so much more to this story than we’ve been told, and I really wish I could be a fly on the wall during the retrospective.

SRE Weekly Issue #104

Well, that was a fun week.  I hope all of you have had a chance for a rest after any hectic patching you might have been involved in.


Curious about the state of on-call, but don’t have a ton of time to do the research? VictorOps has gathered the most important stats in one place for you to skim.


Local Rationale: the reasoning and context behind a decision that an operator made. Here’s Todd Conklin reminding us to find out what was really going on when the benefit of hindsight makes a decision seem irrational.

In part two of the series I linked to last week, Tyler Treat introduces data replication strategies including replicating data to all replicas before returning or just a quorum.

Here’s something I wasn’t aware of: hospitals have their own version of the ICS.

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy.

Now this is neat. This research team pings basically the entire internet all the time and can track outages across the globe. They can see things like Egypt shutting down Internet access for all of its citizens and the effects of hurricanes.

This is a summary of a couple of talks from Influx Days. I especially like the bit about Baron Schwartz’s talk on the pitfalls of anomaly detection.

Meltdown is especially scary because the fix has the potential to significantly impact performance.


SRE Weekly Issue #103


Looking for light reading for the new year? Dive into a VictorOps favorite: Scala Unified Logging to Full System Observability.


Gremlin Inc. helps folks simulate failure, but what happens when they turn their tools on their own infrastructure? In this article, they share all sorts of juicy details about how they set up their experiments, what they hoped to prove and thought might happen, and then what actually happened, including an unexpected failure mode.

This article series isn’t actually about writing your own new distributed log from scratch — probably not a good idea. It’s about learning the fundamental principles involved in designing such systems so that we can better understand them while operating and using them.

What do you do about the scary system that nobody touches and everyone is afraid will fall over some day? This article shows you a concrete plan for digging in and dealing with the skeleton in the closet.

It’s Julia Evans, writing at Stripe!

In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.

AppOptics’s take on alerting, including this gem:

More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency.

How many times have you seen a migration or transition reach 90% completion and stall? This SysAdvent author urges caution in engaging a “hybrid cloud” vendor solution.

Juniper discusses the evolution of the Network Engineer role into Network Reliability Engineer (NRE).

Just like sysadmins have graduated from technicians to technologists as SREs, the NRE title is a declaration of a new culture and serves as the zenith for all that we do and have as engineers of network invincibility.

A primer on setting up load testing for WebDAV using Apache Jmeter.

An interesting debugging story involving a tricky data corruption bug in RavenDB.


SRE Weekly Issue #102

My phone died this week, and I discovered the hard way that my backups hadn’t been functioning properly.  SRE Weekly is served out of a single t2.micro, too.  Sometimes it’s hard to practice what I preach outside of work. ;)  Anyway, bit of a light issue this week, but still some great stuff.


A robust mobile app is essential for on-call. See why VictorOps updated both native iOS and Android apps.


I’ve lost count of the number of incidents I’ve witnessed that were caused by TCP connections in which one end disappeared. This cautionary tale has a pretty interesting cause as well.

In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.

If the title of this article doesn’t make sense to you, then you may well have been interpreting traceroute results incorrectly. Definitely worth a read.

Gremlin inc. is live! Here’s the official “coming out” post for this chaos engineering startup.

There’s so much to delve into in this long article about Netflix’s infrastructure. It’s part of the book, Explain the Cloud Like I’m 10, but I didn’t really find the explanations watered-down or over-simplified.

A great description of‘s incident response and followup process.

Incidents are like presents: You love them as long as you don’t get the same present twice.


A production of Tinker Tinker Tinker, LLC Frontier Theme