General

SRE Weekly Issue #108

Wow, I have a lot of great content to share with you this week!  Sometimes it seems like awesome articles come in waves… not sure what that’s about.

SPONSOR MESSAGE

ChatOps continues to gain momentum in all industries. See what Jason Hand had to say about the progression. http://try.victorops.com/SREWeekly/ChatOpsUpdate

Articles

This is the first in a series where New York Times CTO, Nick Rockwell, talks to leaders in the technology world about their work.

There’s so incredibly much awesome in this conversation, and I’ve already seen the internet alight with people quoting it. Charity says so many insightful things that I’m going to have to reread this a couple of times to absorb it all. It’s a must-read!

Xero SRE is back, this time with an article about their incident response process and an overview of their chatbot, Multivac. The bot assists with paging and information tracking and, crucially, guides incident responders through a checklist of actions such as determining severity.

Here’s a fun little distributed system debugging story from the founder of RavenDB.

This CNN article goes into a little more detail about what happened. To my eye, there’s not enough in those details to warrant firing, so there must be more than has been shared publicly.

LinkedIn’s growth from a single datacenter to multiple “hyperscale” locations was accompanied by a cultural shift. They transitioned from “‘Site-Up’ is priority #1” to “taking intelligent risks” as their overall reliability improved.

The program is nominally aimed toward “a variety of industries, including the aerospace, automotive, maritime, manufacturing, oil, chemical, power transmission, medical device, infrastructure planning and extreme event response sectors”, though I can’t help but wonder if it might be applicable to IT.

“Well I’d cut out the pizza and beer and instead pay for Splunk.”

This author pushes us to resist the urge to write something in-house and instead look for external services or software, when the tool is not key to delivering customer value.

Here’s a very well-articulated argument for using a third-party feature-flag service rather than writing your own. I’ve seen every pitfall they mention and more. This article is by Rollout.io, a feature-flag service, but they notably don’t mention their product even once, and they don’t need to. Nicely done, folks.

I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.

We should look beyond merely preventing the same kind of incident in the future and improving our incident response process, says this article from PagerDuty.

How many times have you been paged for a server at 95% disk usage, only to find that it’s still months away from full? This article by SignalFX is about a feature on their platform, but its concepts are generally applicable to other tools.

A primer on testing failover in a MongoDB Atlas cluster.

Large numbers of SREs went scrambling last month when we realized that we may suddenly run out of resources on our NoSQL workloads. Here are some concrete numbers on how things actually turned out.

Outages

SRE Weekly Issue #107

SPONSOR MESSAGE

Reactive, tactical, integrated, or holistic—where does your incident management fall? Read about incident management maturity to find out. http://try.victorops.com/SREWeekly/IncidentManagementMaturity

Articles

Here, “escalation policy” refers to ongoing work by SRE to get a service back into its SLO, rather than an escalation policy definition in PagerDuty (for example). This article describes the tactics a hypothetical Google SRE team has at their disposal to deal with an ailing service. It’s especially striking to me how this policy comes across as almost punitive in nature.

In this post, we’ll provide a technical walk-through of how we used the Play Framework and the Akka Actor Model to build the massive infrastructure that keeps track of the online status of millions of members at any given moment. We’ll describe how it distributes thousands of changes per second in the online status of these members to millions of other connected members in real time. You will also learn how to apply these techniques to your own applications.

This article from LaunchDarkly is about assuming failure and mitigating harm, through the lens of feature-flag-based deployment.

New Relic shares this list of the categories of tools that SREs use to standardize the systems they support.

As Liz [Fong-Jones] told Matthew Flaming, New Relic vice president of software engineering, “One SRE team is going to have a really difficult time supporting 50 different software engineering teams if they’re each doing their own separate thing, and they’re each using separate tooling.”

In the final article of this series, Tyler Treat lays out a design for a new distributed log based on NSQ.

While perhaps not strictly SRE-related, hiring is still critically important for SRE teams. I really love Honeycomb’s approach to hiring as laid out in this blog post.

Why indeed? This issue of The Morning Paper discusses a paper on the effectiveness of random testing in distributed systems. More specifically, it goes over the mathematics behind why randomized testing in Jepsen is actually useful, despite classical theories that it ought not be.

Outages

  • Pinterest
  • Google Cloud Storage
    • This one’s worth a read. Google’s original status posting stated 100% impact to cloud storage in its US region, but their followup post retroactively reduced that to 2.0% average and 3.6% peak.
  • Netflix
    • This one happened seemingly at the same time as the Google Cloud Storage outage, but that may be a spurious correlation. This is the first time that I learned that Netflix does have a status page of sorts: it’s an article in their help center entitled “Is Netflix Down?” and they update it live. Who knew?
  • Facebook/Instagram
  • National Health Service (UK)

SRE Weekly Issue #106

SPONSOR MESSAGE

See how AlienVault focuses their incident management on collaboration and shared responsibility while relying on the rules engine of the VictorOps Transmogrifier. http://try.victorops.com/SREWeekly/AlienVault

Articles

Chaos engineering is extremely useful, and Mathias Lafeldt has written plenty about its virtues. But as with everything, it’s important to be aware of its pitfalls and shortcomings too.

There’s been a lot of talk of firing (or worse) the person whose actions led to the false alarm in Hawaii. That’s why I’m especially glad to see this excellent analysis by Don Norman (The Design of Everyday Things and others). Bonus content: another article along the same vein with some more interesting tidbits.

Think twice before you disable swap, says Chris Down, an author of the upcoming cgroup v2 in the Linux kernel.

Catchpoint is running a survey of SREs and SRE-like folks, and I’d really appreciate it if you’d take a moment to fill it out. Not only will the resulting data be very interesting, but Catchpoint is donating $5 to charity for every survey completed. Let’s stuff that ballot box and get them to hit their cap of $3000!

The awesome continues this week with a discussion of the importance of simplicity in the design of a reliable system.

This article from Heidi Waterhouse at Launch Darkly starts off with a really interesting take on the Y2K bug and continues on to discuss risk management in operations.

This short article has an extremely cogent point: design your system to be flexible enough to allow the user to do something seemingly incorrect, because they might need to while responding to an incident!

LinkedIn had a problem: their on-call system was so dysfunctional that they had to scramble to find coverage for an engineer that had been scheduled to be on call when they were on vacation. They explain how they identified the problem, came up with a solution, and implemented it, including automation and cultural fixes.

If the phrase “a DevOps World” makes you feel ill, don’t dismiss this article from ACM Queue out of hand. It’s got some great points about designing effective monitoring, and I like the introduction of the “Real Systems Monitoring” concept (akin to “Real User Monitoring” or RUM).

Outages

  • Heroku
    • Heroku had a 29-hour impairment to their application log routing platform.

SRE Weekly Issue #105


A quick note: Friday was my last day at Heroku/Salesforce, so don’t be surprised if you see my “full disclosure” notices change.

SPONSOR MESSAGE

See how CloudBees Jenkins Solutions & VictorOps work together to bridge the on-call gap for CI/CD in this webinar. Register today. http://join.cloudbees.com/l/272242/2018-01-09/739hy

Articles

PagerDuty put a call out on Twitter, asking what folks are doing to improve the on-call experience at their companies.

Here’s part three in the series. This one’s about sharding, horizontal scaling, and client versus server complexity.

Here’s how Azure’s new availability zones change the way highly available apps can be designed on Azure.

The meltdown patch seems to be having a disproportionate impact on Redis performance. Here’s Grab’s story of how they figured out what was up and what they did to deal with it.

I don’t often do the Twitter thing, but this chain by Charity Majors is worth reading. Is that what they call it? a chain?

Google on the advantages of Cloud Spanner’s strong consistency and why to use it. I’m still looking out for an explanation of what the downside to Spanner is…

Just to be clear, this is about how critical it is that Facebook keep their machine learning applications running, rather than using machine learning to design disaster recovery solutions.

This article is about useful error messages, which are important both for the customer experience and for operations. I’m not sure what really qualifies as a “mainframe” these days, though….

LinkedIn is open-sourcing two tools that they use for troubleshooting during incidents. Fossor automates running data-gathering can and Ascii Etch displays graphs using ASCII art.

Outages

  • LastPass
  • Slack
  • Spotify
  • Bitbucket
    • Bitbucket has had severe performance problems due to a failure in their storage layer.
  • Kraken (cryptocurrency exchange)
    • This appears to have been a scheduled upgrade that blew up in complexity, preventing Kraken from coming back up for two days. From the article:

      Most astonishing of all, about 36 hours after the upgrade began, Kraken apparently sent their engineers home to take a nap!

      Not that astonishing! Tired engineers make mistakes, after all.

  • Missile threat alert for Hawaii a false alarm
    • There’s so much more to this story than we’ve been told, and I really wish I could be a fly on the wall during the retrospective.

SRE Weekly Issue #104

Well, that was a fun week.  I hope all of you have had a chance for a rest after any hectic patching you might have been involved in.

SPONSOR MESSAGE

Curious about the state of on-call, but don’t have a ton of time to do the research? VictorOps has gathered the most important stats in one place for you to skim. http://try.victorops.com/SREWeekly/OnCallStats

Articles

Local Rationale: the reasoning and context behind a decision that an operator made. Here’s Todd Conklin reminding us to find out what was really going on when the benefit of hindsight makes a decision seem irrational.

In part two of the series I linked to last week, Tyler Treat introduces data replication strategies including replicating data to all replicas before returning or just a quorum.

Here’s something I wasn’t aware of: hospitals have their own version of the ICS.

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy.

Now this is neat. This research team pings basically the entire internet all the time and can track outages across the globe. They can see things like Egypt shutting down Internet access for all of its citizens and the effects of hurricanes.

This is a summary of a couple of talks from Influx Days. I especially like the bit about Baron Schwartz’s talk on the pitfalls of anomaly detection.

Meltdown is especially scary because the fix has the potential to significantly impact performance.

Outages

A production of Tinker Tinker Tinker, LLC Frontier Theme