General

SRE Weekly Issue #105

lex

January 14, 2018

View on sreweekly.com
A quick note: Friday was my last day at Heroku/Salesforce, so don’t be surprised if you see my “full disclosure” notices change.

Articles

Page It Forward!

PagerDuty put a call out on Twitter, asking what folks are doing to improve the on-call experience at their companies.

Building a Distributed Log from Scratch, Part 3: Scaling Message Delivery

Here’s part three in the series. This one’s about sharding, horizontal scaling, and client versus server complexity.

What are Azure Availability Zones and why should you use them

Here’s how Azure’s new availability zones change the way highly available apps can be designed on Azure.

Dealing with the Meltdown patch at Grab

The meltdown patch seems to be having a disproportionate impact on Redis performance. Here’s Grab’s story of how they figured out what was up and what they did to deal with it.

Twitter: mipsytipsy on monitoring vs observability

I don’t often do the Twitter thing, but this chain by Charity Majors is worth reading. Is that what they call it? a chain?

Google Cloud Platform Blog: Why you should pick strong consistency, whenever possible

Google on the advantages of Cloud Spanner’s strong consistency and why to use it. I’m still looking out for an explanation of what the downside to Spanner is…

Machine Learning Drives Changing Disaster Recovery At Facebook

Just to be clear, this is about how critical it is that Facebook keep their machine learning applications running, rather than using machine learning to design disaster recovery solutions.

Planning Better for Failure: How Mainframe Error Messages Impact CX

This article is about useful error messages, which are important both for the customer experience and for operations. I’m not sure what really qualifies as a “mainframe” these days, though….

Automating Your Oncall: Open Sourcing Fossor and Ascii Etch

LinkedIn is open-sourcing two tools that they use for troubleshooting during incidents. Fossor automates running data-gathering can and Ascii Etch displays graphs using ASCII art.

Outages

LastPass
Slack
Spotify
Bitbucket
- Bitbucket has had severe performance problems due to a failure in their storage layer.
Kraken (cryptocurrency exchange)
- This appears to have been a scheduled upgrade that blew up in complexity, preventing Kraken from coming back up for two days. From the article:
  
  Most astonishing of all, about 36 hours after the upgrade began, Kraken apparently sent their engineers home to take a nap!
  
  Not that astonishing! Tired engineers make mistakes, after all.
Missile threat alert for Hawaii a false alarm
- There’s so much more to this story than we’ve been told, and I really wish I could be a fly on the wall during the retrospective.

SRE Weekly Issue #104

lex

January 7, 2018

General

Comments

View on sreweekly.com

Well, that was a fun week. I hope all of you have had a chance for a rest after any hectic patching you might have been involved in.

Articles

Safety Moment – The Power of Local Rational…it is big!

Local Rationale: the reasoning and context behind a decision that an operator made. Here’s Todd Conklin reminding us to find out what was really going on when the benefit of hindsight makes a decision seem irrational.

Building a Distributed Log from Scratch, Part 2: Data Replication

In part two of the series I linked to last week, Tyler Treat introduces data replication strategies including replicating data to all replicas before returning or just a quorum.

Developing a Hospital Emergency Incident Command System (HEICS)

Here’s something I wasn’t aware of: hospitals have their own version of the ICS.

Google Cloud Platform Blog: Consequences of SLO violations

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy.

This Is What it Takes to Measure the Internet

Now this is neat. This research team pings basically the entire internet all the time and can track outages across the globe. They can see things like Egypt shutting down Internet access for all of its citizens and the effects of hurricanes.

How Log Analysis Can Bring Front-End Engineers on Call

This is a summary of a couple of talks from Influx Days. I especially like the bit about Baron Schwartz’s talk on the pitfalls of anomaly detection.

Speculative Execution Exploit Performance Impacts – Describing the performance impacts to security patches for CVE-2017-5754 CVE-2017-5753 and CVE-2017-5715

Meltdown is especially scary because the fix has the potential to significantly impact performance.

Outages

WhatsApp
- WhatsApp had trouble at the stroke of midnight on New Year’s Day.
US Customs
Yahoo Mail
New York (US state) tax department and DMV
Funimation and Crunchyroll
- Two anime sites were down, preventing fans from viewing a big new release.

SRE Weekly Issue #103

lex

December 31, 2017

General

Comments

View on sreweekly.com

Articles

Gremlin’s Gameday: Breaking DynamoDB

Gremlin Inc. helps folks simulate failure, but what happens when they turn their tools on their own infrastructure? In this article, they share all sorts of juicy details about how they set up their experiments, what they hoped to prove and thought might happen, and then what actually happened, including an unexpected failure mode.

Building a Distributed Log from Scratch, Part 1: Storage Mechanics

This article series isn’t actually about writing your own new distributed log from scratch — probably not a good idea. It’s about learning the fundamental principles involved in designing such systems so that we can better understand them while operating and using them.

sysadvent: Day 21 – Lighting Up Your Haunted Graveyards

What do you do about the scary system that nobody touches and everyone is afraid will fall over some day? This article shows you a concrete plan for digging in and dealing with the skeleton in the closet.

Learning to operate Kubernetes reliably

It’s Julia Evans, writing at Stripe!

In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.

No Need to Be Alarmed: Crafting an Effective Alert Strategy

AppOptics’s take on alerting, including this gem:

More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency.

sysadvent: Day 17 – Don’t Fall for the Hybrid Cloud Trap

How many times have you seen a migration or transition reach 90% completion and stall? This SysAdvent author urges caution in engaging a “hybrid cloud” vendor solution.

2018 and the Dawn of Network Reliability Engineering (NRE)

Juniper discusses the evolution of the Network Engineer role into Network Reliability Engineer (NRE).

Just like sysadmins have graduated from technicians to technologists as SREs, the NRE title is a declaration of a new culture and serves as the zenith for all that we do and have as engineers of network invincibility.

Load Testing WebDAV Servers

A primer on setting up load testing for WebDAV using Apache Jmeter.

Production postmortem: data corruption, a view from INSIDE the sausage

An interesting debugging story involving a tricky data corruption bug in RavenDB.

Outages

SRE Weekly Issue #102

lex

December 17, 2017

General

Comments

View on sreweekly.com
My phone died this week, and I discovered the hard way that my backups hadn’t been functioning properly. SRE Weekly is served out of a single t2.micro, too. Sometimes it’s hard to practice what I preach outside of work. ;) Anyway, bit of a light issue this week, but still some great stuff.

Articles

sysadvent: Day 13 – Half-Dead TCP Connections and Why Heartbeats Matter

I’ve lost count of the number of incidents I’ve witnessed that were caused by TCP connections in which one end disappeared. This cautionary tale has a pretty interesting cause as well.

Getting the Most from Your Incident Post-Mortem – PagerDuty

In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.

Reminders about using traceroute in multi-path networks

If the title of this article doesn’t make sense to you, then you may well have been interpreting traceroute results incorrectly. Definitely worth a read.

Introducing Gremlin: Orchestrating Chaos

Gremlin inc. is live! Here’s the official “coming out” post for this chaos engineering startup.

Netflix: What Happens When You Press Play?

There’s so much to delve into in this long article about Netflix’s infrastructure. It’s part of the book, Explain the Cloud Like I’m 10, but I didn’t really find the explanations watered-down or over-simplified.

Incidents, fixes, and the day after

A great description of booking.com‘s incident response and followup process.

Incidents are like presents: You love them as long as you don’t get the same present twice.

Outages

Incident review: API and Dashboard outage on 10 October 2017 — GoCardless Blog
- This is a truly epic post-incident analysis from the folks at GoCardless. The highlights: simultaneous 3-drive failure in a RAID array, weird behavior from Pacemaker, a red herring from Postgres, and a multi-month investigation process.
Slack Server Error, Site Down Amid Friday Outage
- And a second incident, later the same day.
Former Rutgers Student Pleads Guilty After Historic Internet Outage
- The outage in question is the Dyn DDoS in October of 2016, and the student pled guilty to creating the Mirai botnet.
Eurex and Xetra (stock exchanges)
Instagram

SRE Weekly Issue #101

lex

December 10, 2017

General

Comments

View on sreweekly.com

Articles

sysadvent: Day 3 – Root Cause is Plural

It’s Sysadvent season again! This article is a great introduction to the idea that there is never just one root cause in an incident.

4 Chaos Experiments to Start With

Want to try out chaos engineering? Here are four kinds of terrible things you can do to your infrastructure, from the folks at Gremlin.

Load Balancing Strategies for HashiCorp Consul

To be clear, this is about using Consul as part of load balancing another service, not load-balancing Consul itself. Several methods are discussed, along with the pros and cons of each.

Is Root Cause Analysis Dead or Are We Just Getting Started?

This article has some interesting ideas, including automated root cause discovery or at least computer-assisted analysis. It also contains this week’s second(!) Challenger shuttle accident reference.

sysadvent: Day 6 – sysadmins – the evolution of a role amidst revolutionary hype.

As job titles change, this author argues that the same basic operations skills are still applicable.

Black Friday & Cyber Monday Performance Report 2017

Here’s Catchpoint’s yearly round-up of how various sites fared over the recent US holiday period.

Monitoring, Analytics, Diagnostics, Observability, and Root Cause Analysis

These terms mean similar things, and sometimes some of them are used interchangeably. Baron Schwartz sets the record straight, defining each term and explaining the distinctions between them.

PostMortems and Proactive Learning From Events

If you have a moment, please consider filling out this survey by John Allspaw:

[…] I’m looking to understand what engineers in software-reliant companies need in learning better from post-incident reviews.

Google Cloud Platform Blog: Getting the most out of shared postmortems

In a continuation of last week’s article, Google’s CRE team discusses sharing a postmortem with customers. “Sharing” here means not only giving it to them, but actually working on the postmortem process together with customers, including assigning them followup actions(!).

sysadvent: Day 8 – Breaking in a New Company as an SRE

SRE Amy Tobey approached a new SRE gig with a beginner’s mind and took notes. The result is a useful set of lessons learned and observations that may come in useful next time you change jobs.

Outages

Facebook
Coinbase (Bitcoin exchange)
Zimbabwe
- Zimbabwe suffered two coincidental fiber cuts. Here’s a related article on the growing concern of undersea fiber cuts: (link)
Nationwide (bank)
Gemini (Bitcoin exchange)
NiceHash (Bitcoin exchange)
- Not an outage per se, but I had to include this since it’s the third Bitcoin-related incident this week. Thieves broke into NiceHash’s systems and stole $78 million (USD) worth of bitcoins. Of course, the actual value of the theft changes almost by the minute…

SRE Weekly Issue #105

Articles

Outages

SRE Weekly Issue #104

Articles

Outages

SRE Weekly Issue #103

Articles

Outages

SRE Weekly Issue #102

Articles

Outages

SRE Weekly Issue #101

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues

General

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

SPONSOR MESSAGE

Articles

Outages

Subscribe

RSS

Mastodon

Search Issues