SRE Weekly Issue #130

SPONSOR MESSAGE

SRE is only as important as your customers. Building a culture of reliability with your customers in mind is essential to building robust, user-friendly systems. Learn about the costs of unreliability and why customers care:

http://try.victorops.com/sreweekly/customer-focused-SRE

Articles

Segment discovered the hard way that their move to a microservice architecture had brought far more problems than benefits. Here’s why they transitioned back and how they pulled it off. Awesome article!

Alexandra Noonan — Segment

Drawing on the work of Dr. David Woods and the rest of the SNAFU Catchers, this article discusses the concepts behind resiliency and how to measure and achieve it.

Beth Long — New Relic

Serverless is not the magical gateway to the land of NoOps. You still have to operate your system even if you’re not directly running the servers. This article does a great job of explaining why.

Bhanu Singh — Network World

New to me: Wireshark’s statistics view and how it can be useful.

Julia Evans

How do you define whether your system is available and healthy? This article uses an anology to medical health.

Claiming that our system is doing well means nothing if users can perceive an outage.

José Carlos Chávez — Typeform

These folks are experiencing mysterious latency with HTTP/2 traffic to their ALB and published this report on their investigation. There’s no happy ending here — ultimately they disabled HTTP/2 support. I hope they update if they do discover the culprit.

Peter Forsberg — ShopGun

I had some fun this week unearthing the cause for the chronic lockups in Rsyslog that we’ve experienced at work. I found the cause (overeager retries of socket writes) and put together a bug report and a pull request.

Full disclosure: Fastly, my employer, is mentioned.

I love science! Grab wrote a nifty tool to help them select cohorts of users and perform experiments on them.

Abeesh Thomas and Roman Atachiants — Grab

Titus is the container orchestration system that Netflix created and open sourced. Rather than building a new auto-scaling feature for Titus, they instead got Amazon to productize EC2’s auto-scaling engine as a generalized auto-scaling tool, which Netflix then integrated with Titus. Neat!

See Amazon’s Application Auto Scaling announcement, published this past week.

Andrew Leung, Amit Joshi, and the rest of the Titus team — Netflix

Outages

SRE Weekly Issue #129

SPONSOR MESSAGE

Aggregate monitoring techniques alongside time series data can improve overall system visibility and reliability. Take SRE to the next level with these aggregate monitoring methods:

http://try.victorops.com/SREWeekly/Aggregate-Monitoring

Articles

What do you do when your hosts have kernel crashes at random every day? It turns out that you don’t need to be a seasoned kernel programmer to find a solution.

Pavlos Parissis — Booking.com

This is my first introduction tcpconnect (part of BCC). Pretty nifty!

fREW Schmidt

At Facebook, […] It is simply too difficult to rewrite caching/admission/eviction policies and other manually tuned heuristics by hand. We have to fundamentally change how we think about software maintenance.

Vladimir Bychkovsky, Jim Cipar, Alvin Wen, Lili Hu, and Saurav Mohapatra — Facebook

A couple weeks back, I linked to a postmortem template. Here’s a gameday report template from the same author.

Michael Kehoe

I had a really hard time choosing whether to include this one. On the one hand, it’s a really interesting article about service discovery in franchises that has to work right every time. On the other hand, Chick-fil-A has a terrible track record on GLBT rights, and I can’t overlook that.

Ultimately, I’m choosing to link to this article for its educational content, but I urge you to join me as I continue to boycott Chick-fil-A.

Brian Chambers, Caleb Hurd, and Alex Crane — Chick-fil-A

At 9 years old, this may be the oldest article I’ve linked to, but it’s worth it. The analogy to a home mortage is spot on.

Eric Lee

Click through to read about an interesting monitoring challenge and an account of how they solved it. I appreciate the emphasis on the importance of educating engineers to spread the knowledge of how the new system works among more people.

Joy Zheng and Jeeyoung Kim — Plaid

Another chaos engineering introduction. Why should you read it? If nothing else, the architecture diagram with the skull and cobwebs on it is pretty great. It’s also well worth reading if you’re looking to create a chaos engineering game plan.

Benjamin Wilms — Codecentric

Sometimes, a reliability risk can come in the form of a bunch of angry customers.

Ben Kuchera — Ars Technica

Outages

SRE Weekly Issue #128

SPONSOR MESSAGE

Looking to go serverless? Beau Christensen, VictorOps Director of Platform Engineering, and Tom McLaughlin, Founder of ServerlessOps, sat down to talk about when VictorOps decided to venture into AWS:

http://try.victorops.com/SREWeekly/going-serverless

Articles

Humor for SREs! This is the most hilarious thing I’ve read all week.

James Mickens — USENIX ;login:logout

This focuses on various ways that Linux systems can fail to boot.

Chris Siebenmann

A (raw) transcript of a chat about Bloomberg’s adoption of SRE practices. It might be worth dropping it in a text editor and removing all occurrences of the phrase “sort of”. The real meat is in the discussion of what Bloomberg has learned (text search: “lessons learned”) and how to sell SRE as necessary in a company (text search: “ROI”).

Alan Shimel — devops.com

Channels employs three time-honored techniques to deliver these messages at low latency: fan-out, sharding, and load balancing. Let’s look inside the box!

Jim Fisher — Pusher

An in-depth explanation of how consistent hashing works. Love the hand-drawn diagrams!

Srushtika Neelakantam — Ably

Have you ever needed to generate a random number in code? whether it’s for rolling a dice, or shuffling a set, this tweet thread is here for you! There’s no reason that it should be easy or obvious, very experienced programmers repeat common mistakes. I did, before I learned …

Not strictly SRE-related, but then again it’s by Colm MacCárthaigh, who is SRE-related.

Colm MacCárthaigh

What should you do if you blow your error budget? Depends on whether you leaked it like a dripping faucet or splurged it all on big outages. Either way, you’ll need to investigate and make a plan.

Adrian Hilton, Alec Warner and Alex Bramley — Google

I love the two-method approach: a simple migration path for users that aren’t active all the time, and a more careful (and more complex) path for very busy users.

Xiang Li and Thomas Georgiou — Facebook

If you haven’t implemented alerts on support page views yet, do it now!! and thank me later. Here’s a view of how our dashboard looked as of a few minutes ago – a clear demonstration of user impact that supplements existing monitors and alerts.…

Click through for the graph. Monitor status and support page views… do we actually need any other monitoring? Only half-kidding.

Sri Harsha Kalavala

Outages

  • Google BigQuery
    • Google posted a followup analysis of the BigQuery outage on June 22.

      A new release of the BigQuery API introduced a software defect that caused the API component to return larger-than-normal responses to the BigQuery router server.

  • Fastly
    • Full disclosure: Fastly is my employer.
  • G Suite Status Dashboard
  • Slack
    • This week, Slack had a ~3-hour, near-total outage. Click through for their followup post.

      The network problems were caused by a bug included in an offline batch process of data, which resulted in unexpected network spikes and led all of our customers to become disconnected and unable to reconnect.

  • Google Home and Chromecast

SRE Weekly Issue #127

It’s a jam-packed issue this week!  After a few light issues, suddenly everyone decided to publish awesome SRE-related content all at once.  Nice work, folks!

SPONSOR MESSAGE

Creating on-call schedules for your SRE team(s) can be challenging. We’ve put together a short list of tips, tricks, and tools you can use to better organize your on-call rotations and help your SRE efforts:

http://try.victorops.com/SREWeekly/SRE-On-Call-Tips

Articles

Visa wrote a letter to the Chair of the Treasury Committee of the UK House of Commons, explaining their outage from a few weeks ago and answering the questions they posed. The good bits are in the first few pages, and the question answers mostly reiterate them. The last question about steps to prevent recurrence has some additional detail.

[…] a component within a switch in our primary data centre suffered a very rare partial failure which prevented the backup switch from activating.

Visa

This is really nifty!

The website has two sections: Country Statistics and Traffic Shifts.

Such an awesome idea:

@eanakashima: Alerting on spikes in status page views: so wrong, or so right?

Emily Nakashima

How (and why) should an SRE team communicate with Dev and the rest of the organization? I especially enjoy the section on how communicating outwardly helps SRE.

HostedGraphite

o11ycon has posted a Call for Failures:

Send us a slide or two, including a graph or other visual artifact of observability that represents the worst day of your (professional) life. Or a graph that drives home some important, deeply unexpected, or just plain interesting point about your systems.

o11ycon

There’s a great description of their current setup, but what really makes this article awesome is the explanation of what was wrong with their old system and why they replaced it.

Shlomi Noach — GitHub

Hilights of this article:

  • description of the pros and cons of two techniques for automating database migrations
  • a surprising number of instances of the word “tentacle”

Hen Peretz — BlazeMeter

Rather than firing the driver that caused a rear-end collision, this company looked deeper and found an underlying flaw in their procedures.

The organization had unknowingly created a system that was risk-promoting, rather than risk-averse.

Larry Boxman and Paul LeSage — Journal of Emergency Medical Services

Outages

  • NPM (nodeJS package manager)
    • This status posting is minimal, but there’s a deeper story at play here. There’s this article:

      Twitter bought an anti-harassment startup and immediately shut it down

      And this tweet by Laurie Voss (npmjs COO):

      @seldo: A vendor notified us of their acquisition at 6am this morning and shut down their APIs 30 minutes later, creating a production outage for npm (package publishes and user registrations). The sheer unprofessionalism of this is blowing my mind.

      Ouch.

  • Datadog
    • These delays may result in “no data” alert conditions for Metric Monitors, to avoid spurious alerts we’ve temporarily disabled these alert types.

  • DIRECTV NOW
    • In the midst of suffering a major outage to their DIRECTV NOW OTT service, AT&T announced the official launch of AT&T WatchTV […]

  • Algeria
    • Algeria switched off its internet on Wednesday in an attempt to prevent cheating on exams.

      Algeria’s blackout can be seen in Oracle’s Internet Intelligence project, which maps web access globally.

      Rory Smith — CNN

  • Atlassian Statuspage (statuspage.io)
    • We have identified the issue as errant traffic from a single customer and have taken action to mitigate the issue, which appears to only affect status pages. The Management Portal is working as normal.

  • New Relic
  • GCP Networking in us-east1
  • Azure North Europe region
    • An environment control system failure caused a huge rise in humidity, taking down some equipment. Huge shout-out to the Microsoft employee who reached out to me to let me know that they saw my call for help last week and forwarded it on to the folks responsible for the status page!

SRE Weekly Issue #126

SPONSOR MESSAGE

Alert fatigue will kill team morale. Take a look at some great ways to avoid alert fatigue and why it’s important for employee health and incident resolution speed:

http://try.victorops.com/SREWeekly/Avoiding-Alert-Fatigue

Articles

Our friends in the GrabFood team now save up to 70% development time on creating a new service. We have also recorded improvements in stability and availability of our services.

Karen Kue and Michael Cartmell — Grab

Some tips on surviving peak traffic as we head into World Cup season. I like the discussion in #10 (load testing): accurately testing your CDN is all but impossible.

Hadar Weiss — Peer 5 (CDN)

This is a video recording of a talk by Charity Majors at Monkigras 2018. She has a lot of awesome stuff to say about making on-call enjoyable and owning your code, including this gem:

Babies, by the way, are engineered by evolution to be too cute for you to want to kill them. Your code is not.

Charity Majors — Honeycomb

A power disruption occurred at our service provider resulting in a number of instances going offline. Heroku databases running on these instances were impacted.

Presumably this was the us-east-1 power issue I reported on in Issue 124.

The first article in this new series is about the evolution of the Network Engineer into a Network Reliability Engineer. It’s part of the broader breakdown of silos with the goal of understanding holistic reliabilty.

Michael Kehoe

I hadn’t realized that GDPR has provisions related to site/service reliability.

Theresa Abbamondi — Netscout

To shamelessly steal a line from this recorded talk, it’s very rarely the right thing for your observability system’s scale to match that of the system it’s observing. To avoid that, you need to throw away some event data rather than storing and indexing everything. How do you do that while still achieving functioning observability?

Ben Hartshorne — Honeycomb

I’m looking forward to seeing where this article series goes. Database changes can be a huge reliability risk, and getting them right is critical.

Bob Walker — Octopus Deploy

Outages

  • Azure south-central US region
    • A load spike in a backend storage system caused impact across a range of Azure services, according to the RCA linked above.

      Actually, I’ve linked to their generic “status history” page, since that seems to be as specific as I can get. Readers from Microsoft, perhaps you could ask the folks that run the Azure status page to create dedicated permalinks for each incident, or at least for each RCA? Even an anchor link in the status history page would be super-awesome!

  • New Relic infrastructure alerting
  • Travis CI
  • WhatsApp
  • American Airlines
  • Instagram
  • Google Compute Engine
    • While instances were stopped (shut down), newly-launched instances were allowed to take their IPs. The stopped instances then failed on startup due to the IP conflicts. The situation lasted for around 20 hours.
  • Optus Sport
    • World Cup fans had issues watching through Optus. World Cup streaming traffic is massive this time around.
  • Apple Maps
  • Netflix
  • .my TLD
SRE WEEKLY © 2015 Frontier Theme