Tracking down a memory leak

May 6, 2024

Tl;Dr: I followed this guide to debug a memory leak in a Sidekiq job. In my case, the leak was caused by a thread local array buffer in lograge-sql that never got cleared.

With PlainAPM infra moved away from Heroku, one of the first monitors I implemented were for the app’s memory, as previously Heroku metrics showed a persistent growth for the Sidekiq process:

Growing memory usage of a Ruby Sidekiq process, Heroku metrics

Initially, I thought it was a memory fragmentation issue, and that the process is allocating objects in a way that doesn’t allow releasing the memory back to the OS. I tried switching to jemalloc, which is supposedly better at handling fragmentation than the standard malloc, and also has a built-in support for leak checking.

But it didn’t help:

Linearly growing memory usage of a Ruby Sidekiq process, a Grafana chart

Luckily, I stumbled upon the above mentioned guide and from there, things were relatively straight-forward.

I had it a bit simpler than Jacopo in the original post, as I was able to reproduce the issue locally, and so did not need to be careful around production. I installed rbtrace so I could capture heap dumps from a running process, enabled trace_object_allocations:

# in bin/sidekiq binstub generated by Bundler

if ENV["TRACE_OBJECT_ALLOCATIONS"]
  require "objspace"
  ObjectSpace.trace_object_allocations_start
end

triggered the Sidekiq workload in between the three captures, and used heapy diff on the heap dumps to get the retained objects:

$ heapy diff /tmp/0.json /tmp/1.json /tmp/2.json  | head -10
Retained STRING 14994 objects of size 9596160/9611080 (in bytes) at: /home/plainapm/.gem/ruby/3.3.1/gems/lograge-sql-2.3.2/lib/lograge/sql.rb:53
8< --- snipped for brevity ---

This probably should have been enough to point me to the right direction, but I continued to follow the blog post, and obtained the address info

$ grep "file.*lib/lograge/sql.rb.*line.*53" /tmp/2.json | head |  grep -o "address\":\"[^\"]\+\""
address":"0x7fa51f9c0c90"
address":"0x7fa51f9c5510"
address":"0x7fa51f9c6910"
%< --- snipped for brevity ---

and the path between the objects with sheap:

$ sheap /tmp/0.json /tmp/2.json
$before: #<Sheap::Heap (269834 objects)>, $after: #<Sheap::Heap (289990 objects)>, $diff: #<Sheap::Diff (20338 objects)>
irb#1():001> $diff.after.find_path($diff.after.at("0x7fa51f9c0c90"))
=>
[#<ROOT vm  (2757 refs)>,
 #<DATA 0x7fa5218baa10 ractor (15 refs)>,
 #<DATA 0x7fa51fbadf58 VM/thread (10 refs)>,
 #<DATA 0x7fa51f92f600 fiber (220 refs)>,
 #<HASH 0x7fa51f8ec6c0  (1 refs)>,
 #<ARRAY 0x7fa51f950288  (4990 refs)>,
 #<STRING 0x7fa51f9c0c90 "<some string>">]

This, and looking into the code finally made me realise what’s up:

I’m a big fan of canonical log lines, so in this case, I was using lograge to get a single line per request that can be easily parsed later on.

To be able to log Rails’ SQL queries this way, I also added a lograge-sql gem.

Under the hood, this library uses an ActiveRecord event subscriber to capture SQL queries into a thread local buffer.

In the usual case, e.g. while processing a request, the buffer would get passed over to lograge through a hook into the extract_request method. This would emit the log line, but also clear the store.

However, grepping and rummaging through the lograge code revealed that it only activates for ActionController and optionally, ActionCable.

For ActiveJob, or a class including Sidekiq::Job, extract_request would never get called, and the buffer would grow indefinitely with every executed SQL query.

For now, I fixed the leak by disabling lograge whenever Sidekiq.server? is not nil. Now, this looks much better:

Memory usage chart of a Ruby process staying flat after fix was deployed