Getting down with Stackprof: how we added N+1 query detection to Scout

Let's get the confusing part out of the way: we're going to get a little "meta". This post is about using one performance tool (Stackprof) to improve another performance tool (Scout's Rails Agent) so Scout can diagnose a common performance pitfall (N+1 queries).

Did I just put "performance" in a sentence three times? You can see I'm a little obsessed.

There's plenty written about why N+1 queries are bad. Is there anything good about an N+1?

Yes! Finding an N+1 is like finding a $20 bill in your couch. It's a performance-enhancing gift and often an easy fix - assuming you can find the spot in your code that is emitting the query.

...and there-in lies the problem. It is difficult to trace problematic N+1 queries on complex apps running in production back to a line-of-code. In fact, I don't know of a tool today you can run in production that does this. Why is it difficult? Grabbing backtraces is not cheap - you can't call Kernel#caller all willy-nilly in your instrumentation. It means you are left with deadends like this when trying to track down an N+1:

missing n+1

In the above screenshot, each ArDescriptor#find query executes quickly (22 ms), but there are 23(!) queries with the same SQL, adding up to nearly 500 ms.

Now, I don't know about you, but I find this incredibly frustrating. Armed with Stackprof, we set out to see: could we make it performant to find the source of N+1s with Scout?

Why finding the source of N+1 is hard

From an instrumentation perspective, it's easy to fetch a backtrace from single slow calls in a web request. Just see if the call took longer than some threshold and grab the caller info. Here's how we did this in Scout prior this experiment.

Collecting caller information isn't cheap. That's why the 500 ms threshold is in place in the linked code above. For giggles, I modified our agent to collect caller information on EVERY instrumented layer and measured against our representative endpoint benchmark. Here's how the numbers lined up:

Benchmark Name Response Time
(Mean)
Response Time
(95th Percentile)
Response Time
(Max)
Overhead
Existing Logic 38.9 ms 63.5 ms 88.1 ms
N+1 Experiment #1 65.8 ms 100.83 ms 131.7 ms 69%

Blindly grabbing the caller from every layer adds almost 70% more overhead - while this would solve tracking N+1s to their source, the medicine is worse than the disease.

Enter Stackprof.

Initial Stackprof Results

When I see poor results in a benchmark, I'll re-run it with Stackprof. Haven't heard of Stackprof? Stackprof is a sampling call-stack profiler for Ruby 2.1+. What's magical about including Stackprof in your performance workflow? If you're relying solely on benchmarking to improve performance, you'll often limit the scope of your work to one change at a time and compare the before-after. You don't have to do these controlled experiments with Stackprof. It's trivial to see what methods - even down to the LOC within methods - are consuming the most time.

Here's a snippet of the Stackprof results from the above experiment, filtered to just our agent (ScoutApm) from experiment #1:

stackprof tmp/stackprof_scout_memory/stackprof-wall-4*
==================================
  Mode: wall(1000)
  Samples: 3176013 (21.44% miss rate)
  GC: 307766 (9.69%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
    731697  (23.0%)      731697  (23.0%)     ScoutApm::Layer#store_backtrace
    ...
    778228  (24.5%)         192   (0.0%)     ScoutApm::TrackedRequest#stop_layer

I've cut the frames outside of my modifications. Now, there's a little dancing with Stackprof...a little juking. It's not always a straight line to a problem: my change was in #stop_layer, not in #store_backtrace.

So, why isn't the extra overhead showing in #stop_layer? Let's zoom into #store_backtrace and see where all the samples are coming from:

stackprof tmp/stackprof_scout_memory/s* --method store_backtrace
ScoutApm::Layer#store_backtrace (/home/vagrant/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/bundler/gems/scout_apm_ruby-74935344e8e3/lib/scout_apm/layer.rb:69)
  samples:  731697 self (23.0%)  /   731697 total (23.0%)
  callers:
    731697  (  100.0%)  ScoutApm::TrackedRequest#stop_layer
  code:
                                  |    69  |     def store_backtrace(bt)
 729999   (23.0%) /  729999  (23.0%)  |    70  |       return unless bt.is_a? Array
  876    (0.0%) /   876   (0.0%)  |    71  |       return unless bt.length > 0
  822    (0.0%) /   822   (0.0%)  |    72  |       @backtrace = bt
                                  |    73  |     end

return unless bt.is_a? Array, seems pretty trivial right? Why is this responsible for 23% of all samples?

Here's why: #stop_layer passes the caller to store_backtrace as parameter, but it isn't actually fetched until the above LOC. Stackprof can isolate samples to a single line, but we can't divide that line further. Lines are the lowest common denominator in Stackprof. So, rather than passing Kernel#caller to #store_backtrace, lets grab the result, then pass it through. More like this:

dup diff

Here's what Stackprof gives us for ScoutApm::TrackedRequest#stop_layer now:

Now, that clarifies things. Our overhead is clearly in the caller collection.

Stackprofin' Lesson: if Stackprof points you to a LOC doing several things, divide that work into multiple lines.

Lookup Hash?

It's clear that blindly grabbing the caller for every layer is too expensive. However, the great thing about an N+1 query is that there's no reason to fetch caller information on repeated, identical calls. We really only need to do it once.

How about a lookup hash of a layer name (ex: "User find" for an ActiveRecord query) and the number of calls to it over the life-cycle of the request? We'll then only fetch caller information if there are 5 or more of the same SQL queries and the time between the start of the first call the current call exceeds 300 ms. Why 300 ms? We'll use that as a reasonable threshold for an N+1 query that has a ample potential to be optimized.

Here's the results of experiment #2:

Benchmark Name Response Time
(Mean)
Response Time
(95th Percentile)
Response Time
(Max)
Additional Overhead
Existing Logic 38.9 ms 63.5 ms 88.1 ms
N+1 Experiment #2 49.5 ms 132.0 ms 2,129 ms 27%

Well shoot - that's still 27% slower. How can that be? Stackprof anyone?

==================================
  Mode: wall(1000)
  Samples: 847204 (24.38% miss rate)
  GC: 55050 (6.50%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
     53959   (6.4%)       53959   (6.4%)     ThreadSafe::NonConcurrentCacheBackend#[]
     47493   (5.6%)       47491   (5.6%)     ActiveSupport::PerThreadRegistry#instance
     38160   (4.5%)       36597   (4.3%)     ScoutApm::Utils::SqlSanitizer#sql
     ...

Well, that's odd...I didn't change anything in SqlSanitizer. So, what's calling this method? I'll run stackprof ... --method ScoutApm::Utils::SqlSanitizer#sql, and inspect the callers:

 callers:
    234994  (  100.0%)  ScoutApm::Utils::SqlSanitizer#to_s_postgres
      15  (    0.0%)  ScoutApm::Utils::SqlSanitizer#to_s

I'll repeat on #to_s_postgres:

  callers:
    242865  (  100.0%)  ScoutApm::Utils::SqlSanitizer#to_s

And I'll keep going, this time with #to_s:

callers:
    247727  (   99.9%)  ScoutApm::CallSet#update!

Boom! That is interesting - CallSet is the new class I created to track N+1 activity. I'll run stackprof ... --method ScoutApm::CallSet#update!:

Sidenote: why did I manually work through the callers vs. Stackprof's graphviz output? I find the generated call tree to large to read and slow to generate.

It's all coming from a line where I'm logging at the debug level, but I'm running logging at the info level. What gives? The contents of that string are being evaluated. I jump into the Logger docs and I see a helpful note...Logger can take a block. From the docs:

Back to the slow logging: I'll just pass a block thru vs. a String:

ScoutApm::Agent.instance.logger.debug { "Updating call set w/ [#{item}]" }

And re-run the benchmarks:

Benchmark Name Response Time
(Mean)
Response Time
(95th Percentile)
Response Time
(Max)
Additional Overhead
Existing Logic 38.9 ms 63.5 ms 88.1 ms
N+1 Experiment #3 40.6 ms 64.8 ms 82.3 ms 4%

That's less than 2 ms of overhead - a tradeoff I'll happily take to easily track down expensive N+1 queries.

Fast-forward to the end

I'll be honest: there's some side-roads we had to address along the way that I didn't cover above, but in the end, we've gotten to the LOC triggering N+1s calls:

n+1

Try Scout's N+1 query detection

N+1 detection is available in version 1.5.0 of our gem. Try Scout for free and let us know how we're doing finding those N+1s. I think you'll be quite happy.

PS - learn more about Stackprof and Rails performance by subscribing to our finally-curated newsletter via the form on the right.