Mike Pall | 10 Sep 19:26 2013

LuaJIT 2.1 Profiler released

The git repo for LuaJIT 2.1 now features an integrated statistical
profiler with very low overhead.

There are plenty of options to generate flat lists, two-level
views, combined sampling, raw output and annotated source code.
The profiler is able to sample stacks, VM states and zones.

Here are some simple examples (for more see the docs):

$ luajit -jp spectralnorm.lua 4000
53%  A
24%  Av
23%  Atv

$ luajit -jp=s spectralnorm.lua 4000
53%  A
  <- 50%  Av
  <- 50%  Atv
24%  Atv
  <- 100%  AtAv
23%  Av
  <- 100%  AtAv

$ luajit -jp=l2 spectralnorm.lua 4000
33%  spectralnorm.lua:21 < spectralnorm.lua:28
33%  spectralnorm.lua:13 < spectralnorm.lua:27
13%  spectralnorm.lua:7 < spectralnorm.lua:13
12%  spectralnorm.lua:7 < spectralnorm.lua:21
 5%  spectralnorm.lua:6 < spectralnorm.lua:21
 4%  spectralnorm.lua:6 < spectralnorm.lua:13

$ luajit -jp=a spectralnorm.lua 4000
====== spectralnorm.lua ======
 <at>  <at>  3  <at>  <at> 
      | -- contributed by Mike Pall
      | local function A(i, j)
  10% |   local ij = i+j-1
  25% |   return 1.0 / (ij * (ij-1) * 0.5 + i)
      | end
      | local function Av(x, y, N)
      |   for i=1,N do
      |     local a = 0
  32% |     for j=1,N do a = a + x[j] * A(i, j) end
      |     y[i] = a
      |   end
      | end
 <at>  <at>  18  <at>  <at> 
      | local function Atv(x, y, N)
      |   for i=1,N do
      |     local a = 0
  32% |     for j=1,N do a = a + x[j] * A(j, i) end
      |     y[i] = a
      |   end
      | end

Stack sampling works for both interpreted and JIT-compiled code.
The results for JIT-compiled code may sometimes be surprising
since LuaJIT heavily optimizes and inlines Lua code.

Please note that it's not a goal to add every possible advanced
functionality to the bundled high-level profiler. The low-level
Lua and C APIs for the profiler are fully documented. Third-party
authors are welcome to use these building blocks and offer their
own extensions, such as graphical profilers or IDE integration.

Documentation is available from the 2.1 branch of the git
repository in the file: doc/ext_profiler.html

Instructions on how to access the git repo are here:

Feedback welcome!


As you might have noticed, I had to change my plans compared to
the original approach presented in June. The main problem with the
instrumenting profiler was finding high-precision and high-speed
timing sources for all platforms.

Even on x86/x64, where RDTSC is readily available, the results
were disappointing. The necessary pipeline flushes shadowed the
actual timings up to the point where the measurements were less
accurate than with a sampling profiler! Other platforms offered
only inaccurate timing sources or none that are accessible from
user mode. And to top it off, the instrumentation added
considerable overhead. Much less than traditional solutions using
Lua hooks, but much more than I was willing to accept.

Ok, so I had to scrap that work and decided to go with a sampling
profiler. I've previously mentioned that multi-platform support
for a sampling profiler would be challenging. But it seemed to be
more promising than pouring even more time into a dead end. I've
implemented three timing/sampling mechanisms: setitimer+SIGPROF,
POSIX threads and Windows threads. This should cover nearly all
platforms (even the consoles).

Here's how it works: the asynchronous signal or thread only sets a
flag and patches the dispatch table for the interpreter. Careful
arrangement of accesses to the shared state avoids explicit
synchronization -- eventual consistency is sufficient for a
profiler that has to cope with timer inaccuracies, anyway.

Whenever the interpreter executes the next instruction, the
profiler callback is invoked synchronously (Lua callbacks are
invoked on a separate coroutine). The callback samples the stack
with a fast stack dump helper and aggregates all samples. The
stored samples are analyzed after the profiling run is finished.

The JIT-compiler adds extra checks to the generated code at the
desired granularity (function or line) when profiling is active.
The machine code exits to the interpreter when the flag is set.
This will eventually invoke the profiler callback.

VM states are sampled when the timer expires, not when the
callback is later invoked. This allows accurately reporting VM
states: interpreted vs. compiled vs. running in C code vs. garbage
collector overhead.

Note: I've dropped the originally planned basic block mode since
the results were confusing and line profiling turned out to be
cheap enough.

And a final word: One key point often overlooked when discussing
instrumenting vs. statistical profilers: sampling a million times
a second vs. hundred times a second does not significantly improve
the ability to pinpoint (and fix) the hotspots of a program. It
doesn't matter if a piece of code that takes exactly 20.7% of
the runtime is shown as taking 18% or 23% -- you'll have to
investigate, anyway. A profiler is not a benchmarking tool.