Down the Rabbit Hole that is Inlining

August 29, 2013

Inlining is one of the most profitable optimizations a compiler can do. It’s profitable because it eliminates call and return overhead, but also because it can expose new optimization opportunities that were previously hidden behind an opaque call. Higgs implements all JavaScript primitives as calls to runtime library functions, which means that inlining was very much necessary in order to expose code specialization opportunities and obtain performance gains. Unfortunately, inlining is a little trickier to do in a Just-In-Time (JIT) compiler than it is in an Ahead-Of-Time (AOT) compiler. This is because inlining modifies function by substituting the bodies other functions into them, and well, in a JIT compiler, the functions being modified might be executing when this happens. That’s a little bit like performing repairs on a car in the middle of a race.

In order to inline code into running functions, their execution has to be momentarily suspended, the code modified, and execution resumed. In the case of Higgs, suspending and resuming execution is done simply: functions run in the interpreter until a counter (either at the function entry or at a loop header) hits a threshold, the JIT compiler then kicks in and does its job, and execution resumes in compiled x86 code. The complex bit is that the compiled code is going to be very different from the original function. It will have new instructions from inlined functions, new SSA phi nodes, some instructions potentially removed, and most importantly: the stack frame layout likely won’t match what we had to begin with. The technique to replace stack frames of running functions is called On-Stack-Replacement (OSR), and it is notoriously difficult to implement.

I decided to simplify things a little bit by restricting Higgs to only do OSR for functions that are on the very top of the stack. This prevents inlining from happening inside recursive functions, which isn’t too much of an issue with the benchmarks I’ve been working with. Things are still not simple, however. Liveness information needs to be taken into account when doing OSR, because multiple values may map to the same stack slot, and just looking at the code doesn’t tell us which values need to be written into updated stack frames: we have to know what was live before inlining, at the time the function was interrupted. Another issue is that new phi nodes may be introduced as a result of inlining, these phi nodes have no direct analog in the original function. Fortunately, one can logically conclude that these must take the value of the inlined call site they correspond to. I probably spent about 12 hours debugging this, but I eventually got it all to work.

My quest did not stop once OSR was working. I wanted to be able to run a peephole optimization pass on the intermediate representation of functions after inlining was done. This is to help remove phi nodes, remove dead code, simplify control flow, and do simple constant propagation. The complexity lies in the fact that since the function we’re modifying is currently executing, we can’t just make any transformation we want. It’s not valid, for example, to replace a value that’s currently live by another value that’s no longer live at the moment where the execution was stopped. A static compiler wouldn’t have to worry about such execution-time constraints, but we do. Fortunately, I found that in practice such limitations don’t really seem to affect the peephole optimizer much, it remains very effective.

I decided to spend some time and try to improve the code generated for a very simple microbenchmark: a for-loop that increments an integer counter from zero to 2 billion. The results are very encouraging: it’s over 10 times faster with inlining than without, and the code generated for the loop is fairly tight (ignoring suboptimal register allocation):

mov r8, 0;
IF_TRUE(2EF9)(7):
cmp r8d, 2000000000;
jge FOR_EXIT(2DFB)(219)
mov ebx, r8d;
add ebx, 1;
jo ADD_OVER(169);
mov r8, rbx;

On my Core 2 Quad 2.4GHz, the loop executes in 3.2s, which puts it at around 625M iterations per second, or less than 4 CPU cycles per iteration, not bad! This microbenchmark is obviously a near-ideal test case, and I can’t claim performance gains anywhere near 1000% on more realistic benchmarks, but every benchmark I tested so far benefits from inlining. There’s still many simple things I can do to improve the speed of the generated code. For one, the heuristics I use to choose what to inline are fairly naive at the moment. Still, all of this is very good progress towards the paper I’m working on.

From → Assembly, Compilers, Higgs

One Comment

Ben Reale permalink

Good work on the inline JIT optimization, I always thought it would have unique challenges but this article has really put the problem into perspective.

One thing though, I’m sure you meant to type ‘Core 2’ after the Code block, instead of ‘Code 2’ so just a minor typo in what is a good article.

Reply