Skip to content

Circumventing the D Garbage Collector

I recently wrote about performance issues I ran into with Higgs. Compilation times are currently unacceptably slow, and I tracked the cause down to performance problems in D’s garbage collector. Someone familiar with the GC internals explained that the GC scales poorly, with total collection time increasing quadratically in function of the amount of memory allocated. Despite this poor GC scaling being an obvious flaw, I received criticism from people telling me that my compiler was simply allocating “too many objects” and that I was doing something wrong. That I should be implementing my own custom memory allocation strategies because garbage collectors were always going to be slow.

Despite the nay sayers, Andrei Alexandrescu, co-creator of D, announced that he would be rewriting D’s GC to fix these performance problems, and his announcement was well-received in the D community. Unfortunately for me, this GC rewrite doesn’t seem to be a top priority, and I’m not sure how many months it will take before a new collector replaces the current one. It seems conceivable that it could take over a year, or that it might happen after I finish my PhD studies. In a sense, the nay sayers are right. With the current GC situation, I have no choice but to circumvent the GC to get acceptable performance.

I decided to do some research into alternative memory management strategies. I first started by looking at the way it’s done in Google’s V8 JIT. Their strategy is a system of memory “zones”. These appear to be large memory arrays in which a pointer is bumped to allocate memory, which is very fast. They generate a function’s Abstract Syntax Tree (AST) and it’s Intermediate Representation (IR), allocating all node objects into a zone, and then discard all of these objects at once when they are no longer needed. The cost to deallocate an entire zone at once is constant. This strategy is clever, but it isn’t sufficient for Higgs. I could see myself using something like this for my AST and IR (and maybe I will), but there are other objects in the system which need more fine-grained deallocation.

Quite possibly, I’ll end up implementing object pools with free lists. This wouldn’t be quite as blazingly fast as zones, but it would result in less memory waste, and it would be usable for all of the often-allocated kinds of objects in Higgs. The folks on the D forum were kind enough to point me to useful resources on how to manually allocate class instances in D.

Out of curiosity, before seriously considering implementing an object pool, I decided to try and benchmark the D garbage collector. I knew there was a performance issue, but I hadn’t precisely quantified it. I was curious to see how much time would be needed to allocate a few million objects, and so I wrote the following test program:

membench

The results, timed on my Core i7 3.4GHz with 16GB of RAM, were as follows:

curve

The curve above is not perfectly quadratic, but if anyone doubted that the D GC has a performance problem, I think that this should be proof enough, with 8 million node allocations taking over a minute on a very recent CPU. In comparison, Higgs itself is able to execute a similar JavaScript benchmark with 10 million objects allocated and initialized in less than 10 seconds.

Audiobooks are Awesome

It’s a sad admission to make, but I’ve always found reading books tedious and uncomfortable. Paper books cause me eye strain and I can never seem to find a comfortable position to hold them in. Computer screens aren’t much better: I have a hard time reading multiple pages in one sitting, and reading a book one page at a time is rather ineffective. I’m not sure if I have some kind of low-grade attention deficit, or if it’s a side-effect of programming and using the internet for so many years, but I prefer for written information to come in bite-sized pieces.

I’ve always loved science fiction, but until recently, my only exposure to it came from movies and tv shows. This changed recently after I acquired a smartphone (my beloved Nexus 4). I found out that there were several free audiobook playback apps in the Google Play store and that many of the greatest science fiction classics were available in audiobook form. I decided to give this a try and listened to Neuromancer, the ultimate cyberpunk classic. I was instantly hooked.

Since discovering audiobooks just a few months ago, I’ve “read”:

  • Neuromancer
  • Burning Chrome
  • Dune
  • Dune Messiah
  • Children of Dune
  • Foundation
  • The Positronic Man
  • The Last Theorem
  • I, Robot
  • Buddha’s Brain
  • Robot Dreams
  • Hyperion

Listening to audiobooks is a relaxing experience, a way to take my mind off of things while having a story narrated to me, a way to take a break. What’s even better is that this entertainment doesn’t take time away from productive activities at all. I can listen to audiobooks while I’m walking somewhere, on the subway or on the bus, while I’m cooking alone, running errands, in a waiting room or even while brushing my teeth. I just have to whip out my phone and earbuds and press play.

The audiobook app I use (Smart Audiobook Player) is nice enough to automatically stop playback if the earbuds get unplugged, and to go back some number of seconds when playback is resumed. This means that if I run into someone I want to talk to, I can simply unplug the earbuds and stop playback without missing anything. The earbuds I bought ($10 Panasonics) provide good enough sound isolation to use on the subway or in other loud environments.

I implied earlier that I have a hard time staying focused on paper books. I’m one of those people who will be reading a book, and eventually realize that I’m rereading the same sentence for the 5th time. You might think that staying focused on audiobooks is just as hard, but I find it a little easier. What I’m also finding though, is that I’ve gotten better at staying focused on audiobooks.

In a way, listening to audiobooks can be seen as a form of mindfulness practice (a form of meditation). If I let my mind run away in thoughts, I lose track of the story and I might want to rewind a bit to try and pay more attention, but I find myself needing to do this less and less. What maybe makes this less tedious than with paper books is that there seems to be less effort involved in listening to something than in reading.

I’m currently listening to The Fall of Hyperion and enjoying it very much so far. I plan to listen to many more audiobooks. It’s come to the point where when I’m between books, I miss having something to listen to. Fortunately, there are many great science-fiction classics out there which have been highly acclaimed and are available in audiobook format.

If you’re a computer geek and you’re looking for something that doesn’t take itself too seriously, I would highly recommend collections of short stories by Isaac Asimov, such as I, Robot. I found them amusing and relaxing to listen to. For rich and complex fictional universes, you might want to check out Dune, Neuromancer or Hyperion. Scott Brick, narrator of the Dune audiobook I listened to, is an excellent voice actor who even makes convincing female voices. His wikipedia page lists several other books he has narrated.

Faster than V8 *

lelouch_r

There’s an asterisk next to the title because in most cases, Higgs doesn’t beat the performance of Google’s V8, quite the opposite. I’ve mentioned in my post about D’s garbage collector performance that Higgs suffers from very slow compilation times. Higgs also lacks the sophisticated inlining capabilities of V8, along with a plethora of micro-optimizations needed to perform well on widely-used benchmarks. Hence, I was both surprised and pleased when I found out that Higgs performs better than V8 on a microbenchmark.

The microbenchmark in question is a simple loop incrementing a global counter variable from 0 to one billion:

// Loop to 1B
// Note: i is a global variable
for (i = 0; i < 1000000000; ++i)
{
}

On my Core i7 laptop running 64-bit linux, I get the following performance results (also replicated on an i7 desktop):

time d8 loop-global-incr.js

real 0m2.686s
user 0m2.672s
sys 0m0.008s

time ./higgs loop-global-incr.js

real 0m2.415s
user 0m2.368s
sys 0m0.040s

This might seem quite pointless as it’s a trivial piece of code, not at all representative of the complexities of a real program. As I said earlier, on real benchmarks, V8 usually wins by a wide margin. Still, I do think that this is interesting. Why? Because in this simple instance, I’m able to generate better quality machine code. Despite its complex multi-tiered architecture, type-feedback and type analysis capabilities, V8 actually does a fairly poor job of optimizing this loop, whereas the machine code generated by Higgs is fairly tight.

Optimizing this simple loop is actually nontrivial. In JavaScript, global variables are properties of the global object. As such, these properties could turn out to be accessors (getters and setter methods). Global properties can also be deleted, in which case reading them should throw an exception. There is also the issue that the addition and less-than operators can operate on any type without throwing an error. Hence, to generate effective machine code, you have to prove that the global variable “i” is not an accessor, that it won’t be deleted and that it will remain an integer. If you can’t prove those facts, then the machine code you generate will contain redundant type checks.

The optimized machine code generated by V8 is quite bulky and cryptic:

— Optimized code —
optimization_id = 0
source_position = 0
kind = OPTIMIZED_FUNCTION
stack_slots = 2
Instructions (size = 346)
0xc579e565ae0 0 55 push rbp
0xc579e565ae1 1 4889e5 REX.W movq rbp,rsp
0xc579e565ae4 4 56 push rsi
0xc579e565ae5 5 57 push rdi
0xc579e565ae6 6 4883ec10 REX.W subq rsp,0x10
0xc579e565aea 10 488b45f8 REX.W movq rax,[rbp-0x8]
0xc579e565aee 14 488945e0 REX.W movq [rbp-0x20],rax
0xc579e565af2 18 488bf0 REX.W movq rsi,rax
0xc579e565af5 21 493ba540070000 REX.W cmpq rsp,[r13+0x740]
0xc579e565afc 28 7305 jnc 35 (0xc579e565b03)
0xc579e565afe 30 e8bdd5fcff call StackCheck (0xc579e5330c0)
0xc579e565b03 35 33c0 xorl rax,rax
0xc579e565b05 37 48bb681c51f000110000 REX.W movq rbx,0x1100f0511c68 ;; property cell
0xc579e565b0f 47 4d8b55b0 REX.W movq r10,[r13-0x50]
0xc579e565b13 51 4c3913 REX.W cmpq [rbx],r10
0xc579e565b16 54 0f84f6000000 jz 306 (0xc579e565c12)
0xc579e565b1c 60 488903 REX.W movq [rbx],rax
0xc579e565b1f 63 e911000000 jmp 85 (0xc579e565b35)
0xc579e565b24 68 4883ec08 REX.W subq rsp,0x8
0xc579e565b28 72 488b45f8 REX.W movq rax,[rbp-0x8]
0xc579e565b2c 76 488b5d10 REX.W movq rbx,[rbp+0x10]
0xc579e565b30 80 e908000000 jmp 93 (0xc579e565b3d)
0xc579e565b35 85 488b5d10 REX.W movq rbx,[rbp+0x10]
0xc579e565b39 89 488b45e0 REX.W movq rax,[rbp-0x20]
0xc579e565b3d 93 48ba681c51f000110000 REX.W movq rdx,0x1100f0511c68 ;; property cell ptr
0xc579e565b47 103 488b12 REX.W movq rdx,[rdx] ;; read counter from global
0xc579e565b4a 106 f6c201 testb rdx,0x1 ;; type tag test
0xc579e565b4d 109 0f8553000000 jnz 198 (0xc579e565ba6)
0xc579e565b53 115 48c1ea20 REX.W shrq rdx,32 ;; unboxing counter
0xc579e565b57 119 81fa00ca9a3b cmpl rdx,0x3b9aca00
0xc579e565b5d 125 0f8d32000000 jge 181 (0xc579e565b95)
0xc579e565b63 131 493ba540070000 REX.W cmpq rsp,[r13+0x740] ;; ???
0xc579e565b6a 138 0f8261000000 jc 241 (0xc579e565bd1); jump on carry
0xc579e565b70 144 83c201 addl rdx,0x1 ;; counter increment
0xc579e565b73 147 8bca movl rcx,rdx ;; redundant move
0xc579e565b75 149 48c1e120 REX.W shlq rcx,32 ;; boxing counter
0xc579e565b79 153 48ba681c51f000110000 REX.W movq rdx,0x1100f0511c68 ;; property cell ptr
0xc579e565b83 163 4d8b55b0 REX.W movq r10,[r13-0x50] ;; ???
0xc579e565b87 167 4c3912 REX.W cmpq [rdx],r10 ;; ???
0xc579e565b8a 170 0f8487000000 jz 311 (0xc579e565c17) ;; ???
0xc579e565b90 176 48890a REX.W movq [rdx],rcx ;; write counter to globa
0xc579e565b93 179 eba8 jmp 93 (0xc579e565b3d) ;; jump to loop header
0xc579e565b95 181 48b82141d0c48b060000 REX.W movq rax,0x68bc4d04121
0xc579e565b9f 191 488be5 REX.W movq rsp,rbp
0xc579e565ba2 194 5d pop rbp
0xc579e565ba3 195 c20800 ret 0x8
0xc579e565ba6 198 4d8b5500 REX.W movq r10,[r13+0x0]
0xc579e565baa 202 4c3952ff REX.W cmpq [rdx-0x1],r10
0xc579e565bae 206 751a jnz 234 (0xc579e565bca)
0xc579e565bb0 208 f20f104207 movsd xmm0,[rdx+0x7]
0xc579e565bb5 213 f20f2cd0 cvttsd2sil rdx,xmm0
0xc579e565bb9 217 0f57c9 xorps xmm1,xmm1
0xc579e565bbc 220 f20f2aca cvtsi2sd xmm1,rdx
0xc579e565bc0 224 660f2ec1 ucomisd xmm0,xmm1
0xc579e565bc4 228 7504 jnz 234 (0xc579e565bca)
0xc579e565bc6 230 7a02 jpe 234 (0xc579e565bca)
0xc579e565bc8 232 eb8d jmp 119 (0xc579e565b57)
0xc579e565bca 234 e86304caff call 0xc579e206032 ;; deoptimization bailout
0xc579e565bcf 239 eb86 jmp 119 (0xc579e565b57)
0xc579e565bd1 241 50 push rax
0xc579e565bd2 242 51 push rcx
0xc579e565bd3 243 52 push rdx
0xc579e565bd4 244 53 push rbx
0xc579e565bd5 245 56 push rsi
0xc579e565bd6 246 57 push rdi
0xc579e565bd7 247 4150 push r8
0xc579e565bd9 249 4151 push r9
0xc579e565bdb 251 4153 push r11
0xc579e565bdd 253 4156 push r14
0xc579e565bdf 255 4157 push r15
0xc579e565be1 257 488d6424d8 REX.W leaq rsp,[rsp-0x28]
0xc579e565be6 262 488b75f8 REX.W movq rsi,[rbp-0x8]
0xc579e565bea 266 33c0 xorl rax,rax
0xc579e565bec 268 498d9d451eb1fe REX.W leaq rbx,[r13-0x14ee1bb]
0xc579e565bf3 275 e82806faff call 0xc579e506220 ;; CEntryStub
0xc579e565bf8 280 488d642428 REX.W leaq rsp,[rsp+0x28]
0xc579e565bfd 285 415f pop r15
0xc579e565bff 287 415e pop r14
0xc579e565c01 289 415b pop r11
0xc579e565c03 291 4159 pop r9
0xc579e565c05 293 4158 pop r8
0xc579e565c07 295 5f pop rdi
0xc579e565c08 296 5e pop rsi
0xc579e565c09 297 5b pop rbx
0xc579e565c0a 298 5a pop rdx
0xc579e565c0b 299 59 pop rcx
0xc579e565c0c 300 58 pop rax
0xc579e565c0d 301 e95effffff jmp 144 (0xc579e565b70)
0xc579e565c12 306 e8f303caff call 0xc579e20600a ;; deoptimization bailout
0xc579e565c17 311 e80c04caff call 0xc579e206028 ;; deoptimization bailout

At the start, there’s a long sequence of instruction which seems to be a function entry prelude. This is not part of the loop body. There are many instructions in this machine code which are never executed. You can see, for example, that there is floating-point code. That code was generated in case an overflow might occur, but this can’t happen here. There are also deoptimization bailouts (calls into the JIT compiler) which are never needed. This code is unnecessary, but won’t affect performance. I’ve highlighted in bold the machine instructions inside the loop body which do affect performance but could potentially be eliminated. Essentially, in each loop iteration, the counter variable is unboxed (using a shift instruction), its type tag is tested and then it is boxed again.

In contrast, Higgs, using basic block versioning, is able to eliminate the type test from the loop body. Higgs also uses an unusual scheme where type tags are stored separately from value words. Because of this, Higgs does not need to box and unbox object property values. Since the type of the counter variable is known to always be an integer in this case, type tags don’t need to be touched at all inside the loop body. Furthermore, in Higgs, code is lazily generated, so there is no floating-point or bailout code. There are no long sequences of instructions that are never executed.

The machine code produced by Higgs for the loop body is as follows:

for_test(19337):
; $3 = shape_get_def $1, “i” => capture_cont(19386)
mov rdx, 140451224576896; rdx = shape pointer, unused

capture_cont(19386):
; $8 = shape_get_prop $1, $3
mov r10, [qword rsi + 2616];

; $0 = lt_i32 $6, 1000000000
cmp r10d, 1000000000;
jge branch_for_exit(1933A);
nop5;

for_body(19338):
; $4 = shape_get_def $1, “i” => capture_cont(193A5)
mov r8, 140451224576896; rdx = shape pointer, unused

; $11 = shape_get_prop $1, $4
mov rdi, [qword rsi + 2616];

; $17 = add_i32_ovf $9, 1 => call_merge(193FF), if_false(19410)
add edi, 1;
mov eax, 9795; eax = block idx, for eventual bailout
jo branch_if_false(19410); overflow test

; $14 = shape_get_def $1, “i” => capture_cont(19429)
mov r9, 140451224576896; rdx = shape pointer, unused

; shape_set_prop $1, “i”, $13
mov [qword rsi + 2616], rdi;

; jump => for_test(19337)
jmp for_test(19337);

The machine code above is far from optimal. It contains redundant moves and overflow checks which are not yet optimized away. I’m also fully aware that V8 is a moving target in terms of performance: the V8 team is already working on a better optimizing compiler, and they will no doubt improve upon the current situation. Still, it’s nice to see that all the work I’ve put into basic block versioning is paying off. In the general case, Higgs doesn’t beat V8, and things are likely to remain that way, but in this specific instance, Higgs is able to generate better quality machine code than a commercial compiler which has tens of man-years of engineering behind it. It’s a small victory, but I find it very encouraging nevertheless.

The tests I performed were done with V8 version 3.29.66 (current SVN trunk revision), and the current version of my development branch. If someone has a recent version of the SpiderMonkey shell built, I’d be curious to see the generated code and performance results for this microbenchmark. I wouldn’t be surprised if SpiderMonkey did better than V8 and Higgs, considering it sports a powerful hybrid type inference engine.

Making Waves: D’s GC Problem Followup

Front Page News

reddit_no1_crop

Two days ago, late in the evening, I wrote a blog post about performance issues I’d attributed to D’s garbage collector. I didn’t expect this blog post to draw much attention. I was hoping that some of the people working on the D runtime library might see it, but I didn’t publicize it any further than tweeting the link. My goal was only to draw some amount of attention to the problem; to provide some numbers so I could quantify the GC overhead a bit.

Someone decided the post was interesting enough to submit it to reddit’s r/programming as well as Hacker News. It’s also been discussed on the D forums. In total, about 18000 people ended up viewing it, and over 160 comments were posted. My main conclusion is that controversial topics are a good way to draw attention. I’ve succeeded in my goal of drawing attention to the problem on a much bigger scale than I’d wanted: now, everyone knows about it. I sincerely apologize for the bad publicity I may have caused D, but I’m glad that the problem has been recognized and that something will be done about it.

Andrei Alexandrescu, co-creator of D, posted on reddit to say that he wants to rewrite the GC based on his own allocator design. Dicebot of Sociomantic has also said that they are working on porting their concurrent garbage collector to the current version of D. I don’t how these two efforts will combine, but I’m sure that something good will come of it. A high performance language needs a high performance runtime, and with a brand new GC, the D programming language will be much more competitive.

Insightful Details

People with knowledge of the D GC internals ended up posting on the reddit thread and providing insight into why the D garbage collector is currently inefficient.

thedeemon wrote the following:

The main issue from the post is quite simple: if you allocate a lot of small objects, D runtime will place them inside chunks (called pools) of fixed size (like 4 MB) that it requested from the OS. Each time a pool is filled up it will do a GC cycle and only if there’s still no room for new object it will allocate a new pool. That means number of GC cycles in this situation is proportional to number of objects you allocate, while each GC cycle time is proportional to the managed heap size. It leads to quadratic complexity and awful slowness. If runtime allocated pools not of fixed size but adapted their size to the allocation rate, this scenario could be significantly faster.

deadalnix added this:

There are many other problems. Amongst others: – D’s GC rely on a global lock in which it goes through every time. – D’s GC scan many location even when they cannot contain pointers, which do not make any sense. – D’s GC do not take advantage of D’s type system, when other experiment like OCaml’s GC have shown that there is great benefice going down that road.

To summarize the issues outlined, the D GC:

  1. Has potentially quadratic complexity with respect to the number of allocated objects.
  2. Uses a global lock while traversing objects.
  3. Can only use one thread to perform collections.
  4. Does not take advantage of the D type system, unnecessarily scans non-pointer data.
  5. Does not use lazy sweeping, a common mark & sweep GC optimization.

Fortunately, the worst issue (quadratic growth) can be fixed without much effort.

For those who are curious, or may want to help, here is a link to the source code of the D GC. If you’re interested in contributing and helping to fix this problem, you should head to the D forums and get in touch with the community, or try contacting Andrei directly. Even if you don’t have time to contribute, your feedback may be useful in the design of a better GC.

Dismissive Criticism

Unfortunately, not everyone was welcoming of the discussion I brought forward. Criticism mostly seemed to come from developers who are fundamentally opposed to the use of garbage collection, and view the whole concept in a very negative light. These people put forth the same dismissive and uninformed line of reasoning, which can be summarized as follows:

  1. Garbage collectors are inherently slow.
  2. Since Higgs is suffers from GC overhead, Higgs must be doing “too many” allocations.
  3. Because of #1, it’s pointless to improve the D GC. It will always be slow.
  4. It’s the responsibility of the application writer to avoid using the GC.

I’ll say this: one of the main reasons I chose D over C++ was that D has a garbage collector. I had written a JIT compiler in C++ previously and I found it tedious to manage memory allocations manually. Compilers are inherently allocation-heavy, because they must represent and transform symbolic expression trees, graphs and a heavy amount of metadata. Having to take memory allocation and release into account every time you wish to transform such a tree or graph is highly error-prone.

To avoid using the GC in Higgs, I’d basically have to implement my own specialized memory managers (my own allocation and garbage collection) for each of the data structures used in the compiler. Saying that I shouldn’t use the GC is basically denying any responsibility the D GC might have and shifting the burden onto me, the application developer, to implement my own specialized memory management for the different use cases where I need dynamic memory allocation, and there are many such use cases.

In my previous blog post, I’ve demonstrated two things:

  1. In a normal execution of Higgs, the GC may take more than 75% of the total execution time.
  2. Simple tweaking of GC parameters can reduce the GC overhead tremendously, producing a 3x speedup.

What’s clear is that the D garbage collector scales very poorly. In fact, I’d say that the scaling is often pathological. Twice the allocations may mean the total collection time quadruples: a quadratic growth. This means that there’s a ceiling on the number of objects the GC can be expected to reasonably manage. There’s a ceiling on the scale of applications using the GC. If you go beyond this ceiling, your application will slow down to a crawl. You want to write a multithreaded web server in D? You’re probably going to have to manage memory manually. Java doesn’t suffer from this weakness, so why should D?

Higgs is a program written in what I believe to be a reasonably efficient style. There aren’t many layers of abstraction in the implementation. It doesn’t create what I would deem to be an unreasonable number of allocations for what it does. I’m sure that if I put forth the engineering effort, I could reduce my reliance on the GC and improve performance. In fact, I already have. However, improving the D GC is something that will clearly benefit all users of the language. A future compiler update might improve the performance of some D programs by an order of magnitude. Working towards fixing poor GC performance is much more helpful than blaming those who are affected by it.

D’s Garbage Collector Problem

One of the biggest issues with Higgs, the JIT compiler I’m working on, is that the compilation times are too slow. The performance of the compiled machine code that Higgs produces is now fairly good. Still below that of V8 and SpiderMonkey, but getting closer, and orders of magnitude above what you would get from an interpreter. Unfortunately, in most benchmarks, the total execution time is now dominated by the time it takes for Higgs to compile JavaScript all the way down to x86 machine code.

This is a problem that I haven’t really tackled. For one, my PhD research is hugely focused on the performance of the compiled code, and my advisor is always pushing me to get better results in that area. Another issue is that, quite frankly, I don’t really know why Higgs is so slow to compile code. I’ve been relatively careful to avoid unnecessary overhead when implementing it, and my attempts at profiling have been rather unfruitful. There are no obvious low-hanging fruits. From what I’m seeing, every step in the compilation pipeline is fairly slow.

I did find one thing that stood out. My implementation of a liveness analysis scales poorly with function size. It allocates a huge bit matrix (vector of bit sets) for liveness computation. This grows quadratically with the function size, which means that in some cases, obscene amounts of memory are required, and this data structure doesn’t fit in the cache. Seeing this, I did the obvious thing and rewrote the liveness analysis to use simple lists as sets of live values. The results were rather perplexing: the liveness analysis was now faster, but the rest of the compilation process had slowed down so that overall, the total compilation time was slower than before. No fair.

I had an inkling of what might be wrong. Turns out it’s a D garbage collector issue I’ve already run into and talked about in my DConf 2014 talk.

To assess performance, I wrote a test.js program which assigns to a global variable and then reads from it 60000 times:

a = 0;
a;
a;

a;

The sad news is, compiling this program is unbelievably slow:

./higgs –perf_stats –nostdlib test.js

comp time (ms): 4504
exec time (ms): 2
total time (ms): 4507
code size (bytes): 633036
peak mem usage (KB): 443608

But now, if I add the following line to the beginning of the main() function:

// Reserve 1GB for the D GC
GC.reserve(1024 * 1024 * 1024);

The results become very different. Preallocating a large chunk of memory for the D heap reduces the compilation time by 66%:

./higgs –perf_stats –nostdlib test.js

comp time (ms): 1570
exec time (ms): 2
total time (ms): 1572
code size (bytes): 633036
peak mem usage (KB): 760828

And now if we both preallocate memory and disable the GC with the following:

GC.reserve(512 * 1024 * 1024);
GC.disable();

Then we get:

./higgs –perf_stats –nostdlib test.js

comp time (ms): 968
exec time (ms): 2
total time (ms): 971
code size (bytes): 633036
peak mem usage (KB): 798372

This is both interesting and perplexing. It tells us that originally, the D GC overhead took over 75% of the total execution time. By reserving memory, we can reduce the number of garbage collections and eliminate most of this overhead. What’s probably happening is that my test program almost fills the heap, but doesn’t cause it to expand. Then, we have a situation where the GC is “thrashing”, and getting repeatedly triggered every few allocations.

So, always preallocate memory then? Unfortunately, I found out through experimentation that choosing the amount of memory to preallocate requires magic voodoo. Sometimes preallocating less memory yields better performance. What seems clear is that the D GC needs better heuristics to decide when to expand the heap to avoid thrashing. This would be much more effective than me trying to guess how much memory to reserve to get around the problem.

There might be another issue here as well. I suspect that D memory allocation itself is quite slow and could be further optimized. It would also be beneficial to optimize GC algorithm (potentially using multithreading) to make collections faster. In any case, it seems very clear to me that something needs to be done about GC performance in D. As it is, this performance issue stands out like a sore thumb.

System programmers like myself care very much about performance, but in the current situation, it seems I can’t improve Higgs compilation times without refactoring my code to avoid memory allocations at all costs. Doing so would take quite a bit of my time, and invariably make the said code much less maintainable. The point of this post isn’t to point fingers and yell at people, but to highlight a very real and fundamental problem which, in my opinion, will scare away many potential D programming language adopters.


Update: September 10, 7:39 PM

This post gathered much more attention than I expected. It made the front page on Hacker News and r/programming. There have been over 11000 views at the time of this writing. Andrei Alexandrescu, co-creator of D, posted on reddit to say that he’s decided to rewrite the GC. I apologize for the bad publicity I may have caused D, but I am glad that the problem has been recognized and that something will be done about it. With a brand new GC, the D programming language will be much more competitive.

For those who are curious, or may want to help, here is a link to the source code of the D GC. If you’re interested in contributing and helping to fix this problem, you should head to the D forums and get in touch with the community, or try contacting Andrei directly. Even if you don’t have time to contribute, your feedback may be useful in the design of a better GC.

Dogfooding Higgs with the Turing Turtle

turing turtle 12

A while back, I wrote Turing Drawings, a JavaScript web app which uses randomly generated Turing machines to produce animated drawings on an HTML5 canvas element. The web app, due to its addictive nature, ended up being featured a few times on Reddit and Hacker News, and was even forked by developers who added interesting new features to it. Since then, I’ve been pretty busy working on Higgs and my PhD thesis, and I haven’t taken the time to work on fun little hobby projects of the sort. That is, until last night.

Thanks to the fine work of Tommy Everett, Higgs now has a 2D drawing library based on X11 with functionality resembling that of the canvas element. I decided to dogfood Higgs and its drawing library by writing a new program inspired from Turing Drawings. This program, which I’ve named Turing Turtle, uses randomly generated Turing machines to control a turtle-style plotter. The machine has an internal tape used for storage, and can perform a number of output actions, such as moving forward, turning left or right, as well as changing the current drawing color. This can produce a variety of interesting drawings, as shown below (click to enlarge):

turing turtle 14

turing turtle 20

turing turtle 16

I’ve uploaded an album with many of these drawings on imgur. I find that randomly generated Turing machines are a simple and effective way to drive generative processes in which you’d like to see repeating patterns emerge. If you’d like to try Turing Turtle for yourself, I’ve added the program to the examples directory of the Higgs repository. You can use the arrow keys to generate new drawings, restart the current drawing as well as increase and decrease the plotter speed. It’s a very fun little program to play with.

My Paper was Rejected… Again

A month ago, I submitted an updated version of my basic block versioning paper to an academic conference (the previous version can be viewed here). We have an improved system, more working benchmarks and better results across all metrics. I was quite proud of this paper, and thought our odds of getting into the said conference were quite high. Unfortunately, the improved paper was again rejected. Comments from the reviewers were mixed. Some quite positive, others extremely negative and harsh. Notably, the paper was shot down by a famous industry researcher who has done extensive work on trace compilation, a competing approach.

Below are some excerpts of the comments from four different reviewers:

The paper is well written and the examples help to understand the tradeoffs of the approach. Overall the approach seems to be sound and deserves further discussion/publication.

 

I fail to see the difference to trace compilation (and the predecessor of trace compilation, dynamic binary translation) [...] Constant propagation and conditional elimination in a trace compiler lead to the same type check elimination that you present.

 

I have had a peek at the code and it very elegant. The paper is well written too. The paper has some weaknesses but  I don’t believe that they are show stoppers. [...] Overall this is a nice idea well worth looking at in more detail.

 

The one big problem I have with the paper is that it does not motivate and put into context lazy block versioning properly. The paper needs to do a better job at explaining which precise problems of current Javascript JIT approaches that are used in production are solved by lazy basic block versioning.

In this improved paper, we’ve managed to show that basic block versioning reduced the number of type tests much more effectively than a static analysis could, and that our approach was able to yield significant speedups. We’ve also addressed one of the most important criticisms of such an approach by showing that it does not result in a code size explosion in practice, but only a moderate code size increase. This, however, wasn’t sufficient for some reviewers. The main criticism they raised is that basic block versioning has similarities with tracing, and that as such, we should have compared its performance against that of a tracing JIT.

I have no issue with writing a detailed comparison of both approaches on paper, but it seems the only way to compare the performance of basic block versioning against that of tracing in a fair manner would be for me to implement a tracing JIT compiler in Higgs. Not only that, I’d probably be expected to implement a tracing JIT with all the latest developments in tracing technology. This could well require an entire year of work. I feel like there’s an unfortunate disregard for the amount of infrastructure work required to do compiler research. It’s becoming increasingly difficult to do academic research on compiler architecture. There’s a high amount of pressure to publish often, but at the same time, conferences expect us to pour several man-years into each publication and to immediately provide an absolute proof that ideas brought forward are better than everything else out there. Curious exploration of new ideas is systematically discouraged.

I was understandably feeling quite frustrated and discouraged after this second rejection. This blog post by Daniel Lemire lifted my spirits a little. It’s clear in my mind that basic block versioning has potential. We have something clearly novel, and it works! No matter what, I can at least take some comfort in the knowledge that basic block versioning has already been presented at several industry conferences, such as DConf, mloc.js and Strange Loop.

I may not be doing very well at the publication game thus far, but thousands of people have watched my talks online. The idea has already been spread around, and I’ve gotten positive responses. I haven’t put this much work into Higgs and my thesis to just give up. I’ve already found two conferences where I can submit further updated versions of the paper, and the deadlines are luckily coming up soon. In addition to this, I’ve already started work towards implementing typed shapes in Higgs, which should lead to another publication next year.

Follow

Get every new post delivered to your Inbox.

Join 2,715 other followers