D’s Garbage Collector Problem

September 9, 2014

One of the biggest issues with Higgs, the JIT compiler I’m working on, is that the compilation times are too slow. The performance of the compiled machine code that Higgs produces is now fairly good. Still below that of V8 and SpiderMonkey, but getting closer, and orders of magnitude above what you would get from an interpreter. Unfortunately, in most benchmarks, the total execution time is now dominated by the time it takes for Higgs to compile JavaScript all the way down to x86 machine code.

This is a problem that I haven’t really tackled. For one, my PhD research is hugely focused on the performance of the compiled code, and my advisor is always pushing me to get better results in that area. Another issue is that, quite frankly, I don’t really know why Higgs is so slow to compile code. I’ve been relatively careful to avoid unnecessary overhead when implementing it, and my attempts at profiling have been rather unfruitful. There are no obvious low-hanging fruits. From what I’m seeing, every step in the compilation pipeline is fairly slow.

I did find one thing that stood out. My implementation of a liveness analysis scales poorly with function size. It allocates a huge bit matrix (vector of bit sets) for liveness computation. This grows quadratically with the function size, which means that in some cases, obscene amounts of memory are required, and this data structure doesn’t fit in the cache. Seeing this, I did the obvious thing and rewrote the liveness analysis to use simple lists as sets of live values. The results were rather perplexing: the liveness analysis was now faster, but the rest of the compilation process had slowed down so that overall, the total compilation time was slower than before. No fair.

I had an inkling of what might be wrong. Turns out it’s a D garbage collector issue I’ve already run into and talked about in my DConf 2014 talk.

To assess performance, I wrote a test.js program which assigns to a global variable and then reads from it 60000 times:

a = 0;
a;
a;
…
a;

The sad news is, compiling this program is unbelievably slow:

./higgs –perf_stats –nostdlib test.js

comp time (ms): 4504
exec time (ms): 2
total time (ms): 4507
code size (bytes): 633036
peak mem usage (KB): 443608

But now, if I add the following line to the beginning of the main() function:

// Reserve 1GB for the D GC
GC.reserve(1024 * 1024 * 1024);

The results become very different. Preallocating a large chunk of memory for the D heap reduces the compilation time by 66%:

./higgs –perf_stats –nostdlib test.js

comp time (ms): 1570
exec time (ms): 2
total time (ms): 1572
code size (bytes): 633036
peak mem usage (KB): 760828

And now if we both preallocate memory and disable the GC with the following:

GC.reserve(512 * 1024 * 1024);
GC.disable();

Then we get:

./higgs –perf_stats –nostdlib test.js

comp time (ms): 968
exec time (ms): 2
total time (ms): 971
code size (bytes): 633036
peak mem usage (KB): 798372

This is both interesting and perplexing. It tells us that originally, the D GC overhead took over 75% of the total execution time. By reserving memory, we can reduce the number of garbage collections and eliminate most of this overhead. What’s probably happening is that my test program almost fills the heap, but doesn’t cause it to expand. Then, we have a situation where the GC is “thrashing”, and getting repeatedly triggered every few allocations.

So, always preallocate memory then? Unfortunately, I found out through experimentation that choosing the amount of memory to preallocate requires magic voodoo. Sometimes preallocating less memory yields better performance. What seems clear is that the D GC needs better heuristics to decide when to expand the heap to avoid thrashing. This would be much more effective than me trying to guess how much memory to reserve to get around the problem.

There might be another issue here as well. I suspect that D memory allocation itself is quite slow and could be further optimized. It would also be beneficial to optimize GC algorithm (potentially using multithreading) to make collections faster. In any case, it seems very clear to me that something needs to be done about GC performance in D. As it is, this performance issue stands out like a sore thumb.

System programmers like myself care very much about performance, but in the current situation, it seems I can’t improve Higgs compilation times without refactoring my code to avoid memory allocations at all costs. Doing so would take quite a bit of my time, and invariably make the said code much less maintainable. The point of this post isn’t to point fingers and yell at people, but to highlight a very real and fundamental problem which, in my opinion, will scare away many potential D programming language adopters.

Update: September 10, 7:39 PM

This post gathered much more attention than I expected. It made the front page on Hacker News and r/programming. There have been over 11000 views at the time of this writing. Andrei Alexandrescu, co-creator of D, posted on reddit to say that he’s decided to rewrite the GC. I apologize for the bad publicity I may have caused D, but I am glad that the problem has been recognized and that something will be done about it. With a brand new GC, the D programming language will be much more competitive.

For those who are curious, or may want to help, here is a link to the source code of the D GC. If you’re interested in contributing and helping to fix this problem, you should head to the D forums and get in touch with the community, or try contacting Andrei directly. Even if you don’t have time to contribute, your feedback may be useful in the design of a better GC.

From → Compilers, D Language, Higgs, Programming

21 Comments

Jonathan A Dunlap (@jonathanAdunlap) permalink

Have you brought this up on dlang.org? Any feedback from Andre on the issue?

Reply
- Maxime Chevalier-Boisvert permalink
  
  Not on dlang.org, but I’ve brought it up at DConf 2014. Haven’t got Andre’s take on it. Walter seemed to be of the opinions that “GCs are going to be slow no matter what”, which I think is wrong in this case. I’m sure the D GC could be made to perform much better.
  
  Reply
  - Daniel Kozák permalink
    
    Yes, D GC could be made to perform better, but it still does not solve your problem. If you want maximum performance you should avoid GC completely.
    
    Reply
    - Maxime Chevalier-Boisvert permalink
      
      You can’t really avoid the GC for a JIT compiler though, that’s the problem. You’re basically telling me that I should write my own special-purpose GC and memory management for my own application, throwing that burden on me. I’m telling you the D GC could be several times faster, and that this would be a very big improvement for very many people.
Ryan Phelps permalink

Walter himself had this problem in writing the D compiler.

http://www.drdobbs.com/cpp/increasing-compiler-speed-by-over-75/240158941

Reply
Bill Hart permalink

I sympathise, as I have some code that takes 2s to run in another language I’ve been working with, which takes nearly 6s even if I just add an empty garbage collection function. The actual cleanup itself takes almost no time in comparison. It’s just adding lots of objects to the gc workload that causes the issue (when the gc next runs).

In my application we can multiply high degree multivariate polynomials three times faster than we can clean them up with the gc. In the past, these sorts of things didn’t matter, because people were using GC in slow interpreted languages where the gc time was negligible.

But now with much faster languages, gc can be a disaster.

One can minimise gc if one can use stack based structs rather than heap allocated objects as much as possible. Of course this means you need to pass them by reference to functions and make deep copies of them in various places where you’d otherwise just copy a pointer. But paradoxically, this is faster than using gc (see my comment about multiplying high degree polynomials being faster than cleaning them up).

Reply
Daniel Fitzpatrick permalink

Apparently this is the article which has spurred a GC rewrite. https://www.reddit.com/r/programming/comments/2g03af/ds_garbage_collector_problem/ckent8c

Reply
- Maxime Chevalier-Boisvert permalink
  
  It’s great news that Andrei is picking this up though. The D GC can be made much better, and now that the problem has been recognized, something is finally going to be done about it. With a better performing GC, D will be that much more competitive.
  
  Reply
rurban permalink

realloc’ing jit buffers is a big bottleneck, esp. when they need to be in executable pages. I tried in one of my jit’s to do a 2 pass compilation. First calc. size, then emit code. The code is the same, jut the pass argument is passed through, and the code emitter only runs in pass 2. Could be an easy try.

Reply
- Maxime Chevalier-Boisvert permalink
  
  Right now, for the executable code, I have a fixed size buffer in Higgs that’s large and never needs to be reallocated. I plan on eventually implementing a code compactor to collect the dead code and compact the remaining code, reordering it so as to minimize branches and such. Haven’t needed it, or gotten around to that yet.
  
  The big bottleneck seems to be that there’s a fair bit of objects allocated to represent the IR instructions, to store liveness analysis data, and by the backend to keep track of the code generation states. DMD doesn’t even provide detailed allocation statistics though, so I’m not even sure what’s the proportion of allocated objects, I’d basically have to roll my own allocation profiling.
  
  Reply
complexmath permalink

GC.reserve() tells the GC to allocate a single large pool of the specified size. This is then available as free space for allocations.

What normally happens in the GC if you try to allocate memory is it will see if it has the space available. If not, it will collect and check again. If there’s still not sufficient memory available then it allocates a new pool to hold the data. If the data is a block of some small size, this pool will be some small multiple of the system’s page size, so let’s say it will be 16K (4 pages).

The problem ends up being that these check, collect, check, map in more memory cycles are very slow, and for a program like a compiler that progressively allocates more and more data and never releases anything you end up paying a huge cost in what amounts to tons of futile checks for available memory and allocations of small chunks to meet the current request. With the current GC, I would absolutely recommend doing exactly what you’ve done–reserve a large chunk of memory and disable the GC. The first will give you some headroom before the GC tries to collect, and the second will tell it to forget about collecting and just ask for more memory from the OS immediately.

GC.reserve() was created specifically for programs like compilers where you have a decent idea of how much memory you might need. You can allocate a huge chunk up front and sidestep this cycle problem for a good long time. But… ultimately it’s kind of a hack. What really needs to happen is for the GC to be smarter about how often it collects and how it maps in new memory from the OS (ie. how much extra, etc).

Reply
Parneet permalink

Any good book to learn about GC algorithms?

Reply
- Maxime Chevalier-Boisvert permalink
  
  Not sure but, you can find introductory material from many university courses with a google search. On Google Scholar and citeseer you can also find many academic papers describing GC techniques. For the simpler algorithms like mark & sweep or stop & copy, an online tutorial is all the explanation you will need.
  
  Reply
- Kyle Hayes permalink
  
  Look at Jones & Lins’ book on GC. I am not sure I can embed a link, but Dr. Jones’ home page is:
  
  http://www.cs.kent.ac.uk/people/staff/rej/gc.html
  
  The original book came out in about 1996, so it does not cover some of the very latest stuff that is happening (particularly in Java world) with multi-core, multi-GB VMs. There is a later book from 2011 that covers more of this.
  
  I found the older book to be extremely approachable. It gives algorithms and code for all kinds of types of GC. Very useful IMHO.
  
  Reply
Bad Joker permalink

Maybe I’m just too stupid too see it, but to me it seems obvious: *Everything* slows down in the Higgs field! (Bada-bish!) I’ll be here… never again.

Reply
John permalink

I’m glad that they’re thinking about fixing this issue.

Reply