Great news! The Higgs GitHub repository has 204 stars and 22 forks at the time of this writing. Even better news, however, is that I’ve finally reached a point where Basic Block Versioning (BBV), the compiler optimization technique I’ve been implementing in Higgs as part of my PhD thesis, finally delivers measurable speedups on benchmark programs. Although I was able to demonstrate a few months back that BBV is able to eliminate dynamic type checks quite effectively, one important problem remained, which is that this didn’t immediately translate in improvements in terms of running time. Many optimizations are required in order to make dynamic languages fast, and the quality of the code generated by Higgs was just too poor: removing 50% of type tests or more didn’t make a noticeable performance impact. It was like removing one drop from a bucket of inefficiency.
Recently, after measuring that property access took more than 30% of the execution time in some benchmarks, I decided to implement inline caching into Higgs. This provided noticeable speedups, but still didn’t make BBV win out in terms of running time. In fact, the results on most benchmarks were worse with versioning enabled. I concluded that the issue was probably largely one of code size. By generating multiple versions for some code paths, the code size increases. This can result in worse performance in terms of instruction cache accesses. Modern CPUs have large L2 and L3 caches multiple megabytes in size, but the instruction cache of the Core i7 is still just a puny 32 kilobytes. This cache is easily filled up, especially if the code you’re generating is rather bloated. Hence, I decided that my next focus should be to try and optimize the size of the machine code generated by Higgs.
I started by implementing a very naive on-the-fly register allocator. The first version spilled values very conservatively and produced many redundant writes and spills whose only purpose was to avoid having the Garbage Collector (GC) ever seeing false pointers (a flaw which has now been fixed). My main concern was to get Higgs to pass all its tests with the register allocator, and only then to optimize it. After many hours of debugging, I got it all working, and to my surprise, despite all the redundant operations, the unoptimized register allocator produced much smaller code and much improved performance. The performance was improved enough, in fact, that BBV finally won out in terms of running time on most benchmarks. I was extremely surprised to find out that on loop microbenchmarks, the naive register allocator produced code that ran more than 10 times faster than similar code loading and storing values on the stack. My conclusion is that when it comes to memory accesses, it’s really the reads that kill performance as the CPU doesn’t have to stall execution for redundant writes.
At this point, I’ve completed a few optimizations to the register allocator. Most of the redundant moves have been eliminated, but spilling still remain to be made less conservative. I’ve also stumbled upon a rather interesting finding. In some cases, it actually turns out that versioning doesn’t increase code size, but actually decreases it. Sometimes, the code it produces is more optimized and more compact than the original. So far, the results are extremely encouraging, but I still believe I can improve on them with additional tweaks. These results are coming in just in time for my presentation in May at DConf 2014. I’ll also be submitting a new publication at a conference in early June.
The performance of the machine code generated by Higgs is slowly but surely becoming more competitive. As I see it, the main flaw Higgs has in terms of usability at this point is that the compilation time is rather slow. Compilation time now largely dominates execution time on most benchmarks. This is something I haven’t had much time to look into. The codebase has become complex enough that I can’t exactly pinpoint where the major inefficiencies might be without some digging. If you’re interested in contributing to this project, profiling and optimizing parsing, analysis and code generation is something we could definitely use help on.
Working out is good for your mood, your motivation and energy levels. It’s good for staying thin and looking good, too. Still, going to the gym is impractical. Especially in a city with a harsh winter like Montreal. I just haven’t been able to motivate myself to make that kind of time commitment. Still, I found a nice compromise. I bought myself a Schwinn A10 stationary bike and set it up right next to my desk at home. That way, I can hop on it whenever I want, right in the middle of a coding session. Hit an annoying bug? Need some time to think? Why not take a break to watch a few minutes of an interesting talk while biking. This works pretty well for me, I now do cardio almost every day.
I like my stationary bike: the mechanism is quiet and sturdy, it offers a good resistance level. Unfortunately, it has a slight design flaw. After 5 minutes of inactivity, it goes into sleep mode, and then, if you leave it in sleep mode for more than one minute or so, all of your workout stats get erased. It seems that when it goes in sleep mode, the microcontroller driving it gets powered off completely, and its RAM only survives so long without a refresh. This is problematic since, as I said earlier, I often like to workout in short sessions of a few minutes each, stretching over several hours. Being the tinkerer that I am, I started thinking about ways to work around this problem. I don’t have awesome soldering skills and I didn’t think I could reverse engineer the stationary bike’s control board, but I did think of something.
I designed and built a small board with a microcontroller and a reed relay (low power relay) to trigger a button on the console every 3 minutes. I chose the very basic and inexpensive Atmel ATtiny85 for this. The reed relay is connected to the “enter” button pins on the bike’s control board. Some of you are probably thinking a microcontroller is overkill for this, that I could have used an NE555 timer. I could have, but that would have made for a more complex electronic design. The microcontroller also allows me to easily trigger the “enter” button exactly 3 times in a row, which changes a setting and then restores it to its original value, essentially performing a noop which only serves to keep the bike’s control board from going into sleep mode.
The circuit was tested on a breadboard before I soldered it on a prototype board. Hot glue was used to install the clandestine board inside the bike’s console. The device draws power directly from the bike’s batteries (4 x D cells). It can be turned on and off with a switch I installed on the front console, and there’s a green panel mount LED to indicate its status. My Schwinn A10′s warranty is no doubt voided now, but the annoying flaw is fixed, and this was a fun little hack to work on. Honestly, I was just curious to see if I could make this work ;)
Upcoming talks about Higgs
I’ll be giving a talk about my research at mloc.js in Budapest on February 13th. This is going to be the first of my conference talks that is entirely focused on Higgs and my current research. The talk will be filmed and should appear online on InfoQ and Ustream. I’ll also be putting the slides online after the talk. In other news, I received an invitation from Andrei Alexandrescu to present at DConf this year again. I’m quite looking forward to this as I had a very positive experience at DConf 2013, where I met many very knowledgeable systems engineers (I may also have a crush on the SF bay area).
For those who would like to see some results of my current research right away, I’ve made available a paper which was recently submitted to CC 2014, but unfortunately rejected. This paper explains in depth the technique of basic block versioning, and gives results comparing it to a traditional type propagation analysis. The main reason stated for rejection is that we did not provide comparisons in terms of overall running time (due to limitations of Higgs). I still believe the results are quite encouraging and make a convincing case for basic block versioning. My PhD advisor and I will be submitting an improved version of this paper in a few months.
New JIT online
The lazy incremental JIT I’ve been working on is now in the master branch. It’s far from complete, but it’s reached the crucial milestone where it passes all of the Higgs tests and runs all of the benchmarks. There is currently no type propagation and no register allocation, but I’ve already started implementing incremental inlining of primitive calls. The inlining is currently done in a rather unsophisticated manner which doesn’t eliminate all call overhead, but this will be fine-tuned.
Performance-wise, the new JIT is comparable to the old one. It’s faster on most of the larger benchmarks, probably because there is no more interpreter and the new JIT generates machine code faster. Thus, there is less warmup time required before reaching full speed. I will be working on reintegrating the full power of basic block versioning and improving the performance within the next few months. I expect to exceed the performance of the old JIT on all benchmarks within a few weeks.
Following our recent announcement that we were looking for contributors, we’ve started to build a Higgs wiki and created a #higgsjs channel on freenode IRC. It seems that this work has paid off as we’ve gotten several pull requests for bug fixes and improvements.
Notably, zimbabao has sent us bug fixes to the runtime library, the regex engine and the standard library. Thanks to him, the SunSpider crypto-aes benchmark finally works properly. He’s even fixed bugs I wasn’t even aware of! Óscar Toledo has helped us find and track down two parser and two semantic bugs using his nanochess JS1K program. I’ve also taken the initiative of fixing several minor semantic bugs to make Higgs respect the ES5 specification more closely.
My number one, Tom Brasington, has completed a new test framework which will make it easy to add new functionality and regression tests by including a JS file in the source/tests directory and sending us a pull request. He’s also set up the travis build test system to notify us if a build fails. Tom is currently working on an improved, more convenient version of our FFI library. We have plans to build bindings to SDL and network APIs, among other things.
Should you wish to help, we’re always looking to make Higgs more JS compliant, add new useful features and grow our collection of tests. If you find a bug in Higgs, please open an issue and/or send us a pull request with a failing test. You’re also welcome to write useful libraries to be included with Higgs. You can help us make Higgs insanely great.
We’ve been told that real-time ray tracing is coming soon. It can produce superb renderings with unattained levels of realism. It works by simulating light, and leads to much more elegant architectures. It’s the future, but unfortunately, it’s been coming soon for several years now. Like thorium reactors, it’s one of those technologies that just seems to make so much sense, but that isn’t getting adopted, probably because we already have something that works “well enough”, and we aren’t yet willing to invest the effort required to convert to a radically different technology. There’s too much inertia, we haven’t yet reached that breaking point. Still, the potential is there, and with the advent of GPGPU, real-time global illumination is very close to becoming a reality in the world of gaming, as demonstrated by the Brigade renderer.
I should point out that ray tracing, path tracing, and global illumination are not the same things. Global illumination is a simulation of the entire light transport within a scene, which accounts for the ways in which light can illuminate objects indirectly, through several reflections between multiple surfaces. It produces superior renderings because it accounts for phenomena such as soft shadows, caustics, and color bleeding. Path tracing is an extended form of ray tracing which achieves global illumination, usually by tracing inverted light rays from the eye to the light sources in a scene. This is done through nondeterministic, approximate sampling, because there is an infinite number of paths light rays could take through a scene.
The way images are currently generated with path tracing is to send many rays through each pixel of the image. These rays then randomly scatter in the scene in ways that obey the properties of the materials they encounter, and eventually reach light sources. The problem with this is that the generated images have significant visual noise (grain), and it takes a large number of rays per pixel to eliminate this noise (easily up to 256+ rays per pixel). As you can imagine, this takes a tremendous amount of computational power. Generating a 4 megapixel image could easily require tracing a billion rays, with each ray bouncing several times through the scene. This is very difficult to achieve at real-time framerates, even with today’s GPUs.
Lately, I started thinking that maybe there’s a more efficient way to do this. Tracing hundreds of rays per pixel seems enormously wasteful. If there’s one thing I learned from my computer science education, it’s that algorithmic optimizations can pay off orders of magnitude more than clever little code tweaks. We’re rendering each pixel independently, when in fact, we know that the colors of pixels are rarely independent variables, because most images contain structured patterns. Maybe path tracing ought to take a hint from the world of image compression, and try to generate images using less information, less samples. It’s not just about the amount of data or raw computational power you have, using it intelligently matters as well. More concretely, I’ve been wondering if it might be possible to inspire ourselves directly from JPEG compression, and trace less rays to try and estimate DCT coefficients for multi-pixel patches, instead of individual pixels. I’m speculating here, but I believe that such an approach might be able to produce less noisy images from much less samples (i.e.: 20 or 50 times less).
The Higgs Roadmap
How You Can Help
There are a number of areas where help is needed on the Higgs project, for example:
- Trying out Higgs, finding and reporting bugs
- Finding new benchmarks for Higgs
- Writing useful libraries
- Repackaging existing JS libraries
- Profiling to improve compilation time
- Implementing support for ES6 features
We’ve opened several issues on GitHub detailing specific tasks that you could contribute to. Some of these are tagged as “easy” as they don’t require much knowledge of the existing Higgs system, and are a good starting point for newcomers. If you’re interested in contributing, we encourage you to take a look at the issues list, as well as the How to Contribute page. I’d also like to say that if you have a specific idea you’d like to implement in Higgs, a specific feature you’ve thought of, we’re open to that too. You’re welcome to pitch us your own ideas.
More about Higgs
We’ve started to document Higgs in Higgs Wiki. For instructions on how to install Higgs, see the Quickstart page. Also, we now have an IRC channel (#higgsjs on freenode) in which you can reach us on most days.
I will be giving a talk at mloc.js in mid-February. This talk will detail some of the novel compilation techniques I’m implementing in Higgs, such as basic block versioning and incremental compilation. This talk will be filmed and made available on InfoQ. In the meantime, you can watch my talks from Strange Loop, DConf and Air Mozilla.
I’m not any kind of data compression expert, but I find the topic quite interesting. There’s something fascinating about the idea of data compression and its relationship to machine learning. What I mean to say is that to compress a data stream efficiently, you should ideally capture something fundamental about the structure of the data being encoded. The ideal compression algorithm should somehow be able to “learn” and “understand” the core nature of the data being compressed, so that it can encode it in a very compact way, with as little redundancy as possible. The compression algorithms in use today do this in at a rather superficial level, seeking to shorten the encoding of repeated patterns of bits. LZW, for example, will build a dictionary (a simple grammar) as it encodes a binary data stream, and reuse pieces from this dictionary, when useful, to shorten the encoding. This simple yet clever strategy is remarkably effective, and quite difficult to improve upon.
Image compression is something I find particularly interesting. Images we typically see around, photos in particular, are not at all random. There’s structure: edges, shapes, visible patterns and repetitions. Real-world objects, letters, people, curves, textures, all of these things could in theory be mathematically approximated in a way that’s much more compact than an array of millions of pixel samples. Our brain is somehow able to recover the underlying structure of the world from images, and transform millions of samples into high-level semantic entities. I’ve seen people use genetic algorithms to approximate images using colored triangles. This got me thinking: what if you had something more expressive than triangles? Could you use genetic algorithms to evolve boolean functions to encode the pixels of an image as a function of their x/y coordinates?
I did a number of experiments trying to come up with boolean functions to approximate one-bit-per-pixel black and white images using mutations and hill climbing, but got surprisingly poor results. There are a number of issues. For one, evaluating complex boolean functions on millions of pixels is very time consuming. Another problem is that there are many complex functions to evaluate, and very very very few of them actually resemble the output we want. It’s like looking for a needle in a galaxy. I think that in all probability, an approach using neural networks working with floating-point values would have a better likelihood of panning out. There are training algorithms for neural networks, which would be a better starting point than completely random mutations. Furthermore, floating-point outputs are probably better for learning than boolean outputs, as there are multiple gradations of how wrong a given output is, rather than all or nothing, but this is a topic for another blog post.
After the horrible failure of my boolean image approximation attempt, I sobbed while sipping on one ounce of whisky, but then, I had an idea. Simple approaches that exploit fundamental properties of a problem space tend to do quite well in practice. Maybe something resembling the LZW algorithm would work better. Maybe I could exploit the self-similarity of images (recurring patterns) in a very simple way. I started thinking about an algorithm to encode images using tiles. Split the image into NxN tiles (e.g.: 4×4 or 8×8) and encode each tile either as a direct listing of pixel values, or as a reference to another part of the image. The idea is to save space by reusing image information when a similar enough tile already exists. I used the Mean Squared Error (MSE) of tile pixel values to decide when tiles are similar enough to allow reuse. Someone probably has already come up with a similar image compression algorithm before, but my google-fu has not revealed anything obvious.
The tile-based algorithm, which Tom has codenamed übersmüsh (usm for short), performs surprisingly well. It’s not comparable to JPEG, which is a very sophisticated format layering many levels of clever tricks to get an encoding as compact as possible, but it does much better than I would have expected. Using 4×4 tiles to compress a 512×512 version the classic Lenna image, I was able to generate something visually quite similar to the original by supplying only 610 tiles of source pixel data, out of 16384. This means that only 3.7% of the tiles in the compressed image come from the source image, the rest are all backreferences to x/y regions of the compressed image. I allowed backreferences to rescale the color values of the tiles being referenced for additional flexibility, which makes for a much smaller tile use.
There are some issues, such as the fact that we probably use too few tiles for the background and too many in some detailed regions. This can probably be improved by tweaking or replacing the heuristic deciding when to use source tiles or backreferences. I can think of several other ideas to improve on the algorithm as a whole:
- Pre-filtering of random noise to make tiles more reusable. The Lenna image has quite a bit of high-frequency noise, as most photographs probably do, due to the nature of image sensors in cameras.
- Reducing the size of source pixel data using chroma subsampling. This may also have the benefit of making tiles more similar to each other, and thus more reusable.
- Avoiding supplying whole source tiles and instead supplying only enough source pixels to get an acceptable error threshold. The rest of the pixels would always come from a backreference.
- Using bigger tiles when possible (e.g.: 16×16 or 8×8) and resorting to a recursive subdivision scheme only when necessary. This would reduce the number of tiles we need to encode.
Some important weaknesses right now might be the difficulty of encoding the tile reference table compactly. The code I wrote is also unoptimized and very slow. It’s a brute-force search that looks at all possible previous pixel coordinates to find the most-resembling 4×4 tile in what has yet been compressed. This currently takes over an hour on my Code 2 Quad. The algorithm could probably be made much faster with a few simple tweaks to reduce the candidate set size. For example, the tiles could be sorted or hashed based on some metric, and only a few candidate tiles examined, instead of hundreds of thousands. The source code is written in D. It’s not the cleanest, but if there’s interest I might just clean it up and make it available.
Happy new year to all :)
Edit: Challenge Accepted
A friend suggested I try übersmüsh on a different kind of image: a screenshot of Facebook containing text and large flat areas. I was curious to see how well it would do, and so I accepted his challenge. I quickly hacked the algorithm to look for matches in restricted size windows (close in position to the current tile). This probably reduces its effectiveness somewhat, but was needed to get a reasonable execution time (minutes instead of multiple hours).
A total of 3354 source tiles was used out of 49000 tiles total, which is about 6.8% of the source image data. Predictably, the algorithm does very well in flat areas. There is some amount of reuse in text, but this could be better if I hadn’t reduced the search window size, and instead implemented a better search heuristic. Some of the ad pictures, namely the ones in the bottom-right corner, show very poor reuse, probably because my mean-squared-error heuristic is a poor approximation of what error is visually acceptable.
It’s been about a month since I started refactoring the Higgs JIT to implement a lazy/incremental compilation strategy. The new JIT will combine the basic block versioning scheme I previously implemented with lazy compilation of basic block versions and eventually incremental inlining of function calls. Higgs will also no longer have an interpreter, everything will be JIT-compiled on the first execution. This is a natural extension of basic block versioning which I believe will bring many interesting possibilities, such as the ability to recompile or invalidate code at the level of individual basic blocks. This means, for example, that function call sites, if they are executed often enough, could be recompiled so as to inline the callee, but without recompiling the entire caller function. Other possible uses include making optimistic assumptions about the types of values, such as global variables, and being able to invalidate only the blocks making these assumptions if the said assumptions are violated.
This is a significant refactoring, but it’s not the first time I rebuild the JIT, and I’ve gotten fairly efficient at it. I already have lazy code generation, most of the runtime library, and the Higgs REPL working. Many of the basic unit tests are working, the crucial bits are in place, but there are still many important parts missing (register allocation, garbage collection, exception hanling, and more). Since I want Higgs to eventually be considered a serious and usable compiler, I decided to create a dev-maxime development branch and avoid feature regressions on the master branch. The new JIT will make it to the master branch once all tests and benchmarks work, which should hopefully happen within two months or so.
The most difficult aspect of this work so far has been to come up with a coherent and efficient design for the incremental code generation. I decided to sketch things up and begin experimenting with the parts of the implementation that I thought would be most problematic right away, so that I could know about the problems that might arise early on. Much exploratory programming was involved, but the key design decisions have been made. Machine code is generated in one pass and directly appended to an executable heap. It’s generated in multiple small fragments which all contain position-independent code. The fragments may end with branches which can refer to other fragments. Each fragment keeps a closure that knows how to write its final end branch, so that these branches can be rewritten when the fragments are relocated.
In other news, I’m still waiting to know if my paper about basic block versioning is accepted at Compiler Construction 2014, I should have the answer by December 20th. I did get some very positive news since my last blog post, however. I’ve received (and accepted) an invitation to present Higgs and my research work at the mloc.js conference, which takes place mid-February in Budapest. Like Strange Loop, this conference will be filmed and the videos will be made available online. I’ve also received an invitation to present at Web Directions Code 2014. I’m very much looking forward to sharing my latest research findings.
My thesis advisor has recently shown more openness to the idea that I could do an industry internship before the end of my PhD, which seems like it would be a good move career wise. This would probably occur around December 2014 or January 2015, after the next round of conference deadlines. I’ve been approached by people from Apple (LLVM team) and IBM (JIT and dynamic language runtimes). Back at DConf 2013, I also had the opportunity to meet Andrei Alexandrescu and other members of the HipHop VM team. They seemed very interested in my work on basic block versioning and in getting me to come do an internship at Facebook. Most likely, I’ll try approaching people at Mozilla and Facebook to see if I could possibly work on either IonMonkey or HipHop VM as part of an internship.
When I started my PhD, I originally wanted to do research in machine learning. Unfortunately, after only a few months, I discovered that field wasn’t really for me. I found it to be overly theoretical, mathematical and dry for my taste. My creativity wasn’t really there. Instead, I found myself thinking back to my MSc project, and having several ideas about how to improve upon the design of McVM. I decided to follow my passion, change advisors and do a PhD in compiler design instead. Compilers are not really a glamorous CS subfield, but I think this may have turned out to be a good career choice. There are clearly more jobs in compilers than there are compiler experts.