The Lone Pole

There aren't very many good resources for optimizing compute-bound JavaScript on the Internet, aside from this article from Google Developers discussing string building and touching very briefly on some of the problems with functions. For the most part, I'd bet this is because a lot of developers say "we'll leave it up to the JIT compiler to do optimizations," and then they write code the idiomatic way, assuming that this will translate into high performance, and the rest focus on non-compute usage, which is generally limited in performance by networked resources. However, as I've found, the v8 optimizer, as good as it is, still struggles with a number of situations that are easily handled in C. Obviously, I'm not the first to notice this, with projects like asm.js that exist to identify a subset of standard JavaScript that can achieve high performance by allowing for better optimization and especially by avoiding garbage collection.

However, I think that there are a number of things that can be done without resorting to asm.js, to increase performance. I've assembled four test cases/demonstrations to illustrate some problematic areas for the JIT optimizer and identify how I've hand-optimized them. All of the test cases will be discussed in terms of simple benchmark results, using Node.js's high resolution timer to achieve a higher precision. One thing that is important to note is that sometimes the relative speed increase appears huge (for instance 10x faster), but really we are looking at a very simple, pared down test case. In this instance it is much more meaningful to consider the absolute speedup (5 microseconds vs. 10 microseconds), because when used in real code, the function bodies will generally have far more meat on them, but the same amount of overhead will be eliminated.

My attempts at optimizing a graph processing library that I have been working on identified function calls to be the biggest source of overhead, because the optimizer doesn't appear to do much with them. I haven't tried to look at tracing or disassembly, because I want to keep this simple for now. Since functions appear to be the problem, let's start with a simple test: does v8 inline functions?

var iter = 10000;
var i, j, start, stop;

function inc(a) { return a+1; }

var x = 0;
start = process.hrtime();
for (i = 0; i < iter; ++i) {
    x = inc(x);
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T1//with function: '+time+' ms');

var x = 0;
start = process.hrtime();
for (i= 0; i < iter; ++i) {
    x = x + 1;
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T1//hand inlined: '+time+' ms');

For me, the version calling inc() generally takes about 0.467 milliseconds to complete, while the hand-inlined version takes only 0.0365 ms. The answer, then, is that either v8 doesn't inline, or if it does, it cannot match a hand-inlined version of the same code. I am not knowledgeable enough about the engine internals or about the specification requirements for JavaScript to speculate on why this happens, but for now, I am satisfied to notice the problem and try to work around it in performance-critical code.

However, this actually leads to a serious problem from a design perspective. You can either write well-partitioned code that is easy to read and follow, with good compartmentalization for function reuse, or you can stuff it all into a single function and expect a significant amount of call overhead to be eliminated. I think this could be (easily?) rolled into a tool like UglifyJS, to automatically inline functions, but at that point, we're introducing a build step for an interpreted language. For scripts that will be sent to the browser, this is perfectly acceptable, as we generally would want to pass these through UglifyJS (although you shouldn't use a separate build step for this anyway), but for server-side scripts, this is extremely tedious. Yes, you can automate it with grunt or another task runner, but it just reintroduces the exact type of process we are trying to eliminate by using JavaScript instead of C.

For now, I'll just say that I think that some amount of function inlining should be possible for the JIT optimizer to do, but there are a lot of instances (member functions, for instance, due to the completely dynamic way in which methods are called) where this is not possible. Therefore, let's move on, and examine the performance of JavaScript's iteration constructs.

One of the things that really attracted me to Node.js was Underscore.js, a functional programming library that provides the types of things I would normally find in Scheme. The best thing about these tools is that they make it simple to write clean, easy to maintain code, and they help enforce good design principles by encouraging programmers to write small functions for one task, and then to build larger processes out of those functions. It might be obvious in hindsight, but considering that function optimization is not very effective, using functions to do array iteration is probably slower than using a for loop to do the same thing. Let's look at two variations, one testing Array.reduce and one testing Array.forEach in an O(n^2) situation to see how the function overhead scales with nested loops.

// Test 2: Iteration methods vs. for loops
iter = 1000;
var elems = 100, subelems = 10, sum;
var xa = [];

for (i = 0; i < elems; ++i) {
    xa[i] = Math.random();
}

// 2a: array sum with reduce
start = process.hrtime();
sum = xa.reduce(function(memo, v) {
    return memo + v;
}, 0);
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2A//using reduce: '+time+' ms');

start = process.hrtime();
sum = 0;
for (i = 0; i < xa.length; ++i) {
    sum += xa[i];
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2A//for loop: '+time+' ms');
console.log('');

// On my machine, times fluctuate a lot, but 0.10-0.16 ms vs 0.0054-0.0078 ms
// so the for loop is faster again

// 2b: nested iteration, calling a member function for each element of an array
function t2b_class() {
    this.val = Math.random();
}
var t2b_sum = 0;
t2b_class.prototype.work = function() {
    t2b_sum += this.val;
};

for (i = 0; i < elems; ++i) {
    xa[i] = [];
    for (j = 0; j < elems; ++j) {
        xa[i][j] = new t2b_class();
    }
}

start = process.hrtime();
xa.forEach(function(v) {
    v.forEach(function(v2) {
        v2.work();
    });
});
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//nested iterators: '+time+' ms');

function inner_iter(v2) {
    v2.work();
}
start = process.hrtime();
xa.forEach(function(v) {
    v.forEach(inner_iter);
});
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//nested iterators, avoids redeclaration: '+time+' ms');

start = process.hrtime();
for (i = 0; i < xa.length; ++i) {
    for (j = 0; j < xa[i].length; ++j) {
        xa[i][j].work();
    }
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//for loop: '+time+' ms');

For Test 2A, I find that using Array.reduce takes between 0.10 and 0.16 milliseconds, while using the for loop to implement a reduce operation "the long way" takes between 0.0045 and 0.0075 ms. These really aren't huge differences, but in principle they shouldn't exist at all, as, from a semantic point of view, the code does exactly the same thing in both implementations. For test 2B, however, there is a much more significant difference. For the nested forEaches, I've tested two possible implementations: the "obvious" approach, which uses an anonymous inner function for the second loop, and the "observant" approach which realizes that this function gets redeclared every time the loop iterates, so the declaration is moved outside of the outer loop. My results are that the obvious approach is the slowest, coming in at around 1.08-1.32 ms, then the observant approach, coming in around 0.87-0.95ms, and then the explicit, nested for loop blowing both out of the water with a time of 0.22-0.35 ms.

The results for 2A are underwhelming, but 2B illustrates a much more significant problem with the use of functional techniques. It's really easy to write code in a functional style that ends up being really slow, because v8 does not effectively optimize it. For me, this is a disaster. Functional-style code is faster to write, easier to maintain, and generally cleaner than its imperative counterpart. Plus, implementing a map, reduce, or even just a for-each loop using the for construct and explicit iteration requires repeating a lot of glue code that is handled internally. Even C++11 has language features to minimize this kind of repetition, now. As with function inlining, I maintain that for both 2A and 2B, these types of transformations can easily be done by a tool like UglifyJS at the source level, but really, I feel that the JIT optimizer should be capable of doing them as well.

Now, let's look at another type of optimization important for performance when using recursive functions: tail call optimization. I've written the canonical factorial function in three styles: first, a basic recursive implementation that works its way back up the stack to compute its result; second, a version that uses tail-calls and could therefore be optimized without any source transformation; and third, a complete iterative version, to determine whether v8 performs maximally effective tail optimization.

// Test 3: Tail calls vs. stack walking vs. iterative method
function fact_tail(n, accumulator) {
    if (n < 2)
        return accumulator;
    return fact_tail(n-1, accumulator*n);
}
function fact_stack(n) {
    if (n < 2)
        return 1;
    return n * fact_stack(n-1);
}
function fact_iter(n) {
    var accumulator = n;
    for (n = n-1; n > 1; --n) {
        accumulator = accumulator*n;
    }
    return accumulator;
}

iter = 1000;
var n = 30, result;

start = process.hrtime();
for (i = 0; i < iter; ++i) {
    result = fact_tail(n-1, n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Tail Factorial: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
    result = fact_stack(n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Stack Factorial: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
    result = fact_iter(n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Iterative Factorial: '+time+' ms');

The results are approximately what you'd expect. The tail call is not fully optimized, but it is faster than the version using the stack. I haven't done enough research to determine whether the speed boost is due to (some) actual optimization being performed or if it's just a by-product of how temporary variables are used implicitly in the local scope. The numbers for me come in at 0.515 ms for the basic stack version, 0.277 ms for the tail-call version, and 0.181 ms for the iterative version.

Again we see that a source transformation from "good, easy-to-follow code" to "awkward, but still makes sense if you stop to think about it code" results in improved performance. This is the same type of trade-off that we faced in the 90s when C compilers did not optimize effectively. And, again, I would posit that this is the type of operation that should be done by the JIT optimizer. If only I had the necessary knowledge to actually try to modify v8 myself--I've looked at the code, and for all my complaints about shortcomings, it is both incredibly impressive and far too complicated for me to "pick up" over the weekend and change).

Finally, I have one last optimization case. This one, however, is not something that can be done by a JIT optimizer, because the result would no longer meet the language specification. For those of you familiar with C++, there are two basic types of methods in a class: "normal" methods, which have a fixed address known at compile time, and "virtual" methods, which are resolved at run time based on the virtual function table stored within the object on which the method is being called (I almost said "dynamically determined by the type" but that would be inaccurate, as virtual methods do not require RTTI). In JavaScript all class methods are implicitly of a virtual type. When you call a function, that field on the object must be looked up every time, because it can be changed at any time. This is an incredibly useful feature of the language, because of the flexibility it provides: polymorphism is "easy," objects can be monkey-patched to modify behavior, and some of the more subtle aspects of C++ go away (for instance, do you know off the top of your head which method is called if you dereference a base class-type pointer to an instance of a derived class and then call a virtual method?). But this power comes at a cost. When a method isn't used in this manner, we still pay the price of doing lookups, and because the optimizer cannot know that we will not replace a method definition, it can't substitute in the address of the specific function we want.

There is actually a simple, clever way to obtain static references to functions. When the method is called through the object as a property lookup, the prototype chain must be traversed to find the first matching function and then it is used. If we skip that step and use the specific function, on the specific prototype that we want, it should be slightly faster. We still pay the price of looking up a property on the prototype object, though, so we can even take this a step further and save a variable that points precisely to the function we want to call. Because of JavaScript's ability to call any function with any "this" context, we can take the static function pointer and call it as though it were a member function without going through the dynamic lookup phase. Let's look at a test to compare this.

// Test 4: Function table overhead
function TestClass(a, b, c) {
    this.a = a;
    this.b = b;
    this.c = c;
}
TestClass.prototype.hypot = function() {
    return Math.sqrt(this.a*this.a + this.b*this.b + this.c*this.c);
}

iter = 10000;
var fn = TestClass.prototype.hypot;
xa = new TestClass(Math.random(), Math.random(), Math.random());

start = process.hrtime();
for (i = 0; i < iter; ++i) {
    result = xa.hypot();
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Object method lookup: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
    result = TestClass.prototype.hypot.call(xa);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Prototype method lookup: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
    result = fn.call(xa);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Pre-saved method pointer: '+time+' ms');

For me, the results are that full dynamic lookup takes 0.194 ms, retrieving the method through the prototype takes 0.172 ms, and using a pre-saved pointer to the function takes only 0.135 ms. These absolute differences really aren't that significant, but this can be applied to every single non-polymorphic function call in your project, which means this might be the test case where relative speedup (30% faster with static function pointers) matters more than absolute. If you still think I'm crazy for even mentioning this, consider the extremely common problem of converting an arguments variable into a proper array by calling Array.prototype.slice on it. Many people will save a reference to Array.prototype.slice at the top of their module and use it throughout their code; luckily it's both shorter to type AND faster, which is great for libraries that provide a lot of functions accepting variable numbers of arguments.

Member function lookup optimization requires far more work than the other three, and it cannot be applied transparently by a JIT optimizer or by UglifyJS as the language specification prevents it from having enough information to correctly do so. It also makes your code much, much more annoying to read. I could suggest writing an AltJS language that allows you to differentiate between the two types of member functions, so that when it compiles to JS it spits out static references where possible, but I'm pretty sure @horse_js would make fun of me for it (rightly so). But, if you're in a situation where performance for computation is just that important, I'd recommend at least considering static function references.

As for my graph library, by carefully rewriting code matching the four above cases, I was able to get the execution time for a single pass of the evaluator down from 25.5 usec to 7.4 usec, which increases the number of users that my application can support for a given amount of hardware by 3.3x, immediately translating into lower operating costs. I wish I could share the code (maybe one day, but for now it's under the "secret project" category), but hopefully an unsubstantiated anecdote like this will help convince you that these optimizations still benefit real code that performs real tasks, and not just overly-simple test cases. There is often an important tradeoff to consider between readability, maintainability, and performance, so I wouldn't recommend always writing code that looks like this, but if you're working on a compute-bound application (mostly, JavaScript-based games, which we will see many of as WebGL and HTML5 are improved), maybe think about writing your code in three steps: first, do it the easy/fast/right way, producing good, clean code. Then, write a very good set of test cases. Finally, optimize your code to death, and use your test cases to verify that you haven't broken anything.

The complete benchmark file used to generate my numbers is available as a gist in case you don't feel like cobbling together the snippets given above.