There aren't very many good resources for optimizing compute-bound JavaScript
on the Internet, aside from this article from Google
Developers
discussing string building and touching very briefly on some of the problems
with functions. For the most part, I'd bet this is because a lot of developers
say "we'll leave it up to the JIT compiler to do optimizations," and then they
write code the idiomatic way, assuming that this will translate into high
performance, and the rest focus on non-compute usage, which is generally
limited in performance by networked resources. However, as I've found, the v8
optimizer, as good as it is, still struggles with a number of situations that
are easily handled in C. Obviously, I'm not the first to notice this, with
projects like asm.js that exist to identify a subset of standard JavaScript
that can achieve high performance by allowing for better optimization and
especially by avoiding garbage collection.
However, I think that there are a number of things that can be done without
resorting to asm.js, to increase performance. I've assembled four test
cases/demonstrations to illustrate some problematic areas for the JIT optimizer
and identify how I've hand-optimized them. All of the test cases will be
discussed in terms of simple benchmark results, using Node.js's high resolution
timer to achieve a higher precision. One thing that is important to note is
that sometimes the relative speed increase appears huge (for instance 10x
faster), but really we are looking at a very simple, pared down test case. In
this instance it is much more meaningful to consider the absolute speedup (5
microseconds vs. 10 microseconds), because when used in real code, the function
bodies will generally have far more meat on them, but the same amount of
overhead will be eliminated.
My attempts at optimizing a graph processing library that I have been working
on identified function calls to be the biggest source of overhead, because the
optimizer doesn't appear to do much with them. I haven't tried to look at
tracing or disassembly, because I want to keep this simple for now. Since
functions appear to be the problem, let's start with a simple test: does v8
inline functions?
var iter = 10000;
var i, j, start, stop;
function inc(a) { return a+1; }
var x = 0;
start = process.hrtime();
for (i = 0; i < iter; ++i) {
x = inc(x);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T1//with function: '+time+' ms');
var x = 0;
start = process.hrtime();
for (i= 0; i < iter; ++i) {
x = x + 1;
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T1//hand inlined: '+time+' ms');
For me, the version calling inc()
generally takes about 0.467 milliseconds to
complete, while the hand-inlined version takes only 0.0365 ms. The answer,
then, is that either v8 doesn't inline, or if it does, it cannot match a
hand-inlined version of the same code. I am not knowledgeable enough about the
engine internals or about the specification requirements for JavaScript to
speculate on why this happens, but for now, I am satisfied to notice the
problem and try to work around it in performance-critical code.
However, this actually leads to a serious problem from a design perspective.
You can either write well-partitioned code that is easy to read and follow,
with good compartmentalization for function reuse, or you can stuff it all into
a single function and expect a significant amount of call overhead to be
eliminated. I think this could be (easily?) rolled into a tool like UglifyJS,
to automatically inline functions, but at that point, we're introducing a build
step for an interpreted language. For scripts that will be sent to the browser,
this is perfectly acceptable, as we generally would want to pass these through
UglifyJS (although you shouldn't use a separate build step for this
anyway), but for server-side
scripts, this is extremely tedious. Yes, you can automate it with grunt or
another task runner, but it just reintroduces the exact type of process we are
trying to eliminate by using JavaScript instead of C.
For now, I'll just say that I think that some amount of function inlining
should be possible for the JIT optimizer to do, but there are a lot of
instances (member functions, for instance, due to the completely dynamic way in
which methods are called) where this is not possible. Therefore, let's move on,
and examine the performance of JavaScript's iteration constructs.
One of the things that really attracted me to Node.js was
Underscore.js, a functional programming library
that provides the types of things I would normally find in Scheme. The best
thing about these tools is that they make it simple to write clean, easy to
maintain code, and they help enforce good design principles by encouraging
programmers to write small functions for one task, and then to build larger
processes out of those functions. It might be obvious in hindsight, but
considering that function optimization is not very effective, using functions
to do array iteration is probably slower than using a for loop to do the same
thing. Let's look at two variations, one testing Array.reduce
and one testing
Array.forEach
in an O(n^2) situation to see how the function overhead scales
with nested loops.
// Test 2: Iteration methods vs. for loops
iter = 1000;
var elems = 100, subelems = 10, sum;
var xa = [];
for (i = 0; i < elems; ++i) {
xa[i] = Math.random();
}
// 2a: array sum with reduce
start = process.hrtime();
sum = xa.reduce(function(memo, v) {
return memo + v;
}, 0);
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2A//using reduce: '+time+' ms');
start = process.hrtime();
sum = 0;
for (i = 0; i < xa.length; ++i) {
sum += xa[i];
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2A//for loop: '+time+' ms');
console.log('');
// On my machine, times fluctuate a lot, but 0.10-0.16 ms vs 0.0054-0.0078 ms
// so the for loop is faster again
// 2b: nested iteration, calling a member function for each element of an array
function t2b_class() {
this.val = Math.random();
}
var t2b_sum = 0;
t2b_class.prototype.work = function() {
t2b_sum += this.val;
};
for (i = 0; i < elems; ++i) {
xa[i] = [];
for (j = 0; j < elems; ++j) {
xa[i][j] = new t2b_class();
}
}
start = process.hrtime();
xa.forEach(function(v) {
v.forEach(function(v2) {
v2.work();
});
});
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//nested iterators: '+time+' ms');
function inner_iter(v2) {
v2.work();
}
start = process.hrtime();
xa.forEach(function(v) {
v.forEach(inner_iter);
});
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//nested iterators, avoids redeclaration: '+time+' ms');
start = process.hrtime();
for (i = 0; i < xa.length; ++i) {
for (j = 0; j < xa[i].length; ++j) {
xa[i][j].work();
}
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//for loop: '+time+' ms');
For Test 2A, I find that using Array.reduce
takes between 0.10 and 0.16
milliseconds, while using the for loop to implement a reduce operation "the
long way" takes between 0.0045 and 0.0075 ms. These really aren't huge
differences, but in principle they shouldn't exist at all, as, from a semantic
point of view, the code does exactly the same thing in both implementations.
For test 2B, however, there is a much more significant difference. For the
nested forEaches, I've tested two possible implementations: the "obvious"
approach, which uses an anonymous inner function for the second loop, and the
"observant" approach which realizes that this function gets redeclared every
time the loop iterates, so the declaration is moved outside of the outer loop.
My results are that the obvious approach is the slowest, coming in at around
1.08-1.32 ms, then the observant approach, coming in around 0.87-0.95ms, and
then the explicit, nested for loop blowing both out of the water with a time of
0.22-0.35 ms.
The results for 2A are underwhelming, but 2B illustrates a much more
significant problem with the use of functional techniques. It's really easy to
write code in a functional style that ends up being really slow, because v8
does not effectively optimize it. For me, this is a disaster. Functional-style
code is faster to write, easier to maintain, and generally cleaner than its
imperative counterpart. Plus, implementing a map, reduce, or even just a
for-each loop using the for construct and explicit iteration requires repeating
a lot of glue code that is handled internally. Even C++11 has language features
to minimize this kind of repetition, now. As with function inlining, I maintain
that for both 2A and 2B, these types of transformations can easily be done by a
tool like UglifyJS at the source level, but really, I feel that the JIT
optimizer should be capable of doing them as well.
Now, let's look at another type of optimization important for performance when
using recursive functions: tail call optimization. I've written the canonical
factorial function in three styles: first, a basic recursive implementation
that works its way back up the stack to compute its result; second, a version
that uses tail-calls and could therefore be optimized without any source
transformation; and third, a complete iterative version, to determine whether
v8 performs maximally effective tail optimization.
// Test 3: Tail calls vs. stack walking vs. iterative method
function fact_tail(n, accumulator) {
if (n < 2)
return accumulator;
return fact_tail(n-1, accumulator*n);
}
function fact_stack(n) {
if (n < 2)
return 1;
return n * fact_stack(n-1);
}
function fact_iter(n) {
var accumulator = n;
for (n = n-1; n > 1; --n) {
accumulator = accumulator*n;
}
return accumulator;
}
iter = 1000;
var n = 30, result;
start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fact_tail(n-1, n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Tail Factorial: '+time+' ms');
start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fact_stack(n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Stack Factorial: '+time+' ms');
start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fact_iter(n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Iterative Factorial: '+time+' ms');
The results are approximately what you'd expect. The tail call is not fully
optimized, but it is
faster than the version using the stack. I haven't
done enough research to determine whether the speed boost is due to (some)
actual optimization being performed or if it's just a by-product of how
temporary variables are used implicitly in the local scope. The numbers for me
come in at 0.515 ms for the basic stack version, 0.277 ms for the tail-call
version, and 0.181 ms for the iterative version.
Again we see that a source transformation from "good, easy-to-follow code" to
"awkward, but still makes sense if you stop to think about it code" results in
improved performance. This is the same type of trade-off that we faced in the
90s when C compilers did not optimize effectively. And, again, I would posit
that this is the type of operation that should be done by the JIT optimizer. If
only I had the necessary knowledge to actually try to modify v8 myself--I've
looked at the code, and for all my complaints about shortcomings, it is both
incredibly impressive and far too complicated for me to "pick up" over the
weekend and change).
Finally, I have one last optimization case. This one, however, is not
something that can be done by a JIT optimizer, because the result would no
longer meet the language specification. For those of you familiar with C++,
there are two basic types of methods in a class: "normal" methods, which have a
fixed address known at compile time, and "virtual" methods, which are resolved
at run time based on the virtual function table stored within the object on
which the method is being called (I almost said "dynamically determined by the
type" but that would be inaccurate, as virtual methods do not require RTTI). In
JavaScript all class methods are implicitly of a virtual type. When you
call a function, that field on the object must be looked up every time,
because it can be changed at any time. This is an incredibly useful
feature of the language, because of the flexibility it provides: polymorphism
is "easy," objects can be monkey-patched to modify behavior, and some of the
more subtle aspects of C++ go away (for instance, do you know off the top of
your head which method is called if you dereference a base class-type pointer
to an instance of a derived class and then call a virtual method?). But this
power comes at a cost. When a method isn't used in this manner, we still
pay the price of doing lookups, and because the optimizer cannot know
that we will not replace a method definition, it can't substitute in the
address of the specific function we want.
There is actually a simple, clever way to obtain static references to
functions. When the method is called through the object as a property lookup,
the prototype chain must be traversed to find the first matching function and
then it is used. If we skip that step and use the specific function, on the
specific prototype that we want, it should be slightly faster. We still pay the
price of looking up a property on the prototype object, though, so we can even
take this a step further and save a variable that points precisely to the
function we want to call. Because of JavaScript's ability to call any function
with any "this" context, we can take the static function pointer and call it as
though it were a member function without going through the dynamic lookup
phase. Let's look at a test to compare this.
// Test 4: Function table overhead
function TestClass(a, b, c) {
this.a = a;
this.b = b;
this.c = c;
}
TestClass.prototype.hypot = function() {
return Math.sqrt(this.a*this.a + this.b*this.b + this.c*this.c);
}
iter = 10000;
var fn = TestClass.prototype.hypot;
xa = new TestClass(Math.random(), Math.random(), Math.random());
start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = xa.hypot();
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Object method lookup: '+time+' ms');
start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = TestClass.prototype.hypot.call(xa);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Prototype method lookup: '+time+' ms');
start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fn.call(xa);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Pre-saved method pointer: '+time+' ms');
For me, the results are that full dynamic lookup takes 0.194 ms, retrieving the
method through the prototype takes 0.172 ms, and using a pre-saved pointer to
the function takes only 0.135 ms. These absolute differences really aren't that
significant, but this can be applied to every single non-polymorphic function
call in your project, which means this might be the test case where relative
speedup (30% faster with static function pointers) matters more than absolute.
If you still think I'm crazy for even mentioning this, consider the extremely
common problem of converting an arguments
variable into a proper array by
calling Array.prototype.slice
on it. Many people will save a reference to
Array.prototype.slice
at the top of their module and use it throughout their
code; luckily it's both shorter to type AND faster, which is great for
libraries that provide a lot of functions accepting variable numbers of
arguments.
Member function lookup optimization requires far more work than the other
three, and it cannot be applied transparently by a JIT optimizer or by UglifyJS
as the language specification prevents it from having enough information to
correctly do so. It also makes your code much, much more annoying to read. I
could suggest writing an AltJS language that allows you to differentiate
between the two types of member functions, so that when it compiles to JS it
spits out static references where possible, but I'm pretty sure @horse_js would
make fun of me for it (rightly so). But, if you're in a situation where
performance for computation is just that important, I'd recommend at
least considering static function references.
As for my graph library, by carefully rewriting code matching the four above
cases, I was able to get the execution time for a single pass of the evaluator
down from 25.5 usec to 7.4 usec, which increases the number of users that my
application can support for a given amount of hardware by 3.3x, immediately
translating into lower operating costs. I wish I could share the code (maybe
one day, but for now it's under the "secret project" category), but hopefully
an unsubstantiated anecdote like this will help convince you that these
optimizations still benefit real code that performs real tasks, and not just
overly-simple test cases. There is often an important tradeoff to consider
between readability, maintainability, and performance, so I wouldn't recommend
always writing code that looks like this, but if you're working on a
compute-bound application (mostly, JavaScript-based games, which we will see
many of as WebGL and HTML5 are improved), maybe think about writing your code
in three steps: first, do it the easy/fast/right way, producing good, clean
code. Then, write a very good set of test cases. Finally, optimize your code to
death, and use your test cases to verify that you haven't broken anything.
The complete benchmark file used to generate my numbers is available as a
gist in case you don't feel like
cobbling together the snippets given above.