Since its release last week, I’ve been playing quite a bit of Fallout 4. There’s an interesting mini-game (which was in previous iterations as well) for “hacking” computer terminals, where you must guess the passcode on a list of possibilities with a limited number of guesses. Each failed guess provides the number of correct letters (in both value and position) in that particular word, but not which letters were correct, allowing you to deduce the correct passcode similarly to the game “Mastermind.” A natural question is, “what is the best strategy for identifying the correct passcode?” We’ll ignore the possibility of dud removal and guess resets (which exist to simplify it a bit in game) for the analysis.

Reformulating this as a probability question offers a framework to design the best strategy. First, some definitions: $N$ denotes the number of words, $z$ denotes the correct word, and $x_i$ denotes a word on the list (in some consistent order). A simple approach suggests that we want to use the maximum likelihood (ML) estimate of $z$ to choose the next word based on all the words guessed so far and their results:

$\hat{z} = \underset{x_i}{\mathrm{argmax}}~~\mathrm{Pr}(z=x_i)$

However, for the first word, the probability prior is uniform—each word is equally likely. This might seem like the end of the line, so just pick the first word randomly (or always pick the first one on the list, for instance). However, future guesses depend on what this first guess tells us, so we’d be better off with an estimate which maximizes the mutual information between the guess and the unknown password. Using the concept of entropy (which I’ve discussed briefly before), we can formalize the notion of “mutual information” into a mathematical definition: $I(z, x) = H(z) - H(z|x)$. In this sense, “information” is what you gain by making an observation, and it is measured by how it affects the possible states for a latent variable to take. For more compact notation, let’s define $F_i=f(x_i)$ as the “result” random variable for a particular word, telling us how many letters matched, taking values $\{0,1,...,M\}$, where $M$ is the length of words in the current puzzle. Then, we can change our selection criteria to pick the maximum mutual information:

$\hat{z} = \underset{x_i}{\mathrm{argmin}}~~H(z|F_i)$

But, we haven’t talked about what “conditional entropy” might mean, so it’s not yet clear how to calculate $H(z | F_i)$, apart from it being the entropy after observing $F_i$‘s value. Conditional entropy is distinct from conditional probability in a subtle way: conditional probability is based on a specific observation, such as $F_i=1$, but conditional entropy is based on all possible observations and reflects how many possible system configurations there are after making an observation, regardless of what its value is. It’s a sum of the resulting entropy after each possible observation, weighted by the probability of that observation happening:

$H(Z | X) = \sum_{x\in X} p(x)H(Z | X = x)$

As an example, let’s consider a puzzle with $M=5$ and $N=10$. We know that $\forall x_i,\mathrm{Pr}(F_i=5)=p_{F_i}(5)=0.1$. If we define the similarity function $L(x_i, x_j)$ to be the number of letters that match in place and value for two words, and we define the group of sets $S^{k}_{i}=\{x_j:L(x_i,x_j)=k\}$ as the candidate sets, then we can find the probability distribution for $F_i$ by counting,

$p_{F_i}(k)=\frac{\vert{S^k_i}\vert}{N}$

As a sanity check, we know that $\vert{S^5_i}\vert=1$ because there are no duplicates, and therefore this equation matches our intuition for the probability of each word being an exact match. With the definition of $p_{F_i}(k)$ in hand, all that remains is finding $H(z | F_i=k)$, but luckily our definition for $S^k_i$ has already solved this problem! If $F_i=k$, then we know that the true solution is uniformly distributed in $S^k_i$, so

$H(z | F_i=k) = \log_2\vert{S^k_i}\vert$.

Finding the best guess is as simple as enumerating $S^k_i$ and then finding the $x_i$ which produces the minimum conditional entropy. For subsequent guesses, we simply augment the definition for the candidate sets by further stipulating that set members $x_j$ must also be in the observed set for all previous iterations. This is equivalent to taking the set intersection, but the notation gets even messier than we have so far, so I won’t list all the details here.

All that said, this is more of an interesting theoretical observation than a practical one. Counting all of the sets by hand generally takes longer than a simpler strategy, so it is not well suited for human use (I believe it is $O(n^2)$ operations for each guess), although a computer can do it effectively. Personally, I just go through and find all the emoticons to remove duds and then find a word that has one or two overlaps with others for my first guess, and the field narrows down very quickly.

Beyond its appearance in a Fallout 4 mini-game, the concept of “maximum mutual information” estimation has broad scientific applications. The most notable in my mind is in machine learning, where MMI is used for training classifiers, in particular, Hidden Markov Models (HMMs) such as those used in speech recognition. Given reasonable probability distributions, MMI estimates are able to handle situations where ML estimates appear ambiguous, and as such they are able to be used for “discriminative training.” Typically, an HMM training algorithm would receive labeled examples of each case and learn their statistics only. However, a discriminative trainer can also consider the labeled examples of other cases in order to improve classification when categories are very similar but semantically distinct.

One of the benefits of an object oriented programming language is that functionality can be built into the object hierarchy in layers, deferring some details to a descendant while providing common functionality in a base class and avoiding repetition. However, when a group of classes all need to implement nearly the same method but with minor, class specific differences (more than just a constant), a curious anti-pattern tends to arise. In this situation, the base class implementation contains different behaviors based on the type identification of the true object even though we desire to push those implementations to the descendants.

In the Star Wars Combine’s (SWC) codebase there are several examples of this, but today we’ll use one from the actions system to illustrate the point. In this game, there are many different forms of travel, each of which works mostly the same way. Each works in steps, where the system establishes a deadline for the next travel update and then changes an entity’s position after it expires. There are also common behaviors that need to be included, such as awarding experience points and sending events when travel expires.

The core functionality is captured in the abstract class StepTravelAction, which defines several key methods:

abstract class StepTravelAction extends BaseAction {

public function setup(Entity $leader,$x, $y) { // ... } public function timerInterrupt(TimerInterrupt$payload) {
// ...
}

public function abort() {
// ...
}

public function finishTravelling($reason = null) {$leader = $this->leader;$travelType = $this->getTravelType();$party = $this->getParty(); XPUtil::giveTravelXP($this->leader, $party,$this->XP, 'Finished ' . $this->getEventText()); RewardUtil::testUniqueRewards($this->leader, enumRewardType::GROUND_TRAVEL);

foreach ($this->getParty() as$member) {
$member->travelID = null; } EntityLocationUtil::saveEntityLocation($leader);

$this->delete(); if ($reason == null) {
$reason = TravelEndInterrupt::FINISH_NORMAL; } InterruptUtil::fire($leader, new TravelEndInterrupt($leader,$travelType, $reason)); } abstract public function setupTravel(); abstract public function applyStep(Entity$leader, Point $next); abstract public function getTravelType(); // ... }  Although the function bodies for some functions have been removed in the interest of space, we can see the core functionality, like aborting or finishing travel, is handled here in the base class. The concrete, travel type-specific details of things, like updating the entity’s position, are farmed out to the derived classes. However, at some point, it was decided that ground travel was awarding too much experience and could be abused by players (mostly to power level NPCs who can be set to patrol a location indefinitely, which takes months otherwise). Experience is awarded inside finishTravelling() as we can see, which is “common” to all of the travel action classes. At this point, there are several options for modifying the implementation, and the “simplest” in a language with dynamic type identification produces a design antipattern. To reduce the XP awarded in ground travel only, an earlier programmer elected to add three lines to StepTravelAction::finishTravelling(), quickly resolving the issue:  public function finishTravelling($reason = null) {
// ...

if ($this instanceof GroundTravelAction) {$this->XP = ceil($this->XP / 5); } // ... }  This ignores the benefits of object inheritance, produces less elegant code, and reduces the sustainability of the code in the future. Behavior specific to GroundTravelAction is now no longer contained within GroundTravelAction itself, so if we wanted to further modify the XP for this case, a lot of code walking would be needed to figure out where to do it. If multiple such exceptions are added, you might as well not have support for polymorphism at all and do everything in a single struct that stores type IDs and uses switch statements, taking us back to the early 80s. Exceptions like this were added to several other methods (abort(), etc.) for the same change as well. The correct approach here is to refactor this method into three different components: 1. a concrete version of finishTravelling that follows the original implementation 2. a virtual method implemented in StepTravelAction that simply forwards to the concrete implementation 3. a virtual method in the descending class that layers additional functionality (such as changing the XP to be awarded) and then calls the concrete version from its parent We need all three components because the default implementation is correct in most cases, so it should be available as the default behavior, but when we want to modify it we still need to have the original version available to avoid copying and pasting it into the derived class. I think it might be even worse if someone were to simply use a virtual method and copy it entirely into the derived class in order to add three lines. For completeness, a preferable implementation would look like this and preserve the same behavior for all other derived classes: abstract class StepTravelAction extends BaseAction { protected function _finishTravelling($reason = null) {
$leader =$this->leader;
$travelType =$this->getTravelType();
$party =$this->getParty();

XPUtil::giveTravelXP($this->leader,$party, $this->XP, 'Finished ' .$this->getEventText());
RewardUtil::testUniqueRewards($this->leader, enumRewardType::GROUND_TRAVEL); foreach ($this->getParty() as $member) {$member->travelID = null;
}
EntityLocationUtil::saveEntityLocation($leader);$this->delete();

if ($reason == null) {$reason = TravelEndInterrupt::FINISH_NORMAL;
}

InterruptUtil::fire($leader, new TravelEndInterrupt($leader, $travelType,$reason));
}

public function finishTravelling($reason = null) {$this->_finishTravelling($reason); } } class GroundTravelAction extends StepTravelAction { public function finishTravelling($reason = null) {
$this->XP = ceil($this->XP / 5);
$this->_finishTravelling($reason);
}

}


The reason I want to highlight this is that the problem arose despite the developer having a familiarity with inheritance and deferring implementation details to derived classes where meaningful. There are abstract methods here and they are used to plug into common functionality implemented within StepTravelAction, but while targeting a change in behavior it is easy to lose sight of the overall design and simply insert a change where the implementation is visible. This kind of polymorphism antipattern is one of the most common implementation issues in SWC, likely due to the availability of the instanceof operator in PHP. In C++ it is a lot harder to do this (although RTTI does exist, I never use it), and people are often forced to learn the correct approach as a result.

Everything that happens in the world can be described in some way. Our descriptions range from informal and causal to precise and scientific, yet ultimately they all share one underlying characteristic: they carry an abstract idea known as “information” about what is being described. In building complex systems, whether out of people or machines, information sharing is central for building cooperative solutions. However, in any system, the rate at which information can be shared is limited. For example, on Twitter, you’re limited to 140 characters per message. With 802.11g you’re limited to 54 Mbps in ideal conditions. In mobile devices, the constraints go even further: transmitting data on the network requires some of our limited bandwidth and some of our limited energy from the battery.

Obviously this means that we want to transmit our information as efficiently as possible, or, in other words, we want to transmit a representation of the information that consumes the smallest amount of resources, such that the recipient can convert this representation back into a useful or meaningful form without losing any of the information. Luckily, the problem has been studied pretty extensively over the past 60-70 years and the solution is well known.

First, it’s important to realize that compression only matters if we don’t know exactly what we’re sending or receiving beforehand. If I knew exactly what was going to be broadcast on the news, I wouldn’t need to watch it to find out what happened around the world today, so nothing would need to be transmitted in the first place. This means that in some sense, information is a property of things we don’t know or can’t predict fully, and it represents the portion that is unknown. In order to quantify it, we’re going to need some math.

Let’s say I want to tell you what color my car is, in a world where there are only four colors: red, blue, yellow, and green. I could send you the color as an English word with one byte per letter, which would require 3, 4, 5, or 6 bytes, or we could be cleverer about it. Using a pre-arranged scheme for all situations where colors need to be shared, we agree on the convention that the binary values 00, 01, 10, and 11 map to our four possible colors. Suddenly, I can use only two bits (0.25 bytes, far more efficient) to tell you what color my car is, a huge improvement. Generalizing, this suggests that for any set $\chi$ of abstract symbols (colors, names, numbers, whatever), by assigning each a unique binary value, we can transmit a description of some value from the set using $\log_2(|\chi|)$ bits on average, if we have a pre-shared mapping. As long as we use the mapping multiple times it amortizes the initial cost of sharing the mapping, so we’re going to ignore it from here out. It’s also worthwhile to keep this limit in mind as a max threshold for “reasonable;” we could easily create an encoding that is worse than this, which means that we’ve failed quite spectacularly at our job.

But, if there are additional constraints on which symbols appear, we should probably be able to do better. Consider the extreme situation where 95% of cars produced are red, 3% blue, and only 1% each for yellow and green. If I needed to transmit color descriptions for my factory’s production of 10,000 vehicles, using the earlier scheme I’d need exactly 20,000 bits to do so by stringing together all of the colors in a single sequence. But, given that by the law of large numbers, I can expect roughly 9,500 cars to be red, so what if I use a different code, where red is assigned the bit string 0, blue is assigned 10, yellow is assigned 110, and green 111? Even though the representation for two of the colors is a bit longer in this scheme, the total average encoding length for a lot of 10,000 cars decreases to 10,700 bits (1*9500 + 2*300 + 3*100 + 3*100), almost an improvement of 50%! This suggests that the probabilities for each symbol should impact the compression mapping, because if some symbols are more common than others, we can make them shorter in exchange for making less common symbols longer and expect the average length of a message made from many symbols to decrease.

So, with that in mind, the next logical question is, how well can we do by adapting our compression scheme to the probability distribution for our set of symbols? And how do we find the mapping that achieves this best amount of compression? Consider a sequence of $n$ independent, identically distributed symbols taken from some source with known probability mass function $p(X=x)$, with $S$ total symbols for which the PMF is nonzero. If $n_i$ is the number of times that the $i$th symbol in the alphabet appears in the sequence, then by the law of large numbers we know that for large $n$ it converges almost surely to a specific value: $\Pr(n_i=np_i)\xrightarrow{n\to \infty}1$.

In order to obtain an estimate of the best possible compression rate, we will use the threshold for reasonable compression identified earlier: it should, on average, take no more than approximately $\log_2(|\chi|)$ bits to represent a value from a set $\chi$, so by finding the number of possible sequences, we can bound how many bits it would take to describe them. A further consequence of the law of large numbers is that because $\Pr(n_i=np_i)\xrightarrow{n\to \infty}1$ we also have $\Pr(n_i\neq np_i)\xrightarrow{n\to \infty}0$. This means that we can expect the set of possible sequences to contain only the possible permutations of a sequence containing $n_i$ realizations of each symbol. The probability of a specific sequence $X^n=x_1 x_2 \ldots x_{n-1} x_n$ can be expanded using the independence of each position and simplified by grouping like symbols in the resulting product:

$P(x^n)=\prod_{k=1}^{n}p(x_k)=\prod_{i=1}^{S} p_i^{n_i}=\prod_{i=1}^{S} p_i^{np_i}$

We still need to find the size of the set $\chi$ in order to find out how many bits we need. However, the probability we found above doesn’t depend on the specific permutation, so it is the same for every element of the set and thus the distribution of sequences within the set is uniform. For a uniform distribution over a set of size $|\chi|$, the probability of a specific element is $\frac{1}{|\chi|}$, so we can substitute the above probability for any element and expand in order to find out how many bits we need for a string of length $n$:

$B(n)=-\log_2(\prod_{i=1}^Sp_i^{np_i})=-n\sum_{i=1}^Sp_i\log_2(p_i)$

Frequently, we’re concerned with the number of bits required per symbol in the source sequence, so we divide $B(n)$ by $n$ to find $H(X)$, a quantity known as the entropy of the source $X$, which has PMF $P(X=x_i)=p_i$:

$H(X) = -\sum_{i=1}^Sp_i\log_2(p_i)$

The entropy, $H(X)$, is important because it establishes the lower bound on the number of bits that is required, on average, to accurately represent a symbol taken from the corresponding source $X$ when encoding a large number of symbols. $H(X)$ is non-negative, but it is not restricted to integers only; however, achieving less than one bit per symbol requires multiple neighboring symbols to be combined and encoded in groups, similarly to the method used above to obtain the expected bit rate. Unfortunately, that process cannot be used in practice for compression, because it requires enumerating an exponential number of strings (as a function of a variable tending towards infinity) in order to assign each sequence to a bit representation. Luckily, two very common, practical methods exist, Huffman Coding and Arithmetic Coding, that are guaranteed to achieve optimal performance.

For the car example mentioned earlier, the entropy works out to about 0.35 bits, which means there is significant room for improvement over the symbol-wise mapping I suggested, which only achieved a rate of 1.07 bits per symbol, but it would require grouping multiple car colors into a compound symbol, which quickly becomes tedious when working by hand. It is kind of amazing that using only ~3,500 bits, we could communicate the car colors that naively required 246,400 bits (=30,800 bytes) by encoding each letter of the English word with a single byte.

$H(X)$ also has other applications, including gambling, investing, lossy compression, communications systems, and signal processing, where it is generally used to establish the bounds for best- or worst-case performance. If you’re interested in a more rigorous definition of entropy and a more formal derivation of the bounds on lossless compression, plus some applications, I’d recommend reading Claude Shannon’s original paper on the subject, which effectively created the field of information theory.

For a client’s website, I needed to enumerate the 12 months preceding a given date to create links to archived content. The site uses a javascript templating engine to create HTML, offloading the process from the server, so generating the list of months on the client side in javascript seemed like a reasonable choice. For the past week, everything looked great, but suddenly today I noticed that it was repeating the current month.

The code itself is pretty simple. It’s written as a dustjs helper function that uses the Date.setMonth function to handle wrapping from January of one year to December of the previous year.

dust.helpers.month_list = function(chunk, ctx, bodies, params) {
var count = dust.helpers.tap(params.count, chunk, ctx) | 0;
var curDate = new Date();

for (var i = 0; i < count; ++i) {
var dateStr = (1900 + curDate.getYear()) + '-' + (curDate.getMonth()+1);
chunk = chunk.render(bodies.block, ctx.push({date : dateStr}));
curDate.setMonth(curDate.getMonth()-1);
}

return chunk;
}


Some quick printf debugging, adding console.log(curDate) at the start of the loop, shows this surprising result:

Tue Dec 31 2013 20:47:14 GMT-0800 (Pacific Standard Time)
Sun Dec 01 2013 20:47:14 GMT-0800 (Pacific Standard Time)
Fri Nov 01 2013 20:47:14 GMT-0700 (Pacific Daylight Time)
Tue Oct 01 2013 20:47:14 GMT-0700 (Pacific Daylight Time)


Apparently, on the 31st of the month, subtracting one month from the current date does not have the desired effect, in Chrome (on Windows 8.1). I ran the test again in IE 11 and observed the same behavior, as well as tried by manually setting the date to the 31st of October and subtracting a month, again seeing the same behavior. I'm not sure if that means this is somehow part of the specification, or if it's a bug caused by a library used underneath the hood in both browsers, but the end result is the same. My "clean" code to use setMonth(getMonth()-1) instead of writing out an if statement to detect the start of the year and wrap correctly now contains a cryptic loop that detects if the month didn't actually change and subtracts it again, all to deal with a bug that only happens once a month.

There aren’t very many good resources for optimizing compute-bound JavaScript on the Internet, aside from this article from Google Developers discussing string building and touching very briefly on some of the problems with functions. For the most part, I’d bet this is because a lot of developers say “we’ll leave it up to the JIT compiler to do optimizations,” and then they write code the idiomatic way, assuming that this will translate into high performance, and the rest focus on non-compute usage, which is generally limited in performance by networked resources. However, as I’ve found, the v8 optimizer, as good as it is, still struggles with a number of situations that are easily handled in C. Obviously, I’m not the first to notice this, with projects like asm.js that exist to identify a subset of standard JavaScript that can achieve high performance by allowing for better optimization and especially by avoiding garbage collection.

However, I think that there are a number of things that can be done without resorting to asm.js, to increase performance. I’ve assembled four test cases/demonstrations to illustrate some problematic areas for the JIT optimizer and identify how I’ve hand-optimized them. All of the test cases will be discussed in terms of simple benchmark results, using Node.js’s high resolution timer to achieve a higher precision. One thing that is important to note is that sometimes the relative speed increase appears huge (for instance 10x faster), but really we are looking at a very simple, pared down test case. In this instance it is much more meaningful to consider the absolute speedup (5 microseconds vs. 10 microseconds), because when used in real code, the function bodies will generally have far more meat on them, but the same amount of overhead will be eliminated.

My attempts at optimizing a graph processing library that I have been working on identified function calls to be the biggest source of overhead, because the optimizer doesn’t appear to do much with them. I haven’t tried to look at tracing or disassembly, because I want to keep this simple for now. Since functions appear to be the problem, let’s start with a simple test: does v8 inline functions?

var iter = 10000;
var i, j, start, stop;

function inc(a) { return a+1; }

var x = 0;
start = process.hrtime();
for (i = 0; i < iter; ++i) {
x = inc(x);
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T1//with function: '+time+' ms');

var x = 0;
start = process.hrtime();
for (i= 0; i < iter; ++i) {
x = x + 1;
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T1//hand inlined: '+time+' ms');


For me, the version calling inc() generally takes about 0.467 milliseconds to complete, while the hand-inlined version takes only 0.0365 ms. The answer, then, is that either v8 doesn't inline, or if it does, it cannot match a hand-inlined version of the same code. I am not knowledgeable enough about the engine internals or about the specification requirements for JavaScript to speculate on why this happens, but for now, I am satisfied to notice the problem and try to work around it in performance-critical code.

However, this actually leads to a serious problem from a design perspective. You can either write well-partitioned code that is easy to read and follow, with good compartmentalization for function reuse, or you can stuff it all into a single function and expect a significant amount of call overhead to be eliminated. I think this could be (easily?) rolled into a tool like UglifyJS, to automatically inline functions, but at that point, we're introducing a build step for an interpreted language. For scripts that will be sent to the browser, this is perfectly acceptable, as we generally would want to pass these through UglifyJS (although you shouldn't use a separate build step for this anyway), but for server-side scripts, this is extremely tedious. Yes, you can automate it with grunt or another task runner, but it just reintroduces the exact type of process we are trying to eliminate by using JavaScript instead of C.

For now, I'll just say that I think that some amount of function inlining should be possible for the JIT optimizer to do, but there are a lot of instances (member functions, for instance, due to the completely dynamic way in which methods are called) where this is not possible. Therefore, let's move on, and examine the performance of JavaScript's iteration constructs.

One of the things that really attracted me to Node.js was Underscore.js, a functional programming library that provides the types of things I would normally find in Scheme. The best thing about these tools is that they make it simple to write clean, easy to maintain code, and they help enforce good design principles by encouraging programmers to write small functions for one task, and then to build larger processes out of those functions. It might be obvious in hindsight, but considering that function optimization is not very effective, using functions to do array iteration is probably slower than using a for loop to do the same thing. Let's look at two variations, one testing Array.reduce and one testing Array.forEach in an O(n^2) situation to see how the function overhead scales with nested loops.

// Test 2: Iteration methods vs. for loops
iter = 1000;
var elems = 100, subelems = 10, sum;
var xa = [];

for (i = 0; i < elems; ++i) {
xa[i] = Math.random();
}

// 2a: array sum with reduce
start = process.hrtime();
sum = xa.reduce(function(memo, v) {
return memo + v;
}, 0);
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2A//using reduce: '+time+' ms');

start = process.hrtime();
sum = 0;
for (i = 0; i < xa.length; ++i) {
sum += xa[i];
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2A//for loop: '+time+' ms');
console.log('');

// On my machine, times fluctuate a lot, but 0.10-0.16 ms vs 0.0054-0.0078 ms
// so the for loop is faster again

// 2b: nested iteration, calling a member function for each element of an array
function t2b_class() {
this.val = Math.random();
}
var t2b_sum = 0;
t2b_class.prototype.work = function() {
t2b_sum += this.val;
};

for (i = 0; i < elems; ++i) {
xa[i] = [];
for (j = 0; j < elems; ++j) {
xa[i][j] = new t2b_class();
}
}

start = process.hrtime();
xa.forEach(function(v) {
v.forEach(function(v2) {
v2.work();
});
});
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//nested iterators: '+time+' ms');

function inner_iter(v2) {
v2.work();
}
start = process.hrtime();
xa.forEach(function(v) {
v.forEach(inner_iter);
});
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//nested iterators, avoids redeclaration: '+time+' ms');

start = process.hrtime();
for (i = 0; i < xa.length; ++i) {
for (j = 0; j < xa[i].length; ++j) {
xa[i][j].work();
}
}
stop = process.hrtime(start);

time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T2B//for loop: '+time+' ms');


For Test 2A, I find that using Array.reduce takes between 0.10 and 0.16 milliseconds, while using the for loop to implement a reduce operation "the long way" takes between 0.0045 and 0.0075 ms. These really aren't huge differences, but in principle they shouldn't exist at all, as, from a semantic point of view, the code does exactly the same thing in both implementations. For test 2B, however, there is a much more significant difference. For the nested forEaches, I've tested two possible implementations: the "obvious" approach, which uses an anonymous inner function for the second loop, and the "observant" approach which realizes that this function gets redeclared every time the loop iterates, so the declaration is moved outside of the outer loop. My results are that the obvious approach is the slowest, coming in at around 1.08-1.32 ms, then the observant approach, coming in around 0.87-0.95ms, and then the explicit, nested for loop blowing both out of the water with a time of 0.22-0.35 ms.

The results for 2A are underwhelming, but 2B illustrates a much more significant problem with the use of functional techniques. It's really easy to write code in a functional style that ends up being really slow, because v8 does not effectively optimize it. For me, this is a disaster. Functional-style code is faster to write, easier to maintain, and generally cleaner than its imperative counterpart. Plus, implementing a map, reduce, or even just a for-each loop using the for construct and explicit iteration requires repeating a lot of glue code that is handled internally. Even C++11 has language features to minimize this kind of repetition, now. As with function inlining, I maintain that for both 2A and 2B, these types of transformations can easily be done by a tool like UglifyJS at the source level, but really, I feel that the JIT optimizer should be capable of doing them as well.

Now, let's look at another type of optimization important for performance when using recursive functions: tail call optimization. I've written the canonical factorial function in three styles: first, a basic recursive implementation that works its way back up the stack to compute its result; second, a version that uses tail-calls and could therefore be optimized without any source transformation; and third, a complete iterative version, to determine whether v8 performs maximally effective tail optimization.

// Test 3: Tail calls vs. stack walking vs. iterative method
function fact_tail(n, accumulator) {
if (n < 2)
return accumulator;
return fact_tail(n-1, accumulator*n);
}
function fact_stack(n) {
if (n < 2)
return 1;
return n * fact_stack(n-1);
}
function fact_iter(n) {
var accumulator = n;
for (n = n-1; n > 1; --n) {
accumulator = accumulator*n;
}
return accumulator;
}

iter = 1000;
var n = 30, result;

start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fact_tail(n-1, n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Tail Factorial: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fact_stack(n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Stack Factorial: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fact_iter(n);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T3//Iterative Factorial: '+time+' ms');


The results are approximately what you'd expect. The tail call is not fully optimized, but it is faster than the version using the stack. I haven't done enough research to determine whether the speed boost is due to (some) actual optimization being performed or if it's just a by-product of how temporary variables are used implicitly in the local scope. The numbers for me come in at 0.515 ms for the basic stack version, 0.277 ms for the tail-call version, and 0.181 ms for the iterative version.

Again we see that a source transformation from "good, easy-to-follow code" to "awkward, but still makes sense if you stop to think about it code" results in improved performance. This is the same type of trade-off that we faced in the 90s when C compilers did not optimize effectively. And, again, I would posit that this is the type of operation that should be done by the JIT optimizer. If only I had the necessary knowledge to actually try to modify v8 myself--I've looked at the code, and for all my complaints about shortcomings, it is both incredibly impressive and far too complicated for me to "pick up" over the weekend and change).

Finally, I have one last optimization case. This one, however, is not something that can be done by a JIT optimizer, because the result would no longer meet the language specification. For those of you familiar with C++, there are two basic types of methods in a class: "normal" methods, which have a fixed address known at compile time, and "virtual" methods, which are resolved at run time based on the virtual function table stored within the object on which the method is being called (I almost said "dynamically determined by the type" but that would be inaccurate, as virtual methods do not require RTTI). In JavaScript all class methods are implicitly of a virtual type. When you call a function, that field on the object must be looked up every time, because it can be changed at any time. This is an incredibly useful feature of the language, because of the flexibility it provides: polymorphism is "easy," objects can be monkey-patched to modify behavior, and some of the more subtle aspects of C++ go away (for instance, do you know off the top of your head which method is called if you dereference a base class-type pointer to an instance of a derived class and then call a virtual method?). But this power comes at a cost. When a method isn't used in this manner, we still pay the price of doing lookups, and because the optimizer cannot know that we will not replace a method definition, it can't substitute in the address of the specific function we want.

There is actually a simple, clever way to obtain static references to functions. When the method is called through the object as a property lookup, the prototype chain must be traversed to find the first matching function and then it is used. If we skip that step and use the specific function, on the specific prototype that we want, it should be slightly faster. We still pay the price of looking up a property on the prototype object, though, so we can even take this a step further and save a variable that points precisely to the function we want to call. Because of JavaScript's ability to call any function with any "this" context, we can take the static function pointer and call it as though it were a member function without going through the dynamic lookup phase. Let's look at a test to compare this.

// Test 4: Function table overhead
function TestClass(a, b, c) {
this.a = a;
this.b = b;
this.c = c;
}
TestClass.prototype.hypot = function() {
return Math.sqrt(this.a*this.a + this.b*this.b + this.c*this.c);
}

iter = 10000;
var fn = TestClass.prototype.hypot;
xa = new TestClass(Math.random(), Math.random(), Math.random());

start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = xa.hypot();
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Object method lookup: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = TestClass.prototype.hypot.call(xa);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Prototype method lookup: '+time+' ms');

start = process.hrtime();
for (i = 0; i < iter; ++i) {
result = fn.call(xa);
}
stop = process.hrtime(start);
time = (stop[0]*1e9 + stop[1])/1e6;
console.log('T4//Pre-saved method pointer: '+time+' ms');


For me, the results are that full dynamic lookup takes 0.194 ms, retrieving the method through the prototype takes 0.172 ms, and using a pre-saved pointer to the function takes only 0.135 ms. These absolute differences really aren't that significant, but this can be applied to every single non-polymorphic function call in your project, which means this might be the test case where relative speedup (30% faster with static function pointers) matters more than absolute. If you still think I'm crazy for even mentioning this, consider the extremely common problem of converting an arguments variable into a proper array by calling Array.prototype.slice on it. Many people will save a reference to Array.prototype.slice at the top of their module and use it throughout their code; luckily it's both shorter to type AND faster, which is great for libraries that provide a lot of functions accepting variable numbers of arguments.

Member function lookup optimization requires far more work than the other three, and it cannot be applied transparently by a JIT optimizer or by UglifyJS as the language specification prevents it from having enough information to correctly do so. It also makes your code much, much more annoying to read. I could suggest writing an AltJS language that allows you to differentiate between the two types of member functions, so that when it compiles to JS it spits out static references where possible, but I'm pretty sure @horse_js would make fun of me for it (rightly so). But, if you're in a situation where performance for computation is just that important, I'd recommend at least considering static function references.

As for my graph library, by carefully rewriting code matching the four above cases, I was able to get the execution time for a single pass of the evaluator down from 25.5 usec to 7.4 usec, which increases the number of users that my application can support for a given amount of hardware by 3.3x, immediately translating into lower operating costs. I wish I could share the code (maybe one day, but for now it's under the "secret project" category), but hopefully an unsubstantiated anecdote like this will help convince you that these optimizations still benefit real code that performs real tasks, and not just overly-simple test cases. There is often an important tradeoff to consider between readability, maintainability, and performance, so I wouldn't recommend always writing code that looks like this, but if you're working on a compute-bound application (mostly, JavaScript-based games, which we will see many of as WebGL and HTML5 are improved), maybe think about writing your code in three steps: first, do it the easy/fast/right way, producing good, clean code. Then, write a very good set of test cases. Finally, optimize your code to death, and use your test cases to verify that you haven't broken anything.

The complete benchmark file used to generate my numbers is available as a gist in case you don't feel like cobbling together the snippets given above.