Since its release last week, I’ve been playing quite a bit of Fallout 4. There’s an interesting mini-game (which was in previous iterations as well) for “hacking” computer terminals, where you must guess the passcode on a list of possibilities with a limited number of guesses. Each failed guess provides the number of correct letters (in both value and position) in that particular word, but not which letters were correct, allowing you to deduce the correct passcode similarly to the game “Mastermind.” A natural question is, “what is the best strategy for identifying the correct passcode?” We’ll ignore the possibility of dud removal and guess resets (which exist to simplify it a bit in game) for the analysis.

Reformulating this as a probability question offers a framework to design the best strategy. First, some definitions: $N$ denotes the number of words, $z$ denotes the correct word, and $x_i$ denotes a word on the list (in some consistent order). A simple approach suggests that we want to use the maximum likelihood (ML) estimate of $z$ to choose the next word based on all the words guessed so far and their results:

$\hat{z} = \underset{x_i}{\mathrm{argmax}}~~\mathrm{Pr}(z=x_i)$

However, for the first word, the probability prior is uniform—each word is equally likely. This might seem like the end of the line, so just pick the first word randomly (or always pick the first one on the list, for instance). However, future guesses depend on what this first guess tells us, so we’d be better off with an estimate which maximizes the mutual information between the guess and the unknown password. Using the concept of entropy (which I’ve discussed briefly before), we can formalize the notion of “mutual information” into a mathematical definition: $I(z, x) = H(z) - H(z|x)$. In this sense, “information” is what you gain by making an observation, and it is measured by how it affects the possible states for a latent variable to take. For more compact notation, let’s define $F_i=f(x_i)$ as the “result” random variable for a particular word, telling us how many letters matched, taking values $\{0,1,...,M\}$, where $M$ is the length of words in the current puzzle. Then, we can change our selection criteria to pick the maximum mutual information:

$\hat{z} = \underset{x_i}{\mathrm{argmin}}~~H(z|F_i)$

But, we haven’t talked about what “conditional entropy” might mean, so it’s not yet clear how to calculate $H(z | F_i)$, apart from it being the entropy after observing $F_i$‘s value. Conditional entropy is distinct from conditional probability in a subtle way: conditional probability is based on a specific observation, such as $F_i=1$, but conditional entropy is based on all possible observations and reflects how many possible system configurations there are after making an observation, regardless of what its value is. It’s a sum of the resulting entropy after each possible observation, weighted by the probability of that observation happening:

$H(Z | X) = \sum_{x\in X} p(x)H(Z | X = x)$

As an example, let’s consider a puzzle with $M=5$ and $N=10$. We know that $\forall x_i,\mathrm{Pr}(F_i=5)=p_{F_i}(5)=0.1$. If we define the similarity function $L(x_i, x_j)$ to be the number of letters that match in place and value for two words, and we define the group of sets $S^{k}_{i}=\{x_j:L(x_i,x_j)=k\}$ as the candidate sets, then we can find the probability distribution for $F_i$ by counting,

$p_{F_i}(k)=\frac{\vert{S^k_i}\vert}{N}$

As a sanity check, we know that $\vert{S^5_i}\vert=1$ because there are no duplicates, and therefore this equation matches our intuition for the probability of each word being an exact match. With the definition of $p_{F_i}(k)$ in hand, all that remains is finding $H(z | F_i=k)$, but luckily our definition for $S^k_i$ has already solved this problem! If $F_i=k$, then we know that the true solution is uniformly distributed in $S^k_i$, so

$H(z | F_i=k) = \log_2\vert{S^k_i}\vert$.

Finding the best guess is as simple as enumerating $S^k_i$ and then finding the $x_i$ which produces the minimum conditional entropy. For subsequent guesses, we simply augment the definition for the candidate sets by further stipulating that set members $x_j$ must also be in the observed set for all previous iterations. This is equivalent to taking the set intersection, but the notation gets even messier than we have so far, so I won’t list all the details here.

All that said, this is more of an interesting theoretical observation than a practical one. Counting all of the sets by hand generally takes longer than a simpler strategy, so it is not well suited for human use (I believe it is $O(n^2)$ operations for each guess), although a computer can do it effectively. Personally, I just go through and find all the emoticons to remove duds and then find a word that has one or two overlaps with others for my first guess, and the field narrows down very quickly.

Beyond its appearance in a Fallout 4 mini-game, the concept of “maximum mutual information” estimation has broad scientific applications. The most notable in my mind is in machine learning, where MMI is used for training classifiers, in particular, Hidden Markov Models (HMMs) such as those used in speech recognition. Given reasonable probability distributions, MMI estimates are able to handle situations where ML estimates appear ambiguous, and as such they are able to be used for “discriminative training.” Typically, an HMM training algorithm would receive labeled examples of each case and learn their statistics only. However, a discriminative trainer can also consider the labeled examples of other cases in order to improve classification when categories are very similar but semantically distinct.

Everything that happens in the world can be described in some way. Our descriptions range from informal and causal to precise and scientific, yet ultimately they all share one underlying characteristic: they carry an abstract idea known as “information” about what is being described. In building complex systems, whether out of people or machines, information sharing is central for building cooperative solutions. However, in any system, the rate at which information can be shared is limited. For example, on Twitter, you’re limited to 140 characters per message. With 802.11g you’re limited to 54 Mbps in ideal conditions. In mobile devices, the constraints go even further: transmitting data on the network requires some of our limited bandwidth and some of our limited energy from the battery.

Obviously this means that we want to transmit our information as efficiently as possible, or, in other words, we want to transmit a representation of the information that consumes the smallest amount of resources, such that the recipient can convert this representation back into a useful or meaningful form without losing any of the information. Luckily, the problem has been studied pretty extensively over the past 60-70 years and the solution is well known.

First, it’s important to realize that compression only matters if we don’t know exactly what we’re sending or receiving beforehand. If I knew exactly what was going to be broadcast on the news, I wouldn’t need to watch it to find out what happened around the world today, so nothing would need to be transmitted in the first place. This means that in some sense, information is a property of things we don’t know or can’t predict fully, and it represents the portion that is unknown. In order to quantify it, we’re going to need some math.

Let’s say I want to tell you what color my car is, in a world where there are only four colors: red, blue, yellow, and green. I could send you the color as an English word with one byte per letter, which would require 3, 4, 5, or 6 bytes, or we could be cleverer about it. Using a pre-arranged scheme for all situations where colors need to be shared, we agree on the convention that the binary values 00, 01, 10, and 11 map to our four possible colors. Suddenly, I can use only two bits (0.25 bytes, far more efficient) to tell you what color my car is, a huge improvement. Generalizing, this suggests that for any set $\chi$ of abstract symbols (colors, names, numbers, whatever), by assigning each a unique binary value, we can transmit a description of some value from the set using $\log_2(|\chi|)$ bits on average, if we have a pre-shared mapping. As long as we use the mapping multiple times it amortizes the initial cost of sharing the mapping, so we’re going to ignore it from here out. It’s also worthwhile to keep this limit in mind as a max threshold for “reasonable;” we could easily create an encoding that is worse than this, which means that we’ve failed quite spectacularly at our job.

But, if there are additional constraints on which symbols appear, we should probably be able to do better. Consider the extreme situation where 95% of cars produced are red, 3% blue, and only 1% each for yellow and green. If I needed to transmit color descriptions for my factory’s production of 10,000 vehicles, using the earlier scheme I’d need exactly 20,000 bits to do so by stringing together all of the colors in a single sequence. But, given that by the law of large numbers, I can expect roughly 9,500 cars to be red, so what if I use a different code, where red is assigned the bit string 0, blue is assigned 10, yellow is assigned 110, and green 111? Even though the representation for two of the colors is a bit longer in this scheme, the total average encoding length for a lot of 10,000 cars decreases to 10,700 bits (1*9500 + 2*300 + 3*100 + 3*100), almost an improvement of 50%! This suggests that the probabilities for each symbol should impact the compression mapping, because if some symbols are more common than others, we can make them shorter in exchange for making less common symbols longer and expect the average length of a message made from many symbols to decrease.

So, with that in mind, the next logical question is, how well can we do by adapting our compression scheme to the probability distribution for our set of symbols? And how do we find the mapping that achieves this best amount of compression? Consider a sequence of $n$ independent, identically distributed symbols taken from some source with known probability mass function $p(X=x)$, with $S$ total symbols for which the PMF is nonzero. If $n_i$ is the number of times that the $i$th symbol in the alphabet appears in the sequence, then by the law of large numbers we know that for large $n$ it converges almost surely to a specific value: $\Pr(n_i=np_i)\xrightarrow{n\to \infty}1$.

In order to obtain an estimate of the best possible compression rate, we will use the threshold for reasonable compression identified earlier: it should, on average, take no more than approximately $\log_2(|\chi|)$ bits to represent a value from a set $\chi$, so by finding the number of possible sequences, we can bound how many bits it would take to describe them. A further consequence of the law of large numbers is that because $\Pr(n_i=np_i)\xrightarrow{n\to \infty}1$ we also have $\Pr(n_i\neq np_i)\xrightarrow{n\to \infty}0$. This means that we can expect the set of possible sequences to contain only the possible permutations of a sequence containing $n_i$ realizations of each symbol. The probability of a specific sequence $X^n=x_1 x_2 \ldots x_{n-1} x_n$ can be expanded using the independence of each position and simplified by grouping like symbols in the resulting product:

$P(x^n)=\prod_{k=1}^{n}p(x_k)=\prod_{i=1}^{S} p_i^{n_i}=\prod_{i=1}^{S} p_i^{np_i}$

We still need to find the size of the set $\chi$ in order to find out how many bits we need. However, the probability we found above doesn’t depend on the specific permutation, so it is the same for every element of the set and thus the distribution of sequences within the set is uniform. For a uniform distribution over a set of size $|\chi|$, the probability of a specific element is $\frac{1}{|\chi|}$, so we can substitute the above probability for any element and expand in order to find out how many bits we need for a string of length $n$:

$B(n)=-\log_2(\prod_{i=1}^Sp_i^{np_i})=-n\sum_{i=1}^Sp_i\log_2(p_i)$

Frequently, we’re concerned with the number of bits required per symbol in the source sequence, so we divide $B(n)$ by $n$ to find $H(X)$, a quantity known as the entropy of the source $X$, which has PMF $P(X=x_i)=p_i$:

$H(X) = -\sum_{i=1}^Sp_i\log_2(p_i)$

The entropy, $H(X)$, is important because it establishes the lower bound on the number of bits that is required, on average, to accurately represent a symbol taken from the corresponding source $X$ when encoding a large number of symbols. $H(X)$ is non-negative, but it is not restricted to integers only; however, achieving less than one bit per symbol requires multiple neighboring symbols to be combined and encoded in groups, similarly to the method used above to obtain the expected bit rate. Unfortunately, that process cannot be used in practice for compression, because it requires enumerating an exponential number of strings (as a function of a variable tending towards infinity) in order to assign each sequence to a bit representation. Luckily, two very common, practical methods exist, Huffman Coding and Arithmetic Coding, that are guaranteed to achieve optimal performance.

For the car example mentioned earlier, the entropy works out to about 0.35 bits, which means there is significant room for improvement over the symbol-wise mapping I suggested, which only achieved a rate of 1.07 bits per symbol, but it would require grouping multiple car colors into a compound symbol, which quickly becomes tedious when working by hand. It is kind of amazing that using only ~3,500 bits, we could communicate the car colors that naively required 246,400 bits (=30,800 bytes) by encoding each letter of the English word with a single byte.

$H(X)$ also has other applications, including gambling, investing, lossy compression, communications systems, and signal processing, where it is generally used to establish the bounds for best- or worst-case performance. If you’re interested in a more rigorous definition of entropy and a more formal derivation of the bounds on lossless compression, plus some applications, I’d recommend reading Claude Shannon’s original paper on the subject, which effectively created the field of information theory.