This post necessarily has spoilers about arcosphere balancing in Factorio's Space Exploration mod. If you're playing this mod and want to work it out yourself then please come back after you've finished it.

The Space Exploration mod for Factorio is well known for introducing new logistics challenges. While most mods introduce complicated recipes with many stages, they primarily require building more stuff. Although SE introduces new recipes and several traditional "more complicated" chains including byproducts and recycling, it also introduces new logistics problems: not all resources are available on every planet, sunlight isn't available in the same amount everywhere, and not all production facilities can be built in all locations, making the real challenge obtaining and moving the materials around so that they can be combined. The final logistics challenge, however, is by far the most interesting. It introduces a new item called an arcosphere, which is available in eight flavors named for greek letters: epsilon \epsilon, lambda \lambda, gamma \gamma, xi \xi, zeta \zeta, theta \theta, phi \phi, and omega \omega. Unlike other ingredients, these arcospheres are not consumed in production, but rather they are converted between flavors in order to produce the desired products. For example, a particular tier 3 deep space science card requires regular materials (naquium plates and significant data) and a pair of arcospheres, and it produces the science card plus a different pair of arcospheres. To complicate things, there is a limited number of arcospheres available, as they must be collected with a process that has diminishing returns, so to produce large amounts of science they must be converted back to the original input arcospheres through auxiliary recipes called folding and inversion. Finally, to avoid a static solution, the production recipes also come in two variants which accept the same inputs but produce different arcospheres, with the specific variant selected being controlled by RNG when production finishes.

The recipes

To start off we need a consistent way to refer to the various recipes, so we define the following names for the folds and inversions:


& A: & \{\lambda,\omega\} & \rightarrow \{\xi,\theta\} \\
& B: & \{\lambda,\theta\} & \rightarrow \{\epsilon,\zeta\} \\
& C: & \{\gamma,\xi\} & \rightarrow \{\lambda,\zeta\} \\
& D: & \{\xi,\zeta\} & \rightarrow \{\theta,\phi\} \\
& E: & \{\epsilon,\theta\} & \rightarrow \{\phi,\omega\} \\
& F: & \{\zeta,\phi\} & \rightarrow \{\epsilon,\gamma\} \\
& G: & \{\gamma,\phi\} & \rightarrow \{\xi,\omega\} \\
& H: & \{\epsilon,\omega\} & \rightarrow \{\lambda,\gamma\} \\
& I: & \{\epsilon,\lambda,\xi,\phi\} & \rightarrow \{\gamma,\zeta,\theta,\omega\} \\
& J: & \{\gamma,\zeta,\theta,\omega\} & \rightarrow \{\epsilon,\lambda,\xi,\phi\}

These names will be used throughout to refer to specific folds and inversions in order to simplify communicating which steps are required to convert a set of arcospheres into a different set of arcospheres. The names are arbitrary, but it is a nice coincidence that I is for inversion and I also is the next letter after H, and j is used by electrical engineers to represent i, and J is used to name the complementary inversion with respect to I.

While we need names to refer to the folds and inversions, we don't need to worry about any of the production recipes themselves, because they will produce a non-deterministic imbalance in the arcospheres. Instead we will be trying to take an arbitrary imbalance and correct it using only the above tools.

The Basic Circuit Solution

We won't by analyzing this one, but the most basic solution is to observe a pattern in the folding recipes, specifically that four folds can be combined to convert one arcosphere into another specific arcosphere, with the help of two catalyst arcospheres which are returned unmodified. For example, by combining folds A, C, G, and D, we can convert two \gamma arcospheres into two \theta arcospheres. There is also a complementary combination of B, E, F, and H, which converts in the opposite direction. There are four such pairs: \lambda \leftrightarrow \phi, \gamma \leftrightarrow \theta, \zeta \leftrightarrow \omega and \xi \leftrightarrow \epsilon. Using the inversion recipes, two of these pairs can be converted into the other two pairs as well.

The solution is to measure the total number of arcospheres and do two things. First, determine the balance between each pair: if \lambda > \phi run the forward operation, else if \phi > \lambda run the reverse operation, allowing us to maintain roughly the same number of each side of a pair. Second, sum the two pairs on one side of an inversion and compare with the other side. Convert the larger sum into the smaller sum. This solution is dynamic and adapts to the arbitrary conversion that happens while running arcosphere-based recipes, and it is effective for finishing the game, but it does have a high static cost: the catalyst arcospheres used in balancing each pair cannot be used elsewhere in a naive solution, resulting in 16 unavailable arcospheres. A requester-based solution also naively results in buffering a number of input arcospheres for each direction of each pair, at least one but usually two arcospheres of each type, raising the cost to 32 total unavailable arcospheres. It is relatively inexpensive to obtain about 100 total arcospheres so this is not prohibitive, but it is still unsatisfying. Smarter implementations of this approach can reduce this static cost but we won't be looking into it any further than that.

Mathematical Representation

Replacing the game mechanics with an abstraction, we could represent our arcospheres as an 8-dimensional vector, with each coordinate storing the total number of arcospheres of that type. Initially we might think to restrict the coordinates to the non-negative integers, but when we think about what a fold does (for example, subtract one \lambda, subtract one \omega, add one \xi and add one \theta), it is going to be more useful to have the coordinates be restricted only to integers so that we can interpret a positive coordinate as a surplus and a negative coordinate as a deficit of the corresponding arcosphere. The vector's coordinates map to arcosphere flavors like this:


\vec{x} = \begin{bmatrix}\epsilon & \lambda & \gamma & \xi & \zeta & \theta
&\phi & \omega\end{bmatrix}.

Note that the order of the coordinates, much like the order of the folds earlier, is arbitrary and just corresponds to the order in which we first wrote them down (they're not even in the same order as the game shows them, which was definitely a mistake, but it is too late to change things now).

Because a single fold adds and subtracts a static amount--independent of the current amounts of arcospheres--we could represent it as an affine transform, but as all of the interesting information is in the translation vector it is much easier to characterize them in terms of this translation vector only. The folds and inversions are commutative, as a result, because vector addition is commutative. So we can define vectors corresponding to each fold and inversion like so:


\vec{a} &= \begin{bmatrix}0 & -1 & 0 & 1 & 0 & 1 & 0 & -1\end{bmatrix} \\
\vec{b} &= \begin{bmatrix}1 & -1 & 0 & 0 & 1 & -1 & 0 & 0\end{bmatrix} \\
\vec{c} &= \begin{bmatrix}0 & 1 & -1 & -1 & 1 & 0 & 0 & 0\end{bmatrix} \\
\vec{d} &= \begin{bmatrix}0 & 0 & 0 & -1 & -1 & 1 & 1 & 0\end{bmatrix} \\
\vec{e} &= \begin{bmatrix}-1 & 0 & 0 & 0 & 0 & -1 & 1 & 1\end{bmatrix} \\
\vec{f} &= \begin{bmatrix}1 & 0 & 1 & 0 & -1 & 0 & -1 & 0\end{bmatrix} \\
\vec{g} &= \begin{bmatrix}0 & 0 & -1 & 1 & 0 & 0 & -1 & 1\end{bmatrix} \\
\vec{h} &= \begin{bmatrix}-1 & 1 & 1 & 0 & 0 & 0 & 0 & -1\end{bmatrix} \\
\vec{i} &= \begin{bmatrix}-1 & -1 & 1 & -1 & 1 & 1 & -1 & 1\end{bmatrix} \\
\vec{j} &= \begin{bmatrix}1 & 1 & -1 & 1 & -1 & -1 & 1 & -1\end{bmatrix}

Written in this way there are a lot of observations we can make immediately:

  • The entries of each fold or inversion vector sum to zero
  • \vec{i} + \vec{j} = \vec{0}
  • \vec{a} + \vec{c} + \vec{f} + \vec{e} = \vec{0}
  • \vec{b} + \vec{d} + \vec{g} + \vec{h} = \vec{0}

These observations are important because we learn that although we have 8 coordinates, the dimension of our vector space is only 7, as the final entry is determined entirely by the earlier entries. We also can conclude that all of the folds, inversions, and production recipes which do not create or destroy arcospheres lie in the hyperplane defined by \sum\vec{x}_i = 0, albeit only at points in the hyperplane with integer coordinates. We also confirm that the two inversion recipes are complementary, and we learn that the folding recipes can be broken into two sets of four, where each set sums to zero as well. The dimension of the operations contained in each these sets is therefore 3 (not 4). This also allows us to construct negated versions of each vector, for example: -\vec{a} = \vec{c} + \vec{f} + \vec{e}. We will give those two sets names so that we can refer to them later:


X = \{\vec{a},\vec{c},\vec{f},\vec{e}\} \\
Y = \{\vec{b},\vec{d},\vec{g},\vec{h}\}

Now we can form a complete plan for solving the balancing problem: we will run the production recipes, then we will obtain an error vector representing the mis-balance between arcospheres, with positive coordinates indicating a surplus and negative coordinates representing a deficit. We will use the inverse of our fold matrix to find a representation of the error vector in fold coordinates, negate that vector, and run each fold the number of times indicated by the corresponding coordinate to balance the total number of arcospheres.

Naturally, there are some complications and the above steps require significant modification to actually find a solution.

Problem 1: Coordinates Must be Integers

This will be more of a recurring issue that requires each step to be modified rather than something we can address up front: both the numbers of arcospheres and the numbers of folds must be integers, so we have two different lattices to work with. We can allow negative values for either as we're representing the difference between current number of arcospheres and a balanced set, and we learned earlier how to represent a negative fold, so for the purposes of algebra we can permit negative numbers of folds, and when it comes time to apply them we can convert those values into an equivalent positive number of different folds. For the most part we will begin with a strategy that would be applicable in a regular vector space and then adapt it to work on the lattice. At the end we will need to analyze the effectiveness of our approach.

Problem 2: The Folds Are Not Orthogonal

The fold vectors are not all orthogonal, so we can't use them as an effective set of basis elements for our fold lattice. We can use the gram-schmidt process to generate orthogonal vectors, but because we require integer coordinates we cannot normalize them. Instead we will have different length basis vectors and we will have to account for this in later steps. You can choose a lot of possible starting points, but we can make several observations first to end up with simpler basis definitions:

  • \langle\vec{c},\vec{e}\rangle = 0
  • \langle\vec{a},\vec{f}\rangle = 0
  • \langle\vec{b},\vec{g}\rangle = 0
  • \langle\vec{d},\vec{h}\rangle = 0
  • \forall\vec{x}_i\in X,\forall\vec{y}_j\in Y: \langle\vec{x}_i,\vec{y}_j\rangle = 0
  • \forall\vec{x}_i\in X\cup Y: \langle\vec{x}_i,\vec{i}\rangle = 0
  • \forall\vec{x}_i\in X\cup Y: \langle\vec{x}_i,\vec{j}\rangle = 0

With this in mind, we know that our basis elements will be broken into three groups: one will be a basis over X, one will be a basis over Y, and one will be a basis over {\vec{i},\vec{j}}, which we will choose to be \vec{i}.

The gram-schmidt process (ignoring normalization) in a regular vector space, given an input set of vectors \vec{v}_i, proceeds like this:


\vec{u}_1 & = \vec{v}_1 \\
\vec{u}_2 & = \vec{v}_2 -
\frac{\langle\vec{v}_2,\vec{u}_1\rangle}{\langle\vec{u}_1,\vec{u}_1\rangle}\vec{u}_1
\\
\vec{u}_3 & = \vec{v}_3 -
\frac{\langle\vec{v}_3,\vec{u}_1\rangle}{\langle\vec{u}_1,\vec{u}_1\rangle}\vec{u}_1
- \frac{\langle\vec{v}_3,\vec{u}_2\rangle}{\langle\vec{u}_2,\vec{u}_2\rangle}\vec{u}_2

and so on, constructing each orthogonal vector by projecting the new vector onto all previous orthogonal elements and subtracting them out to obtain an orthogonal residual. However, the ratio of dot products may be a rational number rather than an integer, so this approach will not work for calculating orthogonal vectors either. We must modify it, so to improve readability we introduce two new expressions, T_i(\vec{x})=\langle\vec{x},\vec{u}_i\rangle, which is the dot product of a vector with the ith basis vector, and S_i=T_i(\vec{u}_i), which is the dot product of the ith basis vector with itself. We can write the expression for each vector space equivalent of each basis vector like so:


\vec{u}_k = \vec{v}_k - \sum_{i=1}^{k-1} \frac{T_i(\vec{v}_k)}{S_i} \vec{u}_i.

Because the basis vector can be any length, we can multiply the left side by a value q to ensure that all coefficients are integers:


q =
\frac{\mathrm{lcm}(S_1,...,S_{k-1})}{\mathrm{gcd}(T_1(\vec{v}_k),...,T_{k-1}(\vec{v}_k))}
\\
\vec{u}_k = q\vec{v}_k - \sum_{i=1}^{k-1} q\frac{T_i(\vec{v}_k)}{S_i} \vec{u}_i.

This choice of q has the advantage of producing the minimum length \vec{u}_k as well.

Problem 3: Negative Folds

Some of the coefficients above may be negative. We've already discovered how to represent the negative/additive inverse of any fold during our earlier observations, but we will restate it here. Because each set of folds sums to zero, the negative of a given fold is the other elements in its set: -\vec{a} = \vec{c}+\vec{f}+\vec{e}. For a combination of folds from both sets, to negate the sum we take the complement of each vector with respect to its set and add them together: -(\vec{b}+\vec{a}) = \vec{c}+\vec{f}+\vec{e}+\vec{g}+\vec{d}+\vec{h}. We won't write down all the negative vectors here, but we will write down the negatives of the basis vectors once we have found them all.

Finding an Orthogonal Basis

Using the above we can finally find our orthogonal basis. You can pick any vectors and work through it, but we can simplify the math by choosing as many already orthogonal vectors as possible to start. We saw that \vec{c} and \vec{e} form an orthogonal pair, so they are our choice for the first two basis vectors:


\vec{u}_1 = \vec{c} \\
\vec{u}_2 = \vec{e} \\

Next we can pick either of the two remaining vectors in X, so let's choose \vec{a}:


\vec{u}_3 & = q\vec{a} - q\frac{T_1(\vec{a})}{S_1}\vec{c} -
q\frac{T_2(\vec{a})}{S_2}\vec{e} \\
\vec{u}_3 & = q\vec{a} - q\frac{-2}{4}\vec{c} - q\frac{-2}{4}\vec{e} \\
q & = \frac{\mathrm{lcm}(4, 4)}{\mathrm{gcd}(2,2)} = \frac{4}{2} = 2 \\
\vec{u}_3 & = 2\vec{a} + 2\frac{2}{4}\vec{c} + 2\frac{2}{4}\vec{e} \\
\vec{u}_3 & = 2\vec{a} + \vec{c} + \vec{e}

Analogously, for Y we can compute three more basis vectors:


\vec{u}_4 & = \vec{b} \\
\vec{u}_5 & = \vec{g} \\
\vec{u}_6 & = 2\vec{d} + \vec{b} + \vec{g}

And, finally we can add the inversion basis vector we chose earlier to fill out the set of 7 that we needed:

\vec{u}_7 = \vec{i}

As we determined earlier that the hyperplane containing all of our vectors was 7 dimensions, at this point we know that we have found all of the orthogonal basis vectors. Since we already know we will need negatives of these basis vectors, it is helpful to compute them now:


-\vec{u}_1 & = \vec{a}+\vec{f}+\vec{e} \\
-\vec{u}_2 & = \vec{a}+\vec{f}+\vec{c} \\
-\vec{u}_3 & = 2\vec{f}+\vec{c}+\vec{e} \\
-\vec{u}_4 & = \vec{g}+\vec{d}+\vec{h} \\
-\vec{u}_5 & = \vec{b}+\vec{d}+\vec{h} \\
-\vec{u}_6 & = 2\vec{h}+\vec{b}+\vec{g} \\
-\vec{u}_7 & = \vec{j}

Problem 4: The Inverse Change of Basis

Armed with an orthogonal set of basis vectors for the fold space arranged into a matrix U where each column is \vec{u}_k, we can convert a vector whose coordinates are numbers of folds \vec{n}_f into an arcosphere error vector \vec{z}_0 through a simple matrix multiplication:

\vec{z}_0 = U\vec{n}_f

However, in this problem we wish to compute \vec{n}_f for an arbitrary \vec{z}_0, and U is not a square matrix. Because the columns of U are orthogonal, we know U^{T}U is invertible, so we can use the pseudoinverse to find


(U^{T}U)^{-1}U^{T}\vec{z}_0 = \vec{n}_f

We also know that because the columns of U are orthogonal, U^{T}U is a diagonal matrix whose nonzero entries are S_i from earlier, the dot product of each basis vector with itself, and its inverse is a diagonal matrix whose entries are the reciprocals of U^{T}U. We can therefore simplify the expression to obtain the components of \vec{n}_f as


\vec{n}_{f,i} = \frac{T_i(\vec{z}_0)}{S_i}

As always, we are restricted to integer numbers of folds, so we must decide how best to handle it. In this case, we will chose to round towards zero (corresponding with integer division in most programming languages and in game), but other reasonable choices include:

  • round to nearest integer,
  • take the ceiling, or
  • scale the error vector to obtain integers similar to how we did in finding a basis.

We will consider the difference between integer division and rounding while bounding the residual error. Scaling the error vector seems promising, as it is generated by a production recipe, so scaling it amounts to running the production recipe multiple times. However, each recipe always has two forms and swaps between arcosphere flavors in the output randomly. So at best a scaling approach could be used to do average case balancing, but the error vector amounts to a random walk, as the error for each round of production accumulates. Analyzing and bounding the walk's distance from the origin with a fixed inversion based on the average case is beyond our scope here, but we note that in early trials this often caused the error vector to diverge rather than return to the origin or remain within the unit cell of the folding lattice, as is expected for what appears to be a higher dimensional random walk.

Balancing the Spheres

Combining everything above, we can finally construct an algorithm for dynamically balancing the arcospheres:

  1. We will run the production recipes for some number of steps (to be determined)
  2. We will measure the arcosphere error vector as the difference between the number of arcospheres of one flavor and the total number divided by 8 (the number of flavors)
  3. We will compute \vec{n}_f as above, applying the floor function to each coordinate
  4. We will express -\vec{n}_f as a number of primitive fold and inversion based on each coordinate, rewriting negative coordinates as different positive folds using the negative expressions for each basis vector noted earlier
  5. We will run each fold and inversion the specified number of times, producing a new error vector of \vec{z}_1 = \vec{z}_0-U\vec{n}_f
  6. Ideally this new error vector has a smaller L1 norm than the starting vector

Bounding the Error

A number of times we have taken the engineer's approach of solving a problem in a vector space with real-valued coordinates and just hamfisting it into the lattice. Let's do that again as a quick sanity check on \vec{z}_1. If we remove the floor function and pretend everything can be a real number we find:


\vec{z}_1 & = \vec{z}_0 - U\vec{n}_f \\
\vec{z}_1 & = \vec{z}_0 - U(U^{T}U)^{-1})U^{T}\vec{z}_0 \\
\vec{z}_1 & = \vec{z}_0 - \vec{z}_0 = \vec{0}

The second equation contains an intimidating expression, but luckily the definition of the pseudoinverse can help us obtain the third equation. One property is that for a matrix U with orthogonal columns and pseudoinverse U^{+}, we have UU^{+}U = U. Furthermore any linear combination of the columns of U will also map to itself, as scalar multiplication commutes with matrix multiplication. Therefore, as \vec{z}_0 is a linear combination of the basis vectors, U(U^{T}U)^{-1})U^{T}\vec{z}_0=\vec{z}_0.

Note that these steps and properties are not true for the lattice we're actually working with. At this stage we're just checking that if we were in a friendlier system, our work would yield the behavior we wanted. Instead, in our situation we will define our lattice vector \vec{z}_L as


\vec{z}_L = \sum_{i=1}^{7} R(T_i(\vec{z}), S_i) \vec{u}_i

with the R function used here corresponding to our choice of strategy for converting the values of T_i and S_i to an integer. In game it is currently done with an integer division that rounds towards zero, but earlier we also suggested that it could be a round to nearest integer division as well. Our lattice basis vectors define a unit cell with which we can tile the arcosphere error lattice, and we can define a residual region that contains any error vectors for which \vec{z}_L = \vec{0}, that is, vectors for which our algorithm does not identify a corresponding lattice vector. Ideally the residual region is exactly the unit cell, but that may be hard to determine and will depend on the choice of R, so instead we will do all of our calculations considering only the residual region.

So to bound our error, we must find the vector of maximum L1 norm, subject to the constraint that it is inside the residual region. This is an integer programming problem and and could be expressed and solved as such, but such problems are NP-hard. However, because our basis vectors are quite short and we have a small number of dimensions, we can instead iterate over all of the points inside the residual region and calculate their L1 norms, and then find the maximum. Even though this has exponential scaling, for this specific case we can write the program in an hour or two and brute force it in under a second, which is fast enough. We can also find all of the vectors with this norm and find the largest possible imbalance for any one arcosphere as these will help us determine how many arcospheres are necessary to operate our system. By inspection, I expect the maximum absolute value of any coordinate inside the residual region to be no more than 3, because any vector with a 4 and -4 pair will surely have T_i > S_i for at least one i as all the basis vectors have -1, 0, and 1 as their only coordinates and have S_i of 4 or 8. To be conservative we can iterate over a region including -4 and 4, for a total of 9^7 = 4782969 vectors or even expand to -5 and 5 for 11^7 = 19487171 points. The full source for a program written to do this search is available in a gist on github because it ended up over a hundred lines, which is a bit too much to include here inline.

After running the program we find a lot of useful answers:

  • The maximum error vector L1 norm is 12
  • 64 vectors achieve this maximum imbalance, as a combination of the values -2, -2, -1, -1, 1, 1, 2, 2 (but not every ordering)
  • There are only 6893 integer coordinates with T_i < S_i for all i
  • The maximum absolute value of any coordinate inside the residual region is 3 as expected

So for any set of production recipes that requires N_k arcospheres of flavor k, having N_k+3 total arcospheres of that flavor is sufficient to ensure that the above algorithm can always balance the arcospheres enough to run the recipe set again. So, for example, a system with 80 total arcospheres will ideally have 10 of each flavor and will have no fewer than 7 of any flavor after balancing, so it can be used to run a set of recipes that consumes up to 7 of any one flavor, without needing to reuse the result arcospheres, before balancing must be performed on complete set of 80 arcospheres again, at which point we can run the same set of recipes (using 7 arcospheres of that type) again. In practice, the game also has a limitation on circuit logic that excludes signals with zero value from being included in "each" operations in arithmetic combinators (for good reason), so we must ensure any count never hits zero and cannot consume more than 6 of any one flavor in this system. Some production recipes can be grouped together, such as the tier 3 deep science cards, which are all consumed together in the same quantities, but others like naquium tesseracts must run independently. This should be used to decide how many production steps to run in step 1 of our balancing algorithm earlier.

Earlier we also mentioned that we wanted to consider the impact of using rounding rather than truncation. If we modify the program to use residual_round instead of residual_floor as get_residual then we find improvements:

  • The maximum error vector L1 norm becomes 8
  • Only 24 vectors have this much imbalance, as a combination of only 1s and -1s
  • The maximum error for any one arcosphere is reduced to 2

This corresponds to one more recipe per arcosphere flavor, which would improve the amount of time spent producing rather than balancing. That may justify the effort of implementing correct rounding using circuits in game for this calculation.

Final Design

In game, I built a machine to employ the above algorithm:

arcosphere balancer screenshot

We'll break down the details of the design in part 2 (soon there will be a link here), since the actual implementation is really only concerned with game mechanics. But as a quick preview it has both synchronous and asynchronous logic sections and uses three separate state machines working together to coordinate a balancing phase, a production phase, and a measurement phase. It has exactly one gravimetrics facility for each folding and inversion recipe that is triggered the correct number of times to perform balancing, and it has no static arcosphere cost: there are no arcospheres trapped as catalysts in any loops or kept in a buffer chest or machine input waiting for their balancer to be enabled. It's a cool design and very satisfying to have built in game, because of this analysis supporting its correctness, but it is quite slow, especially compared to a naive, bot based ratio balancer like the one described at the start. Maybe I will be able to optimize or rebuild it by the time part 2 is finished and posted though and see if it can compare favorably with more effort put in there.

What does a linked list implementation tell you about a C programmer? Why is it a historically common interview question? Is it still worth asking someone in 2024? A couple years ago this post by Hillel Wayne discussed the possible history of the question and how it became an interview staple for C programmers. I agree with essentially all of the analysis there so you should read it first before getting into why I think this question still has relevance because, especially when interviewing candidates for senior positions, it identifies programmers who are familiar with C idioms and not just C syntax.

Imaginary Interview

Let's imagine I'm interviewing someone and ask them to implement a singly linked list that supports appending, inserting into the list in an arbitrary spot, and removing elements from the list. I ask them to go for something basic so that we can start talking about how they've implemented it and how it might be improved. They quickly produce code like this, after a little bit of the usual C testing and debugging:

struct slist {
    struct slist *next;
    void *data;
};

#define foreach_slist(_list, _data) \
    for (struct slist *_iter = _list; _iter && ((_data = _iter->data)); _iter = _iter->next)

struct slist *slist_alloc(void *data) {
    struct slist *node = calloc(1, sizeof(*node));
    node->data = data;
    return node;
}

void slist_insert_after(struct slist *head, void *data) {
    struct slist *node = slist_alloc(data);
    node->next = head->next;
    head->next = node;
}

void slist_append(struct slist *head, void *data) {
    while (head->next)
        head = head->next;
    slist_insert_after(head, data);
}

void slist_remove_after(struct slist *head) {
    if (head->next) {
        struct slist *node = head->next;
        head->next = node->next;
        free(node);
    }
}

struct slist *slist_remove(struct slist *head, void *data) {
    if (head->data == data) {
        struct slist *rtn = head->next;
        free(head);
        return rtn;
    }
    head->next = slist_remove(head->next, data);
    return head;
}

Here's the matching test/user, storing some ints in our list to simplify the test implementation, with the understanding that in a real program it'd be storing pointers instead:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
    void *data;
    struct slist *head = slist_alloc((void*)5);
    slist_append(head, (void*)6);
    slist_append(head, (void*)7);

    printf("Old list: ");
    foreach_slist(head, data) {
        int val = (int)data;
        printf("%d ", val);
    }
    printf("\n");

    head = slist_remove(head, (void*)6);
    printf("New list: ");
    foreach_slist(head, data) {
        int val = (int)data;
        printf("%d ", val);
    }
    printf("\n");


    return 0;
}

We dive into it. There's a softball problem: slist_remove is rebuilding the entire list, so I ask them why they've done it that way and what could be done to fix it? My interviewee immediately answers, "I realized it as I was writing the function out, but to fix it properly I want a list container so that the list head being passed to slist_remove never needs to be deleted--it'd take the container instead and the head pointer in the container could be deleted without confusing the caller or complicating list management. The remove code would then be able to traverse the list and remove just one node without reassigning all of the next pointers. It'd also simplify working with an empty list, because in the current version the user has to handle an empty list as a null pointer, which creates a lot of special cases in the user code that should instead be inside the library code."

The first version would never have been submitted for review, but during an interview one might be nervous or feel pressured to produce something as fast as possible, so the first implementation is going to have problems like this. But they already understand them and have a good plan for fixing them, so there's no real reason to dwell on them, and I skip another small problem in the foreach macro (what happens if you store a 0 integer? can you work around this?).

No Allocations

We can, instead, dive into to the real question I wanted to ask all along: "This implementation performs too many allocations. I would like a linked list that does no allocations at all. How would you do that?"

The interviewee says "Well, that is pretty simple but the code is not very reusable and you'd have to re-implement all the linked list operations for each struct that was to be stored in a list. We just locally implement a linked list as part of some other struct foo, as it is quite easy:"

They type out some pseudo-C to illustrate their point:

struct foo {
    struct foo *next;
    int bar;
};

void foo_append(struct foo * restrict head, struct foo * restrict new) {
    ...
}

struct foo *foo_remove(struct foo *head, struct foo *rem) {
    ...
}

They explain further, "so these append and remove functions would be essentially the same as the generic ones from before but they'd be specific to struct foo, and we'd need to re-implement them for each struct. Maybe we could stick in some kind of macro to generate it, but really we should just use C++ and implement it as a template to make the implementation easier to understand, maybe something along these lines:"

template<typename T>
class ListNode {
    ListNode<T> *next;

    void append(ListNode<T> *data) { ... }
    ListNode<T> remove(ListNode<T> *rem) { ...}
};

class Foo : ListNode<Foo> {
    ...
};

"Otherwise, we need to stick to the generic version in C with the node allocations. It would be a poor design to have a dozen linked list implementations in our code base in order to avoid allocations."

Do you agree? Is it possible to satisfy both objectives at once and produce a no-allocation linked list implementation that is generic and reusable in C only, or is another language required?

The C Version

In the standard, it specifies that the first element of a struct has the same address as the struct. That is, in a struct defined like so,

struct slist_node {
    struct slist_node *next;
};

struct foo {
    struct slist_node node;
};

struct foo x;

The values of &x and &x.node are the same. So if a pointer to struct slist_node is known to be contained within a struct foo, we can convert between the two pointers with a typecast and have valid behavior. By embedding the list node struct inside our structure at the start, we can freely convert between the two, so we can write a generic linked list in terms of struct slist_node and embed it in an arbitrary structure to be able to use it with the linked list implementation. It is both a generic implementation and it requires no allocations to do list operations! At first, however, this seems dangerous: what if someone unfamiliar with it inserts another value in front of the list node in the struct definition? Can we prevent that? We actually can with static_assert:

static_assert(offsetof(struct foo, node) == 0, "node must come first in foo");

Now if another programmer changes our layout they'll see a compiler error and we have achieved some level of safety, as our assumptions about the struct layout are encoded and verified by the compiler. There's an entirely separate question of ensuring that we only encounter struct slist_nodes that are contained within the right type of struct, but that issue is addressed in a different part of the design dealing with how code is broken into translation units.

This raises even more questions: what if I want to put struct foo into two linked lists at the same time, as this requires two list nodes? What if I adopt this more broadly and I want struct foo to be a linked list node and also be something else, like an event, which might have a similar requirement to place struct event at offset zero in order to do conversion?

Hopefully it is clear that the answer is pointer arithmetic. The first pass took advantage of the location of the first element of a struct being known--there is never padding at the start of a struct--so a list node pointer could be converted to a foo pointer by doing nothing. We can generalize this to a conversion between a member and its container by subtracting the offset of that member within the container from a given pointer:

struct foo *to_foo(struct slist_node *node) {
    return (struct foo *)((char *)node - offsetof(struct foo, node));
}

In this way, regardless of how the layout of struct foo changes, the correct pointer conversion will be performed to obtain the containing structure address based on the address of the node member. This is so common in the Linux kernel that there's an associated macro to further improve the robustness by adding some type checking to ensure the casts aren't covering up a mistake:

#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
#define container_of(ptr, type, member) ({                \
    void *__mptr = (void *)(ptr);                   \
    static_assert(__same_type(*(ptr), ((type *)0)->member) ||   \
              __same_type(*(ptr), void),            \
              "pointer type mismatch in container_of()");   \
    ((type *)(__mptr - offsetof(type, member))); })

Armed with the implementation of container_of from the kernel we can provide our most general linked list implementation, which supports putting a single struct into multiple lists, does not allocate memory, and can be combined with any struct using a small amount of glue code to do the necessary type conversions:

struct slist_node {
    struct slist_node *next;
};

struct slist {
    struct slist_node head;
};

#define foreach_slist(_list, _iter) \
    for (_iter = (_list)->head.next; _iter; _iter = _iter->next)

void slist_append(struct slist *list, struct slist_node *node) {
    struct slist_node *head = &list->head;

    while (head->next)
        head = head->next;

    head->next = node;
}

void slist_insert_after(struct slist_node * restrict head,
    struct slist_node * restrict node)
{
    node->next = head->next;
    head->next = node;
}

void slist_remove_after(struct slist_node *node) {
    if (node->next)
        node->next = node->next->next;
}

void slist_remove(struct slist *list, struct slist_node *node) {
    struct slist_node *prev = &list->head;
    struct slist_node *next = list->head.next;

    while (next) {
        if (next == node) {
            slist_remove_after(prev);
            return;
        }

        prev = next;
        next = next->next;
    }
}

The matching user code looks like this:

#include <assert.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>

#define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
#define container_of(ptr, type, member) ({                \
    void *__mptr = (void *)(ptr);                   \
    static_assert(__same_type(*(ptr), ((type *)0)->member) ||   \
              __same_type(*(ptr), void),            \
              "pointer type mismatch in container_of()");   \
    ((type *)(__mptr - offsetof(type, member))); })

struct foo {
    struct slist_node node;
    int value;
};

struct foo *to_foo(struct slist_node *node) {
    return container_of(node, struct foo, node);
}

int main(void) {
    struct foo a = { .value = 1 };
    struct foo b = { .value = 2 };
    struct foo c = { .value = 3 };
    struct slist list = {0};
    struct slist_node *node;

    slist_append(&list, &a.node);
    slist_append(&list, &b.node);
    slist_append(&list, &c.node);

    printf("Old: ");
    foreach_slist(&list, node) {
        struct foo *f = to_foo(node);
        printf("%d ", f->value);
    }
    printf("\n");

    slist_remove(&list, &b.node);
    printf("New: ");
    foreach_slist(&list, node) {
        struct foo *f = to_foo(node);
        printf("%d ", f->value);
    }
    printf("\n");

    return 0;
}

Is This a Useful Insight?

I started off by asserting that the point of asking this question is to help identify people who are familiar with C-specific idioms, not just with the language syntax or the notion of storing a few pointers or swapping them, as ultimately the default linked list implementation in C, Java, and other imperative languages looks structurally similar. Although the goal can be accomplished in C++ and possibly in other languages, doing it in C requires not just pointer management but pointer arithmetic and a level of comfort with the memory layout of a structure that is rarely relevant in another language.

In an embedded system, the resource constraints generally require this kind of thinking. Dynamic memory allocation may be disallowed entirely, as it introduces additional bug potential and complicates system analysis, because malloc has unbounded runtime and memory fragmentation increases the worst case bound on overhead, which could drive up cost if larger amounts of external memory are required as a result. But in contrast, this approach has less per element memory usage (one pointer rather than two) and very consistent and predictable performance for list operations. Outside of embedded, any system that implements some kind of object-like or polymorphic functionality is going to require one of two main implementation strategies; this is one of them and the other involves unions, type identifiers, and large switch statements.

It's not a perfect question, but it's a reasonable way to start a discussion of all of these topics, grounded in simple code free of algorithmic challenges.

In C functions are generally not considered first-class objects. Although pointers to functions exist and can be used to build some routines that have rudimentary higher-order behavior, there is no language syntax for dynamically creating a long-lived function at run time, partially evaluating a function, creating a closure, etc. But as we can build interpreters for higher level languages that do support this, surely there must be a way to achieve this?

Basic callbacks

A basic use-case for passing a function to another in C is a situation where we have asynchronous events that require notification. To solve this, we typically create a function called a callback to register with some kind of event-generating framework to notify us of these asynchronous events. Furthermore we likely want some kind of context to go along with the function, so that the same function definition could be used with multiple different context objects to handle different events.

If we have control of both sides of the API, then we could define a callback interface where we register a function which takes an additional parameter representing this context information. This is a common C idiom: in the real world it is used in the Linux kernel to register interrupt handlers so that they can be called with a pointer to the struct device that they need to actually handle the interrupt (interrupt handlers have two parameters, however, not just one, because they also include the interrupt number). An example API and usage for a one argument event system might look something like this:

#include <errno.h>
#include <stdint.h>
#include <stdio.h>

#define NUM_EVENTS 4

typedef int (*callback)(void *ctx);

struct handler {
    callback fn;
    void *ctx;
};

static struct handler handlers[NUM_EVENTS] = {0};

int register_callback(uint32_t id, callback fn, void *ctx) {
    if (id >= NUM_EVENTS)
        return -EINVAL;

    if (handlers[id].fn)
        return -EBUSY;

    handlers[id].fn = fn;
    handlers[id].ctx = ctx;
    return 0;
}

int raise_event(uint32_t id) {
    if (id >= NUM_EVENTS)
        return -EINVAL;

    if (handlers[id].fn)
        return handlers[id].fn(handlers[id].ctx);

    return -ENOSYS;
}

int print_data(void *ctx) {
    const char *name = ctx;
    printf("got event name = %s\n", name);
    return 0;
}

static const char event0[] = "event 0";
static const char event1[] = "event 1";

void test_events(void) {
    register_callback(0, print_data, event0);
    register_callback(1, print_data, event1);

    raise_event(0);
    raise_event(1);
}

which produces this output when we call test_events() from another function, giving us our desired behavior:

got event name = event 0
got event name = event 1

However, this approach requires that the callback API was designed to support this use case. When we're in charge we can guarantee that it does, but if we're designing a library we can't guarantee that we will support all use cases and will likely opt for a simple, one parameter version. What if the user needs two pieces of context? They could of course pack any number of contexts into a wrapper context and then manually introduce an unpacking step, but this introduces additional complexity. Now, to use our two parameter function with an API that only offers a single void pointer parameter, we will need to:

  1. Allocate a context wrapper and populate it with our parameters and function pointer
  2. Register a specific unwrapping function (not the function we actually wanted to call)
  3. Deallocate the context wrapper after we're done, in addition to everything else, which creates one more opportunity to leak memory
  4. Potentially implement a new wrapping function and struct for each two parameter function signature we need, or embrace void * and lose more type safety with one generic two parameter wrapping version

Here's a basic example of how this might look.

struct ctx_wrap {
    int (*fn)(const char *, int);
    const char *name;
    int id;
};

int unwrapping_thunk(void *ctx) {
    struct ctx_wrap *wrap = ctx;
    return wrap->fn(wrap->name, wrap->id);
}

struct ctx_wrap *alloc_wrap(int (*fn)(const char *, int), const char *name, int id) {
    struct ctx_wrap *ret;

    ret = calloc(sizeof(*ret), 1);
    if (!ret)
        return NULL;

    ret->fn = fn;
    ret->name = name;
    ret->id = id;
    return ret;
}

void free_wrap(struct ctx_wrap *wrap) {
    free(wrap);
}

void test_events(void) {
    register_callback(0, print_data, event0);
    register_callback(1, print_data, event1);

    raise_event(0);
    raise_event(1);
}

static const char event2[] = "event 2";

int print_two_data(const char *name, int id) {
    printf("got event name %s, id %d\n", name, id);
    return 0;
}

void test_events2(void) {
    struct ctx_wrap *wrap = alloc_wrap(print_two_data, event2, 100);
    register_callback(2, unwrapping_thunk, wrap);

    raise_event(2);
    free_wrap(wrap);
}

When executed, test_events2 illustrates the behavior we wanted, calling our function print_two_data with the two arguments we've specified while using the one context parameter register_callback and raise_event functions from earlier:

got event name event 2, id 100

This approach provides a relatively general solution. We can pass any number of fixed parameters to our callback function using only one context pointer, and it can easily be integrated with systems where the callback also receives dynamic, event-specific parameters.

Fixing parameters to functions, "partial evaluation"

However, what if we need to provide context when working with an API where we are not allowed parameters at all? What if we really want to create a self-contained, callable function that has no externally visible context data to manage separately? Ideally, the end result in this scenario is a function that enables us to do something like this:

int foo(int x) {
    return x + 1;
}

void demo(void) {
    int (*bar)(void);
    bar = alloc_partial(foo, 1);
    printf("%d\n", bar());
    free_partial(bar);
}

and receive an ouptut of 2.

To do this we should start by building off of the idea of a wrapper context and an entry thunk that will decode the wrapper context and then call our target function with the saved parameters. But we also need to overcome these challenges:

  1. We need to actually create a new function, somewhere in memory, that can be called just like any function that was compiled into the program through a function pointer
  2. We need to store data that is specific to each instance of that function as part of the memory region used to store the function itself so they can be deallocated together
  3. The thunk needs to refer to variables that are not declared on the stack, are not passed as arguments, and do not have static storage, without a separate pointer available to find them
  4. We will not be using any external tools (like asking clang to compile a string and then copying the binary into our memory space or loading a shared object it produces) to do this
  5. There's no standards-compliant way to express this in C

So, we will need to carefully design our thunk. A quick disclaimer as there is a bit of assembly in the upcoming bits: day to day, I may read armv7 or armv8 assembly, but I generally do not write much. My x86 assembly is based mostly on how I would expect it to be done on ARM, so I don't do a very good job of taking advantage of x86 addressing options (but the compiler does, as we will see later). I set out to do this at first assuming that the C compiler could not or would not help me and hand wrote an assembly-only implementation, wrapped in a naked C function for ease of integration with the rest of my program:

__attribute__((naked))
void __thunk() {
    asm volatile("lea -0x17(%rip), %rax");
    asm volatile("mov 8(%rax), %rdi");
    asm volatile("mov 0(%rax), %rax");
    asm volatile("jmp *%rax");
}

This function is declared with the naked attribute, so that there is no additional assembly introduced related to stack frame management, except it is interesting to note that GCC inserts an intentional undefined instruction after the function body, presumably to catch mistakes and trigger an exception immediately. I think this is a very nice touch and would catch a mistake, such as if I tried to make a tail call to another function using the call opcode rather than jumping to it, because ud2 would be executed after that function returns here (instead of to my caller as was intended) and trigger an exception rather than falling through into whatever is located after __thunk:

0000000000001600 <__thunk>:
    1600:       48 8d 05 e9 ff ff ff    lea    -0x17(%rip),%rax        # 15f0 <__libc_csu_fini>
    1607:       48 8b 78 08             mov    0x8(%rax),%rdi
    160b:       48 8b 00                mov    (%rax),%rax
    160e:       ff e0                   jmp    *%rax
    1610:       0f 0b                   ud2

So how does this thunk work? It uses the current instruction pointer, available in the %rip register, to create a pointer to memory that resides immediately before the function body by adding the constant -0x17. We use -0x17 because this encoding of lea is 7 bytes long, and we wish to further subtract space for two 8 byte (= 0x10 total) values that are stored before lea in memory.

From there we read the function argument at ptr+8, as a uint64 sized value, into %rdi, which is where the one argument to a function of type int (*fn)(int) is expected according to the calling convention being used. Then we read the function pointer itself into %rax and jump to %rax to forward to the function we wanted to call originally.

To go along with this thunk, we need another function that will copy the function pointer, function argument, and thunk body into the right places in memory. Furthermore, we need to allocate this memory specially, as memory returned by malloc is typically not executable. To allocate executable memory, we can use mmap directly to request a fresh page of virtual memory with read, write, and execute permissions set:

buf = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

Further furthermore, we will need to find a way to access the compiled version of __thunk including its length from within the source code. Luckily the linker can help us out. If we place the original thunk into its own section and add two external symbols (which will be filled in by the linker), we can calculate the length of the section at run time:

__attribute__((naked,section("thunk")))
void __thunk() {
    asm volatile("lea -0x17(%rip), %rax");
    asm volatile("mov 8(%rax), %rdi");
    asm volatile("mov 0(%rax), %rax");
    asm volatile("jmp *%rax");
}
extern uint8_t __start_thunk;
extern uint8_t __stop_thunk;

The way that __start_thunk and __stop_thunk work are that the symbols are overlaid with the start and end of the section, and the value of each would be a single byte from the compiled representation of __thunk. So to compute the size of the section (which contains only our function, and thus the section size == the function size), we subtract the addresses of the two symbols: size = &__stop_thunk - &start_thunk.

We can combine all of this to create a very rough function that accomplishes our goal and creates a callable object:

typedef int (*retint)(void);

retint partial(int (*fn)(int), int x) {
    uint64_t *buf;
    retint result;
    size_t tsize;
    size_t size;

    tsize = &__stop_thunk - &__start_thunk;
    size = tsize + 2*sizeof(uint64_t);

    buf = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (buf == -1) {
        perror("mapping failed");
        return NULL;
    }

    buf[0] = (uint64_t) fn;
    buf[1] = x;

    memcpy(buf+2, __thunk, tsize);
    return (retint)(buf+2);
}

int main(int argc, char **argv) {
    retint foo_partial;
    foo_partial = partial(foo, 7);
    printf("%d\n", foo_partial());
    return 0;
}

This will print out 8 when it runs, just as we expected. We can now create function objects that store another function pointer and a single argument internally and otherwise work just like regular zero argument functions that we can call directly from anywhere in our code, forwarding the call to the stored pointer along with the stored argument!

A little C as a treat

We could be happy here, but it is really unfortunate that the entire thunk is implemented in assembly. Things seem kind of brittle, and if I wanted to have a function with a different argument type, or a function with two arguments, I'd need to re-create it by hand and manually check that everything lines up correctly, plus I would need to handle more of the calling convention and I have no desire to figure out all of the details. Ideally, we get the C compiler to cooperate with us as much as possible so that it can take care of these things, so let's start on version 2.

The first thing we want to do is define a C struct to handle our memory layout:

struct partial_fn {
    int (*fn)(int);
    uint64_t a0;
    char body[];
};

This will store our function pointer, one argument, and a flexible array member to represent the body of the thunk that will be copied around. From there, we can define a new thunk that is much less brittle, using lea to get a pointer to this struct and then locating the function we wish to call and its argument with the rest of the thunk written in C:

#define LEA_SIZE 0x07
__attribute__((section("t2")))
int __t2() {
    struct partial_fn *ctx;
    asm("lea %c1(%%rip), %0" : "=r" (ctx) : "i" (-LEA_SIZE - offsetof(struct partial_fn, body)));
    return ctx->fn(ctx->a0);
}
extern uint8_t __start_t2;
extern uint8_t __stop_t2;

Since we need to use the gcc extended assembly syntax in order to reference the local variable ctx, we can also introduce a dynamically computed constant to replace the -0x17 from before, which will account for changes to the layout of struct partial_fn, but which will not adapt to the size of any prologue instructions inserted before lea. Unfortunately, this version doesn't work out of the box. Since the function is not declared naked, it will insert a prologue and epilogue by default, resulting in disassembly like this:

00000000000016c4 <__t2>:
    16c4:       55                      push   %rbp
    16c5:       48 89 e5                mov    %rsp,%rbp
    16c8:       48 83 ec 20             sub    $0x20,%rsp
    16cc:       48 8d 05 e9 ff ff ff    lea    -0x17(%rip),%rax        # 16bc <__thunk+0xb>
    16d3:       48 89 45 e8             mov    %rax,-0x18(%rbp)
    16d7:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    16db:       48 89 45 f0             mov    %rax,-0x10(%rbp)
    16df:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    16e3:       48 83 c0 08             add    $0x8,%rax
    16e7:       48 89 45 f8             mov    %rax,-0x8(%rbp)
    16eb:       48 8b 45 f0             mov    -0x10(%rbp),%rax
    16ef:       48 8b 10                mov    (%rax),%rdx
    16f2:       48 8b 45 f8             mov    -0x8(%rbp),%rax
    16f6:       8b 00                   mov    (%rax),%eax
    16f8:       89 c7                   mov    %eax,%edi
    16fa:       ff d2                   call   *%rdx
    16fc:       c9                      leave
    16fd:       c3                      ret

Note that I've left the offset at -0x17 and compiled it just to grab this disassembly; this version won't work because that offset is incorrect for locating the function pointer and argument. However, we implemented something that was logically equivalent using only four instructions, so this seems wasteful--maybe we've failed at getting the compiler to understand what we want? Not really, we just need to turn on optimization. If I build with -O2 however, this output is created for __t2 instead, which is exactly what we were hoping to see:

0000000000001600 <__t2>:
    1600:       48 8d 05 e9 ff ff ff    lea    -0x17(%rip),%rax        # 15f0 <__thunk+0x10>
    1607:       8b 78 08                mov    0x8(%rax),%edi
    160a:       ff 20                   jmp    *(%rax)

Even better, the compiler has outdone me, because it knows about an indirect jump addressing mode that I didn't know about at the time, so it is able to combine the two instruction sequence I used into a single instruction for using the stored function pointer. So, based on our source code, gcc is able to infer that we do not need a stack frame, do not need a prologue or epilogue, and can optimize this as a tail call using jmp rather than call and ret.

With the memory layout codified as a struct definition, we can also improve the definition of partial in order to avoid manually computing the offsets or locations of things. This gives us a much more robust implementation:

retint partial2(int (*fn)(int), int x) {
    struct partial_fn *result;
    size_t tsize;
    size_t size;

    tsize = &__stop_t2 - &__start_t2;
    size = tsize + sizeof(*result);

    result = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (result == (void *)-1) {
        perror("mapping failed");
        return NULL;
    }

    result->fn = fn;
    result->a0 = x;

    memcpy(result->body, __t2, tsize);
    return (retint) result->body;
}

Testing this version gives the expected behavior:

int main(int argc, char **argv) {
    retint foo_partial;
    foo_partial = partial2(foo, 3);
    printf("%d\n", foo_partial());
    return 0;
}

prints 4 to the terminal. Now we have an implementation that is robust against changes to the memory layout and respects the calling convention; the only bit of magic left is the bit of inline assembly and the assumption that said inline assembly is the first instruction in the function body.

Do it without inline assembly?

Finally, I decided to try something I had ruled out earlier. This goes a little bit outside the defined behavior of the C language and relies on some implementation and platform details that are known to differ between architectures, but it yields a solution that mostly appears to work on x86 (but not on armv8, for example). I decided to see where gcc would place function-level static variables on x86 to see if they were located near the function and if they were accessed with %rip-relative addressing. It turns out they're placed in the data section (largely for security reasons--overflows on buffers allocated in data would not allow code injection into the read-only executable pages), but they are accessed with a relative load. Unfortunately, the offset between the code and data is large and changes with the amount of code, number of variables, link order etc. so it is too hard to predict, and we cannot adapt the partial function to store data at that offset relative to the new copy of the thunk function without a lot of extra work to read the offset from the compiled thunk. But, function pointers are accessed relative to the current instruction pointer as well. So, this snippet of code accomplishes what we want and gives us a pointer to the partial_fn struct allocated by the partial function, but the behavior is not guaranteed by the standard:

void *x = &__t2;
struct partial_fn *fn = x - sizeof(struct partial_fn);

This version works with optimization enabled or disabled and provides the shortest possible assembly implementation in -O2 by not calculating the pointer at all; instead it uses two instructions reads to load the argument relative to %rip and then jump to the function pointer:

0000000000001690 <__t3>:
    1690:       8b 3d f2 ff ff ff       mov    -0xe(%rip),%edi        # 1688 <__t2+0x8>
    1696:       ff 25 e4 ff ff ff       jmp    *-0x1c(%rip)        # 1680 <__t2>

This optimization was previously impossible because gcc doesn't optimize inline assembly beyond doing some dead code elimination, so the optimizer was not aware of what lea did and could not fuse it with the following mov and jmp instructions. My understanding is this is unlikely to change, and it is not something I would consider particularly important as a feature anyway.

This enables us to have version 3 implemented entirely in C:

extern uint8_t __start_t3;
extern uint8_t __stop_t3;

__attribute__((section("t3")))
int __t3(void) {
    void *x = &__t3;
    struct partial_fn *args = x - sizeof(struct partial_fn);
    return args->fn(args->a0);
}

retint partial3(int (*fn)(int), int x) {
    struct partial_fn *result;
    size_t tsize;
    size_t size;

    tsize = &__stop_t3 - &__start_t3;
    size = tsize + sizeof(*result);

    result = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (result == (void *)-1) {
        perror("mapping failed");
        return NULL;
    }

    result->fn = fn;
    result->a0 = x;

    memcpy(result->body, __t3, tsize);
    return (retint) result->body;
}

What happens on ARM?

This version however, is no more portable than the versions with inline assembly. If I compile this for armv8 using gcc 9.3, it disassembles into this (even with -O2 enabled):

0000000000000de4 <__t3>:
 de4:   a9be7bfd        stp     x29, x30, [sp, #-32]!
 de8:   910003fd        mov     x29, sp
 dec:   90000000        adrp    x0, 0 <_init-0x748>
 df0:   91379000        add     x0, x0, #0xde4
 df4:   f9000be0        str     x0, [sp, #16]
 df8:   f9400be0        ldr     x0, [sp, #16]
 dfc:   d1004000        sub     x0, x0, #0x10
 e00:   f9000fe0        str     x0, [sp, #24]
 e04:   f9400fe0        ldr     x0, [sp, #24]
 e08:   f9400001        ldr     x1, [x0]
 e0c:   f9400fe0        ldr     x0, [sp, #24]
 e10:   f9400400        ldr     x0, [x0, #8]
 e14:   d63f0020        blr     x1
 e18:   a8c27bfd        ldp     x29, x30, [sp], #32
 e1c:   d65f03c0        ret

This is interesting for a number of reasons. When we run a program containing this function it crashes; I get an Illegal Instruction, but it could produce other errors as well, because it jumps to an unexpected spot in memory instead of to the function pointer we wanted. The cause can be seen from the two instruction sequence used to calulate x:

 dec:   90000000        adrp    x0, 0 <_init-0x748>
 df0:   91379000        add     x0, x0, #0xde4

The adrp instruction is used to compute a PC-relative address (good), but does so in a roundabout way: it zeros the 12 lowest bits of the program counter and then adds the offset included in the instruction, which in this case is zero, so it loads the address of the start of the page containing the instruction. Then, the next instruction adds the constant 0xde4, which is the offset of __t3 relative to the start of the program in the executable, which was assumed to be loaded at a page boundary. Thus in order to compute the correct address, the function must be located at 0xde4 relative to the page it is part of, an offset that will change with the size of code in the program.

The other interesting thing about this assembly is that gcc is doing some very confusing things. It uses adrp and add to compute in two steps what could have been done with one, it writes to the stack and then immediately reads back the pointer (instructions 0xdf4 and 0xdf8), and then subtracts 16 bytes from x0 after this without using the earlier address for anything. The load/store could be eliminated and the three remaining pointer-calculation instructions could be combined into a single instruction. It repeats the store-load pair at instructions 0xe00 and 0xe04, and reloads x0 with the value it already contains at 0xe0c. I believe the same code should have been generated like this, but I haven't double-checked the calling convention. This is also the armv8 translation of the x86 instruction sequence I started with earlier.

adrp      x0, #0xdd4
ldr       x1, [x0]
ldr       x0, [x0, #8]
br        x1

To actually achieve what we wanted, adrp shouldn't be used at all; instead the PC-relative ldr with a literal argument should be used to directly load the the values stored before our thunk in memory. This is normally used for loading 32 or 64 bit constants that are stored directly in the .text section, but it can be used here. I haven't investigated how to get gcc to emit this instruction most easily, but the ideal assembly would look something like this instead:

ldr      x1, #-16
ldr      x0, #-16
br       x1

My syntax isn't entirely correct here, as this form normally needs to be written with a named literal as the second argument, which causes the assembler to encode it in the ldr (literal) way, but accounting for that, this three instruction sequence would accomplish the same behavior as the x86 version. I believe it should be reasonable for gcc to produce this same three instruction sequence for version 3 above because it is a simple interpretation of the source code, so I am surprised that it does what I wanted on x86 but not on armv8.

Conclusions

This is a pretty brittle way of dynamically creating function objects with saved parameters, but it has been fun to explore. It works well on x86 and can even be represented in C without inline assembly, but it relies on under-specified parts of the language specification to obtain the address of the function being executed using instruction-relative addressing. As we saw by looking at the ARM disassembly, other architectures may support instruction-relative addressing differently and this technique may not work as expected. However, fundamentally, it is possible to do using hand-written assembly on both platforms. Maybe a future version of the C language specification will include syntax and semantics for expressing this using well defined behavior only or directly include facilities to create functions dynamically, but of course this will open the door to more complicated dynamic functions such as closures.

It's also been fun to compare the x86 and armv8 products and see what gcc is doing. I am surprised by the issues in the output for armv8 on 9.3, but perhaps that has been improved by now (9.3 is almost two years old). If not maybe I can file a bug report or try to fix it myself.

Do I plan to actually use this technique? I'm not sure yet, but I am tempted because it may simplify some implementation in a project I am working on.

Full code as a single gist.

Once a year, I have to pay my renter's insurance bill. I get an email from Progressive, go to the website and find that I can't remember the password. I pull up the password manager and find Progressive-Homesite in the list, but the password is incorrect. Oh, I forgot, 4 months ago I needed to print proof of insurance while I wasn't home and had to reset the password to log in on the office computer. I reset the password again and pay my bill, saving the new password. The same basic scenario plays out every year for every infrequent account I have: insurance, internet, electricity.

So, if I never know what my password is, how am I really being authenticated? It comes down entirely to having access to my email address, but the process is tedious:

  • Click lost password/forgot password/trouble logging in

  • Enter email address

  • Wait for email to arrive, check spam, etc.

  • Click a link, where I create a new password, sometimes with confusing results:

    Homesite's Password Reset

  • Log in with the newly created password

These are effectively all low security websites. Contrast this with my bank, where if I were to forget my password, I would need to make a phone call and verify my identity in another way or visit a branch in person. That said, security questions, common personal details for verification, etc. are all pretty bad ways to verify someone's identity because they get reused and cannot be changed, so it's possible that the actual security in those processes is still quite low. For the sake of argument, however, let's classify this as a low security website because the password reset is automated and doesn't even offer the opportunity for another human to notice something may be amiss, in contrast with one where a person might stop and ask why someone is asking for access to my account but is unable to pronounce my last name consistently in conversation.

This process sucks. It's already established that if I have access to email, I can access anything that uses that email as a lost password address--we should cut out the middle man and use email directly to authenticate. Google already has a solution to this with Google Identity Platform, which is much faster to use, doesn't require me to store additional credentials, and easier to build and use correctly than homemade authentication systems might be (cf. Homesite requiring special characters in the password and tripping over the ones my random generator produced above). But Google isn't the only one that does this, and OpenID has been developed and deployed widely for authenticating users through third parties. If you don't like Google, you could use Facebook, or Twitter, or Steam. Just stop creating password-based logins. For the reboot of my blog I've used Google Identity Platform, and for the Dota 2 league I run, players sign in through Steam. So in both cases the login security is handled by companies with better security teams, and a data breach on my end also reveals no private information--the account names and IDs are public in both cases, after all, and I store no passwords or security questions.

At the end of the day, though, OpenID is somewhat inconsistently implemented, so most stable providers--the big names we can reasonably expect to continue to exist in five years--have unique-ish APIs for it, or in some cases (like Steam) don't really support OpenID and instead use OAuth with a basic permission to read account details. So currently, you end up needing to support each authentication provider separately, and the identity details they provide are not linked: you cannot programmatically tell that the same "person" operates both a twitter and facebook account, even if they wish to make that information public, without having them log in to both during the same session on your site.

Ideally, users could register themselves with a public key that they provide. Both RSA and ECC are standardized and have robust library support; plus they are already available on every client and server. Digital signature algorithms provide a means for a user to prove that they can access the private key associated with a public key. I know the HTTP spec currently supports client authentication, but I don't remember if it permits the use of unsigned certificates or if this varies by implementation. In this case, because the user is providing their public key at registration, the issuer doesn't need to be verified: we just need to see the same key again plus proof of ownership to authenticate.

Specialized applications, like the DoD's Common Access Card have already deployed this strategy and proven to be very successful, but there are challenges with a widespread deployment for a hardware security token, because people are very good at losing their tokens. For the CAC and other truly sensitive situations, the recovery procedure makes sense: you have to go in person and answer a lot of questions, but for my reddit account there's no need to travel to California if I lose my key or it is stolen. Nonetheless, it'd be nice if some keys could be recovered or replaced, which requires that the keys take advantage of existing public key infrastructure, but without requiring a central authority.

In particular, users should be able to federate their keys: use a master key to sign certificates for secondary keys. Then, authentication can be done by presenting a signed certificate, although no external certificate authorities have been involved: I sign my own secondary key to use on a mobile device, so that the master key for an identity does not need to be risked. Key revocation is more complicated, but a couple of options come to mind. (1) Since we are the masters of our own destiny, signed registration certificates could include a URL to check for related revocation status when the matching key is presented in the future. However, an attacker could disable this singular service to use the stolen key along with its identity chain. (2) Use a distributed revocation network that allows a master key to publish both the creation and (later, if necessary) the recovation of a secondary key. With a blockchain the history of a key is immutably documented, easy to consult by identity consumers, and not controlled by a central authority. There are, however, a lot of technical details to figure out there, and the best solution may be (3) require users to log in with a "more authoritative" key of some kind (for instance in a 3 tiered scheme the intermediate key in between the master and the daily key could be used for this) to discredit a lost/stolen key at affected services, with no networked or third party server checks.

Something like that is pretty far away, but for now, I'd just be happier if most services switched to identity providers rather than storing usernames and passwords. It's easier for end users, increases security, and reduces the damage of data breaches.

Since its release last week, I've been playing quite a bit of Fallout 4. There's an interesting mini-game (which was in previous iterations as well) for "hacking" computer terminals, where you must guess the passcode on a list of possibilities with a limited number of guesses. Each failed guess provides the number of correct letters (in both value and position) in that particular word, but not which letters were correct, allowing you to deduce the correct passcode similarly to the game "Mastermind." A natural question is, "what is the best strategy for identifying the correct passcode?" We'll ignore the possibility of dud removal and guess resets (which exist to simplify it a bit in game) for the analysis.

Reformulating this as a probability question offers a framework to design the best strategy. First, some definitions: N denotes the number of words, z denotes the correct word, and x_i denotes a word on the list (in some consistent order). A simple approach suggests that we want to use the maximum likelihood (ML) estimate of z to choose the next word based on all the words guessed so far and their results:

 \hat{z} = \underset{x_i}{\mathrm{argmax}}~~\mathrm{Pr}(z=x_i)

However, for the first word, the probability prior is uniform--each word is equally likely. This might seem like the end of the line, so just pick the first word randomly (or always pick the first one on the list, for instance). However, future guesses depend on what this first guess tells us, so we'd be better off with an estimate which maximizes the mutual information between the guess and the unknown password. Using the concept of entropy (which I've discussed briefly before), we can formalize the notion of "mutual information" into a mathematical definition: I(z, x) = H(z) - H(z|x). In this sense, "information" is what you gain by making an observation, and it is measured by how it affects the possible states for a latent variable to take. For more compact notation, let's define F_i=f(x_i) as the "result" random variable for a particular word, telling us how many letters matched, taking values  \{0,1,...,M\} , where M is the length of words in the current puzzle. Then, we can change our selection criteria to pick the maximum mutual information:

 \hat{z} = \underset{x_i}{\mathrm{argmin}}~~H(z|F_i)

But, we haven't talked about what "conditional entropy" might mean, so it's not yet clear how to calculate H(z | F_i), apart from it being the entropy after observing F_i's value. Conditional entropy is distinct from conditional probability in a subtle way: conditional probability is based on a specific observation, such as F_i=1, but conditional entropy is based on all possible observations and reflects how many possible system configurations there are after making an observation, regardless of what its value is. It's a sum of the resulting entropy after each possible observation, weighted by the probability of that observation happening:

 H(Z | X) = \sum_{x\in X} p(x)H(Z | X = x)

As an example, let's consider a puzzle with M=5 and N=10. We know that  \forall x_i,\mathrm{Pr}(F_i=5)=p_{F_i}(5)=0.1 . If we define the similarity function L(x_i, x_j) to be the number of letters that match in place and value for two words, and we define the group of sets  S^{k}_{i}=\{x_j:L(x_i,x_j)=k\} as the candidate sets, then we can find the probability distribution for F_i by counting,

 p_{F_i}(k)=\frac{\vert{S^k_i}\vert}{N}

As a sanity check, we know that  \vert{S^5_i}\vert=1 because there are no duplicates, and therefore this equation matches our intuition for the probability of each word being an exact match. With the definition of p_{F_i}(k) in hand, all that remains is finding H(z | F_i=k), but luckily our definition for S^k_i has already solved this problem! If F_i=k, then we know that the true solution is uniformly distributed in S^k_i, so

 H(z | F_i=k) = \log_2\vert{S^k_i}\vert

Finding the best guess is as simple as enumerating S^k_i and then finding the x_i which produces the minimum conditional entropy. For subsequent guesses, we simply augment the definition for the candidate sets by further stipulating that set members x_j must also be in the observed set for all previous iterations. This is equivalent to taking the set intersection, but the notation gets even messier than we have so far, so I won't list all the details here.

All that said, this is more of an interesting theoretical observation than a practical one. Counting all of the sets by hand generally takes longer than a simpler strategy, so it is not well suited for human use (I believe it is O(n^2) operations for each guess), although a computer can do it effectively. Personally, I just go through and find all the emoticons to remove duds and then find a word that has one or two overlaps with others for my first guess, and the field narrows down very quickly.

Beyond its appearance in a Fallout 4 mini-game, the concept of "maximum mutual information" estimation has broad scientific applications. The most notable in my mind is in machine learning, where MMI is used for training classifiers, in particular, Hidden Markov Models (HMMs) such as those used in speech recognition. Given reasonable probability distributions, MMI estimates are able to handle situations where ML estimates appear ambiguous, and as such they are able to be used for "discriminative training." Typically, an HMM training algorithm would receive labeled examples of each case and learn their statistics only. However, a discriminative trainer can also consider the labeled examples of other cases in order to improve classification when categories are very similar but semantically distinct.