In C functions are generally not considered first-class objects. Although pointers to functions exist and can be used to build some routines that have rudimentary higher-order behavior, there is no language syntax for dynamically creating a long-lived function at run time, partially evaluating a function, creating a closure, etc. But as we can build interpreters for higher level languages that do support this, surely there must be a way to achieve this?
Basic callbacks
A basic use-case for passing a function to another in C is a situation where we have asynchronous events that require notification. To solve this, we typically create a function called a callback to register with some kind of event-generating framework to notify us of these asynchronous events. Furthermore we likely want some kind of context to go along with the function, so that the same function definition could be used with multiple different context objects to handle different events.
If we have control of both sides of the API, then we could define a callback interface where we
register a function which takes an additional parameter representing this context information. This
is a common C idiom: in the real world it is used in the Linux kernel to register interrupt handlers
so that they can be called with a pointer to the struct device
that they need to actually handle
the interrupt (interrupt handlers have two parameters, however, not just one, because they also
include the interrupt number). An example API and usage for a one argument event system might look
something like this:
#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#define NUM_EVENTS 4
typedef int (*callback)(void *ctx);
struct handler {
callback fn;
void *ctx;
};
static struct handler handlers[NUM_EVENTS] = {0};
int register_callback(uint32_t id, callback fn, void *ctx) {
if (id >= NUM_EVENTS)
return -EINVAL;
if (handlers[id].fn)
return -EBUSY;
handlers[id].fn = fn;
handlers[id].ctx = ctx;
return 0;
}
int raise_event(uint32_t id) {
if (id >= NUM_EVENTS)
return -EINVAL;
if (handlers[id].fn)
return handlers[id].fn(handlers[id].ctx);
return -ENOSYS;
}
int print_data(void *ctx) {
const char *name = ctx;
printf("got event name = %s\n", name);
return 0;
}
static const char event0[] = "event 0";
static const char event1[] = "event 1";
void test_events(void) {
register_callback(0, print_data, event0);
register_callback(1, print_data, event1);
raise_event(0);
raise_event(1);
}
which produces this output when we call test_events()
from another function, giving us our desired
behavior:
got event name = event 0
got event name = event 1
However, this approach requires that the callback API was designed to support this use case. When we're in charge we can guarantee that it does, but if we're designing a library we can't guarantee that we will support all use cases and will likely opt for a simple, one parameter version. What if the user needs two pieces of context? They could of course pack any number of contexts into a wrapper context and then manually introduce an unpacking step, but this introduces additional complexity. Now, to use our two parameter function with an API that only offers a single void pointer parameter, we will need to:
- Allocate a context wrapper and populate it with our parameters and function pointer
- Register a specific unwrapping function (not the function we actually wanted to call)
- Deallocate the context wrapper after we're done, in addition to everything else, which creates one more opportunity to leak memory
- Potentially implement a new wrapping function and struct for each two parameter function
signature we need, or embrace
void *
and lose more type safety with one generic two parameter wrapping version
Here's a basic example of how this might look.
struct ctx_wrap {
int (*fn)(const char *, int);
const char *name;
int id;
};
int unwrapping_thunk(void *ctx) {
struct ctx_wrap *wrap = ctx;
return wrap->fn(wrap->name, wrap->id);
}
struct ctx_wrap *alloc_wrap(int (*fn)(const char *, int), const char *name, int id) {
struct ctx_wrap *ret;
ret = calloc(sizeof(*ret), 1);
if (!ret)
return NULL;
ret->fn = fn;
ret->name = name;
ret->id = id;
return ret;
}
void free_wrap(struct ctx_wrap *wrap) {
free(wrap);
}
void test_events(void) {
register_callback(0, print_data, event0);
register_callback(1, print_data, event1);
raise_event(0);
raise_event(1);
}
static const char event2[] = "event 2";
int print_two_data(const char *name, int id) {
printf("got event name %s, id %d\n", name, id);
return 0;
}
void test_events2(void) {
struct ctx_wrap *wrap = alloc_wrap(print_two_data, event2, 100);
register_callback(2, unwrapping_thunk, wrap);
raise_event(2);
free_wrap(wrap);
}
When executed, test_events2
illustrates the behavior we wanted, calling our function
print_two_data
with the two arguments we've specified while using the one context parameter
register_callback
and raise_event
functions from earlier:
got event name event 2, id 100
This approach provides a relatively general solution. We can pass any number of fixed parameters to our callback function using only one context pointer, and it can easily be integrated with systems where the callback also receives dynamic, event-specific parameters.
Fixing parameters to functions, "partial evaluation"
However, what if we need to provide context when working with an API where we are not allowed parameters at all? What if we really want to create a self-contained, callable function that has no externally visible context data to manage separately? Ideally, the end result in this scenario is a function that enables us to do something like this:
int foo(int x) {
return x + 1;
}
void demo(void) {
int (*bar)(void);
bar = alloc_partial(foo, 1);
printf("%d\n", bar());
free_partial(bar);
}
and receive an ouptut of 2
.
To do this we should start by building off of the idea of a wrapper context and an entry thunk that will decode the wrapper context and then call our target function with the saved parameters. But we also need to overcome these challenges:
- We need to actually create a new function, somewhere in memory, that can be called just like any function that was compiled into the program through a function pointer
- We need to store data that is specific to each instance of that function as part of the memory region used to store the function itself so they can be deallocated together
- The thunk needs to refer to variables that are not declared on the stack, are not passed as arguments, and do not have static storage, without a separate pointer available to find them
- We will not be using any external tools (like asking clang to compile a string and then copying the binary into our memory space or loading a shared object it produces) to do this
- There's no standards-compliant way to express this in C
So, we will need to carefully design our thunk. A quick disclaimer as there is a bit of assembly in the upcoming bits: day to day, I may read armv7 or armv8 assembly, but I generally do not write much. My x86 assembly is based mostly on how I would expect it to be done on ARM, so I don't do a very good job of taking advantage of x86 addressing options (but the compiler does, as we will see later). I set out to do this at first assuming that the C compiler could not or would not help me and hand wrote an assembly-only implementation, wrapped in a naked C function for ease of integration with the rest of my program:
__attribute__((naked))
void __thunk() {
asm volatile("lea -0x17(%rip), %rax");
asm volatile("mov 8(%rax), %rdi");
asm volatile("mov 0(%rax), %rax");
asm volatile("jmp *%rax");
}
This function is declared with the naked
attribute, so that there is no additional assembly
introduced related to stack frame management, except it is interesting to note that GCC inserts an
intentional undefined instruction after the function body, presumably to catch mistakes and trigger
an exception immediately. I think this is a very nice touch and would catch a mistake, such as if I
tried to make a tail call to another function using the call
opcode rather than jumping to it,
because ud2
would be executed after that function returns here (instead of to my caller as was
intended) and trigger an exception rather than falling through into whatever is located after
__thunk
:
0000000000001600 <__thunk>:
1600: 48 8d 05 e9 ff ff ff lea -0x17(%rip),%rax # 15f0 <__libc_csu_fini>
1607: 48 8b 78 08 mov 0x8(%rax),%rdi
160b: 48 8b 00 mov (%rax),%rax
160e: ff e0 jmp *%rax
1610: 0f 0b ud2
So how does this thunk work? It uses the current instruction pointer, available in the %rip
register, to create a pointer to memory that resides immediately before the function body by adding
the constant -0x17
. We use -0x17
because this encoding of lea
is 7 bytes long, and we wish to
further subtract space for two 8 byte (= 0x10 total) values that are stored before lea
in memory.
From there we read the function argument at ptr+8
, as a uint64 sized value, into %rdi
, which is
where the one argument to a function of type int (*fn)(int)
is expected according to the calling
convention being used. Then we read the function pointer itself into %rax
and jump to %rax
to forward to the function we wanted to call originally.
To go along with this thunk, we need another function that will copy the function pointer, function
argument, and thunk body into the right places in memory. Furthermore, we need to allocate this
memory specially, as memory returned by malloc is typically not executable. To allocate executable
memory, we can use mmap
directly to request a fresh page of virtual memory with read, write, and
execute permissions set:
buf = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
Further furthermore, we will need to find a way to access the compiled version of __thunk
including its length from within the source code. Luckily the linker can help us out. If we place
the original thunk into its own section and add two external symbols (which will be filled in by the
linker), we can calculate the length of the section at run time:
__attribute__((naked,section("thunk")))
void __thunk() {
asm volatile("lea -0x17(%rip), %rax");
asm volatile("mov 8(%rax), %rdi");
asm volatile("mov 0(%rax), %rax");
asm volatile("jmp *%rax");
}
extern uint8_t __start_thunk;
extern uint8_t __stop_thunk;
The way that __start_thunk
and __stop_thunk
work are that the symbols are overlaid with the
start and end of the section, and the value of each would be a single byte from the compiled
representation of __thunk
. So to compute the size of the section (which contains only our
function, and thus the section size == the function size), we subtract the addresses of the two
symbols: size = &__stop_thunk - &start_thunk
.
We can combine all of this to create a very rough function that accomplishes our goal and creates a callable object:
typedef int (*retint)(void);
retint partial(int (*fn)(int), int x) {
uint64_t *buf;
retint result;
size_t tsize;
size_t size;
tsize = &__stop_thunk - &__start_thunk;
size = tsize + 2*sizeof(uint64_t);
buf = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (buf == -1) {
perror("mapping failed");
return NULL;
}
buf[0] = (uint64_t) fn;
buf[1] = x;
memcpy(buf+2, __thunk, tsize);
return (retint)(buf+2);
}
int main(int argc, char **argv) {
retint foo_partial;
foo_partial = partial(foo, 7);
printf("%d\n", foo_partial());
return 0;
}
This will print out 8
when it runs, just as we expected. We can now create function objects that
store another function pointer and a single argument internally and otherwise work just like regular
zero argument functions that we can call directly from anywhere in our code, forwarding the call to
the stored pointer along with the stored argument!
A little C as a treat
We could be happy here, but it is really unfortunate that the entire thunk is implemented in assembly. Things seem kind of brittle, and if I wanted to have a function with a different argument type, or a function with two arguments, I'd need to re-create it by hand and manually check that everything lines up correctly, plus I would need to handle more of the calling convention and I have no desire to figure out all of the details. Ideally, we get the C compiler to cooperate with us as much as possible so that it can take care of these things, so let's start on version 2.
The first thing we want to do is define a C struct to handle our memory layout:
struct partial_fn {
int (*fn)(int);
uint64_t a0;
char body[];
};
This will store our function pointer, one argument, and a flexible array member to represent the
body of the thunk that will be copied around. From there, we can define a new thunk that is much
less brittle, using lea
to get a pointer to this struct and then locating the function we wish to
call and its argument with the rest of the thunk written in C:
#define LEA_SIZE 0x07
__attribute__((section("t2")))
int __t2() {
struct partial_fn *ctx;
asm("lea %c1(%%rip), %0" : "=r" (ctx) : "i" (-LEA_SIZE - offsetof(struct partial_fn, body)));
return ctx->fn(ctx->a0);
}
extern uint8_t __start_t2;
extern uint8_t __stop_t2;
Since we need to use the gcc extended assembly syntax in order to reference the local variable
ctx
, we can also introduce a dynamically computed constant to replace the -0x17
from before,
which will account for changes to the layout of struct partial_fn
, but which will not adapt to the
size of any prologue instructions inserted before lea
. Unfortunately, this version doesn't work
out of the box. Since the function is not declared naked
, it will insert a prologue and epilogue
by default, resulting in disassembly like this:
00000000000016c4 <__t2>:
16c4: 55 push %rbp
16c5: 48 89 e5 mov %rsp,%rbp
16c8: 48 83 ec 20 sub $0x20,%rsp
16cc: 48 8d 05 e9 ff ff ff lea -0x17(%rip),%rax # 16bc <__thunk+0xb>
16d3: 48 89 45 e8 mov %rax,-0x18(%rbp)
16d7: 48 8b 45 e8 mov -0x18(%rbp),%rax
16db: 48 89 45 f0 mov %rax,-0x10(%rbp)
16df: 48 8b 45 e8 mov -0x18(%rbp),%rax
16e3: 48 83 c0 08 add $0x8,%rax
16e7: 48 89 45 f8 mov %rax,-0x8(%rbp)
16eb: 48 8b 45 f0 mov -0x10(%rbp),%rax
16ef: 48 8b 10 mov (%rax),%rdx
16f2: 48 8b 45 f8 mov -0x8(%rbp),%rax
16f6: 8b 00 mov (%rax),%eax
16f8: 89 c7 mov %eax,%edi
16fa: ff d2 call *%rdx
16fc: c9 leave
16fd: c3 ret
Note that I've left the offset at -0x17 and compiled it just to grab this disassembly; this version
won't work because that offset is incorrect for locating the function pointer and argument. However,
we implemented something that was logically equivalent using only four instructions, so this seems
wasteful--maybe we've failed at getting the compiler to understand what we want? Not really, we just
need to turn on optimization. If I build with -O2
however, this output is created for __t2
instead, which is exactly what we were hoping to see:
0000000000001600 <__t2>:
1600: 48 8d 05 e9 ff ff ff lea -0x17(%rip),%rax # 15f0 <__thunk+0x10>
1607: 8b 78 08 mov 0x8(%rax),%edi
160a: ff 20 jmp *(%rax)
Even better, the compiler has outdone me, because it knows about an indirect jump addressing mode
that I didn't know about at the time, so it is able to combine the two instruction sequence I used
into a single instruction for using the stored function pointer. So, based on our source code, gcc
is able to infer that we do not need a stack frame, do not need a prologue or epilogue, and can
optimize this as a tail call using jmp
rather than call
and ret
.
With the memory layout codified as a struct definition, we can also improve the definition of
partial
in order to avoid manually computing the offsets or locations of things. This gives us a
much more robust implementation:
retint partial2(int (*fn)(int), int x) {
struct partial_fn *result;
size_t tsize;
size_t size;
tsize = &__stop_t2 - &__start_t2;
size = tsize + sizeof(*result);
result = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (result == (void *)-1) {
perror("mapping failed");
return NULL;
}
result->fn = fn;
result->a0 = x;
memcpy(result->body, __t2, tsize);
return (retint) result->body;
}
Testing this version gives the expected behavior:
int main(int argc, char **argv) {
retint foo_partial;
foo_partial = partial2(foo, 3);
printf("%d\n", foo_partial());
return 0;
}
prints 4
to the terminal. Now we have an implementation that is robust against changes to the
memory layout and respects the calling convention; the only bit of magic left is the bit of inline
assembly and the assumption that said inline assembly is the first instruction in the function body.
Do it without inline assembly?
Finally, I decided to try something I had ruled out earlier. This goes a little bit outside the
defined behavior of the C language and relies on some implementation and platform details that are
known to differ between architectures, but it yields a solution that mostly appears to work on x86
(but not on armv8, for example). I decided to see where gcc would place function-level static
variables on x86 to see if they were located near the function and if they were accessed with
%rip
-relative addressing. It turns out they're placed in the data section (largely for security
reasons--overflows on buffers allocated in data would not allow code injection into the read-only
executable pages), but they are accessed with a relative load. Unfortunately, the offset between
the code and data is large and changes with the amount of code, number of variables, link order etc.
so it is too hard to predict, and we cannot adapt the partial
function to store data at that
offset relative to the new copy of the thunk function without a lot of extra work to read the offset
from the compiled thunk. But, function pointers are accessed relative to the current instruction
pointer as well. So, this snippet of code accomplishes what we want and gives us a pointer to the
partial_fn
struct allocated by the partial
function, but the behavior is not guaranteed by the
standard:
void *x = &__t2;
struct partial_fn *fn = x - sizeof(struct partial_fn);
This version works with optimization enabled or disabled and provides the shortest possible
assembly implementation in -O2 by not calculating the pointer at all; instead it uses two
instructions reads to load the argument relative to %rip
and then jump to the function pointer:
0000000000001690 <__t3>:
1690: 8b 3d f2 ff ff ff mov -0xe(%rip),%edi # 1688 <__t2+0x8>
1696: ff 25 e4 ff ff ff jmp *-0x1c(%rip) # 1680 <__t2>
This optimization was previously impossible because gcc doesn't optimize inline assembly beyond
doing some dead code elimination, so the optimizer was not aware of what lea
did and could not
fuse it with the following mov
and jmp
instructions. My understanding is this is unlikely to
change, and it is not something I would consider particularly important as a feature anyway.
This enables us to have version 3 implemented entirely in C:
extern uint8_t __start_t3;
extern uint8_t __stop_t3;
__attribute__((section("t3")))
int __t3(void) {
void *x = &__t3;
struct partial_fn *args = x - sizeof(struct partial_fn);
return args->fn(args->a0);
}
retint partial3(int (*fn)(int), int x) {
struct partial_fn *result;
size_t tsize;
size_t size;
tsize = &__stop_t3 - &__start_t3;
size = tsize + sizeof(*result);
result = mmap(NULL, size, PROT_EXEC | PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (result == (void *)-1) {
perror("mapping failed");
return NULL;
}
result->fn = fn;
result->a0 = x;
memcpy(result->body, __t3, tsize);
return (retint) result->body;
}
What happens on ARM?
This version however, is no more portable than the versions with inline assembly. If I compile this for armv8 using gcc 9.3, it disassembles into this (even with -O2 enabled):
0000000000000de4 <__t3>:
de4: a9be7bfd stp x29, x30, [sp, #-32]!
de8: 910003fd mov x29, sp
dec: 90000000 adrp x0, 0 <_init-0x748>
df0: 91379000 add x0, x0, #0xde4
df4: f9000be0 str x0, [sp, #16]
df8: f9400be0 ldr x0, [sp, #16]
dfc: d1004000 sub x0, x0, #0x10
e00: f9000fe0 str x0, [sp, #24]
e04: f9400fe0 ldr x0, [sp, #24]
e08: f9400001 ldr x1, [x0]
e0c: f9400fe0 ldr x0, [sp, #24]
e10: f9400400 ldr x0, [x0, #8]
e14: d63f0020 blr x1
e18: a8c27bfd ldp x29, x30, [sp], #32
e1c: d65f03c0 ret
This is interesting for a number of reasons. When we run a program containing this function it
crashes; I get an Illegal Instruction, but it could produce other errors as well, because it jumps
to an unexpected spot in memory instead of to the function pointer we wanted. The cause can be seen
from the two instruction sequence used to calulate x
:
dec: 90000000 adrp x0, 0 <_init-0x748>
df0: 91379000 add x0, x0, #0xde4
The adrp
instruction is used to compute a PC-relative address (good), but does so in a roundabout
way: it zeros the 12 lowest bits of the program counter and then adds the offset included in the
instruction, which in this case is zero, so it loads the address of the start of the page containing
the instruction. Then, the next instruction adds the constant 0xde4
, which is the offset of __t3
relative to the start of the program in the executable, which was assumed to be loaded at a page
boundary. Thus in order to compute the correct address, the function must be located at 0xde4
relative to the page it is part of, an offset that will change with the size of code in the program.
The other interesting thing about this assembly is that gcc is doing some very confusing things. It
uses adrp
and add
to compute in two steps what could have been done with one, it writes to the
stack and then immediately reads back the pointer (instructions 0xdf4 and 0xdf8), and then subtracts
16 bytes from x0 after this without using the earlier address for anything. The load/store could be
eliminated and the three remaining pointer-calculation instructions could be combined into a single
instruction. It repeats the store-load pair at instructions 0xe00 and 0xe04, and reloads x0 with the
value it already contains at 0xe0c. I believe the same code should have been generated like this,
but I haven't double-checked the calling convention. This is also the armv8 translation of the x86
instruction sequence I started with earlier.
adrp x0, #0xdd4
ldr x1, [x0]
ldr x0, [x0, #8]
br x1
To actually achieve what we wanted, adrp
shouldn't be used at all; instead the PC-relative ldr
with a literal argument should be used to directly load the the values stored before our thunk in
memory. This is normally used for loading 32 or 64 bit constants that are stored directly in the
.text section, but it can be used here. I haven't investigated how to get gcc to emit this
instruction most easily, but the ideal assembly would look something like this instead:
ldr x1, #-16
ldr x0, #-16
br x1
My syntax isn't entirely correct here, as this form normally needs to be written with a named
literal as the second argument, which causes the assembler to encode it in the ldr (literal)
way,
but accounting for that, this three instruction sequence would accomplish the same behavior as the
x86 version. I believe it should be reasonable for gcc to produce this same three instruction
sequence for version 3 above because it is a simple interpretation of the source code, so I am
surprised that it does what I wanted on x86 but not on armv8.
Conclusions
This is a pretty brittle way of dynamically creating function objects with saved parameters, but it has been fun to explore. It works well on x86 and can even be represented in C without inline assembly, but it relies on under-specified parts of the language specification to obtain the address of the function being executed using instruction-relative addressing. As we saw by looking at the ARM disassembly, other architectures may support instruction-relative addressing differently and this technique may not work as expected. However, fundamentally, it is possible to do using hand-written assembly on both platforms. Maybe a future version of the C language specification will include syntax and semantics for expressing this using well defined behavior only or directly include facilities to create functions dynamically, but of course this will open the door to more complicated dynamic functions such as closures.
It's also been fun to compare the x86 and armv8 products and see what gcc is doing. I am surprised by the issues in the output for armv8 on 9.3, but perhaps that has been improved by now (9.3 is almost two years old). If not maybe I can file a bug report or try to fix it myself.
Do I plan to actually use this technique? I'm not sure yet, but I am tempted because it may simplify some implementation in a project I am working on.