Sunday, July 12, 2015

mov-only code generation

I stumbled on the M/o/Vfuscator compiler that only use mov instructions in the generated code1 using ideas from the "mov is Turing-complete" paper.

It proves to be surprisingly easy to generate such code. This blog post shows how it is done using a somewhat idealized version of the CPU, but it should be easy to map it on the real X86 ISA. The examples use registers r0, r1, …, and a mov instruction that can be in one of the three forms
mov rdest, rsrc
mov rdest, [rsrc + roffset]      ; Load indexed
mov [rdest + roffset], rsrc      ; Store indexed
It is possible to use an immediate constant instead of rsrc, or for the rdest and roffset in the addresses. The width of the move can be 8 bits (movb) or 32 bits (movl). We'll only use 8-bit values for the computations — wider widths can be handled by working on one byte at a time. Pointers are assumed to be 32 bits wide.


We'll begin with if-statements. Comparing two registers r1 == r2 and placing the result in r0 can be done as
movb [&scratch + r1], 0
movb [&scratch + r2], 1
movb r0, [&scratch + r1]
where scratch is an array
uint8_t scratch[256];
We cannot have any flow control when only using mov instructions, so the idea is to handle if-statements by executing both the true and false parts, and ignoring the result from the false part. E.g. code of the form
if (r1 == r2)
    r0 = r3;
    r0 = r4;
is compiled as if written
r0 = (r1 == r2) ? r3 : r4;
which can be done as
movb [&scratch + r1], 0
movb [&scratch + r2], 1
movb r0, [&scratch + r1]
movb [&scratch + 1], r3
movb [&scratch + 0], r4
movb r0, [&scratch + r0]
This works fine as long as everything is in registers, or both the true and false parts write to the same memory. The case where they write different memory need to be handled by writing to a dummy location for the cases where no value should be stored. For example,
if (r1 == r2)
    a = r3;
is compiled to
movb [&scratch + r1], 0
movb [&scratch + r2], 4
movb r0, [&scratch + r1]
movl [&tmp + 4], &a
movl [&tmp + 0], &dummy
movl r0, [&tmp + r0]
movb [r0 + 0], r3
where tmp and dummy are defined as
uint32_t tmp[2];
uint8_t dummy;


Addition by a constant can be implemented by leveraging the addition done within the addressing mode by indexing in a constant array
const uint8_t constants[512] = {
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
    0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
    0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff,
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
    0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff
so e.g. adding 23 to r0 is done as
movb r0, [&(constants+23) + r0]

Loops etc.

We now have nearly everything needed to implement a Turing machine where the tape is implemented as an array that is indexed by the current position. The only problem is that we only can do straight line code... So we cheat by adding a jmp instruction at the end of our program, jumping back to the first instruction. This works fine for implementing a Turing machine, as it is typically written as
while (1) {
    if (current_state == 0) {
        /* Do the work for this state */
        current_state = /* State transition */
    } else if (current_state == 1) {
        /* Do the work for this state */
        current_state = /* State transition */
    } else if ...
But the same approach works for normal languages too, by treating each basic block as a state, and to do a state transition for each branch.

Program termination

Finally, we need a way to terminate the program. This is done by writing to address 0
movb [0 + 0], 0
which terminates the process with a segmentation fault. Conditional termination is done with the same dummy-variable trick as for the conditional store above.

1. This is not completely true – it need exactly one jmp instruction in addition to the mov instructions.