Sunday, January 24, 2016

More Turing-completeness in surprising places

My twitter feed was full of “printf is Turing-complete” tweets after the 32c3 talk "New memory corruption attacks: why can't we have nice things?" (see printbf for the paper and the implementation discussed in the video), and that inspired me to read some more articles about Turing-complete1 mechanisms hiding in surprising places...

C++ exception stack unwinding is Turing-complete

The DWARF debugging information format is complex, which is needed in order to describe optimized code. For example, an expression such as i+1 may be placed in a register reg, and i is not available anywhere in the registers or memory in the optimized program. But the debugger should be able to display the value of i, so the debug information must be able to express that i has the value reg-1. And optimized code can be transformed in much more complex ways than this, so DWARF does (surprisingly) define a stack machine that can do arbitrary computation in order to be able to represent the information.

Exception handling need to unwind the stack and restore registers, which is functionality that can be represented by the debug information, so DWARF was adopted (with some minor modifications) for the exception handling. The stack unwinding code can thus execute programs "hidden" in the DWARF information while unwinding the stack.

The Turing-completeness of this is not strange in the same way as the others I have written about previously; DWARF is designed to be general! But it is probably surprising to most developers that this functionality exists... The paper "Exploiting the hard-working DWARF: Trojan and Exploit Techniques With No Native Executable Code" looks at what can be done with this. The stack machine is rather easy to work with — here is an example from the paper that calculates the length of a string that is located just below the stack frame:
  # Value at −0x8(%rbp) on stack
  DW_OP_breg6 −8
  DW_OP_lit0  # initial strlen
  DW_OP_deref_size 1
  # Branch if top of stack nonzero
  # Increment the counted length
  # Add length to char pointer
  DW_OP_skip LOOP
  # Finally put the character count on the top of the stack
  # as return value

ELF metadata is Turing-complete

The Runtime Linker/Loader (RTLD) is responsible for loading shared libraries into a process' address space and updating the symbol references in the object code using the relocation and symbol tables. The RTLD can be viewed as a machine that is executing the relocation tables, and the paper "“Weird Machines” in ELF: A Spotlight on the Underappreciated Metadata" shows that this machine is Turing-complete by constructing an assembly-like  language  composed  of  three  instructions
  • mov — move
  • add — addition
  • jnz — jump if not zero
where each instruction is built from a sequence of relocations.

There are two kinds of structures involved in the relocation:
  • The symbol tables contain information about the symbols, where each symbol is described by the ELF64_Sym structure
    typedef struct
      Elf64_Word    st_name;      /* Symbol name (string table index) */
      unsigned char st_info;      /* Symbol type and binding */
      unsigned char st_other;     /* Unused */
      Elf64_Section st_shndx;     /* Section index */
      Elf64_Addr    st_value;     /* Symbol value */
      Elf64_Xword   st_size;      /* Symbol size */
    } Elf64_Sym;
  • The relocation tables specify how to modify the binary, where each relocation is described by the Elf64_Rela structure
    typedef struct
      Elf64_Addr    r_offset;     /* Address to patch */
      Elf64_Xword   r_info;       /* Relocation type and symbol index */
      Elf64_Sxword  r_addend;     /* Addend */
    } Elf64_Rela;
The construct of the machine is using three kinds of relocations, R_X86_64_COPY, R_X86_64_64, and R_X86_64_RELATIVE, which have the functionality
R_X86_64_COPY         memcpy(base + r.r_offset, s.st_value, s.st_size)
R_X86_64_64           *(base + r.r_offset) = s.st_value + r.addend + base
R_X86_64_RELATIVE     *(base + r.r_offset) = r.r_addend + base
where r is the relocation, s is the symbol referenced by the relocation, and base is the ELF object’s base address (that is typically 0 for an executable). These relocations directly correspond to move and addition instructions of the form
mov *0xbeef0000, $0x02        # *(uint64_t*)0xbeef0000 = 2;
mov *0xbeef0000, [%foo]       # *(uint64_t*)0xbeef0000 = *(uint64_t*)foo;
add *0xbeef0000, %foo, $0x02  # *(uint64_t*)0xbeef0000 = foo + 2;
For example, an addition instruction
add *0xbeef0000, %foo, $0x02
is encoded as a relocation entry with r_offset=0xbeef0000, r_addend=2, and r_info specifying relocation type R_X86_64_64 and the symbol foo.

Branch instructions are implemented by modifying the data RTLD is using while processing the relocations. The RTLD is maintaining a doubly linked list of link_map structures containing information for each ELF object, and it iterates over this list in reverse order when relocating the binary
while (lm != NULL) {
  r = lm->dyn[DT_RELA];
  end = r + lm->dyn[DT_RELASZ];
  for (r ; r < end; r++)
    relocate(lm, r, &dyn[DT_SYM]);
  lm = lm->prev;
A branch instruction can be implemented by relocations modifying lm->prev, lm->dyn[DT_RELASZ], or lm->dyn[DT_RELA] so that an lm structure is skipped, handled partially, or handled multiple times. There are some complications such as finding the address to modify, and to save/restore the original value to be able to exit the program. I refer to the paper for the details.

The only thing remaining is a conditional branch instruction. The paper constructs the jnz instruction by noting that symbols of the type STT_GNU_IFUNC are treated differently depending on if the section index is 0 or not, and the symbol can be initialized so that an add instruction of the form
add *0xbeef0000, %foo, $0x00
has the effect
*(uint64_t*)0xbeef0000 = (s.st_shindx != 0) ? CONSTANT : 0;
where s is the Elf64_Sym structure describing foo. This can now be used when rewriting the end marker lm->dyn[DT_RELASZ] so that RTLD does not enter the for-loop for the link map when the condition is 0 (and the machine is thus not jumping to the part of the program defined by those relocations), by relocations first writing the condition to test to s.st_shindx of a temporary symbol s, and then writing the symbol (by adding 0 to it) to lm->dyn[DT_RELASZ].

1. I'm playing fast and loose with the concept “Turing-complete”, as does most papers I have read. I had expected computer scientists to be more careful, and use terms like "linear bounded automaton" or similar...