Krister Walfridsson’s old blog

Sunday, April 10, 2016

Why volatile is hard to specify and implement

Volatile in C

The main use case for volatile in C is memory mapped hardware where reads and writes may change the hardware state (even reads may change the state — it is common that e.g. a read of an interrupt register resets the value). This means that the compiler must not eliminate, introduce, or reorder memory accesses to such memory.

Real world hardware have restrictions that makes it is hard to specify access rules in a way that works on all hardware. A few examples:

The original DEC Alpha processor could only do 32-bit load/store operations, so access to an 8-bit variable must be done by reading 32 bits, modifying the value, and writing back the result. This means that writing one element in
```
volatile char a[4];
```
will read and write all the elements.
Unaligned load/store operations are split into two operations on the bus.
The hardware may have restrictions on how read-modify-write instructions are used for memory mapped hardware, so e.g. incrementing a volatile variable as a++ may need to be done as a load, an addition, and a store, even if the CPU has an instruction that increments a value in memory.

The standard solves these problems by saying that accesses to volatile objects need to be done as in the source code, but it is up to the implementation to define what constitutes an "access". The text of this is in C11 6.7.3 "Type qualifiers":

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.

This means that an implementation can do essentially anything for volatile as long as it can cover this by a definition of "access", but even a reasonable definition may produce surprises. For example, the text above talks about "at every sequence point". This means that the compiler is allowed to optimize

volatile int *p = somevalue;
int s = *p + *p;

as if it said

volatile int *p = somevalue;
int tmp = *p;
int s = tmp + tmp;

as both reads of *p are between sequence points. A similar case is when the value of an assignment expression is used, such as

volatile int a, b;
a = b = 0;

The standard does not say if b is read, i.e. the compiler is allowed to compile this as

b = 0; a = 0;

b = 0; a = b;

The examples above are probably not that common, but even trivial cases may be surprising. Consider the load *p in

volatile int *p = somevalue;
*p;

C++ compilers must not generate a load for this case (because of the C++ rules regarding lvalue-to-rvalue conversion). Many developers argue that C compilers must generate a load, but I'm not completely convinced. And, for example, the GCC 5.3.0 manual describes this as if the compiler has a choice:

According to the C standard, such an expression is an rvalue whose type is the unqualified version of its original type, i.e. int. Whether GCC interprets this as a read of the volatile object being pointed to or only as a request to evaluate the expression for its side-effects depends on this type.

If it is a scalar type, or on most targets an aggregate type whose only member object is of a scalar type, or a union type whose member objects are of scalar types, the expression is interpreted by GCC as a read of the volatile object; in the other cases, the expression is only evaluated for its side-effects.

Anyway, many C compilers do not generate the load, so you should not rely on this construct, regardless of what is the right interpretation of the standard.

Using volatile to limit optimization

The C standard specifies that volatile accesses will not be reordered or removed, but CPUs have things like write buffers that reorder and removes writes. That functionality is usually disabled for the memory regions that contain memory mapped hardware, but it is enabled for the stack. This always confuses me when I see local variables being volatile (when not needed by longjmp) — why is it OK for the CPU to reorder memory operations when the compiler must not do it? What did the developer expect to achieve by adding volatile?

The idea is usually to use volatile to disable compiler optimizations. A typical example is cryptographic code that want to clear sensitive data from memory. The "obvious" way of clearing memory does not work

void getPassword(void) {
    char pwd[64];
    if (GetPassword(pwd, sizeof(pwd))) {
        /* Checking of password, secure operations, etc. */
    }
    memset(pwd, 0, sizeof(pwd));
}

as the compiler may optimize away the memset, and the usual suggestion is to change the memset to something like

volatile char *p = pwd;
for (size_t i = 0; i < sizeof(pwd); ++i)
    p[i] = 0;

This may generate store instructions, but it is not guaranteed to clear the memory! Consider for example a hypothetical architecture that invalidates the cache for the stack frame when popping the stack, instead of wasting power and memory bandwidth by writing the now useless data to the memory — the cache will be invalidated before the newly written zeros have reached main memory.

One other problem is that the compiler may have placed sensitive data in "anonymous" temporary space allocated on the stack (e.g. by spilling), in registers etc., as is discussed in "Zeroing buffers is insufficient":

I am absolutely certain that there is software out there which inadvertantly keeps an AES key sitting in an XMM register long after it has been wiped from memory.

It may still be prudent to try to clear the memory (and it does not hurt), but be aware that it is not guaranteed to help either...

Volatile and real world compilers

Compilers tend to have much code handling special cases. There are several reasons, such as

Different operations can be optimized in different ways
General optimizations are often expensive, so it makes sense to handle cheap cases separately in order to reduce the load on the general optimization pass, or to be able to do the cheaper cases more often
The compiler may need different optimization strategies depending on surrounding code (e.g. depending on register pressure)
Different target CPUs need different optimization strategies

It is easy to forget to check for "volatile" in some obscure corner case, and it is hard test in a way that finds all such problems, so it is not that uncommon to find cases where volatile accesses are reordered/removed.

One example of this is discussed in OpenSSL pull request #102:

#include <stddef.h>

static void memset_s(void * v, size_t n) {
    volatile unsigned char * p = (volatile unsigned char *)v;
    for (size_t i = 0; i < n; ++i) {
        p[i] = 0;
    }
}

void f1() {
    unsigned char x[4];
    memset_s(x, sizeof x);
}

void f2() {
    unsigned char x[7];
    memset_s(x, sizeof x);
}

Current GCC optimizes away the memset_s in f1, but keeps it in f2.

Both functions are optimized the same by the high-level optimizations — the memset_s is inlined, and the loop unrolled:

f1 ()
{
  unsigned char x[4];

  <bb 2>:
  MEM[(volatile unsigned char *)&x] ={v} 0;
  MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
  MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
  MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
  x ={v} {CLOBBER};
  return;
}

f2 ()
{
  unsigned char x[7];

  <bb 2>:
  MEM[(volatile unsigned char *)&x] ={v} 0;
  MEM[(volatile unsigned char *)&x + 1B] ={v} 0;
  MEM[(volatile unsigned char *)&x + 2B] ={v} 0;
  MEM[(volatile unsigned char *)&x + 3B] ={v} 0;
  MEM[(volatile unsigned char *)&x + 4B] ={v} 0;
  MEM[(volatile unsigned char *)&x + 5B] ={v} 0;
  MEM[(volatile unsigned char *)&x + 6B] ={v} 0;
  x ={v} {CLOBBER};
  return;
}

The difference is introduced when the GIMPLE IR above is lowered to the representation used for instruction selection: the compiler sees that x[4] is an int-sized non-volatile array on the stack, where all accesses are using known indices, and this is a special case where it is profitable to place the array in a register (that is then eliminated as its value is never used). The problem is that this is not a valid transformation unless all accesses are non-volatile, and the code forgot to check for this...

Sunday, March 20, 2016

Pointer casts in C

The C standard has more restrictions on pointers than what was discussed in the previous blog post. This post covers pointer casts.

Pointer casts and alignment

Casting a pointer invokes undefined behavior if the resulting pointer is not correctly aligned. The standard is written in this way in order to support architectures that use different formats for different kinds of pointers, and such architectures do exist — see for example this mail to the GCC development list about a mainframe architecture that was recently commercially supported with a GCC 4.3 port.

The compiler may use this to optimize memory accesses for processors with strict alignment requirements (such as old ARM processors). Consider for example

void foo(void *p)
{
    memset(p, 0, 4);
}

that clears 32 bits of data. This could be generated as a 32-bit store, but the alignment of p is unknown, so the compiler must generate this as four byte operations if it want to inline the memset

mov     r3, #0
strb    r3, [r0, #0]
strb    r3, [r0, #1]
strb    r3, [r0, #2]
strb    r3, [r0, #3]

Consider now

void bar(int *);

void foo(void *p)
{
    memset(p, 0, 4);
    bar(p);
}

The call to bar will convert p to an int* pointer, and this invokes undefined behavior if p is not 32-bit aligned. So the compiler may assume p is aligned, and the memset can now be generated as a 32-bit store

mov     r3, #0
str     r3, [r0, #0]

This example illustrates two things that often cause confusion with regard to undefined behavior

The effect of undefined behavior may go back in time — the undefined behavior is invoked when calling bar, but the compiler may use this to optimize code executed before bar in a way that would make it misbehave if p was not correctly aligned.
The compiler just use the undefined behavior of misaligned conversion to determine that p must be aligned, which then affects each use of p in the same way as if the alignment had been known by some other means — i.e. the compiler developers do not go out of their way implementing evil algorithms doing obscure transformations back in time based on undefined behavior.

Function pointers

It is not allowed to cast between a function pointer and a pointer to object type (i.e. a "normal" pointer). The reason is that they may be very different on a hardware level, and it may be impossible to represent data pointers as function pointers, or vice versa. One trivial example is when function pointers and data pointers have different width, such as the MS-DOS "medium" memory model that use 32-bit pointers for code but only 16-bit pointers for data.

Casting between integers and pointers

Casting between integers and pointers are implementation defined, so the compiler may choose to handle this in any way it want (although the standard has a footnote saying that the intention is that the mapping between pointers and integers should "be consistent with the addressing structure of the execution environment").

The only thing that is guaranteed to work on any implementation is casting between the integer value 0 and pointers:

An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.

I have read far too many confused discussions about if the value of NULL is guaranteed to be 0. In some sense it is, as (void*)0 == NULL evaluates to true, but the value of (void*)NULL does not need to be 0 when stored in memory — the implementation may for example choose to implement the cast operator as flipping the most significant bit,¹ which means that the code

union {
    uintptr_t u;
    void *p;
} u;
u.p = 0;
printf("0x%" PRIxPTR "\n", u.u);

prints 0x80000000 on a 32-bit machine.

One other thing to note is that NULL is defined as being a "null pointer constant", which does not need to be of a pointer type per the quoted text above! This may cause problems when passing NULL to a va_arg function.

1. This is not a completely stupid example. Consider a small embedded processor that has some hardware mapped at address 0. The platform does not have much memory, so 0x80000000 is guaranteed not to be a valid address, and the implementation use this for NULL. There are in general better ways of handling this, but I have seen this done for real hardware...

Sunday, March 6, 2016

C pointers are not hardware pointers

Pointers in the C language are more abstract than pointers in the hardware, and the compiler may surprise developers that think that pointers in C work in the same way as pointers in the CPU.

A C pointer points into what the standard calls an object, points to the byte following the object, or has the value NULL. The concept of "object" here is very different from the C++ object — an object in the C standard is a range of bytes allocated as a unit,¹ so x and y are objects in

int x;
struct foo y[10];

and the memory returned by a call to malloc is an object.

Dangling pointers are dangerous, and may invoke undefined behavior even when they are not dereferenced. The reason is that the value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime. That is, a dangling pointer is treated the same way as an uninitialized pointer, and all use of uninitialized values invokes undefined behavior.

void foo(int *p, int *q)
{
    free(p);
    if (p == q)  // Undefined behavior!
        bar();
}

The pointer p has indeterminate value after free(p), so the comparison invokes undefined behavior.

Comparing pointers using the relational operators (<, >, <=, and >=) requires them to point into the same object (or the byte following the object). That is

int *p = malloc(64 * sizeof(int));
int *q = p + i;
if (p < q)
   foo();

is fine (provided that 0 ≤ i ≤ 64), but

int *p = malloc(64 * sizeof(int));
int *q = malloc(64 * sizeof(int));
if (p < q)  // Undefined behavior!
   foo();

invokes undefined behavior. Similarly, subtraction of pointers are only allowed for array objects, and both pointers must point into the same array object (or the byte following the object).

Arithmetic on a pointer cannot make it point outside the object (more than on the byte following the object). In particular, arithmetic on a pointer cannot make it point into another object. This is useful for compilers, as they can use this to track memory accesses and trivially determine that reads and writes through pointers derived from different objects do not conflict. Actually, it would be very hard for the compiler to place variables in registers if writes through pointers could modify arbitrary objects!

There is however one special case where a pointer can point to the address of another object — two objects may be placed next to each other in memory, which is typical for cases such as as

int x, y;

and p and q can now have the same value after

int *p = &x + 1;
int *q = &y;

But the pointer p does not really point to y — it points to the address following x. Dereferencing p invokes undefined behavior, so you cannot really use the fact that p and q have the same value.

GCC has a somewhat aggressive interpretation on the standard, so it compiles p == q to false if it determines that they are derived from different objects (see GCC bug 61502 for details). This has the fun effect that it is possible to get pointers p and q that point at the same memory address, but p == q evaluates to false, as in

#include <stdio.h>

int main(void)
{
    int x, y;
    int *p = &x + 1;
    int *q = &y;
    printf("%p %p %d\n", (void*)p, (void*)q, p == q);
    return 0;
}

that prints

0x7f7fffffdafc 0x7f7fffffdafc 0

on my development machine when compiled with a recent GCC.

1. The C standard's definition of object is a bit more involved, as it also involves the type of the data within the range of bytes, but this does not affect the discussion in this blog post. I'll come back to the type related part in a future blog post on aliasing.

This blog post was updated 2016-05-19:

Added casts in last example
Changed 256 to 64*sizeof(int) in examples

Sunday, February 21, 2016

How undefined signed overflow enables optimizations in GCC

Signed integers are not allowed to overflow in C and C++, and this helps compilers generate better code. I was interested in how GCC is taking advantage of this, and here are my findings.¹

Signed integer expression simplification

The nice property of overflow being undefined is that signed integer operations works as in normal mathematics — you can cancel out values so that (x*10)/5 simplifies to x*2, or (x+1)<(y+3) simplifies to x<(y+2). Increasing a value always makes it larger, so x<(x+1) is always true.

GCC iterates over the IR (the compiler's Internal Representation of the program), and does the following transformations (x, and y are signed integers, c, c1, and c2 are positive constants, and cmp is a comparison operator. I have only listed the transformations for positive constants, but GCC handles negative constants too in the obvious way)

Eliminate multiplication in comparison with 0
```
(x * c) cmp 0   ->   x cmp 0 
```

Eliminate division after multiplication

(x * c1) / c2   ->   x * (c1 / c2) if c1 is divisible by c2

Eliminate negation
```
(-x) / (-y)     ->   x / y
```

Simplify comparisons that are always true or false

x + c < x       ->   false
x + c <= x      ->   false
x + c > x       ->   true
x + c >= x      ->   true

Eliminate negation in comparisons
```
(-x) cmp (-y)   ->   y cmp x
```

Reduce magnitude of constants

x + c > y       ->   x + (c - 1) >= y
x + c <= y      ->   x + (c - 1) < y

Eliminate constants in comparisons
```
(x + c1) cmp c2         ->   x cmp (c2 - c1)
(x + c1) cmp (y + c2)   ->   x cmp (y + (c2 - c1)) if c1 <= c2
```
The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.

Pointer arithmetic and type promotion

If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.

One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.

Value range calculations

The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as

int x = foo();
if (x > 0) {
  int y = x + 5;
  int z = y / 4;

it determines that x has the range [1, INT_MAX] after the if-statement, and can thus determine that y has the range [6, INT_MAX] as overflow is not allowed. And the next line can be optimized to

  int z = y >> 2;

as the compiler knows that y is non-negative.

The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form [INT_MIN, (INT_MIN+4)] or [6, INT_MAX] that prevents all useful comparisons with < or >), such as

Changing comparisons x<y to true or false if the ranges for x and y does not overlap
Changing min(x,y) or max(x,y) to x or y if the ranges do not overlap
Changing abs(x) to x or -x if the range does not cross 0
Changing x/c to x>>log2(c) if x>0 and the constant c is a power of 2
Changing x%c to x&(c-1) if x>0 and the constant c is a power of 2

Loop analysis and optimization

The canonical example of why undefined signed overflow helps loop optimizations is that loops like

for (int i = 0; i <= m; i++)

are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.

But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.

1. Using GCC trunk r233186, "gcc version 6.0.0 20160205 (experimental)"

This blog post was updated 2016-02-23:

Corrected first sentence to say "overflow"instead of "wrap".
Removed one incorrect transformation from "Eliminate Negation" where I had misread the GCC source code.
Corrected the ranges in the "Value range calculations" example.

Sunday, January 24, 2016

More Turing-completeness in surprising places

My twitter feed was full of “printf is Turing-complete” tweets after the 32c3 talk "New memory corruption attacks: why can't we have nice things?" (see printbf for the paper and the implementation discussed in the video), and that inspired me to read some more articles about Turing-complete¹ mechanisms hiding in surprising places...

C++ exception stack unwinding is Turing-complete

The DWARF debugging information format is complex, which is needed in order to describe optimized code. For example, an expression such as i+1 may be placed in a register reg, and i is not available anywhere in the registers or memory in the optimized program. But the debugger should be able to display the value of i, so the debug information must be able to express that i has the value reg-1. And optimized code can be transformed in much more complex ways than this, so DWARF does (surprisingly) define a stack machine that can do arbitrary computation in order to be able to represent the information.

Exception handling need to unwind the stack and restore registers, which is functionality that can be represented by the debug information, so DWARF was adopted (with some minor modifications) for the exception handling. The stack unwinding code can thus execute programs "hidden" in the DWARF information while unwinding the stack.

The Turing-completeness of this is not strange in the same way as the others I have written about previously; DWARF is designed to be general! But it is probably surprising to most developers that this functionality exists... The paper "Exploiting the hard-working DWARF: Trojan and Exploit Techniques With No Native Executable Code" looks at what can be done with this. The stack machine is rather easy to work with — here is an example from the paper that calculates the length of a string that is located just below the stack frame:

  # Value at −0x8(%rbp) on stack
  DW_OP_breg6 −8
  DW_OP_lit0  # initial strlen
  DW_OP_swap
  DW_OP_dup
LOOP:
  DW_OP_deref_size 1
  # Branch if top of stack nonzero
  DW_OP_bra NOT_DONE
  DW_OP_skip FINISH
NOT_DONE:
  # Increment the counted length
  DW_OP_swap
  DW_OP_lit1
  DW_OP_plus
  DW_OP_swap
  # Add length to char pointer
  DW_OP_plus
  DW_OP_skip LOOP
FINISH:
  # Finally put the character count on the top of the stack
  # as return value
  DW_OP_swap

ELF metadata is Turing-complete

The Runtime Linker/Loader (RTLD) is responsible for loading shared libraries into a process' address space and updating the symbol references in the object code using the relocation and symbol tables. The RTLD can be viewed as a machine that is executing the relocation tables, and the paper "“Weird Machines” in ELF: A Spotlight on the Underappreciated Metadata" shows that this machine is Turing-complete by constructing an assembly-like language composed of three instructions

mov — move
add — addition
jnz — jump if not zero

where each instruction is built from a sequence of relocations.

There are two kinds of structures involved in the relocation:

The symbol tables contain information about the symbols, where each symbol is described by the ELF64_Sym structure

typedef struct
{
  Elf64_Word    st_name;      /* Symbol name (string table index) */
  unsigned char st_info;      /* Symbol type and binding */
  unsigned char st_other;     /* Unused */
  Elf64_Section st_shndx;     /* Section index */
  Elf64_Addr    st_value;     /* Symbol value */
  Elf64_Xword   st_size;      /* Symbol size */
} Elf64_Sym;

The relocation tables specify how to modify the binary, where each relocation is described by the Elf64_Rela structure

typedef struct
{
  Elf64_Addr    r_offset;     /* Address to patch */
  Elf64_Xword   r_info;       /* Relocation type and symbol index */
  Elf64_Sxword  r_addend;     /* Addend */
} Elf64_Rela;

The construct of the machine is using three kinds of relocations, R_X86_64_COPY, R_X86_64_64, and R_X86_64_RELATIVE, which have the functionality

R_X86_64_COPY         memcpy(base + r.r_offset, s.st_value, s.st_size)
R_X86_64_64           *(base + r.r_offset) = s.st_value + r.addend + base
R_X86_64_RELATIVE     *(base + r.r_offset) = r.r_addend + base

where r is the relocation, s is the symbol referenced by the relocation, and base is the ELF object’s base address (that is typically 0 for an executable). These relocations directly correspond to move and addition instructions of the form

mov *0xbeef0000, $0x02        # *(uint64_t*)0xbeef0000 = 2;
mov *0xbeef0000, [%foo]       # *(uint64_t*)0xbeef0000 = *(uint64_t*)foo;
add *0xbeef0000, %foo, $0x02  # *(uint64_t*)0xbeef0000 = foo + 2;

For example, an addition instruction

add *0xbeef0000, %foo, $0x02

is encoded as a relocation entry with r_offset=0xbeef0000, r_addend=2, and r_info specifying relocation type R_X86_64_64 and the symbol foo.

Branch instructions are implemented by modifying the data RTLD is using while processing the relocations. The RTLD is maintaining a doubly linked list of link_map structures containing information for each ELF object, and it iterates over this list in reverse order when relocating the binary

while (lm != NULL) {
  r = lm->dyn[DT_RELA];
  end = r + lm->dyn[DT_RELASZ];
  for (r ; r < end; r++)
    relocate(lm, r, &dyn[DT_SYM]);
  lm = lm->prev;
}

A branch instruction can be implemented by relocations modifying lm->prev, lm->dyn[DT_RELASZ], or lm->dyn[DT_RELA] so that an lm structure is skipped, handled partially, or handled multiple times. There are some complications such as finding the address to modify, and to save/restore the original value to be able to exit the program. I refer to the paper for the details.

The only thing remaining is a conditional branch instruction. The paper constructs the jnz instruction by noting that symbols of the type STT_GNU_IFUNC are treated differently depending on if the section index is 0 or not, and the symbol can be initialized so that an add instruction of the form

add *0xbeef0000, %foo, $0x00

has the effect

*(uint64_t*)0xbeef0000 = (s.st_shindx != 0) ? CONSTANT : 0;

where s is the Elf64_Sym structure describing foo. This can now be used when rewriting the end marker lm->dyn[DT_RELASZ] so that RTLD does not enter the for-loop for the link map when the condition is 0 (and the machine is thus not jumping to the part of the program defined by those relocations), by relocations first writing the condition to test to s.st_shindx of a temporary symbol s, and then writing the symbol (by adding 0 to it) to lm->dyn[DT_RELASZ].

1. I'm playing fast and loose with the concept “Turing-complete”, as does most papers I have read. I had expected computer scientists to be more careful, and use terms like "linear bounded automaton" or similar...

Sunday, October 4, 2015

spirv-tools — status, problems, and plans

I have done some work on spirv-tools since the posts about the human friendly SPIR-V representation and Python API, so it makes sense to do a status update now.

Most of the development has been to actually make the code work. And I think it does now. For example, one of my tests is to disassemble all the shaders from the Mesa shader-db (containing shaders from some old games), assemble the result, and verifying that the binaries are identical.

API

IR improvements

The major user-visible change in the API is that the IR has been modified so that

An ID is represented by an ID object. The ir.Id class contains a reference to the instruction defining the ID, so there is no need to use module.id_to_inst[id] each time you want to get the instruction (which you usually want each time you have an ID). The instruction is now accessed as id.inst.
A literal number is represented as an integer.
A literal string is represented as a string.
An enumerated value is represented as a string of the enumeration name. I choose this representation in order to make it easy to see what the code does. For example
```
if inst.operands[0] == 'FPFastMathMode':
```
checks if a decoration is specifying the floating-point fast math flags.
A mask is represented as a list of strings of the enumeration names, and the empty list is used when no value is set in the mask. Checking if a bit is set is done as
```
if 'NotNaN' in inst.operands[1]:
```

Optimizations

I have also added optimization passes corresponding to the LLVM instcombine, constprop, die (Dead Instruction Elimination), and simplifycfg passes. And a mem2reg pass will be available soon.

I'm mostly working on the optimizations just to verify that the API makes sense, and some of the passes (constprop and instcombine) are essentially placeholders right now, but I will finish up the code when the final SPIR-V specification is released.

Plans

My short-term plan for the API:

Make this a real Python package, installable with pip. And make it work for Python 3.x too.
Add a mem2reg pass (including infrastructure for simple data flow analysis).
Implement a better API for handling constants, types, and global instructions.
Clean up the API. There are some things that need to be renamed and tweaked a little bit (for example some functions having "uses" in their name treat decorations as a usage, and some does not).
Document the API. Add examples/tutorials.

Assembler / disassembler

The biggest user-visible change in the assembler/disassembler is that global symbols now use normal ID tokens (such as %main) instead of prefixing the name with @ (such as @main). The original implementation used @ in order to simplify parsing of a more convenient syntax for declaring global variables

@gl_FragColor = Output <4 x f32> BuiltIn(FragColor)

but decorations are appended to normal instructions, so this is not much more convenient than using an OpVariable instruction

%gl_FragColor = OpVariable %44 BuiltIn(FragColor) Output

The only real difference is that the type must be specified as a pointer type for OpVariable, so it is not pretty-printed (The reason is that the pointer type contains a storage class, and I have not found a good way to include it in a pretty-printed type. The storage class is an operand to OpVariable too, so this could be written as

%gl_FragColor = OpVariable *<4 x f32> BuiltIn(FragColor) Output

if the assembler is updated to infer the storage class from the instruction. But I'm not sure if that is a good idea or not...).

The assembler/disassembler are mostly done, but two things needs to be implemented:

Floating-point literals
Correct name handling

And there are lots of minor things that could be improved...

Floating-point literals

The assembler is supposed to allow

%52 = OpFSub <4 x f32> %49, (0.5, 0.5, 0.5, 0.5)

instead of

%50 = OpConstant f32 0x3f000000
%51 = OpConstantComposite <4 x f32> %50, %50, %50, %50
%52 = OpFSub <4 x f32> %49, %51

but the current implementation does only handle integer/Boolean literals.

Name handling

The assembler accepts named IDs (such as %main) and the disassembler is using the names from OpName decorations to create named IDs. But there are some problems with the implementation:

Name mangling need to be implemented in order to handle polymorphism (the disassembler currently use the numerical value for IDs if it finds several identical names in the binary, and the assembler returns errors for re-declared names).
ID names declared within a function should be local to the function.
How should the tools handle multiple names for the same ID? For example, if the name in OpEntryPoint differs from the name in an OpName debug instruction for the function. Or if one instruction is decorated with multiple OpName names.

Minor tweaks

SPIR-V spreads out some information over the file (e.g. decorations are placed in the beginning of the file, far from the instruction they refer to), and the goal of the textual representation is to collect it in a way that makes it is easy to see all relevant details affecting each instruction, as well as supressing irrelevant details. But it is a bit unclear how to make this as useful as possible...

Some examples of things to consider:

Is it better to be consistent or compact? Decorations are written after the instruction name, and a FPFastMathMode decoration is currently written as
```
%52 = OpFSub <4 x f32> FPFastMathMode(NotNaN | NotInf) %49, %50
```
The values are unique, so it could be written more compact, without the FPFastMathMode keyword
```
%52 = OpFSub <4 x f32> NotNaN | NotInf %49, %50
```
But there may exist better ways of improving this...
Pointers are heavily used in Kernels, so it makes sense to pretty-print them. But how should the storage class be handled?
Should structures be pretty printed?
Are OpLoopMerge and OpSelectionMerge instructions necessary, or should the assembler insert them automatically when needed?

I need to look at lots of real world shaders in order to get an idea of what makes sense, but that need to wait for the SPIR-V specification to be released and shader compilers becoming available. And I need to find relevant modern shaders to look at...

Plans

My short-term plan for the assembler/disassembler:

Implement floating-point literals
Document the assembler syntax

Saturday, August 29, 2015

Instruction-less computation – technical details

This blog post continues the previous post by describing the technical details.

Configuring x86 hardware

The IA-32 hardware is configured by a combination of architectural registers and descriptors in memory. The construction of the movdbz machine uses four different kinds of descriptors:

memory translation tables
The memory translation is done through a two-level page table, where the first level is called a "page directory". The CR3 register contains the physical address of the page directory to use.

TSS – Task-State Segment
The TSS is a structure where the processor state (registers, status flags, etc.) is stored when the task is not running.

GDT – Global Descriptor Table
The GDT is an array of "segment descriptors" containing a pointer to a segment (such as a TSS) and some flags. The task switching mechanism access the TSS by indexing the GDT. For example,
```
ljmp $24, $0
```
switches to a new task whose TSS is described at offset 24 in the GDT.

IDT – Interrupt Descriptor Table
The IDT defines how each interrupt is handled by providing an interrupt descriptor for each interrupt type. We configure it to switch to a new task for #PF and #DF exceptions, and each descriptor contains an offset into the GDT corresponding to the target TSS.

Interrupts, task switching, and memory translation tables

The branch targets in the instructions are specified by the IDT, so it needs to be changed when stepping into a new instruction (i.e. when switching to a new task). The task switch cannot change it directly, but the IDT is placed in virtual memory, and the TSS contains CR3 that points to the page directory, so the new task may map a new page for the IDT, and thus modify the branch targets for the new instruction.

This does, however, give us new problems... The TSS contains the stack pointer which is used to encode the register value, so the TSS is now used to represent both a register and an instruction, and we need to separate them.

The TSS is defined as:

We can place this structure in memory so that it crosses a page boundary, where the stack pointer (ESP) is in one page, and CR3 in the other. That is, the TSS is mapped as a combination of a "register page" and an "instruction page".

The implementation places the TSS so that the page boundary is between ECX and EDX (ESP is thus placed 8 bytes from the top of the page).

Handling source and destination registers

The execution of one instruction in the movdbz machine is done in three steps:

Read the TSS for the new instruction.
Generate exception (as the program counter is invalid). If the stack pointer is not zero, write status and update the stack pointer.
Store the TSS.

The registers are represented by the stack pointer stored in the TSS, but the instructions use two registers (source and destination), so the read and write of the stack pointer should be to different TSS structures.

The TSS is stored in virtual memory, and we may play some tricks with the memory translation. The "read TSS" step reads the state (including the CR3) from the TSS, which updates the memory translation. We may use this to map a new register page for the TSS as a side effect of reading the TSS, and the "store TSS" will now write to this new register page (i.e. the destination register). So we will in some sense have two TSS structures; one "source TSS" and one "destination TSS".

Task switching to itself

There is a hardware restriction that task switching must be done to a new task – it is not possible to switch to the currently running task. This means that instructions of the form

loop:    movdbz r0, r0, loop, done

are invalid. But this is not much of a problem as it is easily fixed by inserting a NOP instruction

loop:    movdbz r0, r0, nop, done
nop:     movdbz discard, 0, loop, loop

TSS segment descriptor busy flag

The IA-32 tries to prevent recursive task switching; it has a "busy flag" that is set in the TSS segment descriptor when a task is entered from an interrupt, and the flag must be cleared before the task can be run a second time.

We need to clear this flag, and we do this by mapping the GDT page containing the instruction's segment descriptor as the instruction page in the destination TSS. Writing the TSS will now overwrite the six last segment descriptors in that GDT page. The descriptor at offset 0x0ff8 is overwritten by the value of EAX and EDX, so the instruction page initializes those registers with the original descriptor value, which clears the flag when they are written to the TSS.

This means that we can only have one TSS segment descriptor per page, and the GDT has a maximum size of 16 pages, so we can only have 16 segment descriptors (and thus only 16 TSS). But each instruction sets up the page tables, so we may remap these to correspond to different instructions over time. There are restrictions; each instruction needs to have its own and its destinations' TSS mapped, so their TSS segment descriptors must be different. The trapcc implementation handles this by graph coloring, but there are simpler ways (see "Generating assembler instructions" below).

Encoding of the instructions

We now have everything we need in order to describe the implementation.

The instless_comp uses three TSS segment descriptors for the instructions, at GDT offset 0x1ff8, 0x2ff8, and 0x3ff8, and have the corresponding TSS mapped at address 0x40ffd0, 0x41ffd0, and 0x42ffd0 in the virtual address space.

The instructions are encoded in four pages:

Page Directory
The page directory sets up five virtual address ranges:
- STACK – the stack for the movdbz machine
- INST – mappings for the movdbz instruction (where TSS and IDT are placed)
- GDT – the GDT
- X86 – the code and data for the CPU (only needed to access the TSS when exiting the movdbz machine)
- PROG – the movdbz program
It is only the INST address range that is specific for the instruction – the other ranges have identical mappings in all instructions.

Page Table
This maps the IDT page, the destination TSS (destination register page and GDT page) for the current instruction, and the source TSS (source register page and instruction page) for the successor instructions, into the address space.

Instruction page
The instruction page of the TSS is set up so that
- CR3 points to the page directory.
- The program counter (EIP) is set to an invalid address.
- EAX and EDX have the value of the corresponding TSS segment descriptor.
- EFLAGS, that contains miscellaneous flags for the system state, is set to the value used when running the CPU.

IDT page
The IDT page contains an IDT that handles #PF and #DF exceptions by switching to a task corresponding to the destination instruction.

The movdbz machine is started by switching to the task corresponding to the first instruction, and it is possible to exit from the machine by letting it switch to the task identified by GDT index 0x18, which is the task for the original program running on the CPU.

Generating assembler instructions

The restrictions on the instructions make it hard to write assembly code, and are annoying when doing code generation. The assembler in instless_comp solves this by generating each assembler instruction as three "real" instructions, so that an instruction of the form

label:    movdbz r_dest, r_src, label_nonzero, label_zero

is generated as

label:    movdbz discard, 1, label+2, label+2
label+1:  movdbz discard, 1, label+2, label+2
label+2:  movdbz r_dest, r_src, label_nonzero, label_zero+1

The first instruction is always encoded with the TSS descriptor at offset 0x1ff8, the second at offset 0x2ff8, and the third at offset 0x3ff8, so this always satisfies the constraints without needing graph coloring. And it also handles the assembler instruction branching to itself, as that will branch from the last instruction to one of the two preceding NOP instructions.

We also know that the target in an assembly instruction always is a NOP, so its input register is always the constant 1, and we do not need to propagate the input registers to its predecessors.

Sunday, August 23, 2015

Instruction-less computation

The paper "The Page-Fault Weird Machine: Lessons in Instruction-less Computation" show that you can run programs on the X86 fault handling mechanism, i.e. without executing any instructions on the CPU!

The construction of this weird machine is using two properties of the IA-32 architecture:

Task switching
The IA-32 has hardware support for task switching — storing the CPU state (registers and status flags), and resuming execution with a new memory map and previously stored CPU state. The interrupt controller may be configured so that a task switch happens when an exception is raised.
Interrupts writes an error code to the stack
The Page-Fault Exception (#PF) writes a 32-bit error code to the stack and decrements the stack pointer before handling the exception, which lets us modify the stack pointer without executing any CPU instructions. The stack pointer cannot wrap, so a Double Fault Exception (#DF) is raised instead of #PF if the stack pointer is less than four.

This can be used to create a machine with registers r0, r1, ..., and a move-branch-if-zero-or-decrement instruction

movdbz r_dest, r_src, label_nonzero, label_zero

that is doing the work corresponding to

if (r_src == 0) {
    r_dest = r_src;
    goto label_zero;
} else {
    r_dest = r_src - 1;
    goto label_nonzero;
}

The movdbz instruction is Turing-complete (see e.g. Wikipedia "One instruction set computer"¹, or this blog post that describes a BF compiler that only generates movdbz instructions). The paper comes with an implementation in trapcc, and I have made a somewhat simpler and more user-friendly implementation in instless_comp.²

A simplified description of the construction is that each register is represented by a task where the stack pointer contains the register's value multiplied by four. Each instruction is represented by an interrupt vector that switches to new tasks (including a new interrupt vector), where #DF switches to the task corresponding to label_zero and #PF switches to label_nonzero. All tasks have invalid program counters, so the new task will raise a new #PF exception (or #DF if the stack pointer is zero), and the machine steps into the next instruction. The implementation is however much more involved, and I describe the construction in detail in a separate blog post.

The stack pointer needs to point to valid memory (or we will get other interrupts), so the registers can only have values that are represented as valid stack addresses, and we cannot represent 32-bit values even if we map all memory for the stack (the decrement is done by subtracting four from the stack pointer, so we lose the two least significant bits...). It is natural to just map page 0 for the stack to give us registers that are 10 bits wide.

It is possible to make the registers wider by adding more pages to the stack, but it does not help much; the machine can only do subtraction, so e.g. addition

a = b + c;

need to be done by subtracting from a maximum value

a = MAX_VAL - (MAX_VAL - b - c);

And the machine can only decrement by one, so the subtractions need to be done by looping

tmp = MAX_VAL;
while (b--)
    tmp--;
while (c--)
    tmp--;
a = MAX_VAL;
while (tmp--)
    a--;

That is, the running time of addition is exponential in the bit width of the registers, so wide registers are not that useful.

The blog post is continued in part 2 that describes the technical details.

1. movdbz reduces trivially to the subtract-and-branch-if-nonzero instruction.

2. The trapcc need to do global analysis (graph coloring etc.) when generating the instructions in order to satisfy some annoying hardware constraints. My implementation uses a less efficient instruction encoding that makes it possible to handle each instruction independently (and makes it easier to write assembly manually).