Saturday, March 25, 2017

pre-decrement vs. post-decrement, etc.

A recent talk at the OpenIoT Summit NA 2017, “Optimizing C for Microcontrollers — Best Practices”, had three examples illustrating the effect of different code constructs
  • Array subscript vs. pointer access
  • Loops (increment vs. decrement)
  • Loops (post-decrement vs. pre-decrement)
as compiled using GCC 6.x on ARM and the -Os optimization level. This blog post will look a bit closer at those examples, and discuss why the conclusions are not always valid.

Array subscript vs. pointer access

The first example is meant to illustrate the difference between array subscripts and pointer access with the two functions
int a[5];

int foo1(void)
{
    int i;
    int res = 0;
    for (i = 0; i < 5; i++)
        res += a[i];
    return res;
}
and
int a[5];

int foo2(void)
{
    int *p;
    int i;
    int res = 0;
    for (p = a, i = 0; i < 5; i++, p++)
        res += *p;
    return res;
}
The first function is generated in a natural way
foo1:
    movs  r0, #0
    mov   r3, r0
    ldr   r1, .L5
.L3:
    ldr   r2, [r1, r3, lsl #2]
    adds  r3, r3, #1
    cmp   r3, #5
    add   r0, r0, r2
    bne   .L3
    bx    lr
while the second function has its loop unrolled
foo2:
    ldr   r3, .L3
    ldm   r3, {r0, r2}
    add   r0, r0, r2
    ldr   r2, [r3, #8]
    add   r0, r0, r2
    ldr   r2, [r3, #12]
    ldr   r3, [r3, #16]
    add   r0, r0, r2
    add   r0, r0, r3
    bx    lr
The reason for this difference is that compiling with -Os should not unroll loops if unrolling increases the code size. But it is hard to estimate the resulting code size, as later optimization passes should be able to take advantage of the unrolling and be able to remove redundant code, so the compiler is using a rather imprecise heuristic. These loops are really close to the threshold (unrolling increases the code size by 4 bytes) and the minor difference between how the loops look when passed to the unroller makes the heuristic estimate that unrolling foo1 will increase the size by one instruction while foo2 will get the same size after unrolling.

This does, however, not illustrate any fundamental difference in the compiler’s understanding of array subscript compared pointer access — any difference in the code could affect a heuristic and have a similar effect (I have worked on compilers that generate different code if you rename variables or even add a comment!).1

Loops (increment vs. decrement)

The second example uses the two functions
void foo1(void)
{
    int x = 0;
    do {
        printk("X = %d\n", x);
        x++;
    } while (x < 100);
}
and
void foo2(void)
{
    int x = 100;
    do {
        printk("X = %d\n", x);
        x--;
    } while (x);
}
to illustrate that it is better to write loops decrementing the iteration variable, as the CPU can do the end of loop check for free as
subs  r4, r4, #1
bne   .L3 
instead of
adds  r4, r4, #1
cmp   r4, #100
bne   .L3
That is true, but the compiler can in many cases transform the loop to change iteration order, so the iteration order in the generated program depend more on what the loop does than how it iterates in the source code.

Note that the two functions do not do the same thing — foo1 outputs the numbers in increasing order and foo2 outputs them in decreasing order. Modifying foo2 to do the same thing as foo1, by changing the function call to
printk("X = %d\n", 100 - x);
makes it generate identical code as foo1 (as the compiler decides that it is better to iterate using increments in order to eliminate the subtraction) even though the function was written as using decrements.

Loops (post-decrement vs. pre-decrement)

The third example consider pre- vs. post-decrement using the examples
void foo1(void)
{
    unsigned int x = 10;
    do {
        if (--x) {
            printk("X = %d\n", x);
        } else {
            printk("X = %d\n", x);
            x = 10;
        }
    } while (1);
}
and
void foo2(void)
{
    unsigned int x = 9;
    do {
        if (x--) {
            printk("X = %d\n", x);
        } else {
            printk("X = %d\n", x);
            x = 9;
        }
    } while (1);
}
The example is meant to illustrate that --x is better, as it can get the comparison as a side effect of the subtraction in the same way as the previous example
subs  r4, r4, #1
bne   .L3 
but it depends much on the microarchitecture if this is beneficial or not. Many microarchitectures can do compare and branch efficiently,2 so a compare and a branch are not necessarily slower than branching on the status code from the subtraction. The problem with --x is that it adds a data dependency — you must do the subtraction before you can evaluate the if-statement. With x-- you can evaluate the if-statement and subtraction in parallel, with the result that
if (--x)
need one extra cycle to execute compared to
if (x--)
for superscalar CPUs having efficient compare and branch.


1. This typically happens when the compiler has different equivalent choices (for example, should it spill variable a or b to the stack), and it just chooses the first alternative. The first alternative is found by iterating over some kind of container, and this container may be an associative array using pointer values as the key...
2. For example, x86 CPUs tend to fuse cmp and jne so that they execute as one instruction.

Sunday, March 5, 2017

The cost of conditional moves and branches

The previous blog post contained an example where branching was much more expensive than using a conditional move, but it is easy to find cases where conditional moves reduce performance noticeably. One such case is in this stack overflow question (and GCC bug 56309) discussing the performance of a function implementing a naive bignum multiplication
static void inline
single_mult(const std::vector<ull>::iterator& data,
            const std::vector<ull>::const_iterator& rbegin,
            const std::vector<ull>::const_iterator& rend,
            const ull x)
{
    ull tmp=0, carry=0, i=0;
    for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it)
    {
        tmp = x * (*rhs_it) + data[i] + carry;
        if (tmp >= imax) {
            carry = tmp >> numbits;
            tmp &= imax - 1;
        } else { 
            carry = 0;
        }
        data[i++] = tmp;
    }
    data[i] += carry;
}

void
naive(std::vector<ull>::iterator data, 
      std::vector<ull>::const_iterator cbegin,
      std::vector<ull>::const_iterator cend,
      std::vector<ull>::const_iterator rbegin,
      std::vector<ull>::const_iterator rend)
{
    for (auto data_it = cbegin; data_it != cend; ++data_it)
    {
        if (*data_it != 0) {
            single_mult(data, rbegin, rend, *data_it);
        }
        ++data;
    }
}
Minor changes to the source code made the compiler use conditional moves instead of a branch, and this reduced the performance by 25%.

The difference between branches and conditional moves can be illustrated by
a = a + b;
if (c > 0)
    a = -a;
a = a + 1;
It is not possible to calculate the number of clock cycles for a code segment when working with reasonably complex CPUs, but it is often easy to get a good estimate (see e.g. this example for how to use such estimates when optimizing assembly code). The CPU converts the original instructions to micro-ops, and it can dispatch several micro-ops per cycle (e.g. 8 for Broadwell). The details are somewhat complicated,1 but most instructions in this blog post are translated to one micro-op that can be executed without any restrictions.

An assembly version using a branch looks like (assuming that the variables are placed in registers)
    addl    %edx, %eax
    testl   %ecx, %ecx
    jle     .L2
    negl    %eax
.L2:
    addl    $1, %eax
The CPU combines the testl and jle instructions to one micro-op by what is called “macro-fusion”, so both the addition and the test/branch instructions can be dispatched in the first cycle. It takes a while for the compare and branch to execute, but branch prediction means that the CPU can speculatively start executing the next instruction in the following cycle, so the final addl or the negl can be dispatched in the second cycle (depending on if the branch is predicted as taken or not). The result is that the code segment is done in 2 or 3 cycles, provided that the branch prediction was correct — a mispredict must discard the speculated instructions and restart execution, which typically adds 15–20 cycles.

Generating a version using a conditional move produces something like
    addl    %edx, %eax
    movl    %eax, %edx
    negl    %edx
    testl   %ecx, %ecx
    cmovg   %edx, %eax
    addl    $1, %eax
The first cycle will execute the first addition and the test instruction, and the following cycles will only be able to execute one instruction at a time as all of them depend on the previous instruction. The result is that this needs 5 cycles to execute.2

So the version with conditional moves takes twice the time to execute compared to the version using a branch, which is noticeable in the kind of short loops from single_mult. In addition, pipeline-restrictions on how instructions can be dispatched (such as only one division instruction can be dispatched each cycle) makes it hard for the CPU to schedule long dependency chains efficiently, which may be a problem for more complex code.


1. See “Intel 64 and IA-32 Architectures Optimization Reference Manual” and Agner Fog’s optimization manuals for the details.
2. This assumes that the cmovg instruction is one micro-op. That is true for some CPUs such as Broadwell, while others split it into two micro-ops.