A recent talk at the OpenIoT Summit NA 2017, “Optimizing C for Microcontrollers — Best Practices”, had three examples illustrating the effect of different code constructs
- Array subscript vs. pointer access
- Loops (increment vs. decrement)
- Loops (post-decrement vs. pre-decrement)
-Os
optimization level. This blog post will look a bit closer at those examples, and discuss why the conclusions are not always valid.Array subscript vs. pointer access
The first example is meant to illustrate the difference between array subscripts and pointer access with the two functionsint a[5]; int foo1(void) { int i; int res = 0; for (i = 0; i < 5; i++) res += a[i]; return res; }and
int a[5]; int foo2(void) { int *p; int i; int res = 0; for (p = a, i = 0; i < 5; i++, p++) res += *p; return res; }The first function is generated in a natural way
foo1: movs r0, #0 mov r3, r0 ldr r1, .L5 .L3: ldr r2, [r1, r3, lsl #2] adds r3, r3, #1 cmp r3, #5 add r0, r0, r2 bne .L3 bx lrwhile the second function has its loop unrolled
foo2: ldr r3, .L3 ldm r3, {r0, r2} add r0, r0, r2 ldr r2, [r3, #8] add r0, r0, r2 ldr r2, [r3, #12] ldr r3, [r3, #16] add r0, r0, r2 add r0, r0, r3 bx lrThe reason for this difference is that compiling with
-Os
should not unroll loops if unrolling increases the code size. But it is hard to estimate the resulting code size, as later optimization passes should be able to take advantage of the unrolling and be able to remove redundant code, so the compiler is using a rather imprecise heuristic. These loops are really close to the threshold (unrolling increases the code size by 4 bytes) and the minor difference between how the loops look when passed to the unroller makes the heuristic estimate that unrolling foo1
will increase the size by one instruction while foo2
will get the same size after unrolling.This does, however, not illustrate any fundamental difference in the compiler’s understanding of array subscript compared pointer access — any difference in the code could affect a heuristic and have a similar effect (I have worked on compilers that generate different code if you rename variables or even add a comment!).1
Loops (increment vs. decrement)
The second example uses the two functionsvoid foo1(void) { int x = 0; do { printk("X = %d\n", x); x++; } while (x < 100); }and
void foo2(void) { int x = 100; do { printk("X = %d\n", x); x--; } while (x); }to illustrate that it is better to write loops decrementing the iteration variable, as the CPU can do the end of loop check for free as
subs r4, r4, #1 bne .L3instead of
adds r4, r4, #1 cmp r4, #100 bne .L3That is true, but the compiler can in many cases transform the loop to change iteration order, so the iteration order in the generated program depend more on what the loop does than how it iterates in the source code.
Note that the two functions do not do the same thing —
foo1
outputs the numbers in increasing order and foo2
outputs them in decreasing order. Modifying foo2
to do the same thing as foo1
, by changing the function call toprintk("X = %d\n", 100 - x);makes it generate identical code as
foo1
(as the compiler decides that it is better to iterate using increments in order to eliminate the subtraction) even though the function was written as using decrements.Loops (post-decrement vs. pre-decrement)
The third example consider pre- vs. post-decrement using the examplesvoid foo1(void) { unsigned int x = 10; do { if (--x) { printk("X = %d\n", x); } else { printk("X = %d\n", x); x = 10; } } while (1); }and
void foo2(void) { unsigned int x = 9; do { if (x--) { printk("X = %d\n", x); } else { printk("X = %d\n", x); x = 9; } } while (1); }The example is meant to illustrate that
--x
is better, as it can get the comparison as a side effect of the subtraction in the same way as the previous examplesubs r4, r4, #1 bne .L3but it depends much on the microarchitecture if this is beneficial or not. Many microarchitectures can do compare and branch efficiently,2 so a compare and a branch are not necessarily slower than branching on the status code from the subtraction. The problem with
--x
is that it adds a data dependency — you must do the subtraction before you can evaluate the if-statement. With x--
you can evaluate the if-statement and subtraction in parallel, with the result thatif (--x)
need one extra cycle to execute compared toif (x--)
for superscalar CPUs having efficient compare and branch.1. This typically happens when the compiler has different equivalent choices (for example, should it spill variable
a
or b
to the stack), and it just chooses the first alternative. The first alternative is found by iterating over some kind of container, and this container may be an associative array using pointer values as the key...2. For example, x86 CPUs tend to fuse
cmp
and jne
so that they execute as one instruction.
In some cases where a program attempts to do a 5 or 7 byte copy operation, e.g.
ReplyDeletevoid push2(char *p)
{
for (int i=0; i!=7; i++) p[i] = p[i+1];
}
gcc 7 x86-64 with -O3 will replace the loop with an actual call to memmove, which seems silly given a size of 7. Interestingly, if the loop is split in two parts, one counting from 0 to 3 and an overlapping one counting from 3 to 6, it does much better.
I don't know how the cost of two randomly-aligned 32-bit loads and two 32-bit randomly-aligned 32-bit stores would compare with the cost of rep movsb, but calling memmove seems downright crazy. Do you know why gcc would be favoring memmove there?
I do not know the details of the heuristics...
DeleteThe reason why it works better when you split the loop is that the two loops are handled individually (GCC does not have a pass to merge the loops or the resulting memmoves).
I ran a quick benchmark, and the call to memmove performs identically as the expanded code on my Intel i7 6850K, but the memmove is much slower on AMD A10-7850K.
In the two-loop case, performance is better because gcc knows how to optimize moves of 2, 4, or 8 bytes, but it doesn't know how to split a 7-byte move into two 4-byte moves.
DeleteIt is actually better to do the 7-byte memmove as three moves (4, 2, and 1 byte) instead of two 4-byte moves — the second 4-byte store overlaps the first, so the first store needs to finish before the second store can be executed. The three smaller stores can be done in parallel.
DeleteIt is possible for the CPU do detect and handle the overlap efficiently, but that is somewhat complicated, and many CPUs does not do this — for example, my AMD A10 is slower with two 4-byte stores than with the three 4-, 2-, and 1-byte stores.
Interesting, though the advantage of the three smaller moves might be diminished if the compiler found something useful to do between the two four-byte moves. How would the efficiency of any of those approaches compare with using a 64-bit shift-by-8 instruction followed by a byte write? In any case, using a function call seems rather odd, since even if a compiler made no effort to ensure things were properly placed in registers beforehand I would think code to put things into RSI/RDX/RCX, perform a REP MOVSB, and then restore the registers would be faster than a call to any plausible external implementation of memmove.
Delete