Episode 43 of “C++ Weekly” talks about evaluating and eliminating code at compile time, and the example is fun as it triggers a few different deficiencies in the GCC optimization passes (using the
-O3
optimization level with GCC trunk r243987 built for x86_64-linux-gnu).
The example
#include <type_traits>
#include <numeric>
#include <iterator>
template<typename ... T>
int sum(T ... t)
{
std::common_type_t<T...> array[sizeof...(T)]{ t... };
return std::accumulate(std::begin(array), std::end(array), 0);
}
int main()
{
return sum(5,4,3,2,1);
}
is meant to be optimized to return a constant value, and it doesmain:
movl $15, %eax
ret
But the behavior varies a lot depending on how many arguments are passed to sum
; 1–5 arguments work as expected, while calling sum
with 6 or 7 arguments is generated asmain:
movdqa .LC0(%rip), %xmm0
movaps %xmm0, -40(%rsp)
movl -32(%rsp), %edx
movl -28(%rsp), %eax
leal 14(%rdx,%rax), %eax
ret
where .LC0
is an array consisting of four constants. 9–12 arguments are similar (but with more code). 13–28 arguments are generated as a constant again
main:
movl $91, %eax
ret
29–64 arguments are optimized to a constant, but with some redundant stack adjustments when the number of arguments is not divisible by four
main:
subq $16, %rsp
movl $435, %eax
addq $16, %rsp
ret
Finally, 65 and more arguments are generated as a vectorized monstrosity in a separate function, called from
main
by pushing all the arguments to the stack, one at a timemain:
subq $16, %rsp
movl $60, %r9d
movl $61, %r8d
pushq $1
pushq $2
movl $62, %ecx
pushq $3
ushq $4
...
This is essentially as far from generating a constant as you can come. 😀The rest of the blog post will look at how GCC is reasoning when trying to optimize this, by examining GCC's internal representation of the program at different points in the optimization pipeline. The IR works essentially as a restricted version of the C language, and you can get GCC to write the IR to a file after each pass by using the command-line option
-fdump-tree-all
.1–5 arguments
The use of
std::accumulate
and iterators expand to five functions, and the compiler starts by inlining and simplifying this toint main() ()
{
common_type_t array[5];
int __init;
int * __first;
int _3;
<bb 2> [16.67%]:
array[0] = 5;
array[1] = 4;
array[2] = 3;
array[3] = 2;
array[4] = 1;
<bb 3> [100.00%]:
# __first_2 = PHI <&array(2), __first_6(4)>
# __init_4 = PHI <0 __init_5>
if (&MEM[(void *)&array + 20B] == __first_2)
goto <bb 5>; [16.67%]
else
goto <bb 4>; [83.33%]
<bb 4> [83.33%]:
_3 = *__first_2;
__init_5 = _3 + __init_4;
__first_6 = __first_2 + 4;
goto <bb 3>; [100.00%]
<bb 5> [16.67%]:
array ={v} {CLOBBER};
return __init_4;
}
The loop is immediately unrolled and simplified to int main() ()
{
common_type_t array[5];
int __init;
int * __first;
int _15;
int _20;
int _25;
int _30;
int _35;
<bb 2> [16.70%]:
array[0] = 5;
array[1] = 4;
array[2] = 3;
array[3] = 2;
array[4] = 1;
_15 = MEM[(int *)&array];
__init_16 = _15;
_20 = MEM[(int *)&array + 4B];
__init_21 = _15 + _20;
_25 = MEM[(int *)&array + 8B];
__init_26 = __init_21 + _25;
_30 = MEM[(int *)&array + 12B];
__init_31 = __init_26 + _30;
_35 = MEM[(int *)&array + 16B];
__init_36 = __init_31 + _35;
array ={v} {CLOBBER};
return __init_36;
}
that is then optimized to a constant by the fre3 (“Full Redundancy Elimination”) pass int main() ()
{
<bb 2> [16.70%]:
return 15;
}
6–12 arguments
The early optimizations that handled the previous case are there mostly to get rid of noise before the heavier optimizations (such as loop optimizations) kicks in, and the loop doing 6 iterations is considered too big to be unrolled by the early unroller.
The “real” loop optimizers determine that the loop is not iterating enough for vectorization to be profitable, so it is just unrolled. The “SLP vectorizer” that vectorizes straight line code is run right after the loop optimizations, and it sees that we are copying constants into consecutive addresses, so it combines four of them to a vector assignment
The “real” loop optimizers determine that the loop is not iterating enough for vectorization to be profitable, so it is just unrolled. The “SLP vectorizer” that vectorizes straight line code is run right after the loop optimizations, and it sees that we are copying constants into consecutive addresses, so it combines four of them to a vector assignment
MEM[(int *)&array] = { 6, 5, 4, 3 }; array[4] = 2; array[5] = 1;This is now simplified by the dom3 pass that does SSA dominator optimizations (jump threading, redundancy elimination, and const/copy propagation), but it does not understand that a scalar initialized by a constant vector is a constant, so it only propagates the constants for
array[4]
and array[5]
that were initialized as scalars, and the code passed to the backend looks likeint main() () { common_type_t array[6]; int __init; int _15; int _22; int _27; int _32; <bb 2> [14.31%]: MEM[(int *)&array] = { 6, 5, 4, 3 }; _15 = MEM[(int *)&array]; _22 = MEM[(int *)&array + 4B]; __init_23 = _15 + _22; _27 = MEM[(int *)&array + 8B]; __init_28 = __init_23 + _27; _32 = MEM[(int *)&array + 12B]; __init_33 = __init_28 + _32; __init_43 = __init_33 + 3; array ={v} {CLOBBER}; return __init_43; }
13–28 arguments
The loop is now iterated enough times that the compiler determines that vectorization is profitable. The idea behind the vectorization is to end up with something like
The vectorizer is, unfortunately, generating somewhat strange code for the checks that there are enough elements
The subsequent passes mostly manage to clean up these conditionals, and dom3 optimizes the vector operations to a constant. But it does not understand the expression used to decide how many scalar elements need to be handled if the iteration count is not a multiple of four (that check is eliminated by the value range propagation pass after dom3 is run), so the scalar additions are kept in the code given to the backend
tmp = { array[0], array[1], array[2], array[3] }
+ { array[4], array[5], array[6], array[7] }
+ { array[8], array[9], array[10], array[11] };
sum = tmp[0] + tmp[1] + tmp[2] + tmp[3] + array[12];
and the vectorizer generates two loops — one that consumes four elements at a time as long as possible, and one that consumes the remaining elements one at a time. The rest of the loop optimizers know how many times the loops are iterating, so the loops can then be unrolled etc. as appropriate.The vectorizer is, unfortunately, generating somewhat strange code for the checks that there are enough elements
_35 = (unsigned long) &MEM[(void *)&array + 52B];
_36 = &array + 4;
_37 = (unsigned long) _36;
_38 = _35 - _37;
_39 = _38 /[ex] 4;
_40 = _39 & 4611686018427387903;
if (_40 <= 4)
goto ; [10.00%]
else
goto ; [90.00%]
that confuse the rest of the loop optimizations, with the result that the IR contains lots of conditional code of this form. This is not the first time I have seen GCC having problems with the pointer arithmetics from iterators (see bug 78847), and I believe this is the same problem (as the “bitwise and” should be optimized away when the pointer arithmetics has been evaluated to a constant).The subsequent passes mostly manage to clean up these conditionals, and dom3 optimizes the vector operations to a constant. But it does not understand the expression used to decide how many scalar elements need to be handled if the iteration count is not a multiple of four (that check is eliminated by the value range propagation pass after dom3 is run), so the scalar additions are kept in the code given to the backend
int main() ()
{
common_type_t array[13];
int __init;
int _23;
[7.14%]:
array[12] = 1;
_23 = MEM[(int *)&array + 48B];
__init_3 = _23 + 90;
array ={v} {CLOBBER};
return __init_3;
}
This is, however, not much of a problem for this program, as the backend manages to optimize this tomain: movl $91, %eax retwhen generating the code.
29–64 arguments
The backend eliminates the memory accesses to the array in the previous case, but the array seems to be kept on the stack. That makes sense — the code is supposed to be optimized before being passed to the backend, so the backend should not be able to eliminate variables, and there is no need to implement code doing this.
Leaf functions do not need to adjust the stack, but GCC does some stack adjustment on leaf functions too when more than 112 bytes are placed on the stack. You can see this for the meaningless function
Anyway, passing 29 arguments to
Leaf functions do not need to adjust the stack, but GCC does some stack adjustment on leaf functions too when more than 112 bytes are placed on the stack. You can see this for the meaningless function
void foo()
{
volatile char a[113];
a[0] = 0;
}
where the stack is adjusted when the array size is larger than 112.foo:
subq $16, %rsp
movb $0, -120(%rsp)
addq $16, %rsp
ret
I do not understand what GCC is trying to do here...Anyway, passing 29 arguments to
sum
makes the array large enough that GCC adds the stack adjustments.65– arguments
The sequence of assignments initializing the array is now large enough thatsum
is not inlined into main
.