Krister Walfridsson’s old blog: GCC code generation for C++ Weekly Ep 43 example

Episode 43 of “C++ Weekly” talks about evaluating and eliminating code at compile time, and the example is fun as it triggers a few different deficiencies in the GCC optimization passes (using the -O3 optimization level with GCC trunk r243987 built for x86_64-linux-gnu).

The example

#include <type_traits>
#include <numeric>
#include <iterator>

template<typename ... T>
int sum(T ... t)
{
  std::common_type_t<T...> array[sizeof...(T)]{ t... };

  return std::accumulate(std::begin(array), std::end(array), 0);
}

int main()
{
  return sum(5,4,3,2,1);
}

is meant to be optimized to return a constant value, and it does

main:
    movl    $15, %eax
    ret

But the behavior varies a lot depending on how many arguments are passed to sum; 1–5 arguments work as expected, while calling sum with 6 or 7 arguments is generated as

main:
 movdqa .LC0(%rip), %xmm0
 movaps %xmm0, -40(%rsp)
 movl -32(%rsp), %edx
 movl -28(%rsp), %eax
 leal 14(%rdx,%rax), %eax
 ret

where .LC0 is an array consisting of four constants. 9–12 arguments are similar (but with more code).

13–28 arguments are generated as a constant again

main:
    movl    $91, %eax
    ret

29–64 arguments are optimized to a constant, but with some redundant stack adjustments when the number of arguments is not divisible by four

main:
    subq    $16, %rsp
    movl    $435, %eax
    addq    $16, %rsp
    ret

Finally, 65 and more arguments are generated as a vectorized monstrosity in a separate function, called from main by pushing all the arguments to the stack, one at a time

main:
    subq    $16, %rsp
    movl    $60, %r9d
    movl    $61, %r8d
    pushq   $1
    pushq   $2
    movl    $62, %ecx
    pushq   $3
    ushq    $4
    ...

This is essentially as far from generating a constant as you can come. 😀

The rest of the blog post will look at how GCC is reasoning when trying to optimize this, by examining GCC's internal representation of the program at different points in the optimization pipeline. The IR works essentially as a restricted version of the C language, and you can get GCC to write the IR to a file after each pass by using the command-line option -fdump-tree-all.

1–5 arguments

The use of std::accumulate and iterators expand to five functions, and the compiler starts by inlining and simplifying this to

int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _3;

  <bb 2> [16.67%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;

  <bb 3> [100.00%]:
  # __first_2 = PHI <&array(2), __first_6(4)>
  # __init_4 = PHI <0 __init_5>
  if (&MEM[(void *)&array + 20B] == __first_2)
    goto <bb 5>; [16.67%]
  else
    goto <bb 4>; [83.33%]

  <bb 4> [83.33%]:
  _3 = *__first_2;
  __init_5 = _3 + __init_4;
  __first_6 = __first_2 + 4;
  goto <bb 3>; [100.00%]

  <bb 5> [16.67%]:
  array ={v} {CLOBBER};
  return __init_4;
}

The loop is immediately unrolled and simplified to

int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _15;
  int _20;
  int _25;
  int _30;
  int _35;

  <bb 2> [16.70%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;
  _15 = MEM[(int *)&array];
  __init_16 = _15;
  _20 = MEM[(int *)&array + 4B];
  __init_21 = _15 + _20;
  _25 = MEM[(int *)&array + 8B];
  __init_26 = __init_21 + _25;
  _30 = MEM[(int *)&array + 12B];
  __init_31 = __init_26 + _30;
  _35 = MEM[(int *)&array + 16B];
  __init_36 = __init_31 + _35;
  array ={v} {CLOBBER};
  return __init_36;
}

that is then optimized to a constant by the fre3 (“Full Redundancy Elimination”) pass

int main() ()
{
  <bb 2> [16.70%]:
  return 15;
}

6–12 arguments

The early optimizations that handled the previous case are there mostly to get rid of noise before the heavier optimizations (such as loop optimizations) kicks in, and the loop doing 6 iterations is considered too big to be unrolled by the early unroller.

The “real” loop optimizers determine that the loop is not iterating enough for vectorization to be profitable, so it is just unrolled. The “SLP vectorizer” that vectorizes straight line code is run right after the loop optimizations, and it sees that we are copying constants into consecutive addresses, so it combines four of them to a vector assignment

MEM[(int *)&array] = { 6, 5, 4, 3 };
array[4] = 2;
array[5] = 1;

This is now simplified by the dom3 pass that does SSA dominator optimizations (jump threading, redundancy elimination, and const/copy propagation), but it does not understand that a scalar initialized by a constant vector is a constant, so it only propagates the constants for array[4] and array[5] that were initialized as scalars, and the code passed to the backend looks like

int main() ()
{
  common_type_t array[6];
  int __init;
  int _15;
  int _22;
  int _27;
  int _32;

  <bb 2> [14.31%]:
  MEM[(int *)&array] = { 6, 5, 4, 3 };
  _15 = MEM[(int *)&array];
  _22 = MEM[(int *)&array + 4B];
  __init_23 = _15 + _22;
  _27 = MEM[(int *)&array + 8B];
  __init_28 = __init_23 + _27;
  _32 = MEM[(int *)&array + 12B];
  __init_33 = __init_28 + _32;
  __init_43 = __init_33 + 3;
  array ={v} {CLOBBER};
  return __init_43;
}

13–28 arguments

The loop is now iterated enough times that the compiler determines that vectorization is profitable. The idea behind the vectorization is to end up with something like

tmp = { array[0], array[1], array[2], array[3] }
    + { array[4], array[5], array[6], array[7] }
    + { array[8], array[9], array[10], array[11] };
sum = tmp[0] + tmp[1] + tmp[2] + tmp[3] + array[12];

and the vectorizer generates two loops — one that consumes four elements at a time as long as possible, and one that consumes the remaining elements one at a time. The rest of the loop optimizers know how many times the loops are iterating, so the loops can then be unrolled etc. as appropriate.

The vectorizer is, unfortunately, generating somewhat strange code for the checks that there are enough elements

_35 = (unsigned long) &MEM[(void *)&array + 52B];
_36 = &array + 4;
_37 = (unsigned long) _36;
_38 = _35 - _37;
_39 = _38 /[ex] 4;
_40 = _39 & 4611686018427387903;
if (_40 <= 4)
  goto ; [10.00%]
else
  goto ; [90.00%]

that confuse the rest of the loop optimizations, with the result that the IR contains lots of conditional code of this form. This is not the first time I have seen GCC having problems with the pointer arithmetics from iterators (see bug 78847), and I believe this is the same problem (as the “bitwise and” should be optimized away when the pointer arithmetics has been evaluated to a constant).

The subsequent passes mostly manage to clean up these conditionals, and dom3 optimizes the vector operations to a constant. But it does not understand the expression used to decide how many scalar elements need to be handled if the iteration count is not a multiple of four (that check is eliminated by the value range propagation pass after dom3 is run), so the scalar additions are kept in the code given to the backend

int main() ()
{
  common_type_t array[13];
  int __init;
  int _23;

   [7.14%]:
  array[12] = 1;
  _23 = MEM[(int *)&array + 48B];
  __init_3 = _23 + 90;
  array ={v} {CLOBBER};
  return __init_3;
}

This is, however, not much of a problem for this program, as the backend manages to optimize this to

main:
    movl    $91, %eax
    ret

when generating the code.

29–64 arguments

The backend eliminates the memory accesses to the array in the previous case, but the array seems to be kept on the stack. That makes sense — the code is supposed to be optimized before being passed to the backend, so the backend should not be able to eliminate variables, and there is no need to implement code doing this.

Leaf functions do not need to adjust the stack, but GCC does some stack adjustment on leaf functions too when more than 112 bytes are placed on the stack. You can see this for the meaningless function

void foo()
{
    volatile char a[113];
    a[0] = 0;
}

where the stack is adjusted when the array size is larger than 112.

foo:
    subq    $16, %rsp
    movb    $0, -120(%rsp)
    addq    $16, %rsp
    ret

I do not understand what GCC is trying to do here...

Anyway, passing 29 arguments to sum makes the array large enough that GCC adds the stack adjustments.

65– arguments

The sequence of assignments initializing the array is now large enough that sum is not inlined into main.

9 comments:

pvJanuary 9, 2017 at 10:35 AM
Very interesting, thanks for writing/sharing.
Did you tried the same kind of investigations with clang++ ?
dinfuehrJanuary 9, 2017 at 12:43 PM
"I do not understand what GCC is trying to do here..."

If you pass -mno-red-zone to the compilation, the stack is also adjusted for 112 bytes. So I guess for stack sizes <= 112 GCC doesn't have to adjust the stack because it still fits in the red zone (it has a size of 128 on x86_64).

I am not sure why it stops for 112 and not 128?
Wojciech MułaJanuary 12, 2017 at 10:40 AM
I tested how clang 3.9 behaves (on godbolt.org). Clang folds to constant up to 45 arguments, for a longer list it generates a function. The function, depending on argument count, is either scalar or use SIMD instructions.

Note: Only a member of this blog may post a comment.

Sunday, January 8, 2017

GCC code generation for C++ Weekly Ep 43 example