Wednesday, November 30, 2016

Loop optimization, and microbenchmarking

I read a blog post "Reducing the Performance Impact of Debug Code in C Libraries" that describes a way to handle overhead of debug code (compile two versions with/without debug code enabled, and load the right version dynamically). The blog post's approach is a reasonable way to solve the problem, but the example illustrating the overhead of debug code surprised me:
#include <stdio.h>
#include <stdlib.h>

#define N 100000000 // Adjust for your machine

int main(void)
{
    int debug_on = !!getenv("DEBUG");
    unsigned long sum = 0;

    for (int i = 0 ; i < N ; i++) {
        for (int j = 0 ; j < N ; j++) {
#ifdef WITH_DEBUG
            if (debug_on)
                printf("Debug print.\n");
#endif
            sum += i;
        }
    }
    printf("Sum: %lu\n", sum);

    return 0;
}
The blog post claims that the version compiled with WITH_DEBUG defined (introducing the overhead of "if (false)") is 50% slower than the one compiled without it — I had thought that both versions would have the same running time, so I tested the program compiled with "gcc -O3", and the difference was even bigger on my computer...

The reason I thought both versions would have identical performance is that GCC has an optimization -funswitch-loops (enabled at -O3) that optimizes loops having if-statements where the condition is unchanged within the loop, such as
for (int j = 0 ; j < N ; j++) {
    if (debug_on)
        printf("Debug print.\n");
    sum += i;
}
The optimization duplicates the loop, optimizes one as if the condition is true and the other as if it is false, and selects the correct version at runtime. That is, the loop above is transformed to
if (debug_on) {
    for (int j = 0 ; j < N ; j++) {
        printf("Debug print.\n");
        sum += i;
    }
} else {
    for (int j = 0 ; j < N ; j++) {
        sum += i;
    }
}
so the performance should be the same with/without WITH_DEBUG defined — both should end up running code resulting from the loop without the if-statement!

I looked some into the details of what is happening. Compiling without defining WITH_DEBUG makes GCC determine that the inner loop
for (int j = 0 ; j < N ; j++) {
    sum += i;
}
calculates
sum += (unsigned long)i * N;
This, in turn, makes the outer loop calculate the sum of an arithmetic sequence, which the compiler knows how to turn into a constant. The result is that the whole program is transformed to the equivalent of
int main(void)
{
    getenv("DEBUG");
    unsigned long sum = 996882102603448320ul;
    printf("Sum: %lu\n", sum);

    return 0;
}
Compiling with WITH_DEBUG defined determines that the inner loop does not change debug_on, and the loop is "unswitched". The sums are determined to be "i * N" as in the previous case, and the compiler sees that both branches do the same calculation that can be moved out and combined. The result is that the inner loop is transformed to
if (debug_on) {
    for (int j = 0 ; j < N ; j++)
        printf("Debug print.\n");
}
sum += (unsigned long)i * N;
The outer loop could now be unswitched but that is not happening, so the compiler continues by noticing that the sum can be directly calculated as in the previous case, and the resulting optimized program is equivalent to
int main(void)
{
    int debug_on = !!getenv("DEBUG");
    for (int i = 0 ; i < N ; i++)
        if (debug_on)
            for (int j = 0 ; j < N ; j++)
                printf("Debug print.\n");

    unsigned long sum = 996882102603448320ul;
    printf("Sum: %lu\n", sum);

    return 0;
}

The reason the program is not fully optimized is that the optimization passes are not run first on the inner loop as in my description above — they are (mostly) run over the whole program, one optimization pass at a time, and each pass requires the code to be simple enough for the optimizations to trigger. In this case, the outer loop would have been unswitched if sum had been optimized before the unswitching pass was run on the outer loop. Dealing with this is one of the "fun" parts of compiler development — you implement an amazing optimization pass that can do wonders, but the stupid real-world code need to be simplified by other passes (later in the chain) before your pass can do its magic. It is, of course, easy to fix this by adding additional optimization passes before (and in the middle of) your new pass, but the users will complain that the compiler is too slow and switch to a faster, less optimizing, compiler... So one important part of compiler development is to ensure that the pass order is good enough to handle most reasonable cases with a limited number of optimization passes.

This also means that simple micro-benchmarking may not give a true and fair view of how code is optimized in reality — I have seen many cases where micro-benchmarks in the compiler's test suite are optimized as expected, while the optimization is almost guaranteed not to trigger for real world code due to pass ordering issues. So your micro-benchmark may show that a code construct X is faster than code construct Y (or that compiler Z is better than compiler W), but the behavior in more complex real-world usage may be very different...

Thursday, November 24, 2016

Tentative variable definitions, and -fno-common

A comment in a Reddit thread on my previous blog post claims that the code is optimized if the variables are declared in the same translation unit as the usage. That is, the claim is that
int a;
int b;

int foo(void)
{
    return &a != &b;
}
compiles to code returning a constant value. This is true for C++
_Z3foov:
    movl    $1, %eax
    ret
but compiling as C still generates
foo:
    movl    $a, %eax
    cmpq    $b, %rax
    setne   %al
    movzbl  %al, %eax
    ret
The reason is that an external declaration without an initializer is what the C standard calls a tentative definition. The tentative definition works in essentially the same way as if it were defined extern (but with the difference that the linker creates the variable if it is not defined in any translation unit), so GCC must assume that the linker may assign the same address for a and b in the same way as for normal extern-declared variables. It is possible to make the variables "real" definitions by initializing them with a value, so changing the declarations to
int a = 0;
int b = 0;
enables the compiler to optimize the function in C too.

One other way to make GCC optimize this is to specify the command-line option -fno-common that disables generation of "common symbols" which are used for emitting tentative definitions. This has the benefit of improving code generation for some CPU architectures, such as ARM. As an example, consider the function
int a;
int b;

int foo(void)
{
    return a + b;
}
The ARM architecture need the variables' addresses placed in registers in order to access the values, so the compiler will generate code similar to
foo:
    ldr     r2, .L2         /* Load address of a */
    ldr     r3, .L2+4       /* Load address of b */
    ldr     r0, [r2]        /* Load a */
    ldr     r3, [r3]        /* Load b */
    add     r0, r0, r3
    bx      lr
.L2:
    .word   a
    .word   b
The compiler could do a better job if it knew how the variables are laid out in memory, as it then only needs to load one address and use relative addressing for the other. But that is not possible when a and b are tentative definitions, as the linker may place them wherever it chooses. Using -fno-common will, however, place a and b in the data (or BSS) segment exactly as they are laid out by the compiler, and the code can be optimized based on knowledge that b is placed right after a in memory1
foo:
    ldr     r2, .L2         /* Load address of a */
    ldr     r0, [r2]        /* Load a */
    ldr     r3, [r2, #4]    /* Load b */
    add     r0, r0, r3
    bx      lr
.L2:
    .word   a


1. A real ARM compiler would use an ldmia instruction in this case, which is even better. But I choose to use this naive code generation in the example in order to show the general principle, without discussing too many ARM-specific details.

Thursday, November 10, 2016

"missing" optimizations — constant address comparison

Sometimes the compiler intentionally fails to optimize certain constructs for arguably good reasons. For example, compiling the following with GCC
extern int a;
extern int b;

int foo(void)
{
    return &a != &b;
}
generates code doing a comparison
foo:
    movl    $a, %eax
    cmpq    $b, %rax
    setne   %al
    movzbl  %al, %eax
    ret
even though the C standard ensures the addresses of a and b are different.

It seems to be a bit unclear why GCC keeps this comparison, but the discussion in the bug 78035 mentions the C defect report DR #078, and the expressiveness of the ELF format. DR #078 notes that
unsigned int h(void)
{
    return memcpy != memmove;
}
may return 0, which happens on implementations where the C standard library uses the same code for memcpy and memmove (the C language cannot do that, but the standard library does not need to be written in C). This does not mean that the compiler must be able to handle different symbols mapping to the same address — it only says that C programs must not assume too much about the standard library. But ELF supports exporting multiple symbols for the same address, and GCC tries to honor ELF possibilities (such as the symbol interposition that is limiting optimizations for shared libraries).

I'm not convinced it makes sense for GCC to keep these comparisons in the generated code — other optimizations, such as alias analysis, treats global symbols as having different addresses, so it is likely that other optimizations will make the code fail if it has two symbols with the same address. For example,
extern int a;
extern int b;

int foo(void)
{
    a = 1;
    b = 5;
    a++;
    return &a != &b;
}
optimizes the accesses to a and b as if they have different addresses, even though the comparison is emitted:
foo:
    movl    $a, %eax
    movl    $5, b(%rip)
    movl    $2, a(%rip)
    cmpq    $b, %rax
    setne   %al
    movzbl  %al, %eax
    ret
This missing optimization does probably not make any difference in reality (although I could imagine some macro or template that relies on this being optimized), but this inconsistency in what is optimized annoys me...

Sunday, November 6, 2016

Inlining — shared libraries are special

The way shared libraries work affect how the code can be optimized, so GCC must be more conservative with inlining when building shared libraries (i.e. when compiling with -fpic or -fPIC).

Consider the functions
int foo(void)
{
    return 23;
}

int bar(void)
{
    return 19 + foo();
}
Compiling this with "gcc -O3" inlines foo into bar
foo:
    movl    $23, %eax
    ret
bar:
    movl    $42, %eax
    ret
but that is not the case when compiling using "gcc -O3 -fPIC"
foo:
    movl    $23, %eax
    ret
bar:
    subq    $8, %rsp
    call    foo@PLT
    addq    $8, %rsp
    addl    $19, %eax
    ret
The reason is that ELF permits symbols in shared libraries to be overridden by the dynamic linker — a typical use case is to use LD_PRELOAD to load a debug library that contains logging versions of some functions. This has the effect that GCC cannot know that it is the foo above that really is called by bar, and thus cannot inline it. It is only exported symbols that can be overridden, so anonymous namespaces and static functions are optimized as usual, as are functions defined as "extern inline" (the compiler is told to inline, so it may assume the function will not be overridden).

The missed optimizations from this are especially noticeable when doing link-time optimization — the benefit of LTO is that the compiler can see the whole library and inline between files, but this is not possible if those functions may be replaced. This problem makes all interprocedural optimizations (such as devirtualization) ineffective, not only inlining.

There are two ways to get GCC to optimize shared libraries in the same way as normal code
  • Use the command line option -fno-semantic-interposition
  • Avoid exporting unnecessary symbols by passing -fvisibility=hidden to the compiler, and manually export the needed symbols using a linker export map or by decorating the functions with
    __attribute__((__visibility__("default")))
    

Monday, October 17, 2016

Inlining — main is special

I wrote in my previous post that I assumed Jason compiled using the -O2 optimization level for the example in his CppCon 2016 talk, as it worked for me when I used -O3. But I was wrong — Jason used -O3, but his example was slightly different compared to mine. I used the function
int bar()
{
    return std::string("a").size() + std::string("b").size();
}
that optimizes to just returning a constant, while Jason used
int main()
{
    return std::string("a").size() + std::string("b").size();
}
which does not optimize. The only difference is the name of the function...

The reason for the difference is that GCC knows that main is special — it has the property that it is called only once. That is, it is a cold function, which makes the inlining less aggressive.

GCC is propagating the "called only once" information where possible, so functions for which the compiler can see all callers (such as static functions, functions in an anonymous namespace, or when compiling using -fwhole-program) are also marked as "called only once" if they are called exactly once from such a function. For example, the compiler can see that bar is called only once in
static int bar()
{
   return std::string("a").size() + std::string("b").size();
}

int main()
{
    return bar();
}
and the string code is not inlined. Removing the static prevents GCC from marking bar as called only once (as it could be called from outside the translation unit), and the string code will now be inlined.

On the other hand, if the "called only once" function contains loops, then GCC can infer that it makes sense to optimize the content of the loop as it is executed multiple times. So for example
int main()
{
    int k = 0;
    for (int i = 0; i < 10000; i++)
        k += std::string("a").size() + std::string("b").size();
    return k;
}
gets the strings inlined, and it optimizes to
main:
    mov     eax, 20000
    ret

Sunday, October 16, 2016

C++ and code inlining

Jason Turner mentions in his CppCon 2016 talk that GCC optimizes the function
int foo()
{
    return std::string("a").size();
}
to just a return of a constant value
_Z3foov:
    movl    $1, %eax
    ret
while
int bar()
{
    return std::string("a").size() + std::string("b").size();
}
generates a less optimized result involving calls to constructors, destructors, etc. The difference between the two cases comes from how GCC handles inlining.

The basic idea behind the GCC inliner is to inline greedily as long as it estimates that the code size does not increase too much (where the growth limit depends on optimization level, if the function is hot/cold, etc.). One important special case is functions that are called exactly once — they do not increase code size if they are inlined, as the inlining just moves the code from the callee into the caller.

STL usage often expands to a large amount of code, for example, the string constructor used in the examples above calls
template<typename _CharT, typename _Traits, typename _Alloc>
  template<typename _InIterator>
    void
    basic_string<_CharT, _Traits, _Alloc>::
    _M_construct(_InIterator __beg, _InIterator __end,
                 std::forward_iterator_tag)
    {
      // NB: Not required, but considered best practice.
      if (__gnu_cxx::__is_null_pointer(__beg) && __beg != __end)
        std::__throw_logic_error(__N("basic_string::"
                                     "_M_construct null not valid"));

      size_type __dnew =
        static_cast<size_type>(std::distance(__beg, __end));

      if (__dnew > size_type(_S_local_capacity))
        {
          _M_data(_M_create(__dnew, size_type(0)));
          _M_capacity(__dnew);
        }

      // Check for out_of_range and length_error exceptions.
      __try
        { this->_S_copy_chars(_M_data(), __beg, __end); }
      __catch(...)
        {
          _M_dispose();
          __throw_exception_again;
        }

      _M_set_length(__dnew);
    }
This in turn calls several other functions, and more than 50 function calls need to be inlined in order to fully optimize foo. The temporary code increase of inlinling before the code gets optimized is over the limit allowed by -O2, but foo has exactly one string, so the compiler can inline this without increasing code size, and the compiler can eventually eliminate everything. But the bar function does not get the constructors inlined, as it constructs two strings...

This can have surprising effects when trying to understand how well code optimize. For example, we saw that foo optimizes to return a constant value, so let us add one more identical function
int foo()
{
    return std::string("a").size();
}

int foo2()
{
    return std::string("b").size();
}
We now construct two strings, which prevents inlining. So neither foo nor foo2 will be fully optimized when compiled with -O2.

The -O3 optimization level allows more inlining, so all examples in this blog post optimize fully when compiled using -O3 (so I assume Jason used -O2 for his examples). In general, -O2 does optimizations that do not involve a space-speed tradeoff, while -O3 allows the code size to increase if the resulting code is faster (which is good if you have enough cache).

To conclude:
  • It is not enough to look at trivial code snippets if you want to verify that your complex templated code is being optimized as you expect.
  • You may want to use -O3 rather than -O2 when compiling code for modern architectures.

Sunday, September 25, 2016

The cost of clearing memory

The technical report "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V" looks at how the RISC-V ISA compares to ARM and Intel by analyzing the result from running SPEC CINT2006. One thing that surprised me was that 30% of the instructions executed when running the 403.gcc benchmark (which compiles a few files using a modified GCC 3.2) is from doing memset!

The RISC-V memset loop writes 16 bytes in 4 instructions
// RV64G, 4 instructions to move 16 bytes
4a3814:   sd    a1, 0(a4)
4a3818:   sd    a1, 8(a4)
4a381c:   addi  a4, a4, 16
4a3820:   bltu  a4, a3, 4a3814
which is somewhat less efficient compared to ARM and Intel that writes more data per instruction
// armv8, 6 instructions to move 64 bytes
6f0928:   stp   x7, x7, [x8,#16]
6f092c:   stp   x7, x7, [x8,#32]
6f0930:   stp   x7, x7, [x8,#48]
6f0934:   stp   x7, x7, [x8,#64]!
6f0938:   subs  x2, x2, #0x40
6f093c:   b.ge  6f0928
so this should translate to about 10% of the executed ARM instructions doing memset. But that is still much more than what I would have guessed.

I do not have access to SPEC, and I have been too lazy to try to replicate with other data, but a quick literature search indicates that this is not as insane as I thought. The papers I have found look at the cost of clearing data in garbage collection implementations, and they seem to get a similar result for the cost. For example "Why Nothing Matters: The Impact of Zeroing" says
We show that existing approaches of zero initialization are surprisingly expensive. On three modern IA32 architectures, the direct cost is around 2.7-4.5% on average and as much as 12.7% of all cycles, in a high-performance Java Virtual Machine (JVM), without accounting for indirect costs due to cache displacement and memory bandwidth consumption.