Saturday, December 29, 2018

Commodore 64 programming

I have done some Commodore 64 programming over the holidays. The C64 is old, but I think there are things that can be learned from dealing with all the hardware limitations (although there may be better ways of getting this experience with more recent platforms...)

C64 Hardware

The 6502 instruction set consists of 56 very simple instructions, and the CPU has been completely reversed engineered so you can follow how it does its work – Visual 6502 even let you visualize transistor-level Simulation of the CPU in the browser! The details of the CPU are described in the talk “Reverse Engineering the MOS 6502 CPU”:


The rest of the C64 hardware is also well described by a talk – “The Ultimate Commodore 64 Talk”:


There are lots of information about C64 available on the net, but I have not found any convenient place that collects the relevant information... But everything I needed (including examples/tutorials) was available at C64 Wiki and Codebase 64.  The details of the Video Interface Chip is documented in “The MOS 6567/6569 video controller (VIC-II) and its application in the Commodore 64”, and Andre Weissflog’s chip emulators contains very readable code if you want clarifications on what the various hardware registers do.

Getting started – raster bars

My first test was to implement two raster bars and some sprites moving around, just to verify that I understood the basics.


Much of C64 programming centers around getting around hardware limitations by timing the code carefully. For example, the raster bars are drawn in the border of the screen, but the border can only have one color. It is, however, possible to modify the border color while the hardware is drawing the screen (the C64 hardware is designed for old CRT monitors that draw the picture one line at a time using an electron beam), so we can draw the raster bars by changing the border color exactly when a new line starts!

I had thought the raster bars should be trivial (just enable an interrupt on a line, and change the colors) but the C64 is not fast enough for this – it takes 9-16 cycles to enter the IRQ, so we are already a bit into the screen when we can change the color. And there are only 63 CPU cycles for each line, so we don’t have the time to set up a new interrupt for the next line anyway. We, therefore, need to first synchronize our code with the start of a line, and then write the code (using NOPs etc. to pad it out) so that we change the colors exactly every 63 cycles.

But there are more complications – some lines have less than 63 cycles available to do the work. The reason is that the VIC-II chip steals cycles from the CPU for lines where it must fetch extra data. There are two cases:
  • The first raster line of each text line must fetch the characters, which steals one cycle per character. These lines are usually called “bad lines”.
  • Each sprite that is enabled on the raster line steals 2 cycles.
There is, in addition, an overhead to this cycle stealing mechanism that may waste up to 3 cycles per raster line (depending on what the CPU is doing when the cycle stealing starts).

I was lazy and ensured that my raster bars were not on a bad line and that there were no sprites on the same raster line, but careful programming can handle this kind of things too. The talk “Behind the scenes of a C64 demo” mentions some insane techniques, such as treating the cycle counter as an address and jump to it (this jumps to different addresses depending on how many cycles have executed, and careful layout of the code can make this compensate for differences in execution time).


Source code

The source code for my test program is available below, and you can build it using the ACME Cross-Assembler as
acme -o test.prg test.asm
The resulting program test.prg can be run in the VICE emulator by loading it using the “Smart attach Disk/Tape” option.

Sunday, October 21, 2018

Optimization and performance measurement

The performance of programs depend surprisingly much on things that “should not” matter (such as the order files are passed to the linker), and modifying one part of the program may change the performance of other parts.

An example

We will look at the libquantum library as an example. Configuring and compiling this as
./configure
make demos
produces two demo applications. Running the shor application as
shor 456 427
takes 11.0 seconds1 on my PC (Ubuntu 16.04, Intel Broadwell CPU).

The performance of the shor application is relatively sensitive to changes, and we can see surprising performance differences when modifying the code. For example, adding this function
void foo(int *a, int *b, int *c, int n)
{
  for (int i = 0; i < n; i++)
    *a++ = *b++ + *c++;
}
at the top of measure.c makes the program need 11.8 seconds to run – it has become 7% slower by just adding an unused function!

Why we get the performance difference

The only effect of adding the unused function is that some other functions are moved to different addresses2, so it is natural to attribute the performance difference to “cache effects“. But there are several other hardware bottlenecks that can cause this kind of performance variability – this particular case is due to branch prediction.

Branch predictors keep track of the histories of branches by building various tables in hardware (see Dan Luu’s great blog post). But the branch prediction hardware looks only at a few bits of the addresses, so several branches may map to the same slot, with the result that the branch predictor cannot differentiate between them. Changing the size of one code segment makes other parts of the program move to different addresses, and some branches may be unlucky in that they now alias with branches behaving in a different way, which cause the performance to regress due to the excessive misprediction.

Impact of this when optimizing

All books and conference talks about optimization tell you to always measure your change to verify that it actually improves the performance before committing. This is good advice, but just measuring the running time is not enough! For example, if our change to libquantum was a real improvement that made the code 5% faster, then we would still have seen a 2% regression due to the (unrelated) branch prediction issues. But this regression is just us being unlucky with code placement, and further changes may make us lucky again, so it would probably have been better long-term to do the change. Similarly, I have seen too many bad changes done because “we measured it, and this made the program faster (but we do not understand why...)“.

So it is often useful to dig deeper and understand why we get the performance result so that we can attack the root cause and not rely on being lucky – for the libquantum example we could ensure the branches in the sensitive part of the code are located near each other in memory (for example, by forcing the code to be inlined) or try to eliminate some branches by algorithmic changes.

Further reading



1. There are some randomized algorithms in libquantum, so I hardcoded a seed in order to measure the same work in each run:
--- shor.c.orig 2007-09-03 08:47:55.000000000 +0200
+++ shor.c 2018-10-20 21:18:45.026076220 +0200
@@ -37,7 +37,7 @@
   int N;
   int c,q,a,b, factor;
 
-  srand(time(0));
+  srand(123456);
 
   if(argc == 1)
     {
The running time of this is stable (11.0 seconds with a standard deviation of 0.014).
2. It could also have made the compiler make different inlining decisions etc., but I have verified that the only difference for this program is that some code is placed on different addresses compared to the original version.

Saturday, August 11, 2018

Inlining and microbenchmarking

Inlining is important for C++ performance, but the compiler must be careful not to increase the code size too much. GCC’s inlining process is basically inlining functions sorted by priority until a growth percentage limit is hit (or until all relevant functions have been inlined), which works fine for large translation units where the parts of the code not helped by inlining can compensate for the code increase in the parts where inlining helps. But it works less well for small translating units that need much inlining, which is common when doing microbenchmarks.

Take for example this quick-bench benchmark from Bartlomiej Filipek’s blog post “Speeding Up string_view String Split Implementation” that measures the performance for different ways of splitting a string.

We can see that StringSplitStd is about 9% faster than StringViewSplit (they have the score 7829 and 7189), but the reason for this difference is that GCC has inlined everything for StringSplitStd, but not for StringViewSplit.

We get a different result if we run the benchmark with a different set of functions. For example, removing StringViewSplitStd and StringViewSplitPtr from the benchmark makes the compiler make different inlining decisions, and we get the same performance for both StringSplitStd and StringViewSplit (quick-bench).

It is a good idea when doing microbenchmarking to check that the compiler makes the same inlining decisions as when the code is used in a real use case (in a realistically sized translation unit).

Saturday, July 28, 2018

Don’t trust quick-bench results you see on the internet

It is easy to use quick-bench to benchmark small C++ functions, but it is hard to ensure they are measuring what is intended. Take this benchmark as an example. It is measuring different ways of transforming a string to lower case using code of the form
void by_index(char* s, size_t n)
{
    for (size_t i = 0; i < n; ++i)
        s[i] = tolower(s[i]);
}

static void LowerByIndex(benchmark::State& state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        by_index(upper_string, 10);

        // Make sure the variable is not optimized away by compiler
        benchmark::DoNotOptimize(upper_string);
    }
}

// Register the function as a benchmark
BENCHMARK(LowerByIndex);
We are going to look at how compilers may optimize it in ways that were probably not intended.

How a compiler may optimize the code

The following corresponds to how GCC optimizes this on Linux when using the libc++ header files.1

The tolower function is defined in /user/include/ctype.h as
extern const __int32_t **__ctype_tolower_loc (void)
    throw () __attribute__ ((__const__));

extern __inline __attribute__ ((__gnu_inline__)) int
tolower (int __c) throw ()
{
    return __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
}
This is inlined into by_index, which in turn is inlined into LowerByIndex
static void LowerByIndex(benchmark::State&amp; state)
{
    // Code inside this loop is measured repeatedly
    for (auto _ : state) {
        char upper_string[] = "UPPER TEXT";
        for (size_t i = 0; i < 10; ++i) {
            int __c = upper_string[i]);
            upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
        }

        benchmark::DoNotOptimize(upper_string);
    }
}

A char has values in the range -128 to 127 when compiling for the x86 architecture, so the compiler determines that the comparisons in
upper_string[i] = __c >= -128 && __c < 256 ? (*__ctype_tolower_loc())[__c] : __c;
are always true (as __c is assigned from a char), and the line is simplified to
upper_string[i] = (*__ctype_tolower_loc())[__c];

The __ctype_tolower_loc function is decorated with the const attribute so the function call can be moved out of the loop, provided the loop body is always executed. Compilers typically represent for-loops as an if-statement followed by a do-while loop – for example
for (i = 0; i < n; ++i) {
    do_something();
}
is represented as
i = 0;
if (i < n) {
    do {
        do_something();
        ++i;
    } while (i < n);
}
This representation simplifies the work for the compiler as the pre-condition is separated from the actual loop, and constant expressions can now trivially be moved out of the loop. The if-statement is simplified to always true (or always false) when the iteration count is known at compile time. Rewriting the loops, and moving __ctype_tolower_loc out of the loops gives us the result
static void LowerByIndex(benchmark::State&amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            size_t i = 0;
            do {
                int __c = upper_string[i]);
                upper_string[i] = p[__c];
                ++i;
            } while (i < 10);

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Note that the call to __ctype_tolower_loc is now outside of the code segment being measured!

The inner loop is small, so it is fully unrolled
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[] = "UPPER TEXT";
            int __c0 = upper_string[0]);
            upper_string[0] = p[__c0];
            int __c1 = upper_string[1]);
            upper_string[1] = p[__c1];
            int __c2 = upper_string[2]);
            upper_string[2] = p[__c2];
            int __c3 = upper_string[3]);
            upper_string[3] = p[__c3];
            int __c4 = upper_string[4]);
            upper_string[4] = p[__c4];
            int __c5 = upper_string[5]);
            upper_string[5] = p[__c5];
            int __c6 = upper_string[6]);
            upper_string[6] = p[__c6];
            int __c7 = upper_string[7]);
            upper_string[7] = p[__c7];
            int __c8 = upper_string[8]);
            upper_string[8] = p[__c8];
            int __c9 = upper_string[9]);
            upper_string[9] = p[__c9];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
All accesses to upper_string are now to known positions, so the compiler can easily forward the values written to where they are read, and the generated code does not need to initialize or read from upper_string
static void LowerByIndex(benchmark::State&amp;amp; state) {
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = p['P'];
            upper_string[2] = p['P'];
            upper_string[3] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = p['T'];
            upper_string[7] = p['E'];
            upper_string[8] = p['X'];
            upper_string[9] = p['T'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
Finally, several characters are the same, so the compiler can CSE (Common Subexpression Elimination) them to only load these values once from p
static void LowerByIndex(benchmark::State&amp;amp; state)
{
    auto it = state.begin()
    if (it != state.end()) {
        int *p = *__ctype_tolower_loc();

        // Code inside this loop is measured repeatedly
        do {
            char upper_string[10];
            upper_string[0] = p['U'];
            upper_string[1] = upper_string[2] = p['P'];
            upper_string[3] = upper_string[7] = p['E'];
            upper_string[4] = p['R'];
            upper_string[5] = p[' '];
            upper_string[6] = upper_string[9] = p['T'];
            upper_string[8] = p['X'];

            benchmark::DoNotOptimize(upper_string);

            ++it;
        } while (it != state.end());
    }
}
That is, the compiler has used its knowledge of the input to basically hard code the result (this may be OK, depending on what the benchmark is trying to measure). And it has moved some code out of the code segment being measured (which is probably not OK for the benchmark).

How to fix the benchmark

I usually prefer keeping the benchmarked function in a separate translation unit in order to guarantee that the compiler cannot take advantage of the code setting up the benchmark, but that does not work in quick-bench. One way to get a similar effect is to mark the function as noinline, but that only solves part of the problem – compilers do various interprocedural optimizations, and for GCC you should specify at least noclone too. Other compilers may need to be restricted in different ways.

It may also be possible to hide information from the compiler by using volatile or functionality from the benchmarking framework (such as benchmark::DoNotOptimize and benchmark::ClobberMemory), but this may also introduce unintended behavior. For example, these workarounds make the code look “unusual” to the compiler, which may make various heuristics make different optimization decisions compare to normal usage.

In general, you need to spend some time analyzing the benchmark in order to determine what the result means (for example, are we measuring the difference in how fast different methods can transform a string, or are we only measuring the difference for the string “UPPER TEXT”?), or as Fabian Giesen says in “A whirlwind introduction to dataflow graphs
With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.


1. The libstdc++ headers use a normal function call for tolower instead of using the inlined version from ctype.h. You can see the optimization from this blog post using libstdc++ too by including ctype.h before any C++ header (but this is not possible in quick-bench, as it adds its own headers before the user code).

Monday, June 25, 2018

Useful GCC address sanitizer checks not enabled by default

Some useful address sanitizer checks are disabled by default because they are relatively expensive (or, as for the std::vector checking, need to be enabled for all translation units).

Use after return

The address sanitizer warns when a variable is used after it has gone out of scope in a function, but it does not warn when the variable is used after the function return. That can, however, be enabled by adding detect_stack_use_after_return=1 to the ASAN_OPTIONS environment variable.

Example

int *ptr;

__attribute__((noinline))
void foo(void)
{
  int a;
  ptr = &a;
}

int main(void)
{
  foo();
  return *ptr;  // Error
}
Compile as
gcc -O -fsanitize=address file.c
and add detect_stack_use_after_return=1 to the ASAN_OPTIONS environment variable before running the program
env ASAN_OPTIONS="detect_stack_use_after_return=1" ./a.out

Pointer comparison

It is not valid to compare two pointers from different objects using the relational operators <, <=, >, and >=. This can be detected by compiling with -fsanitize=address,pointer-compare and adding detect_invalid_pointer_pairs=1 to the ASAN_OPTIONS environment variable.

Note: -fsanitize=pointer-compare was added in GCC 8.

Example

#include <stdlib.h>

int main(void)
{
  char *p = malloc(42);
  char *q = malloc(42);

  int tmp = p < q;  // Error

  free(p);
  free(q);

  return tmp;
}
Compile as
gcc -fsanitize=address,pointer-compare file.c
and add detect_invalid_pointer_pairs=1 to the ASAN_OPTIONS environment variable before running the program
env ASAN_OPTIONS="detect_invalid_pointer_pairs=1" ./a.out

Pointer subtraction

It is not valid to subtract pointers that point into different objects. This can be detected by compiling with -fsanitize=address,pointer-subtract and adding detect_invalid_pointer_pairs=1 to the ASAN_OPTIONS environment variable.

Note: -fsanitize=pointer-subtract was added in GCC 8.

Example

#include <stdlib.h>

int main(void)
{
  char *p = malloc(42);
  char *q = malloc(42);

  int tmp = p - q;  // Error

  free(p);
  free(q);

  return tmp;
}
Compile as
gcc -O -fsanitize=address,pointer-subtract file.c
and add detect_invalid_pointer_pairs=1 to the ASAN_OPTIONS environment variable before running the program
env ASAN_OPTIONS="detect_invalid_pointer_pairs=1" ./a.out

std::vector checking

The address sanitizer does not detect out-of-bounds accesses to the unused capacity of a vector, such as
std::vector<int> v(2);
int* p = v.data();
v.pop_back();
return p[1];  // Error
because the memory is valid, even though it is an error to use it. It is possible to make the address sanitizer warn for this by compiling with -D_GLIBCXX_SANITIZE_VECTOR which makes libstdc++ annotate the memory so that the validity can be tracked. The annotations must be present on all vector operations or none, so this macro must be defined to the same value for all translation units that create, destroy or modify vectors.

Note: _GLIBCXX_SANITIZE_VECTOR was added in the GCC 8 libstdc++.

Example

#include <vector>

int main()
{
  std::vector<int> v(2);
  int* p = v.data();
  v.pop_back();
  return p[1];  // Error
}
Compile as
g++ -O -fsanitize=address -D_GLIBCXX_SANITIZE_VECTOR file.cpp

Sunday, June 10, 2018

On an example from “What Else Has My Compiler Done For Me Lately?”

One of the examples in Matt Godbolt’s C++Now 2018 talk “What Else Has My Compiler Done For Me Lately?” is the function
void maxArray(double * __restrict x, double * __restrict y)
{
  for (int i = 0; i < 65536; i++) {
    if (y[i] > x[i])
      x[i] = y[i];
  }
}
The compiler generates vectorized code that processes four elements at a time – it reads the elements from x and y, compares the elements, and uses the result of the comparison as a mask in a masked move to write the elements from y that are larger than the corresponding element from x
vmovupd ymm0, ymmword ptr [rsi + rax]
vmovupd ymm4, ymmword ptr [rdi + rax]
vcmpltpd ymm4, ymm4, ymm0
vmaskmovpd ymmword ptr [rdi + rax], ymm4, ymm0
Modifying maxArray to use a more max-like construct (or std::max) as in
void maxArray2(double * __restrict x, double * __restrict y)
{
  for (int i = 0; i < 65536; i++) {
    x[i] = (y[i] > x[i]) ? y[i] : x[i];
  }
}
makes the compiler generate this using a “max” instruction instead of the compare and masked move
vmovupd ymm0, ymmword ptr [rsi + rax]
vmaxpd ymm0, ymm0, ymmword ptr [rdi + rax]
vmovupd ymmword ptr [rdi + rax], ymm0

Matt says he is a bit surprised that the compiler cannot see that the first version too can be generated in this way, but the compiler is doing the right thing – it is not allowed to change maxArray in this way! The reason is that maxArray only writes to x when the value changes while maxArray2 always writes to x, and the compiler would introduce problems if the generated code contain stores that are not in the original source code. Consider for example the program
const double a1[65536] = {0.0};
double a2[65536] = {0.0};

int main(void)
{
  maxArray((double*)a1, a2);
  return 0;
}
that is passing a constant array to maxArray. It is valid to cast away const as long as the object is not written to through the pointer, so this program is correct – y[i] is never bigger than x[i] for any i, so maxArray will never write to (the mask in the vectorized code is never set, so the vmaskmovpd instruction is essentially a nop). The code from maxArray2 does, however, always write to x so it would crash on this input as the compiler places a1 in read-only memory.

Thursday, April 12, 2018

Compilation time – -std=c++98 vs. -std=c++11

I saw on twitter that it takes more than twice as much time compiling the Ogre graphics engine using GCC than when using Clang. My experience is that GCC and Clang usually compile with similar speed, so I decided to look into why compiling Ogre is different.

It turned out that a big part of the difference comes from which C++ version the compilers use per default – clang-5.0 defaults to C++98, while gcc-7 defaults to a newer version. Forcing the compilers to use C++98 by passing -std=c++98 makes GCC compile Ogre in about half the time (668s vs. 1135s), while passing -std=c++11 nearly doubles the time Clang needs to compile it!

One reason for this difference is that some of the standard include files are more expensive in C++11 mode as they suck in more dependencies. For example, compiling a file containing just the line
#include <memory>
takes 0.16 seconds on my computer when using C++11
> time -p g++ -O2 -c test.cpp -std=c++11
real 0.16
user 0.14
sys 0.01
while compiling it as C++98 is faster
> time -p g++ -O2 -c test.cpp -std=c++98
real 0.02
user 0.01
sys 0.01
The 0.14-second difference may not seem that big, but it makes a difference when, as for Ogre, you are compiling more than 500 files, each taking about one second. The increased cost of including standard header files for C++11 compared to C++98 adds about 20% to the Ogre build time.

It is a bit unclear to me exactly where the rest of the slowdown comes from, but it seems to be spread all over the code (I tried to remove various classes in the Ogre code base, and removing 10% of the source code seems to affect both the fast and slow version by about 10%) so I assume this is just because the templates in the C++11 STL are more complex and the compiler needs to work a bit harder each time they are used...

Anyway, the difference in compilation time between -std=c++98 and -std=c++11 was much bigger than I had guessed, and I’ll now ensure I use -std=c++98 when building C++98 code.


Updated: The original blog post said that gcc-7 uses C++11 per default. That was wrong, it defaults to C++14.

Saturday, March 3, 2018

Detecting incorrect C++ STL usage

The GCC and LLVM sanitizers are great for finding problems in C++ code, but they do not detect problems with incorrect STL usage. For example, std::list::merge merges two sorted lists, so the following program is incorrect (as list2 is not sorted)
#include <iostream>
#include <list>

int main()
{
  std::list<int> list1 = {1, 2, 3, 4};
  std::list<int> list2 = {5, 3, 4, 2};
  list1.merge(list2);
  for (auto x : list1)
    std::cout << x << '\n';
}
but this cannot be detected by the sanitizers. The libstdc++ library does, however, have a “debug mode” that can be used to detect this kind of problems. The debug mode is enabled by passing -D_GLIBCXX_DEBUG to the compiler
g++ -O2 -D_GLIBCXX_DEBUG example.cpp
which enables assertions checking the preconditions, and the program fails at runtime
/scratch/gcc-7.2.0/install/include/c++/7.2.0/debug/list:716:
Error: elements in iterator range [__x.begin().base(), __x.end().base())
are not sorted.

Objects involved in the operation:
    iterator "__x.begin().base()" @ 0x0x7fff1754ea10 {
      type = std::__cxx1998::_List_iterator;
    }
    iterator "__x.end().base()" @ 0x0x7fff1754ea40 {
      type = std::__cxx1998::_List_iterator;
    }
Abort (core dumped)

There are a few constructs that are invalid according to the C++ standard but that libstdc++ handles as an extension. One such example is inserting a range of a list into the list the range points to, as in
#include <iostream>
#include <list>

int main()
{
  std::list<int> list1 = {1, 2, 3, 4};
  list1.insert(list1.begin(), list1.begin(), list1.end());
  for (auto x : list1)
    std::cout << x << '\n';
}
The debug mode does not report errors for these extensions when using -D_GLIBCXX_DEBUG, but the debug mode can be made more pedantic by adding -D_GLIBCXX_DEBUG_PEDANTIC
g++ -O2 -D_GLIBCXX_DEBUG -D_GLIBCXX_DEBUG_PEDANTIC example.cpp
which reports errors for these too.

One annoyance with the debug mode is that it changes the size of some standard class templates, so you cannot pass containers between translation units compiled with and without debug mode – this often means that you need to build the whole application with debug mode enabled.

The libstdc++ debug mode was introduced in GCC 3.4.

Sunday, February 4, 2018

GCC command options for debugging – -Og and -g3

I listened to a recent CppCast episode where Balázs Török talked about game development and mentioned that C++ abstractions make the code unusable slow in debug builds, and that he is skeptical of debuggability of meta-classes as today's debuggers cannot even handle macros. I think both problems are solvable, and I would argue that GCC is already handling at least the first issue, provided the right command options are used.

Debugging optimized code

GCC can generate debug information when optimizing, so it is possible to run fully optimized code in the debugger. But this is not too useful in reality – many optimizations change the structure of the code, so it is often impossible to single-step in the resulting binary as instructions from different parts of the program are interleaved...

The GCC developers have traditionally tried to limit the damage done by the optimizers for the -O1 optimization level, but -O1 is often used for release builds too, so there is a limit to how much optimizations can be disabled without annoying too many developers – a new optimization level, -Og, were therefore introduced in GCC 4.8. The -Og optimization level enables the optimizations that do not interfere with debugging, and it may even result in a better debugging experience than when compiling without optimizations, as some optimization passes collect information useful for generating better debug information.

The difference between -O1 and -Og is that
  • -Og disables some optimizations, such as if-conversion, that simplifies control flow, so the structure of the generated code is roughly the same as in the source code.
  • -Og disables some passes, such as -ftree-pta and -ftree-sra, that help other optimization passes by propagating information about memory accesses throughout the functions. The effect of this is that those passes now optimize with only the information available locally within each basic block.
  • -Og is less aggressive in the back end peephole optimizations, so each generated instruction is less likely to execute functionality from several statements in the source code.
How much this affects the performance depends a lot on coding style etc., but modern CPUs are great at hiding inefficiencies using branch-prediction, speculative execution, and store to load forwarding, so the difference between -Og and fully optimized code is often surprisingly small, even when using the STL.

Debug information for macros

GCC does not put information about macros in the debug information per default, but it is possible to add it by passing -g3 to the compiler. This makes GDB know of the macros and enables some macro-related commands. But the debugging experience is still not that great. I do not understand why a better support has not been implemented – possibly because inline functions should be used instead of macros...

Sunday, January 21, 2018

GCC back end performance tuning

This is part six of a series “Writing a GCC back end”.

Cost model – TARGET_RTX_COSTS

The compiler often has different options for how it can optimize and emit the code. For example, dividing a 32-bit integer by the constant value 3
i = i / 3;
can be generated as a division instruction, but division instructions are slow, so it may be better to generate this as the equivalent of
i = (((int64_t)i * 0x55555556) >> 32) - (i >> 31);
which gives the same result. GCC decides which to generate by preparing RTL for the different alternatives, and queries the target’s cost function (if implemented) to determine which is the cheapest. Which alternatives are tried depends on which insns are defined in the target description, and what constraints the insns have, but the compiler will, in this case, ask for the costs of subexpressions of the form
(truncate:SI (lshiftrt:DI (mult:DI (sign_extend:DI (reg:SI 88))
                                   (const_int 0x55555556))
                          (const_int 32)))
and compares their combined costs with
(div:SI (reg:SI 88) (const_int 3))

Implementing this cost function is relatively easy for simple architectures – it consists of a switch case for each operation returning the cost expressed as the number of nop instructions  (which usually means the number of cycles)
static bool
machine_rtx_costs (rtx x, machine_mode mode, int outer_code, int opno,
                   int *total, bool speed)
{
  switch (GET_CODE (x))
    {
    case CONST_INT:
      *total = 0;
      return true;

    case AND:
    case IOR:
    case XOR:
      *total = COSTS_N_INSNS (GET_MODE_SIZE (mode) > UNITS_PER_WORD ? 2 : 1);
      return false;

    case ABS:
      *total = COSTS_N_INSNS (FLOAT_MODE_P (mode) ? 1 : 3);
      return false;

    // ...

    default:
      return false;
    }
}

#undef TARGET_RTX_COSTS
#define TARGET_RTX_COSTS machine_rtx_costs
Returning true from cost function means that it has written the cost of the whole RTL expression x to *total, and returning false means just for the first operation in the expression (in which case the cost function will be called separately on the arguments).

The cost function gets complicated fast as the CPU gets more complex with different costs depending on how the operations combine with other operations. For example, an addition may have different cost if it can be done as part of the addressing mode in a memory operation
(set (reg:SI 90)
        (mem:SI (plus:SI (reg:SI 88) (reg:SI 89))))
compared to if it is a normal addition
(set (reg:SI 90)
        (plus:SI (reg:SI 88) (reg:SI 89)))
But the cost function does not need to be too exact – there are many optimizations running after the optimization passes calling the cost function (and it is not obvious what the cost even mean for superscalar out-of-order CPUs anyway...).

Cost model – more configuration options

There are about 30 additional macros guiding performance-related decisions during compilation. These cover various properties such as relative cost of different addressing modes and registers, how expensive branches are compared to arithmetic operations, and when the compiler should unroll/inline memcpy instead of calling the function in the library.

See “Describing Relative Costs of Operations” in “GNU Compiler Collection Internals” for a list of these macros.

Peephole optimizations

The define_peephole2 definition in the target description takes a sequence of insns and transforms them to a new sequence of insns, working in essentially the same way as define_expand. This is used to take advantage of target-specific instructions that the generic peep-hole optimizations cannot do.

But there is in general not much need to write peephole optimizations – the insns describes exactly what they do in the RTL pattern, so GCC can reason about the insns and combine them when possible. So missing peephole optimizations are in general deficiencies in the machine description, such as missing define_insn, too conservative constraints (so that GCC does not believe the transformation is allowed), incorrect cost model (so it seems to be slower), etc.

Tuning optimization passes for the target architecture

GCC lets the user enable or disable optimization passes (using -f-options) and change different thresholds (using --param) when compiling. The default value for all of these can be set by macros in the target-specific configuration file gcc/common/config/machine/machine-common.c.

TARGET_OPTION_OPTIMIZATION_TABLE is used to enable or disable optimization passes at the various optimization levels. For example, this code snippet from the i386 backend enables -free for -O2 and higher optimization levels, and disables -fschedule-insns for all optimization levels
static const struct default_options machine_option_optimization_table[] =
  {
    /* Enable redundant extension instructions removal at -O2 and higher.  */
    { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
    /* Turn off -fschedule-insns by default.  It tends to make the
       problem with not enough registers even worse.  */
    { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },

    { OPT_LEVELS_NONE, 0, NULL, 0 }
  };

#undef TARGET_OPTION_OPTIMIZATION_TABLE
#define TARGET_OPTION_OPTIMIZATION_TABLE machine_option_optimization_table

TARGET_OPTION_DEFAULT_PARAMS is used to set the default value for --param parameters. For example, the default value of the parameter l1_cache_line_size is modified as
static void
machine_option_default_params (void)
{
  set_default_param_value (PARAM_L1_CACHE_LINE_SIZE, 16);
}

#undef TARGET_OPTION_DEFAULT_PARAMS
#define TARGET_OPTION_DEFAULT_PARAMS machine_option_default_params

The backend may need to change the default values for other options affecting how the compiler works. For example, it makes sense to make -fno-delete-null-pointer-checks the default for a microcontroller where address 0 is a valid address. This can be done by using TARGET_OPTION_INIT_STRUCT
static void
machine_option_init_struct (struct gcc_options *opts)
{
  opts->x_flag_delete_null_pointer_checks = 0;
}

#undef TARGET_OPTION_INIT_STRUCT
#define TARGET_OPTION_INIT_STRUCT machine_option_init_struct

Further reading

All functionality is described in “GNU Compiler Collection Internals”: