Sunday, February 4, 2018

GCC command options for debugging – -Og and -g3

I listened to a recent CppCast episode where Balázs Török talked about game development and mentioned that C++ abstractions make the code unusable slow in debug builds, and that he is skeptical of debuggability of meta-classes as today's debuggers cannot even handle macros. I think both problems are solvable, and I would argue that GCC is already handling at least the first issue, provided the right command options are used.

Debugging optimized code

GCC can generate debug information when optimizing, so it is possible to run fully optimized code in the debugger. But this is not too useful in reality – many optimizations change the structure of the code, so it is often impossible to single-step in the resulting binary as instructions from different parts of the program are interleaved...

The GCC developers have traditionally tried to limit the damage done by the optimizers for the -O1 optimization level, but -O1 is often used for release builds too, so there is a limit to how much optimizations can be disabled without annoying too many developers – a new optimization level, -Og, were therefore introduced in GCC 4.8. The -Og optimization level enables the optimizations that do not interfere with debugging, and it may even result in a better debugging experience than when compiling without optimizations, as some optimization passes collect information useful for generating better debug information.

The difference between -O1 and -Og is that
  • -Og disables some optimizations, such as if-conversion, that simplifies control flow, so the structure of the generated code is roughly the same as in the source code.
  • -Og disables some passes, such as -ftree-pta and -ftree-sra, that help other optimization passes by propagating information about memory accesses throughout the functions. The effect of this is that those passes now optimize with only the information available locally within each basic block.
  • -Og is less aggressive in the back end peephole optimizations, so each generated instruction is less likely to execute functionality from several statements in the source code.
How much this affects the performance depends a lot on coding style etc., but modern CPUs are great at hiding inefficiencies using branch-prediction, speculative execution, and store to load forwarding, so the difference between -Og and fully optimized code is often surprisingly small, even when using the STL.

Debug information for macros

GCC does not put information about macros in the debug information per default, but it is possible to add it by passing -g3 to the compiler. This makes GDB know of the macros and enables some macro-related commands. But the debugging experience is still not that great. I do not understand why a better support has not been implemented – possibly because inline functions should be used instead of macros...

Sunday, January 21, 2018

GCC back end performance tuning

This is part six of a series “Writing a GCC back end”.


The compiler often has different options for how it can optimize and emit the code. For example, dividing a 32-bit integer by the constant value 3
i = i / 3;
can be generated as a division instruction, but division instructions are slow, so it may be better to generate this as the equivalent of
i = (((int64_t)i * 0x55555556) >> 32) - (i >> 31);
which gives the same result. GCC decides which to generate by preparing RTL for the different alternatives, and queries the target’s cost function (if implemented) to determine which is the cheapest. Which alternatives are tried depends on which insns are defined in the target description, and what constraints the insns have, but the compiler will, in this case, ask for the costs of subexpressions of the form
(truncate:SI (lshiftrt:DI (mult:DI (sign_extend:DI (reg:SI 88))
                                   (const_int 0x55555556))
                          (const_int 32)))
and compares their combined costs with
(div:SI (reg:SI 88) (const_int 3))

Implementing this cost function is relatively easy for simple architectures – it consists of a switch case for each operation returning the cost expressed as the number of nop instructions  (which usually means the number of cycles)
static bool
machine_rtx_costs (rtx x, machine_mode mode, int outer_code, int opno,
                   int *total, bool speed)
  switch (GET_CODE (x))
    case CONST_INT:
      *total = 0;
      return true;

    case AND:
    case IOR:
    case XOR:
      *total = COSTS_N_INSNS (GET_MODE_SIZE (mode) > UNITS_PER_WORD ? 2 : 1);
      return false;

    case ABS:
      *total = COSTS_N_INSNS (FLOAT_MODE_P (mode) ? 1 : 3);
      return false;

    // ...

      return false;

#define TARGET_RTX_COSTS machine_rtx_costs
Returning true from cost function means that it has written the cost of the whole RTL expression x to *total, and returning false means just for the first operation in the expression (in which case the cost function will be called separately on the arguments).

The cost function gets complicated fast as the CPU gets more complex with different costs depending on how the operations combine with other operations. For example, an addition may have different cost if it can be done as part of the addressing mode in a memory operation
(set (reg:SI 90)
        (mem:SI (plus:SI (reg:SI 88) (reg:SI 89))))
compared to if it is a normal addition
(set (reg:SI 90)
        (plus:SI (reg:SI 88) (reg:SI 89)))
But the cost function does not need to be too exact – there are many optimizations running after the optimization passes calling the cost function (and it is not obvious what the cost even mean for superscalar out-of-order CPUs anyway...).

Cost model – more configuration options

There are about 30 additional macros guiding performance-related decisions during compilation. These cover various properties such as relative cost of different addressing modes and registers, how expensive branches are compared to arithmetic operations, and when the compiler should unroll/inline memcpy instead of calling the function in the library.

See “Describing Relative Costs of Operations” in “GNU Compiler Collection Internals” for a list of these macros.

Peephole optimizations

The define_peephole2 definition in the target description takes a sequence of insns and transforms them to a new sequence of insns, working in essentially the same way as define_expand. This is used to take advantage of target-specific instructions that the generic peep-hole optimizations cannot do.

But there is in general not much need to write peephole optimizations – the insns describes exactly what they do in the RTL pattern, so GCC can reason about the insns and combine them when possible. So missing peephole optimizations are in general deficiencies in the machine description, such as missing define_insn, too conservative constraints (so that GCC does not believe the transformation is allowed), incorrect cost model (so it seems to be slower), etc.

Tuning optimization passes for the target architecture

GCC lets the user enable or disable optimization passes (using -f-options) and change different thresholds (using --param) when compiling. The default value for all of these can be set by macros in the target-specific configuration file gcc/common/config/machine/machine-common.c.

TARGET_OPTION_OPTIMIZATION_TABLE is used to enable or disable optimization passes at the various optimization levels. For example, this code snippet from the i386 backend enables -free for -O2 and higher optimization levels, and disables -fschedule-insns for all optimization levels
static const struct default_options machine_option_optimization_table[] =
    /* Enable redundant extension instructions removal at -O2 and higher.  */
    { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
    /* Turn off -fschedule-insns by default.  It tends to make the
       problem with not enough registers even worse.  */
    { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },

    { OPT_LEVELS_NONE, 0, NULL, 0 }

#define TARGET_OPTION_OPTIMIZATION_TABLE machine_option_optimization_table

TARGET_OPTION_DEFAULT_PARAMS is used to set the default value for --param parameters. For example, the default value of the parameter l1_cache_line_size is modified as
static void
machine_option_default_params (void)
  set_default_param_value (PARAM_L1_CACHE_LINE_SIZE, 16);

#define TARGET_OPTION_DEFAULT_PARAMS machine_option_default_params

The backend may need to change the default values for other options affecting how the compiler works. For example, it makes sense to make -fno-delete-null-pointer-checks the default for a microcontroller where address 0 is a valid address. This can be done by using TARGET_OPTION_INIT_STRUCT
static void
machine_option_init_struct (struct gcc_options *opts)
  opts->x_flag_delete_null_pointer_checks = 0;

#define TARGET_OPTION_INIT_STRUCT machine_option_init_struct

Further reading

All functionality is described in “GNU Compiler Collection Internals”: