Tuesday, March 17, 2015

Proving correctness of the GCC match.pd — Part 1

The idea behind ALIVe is that you can eliminate a large class of compiler bugs in LLVM by generating peephole optimizations from a specification that is proved correct using an SMT solver. GCC is already generating some such optimizations from a specification, so the "only" thing remaining is to prove that the transformations in the specification are correct.

The transformations are specified in a file called match.pd, and work on the GCC middle-end GENERIC and GIMPLE IR that is a somewhat higher level representation than the LLVM IR. GENERIC is the representation generated by the front end, and is almost immediately transformed to GIMPLE that is used for the majority of the middle-end optimizations. Both use mostly the same operations, but GENERIC has some higher level constructs, and it is not in SSA form.

GCC does not have an external format for the IR, but you can get an idea of what the GIMPLE IR looks like from the debug dumps generated with the -fdump-tree-all command-line option. For example, the C function
int foo(int a, int b, int c, int d)
  int tmp;
  if (a)
    tmp = b + c + d;
    tmp = 23;
  return tmp;
is generated in GIMPLE as
foo (int a, int b, int c, int d)
  int tmp;
  int D.2721;
  int D.2719;

<bb 2>:
  if (a_2(D) != 0)
    goto <bb 3>;
    goto <bb 4>;

<bb 3>:
  D.2719_5 = b_3(D) + c_4(D);
  tmp_7 = D.2719_5 + d_6(D);
  goto <bb 5>;

<bb 4>:
  tmp_8 = 23;

<bb 5>:
  # tmp_1 = PHI <tmp_7(3), tmp_8(4)>
  D.2721_9 = tmp_1;
  return D.2721_9;
This IR looks very different from the LLVM IR, but it is mostly because the debug dumps tries to be helpful and writes e.g. addition as "a + b" instead of using the operation name followed by the operands, such as "plus a, b", which would have been more convenient, as this is how we will need to write the matching rules in match.pd...

One thing I like about this IR is that it may contain information about parentheses from the original language. This is used by the Fortran compiler, as the Fortran standard allows the compiler to optimize floating point calculations (reassociation, etc.), but it requires the compiler to respect the order of evaluation specified by parentheses. I think this is the best way of handling floating point, and I wish more languages had chosen this semantic... As an example of how parentheses looks like in the IR, consider
a = (b + 1.0) - 1.0;
that is generated as
_3 = *b_2(D);
_4 = _3 + 1.0e+0;
_5 = ((_4));
_6 = _5 - 1.0e+0;
*a_7(D) = _6;
As mentioned above, I think the syntax would have been clearer if the parenthesis was written as
_5 = paren _4;
as this is how it is represented within the compiler — the right parenthesis is implied by the evaluation order.

We are mostly isolated from the details of the IR when doing the peephole optimizations, as they only work on small sequences of instructions, and both LLVM IR and GENRIC/GIMPLE looks similar on that level. The transformation specifications in match.pd are written in a similar way as for ALIVe, although using a "lispy" syntax. As trivial example, for eliminating double bitwise negation, that for ALIVe looks like
; ~~x -> x
%1 = xor %x, -1
%2 = xor %1, -1
%2 = %x
is written as
/* ~~x -> x */
  (bit_not (bit_not @0))
in the GCC match.pd. The major difference between LLVM and GCC that will affect us is that GCC uses information from the type system to determine wrapping behavior (unsigned integers may wrap, but the behavior is undefined if signed integers overflow), so the transformation rules need to look at the type in order to determine the wrapping semantics. GCC also have types such as complex numbers that is not a basic type in LLVM, and those extra types complicates some transformations.

The next blog post will look at the details of how the transformations are specified in match.pd.

1 comment:

  1. Thanks. This post helps me a lot to understand match.pd.