Sunday, February 21, 2016

How undefined signed overflow enables optimizations in GCC

Signed integers are not allowed to overflow in C and C++, and this helps compilers generate better code. I was interested in how GCC is taking advantage of this, and here are my findings.1

Signed integer expression simplification

The nice property of overflow being undefined is that signed integer operations works as in normal mathematics — you can cancel out values so that (x*10)/5 simplifies to x*2, or (x+1)<(y+3) simplifies to x<(y+2). Increasing a value always makes it larger, so x<(x+1) is always true.

GCC iterates over the IR (the compiler's Internal Representation of the program), and does the following transformations (x, and y are signed integers, c, c1, and c2 are positive constants, and cmp is a comparison operator. I have only listed the transformations for positive constants, but GCC handles negative constants too in the obvious way)
  • Eliminate multiplication in comparison with 0
    (x * c) cmp 0   ->   x cmp 0 
    
  • Eliminate division after multiplication
    (x * c1) / c2   ->   x * (c1 / c2) if c1 is divisible by c2
    
  • Eliminate negation
    (-x) / (-y)     ->   x / y
    
  • Simplify comparisons that are always true or false
    x + c < x       ->   false
    x + c <= x      ->   false
    x + c > x       ->   true
    x + c >= x      ->   true
    
  • Eliminate negation in comparisons
    (-x) cmp (-y)   ->   y cmp x
    
  • Reduce magnitude of constants
    x + c > y       ->   x + (c - 1) >= y
    x + c <= y      ->   x + (c - 1) < y
    
  • Eliminate constants in comparisons
    (x + c1) cmp c2         ->   x cmp (c2 - c1)
    (x + c1) cmp (y + c2)   ->   x cmp (y + (c2 - c1)) if c1 <= c2
    
    The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.

Pointer arithmetic and type promotion

If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.

One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.

Value range calculations

The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as
int x = foo();
if (x > 0) {
  int y = x + 5;
  int z = y / 4;
it determines that x has the range [1, INT_MAX] after the if-statement, and can thus determine that y has the range [6, INT_MAX] as overflow is not allowed. And the next line can be optimized to
  int z = y >> 2;
as the compiler knows that y is non-negative.

The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form [INT_MIN, (INT_MIN+4)] or [6, INT_MAX] that prevents all useful comparisons with < or >), such as
  • Changing comparisons x<y to true or false if the ranges for x and y does not overlap
  • Changing min(x,y) or max(x,y) to x or y if the ranges do not overlap
  • Changing abs(x) to x or -x if the range does not cross 0
  • Changing x/c to x>>log2(c) if x>0 and the constant c is a power of 2
  • Changing x%c to x&(c-1) if x>0 and the constant c is a power of 2

Loop analysis and optimization

The canonical example of why undefined signed overflow helps loop optimizations is that loops like
for (int i = 0; i <= m; i++)
are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.

But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.


1. Using GCC trunk r233186,  "gcc version 6.0.0 20160205 (experimental)"

This blog post was updated 2016-02-23:
  • Corrected first sentence to say "overflow"instead of "wrap".
  • Removed one incorrect transformation from "Eliminate Negation" where I had misread the GCC source code.
  • Corrected the ranges in the "Value range calculations" example.

16 comments:

  1. Nice writeup.
    These optimizations also mean UNsigned integers should only be used when really needed. E.g. when you need the overflow wrap around.

    ReplyDelete
  2. In "Value range calculations" shouldn't it be

    (0, INT_MAX]

    instead of

    [0, INT_MAX]

    because the if-statement checks to make sure x > 0, not x >= 0?

    Great write-up!

    ReplyDelete
  3. Thanks! I have updated the ranges in the blog post.

    ReplyDelete
  4. I'm afraid that as regards the C++ language you've fallen victim to some purely associative propaganda, I'm sorry: very few of the opportunities you list, are examples where the UB contributes to optimization of actual C++ code.

    [Snipped concrete discussion of first 7 cases, because that exceeded the 4096 character limit on postings. Conclusion: only 6 and 7 are real. Silly limit! Gag the critics!]

    ***

    Okay there are more cases, and loops are especially interesting. They offer some valid good optimizations, and some that are more in the class of perverse compiler behavior (which you can just take as my personal opinion, or investigate, sorry I don't have more time here). But in conclusion, the optimization possibilities offered by signed arithmetic UB, are somewhat hyped up: there are some such possibilities, even some good ones, but many alleged ones are not really optimizations at all.

    Still there are good reasons for preferentially using signed integers for numbers, and those good reasons include both clarity (less complexity, more concise code), and less opportunities for e.g. promotion bugs to creep in.

    Cheers!

    ReplyDelete
  5. Hi Alf,

    signed overflow is undefined behavior in C++ too, and GCC does all of the described transformations for both C and C++.

    We may argue about the benefit of these transformations, but I think that is a topic for a separate blog post... :-)

    ReplyDelete
  6. Oh. You misunderstood. I mentioned the language because a comment argument for optimization of sillycodes like `x*Constant > x` (in one of your examples) is that such code can be generated by macros. Those are at best a C-specific arguments; there is no reason a sane C++ programmer would do this, or indeed, need to have it optimized.

    Cheers & hth.,

    - Alf

    ReplyDelete
  7. @Krister, nice article!

    @Alf, I think you are underestimating the amount of C++ code written by non-humans. Many code generators produce idioms like this. Convoluted template shenanigans can also lead to such expressions appearing in a translation unit.

    @Koen, yes and no. Unsigned types are much easier to reason about in many situations. It is true that there are cases like the one Krister identified where an optimisation is only enabled when operating on signed types. However, there are so many cases of undefined behaviour relating to signed but not unsigned types that it can be a minefield for the insufficiently cautious programmer. E.g. `1 << 31`, `INT_MIN / -1`, ...

    ReplyDelete
  8. @Koen: I second what Matthew is saying. The standard says that signed overflow leads to undefined behavior. This is because the IBM 360 and successors trapped on signed integer overflow.

    Since undefined behavior can be anything, programmers must avoid creating such code.

    Thus, modern instances of GCC and other compilers assume that programmers are infallible and have ensured that their code does not cause overflow, hence giving the compilers license to perform these optimizations.

    If however, the code was written with the assumption that signed overflow is treated exactly like unsigned arithmetic, as was the case for previous instances of these compilers on most machines, these optimization can cause the code to fail.

    Although I can understand compiler writers' interest in implementing these optimizations, one hast to wonder if a programmer that creates code with such obvious optimization opportunities has really thought through his code, e.g., in regards to integer overflow.

    And honestly, I think programmers would be better served by having code generated that traps on overflow, at least for debug builds.

    ReplyDelete
  9. @Konz In some cases of undefined behavior gcc and clang may use __builtin_trap which let's say on x86 may generate a ud2 instruction which will trap. They both currently do this when dereferencing a null pointer see this godbolt example: https://goo.gl/VoV7zH

    The compiler may not always be able to prove undefined behavior, on a stackoverflow question I answered a while ago: http://stackoverflow.com/q/32506643/1708801

    A simple cout of the variable was sufficient to confuse the compiler and prevent a warning of undefined behavior. Although the optimizer still managed to turn a finite loop into an infinite one in this case.

    ReplyDelete
  10. This is wrong for the case y = -1:
    (-x) / (-y) -> x / y

    since with y = -1 the optimized code would crash when x = INT_MAX (the CPU traps).

    ReplyDelete
  11. Is it common to see loop like

    for (int i = 0; i <= m; i++)

    ?

    "Most"[1] loops that I've seen are written as

    for (int i = 0; i < m; i++)

    in which case you don't need UB-on-signed-overflow (you can prove no-overflow directly, even if spec is to wrap).

    [1]: In quotes since I have no way at the moment to quantify this

    ReplyDelete
  12. When the when (m is unsigned) or (m is larger bit size than int), and m is greater than INT_MAX, then the loop will never terminate. (IOW the loop is not countable).

    ReplyDelete
  13. There are a few cases where signed overflow is the desired behavior, although they aren't too common:

    - Generating -1 when input is negative without a branch:
    (int32_t)val >> 31; // equivalent to val < 0 ? -1 : 0;

    Fast random number generator in -128 to 127 range:
    int32_t randMem;
    randMem *= 1103515245;
    randMem += 12345;
    randOut = randMem >> 24;

    What's the proper way to get GCC to generate the proper code in this kind of case? Cast to uint32_t to force limited range, then apply wrapping arithmetic shift right? Or is there an intrinsic as in the case of stuff like 32bit * 32bit -> 64bit multiply on 32bit architectures?

    ReplyDelete
    Replies
    1. There are no relevant intrinsics, but the compiler is in general smart enough to handle many interesting cases. For example, your generation of -1 if \(\verb!val!\) is negative can be done as \(\verb!(val < 0) ? -1 : 0!\) — GCC generates a shift for this!

      Or you can make your operations safe by casting to larger types/unsigned etc. For example, if you have
      \(\verb!int32_t a, b, c;!\)
      then you can implement a wrapping addition as
      \(\verb!c = (int32_t)((int64_t)a + (int64_t)b);!\)
      GCC generates a 32-bit wrapping addition for this.

      Or you can use the \(\verb!-fwrapv!\) command-line option, that treats wrapping integers as being defined (but which of course disables the optimizations described in the blog post)

      Delete
  14. How many genuinely-useful optimizations are enabled by treating signed overflow as UB, rather than saying that it will yield a value that behaves as any mathematical integer--not necessarily within the range of the type--whose value is congruent to the correct value mod 2³² (or 2⁶⁴, etc. based on type). Such a guarantee would so far as I can tell very seldom interfere in any way with any of the optimizations you listed, but would allow programmers to write shorter and more efficient code in cases where the above behavior would satisfy application requirements, than would be possible without such a guarantee.

    ReplyDelete
    Replies
    1. I'd say this is true for essentially all of the optimizations mentioned in the blog post — e.g. it uses the fact that wrapping is UB to be allowed to simplify e.g. \(\verb!x+1 < x!\) to \(\verb!1 < 0!\) by subtracting \(\verb!x!\) from both sides, but the rule could be expressed as if the operation was done in a wider type.

      The only optimization that cannot be expressed in this way is the loop-related optimizations (that calculates iteration count based on that the values do not wrap. And that \(\verb!a[i]!\) and \(\verb!a[i+1]!\) are adjacent, which is important for vectorization).

      Delete