Sunday, October 16, 2016

C++ and code inlining

Jason Turner mentions in his CppCon 2016 talk that GCC optimizes the function
int foo()
{
    return std::string("a").size();
}
to just a return of a constant value
_Z3foov:
    movl    $1, %eax
    ret
while
int bar()
{
    return std::string("a").size() + std::string("b").size();
}
generates a less optimized result involving calls to constructors, destructors, etc. The difference between the two cases comes from how GCC handles inlining.

The basic idea behind the GCC inliner is to inline greedily as long as it estimates that the code size does not increase too much (where the growth limit depends on optimization level, if the function is hot/cold, etc.). One important special case is functions that are called exactly once — they do not increase code size if they are inlined, as the inlining just moves the code from the callee into the caller.

STL usage often expands to a large amount of code, for example, the string constructor used in the examples above calls
template<typename _CharT, typename _Traits, typename _Alloc>
  template<typename _InIterator>
    void
    basic_string<_CharT, _Traits, _Alloc>::
    _M_construct(_InIterator __beg, _InIterator __end,
                 std::forward_iterator_tag)
    {
      // NB: Not required, but considered best practice.
      if (__gnu_cxx::__is_null_pointer(__beg) && __beg != __end)
        std::__throw_logic_error(__N("basic_string::"
                                     "_M_construct null not valid"));

      size_type __dnew =
        static_cast<size_type>(std::distance(__beg, __end));

      if (__dnew > size_type(_S_local_capacity))
        {
          _M_data(_M_create(__dnew, size_type(0)));
          _M_capacity(__dnew);
        }

      // Check for out_of_range and length_error exceptions.
      __try
        { this->_S_copy_chars(_M_data(), __beg, __end); }
      __catch(...)
        {
          _M_dispose();
          __throw_exception_again;
        }

      _M_set_length(__dnew);
    }
This in turn calls several other functions, and more than 50 function calls need to be inlined in order to fully optimize foo. The temporary code increase of inlinling before the code gets optimized is over the limit allowed by -O2, but foo has exactly one string, so the compiler can inline this without increasing code size, and the compiler can eventually eliminate everything. But the bar function does not get the constructors inlined, as it constructs two strings...

This can have surprising effects when trying to understand how well code optimize. For example, we saw that foo optimizes to return a constant value, so let us add one more identical function
int foo()
{
    return std::string("a").size();
}

int foo2()
{
    return std::string("b").size();
}
We now construct two strings, which prevents inlining. So neither foo nor foo2 will be fully optimized when compiled with -O2.

The -O3 optimization level allows more inlining, so all examples in this blog post optimize fully when compiled using -O3 (so I assume Jason used -O2 for his examples). In general, -O2 does optimizations that do not involve a space-speed tradeoff, while -O3 allows the code size to increase if the resulting code is faster (which is good if you have enough cache).

To conclude:
  • It is not enough to look at trivial code snippets if you want to verify that your complex templated code is being optimized as you expect.
  • You may want to use -O3 rather than -O2 when compiling code for modern architectures.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.