Back to Home page of KG
Tuning the G++/GCC Inliner
Table of contents
The problem
Inliner thoughts
Inliner shortcomings
Fix the heuristics
gcc-3.1
gcc-2.95.3 work
Download
The problem
Developing the TBCI library,
a high performance templated C++ library, I use C++ compilers a lot,
especially the GNU Compiler Collection and
therefore monitored its performance.
I came across two problems
- For some C++ code, the compiler would consume tremendous amounts of
memory (>500MB). This affected egcs-1.1.x and gcc-2.95.3. With
gcc-2.96/7, this was even worse, but in April 2001, this problem
was
solved by throttling recursive inlining in the new C++ tree inliner,
so gcc-3.0 does not suffer it any more. EDG based compilers (Kai, DEC
CXX) also did not suffer it.
- gcc-3.0.1 improved on the compile times even much more, but
unlike 3.0, performance was
catastrophic for some of my benchmarks (more than 2.5x slower than
3.0) , but also
reported by Gerald Pfeifer, also using C++ code.
For the latter problem, some tuning of the inliner also seemed to be
responsible, so I used the command line parameter -finline-limit-X
to play. This value used to be at 10000 and was set to the much more
reasonable value of 600 between 3.0.0 and 3.0.1. I needed to increase this
value
to get reasonable performance with my double tests
and I needed to go much higher for complex numbers.
(Up)
Inliner thoughts
One has to think of some aspects of inlining:
- It increases runtime performance, as you save time for storing the
arguments (on the stack for most ABIs), doing the function call, returning
and cleaning up the stack.
- You pay by compile time resources, as the compiler has to deal with
larger code blocks and has to generate the same code multiple times.
- The larger blocks may be optimized better, due to constant propagation
etc. resulting in better performance and eventually smaller code.
- Unless you only inline functions that are smaller than the instructions
needed for argument setup/cleanup and function call, or unless you call
the inlined function only from one place, you increase the size of your
executable.
- If this happens a lot, you may lose performance, because more
instructions have to be loaded from memory and the likelihood to find them
in the faster instruction-cache of your CPU is smaller.
So you want to inline the right functions, i.e. the ones which are
- Fast, so calling overhead would be significant
- Small, so code bloat is small
- Called often, but only from a few places
Unfortunately, the only information that is straightforwardly available for
the compiler is the size of the function, and even that one is not known
exactly, as some code may be eliminated later (constant propagation etc.).
(Up)
Inliner shortcomings
Back to the gcc-3.0.1: I was wondering why such high limits are needed and
suspected that the
g++ inliner does limit the inlining at the leaf functions of the call tree
instead of cutting at the trunk. While this mail was unanswered,
Daniel Berlin sort
of confirmed my suspicion.
Now, inlining large functions at the trunk and then running against the
recursive inlining limit at the leaves of the call tree is about the worst
thing that can happen, when you have inlining heuristics, and can
very well explain bad runtime performance. This is because your leaf
functions tend to be small or be called from within loops. For typical
C++ code, those are often just accessors. Anyway, you always have more
leaves than branches in a tree.
The dependence on the inline-limit indicated that this was the reason
indeed.
A good fix would be to reverse the inlining logistics and cut at the trunk.
It seems such logic
is present in the AST optimizer branch.
(Up)
Fixing the inlining heuristics (g++-3.0.1)
For the 3.0 branch, the ast optimizer is probably no option, so the
performance gain should be reached by tuning the inliner heuristics.
The first idea to fix inlining heuristics is to not allow a single function
to eat up all the size we've reserved. So let's allow only half of the total
inlining budget to a single function.
Second question is what to do when we reach the recursive inlining limit.
The idea is to not disallow inlining then completely, but just to become
more restrictive, this way still allowing small (and potentially very fast)
functions to be inlined. On the other hand, we hopefully still limit
recursive inlining enough to prevent excessive compile time resource
requirements.
A first patch to
do exactly this, using a limit of 30 tree tokens (corresp. to an
estimated 300 RTL tokens) for single functions and 60 tree tokens as limit
for recursive inlining was proposed and showed rather promising results in
both my own
and also in
Gerald Pfeifer'stests.
This patch has been merged between 3.0.2 and 3.0.3.
Some more ideas to further improve came to my mind:
- Instead of dropping off abruptly to half, just use a linear function
with a small negative slope, after we reached the recursive inlining
limit.
- Stop decreasing the limit at some low value which seems to really only
cover very small functions, so we hardly take any risk of slowdown
of compilation or code bloat (which is bad for runtime performance as
well, due to the memory latency and the limited size of the CPU's
instruction cache). This value should be determined experimentally.
- Even stop this after we are way beyond this to put a limit to infinite
recursions.
The slope was determined to be 1/16 and the minimum inline size to be 13
tree tokens and the
second version of the patch was proposed. My results were slightly
better with this approach and so were the ones from
Gerald Pfeifer
and from a SPEC
2000 run of 252.eon performed by Andreas Jaeger (SuSE) and using the
patch for the mainline compiler (3.1 CVS).
Of course some
more ideas are possible:
- Fine tune those limits with respect to the limits in the RTL inliner
(used for C code), so the default can be reasonable for both.
- Use a lower slope (thus dropping off less fast) after the recursive
inlining limit has been reached.
- Do more tests to determine the base reached then (which is 13 now).
- Give functions declared inline an advantage over normal
functions even for the -O3 case to fix the -O3 performance. (For code
that has been tuned for performance and therefore be equipped with
the inline keyword at the right places, -finline-functions aka. -O3
does introduce a pessimization, normally!)
Update: A patch to add
this is available. Please report back aboput it!
- Give functions at the leaves of the call tree an advantage.
- Gather and use information about whether the function is called in a
loop and from how many different places it's called.
It seems the heuristics even could use a bit more tuning, as Olaf Petzold
does report
some shortcomings when using expression templates (Blitz++). I hope that we
can fix it with some
tuning.
OK, here it comes v3: I dropped the slope of the inlining limit (as function
of the already inlined + the to be inlined code) from -1/16 to -1/32. So even
after inlining a lot, we don't get picky about inlining functions again,
because just now it might hurt most. Also, when accounting the already
inlined code, substract one tree statement, as we save the call.
With my own measurements, compile times and runtime performance stayed the
same within the precision of my benchmarks as compared to v2. Executable
size did differ in both directions by half a percent. But according to my
theory, it should yield a bit better results for some cases.
More ideas to be implemented in v4 ...
(Up)
gcc 3.1 and newer
I was expecting the AST inlining approach to be merged before gcc 3.1.
However, this is not the case. Instead the 3.0.3 C++ tree inlining
heuristics (my patch v1) is still used (now as well for other languages).
I still think that inlining is one of the bottlenecks for the compiler's
performance with C++ in gcc-3.1, and
Gerald's
benchmarks seem to confirm this.
On the long run, I believe, we would need to reverse the inlining (start at
the leaves as in AST) heuristics and/or gather more information to take
inlining decisions. On short term, the v3 patch and the function inline
accounting (-O3 fix) may help gcc-3.1 to perform a bit better. Note that the
benefits from v3 (compared to v1 which is in gcc-3.1) are not huge,
though
considerable.
But the -O3 fix patch does prevent exaggerated inlining with
-finline-functions (-O3) for me.
Find below slightly improved versions of the inliner patch (v3.6) and the
function inline accounting patch (v1.2). The v3 patch has been made tunable
by making the magic constants parameters (max-inline-insns-single
(300), max-inline-insns (600), max-inline-slope (32),
min-inline-insns (130)), which are also documented.
For the function inlining accounting, it turned out to adversely affect
performance if the RTL inliner (which is run later than the tree inliner)
also cuts down inlining of automatically inlined functions. So this has been
removed. As the RTL inliner does not account for repeated inlining, this
will not result in important leaf functions from being cut off from inlining
and therefore should not cause runtime performance problems. Also the effect
on compile-time resources of repeated inlining in the RTL inliner is less
severe than in the tree inliner.
Meanwhile some more improvements have been made (patch v4).
Both patches have been merged and the parameters are all settable via
--param. The max-inline-insns-auto limits the size
of functions inlined by -finline-functions, whereas
--max-inline-insns-rtl and --max-inline-insns-rtl-leaf do
limit the inliner for those languages still using the RTL inliner.
The -finline-insns=N now sets max-inline-insns-single to
N/2, max-inline-insns-auto to N/3, max-inline-insns to N and
max-inline-insns-rtl to N and max-inline-insns-rtl-leaf to
N*3/2.
In gcc-3.3, the patch v3.x patch has been merged, but unfortunately a preliminary
version without documentation and proper initialization of the parameters.
This has been reported as bug PR/8387.
This is fixed by the v41 patch.
The patch that puts a different limit on the compiler-flag inline functions
has also been recreated, but with the same default limit as for
inline-declared functions.
This should probably be changed.
Both patches have been merged into gcc-3.4 CVS as of 2003-03-02.
A patch to use a higher max-inline-insns and initialize
max-inline-insns-auto to only a third of it (thus lowering it)
can be found below as well. You should see slightly more inlining when
compiling with -O2 and your source code makes use of the inline keyword. You
should see slightly less inlining if you compile with -O3, but there are no
inline keywords (or in class decl implementation for C++) in your code.
If you have code where inline is used at the right places, an -O3 compile
should yield about the same amount of inlining that you got previously,
but it will favour the inline declared functions. It will lower the
likelihood that compiler-flag elected functions prevent your inline-declared
functions from being inlined due to the recursive throttling.
So, if the keyword has been used at the right places, it should improve
code quality.
(Up)
Fixing 2.95.3 excessive compilation time
Of course I felt confident now and thought I would just back port the changes to
2.95.3, so also the distributions out there that still use this good old
compiler could profit. However, 2.95.x does not have the C++ tree inliner,
but just the RTL inliner (gcc/integrate.c), so it's not just a matter of
applying the same patch.
The RTL inliner does not have the accounting of recursive inlining, so
together with the huge inline-limit of 10000, we do understand the excessive
compile time and memory requirements rather easily. (The RTL inliner is
just more picky about what can be inlined, so it's not triggered that often.)
So, there remains only three things to do, AFAICS:
- Prefer leaf functions as they don't incur the risk of limited inlining
and thus causing compile resource problems and because they are more
important to inline (may be called from loops).
- Lowering the inline-limit to less huge values.
- Give inline functions still an advantage for
-finline-functions
I implemented the first two and sent a
patch with a default
inline-limit of 400. Results are a bit mixed, though; some code runs a bit
faster, some code runs a bit slower (up to 5%).
Here are some results.
If you don not want to risk to have slight performance degradations for some
code, you may want to use the patch with a limit of 2000 instead of 400, so
you only prevent the worst case C++ recursive inlining abuse from happening.
The one that I described at the beginning.
Look at the compilation times.
(Up)
Download
- 3.0.1 patches
- patch v1 (30,60,15)
- patch v1.5v1 with ChangeLog
- patch v1.6v1 with smaller ChangeLog
- patch v2 (30,60,13,0,-1/16)
- patch v2.5v2 with texi docu
- patch v3(30//60/13,0,-1/32)
- -O3 fix v1 Be more
restrictive for functions inlined by -finline-functions (aka -O3).
Patch also includes the 2.95.3 stuff (leaf bonus)
- 3.1 patches
- patch v3 (30/60/13,0,-1/32)
- patch v3.1
(30/60/13,0,-1/32), v3 with better compliance to GNU style.
- patch v3.6
(30/60/13,0,-1/32), v3.1 with tunable (and documented) parameters.
- -O3 fix v1 Be more
restrictive for functions inlined by -finline-functions (aka -O3).
Patch also includes the 2.95.3 stuff (leaf bonus, -Os tuning)
- -O3 fix v1.2
The inline-functions accounting (-O3 fix) v1 with original
RTL inlining limits. Fixes some performance regression
in v1.
- patch v4
This is the merge of both patches. All the parameters are
exposed via --param.
- 3.2 (CVS) patches (against 2002-05-28 CVS)
- patch v4
This is the merge of both patches. All the parmeters are
exposed via --param. As the v3.x was already in 3.2 CVS,
this is a little clean up and the -finline-functions patch.
- 3.3 (CVS) patches (against 2003-03-01 CVS)
- Patch v41, inline parameters
Add docu and param initialization that was missed when merging v4
into gcc-3.3 CVS. Fixes PR/8387.
- Patch v41, limit auto inlining
Patch that accounts compiler-flag inlined functions differently than
the one declared inline. More strict limit (240 instead of 300).
- 3.4 (CVS) patches (against 2003-03-05 CVS)
- Lower auto inlining limit
Use a higher default for single function and recursive inlining
limits, but a slightly lower one for the ones that are not declared
inline but elected by -finline-functions/-O3. Ratio is 3/2 now,
defaults are recursive/single/auto/min = 840/420/280/130.
- 2.95.3 patches
- aggressive patch (400)
- conservative patch (2000)
Note that I also changed fine tuning of the optimize_size (-Os)
a bit to yield better results (both smaller and faster executables).
Note that you can always tune the inlining limit manually with
-finline-limit-X, where X may be reasonably set to a value between
100 and 5000.
Please send feedback to me and/or to the
gcc mailing list.
Note that I used a lot of references in this document and although it may
seem so on the first sight, those are not part of this site and I just
linked them for your convenience. Read
the text from Tim
Berners-Lee, the inventor of the WWW, about what a link means. I agree
with him.
Disclaimers are these awful unreadable and stupid sentences
those lawyer people managed to impose on us.
Browse directory
(w) by KG, last changed 2002-05-28