Left Arrow Back to Home page of KG

Tuning the G++/GCC Inliner

Table of contents

The problem
Inliner thoughts
Inliner shortcomings
Fix the heuristics
gcc-3.1
gcc-2.95.3 work
Download

The problem

Developing the TBCI library, a high performance templated C++ library, I use C++ compilers a lot, especially the GNU Compiler Collection and therefore monitored its performance.
I came across two problems For the latter problem, some tuning of the inliner also seemed to be responsible, so I used the command line parameter -finline-limit-X to play. This value used to be at 10000 and was set to the much more reasonable value of 600 between 3.0.0 and 3.0.1. I needed to increase this value to get reasonable performance with my double tests and I needed to go much higher for complex numbers.

(Up)

Inliner thoughts

One has to think of some aspects of inlining: So you want to inline the right functions, i.e. the ones which are Unfortunately, the only information that is straightforwardly available for the compiler is the size of the function, and even that one is not known exactly, as some code may be eliminated later (constant propagation etc.).

(Up)

Inliner shortcomings

Back to the gcc-3.0.1: I was wondering why such high limits are needed and suspected that the g++ inliner does limit the inlining at the leaf functions of the call tree instead of cutting at the trunk. While this mail was unanswered, Daniel Berlin sort of confirmed my suspicion.
Now, inlining large functions at the trunk and then running against the recursive inlining limit at the leaves of the call tree is about the worst thing that can happen, when you have inlining heuristics, and can very well explain bad runtime performance. This is because your leaf functions tend to be small or be called from within loops. For typical C++ code, those are often just accessors. Anyway, you always have more leaves than branches in a tree. The dependence on the inline-limit indicated that this was the reason indeed.
A good fix would be to reverse the inlining logistics and cut at the trunk. It seems such logic is present in the AST optimizer branch.

(Up)

Fixing the inlining heuristics (g++-3.0.1)

For the 3.0 branch, the ast optimizer is probably no option, so the performance gain should be reached by tuning the inliner heuristics.
The first idea to fix inlining heuristics is to not allow a single function to eat up all the size we've reserved. So let's allow only half of the total inlining budget to a single function.
Second question is what to do when we reach the recursive inlining limit. The idea is to not disallow inlining then completely, but just to become more restrictive, this way still allowing small (and potentially very fast) functions to be inlined. On the other hand, we hopefully still limit recursive inlining enough to prevent excessive compile time resource requirements.
A first patch to do exactly this, using a limit of 30 tree tokens (corresp. to an estimated 300 RTL tokens) for single functions and 60 tree tokens as limit for recursive inlining was proposed and showed rather promising results in both my own and also in Gerald Pfeifer'stests.
This patch has been merged between 3.0.2 and 3.0.3.

Some more ideas to further improve came to my mind:

The slope was determined to be 1/16 and the minimum inline size to be 13 tree tokens and the second version of the patch was proposed. My results were slightly better with this approach and so were the ones from Gerald Pfeifer and from a SPEC 2000 run of 252.eon performed by Andreas Jaeger (SuSE) and using the patch for the mainline compiler (3.1 CVS).

Of course some more ideas are possible:

It seems the heuristics even could use a bit more tuning, as Olaf Petzold does report some shortcomings when using expression templates (Blitz++). I hope that we can fix it with some tuning.

OK, here it comes v3: I dropped the slope of the inlining limit (as function of the already inlined + the to be inlined code) from -1/16 to -1/32. So even after inlining a lot, we don't get picky about inlining functions again, because just now it might hurt most. Also, when accounting the already inlined code, substract one tree statement, as we save the call.
With my own measurements, compile times and runtime performance stayed the same within the precision of my benchmarks as compared to v2. Executable size did differ in both directions by half a percent. But according to my theory, it should yield a bit better results for some cases.

More ideas to be implemented in v4 ...

(Up)

gcc 3.1 and newer

I was expecting the AST inlining approach to be merged before gcc 3.1. However, this is not the case. Instead the 3.0.3 C++ tree inlining heuristics (my patch v1) is still used (now as well for other languages).

I still think that inlining is one of the bottlenecks for the compiler's performance with C++ in gcc-3.1, and Gerald's benchmarks seem to confirm this.
On the long run, I believe, we would need to reverse the inlining (start at the leaves as in AST) heuristics and/or gather more information to take inlining decisions. On short term, the v3 patch and the function inline accounting (-O3 fix) may help gcc-3.1 to perform a bit better. Note that the benefits from v3 (compared to v1 which is in gcc-3.1) are not huge, though considerable.
But the -O3 fix patch does prevent exaggerated inlining with -finline-functions (-O3) for me.

Find below slightly improved versions of the inliner patch (v3.6) and the function inline accounting patch (v1.2). The v3 patch has been made tunable by making the magic constants parameters (max-inline-insns-single (300), max-inline-insns (600), max-inline-slope (32), min-inline-insns (130)), which are also documented.
For the function inlining accounting, it turned out to adversely affect performance if the RTL inliner (which is run later than the tree inliner) also cuts down inlining of automatically inlined functions. So this has been removed. As the RTL inliner does not account for repeated inlining, this will not result in important leaf functions from being cut off from inlining and therefore should not cause runtime performance problems. Also the effect on compile-time resources of repeated inlining in the RTL inliner is less severe than in the tree inliner.

Meanwhile some more improvements have been made (patch v4). Both patches have been merged and the parameters are all settable via --param. The max-inline-insns-auto limits the size of functions inlined by -finline-functions, whereas --max-inline-insns-rtl and --max-inline-insns-rtl-leaf do limit the inliner for those languages still using the RTL inliner. The -finline-insns=N now sets max-inline-insns-single to N/2, max-inline-insns-auto to N/3, max-inline-insns to N and max-inline-insns-rtl to N and max-inline-insns-rtl-leaf to N*3/2.

In gcc-3.3, the patch v3.x patch has been merged, but unfortunately a preliminary version without documentation and proper initialization of the parameters. This has been reported as bug PR/8387. This is fixed by the v41 patch.
The patch that puts a different limit on the compiler-flag inline functions has also been recreated, but with the same default limit as for inline-declared functions. This should probably be changed. Both patches have been merged into gcc-3.4 CVS as of 2003-03-02.
A patch to use a higher max-inline-insns and initialize max-inline-insns-auto to only a third of it (thus lowering it) can be found below as well. You should see slightly more inlining when compiling with -O2 and your source code makes use of the inline keyword. You should see slightly less inlining if you compile with -O3, but there are no inline keywords (or in class decl implementation for C++) in your code. If you have code where inline is used at the right places, an -O3 compile should yield about the same amount of inlining that you got previously, but it will favour the inline declared functions. It will lower the likelihood that compiler-flag elected functions prevent your inline-declared functions from being inlined due to the recursive throttling. So, if the keyword has been used at the right places, it should improve code quality.

(Up)

Fixing 2.95.3 excessive compilation time

Of course I felt confident now and thought I would just back port the changes to 2.95.3, so also the distributions out there that still use this good old compiler could profit. However, 2.95.x does not have the C++ tree inliner, but just the RTL inliner (gcc/integrate.c), so it's not just a matter of applying the same patch.
The RTL inliner does not have the accounting of recursive inlining, so together with the huge inline-limit of 10000, we do understand the excessive compile time and memory requirements rather easily. (The RTL inliner is just more picky about what can be inlined, so it's not triggered that often.)
So, there remains only three things to do, AFAICS: I implemented the first two and sent a patch with a default inline-limit of 400. Results are a bit mixed, though; some code runs a bit faster, some code runs a bit slower (up to 5%). Here are some results.
If you don not want to risk to have slight performance degradations for some code, you may want to use the patch with a limit of 2000 instead of 400, so you only prevent the worst case C++ recursive inlining abuse from happening. The one that I described at the beginning.
Look at the compilation times.

(Up)

Download

3.0.1 patches
patch v1 (30,60,15)
patch v1.5v1 with ChangeLog
patch v1.6v1 with smaller ChangeLog
patch v2 (30,60,13,0,-1/16)
patch v2.5v2 with texi docu
patch v3(30//60/13,0,-1/32)
-O3 fix v1 Be more restrictive for functions inlined by -finline-functions (aka -O3).
Patch also includes the 2.95.3 stuff (leaf bonus)
3.1 patches
patch v3 (30/60/13,0,-1/32)
patch v3.1 (30/60/13,0,-1/32), v3 with better compliance to GNU style.
patch v3.6 (30/60/13,0,-1/32), v3.1 with tunable (and documented) parameters.
-O3 fix v1 Be more restrictive for functions inlined by -finline-functions (aka -O3).
Patch also includes the 2.95.3 stuff (leaf bonus, -Os tuning)
-O3 fix v1.2 The inline-functions accounting (-O3 fix) v1 with original RTL inlining limits. Fixes some performance regression in v1.
patch v4 This is the merge of both patches. All the parameters are exposed via --param.
3.2 (CVS) patches (against 2002-05-28 CVS)
patch v4 This is the merge of both patches. All the parmeters are exposed via --param. As the v3.x was already in 3.2 CVS, this is a little clean up and the -finline-functions patch.
3.3 (CVS) patches (against 2003-03-01 CVS)
Patch v41, inline parameters Add docu and param initialization that was missed when merging v4 into gcc-3.3 CVS. Fixes PR/8387.
Patch v41, limit auto inlining Patch that accounts compiler-flag inlined functions differently than the one declared inline. More strict limit (240 instead of 300).
3.4 (CVS) patches (against 2003-03-05 CVS)
Lower auto inlining limit Use a higher default for single function and recursive inlining limits, but a slightly lower one for the ones that are not declared inline but elected by -finline-functions/-O3. Ratio is 3/2 now, defaults are recursive/single/auto/min = 840/420/280/130.
2.95.3 patches
aggressive patch (400)
conservative patch (2000)
Note that I also changed fine tuning of the optimize_size (-Os) a bit to yield better results (both smaller and faster executables).
Note that you can always tune the inlining limit manually with -finline-limit-X, where X may be reasonably set to a value between 100 and 5000.

Please send feedback to me and/or to the gcc mailing list.


Note that I used a lot of references in this document and although it may seem so on the first sight, those are not part of this site and I just linked them for your convenience. Read the text from Tim Berners-Lee, the inventor of the WWW, about what a link means. I agree with him.
Disclaimers are these awful unreadable and stupid sentences those lawyer people managed to impose on us.
Browse directory
(w) by KG, last changed 2002-05-28