The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Inlining Functions

Does anyone have experience with this?  I've repeatedly read that modern compilers are better at this than humans, but I'm finding the opposite.

I'm currently writing a performance critical app in C#, and simply cut and pasting the function code in place of calls takes my execution time down from 13 seconds to 8 seconds for 100 million iterations. 

The only variables passed are a uint64 and an int32.  The function code consists of about 12 bitwise operations, a few additions, a switch statement and 3 table lookups.

Is this performance difference .NET specific or is it found in other languages too?
Philip Connors Send private email
Saturday, January 13, 2007
 
 
Make sure your functions are in-linable.  For example, a public member function of a class will not be inlined because it may be overridden in a subclass.  The stub needs to be there or else the child class would need the source code to know how to undo the inline.
JSmith Send private email
Saturday, January 13, 2007
 
 
That seems like a large function to me (I'm not surprised that it's not inlined).

I'm also pretty surprised that it makes such a big difference to the execution time: I can't explain that.

The C# optimizer optimizes for size and speed ... I don't notice a compiler option to tell it to optimize for speed at the expense of size.

> Is this performance difference .NET specific or is it found in other languages too?

The native C++ compiler lets you optimize for speed *or* size: inlining makes it faster, at the expense of making it bigger. It gives you other options too, like whether to disable stack-related functionality and support for exceptions, and other optimizations.

> I've repeatedly read that modern compilers are better at this than humans

Perhaps that's true at a more micro-level: the sequence of opcodes that it emits is counter-intuitive to me, so I'm guessing the comiler does that to optimize the Pentium'm pipe-lining.
Christopher Wells Send private email
Sunday, January 14, 2007
 
 
> Is this performance difference .NET specific
> or is it found in other languages too?

This happens sometimes in C++ as well. It's kind of a basic problem - without using some kind of profile-guided optimization technique, the compiler just doesn't really know that one section of code is going to be iterated such a tremendously large amount, so it doesn't really treat that section as a "super special" case and do things that would not be really great to do in every single case like super aggressive inlining.

But in Visual C++ you can use the __forceinline keyword to force the compiler to do the inlining of a particular function, you don't have to cut and paste.
Michael G
Sunday, January 14, 2007
 
 
I can't speak for C#, but in C++ it can definitely make a difference.

Be aware that it may also be a negative difference if you over do it.  The reason being that the program can bloat and then result in less caching of the code.

So you need to test it.

As always with the optimization:

1. look for places where the same thing is repeatedly done (don't just inline everything willy nilly and hope for an improvement - this is probably text book example of the pathological case)

2. test and measure
Sunil Tanna
Sunday, January 14, 2007
 
 
I wouldn't have thought that adding push+call+ret to such a large function would make such a big difference. Perhaps calling a subroutine in the CLR adds a lot more that just push+call+ret opcodes, but I don't know what (in C++ is might be massaging the stack frame for the debugger's walking the stack, or for catch handling), nor whether it's possible to remove that overhead somehow.
Christopher Wells Send private email
Sunday, January 14, 2007
 
 
"Does anyone have experience with this?  I've repeatedly read that modern compilers are better at this than humans, but I'm finding the opposite."

You know, it's like the decision to go to assembler, or get nasty like loop unrolling, etc.  You're just going to have to look at the assembler output, maybe stick in a high performance clock (it's a lot harder to calculate cycles these days), and try stuff.  Wash, rinse, repeat.

In general, a lot of things I've done tended to depend on the CPU and/or compiler for efficient use of the computer.  My default assumption is that x86 compilers, especially Microsoft, are pretty bad.  OTOH, superscalar RISC stuff (typically GNU tools) is pretty hard to beat.  If you are down to counting cycles in a loop and you have a really good optimizing compiler, it sometimes becomes a matter of just trying stuff and seeing how the tools deal with it.

In any case, if you build a bunch of way efficient, weirdly written functions/methods, I would recommend making them peers to easy to read, slow, straightforward implementations.  Map them in and out with an ifdef or something.
old.fart Send private email
Sunday, January 14, 2007
 
 
Your function does not sound like extremely simple so I'm not really surprised that it is not inlined.
Make sure (I assume that you do it already, but just in case) that you are measuring things on Release build (csc /o+) and not under debugger (so inlining at JIT time actually have chance to happen).

You can try to look for articles by CLR performance folks (I.e. Rico Mariani http://blogs.msdn.com/ricom/ ) to see if this they cover inlining.
WildTiger Send private email
Monday, January 15, 2007
 
 
Sanity check.  The two steps I'd take first would be to: one, reverify your original measurements and two, determine if the difference is really significant to the  application.

The measurement stated an improvement of 13 seconds to 8 seconds for 100 million operations.  This corresponds to a change from 0.13 microseconds to 0.08 microseconds and also implies that the function call and return is consuming over 38% of your operating time.  Given the description of what the function does, this seems to be a pretty high ratio.  I would reverify that the function is really doing all of the operations defined and that there is no extraneous overhead creeping into the measurement of the non-inlined version.

From a higher level, I would suggest that one look at the performance requirements of the overall application.  You are far closer to the work than I am, but it has been my experience that tuning to remove function calls is rarely necessary; there are far too many things that can have a bigger affect on operating times.  If it is the case that the function call and return significantly affects the operation of the application, I would suggest that the application is running outside the capabilities of .Net on the target hardware.  Either get higher speed target hardware (next year's processor speeds will more than make up the difference in execution time), or consider rewriting some or all of your application in another language (pure C++, C, or assembler).

I usually limit my use of inlining to C++ and for only stylistic reasons, and I would be concerned about an application that is so on the edge of its performance requirements that changing between inline and non-inline makes a signficant difference.

Just my two cents, and as I said, this may not apply to the specific constraints of your particular application.  You are closer to the problem than I am.
Wayne M.
Tuesday, January 16, 2007
 
 
Thanks for all the helpful advice (and link).

The portion I'm optimizing is a calculator using a Monte Carlo method.  The user will use it many times in a session and has to wait for the result, so I'm looking to get it under 3 seconds.

Regarding the function overhead - I'm surprised too.  With all compiler optimizations on, simply calling this test function:

private uint TestFunction(ulong x, uint y)
{uint z = x+1; return z;}

adds 3.7 seconds overhead for 100 million iterations vs the inlined version.  The return value and both input values are used in the calling method, so nothing is being optimized out.  This is purely function overhead.

I'm using a lot of precalculated lookup tables in the calling method, which are small enough to fit in the L1 cache, so my only guess is that the function calls are somehow displacing this cache data and causing cache misses that aren't happening with inlined code.  I don't know how.

Anyway it seems that Microsoft's promises about C# being faster than native C++ aren't entirely accurate.  Time for some interop. :)
Philip Connors Send private email
Tuesday, January 16, 2007
 
 
Chris Nahr posted http://discuss.joelonsoftware.com/default.asp?dotnet.12.439456.9 recently about C++/CLI being faster which might interest you.
Christopher Wells Send private email
Tuesday, January 16, 2007
 
 
TestFunction compiles to as follows on my machine ...

        private  ulong TestFunction(ulong x, uint y)
        {
            ulong z = x + 1;
00000000  push        edi 
00000001  push        esi 
00000002  sub        esp,8
00000005  mov        dword ptr [esp],ecx
00000008  mov        dword ptr [esp+4],edx
0000000c  cmp        dword ptr ds:[01160978h],0
00000013  je          0000001A
00000015  call        7941FC3E
0000001a  xor        esi,esi
0000001c  xor        edi,edi
0000001e  mov        eax,dword ptr [esp+14h]
00000022  mov        edx,dword ptr [esp+18h]
00000026  add        eax,1
00000029  adc        edx,0
0000002c  mov        esi,eax
0000002e  mov        edi,edx
            return z;
00000030  mov        eax,esi
00000032  mov        edx,edi
00000034  add        esp,8
00000037  pop        esi 
00000038  pop        edi 
00000039  ret        8   

... there's nothing remarkable; 'je' jumps over the 'call' (the 'call' isn't taken).

The inside of the for loop which calls it is simple too, like this ...

            {
                ulong x = TestFunction(2, 3);
00000284  push        0   
00000286  push        2   
00000288  mov        ecx,ebx
0000028a  mov        edx,3
0000028f  call        004F10A0
00000294  nop             
            for (int i = 0; i < 500000000; ++i)
00000295  inc        dword ptr [ebp-18h]
00000298  cmp        dword ptr [ebp-18h],1DCD6500h
0000029f  jl          00000284
            }

On my machine it takes 6.5 seconds to call it 500,000,000 times; or 4 seconds if I declare it as static.

The assembly with /optimize enabled is pretty simple, but not great: for example it's using edi and esi unnecessarily; and even this little function isn't inlined.

> With all compiler optimizations on

There is only one C# compiler optimization option, isn't there.
Christopher Wells Send private email
Tuesday, January 16, 2007
 
 
Christopher, that resulting code looks rather bad. Here is what I get with cl /Ox /Oy (full optimization + omit frame pointers):

?TestFunction@@YA_K_KI@Z PROC              ; TestFunction

; 7    :            unsigned __int64 z = x + 1;

    mov eax, DWORD PTR _x$[esp-4]
    mov edx, DWORD PTR _x$[esp]
    add eax, 1
    adc edx, 0

; 8    :            return z;
; 9    : }

    ret 0
?TestFunction@@YA_K_KI@Z ENDP              ; TestFunction
x
Wednesday, January 17, 2007
 
 
Yes the C# still has all the function prolog and epilog. It doesn't have any magic extra subroutine calls in the prolog/epilog though, which I'd originally feared when I heard "3 seconds". Thinking about it, even only an extra dozen opcodes add up to something when they're executed a hundred million times.
Christopher Wells Send private email
Wednesday, January 17, 2007
 
 
I just tried the following code ...

class Program
{
    private static ulong TestFunction(ulong x, uint y)
    {
        ulong z = x + y;
        return z;
    }
    static void Main(string[] args)
    {
        ulong x = 0;
        uint y = 1;
        DateTime now = DateTime.Now;
        for (int i = 0; i < 500000000; ++i)
        {
            x = TestFunction(x, y);
        }
        TimeSpan ellapsed = now.Subtract(DateTime.Now);
        Console.WriteLine(string.Format("Ellapsed={0}, Result=1", ellapsed, x));
    }
}

The time allapsed is 1.68 seconds when run from the command-line, or 5.47 seconds when run under the debugger.

I think that when it's run under the debugger then compiler optimizations are disabled, so you can't run from the debugger to see what the assembly looks like. Sorry: I didn't know that about the JIT compiler.

Anyway, if I launch it without a debugger and then break in after it's running, then the assembly looks like this ...

0000004f  xor        esi,esi
00000051  mov        eax,dword ptr [esp+8]
00000055  mov        edx,dword ptr [esp+0Ch]
00000059  add        eax,1
0000005c  adc        edx,0
0000005f  mov        dword ptr [esp+28h],eax
00000063  mov        dword ptr [esp+2Ch],edx
00000067  mov        dword ptr [esp+8],eax
0000006b  mov        dword ptr [esp+0Ch],edx
0000006f  add        esi,1
00000072  cmp        esi,1DCD6500h
00000078  jl          00000051

.. i.e. a) it has been inlined, b) there's no prolog, and c) it has replaced 'y' with a hard-coded 1 ... so it's much more optimized than the version above (though it has still failed to move the stack-to-registers and registers-to-stack instructions out of the inner loop).
Christopher Wells Send private email
Thursday, January 25, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz