primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
 
Hi Nicolas,

It's been a while I optimized C programs, but I'm pretty sure function
calls cost a lot compared to a few direct instructions (e.g.
isIntegerObject).

Levente
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
 
AFAICT, cDigitCompare:with:len: is inlined by clang.
But the interpreterProxy messages are not inlined, but it's not really amazing, these are different units of compilation.
The procedure I used for checking was:

vim ../common/Makefile.flags (add option -S to CFLAGS)
touch ../../src/plugins/LargeIntegers/LargeIntegers.c
mvm -f
less build/LargeIntegers/LargeIntegers.o

And you are certainly right, compared to a few bit ops, the function calls/return/stack handling are expensive.
So inlining should make a measurable difference for "small" large integers.
I was biased by giant integers which is more what I'm after (tight loops)

Nicolas

2016-04-17 0:59 GMT+02:00 Levente Uzonyi <[hidden email]>:

Hi Nicolas,

It's been a while I optimized C programs, but I'm pretty sure function calls cost a lot compared to a few direct instructions (e.g. isIntegerObject).

Levente

Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
 


2016-04-17 22:43 GMT+02:00 Nicolas Cellier <[hidden email]>:
AFAICT, cDigitCompare:with:len: is inlined by clang.
But the interpreterProxy messages are not inlined, but it's not really amazing, these are different units of compilation.

Ah, and this is the purpose of -flto, link-time-optimization, sorry for being slow myself ;)
 
The procedure I used for checking was:

vim ../common/Makefile.flags (add option -S to CFLAGS)
touch ../../src/plugins/LargeIntegers/LargeIntegers.c
mvm -f
less build/LargeIntegers/LargeIntegers.o

And you are certainly right, compared to a few bit ops, the function calls/return/stack handling are expensive.
So inlining should make a measurable difference for "small" large integers.
I was biased by giant integers which is more what I'm after (tight loops)

Nicolas


2016-04-17 0:59 GMT+02:00 Levente Uzonyi <[hidden email]>:

Hi Nicolas,

It's been a while I optimized C programs, but I'm pretty sure function calls cost a lot compared to a few direct instructions (e.g. isIntegerObject).

Levente


Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Nicolas Cellier
 


2016-04-17 22:51 GMT+02:00 Nicolas Cellier <[hidden email]>:


2016-04-17 22:43 GMT+02:00 Nicolas Cellier <[hidden email]>:
AFAICT, cDigitCompare:with:len: is inlined by clang.
But the interpreterProxy messages are not inlined, but it's not really amazing, these are different units of compilation.

Ah, and this is the purpose of -flto, link-time-optimization, sorry for being slow myself ;)
 

Hmm, I'm slow, but is -flto stable?
If I add it t CFLAGS and to LDFLAGS then I cannot compile Spur on mac, and I cannot decode the error message easily:

0  0x108ee297e  __assert_rtn + 144
1  0x108f709d8  ld::tool::HeaderAndLoadCommandsAtom<x86>::sectionFlags(ld::Internal::FinalSection*) const + 782
2  0x108f703a5  ld::tool::HeaderAndLoadCommandsAtom<x86>::copySegmentLoadCommands(unsigned char*) const + 955
3  0x108f6f58c  ld::tool::HeaderAndLoadCommandsAtom<x86>::copyRawContent(unsigned char*) const + 146
4  0x108f46e0f  ld::tool::OutputFile::writeAtoms(ld::Internal&, unsigned char*) + 465
5  0x108f3fdda  ld::tool::OutputFile::writeOutputFile(ld::Internal&) + 822
6  0x108f39e94  ld::tool::OutputFile::write(ld::Internal&) + 178
7  0x108ee38fa  main + 1311
A linker snapshot was created at:
    /tmp/Squeak-2016-03-17-230136.ld-snapshot
ld: Assertion failed: (0 && "typeTempLTO should not make it to final linked image"), function sectionFlags, file /Library/Caches/com.apple.xbs/Sources/ld64/ld64-264.3.101/src/ld/HeaderAndLoadCommands.hpp, line 780.
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [build/vm/Squeak] Error 1

 
The procedure I used for checking was:

vim ../common/Makefile.flags (add option -S to CFLAGS)
touch ../../src/plugins/LargeIntegers/LargeIntegers.c
mvm -f
less build/LargeIntegers/LargeIntegers.o

And you are certainly right, compared to a few bit ops, the function calls/return/stack handling are expensive.
So inlining should make a measurable difference for "small" large integers.
I was biased by giant integers which is more what I'm after (tight loops)

Nicolas


2016-04-17 0:59 GMT+02:00 Levente Uzonyi <[hidden email]>:

Hi Nicolas,

It's been a while I optimized C programs, but I'm pretty sure function calls cost a lot compared to a few direct instructions (e.g. isIntegerObject).

Levente



Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
 
You have to add -flto to both CFLAGS and LDFLAGS.

I tried to compile it on Ubuntu 14.04, but there's some problem with
autoconf. The -flto flag probably optimizes something away and the
script to detect libdl will fail. Instead, it sets the HAVE_DYLD flag,
which is Mac only, and sqUnixExternalPrims.c won't compile.

Levente
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
 
On Mon, 18 Apr 2016, Levente Uzonyi wrote:

>
> You have to add -flto to both CFLAGS and LDFLAGS.

Nevermind. You did.

Levente

>
> I tried to compile it on Ubuntu 14.04, but there's some problem with
> autoconf. The -flto flag probably optimizes something away and the script to
> detect libdl will fail. Instead, it sets the HAVE_DYLD flag,
> which is Mac only, and sqUnixExternalPrims.c won't compile.
>
> Levente
>
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
In reply to this post by Levente Uzonyi
 
With a bit of manual tweaking I managed to build a vm using the -flto
flag (from the 3666 sources). gcc managed to inline stackValue, failed,
isIntegerObject, integerObjectOf, but failed to do the same for success,
isKindOf, popThenPush, integerValueOf and digitCompareLargewith.

The overall speedup was ~10% for SmallIntegers and ~4-5% for
equal 64-bit LargeIntegers, and ~13% for unequal ones (no bytes match).

I added -flto to CFLAGS and -Wl,-O1 -Wl,-flto to LDFLAGS.

Levente

On Mon, 18 Apr 2016, Levente Uzonyi wrote:

>
> You have to add -flto to both CFLAGS and LDFLAGS.
>
> I tried to compile it on Ubuntu 14.04, but there's some problem with
> autoconf. The -flto flag probably optimizes something away and the script to
> detect libdl will fail. Instead, it sets the HAVE_DYLD flag,
> which is Mac only, and sqUnixExternalPrims.c won't compile.
>
> Levente
>
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Eliot Miranda-2
 
Hi Levente,

On Mon, Apr 18, 2016 at 8:39 AM, Levente Uzonyi <[hidden email]> wrote:

With a bit of manual tweaking I managed to build a vm using the -flto flag (from the 3666 sources). gcc managed to inline stackValue, failed, isIntegerObject, integerObjectOf, but failed to do the same for success, isKindOf, popThenPush, integerValueOf and digitCompareLargewith.

The overall speedup was ~10% for SmallIntegers and ~4-5% for equal 64-bit LargeIntegers, and ~13% for unequal ones (no bytes match).

I added -flto to CFLAGS and -Wl,-O1 -Wl,-flto to LDFLAGS.

Excuse me for being lazy and asking you rather than looking, but is there a way to specify the set of functions considered?  t would be nice to only apply link-time optimisation to the plain primitive API.  For debugging I'd rather /not/ apply link-time optimisation to the interface between the CoInterpreter and the Cogit.  But the plugin API is performance-critical enough that I'll put up with reduced debuggability for that part of the system.

Levente


On Mon, 18 Apr 2016, Levente Uzonyi wrote:


You have to add -flto to both CFLAGS and LDFLAGS.

I tried to compile it on Ubuntu 14.04, but there's some problem with autoconf. The -flto flag probably optimizes something away and the script to detect libdl will fail. Instead, it sets the HAVE_DYLD flag,
which is Mac only, and sqUnixExternalPrims.c won't compile.

Levente




--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: primitiveDigitCompare is slow (was: Re: [squeak-dev] The Inbox: Kernel-dtl.1015.mcz)

Levente Uzonyi
 
Hi Eliot,

Unfortunately that's not easily doable, but files compiled without the
-flto flag will not be subject of link-time optimization. So, one way to
achieve what you want is to compile the methods in a separate file using
the -flto flag and the rest in another file without the flag.
This rule applies to every function you want to be subject of link-time
optimization, let it be the one you want to inline into another function
or the one you want to inline another function into.
Also, the -flto flag doesn't work well with -g, so it shouldn't be a
problem during debugging. If that was what you meant by debugging.

You can find the details here:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-flto-882

Levente

12