Smalltalk › Squeak › Squeak VM

[Cog] Strange inliner behavior

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

11 messages Options

Igor Stasenko

[Cog] Strange inliner behavior

Here the slang code:

size := self sizeBitsOf: op1.
size = (self sizeBitsOf: op2) ifFalse: [
^ false ].

And here translated code:

/* begin sizeBitsOf: */
header = longAt(op1);
size = ((header & TypeMask) == HeaderTypeSizeAndClass
? (longAt(op1 - (BytesPerWord * 2))) & LongSizeMask
: header & SizeMask);
if (!(size == (sizeBitsOf(op2)))) {
return 0;
}

as you can see it inlining first, but refuses to inline second one.

--
Best regards,
Igor Stasenko AKA sig.

Igor Stasenko

Re: [Cog] Strange inliner behavior

if instead i do like that:

size := self sizeBitsOf: op1.
sz2 := self sizeBitsOf: op2.
size = sz2 ifFalse: [ ^ false ].

then it inlines both calls.

On 31 July 2011 05:49, Igor Stasenko <[hidden email]> wrote:

> Here the slang code:
>
> size := self sizeBitsOf: op1.
> size = (self sizeBitsOf: op2) ifFalse: [
> ^ false ].
>
> And here translated code:
>
> /* begin sizeBitsOf: */
> header = longAt(op1);
> size = ((header & TypeMask) == HeaderTypeSizeAndClass
> ? (longAt(op1 - (BytesPerWord * 2))) & LongSizeMask
> : header & SizeMask);
> if (!(size == (sizeBitsOf(op2)))) {
> return 0;
> }
>
> as you can see it inlining first, but refuses to inline second one.
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

--
Best regards,
Igor Stasenko AKA sig.

Stefan Marr

Re: [Cog] Strange inliner behavior

In reply to this post by Igor Stasenko

Hi Igor:

On 31/07/11 05:49, Igor Stasenko wrote:

>
> Here the slang code:
>
> size := self sizeBitsOf: op1.
> size = (self sizeBitsOf: op2) ifFalse: [
> ^ false ].
>
> And here translated code:
>
> /* begin sizeBitsOf: */
> header = longAt(op1);
> size = ((header& TypeMask) == HeaderTypeSizeAndClass
> ? (longAt(op1 - (BytesPerWord * 2)))& LongSizeMask
> : header& SizeMask);
> if (!(size == (sizeBitsOf(op2)))) {
> return 0;
> }
>
> as you can see it inlining first, but refuses to inline second one.

Just out of curiosity, in which kind of use cases is the inline-behavior
of the used C compiler not sufficient to rely on, instead of manually
inline such C code?

Especially, since people like Mike Pall of the LuaJIT2 claim that GCC
with -O3 inlines to aggressively anyway which leads to code bloat that
does not fit into typical CPU instruction caches and thus slows things down.
But since that is just 3rd-hand knowledge, I would like to hear about
real experiences.

Best regards
Stefan

>
>

David T. Lewis

Re: [Cog] Strange inliner behavior

In reply to this post by Igor Stasenko

On Sun, Jul 31, 2011 at 05:49:40AM +0200, Igor Stasenko wrote:

>
> Here the slang code:
>
> size := self sizeBitsOf: op1.
> size = (self sizeBitsOf: op2) ifFalse: [
> ^ false ].
>
> And here translated code:
>
> /* begin sizeBitsOf: */
> header = longAt(op1);
> size = ((header & TypeMask) == HeaderTypeSizeAndClass
> ? (longAt(op1 - (BytesPerWord * 2))) & LongSizeMask
> : header & SizeMask);
> if (!(size == (sizeBitsOf(op2)))) {
> return 0;
> }
>
> as you can see it inlining first, but refuses to inline second one.
>

It looks like a bug, but it is not related to Cog. I get the same
results using VMMaker trunk.

Dave

David T. Lewis

Re: Slang inliner effectiveness (was: [Cog] Strange inliner behavior)

In reply to this post by Stefan Marr

On Sun, Jul 31, 2011 at 12:59:21PM +0200, Stefan Marr wrote:

>
> Just out of curiosity, in which kind of use cases is the inline-behavior
> of the used C compiler not sufficient to rely on, instead of manually
> inline such C code?
>
> Especially, since people like Mike Pall of the LuaJIT2 claim that GCC
> with -O3 inlines to aggressively anyway which leads to code bloat that
> does not fit into typical CPU instruction caches and thus slows things down.
> But since that is just 3rd-hand knowledge, I would like to hear about
> real experiences.

The slang inliner is amazingly effective. The most obvious use case
is of course the interpreter itself. Try turning off the slang inlining,
apply all the GCC optimization you want, and you will end up with a
painfully slow VM.

As a second use case, which to me was even more convincing, consider
the memory access macros in sqMemoryAccess.h. These are written to be
as efficient as possible for speed. Then look at the slang code in
the MemoryAccess package on SqueakSource/VMMaker. This is a slang
replacement for the memory access macros. When this package is used,
the macros are not used at all, and the memory access code is all
Smalltalk down to the lowest possible level.

I found that using the slang memory access methods, which are fully
inlined by the slang inliner, results in a VM with performance
identical to that of the VM with C macros (to the best of my ability
to measure it with #tinyBenchmarks). I was extremely surprised by this
result, and it tells me that the slang inliner is really very effective
indeed.

Dave

Stefan Marr

Re: Slang inliner effectiveness

Hi Dave:

On 31/07/11 15:35, David T. Lewis wrote:

>
> On Sun, Jul 31, 2011 at 12:59:21PM +0200, Stefan Marr wrote:
>> Just out of curiosity, in which kind of use cases is the inline-behavior
>> of the used C compiler not sufficient to rely on, instead of manually
>> inline such C code?
>>
>> Especially, since people like Mike Pall of the LuaJIT2 claim that GCC
>> with -O3 inlines to aggressively anyway which leads to code bloat that
>> does not fit into typical CPU instruction caches and thus slows things down.
>> But since that is just 3rd-hand knowledge, I would like to hear about
>> real experiences.
> The slang inliner is amazingly effective. The most obvious use case
> is of course the interpreter itself. Try turning off the slang inlining,
> apply all the GCC optimization you want, and you will end up with a
> painfully slow VM.
>
> As a second use case, which to me was even more convincing, consider
> the memory access macros in sqMemoryAccess.h. These are written to be
> as efficient as possible for speed. Then look at the slang code in
> the MemoryAccess package on SqueakSource/VMMaker. This is a slang
> replacement for the memory access macros. When this package is used,
> the macros are not used at all, and the memory access code is all
> Smalltalk down to the lowest possible level.
>
> I found that using the slang memory access methods, which are fully
> inlined by the slang inliner, results in a VM with performance
> identical to that of the VM with C macros (to the best of my ability
> to measure it with #tinyBenchmarks). I was extremely surprised by this
> result, and it tells me that the slang inliner is really very effective
> indeed.

Interesting. Just for completeness:
When the inlineing isn't done, the generated C functions have compiler
hints like `inline` and `__attribute__ ((always_inline))`, and are
generated into the same compilation unit? I guess the last one is true
since the interp.c is probably the most relevant thing here.

Any guess what the reason could be why the C compiler fails to do proper
inlineing?

Thanks
Stefan
>
> Dave
>

David T. Lewis

Re: Slang inliner effectiveness

On Sun, Jul 31, 2011 at 04:38:39PM +0200, Stefan Marr wrote:

>
> Hi Dave:
>
> On 31/07/11 15:35, David T. Lewis wrote:
> >
> >On Sun, Jul 31, 2011 at 12:59:21PM +0200, Stefan Marr wrote:
> >>Just out of curiosity, in which kind of use cases is the inline-behavior
> >>of the used C compiler not sufficient to rely on, instead of manually
> >>inline such C code?
> >>
> >>Especially, since people like Mike Pall of the LuaJIT2 claim that GCC
> >>with -O3 inlines to aggressively anyway which leads to code bloat that
> >>does not fit into typical CPU instruction caches and thus slows things
> >>down.
> >>But since that is just 3rd-hand knowledge, I would like to hear about
> >>real experiences.
> >The slang inliner is amazingly effective. The most obvious use case
> >is of course the interpreter itself. Try turning off the slang inlining,
> >apply all the GCC optimization you want, and you will end up with a
> >painfully slow VM.
> >
> >As a second use case, which to me was even more convincing, consider
> >the memory access macros in sqMemoryAccess.h. These are written to be
> >as efficient as possible for speed. Then look at the slang code in
> >the MemoryAccess package on SqueakSource/VMMaker. This is a slang
> >replacement for the memory access macros. When this package is used,
> >the macros are not used at all, and the memory access code is all
> >Smalltalk down to the lowest possible level.
> >
> >I found that using the slang memory access methods, which are fully
> >inlined by the slang inliner, results in a VM with performance
> >identical to that of the VM with C macros (to the best of my ability
> >to measure it with #tinyBenchmarks). I was extremely surprised by this
> >result, and it tells me that the slang inliner is really very effective
> >indeed.
> Interesting. Just for completeness:
> When the inlineing isn't done, the generated C functions have compiler
> hints like `inline` and `__attribute__ ((always_inline))`, and are
> generated into the same compilation unit? I guess the last one is true
> since the interp.c is probably the most relevant thing here.

No, there would have been no extra compiler hints generated in either
case. Yes, for the interpreter proper (and hence the main interpreter
loop) there is only a single compilation unit (interp.c).

>
> Any guess what the reason could be why the C compiler fails to do proper
> inlineing?

I cannot really say, and I am not much of an expert of C compilers.
For the most part I was just concerned with the slang inliner itself
when I was doing this (it needed some tweaks and fixes before I could
get MemoryAccess to work properly). I was very impressed with how well
the slang inliner actually worked in practice, though I cannot say too
much about what it might take to get a C compiler to achieve similar
results. To be honest I would not much care about it, given how well
the slang inlining already works, and given that it is generating C
code that will work well on most any compiler. I also like the fact
that it is 100% Smalltalk, and does not rely on any hidden magic in
the external compiler.

Dave

Igor Stasenko

Re: [Cog] Strange inliner behavior

In reply to this post by Stefan Marr

On 31 July 2011 12:59, Stefan Marr <[hidden email]> wrote:

>
> Hi Igor:
>
> On 31/07/11 05:49, Igor Stasenko wrote:
>>
>> Here the slang code:
>>
>> size := self sizeBitsOf: op1.
>> size = (self sizeBitsOf: op2) ifFalse: [
>> ^ false ].
>>
>> And here translated code:
>>
>> /* begin sizeBitsOf: */
>> header = longAt(op1);
>> size = ((header& TypeMask) == HeaderTypeSizeAndClass
>> ? (longAt(op1 - (BytesPerWord * 2)))& LongSizeMask
>> : header& SizeMask);
>> if (!(size == (sizeBitsOf(op2)))) {
>> return 0;
>> }
>>
>> as you can see it inlining first, but refuses to inline second one.
>
> Just out of curiosity, in which kind of use cases is the inline-behavior of
> the used C compiler not sufficient to rely on, instead of manually inline
> such C code?
>
> Especially, since people like Mike Pall of the LuaJIT2 claim that GCC with
> -O3 inlines to aggressively anyway which leads to code bloat that does not
> fit into typical CPU instruction caches and thus slows things down.
> But since that is just 3rd-hand knowledge, I would like to hear about real
> experiences.
>

I don't have much to say about it. Maybe it is like that, because
VMMaker has more than 10 years history, and at the time when it was
introduced, a C compilers inlining was not so good.
As of today, yes. I see a little sense to do it manually , except from
inside interpret() function.

> Best regards
> Stefan
>

--
Best regards,
Igor Stasenko AKA sig.

Eliot Miranda-2

Re: [Cog] Strange inliner behavior

In reply to this post by Igor Stasenko

Hi Igor,

this is simply because the inliner can only inline an expression in an expression context. Since sizeBitsOf: is written as multiple statements it can only be inlined in a statement context. Here's NewObjectMemory>sizeBitsOf: with the statements numbered:

sizeBitsOf: oop

"Answer the number of bytes in the given object, including its base header, rounded up to an integral number of words."

"Note: byte indexable objects need to have low bits subtracted from this size."

<inline: true>

| header |

1. header := self baseHeader: oop.

2. ^(header bitAnd: TypeMask) = HeaderTypeSizeAndClass

ifTrue: [(self sizeHeader: oop) bitAnd: LongSizeMask]

ifFalse: [header bitAnd: SizeMask]

Here's StackInterpreter>printNameOfClass:count:, with the occurrence of sizeBitsOf: in an expression context:

printNameOfClass: classOop count: cnt

"Details: The count argument is used to avoid a possible infinite recursion if classOop is a corrupted object."

<inline: false>

(classOop = 0 or: [cnt <= 0]) ifTrue: [^self print: 'bad class'].

e: ((objectMemory sizeBitsOf: classOop) = metaclassSizeBytes

and: [metaclassSizeBytes > (thisClassIndex * BytesPerWord)]) "(Metaclass instSize * 4)"

ifTrue: [self printNameOfClass: (objectMemory fetchPointer: thisClassIndex ofObject: classOop) count: cnt - 1.

self print: ' class']

ifFalse: [self printStringOf: (objectMemory fetchPointer: classNameIndex ofObject: classOop)]

With sizeBitsOf: defined as above this translates to:

static void

printNameOfClasscount(sqInt classOop, sqInt cnt)

{

if ((classOop == 0)

|| (cnt <= 0)) {

print("bad class"); return;

}

if (((sizeBitsOf(classOop)) == GIV(metaclassSizeBytes))

&& (GIV(metaclassSizeBytes) > (GIV(thisClassIndex) * BytesPerWord))) {

printNameOfClasscount(longAt((classOop + BaseHeaderSize) + (GIV(thisClassIndex) << ShiftForWord)), cnt - 1);

print(" class");

}

else {

printStringOf(longAt((classOop + BaseHeaderSize) + (GIV(classNameIndex) << ShiftForWord)));

}

But if one changes the definition of sizeBitsOf: to an expression, necessitating two references through oop, to:

sizeBitsOf: oop

"Answer the number of bytes in the given object, including its base header, rounded up to an integral number of words."

"Note: byte indexable objects need to have low bits subtracted from this size."

<inline: true>

^((self baseHeader: oop) bitAnd: TypeMask) = HeaderTypeSizeAndClass

ifTrue: [(self sizeHeader: oop) bitAnd: LongSizeMask]

ifFalse: [(self baseHeader: oop) bitAnd: SizeMask]

then StackInterpreter>printNameOfClass:count: translates to

static void

printNameOfClasscount(sqInt classOop, sqInt cnt)

{

if ((classOop == 0)

|| (cnt <= 0)) {

print("bad class"); return;

}

if ((((((longAt(classOop)) & TypeMask) == HeaderTypeSizeAndClass

? (longAt(classOop - (BytesPerWord * 2))) & LongSizeMask

: (longAt(classOop)) & SizeMask)) == GIV(metaclassSizeBytes))

&& (GIV(metaclassSizeBytes) > (GIV(thisClassIndex) * BytesPerWord))) {

printNameOfClasscount(longAt((classOop + BaseHeaderSize) + (GIV(thisClassIndex) << ShiftForWord)), cnt - 1);

print(" class");

}

else {

printStringOf(longAt((classOop + BaseHeaderSize) + (GIV(classNameIndex) << ShiftForWord)));

}

So (David) this isn't so much a bug as a limitation of the inliner.

Of course in an expression context it could be translated as

static void

printNameOfClasscount(sqInt classOop, sqInt cnt)

{

| header |

if ((classOop == 0)

|| (cnt <= 0)) {

print("bad class"); return;

}

if (((header = longAt(classOop),

((header & TypeMask) == HeaderTypeSizeAndClass

? (longAt(classOop - (BytesPerWord * 2))) & LongSizeMask

: header & SizeMask)) == GIV(metaclassSizeBytes))

&& (GIV(metaclassSizeBytes) > (GIV(thisClassIndex) * BytesPerWord))) {

printNameOfClasscount(longAt((classOop + BaseHeaderSize) + (GIV(thisClassIndex) << ShiftForWord)), cnt - 1);

print(" class");

}

else {

printStringOf(longAt((classOop + BaseHeaderSize) + (GIV(classNameIndex) << ShiftForWord)));

}

and I think the changes I made recently to TStmtListNode>emitCCodeAsArgumentOn:level:generator: are what's required. You simply have to track down where in the inliner it makes the determination as to whether the expression can be inlined.

HTH

Eliot

On Sat, Jul 30, 2011 at 8:53 PM, Igor Stasenko <[hidden email]> wrote:

if instead i do like that:

size := self sizeBitsOf: op1.
sz2 := self sizeBitsOf: op2.
size = sz2 ifFalse: [ ^ false ].

then it inlines both calls.

On 31 July 2011 05:49, Igor Stasenko <[hidden email]> wrote:
> Here the slang code:
>
> size := self sizeBitsOf: op1.
> size = (self sizeBitsOf: op2) ifFalse: [
> ^ false ].
>
> And here translated code:
>
> /* begin sizeBitsOf: */
> header = longAt(op1);
> size = ((header & TypeMask) == HeaderTypeSizeAndClass
> ? (longAt(op1 - (BytesPerWord * 2))) & LongSizeMask
> : header & SizeMask);
> if (!(size == (sizeBitsOf(op2)))) {
> return 0;
> }
>
> as you can see it inlining first, but refuses to inline second one.
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

--
Best regards,
Igor Stasenko AKA sig.

--
best,

Eliot

Stefan Marr

Re: Slang inliner effectiveness

In reply to this post by David T. Lewis

On 31/07/11 17:25, David T. Lewis wrote:

>
> On Sun, Jul 31, 2011 at 04:38:39PM +0200, Stefan Marr wrote:
> Any guess what the reason could be why the C compiler fails to do proper
> inlineing?
> I cannot really say, and I am not much of an expert of C compilers.
> For the most part I was just concerned with the slang inliner itself
> when I was doing this (it needed some tweaks and fixes before I could
> get MemoryAccess to work properly). I was very impressed with how well
> the slang inliner actually worked in practice, though I cannot say too
> much about what it might take to get a C compiler to achieve similar
> results. To be honest I would not much care about it, given how well
> the slang inlining already works, and given that it is generating C
> code that will work well on most any compiler. I also like the fact
> that it is 100% Smalltalk, and does not rely on any hidden magic in
> the external compiler.

Well, the interesting question for me is, does it pay of in terms of
some percent speedup if I walk over the RoarVM codebase and add some
more inline/force_inline hints here and there.

Perhaps I should just try it, the only question then is, where to start,
and where to stop ;)

Thanks
Stefan

>
> Dave
>

Eliot Miranda-2

Re: Slang inliner effectiveness

On Sun, Jul 31, 2011 at 11:31 AM, Stefan Marr <[hidden email]> wrote:

On 31/07/11 17:25, David T. Lewis wrote:

On Sun, Jul 31, 2011 at 04:38:39PM +0200, Stefan Marr wrote:
Any guess what the reason could be why the C compiler fails to do proper
inlineing?
I cannot really say, and I am not much of an expert of C compilers.
For the most part I was just concerned with the slang inliner itself
when I was doing this (it needed some tweaks and fixes before I could
get MemoryAccess to work properly). I was very impressed with how well
the slang inliner actually worked in practice, though I cannot say too
much about what it might take to get a C compiler to achieve similar
results. To be honest I would not much care about it, given how well
the slang inlining already works, and given that it is generating C
code that will work well on most any compiler. I also like the fact
that it is 100% Smalltalk, and does not rely on any hidden magic in
the external compiler.

Well, the interesting question for me is, does it pay of in terms of some percent speedup if I walk over the RoarVM codebase and add some more inline/force_inline hints here and there.

Perhaps I should just try it, the only question then is, where to start, and where to stop ;)

The experiment to do is to modify Slang so that you can get it to do no inlining but add the inline keyword to all methods marked inline and compare the performance of the C compiler inlined code vs the Slang inlined code. However, given Roar has architectural overheads I'd also do this experiment on the base interpreter.

Thanks
Stefan

Dave

--
best,

Eliot