Here the slang code: size := self sizeBitsOf: op1. size = (self sizeBitsOf: op2) ifFalse: [ ^ false ]. And here translated code: /* begin sizeBitsOf: */ header = longAt(op1); size = ((header & TypeMask) == HeaderTypeSizeAndClass ? (longAt(op1 - (BytesPerWord * 2))) & LongSizeMask : header & SizeMask); if (!(size == (sizeBitsOf(op2)))) { return 0; } as you can see it inlining first, but refuses to inline second one. -- Best regards, Igor Stasenko AKA sig. |
if instead i do like that: size := self sizeBitsOf: op1. sz2 := self sizeBitsOf: op2. size = sz2 ifFalse: [ ^ false ]. then it inlines both calls. On 31 July 2011 05:49, Igor Stasenko <[hidden email]> wrote: > Here the slang code: > > size := self sizeBitsOf: op1. > size = (self sizeBitsOf: op2) ifFalse: [ > ^ false ]. > > And here translated code: > > /* begin sizeBitsOf: */ > header = longAt(op1); > size = ((header & TypeMask) == HeaderTypeSizeAndClass > ? (longAt(op1 - (BytesPerWord * 2))) & LongSizeMask > : header & SizeMask); > if (!(size == (sizeBitsOf(op2)))) { > return 0; > } > > as you can see it inlining first, but refuses to inline second one. > > > -- > Best regards, > Igor Stasenko AKA sig. > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
Hi Igor: On 31/07/11 05:49, Igor Stasenko wrote: > > Here the slang code: > > size := self sizeBitsOf: op1. > size = (self sizeBitsOf: op2) ifFalse: [ > ^ false ]. > > And here translated code: > > /* begin sizeBitsOf: */ > header = longAt(op1); > size = ((header& TypeMask) == HeaderTypeSizeAndClass > ? (longAt(op1 - (BytesPerWord * 2)))& LongSizeMask > : header& SizeMask); > if (!(size == (sizeBitsOf(op2)))) { > return 0; > } > > as you can see it inlining first, but refuses to inline second one. Just out of curiosity, in which kind of use cases is the inline-behavior of the used C compiler not sufficient to rely on, instead of manually inline such C code? Especially, since people like Mike Pall of the LuaJIT2 claim that GCC with -O3 inlines to aggressively anyway which leads to code bloat that does not fit into typical CPU instruction caches and thus slows things down. But since that is just 3rd-hand knowledge, I would like to hear about real experiences. Best regards Stefan > > |
In reply to this post by Igor Stasenko
On Sun, Jul 31, 2011 at 05:49:40AM +0200, Igor Stasenko wrote: > > Here the slang code: > > size := self sizeBitsOf: op1. > size = (self sizeBitsOf: op2) ifFalse: [ > ^ false ]. > > And here translated code: > > /* begin sizeBitsOf: */ > header = longAt(op1); > size = ((header & TypeMask) == HeaderTypeSizeAndClass > ? (longAt(op1 - (BytesPerWord * 2))) & LongSizeMask > : header & SizeMask); > if (!(size == (sizeBitsOf(op2)))) { > return 0; > } > > as you can see it inlining first, but refuses to inline second one. > It looks like a bug, but it is not related to Cog. I get the same results using VMMaker trunk. Dave |
In reply to this post by Stefan Marr
On Sun, Jul 31, 2011 at 12:59:21PM +0200, Stefan Marr wrote: > > Just out of curiosity, in which kind of use cases is the inline-behavior > of the used C compiler not sufficient to rely on, instead of manually > inline such C code? > > Especially, since people like Mike Pall of the LuaJIT2 claim that GCC > with -O3 inlines to aggressively anyway which leads to code bloat that > does not fit into typical CPU instruction caches and thus slows things down. > But since that is just 3rd-hand knowledge, I would like to hear about > real experiences. The slang inliner is amazingly effective. The most obvious use case is of course the interpreter itself. Try turning off the slang inlining, apply all the GCC optimization you want, and you will end up with a painfully slow VM. As a second use case, which to me was even more convincing, consider the memory access macros in sqMemoryAccess.h. These are written to be as efficient as possible for speed. Then look at the slang code in the MemoryAccess package on SqueakSource/VMMaker. This is a slang replacement for the memory access macros. When this package is used, the macros are not used at all, and the memory access code is all Smalltalk down to the lowest possible level. I found that using the slang memory access methods, which are fully inlined by the slang inliner, results in a VM with performance identical to that of the VM with C macros (to the best of my ability to measure it with #tinyBenchmarks). I was extremely surprised by this result, and it tells me that the slang inliner is really very effective indeed. Dave |
Hi Dave: On 31/07/11 15:35, David T. Lewis wrote: > > On Sun, Jul 31, 2011 at 12:59:21PM +0200, Stefan Marr wrote: >> Just out of curiosity, in which kind of use cases is the inline-behavior >> of the used C compiler not sufficient to rely on, instead of manually >> inline such C code? >> >> Especially, since people like Mike Pall of the LuaJIT2 claim that GCC >> with -O3 inlines to aggressively anyway which leads to code bloat that >> does not fit into typical CPU instruction caches and thus slows things down. >> But since that is just 3rd-hand knowledge, I would like to hear about >> real experiences. > The slang inliner is amazingly effective. The most obvious use case > is of course the interpreter itself. Try turning off the slang inlining, > apply all the GCC optimization you want, and you will end up with a > painfully slow VM. > > As a second use case, which to me was even more convincing, consider > the memory access macros in sqMemoryAccess.h. These are written to be > as efficient as possible for speed. Then look at the slang code in > the MemoryAccess package on SqueakSource/VMMaker. This is a slang > replacement for the memory access macros. When this package is used, > the macros are not used at all, and the memory access code is all > Smalltalk down to the lowest possible level. > > I found that using the slang memory access methods, which are fully > inlined by the slang inliner, results in a VM with performance > identical to that of the VM with C macros (to the best of my ability > to measure it with #tinyBenchmarks). I was extremely surprised by this > result, and it tells me that the slang inliner is really very effective > indeed. When the inlineing isn't done, the generated C functions have compiler hints like `inline` and `__attribute__ ((always_inline))`, and are generated into the same compilation unit? I guess the last one is true since the interp.c is probably the most relevant thing here. Any guess what the reason could be why the C compiler fails to do proper inlineing? Thanks Stefan > > Dave > |
On Sun, Jul 31, 2011 at 04:38:39PM +0200, Stefan Marr wrote: > > Hi Dave: > > On 31/07/11 15:35, David T. Lewis wrote: > > > >On Sun, Jul 31, 2011 at 12:59:21PM +0200, Stefan Marr wrote: > >>Just out of curiosity, in which kind of use cases is the inline-behavior > >>of the used C compiler not sufficient to rely on, instead of manually > >>inline such C code? > >> > >>Especially, since people like Mike Pall of the LuaJIT2 claim that GCC > >>with -O3 inlines to aggressively anyway which leads to code bloat that > >>does not fit into typical CPU instruction caches and thus slows things > >>down. > >>But since that is just 3rd-hand knowledge, I would like to hear about > >>real experiences. > >The slang inliner is amazingly effective. The most obvious use case > >is of course the interpreter itself. Try turning off the slang inlining, > >apply all the GCC optimization you want, and you will end up with a > >painfully slow VM. > > > >As a second use case, which to me was even more convincing, consider > >the memory access macros in sqMemoryAccess.h. These are written to be > >as efficient as possible for speed. Then look at the slang code in > >the MemoryAccess package on SqueakSource/VMMaker. This is a slang > >replacement for the memory access macros. When this package is used, > >the macros are not used at all, and the memory access code is all > >Smalltalk down to the lowest possible level. > > > >I found that using the slang memory access methods, which are fully > >inlined by the slang inliner, results in a VM with performance > >identical to that of the VM with C macros (to the best of my ability > >to measure it with #tinyBenchmarks). I was extremely surprised by this > >result, and it tells me that the slang inliner is really very effective > >indeed. > Interesting. Just for completeness: > When the inlineing isn't done, the generated C functions have compiler > hints like `inline` and `__attribute__ ((always_inline))`, and are > generated into the same compilation unit? I guess the last one is true > since the interp.c is probably the most relevant thing here. No, there would have been no extra compiler hints generated in either case. Yes, for the interpreter proper (and hence the main interpreter loop) there is only a single compilation unit (interp.c). > > Any guess what the reason could be why the C compiler fails to do proper > inlineing? I cannot really say, and I am not much of an expert of C compilers. For the most part I was just concerned with the slang inliner itself when I was doing this (it needed some tweaks and fixes before I could get MemoryAccess to work properly). I was very impressed with how well the slang inliner actually worked in practice, though I cannot say too much about what it might take to get a C compiler to achieve similar results. To be honest I would not much care about it, given how well the slang inlining already works, and given that it is generating C code that will work well on most any compiler. I also like the fact that it is 100% Smalltalk, and does not rely on any hidden magic in the external compiler. Dave |
In reply to this post by Stefan Marr
On 31 July 2011 12:59, Stefan Marr <[hidden email]> wrote: > > Hi Igor: > > On 31/07/11 05:49, Igor Stasenko wrote: >> >> Here the slang code: >> >> size := self sizeBitsOf: op1. >> size = (self sizeBitsOf: op2) ifFalse: [ >> ^ false ]. >> >> And here translated code: >> >> /* begin sizeBitsOf: */ >> header = longAt(op1); >> size = ((header& TypeMask) == HeaderTypeSizeAndClass >> ? (longAt(op1 - (BytesPerWord * 2)))& LongSizeMask >> : header& SizeMask); >> if (!(size == (sizeBitsOf(op2)))) { >> return 0; >> } >> >> as you can see it inlining first, but refuses to inline second one. > > Just out of curiosity, in which kind of use cases is the inline-behavior of > the used C compiler not sufficient to rely on, instead of manually inline > such C code? > > Especially, since people like Mike Pall of the LuaJIT2 claim that GCC with > -O3 inlines to aggressively anyway which leads to code bloat that does not > fit into typical CPU instruction caches and thus slows things down. > But since that is just 3rd-hand knowledge, I would like to hear about real > experiences. > I don't have much to say about it. Maybe it is like that, because VMMaker has more than 10 years history, and at the time when it was introduced, a C compilers inlining was not so good. As of today, yes. I see a little sense to do it manually , except from inside interpret() function. > Best regards > Stefan > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
Hi Igor, this is simply because the inliner can only inline an expression in an expression context. Since sizeBitsOf: is written as multiple statements it can only be inlined in a statement context. Here's NewObjectMemory>sizeBitsOf: with the statements numbered:
sizeBitsOf: oop "Answer the number of bytes in the given object, including its base header, rounded up to an integral number of words."
"Note: byte indexable objects need to have low bits subtracted from this size." <inline: true>
| header | 1. header := self baseHeader: oop. 2. ^(header bitAnd: TypeMask) = HeaderTypeSizeAndClass
ifTrue: [(self sizeHeader: oop) bitAnd: LongSizeMask] ifFalse: [header bitAnd: SizeMask]
Here's StackInterpreter>printNameOfClass:count:, with the occurrence of sizeBitsOf: in an expression context: printNameOfClass: classOop count: cnt "Details: The count argument is used to avoid a possible infinite recursion if classOop is a corrupted object."
<inline: false> (classOop = 0 or: [cnt <= 0]) ifTrue: [^self print: 'bad class'].
e: ((objectMemory sizeBitsOf: classOop) = metaclassSizeBytes and: [metaclassSizeBytes > (thisClassIndex * BytesPerWord)]) "(Metaclass instSize * 4)"
ifTrue: [self printNameOfClass: (objectMemory fetchPointer: thisClassIndex ofObject: classOop) count: cnt - 1. self print: ' class']
ifFalse: [self printStringOf: (objectMemory fetchPointer: classNameIndex ofObject: classOop)] static void printNameOfClasscount(sqInt classOop, sqInt cnt) { if ((classOop == 0) || (cnt <= 0)) {
print("bad class"); return; } if (((sizeBitsOf(classOop)) == GIV(metaclassSizeBytes))
&& (GIV(metaclassSizeBytes) > (GIV(thisClassIndex) * BytesPerWord))) { printNameOfClasscount(longAt((classOop + BaseHeaderSize) + (GIV(thisClassIndex) << ShiftForWord)), cnt - 1);
print(" class"); } else {
printStringOf(longAt((classOop + BaseHeaderSize) + (GIV(classNameIndex) << ShiftForWord))); }
} But if one changes the definition of sizeBitsOf: to an expression, necessitating two references through oop, to: sizeBitsOf: oop "Answer the number of bytes in the given object, including its base header, rounded up to an integral number of words."
"Note: byte indexable objects need to have low bits subtracted from this size." <inline: true>
^((self baseHeader: oop) bitAnd: TypeMask) = HeaderTypeSizeAndClass ifTrue: [(self sizeHeader: oop) bitAnd: LongSizeMask]
ifFalse: [(self baseHeader: oop) bitAnd: SizeMask] then StackInterpreter>printNameOfClass:count: translates to static void printNameOfClasscount(sqInt classOop, sqInt cnt) { if ((classOop == 0) || (cnt <= 0)) {
print("bad class"); return; } if ((((((longAt(classOop)) & TypeMask) == HeaderTypeSizeAndClass
? (longAt(classOop - (BytesPerWord * 2))) & LongSizeMask : (longAt(classOop)) & SizeMask)) == GIV(metaclassSizeBytes))
&& (GIV(metaclassSizeBytes) > (GIV(thisClassIndex) * BytesPerWord))) { printNameOfClasscount(longAt((classOop + BaseHeaderSize) + (GIV(thisClassIndex) << ShiftForWord)), cnt - 1);
print(" class"); } else {
printStringOf(longAt((classOop + BaseHeaderSize) + (GIV(classNameIndex) << ShiftForWord))); }
} So (David) this isn't so much a bug as a limitation of the inliner. Of course in an expression context it could be translated as static void printNameOfClasscount(sqInt classOop, sqInt cnt) { | header | if ((classOop == 0)
|| (cnt <= 0)) { print("bad class"); return; }
if (((header = longAt(classOop), ((header & TypeMask) == HeaderTypeSizeAndClass
? (longAt(classOop - (BytesPerWord * 2))) & LongSizeMask : header & SizeMask)) == GIV(metaclassSizeBytes))
&& (GIV(metaclassSizeBytes) > (GIV(thisClassIndex) * BytesPerWord))) { printNameOfClasscount(longAt((classOop + BaseHeaderSize) + (GIV(thisClassIndex) << ShiftForWord)), cnt - 1);
print(" class"); } else {
printStringOf(longAt((classOop + BaseHeaderSize) + (GIV(classNameIndex) << ShiftForWord))); }
} and I think the changes I made recently to TStmtListNode>emitCCodeAsArgumentOn:level:generator: are what's required. You simply have to track down where in the inliner it makes the determination as to whether the expression can be inlined.
HTH Eliot On Sat, Jul 30, 2011 at 8:53 PM, Igor Stasenko <[hidden email]> wrote:
-- best, Eliot |
In reply to this post by David T. Lewis
On 31/07/11 17:25, David T. Lewis wrote: > > On Sun, Jul 31, 2011 at 04:38:39PM +0200, Stefan Marr wrote: > Any guess what the reason could be why the C compiler fails to do proper > inlineing? > I cannot really say, and I am not much of an expert of C compilers. > For the most part I was just concerned with the slang inliner itself > when I was doing this (it needed some tweaks and fixes before I could > get MemoryAccess to work properly). I was very impressed with how well > the slang inliner actually worked in practice, though I cannot say too > much about what it might take to get a C compiler to achieve similar > results. To be honest I would not much care about it, given how well > the slang inlining already works, and given that it is generating C > code that will work well on most any compiler. I also like the fact > that it is 100% Smalltalk, and does not rely on any hidden magic in > the external compiler. some percent speedup if I walk over the RoarVM codebase and add some more inline/force_inline hints here and there. Perhaps I should just try it, the only question then is, where to start, and where to stop ;) Thanks Stefan > > Dave > |
On Sun, Jul 31, 2011 at 11:31 AM, Stefan Marr <[hidden email]> wrote:
The experiment to do is to modify Slang so that you can get it to do no inlining but add the inline keyword to all methods marked inline and compare the performance of the C compiler inlined code vs the Slang inlined code. However, given Roar has architectural overheads I'd also do this experiment on the base interpreter.
-- best, Eliot |
Free forum by Nabble | Edit this page |