Smalltalk › Squeak › Squeak VM

Performance of primitiveFailFor: and use of primFailCode

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

23 messages Options

David T. Lewis

Performance of primitiveFailFor: and use of primFailCode

I have been trying to gradually update trunk VMMaker to better align
with oscog VMMaker (an admittedly slow process, but hopefully still
worthwhile). I have gotten the interpreter primitives moved into class
InterpreterPrimitives and verified no changes to generated code. This
greatly reduces the clutter in class Interpreter, so it's a nice change
I think.

My next step was to update all of the primitives to use the #primitiveFailFor:
idiom, in which the successFlag variable is replaced with primFailCode
(integer value, 0 for success, 1, 2, 3... for failure codes). This would
get us closer to the point where the standard interpreter and stack/cog
would use a common set of primitives. A lot of changes were required for
this, but the resulting VM works fine ... except for performance.

On a standard interpreter, use of primFailCode seems to result in a
nearly 12% reduction in bytecode performance as measured by tinyBenchmarks:

Standard interpreter (using successFlag):
0 tinyBenchmarks. '439108061 bytecodes/sec; 15264622 sends/sec'
0 tinyBenchmarks. '433164128 bytecodes/sec; 14740358 sends/sec'
0 tinyBenchmarks. '445993031 bytecodes/sec; 15040691 sends/sec'
0 tinyBenchmarks. '440999138 bytecodes/sec; 15052960 sends/sec'
0 tinyBenchmarks. '445993031 bytecodes/sec; 14485815 sends/sec'

After updating the standard interpreter (using primFailCode):
0 tinyBenchmarks. '393241167 bytecodes/sec; 14066256 sends/sec'
0 tinyBenchmarks. '392036753 bytecodes/sec; 15040691 sends/sec'
0 tinyBenchmarks. '393846153 bytecodes/sec; 14272953 sends/sec'
0 tinyBenchmarks. '400625978 bytecodes/sec; 14991818 sends/sec'
0 tinyBenchmarks. '393846153 bytecodes/sec; 15176750 sends/sec'

This is a much larger performance difference than I expected to see.
Actually I expected no measurable difference at all, and I was just
testing to verify this. But 12% is a lot, so I want to ask if I'm
missing something?

The changes to generated code generally take the form of:

Testing success status, original:
if (successFlag) { ... }

Testing success status, new:
if (foo->primFailCode == 0) { ... }

Setting failure status, original:
successFlag = 0;

Setting failure status, new:
if (foo->primFailCode == 0) {
foo->primFailCode = 1;
}

My approach to doing the updates was as follows:
- Replace all occurrences of "successFlag := true" with "self initPrimCall",
which initialize primFailCode to 0.
- Replace all "successFlag := false" with "self primitiveFail".
- Replace all "successFlag ifTrue: [] ifFalse: []" with
"self successful ifTrue: [] ifFalse: []".
- Update #primitiveFail, #failed and #success: to use primFailCode rather
than successFlag.
- Remove successFlag variable.

Obviously I don't want to publish the code on SqS/VMMaker, but I can mail
an interp.c if anyone wants to see the gory details (It is too large to
post on this mailing list though).

Any advice appreciated. I suspect I'm missing something basic here.

Thanks,
Dave

stephane ducasse-2

Re: Performance of primitiveFailFor: and use of primFailCode

Thanks A LOT for your effort!!!

Stef

On May 22, 2011, at 5:54 PM, David T. Lewis wrote:

>
> I have been trying to gradually update trunk VMMaker to better align
> with oscog VMMaker (an admittedly slow process, but hopefully still
> worthwhile). I have gotten the interpreter primitives moved into class
> InterpreterPrimitives and verified no changes to generated code. This
> greatly reduces the clutter in class Interpreter, so it's a nice change
> I think.
>
> My next step was to update all of the primitives to use the #primitiveFailFor:
> idiom, in which the successFlag variable is replaced with primFailCode
> (integer value, 0 for success, 1, 2, 3... for failure codes). This would
> get us closer to the point where the standard interpreter and stack/cog
> would use a common set of primitives. A lot of changes were required for
> this, but the resulting VM works fine ... except for performance.
>
> On a standard interpreter, use of primFailCode seems to result in a
> nearly 12% reduction in bytecode performance as measured by tinyBenchmarks:
>
> Standard interpreter (using successFlag):
> 0 tinyBenchmarks. '439108061 bytecodes/sec; 15264622 sends/sec'
> 0 tinyBenchmarks. '433164128 bytecodes/sec; 14740358 sends/sec'
> 0 tinyBenchmarks. '445993031 bytecodes/sec; 15040691 sends/sec'
> 0 tinyBenchmarks. '440999138 bytecodes/sec; 15052960 sends/sec'
> 0 tinyBenchmarks. '445993031 bytecodes/sec; 14485815 sends/sec'
>
> After updating the standard interpreter (using primFailCode):
> 0 tinyBenchmarks. '393241167 bytecodes/sec; 14066256 sends/sec'
> 0 tinyBenchmarks. '392036753 bytecodes/sec; 15040691 sends/sec'
> 0 tinyBenchmarks. '393846153 bytecodes/sec; 14272953 sends/sec'
> 0 tinyBenchmarks. '400625978 bytecodes/sec; 14991818 sends/sec'
> 0 tinyBenchmarks. '393846153 bytecodes/sec; 15176750 sends/sec'
>
> This is a much larger performance difference than I expected to see.
> Actually I expected no measurable difference at all, and I was just
> testing to verify this. But 12% is a lot, so I want to ask if I'm
> missing something?
>
> The changes to generated code generally take the form of:
>
> Testing success status, original:
> if (successFlag) { ... }
>
> Testing success status, new:
> if (foo->primFailCode == 0) { ... }
>
> Setting failure status, original:
> successFlag = 0;
>
> Setting failure status, new:
> if (foo->primFailCode == 0) {
> foo->primFailCode = 1;
> }
>
> My approach to doing the updates was as follows:
> - Replace all occurrences of "successFlag := true" with "self initPrimCall",
> which initialize primFailCode to 0.
> - Replace all "successFlag := false" with "self primitiveFail".
> - Replace all "successFlag ifTrue: [] ifFalse: []" with
> "self successful ifTrue: [] ifFalse: []".
> - Update #primitiveFail, #failed and #success: to use primFailCode rather
> than successFlag.
> - Remove successFlag variable.
>
> Obviously I don't want to publish the code on SqS/VMMaker, but I can mail
> an interp.c if anyone wants to see the gory details (It is too large to
> post on this mailing list though).
>
> Any advice appreciated. I suspect I'm missing something basic here.
>
> Thanks,
> Dave
>

David T. Lewis

Re: Performance of primitiveFailFor: and use of primFailCode

In reply to this post by David T. Lewis

A correction to the code that I quoted below: The generated code
before and after the change looks like this (sorry I forgot a "foo"):

Testing success status, original:
if (foo->successFlag) { ... }

Testing success status, new:
if (foo->primFailCode == 0) { ... }

Setting failure status, original:
foo->successFlag = 0;

Setting failure status, new:
if (foo->primFailCode == 0) {
foo->primFailCode = 1;
}

Dave

On Sun, May 22, 2011 at 11:54:18AM -0400, David T. Lewis wrote:

stephane ducasse-2

Re: Performance of primitiveFailFor: and use of primFailCode

In reply to this post by David T. Lewis

Thanks A LOT for your effort!!!

Stef

On May 22, 2011, at 5:54 PM, David T. Lewis wrote:

Eliot Miranda-2

Re: Performance of primitiveFailFor: and use of primFailCode

In reply to this post by David T. Lewis

Hi David,

the difference looks to me to do with the fact that successFlag is flat and primErrorCode is in the VM struct. Try generating a VM where either primFailCode is also flat or, better still, all variables are flat. In my experience the flat form is faster on x86 (and faster with both the intel and gcc compilers; not tested with llvm yet). BTW, if you use the Cog generator it'll generate accesses to variables which might be in the VM struct as GIV(theVariableInQuestion) (where GIV stands for global interpreter variable), and this allows one to choose whether these variables are kept in a struct or kept as separate variables at compile-time instead of generation time, as controlled by the USE_GLOBAL_STRUCT compile-time constant, e.g. gcc -DUSE_GLOBAL_STRUCT=0 gcc3x-interp.c.

HTH

Eliot

On Sun, May 22, 2011 at 8:54 AM, David T. Lewis <[hidden email]> wrote:

I have been trying to gradually update trunk VMMaker to better align
with oscog VMMaker (an admittedly slow process, but hopefully still
worthwhile). I have gotten the interpreter primitives moved into class
InterpreterPrimitives and verified no changes to generated code. This
greatly reduces the clutter in class Interpreter, so it's a nice change
I think.

My next step was to update all of the primitives to use the #primitiveFailFor:
idiom, in which the successFlag variable is replaced with primFailCode
(integer value, 0 for success, 1, 2, 3... for failure codes). This would
get us closer to the point where the standard interpreter and stack/cog
would use a common set of primitives. A lot of changes were required for
this, but the resulting VM works fine ... except for performance.

On a standard interpreter, use of primFailCode seems to result in a
nearly 12% reduction in bytecode performance as measured by tinyBenchmarks:

Standard interpreter (using successFlag):
0 tinyBenchmarks. '439108061 bytecodes/sec; 15264622 sends/sec'
0 tinyBenchmarks. '433164128 bytecodes/sec; 14740358 sends/sec'
0 tinyBenchmarks. '445993031 bytecodes/sec; 15040691 sends/sec'
0 tinyBenchmarks. '440999138 bytecodes/sec; 15052960 sends/sec'
0 tinyBenchmarks. '445993031 bytecodes/sec; 14485815 sends/sec'

After updating the standard interpreter (using primFailCode):
0 tinyBenchmarks. '393241167 bytecodes/sec; 14066256 sends/sec'
0 tinyBenchmarks. '392036753 bytecodes/sec; 15040691 sends/sec'
0 tinyBenchmarks. '393846153 bytecodes/sec; 14272953 sends/sec'
0 tinyBenchmarks. '400625978 bytecodes/sec; 14991818 sends/sec'
0 tinyBenchmarks. '393846153 bytecodes/sec; 15176750 sends/sec'

This is a much larger performance difference than I expected to see.
Actually I expected no measurable difference at all, and I was just
testing to verify this. But 12% is a lot, so I want to ask if I'm
missing something?

The changes to generated code generally take the form of:

Testing success status, original:
if (successFlag) { ... }

Testing success status, new:
if (foo->primFailCode == 0) { ... }

Setting failure status, original:
successFlag = 0;

Setting failure status, new:
if (foo->primFailCode == 0) {
foo->primFailCode = 1;
}

My approach to doing the updates was as follows:
- Replace all occurrences of "successFlag := true" with "self initPrimCall",
which initialize primFailCode to 0.
- Replace all "successFlag := false" with "self primitiveFail".
- Replace all "successFlag ifTrue: [] ifFalse: []" with
"self successful ifTrue: [] ifFalse: []".
- Update #primitiveFail, #failed and #success: to use primFailCode rather
than successFlag.
- Remove successFlag variable.

Obviously I don't want to publish the code on SqS/VMMaker, but I can mail
an interp.c if anyone wants to see the gory details (It is too large to
post on this mailing list though).

Any advice appreciated. I suspect I'm missing something basic here.

Thanks,
Dave

David T. Lewis

Re: Performance of primitiveFailFor: and use of primFailCode

On Mon, May 23, 2011 at 01:44:48PM -0700, Eliot Miranda wrote:

>
> Hi David,
>
> the difference looks to me to do with the fact that successFlag is flat
> and primErrorCode is in the VM struct. Try generating a VM where either
> primFailCode is also flat or, better still, all variables are flat. In my
> experience the flat form is faster on x86 (and faster with both the intel
> and gcc compilers; not tested with llvm yet). BTW, if you use the Cog
> generator it'll generate accesses to variables which might be in the VM
> struct as GIV(theVariableInQuestion) (where GIV stands for global
> interpreter variable), and this allows one to choose whether these variables
> are kept in a struct or kept as separate variables at compile-time instead
> of generation time, as controlled by the USE_GLOBAL_STRUCT compile-time
> constant, e.g. gcc -DUSE_GLOBAL_STRUCT=0 gcc3x-interp.c.

Eliot,

Thanks, and I have to apologize because I quoted the code incorrectly
in my original message. The generated code before and after the change
actually looks like this (sorry I forgot the "foo"):

Testing success status, original:
if (foo->successFlag) { ... }

Testing success status, new:
if (foo->primFailCode == 0) { ... }

Setting failure status, original:
foo->successFlag = 0;

Setting failure status, new:
if (foo->primFailCode == 0) {
foo->primFailCode = 1;
}

So in each case the global struct is being used, both for successFlag
and primFailCode. Sorry for the confusion. In any case, I'm still left
scratching my head over the size of the performance difference.

Dave

>
> On Sun, May 22, 2011 at 8:54 AM, David T. Lewis <[hidden email]> wrote:
>
> >
> > I have been trying to gradually update trunk VMMaker to better align
> > with oscog VMMaker (an admittedly slow process, but hopefully still
> > worthwhile). I have gotten the interpreter primitives moved into class
> > InterpreterPrimitives and verified no changes to generated code. This
> > greatly reduces the clutter in class Interpreter, so it's a nice change
> > I think.
> >
> > My next step was to update all of the primitives to use the
> > #primitiveFailFor:
> > idiom, in which the successFlag variable is replaced with primFailCode
> > (integer value, 0 for success, 1, 2, 3... for failure codes). This would
> > get us closer to the point where the standard interpreter and stack/cog
> > would use a common set of primitives. A lot of changes were required for
> > this, but the resulting VM works fine ... except for performance.
> >
> > On a standard interpreter, use of primFailCode seems to result in a
> > nearly 12% reduction in bytecode performance as measured by tinyBenchmarks:
> >
> > Standard interpreter (using successFlag):
> > 0 tinyBenchmarks. '439108061 bytecodes/sec; 15264622 sends/sec'
> > 0 tinyBenchmarks. '433164128 bytecodes/sec; 14740358 sends/sec'
> > 0 tinyBenchmarks. '445993031 bytecodes/sec; 15040691 sends/sec'
> > 0 tinyBenchmarks. '440999138 bytecodes/sec; 15052960 sends/sec'
> > 0 tinyBenchmarks. '445993031 bytecodes/sec; 14485815 sends/sec'
> >
> > After updating the standard interpreter (using primFailCode):
> > 0 tinyBenchmarks. '393241167 bytecodes/sec; 14066256 sends/sec'
> > 0 tinyBenchmarks. '392036753 bytecodes/sec; 15040691 sends/sec'
> > 0 tinyBenchmarks. '393846153 bytecodes/sec; 14272953 sends/sec'
> > 0 tinyBenchmarks. '400625978 bytecodes/sec; 14991818 sends/sec'
> > 0 tinyBenchmarks. '393846153 bytecodes/sec; 15176750 sends/sec'
> >
> > This is a much larger performance difference than I expected to see.
> > Actually I expected no measurable difference at all, and I was just
> > testing to verify this. But 12% is a lot, so I want to ask if I'm
> > missing something?
> >
> > The changes to generated code generally take the form of:
> >
> > Testing success status, original:
> > if (successFlag) { ... }
> >
> > Testing success status, new:
> > if (foo->primFailCode == 0) { ... }
> >
> > Setting failure status, original:
> > successFlag = 0;
> >
> > Setting failure status, new:
> > if (foo->primFailCode == 0) {
> > foo->primFailCode = 1;
> > }
> >
> > My approach to doing the updates was as follows:
> > - Replace all occurrences of "successFlag := true" with "self
> > initPrimCall",
> > which initialize primFailCode to 0.
> > - Replace all "successFlag := false" with "self primitiveFail".
> > - Replace all "successFlag ifTrue: [] ifFalse: []" with
> > "self successful ifTrue: [] ifFalse: []".
> > - Update #primitiveFail, #failed and #success: to use primFailCode rather
> > than successFlag.
> > - Remove successFlag variable.
> >
> > Obviously I don't want to publish the code on SqS/VMMaker, but I can mail
> > an interp.c if anyone wants to see the gory details (It is too large to
> > post on this mailing list though).
> >
> > Any advice appreciated. I suspect I'm missing something basic here.
> >
> > Thanks,
> > Dave
> >
> >

Eliot Miranda-2

Re: Performance of primitiveFailFor: and use of primFailCode

On Mon, May 23, 2011 at 2:08 PM, David T. Lewis <[hidden email]> wrote:

On Mon, May 23, 2011 at 01:44:48PM -0700, Eliot Miranda wrote:
>
> Hi David,
>
> the difference looks to me to do with the fact that successFlag is flat
> and primErrorCode is in the VM struct. Try generating a VM where either
> primFailCode is also flat or, better still, all variables are flat. In my
> experience the flat form is faster on x86 (and faster with both the intel
> and gcc compilers; not tested with llvm yet). BTW, if you use the Cog
> generator it'll generate accesses to variables which might be in the VM
> struct as GIV(theVariableInQuestion) (where GIV stands for global
> interpreter variable), and this allows one to choose whether these variables
> are kept in a struct or kept as separate variables at compile-time instead
> of generation time, as controlled by the USE_GLOBAL_STRUCT compile-time
> constant, e.g. gcc -DUSE_GLOBAL_STRUCT=0 gcc3x-interp.c.

Eliot,

Thanks, and I have to apologize because I quoted the code incorrectly
in my original message. The generated code before and after the change
actually looks like this (sorry I forgot the "foo"):

Ah, ok.

Testing success status, original:
if (foo->successFlag) { ... }

Testing success status, new:
if (foo->primFailCode == 0) { ... }

Setting failure status, original:

foo->successFlag = 0;

Setting failure status, new:
if (foo->primFailCode == 0) {
foo->primFailCode = 1;
}

So in each case the global struct is being used, both for successFlag
and primFailCode. Sorry for the confusion. In any case, I'm still left
scratching my head over the size of the performance difference.

One thought, where are successFlag and primFailCode in the struct? Perhaps the size of the offset needed to access them makes a difference?

Dave

>
> On Sun, May 22, 2011 at 8:54 AM, David T. Lewis <[hidden email]> wrote:
>
> >
> > I have been trying to gradually update trunk VMMaker to better align
> > with oscog VMMaker (an admittedly slow process, but hopefully still
> > worthwhile). I have gotten the interpreter primitives moved into class
> > InterpreterPrimitives and verified no changes to generated code. This
> > greatly reduces the clutter in class Interpreter, so it's a nice change
> > I think.
> >
> > My next step was to update all of the primitives to use the
> > #primitiveFailFor:
> > idiom, in which the successFlag variable is replaced with primFailCode
> > (integer value, 0 for success, 1, 2, 3... for failure codes). This would
> > get us closer to the point where the standard interpreter and stack/cog
> > would use a common set of primitives. A lot of changes were required for
> > this, but the resulting VM works fine ... except for performance.
> >
> > On a standard interpreter, use of primFailCode seems to result in a
> > nearly 12% reduction in bytecode performance as measured by tinyBenchmarks:
> >
> > Standard interpreter (using successFlag):
> > 0 tinyBenchmarks. '439108061 bytecodes/sec; 15264622 sends/sec'
> > 0 tinyBenchmarks. '433164128 bytecodes/sec; 14740358 sends/sec'
> > 0 tinyBenchmarks. '445993031 bytecodes/sec; 15040691 sends/sec'
> > 0 tinyBenchmarks. '440999138 bytecodes/sec; 15052960 sends/sec'
> > 0 tinyBenchmarks. '445993031 bytecodes/sec; 14485815 sends/sec'
> >
> > After updating the standard interpreter (using primFailCode):
> > 0 tinyBenchmarks. '393241167 bytecodes/sec; 14066256 sends/sec'
> > 0 tinyBenchmarks. '392036753 bytecodes/sec; 15040691 sends/sec'
> > 0 tinyBenchmarks. '393846153 bytecodes/sec; 14272953 sends/sec'
> > 0 tinyBenchmarks. '400625978 bytecodes/sec; 14991818 sends/sec'
> > 0 tinyBenchmarks. '393846153 bytecodes/sec; 15176750 sends/sec'
> >
> > This is a much larger performance difference than I expected to see.
> > Actually I expected no measurable difference at all, and I was just
> > testing to verify this. But 12% is a lot, so I want to ask if I'm
> > missing something?
> >
> > The changes to generated code generally take the form of:
> >
> > Testing success status, original:
> > if (successFlag) { ... }
> >
> > Testing success status, new:
> > if (foo->primFailCode == 0) { ... }
> >
> > Setting failure status, original:
> > successFlag = 0;
> >
> > Setting failure status, new:
> > if (foo->primFailCode == 0) {
> > foo->primFailCode = 1;
> > }
> >
> > My approach to doing the updates was as follows:
> > - Replace all occurrences of "successFlag := true" with "self
> > initPrimCall",
> > which initialize primFailCode to 0.
> > - Replace all "successFlag := false" with "self primitiveFail".
> > - Replace all "successFlag ifTrue: [] ifFalse: []" with
> > "self successful ifTrue: [] ifFalse: []".
> > - Update #primitiveFail, #failed and #success: to use primFailCode rather
> > than successFlag.
> > - Remove successFlag variable.
> >
> > Obviously I don't want to publish the code on SqS/VMMaker, but I can mail
> > an interp.c if anyone wants to see the gory details (It is too large to
> > post on this mailing list though).
> >
> > Any advice appreciated. I suspect I'm missing something basic here.
> >
> > Thanks,
> > Dave
> >
> >

David T. Lewis

Re: Performance of primitiveFailFor: and use of primFailCode

On Mon, May 23, 2011 at 02:33:52PM -0700, Eliot Miranda wrote:

>
> On Mon, May 23, 2011 at 2:08 PM, David T. Lewis <[hidden email]> wrote:
> >
> > Testing success status, original:
> > if (foo->successFlag) { ... }
> >
> > Testing success status, new:
> > if (foo->primFailCode == 0) { ... }
> >
> > Setting failure status, original:
> > foo->successFlag = 0;
> >
> > Setting failure status, new:
> > if (foo->primFailCode == 0) {
> > foo->primFailCode = 1;
> > }
> >
> > So in each case the global struct is being used, both for successFlag
> > and primFailCode. Sorry for the confusion. In any case, I'm still left
> > scratching my head over the size of the performance difference.
> >
>
> One thought, where are successFlag and primFailCode in the struct? Perhaps
> the size of the offset needed to access them makes a difference?

In both cases they are the first element of the struct, so that
cannot be it.

I think I had better circle back and redo my tests. Maybe I made
a mistake somewhere.

Thanks,
Dave

David T. Lewis

Re: Performance of primitiveFailFor: and use of primFailCode

On Mon, May 23, 2011 at 07:30:09PM -0400, David T. Lewis wrote:

> On Mon, May 23, 2011 at 02:33:52PM -0700, Eliot Miranda wrote:
> >
> > On Mon, May 23, 2011 at 2:08 PM, David T. Lewis <[hidden email]> wrote:
> > >
> > > Testing success status, original:
> > > if (foo->successFlag) { ... }
> > >
> > > Testing success status, new:
> > > if (foo->primFailCode == 0) { ... }
> > >
> > > Setting failure status, original:
> > > foo->successFlag = 0;
> > >
> > > Setting failure status, new:
> > > if (foo->primFailCode == 0) {
> > > foo->primFailCode = 1;
> > > }
> > >
> > > So in each case the global struct is being used, both for successFlag
> > > and primFailCode. Sorry for the confusion. In any case, I'm still left
> > > scratching my head over the size of the performance difference.
> > >
> >
> > One thought, where are successFlag and primFailCode in the struct? Perhaps
> > the size of the offset needed to access them makes a difference?
>
> In both cases they are the first element of the struct, so that
> cannot be it.
>
> I think I had better circle back and redo my tests. Maybe I made
> a mistake somewhere.
>

No mistake, the performance problem was real.

Good news - I found the cause. Better news - this may be good for a
performance boost on StackVM and possibly Cog also.

The performance hit was due almost entirely to InterpreterPrimitives>>failed,
and perhaps a little bit to #successful and #success: also.

This issue with #failed is due to "^primFailCode ~= 0" which, for purposes
of C translation, can be recoded as "^primFailCode" with an override in
the simulator as "^primFailCode ~= 0". This produces a significant speed
improvement, at least as fast as for the original interpreter implementation
using successFlag.

I expect that the same change applied to StackInterpreter may give a similar
10% improvement (though I have not tried it). I don't know what to expect
with Cog, but it may give a boost there as well.

Changes attached, also included in VMMaker-dtl.237 on SqueakSource.

Dave

InterpreterPlugins-primitiveFailFor-speedup-dtl.1.cs (3K) Download Attachment

Igor Stasenko

Re: Performance of primitiveFailFor: and use of primFailCode

>
> No mistake, the performance problem was real.
>
> Good news - I found the cause. Better news - this may be good for a
> performance boost on StackVM and possibly Cog also.
>
> The performance hit was due almost entirely to InterpreterPrimitives>>failed,
> and perhaps a little bit to #successful and #success: also.
>
> This issue with #failed is due to "^primFailCode ~= 0" which, for purposes
> of C translation, can be recoded as "^primFailCode" with an override in
> the simulator as "^primFailCode ~= 0". This produces a significant speed
> improvement, at least as fast as for the original interpreter implementation
> using successFlag.
>
> I expect that the same change applied to StackInterpreter may give a similar
> 10% improvement (though I have not tried it). I don't know what to expect
> with Cog, but it may give a boost there as well.
>
> Changes attached, also included in VMMaker-dtl.237 on SqueakSource.
>
> Dave
>
>
>

added to http://code.google.com/p/cog/issues/detail?id=45

it is strange that such small detail could make a lot of difference in speed.

--
Best regards,
Igor Stasenko AKA sig.

David T. Lewis

Re: Performance of primitiveFailFor: and use of primFailCode

On Tue, May 24, 2011 at 10:46:02AM +0200, Igor Stasenko wrote:

>
> >
> > No mistake, the performance problem was real.
> >
> > Good news - I found the cause. Better news - this may be good for a
> > performance boost on StackVM and possibly Cog also.
> >
> > The performance hit was due almost entirely to InterpreterPrimitives>>failed,
> > and perhaps a little bit to #successful and #success: also.
> >
> > This issue with #failed is due to "^primFailCode ~= 0" which, for purposes
> > of C translation, can be recoded as "^primFailCode" with an override in
> > the simulator as "^primFailCode ~= 0". This produces a significant speed
> > improvement, at least as fast as for the original interpreter implementation
> > using successFlag.
> >
> > I expect that the same change applied to StackInterpreter may give a similar
> > 10% improvement (though I have not tried it). I don't know what to expect
> > with Cog, but it may give a boost there as well.
> >
> > Changes attached, also included in VMMaker-dtl.237 on SqueakSource.
> >
> > Dave
> >
> >
> >
>
> added to http://code.google.com/p/cog/issues/detail?id=45

Thanks Igor.

>
> it is strange that such small detail could make a lot of difference in speed.

Yes, I was very surprised to see it also. It will be interesting to see
if it has a similar effect for StackInterpreter. I probably will not have
time to check this for a while, so if you try it please let us know
what you find.

Dave

Igor Stasenko

Re: Performance of primitiveFailFor: and use of primFailCode

On 24 May 2011 14:00, David T. Lewis <[hidden email]> wrote:

>
> On Tue, May 24, 2011 at 10:46:02AM +0200, Igor Stasenko wrote:
>>
>> >
>> > No mistake, the performance problem was real.
>> >
>> > Good news - I found the cause. Better news - this may be good for a
>> > performance boost on StackVM and possibly Cog also.
>> >
>> > The performance hit was due almost entirely to InterpreterPrimitives>>failed,
>> > and perhaps a little bit to #successful and #success: also.
>> >
>> > This issue with #failed is due to "^primFailCode ~= 0" which, for purposes
>> > of C translation, can be recoded as "^primFailCode" with an override in
>> > the simulator as "^primFailCode ~= 0". This produces a significant speed
>> > improvement, at least as fast as for the original interpreter implementation
>> > using successFlag.
>> >
>> > I expect that the same change applied to StackInterpreter may give a similar
>> > 10% improvement (though I have not tried it). I don't know what to expect
>> > with Cog, but it may give a boost there as well.
>> >
>> > Changes attached, also included in VMMaker-dtl.237 on SqueakSource.
>> >
>> > Dave
>> >
>> >
>> >
>>
>> added to http://code.google.com/p/cog/issues/detail?id=45
>
> Thanks Igor.
>
>>
>> it is strange that such small detail could make a lot of difference in speed.
>
> Yes, I was very surprised to see it also. It will be interesting to see
> if it has a similar effect for StackInterpreter. I probably will not have
> time to check this for a while, so if you try it please let us know
> what you find.
>

What you using to measure difference in speed?

> Dave
>
>

--
Best regards,
Igor Stasenko AKA sig.

David T. Lewis

Re: Performance of primitiveFailFor: and use of primFailCode

On Tue, May 24, 2011 at 02:16:05PM +0200, Igor Stasenko wrote:

>
> On 24 May 2011 14:00, David T. Lewis <[hidden email]> wrote:
> >
> > On Tue, May 24, 2011 at 10:46:02AM +0200, Igor Stasenko wrote:
> >>
> >> >
> >> > No mistake, the performance problem was real.
> >> >
> >> > Good news - I found the cause. Better news - this may be good for a
> >> > performance boost on StackVM and possibly Cog also.
> >> >
> >> > The performance hit was due almost entirely to InterpreterPrimitives>>failed,
> >> > and perhaps a little bit to #successful and #success: also.
> >> >
> >> > This issue with #failed is due to "^primFailCode ~= 0" which, for purposes
> >> > of C translation, can be recoded as "^primFailCode" with an override in
> >> > the simulator as "^primFailCode ~= 0". This produces a significant speed
> >> > improvement, at least as fast as for the original interpreter implementation
> >> > using successFlag.
> >> >
> >> > I expect that the same change applied to StackInterpreter may give a similar
> >> > 10% improvement (though I have not tried it). I don't know what to expect
> >> > with Cog, but it may give a boost there as well.
> >> >
> >> > Changes attached, also included in VMMaker-dtl.237 on SqueakSource.
> >> >
> >> > Dave
> >> >
> >> >
> >> >
> >>
> >> added to http://code.google.com/p/cog/issues/detail?id=45
> >
> > Thanks Igor.
> >
> >>
> >> it is strange that such small detail ??could make a lot of difference in speed.
> >
> > Yes, I was very surprised to see it also. It will be interesting to see
> > if it has a similar effect for StackInterpreter. I probably will not have
> > time to check this for a while, so if you try it please let us know
> > what you find.
> >
>
> What you using to measure difference in speed?
>

I just use tinyBenchmarks as a smoke test to make sure that
changes in the slang do not affect performance. So I am looking
at different variants of the code, running each one five times
to get an average. Obviously this does not reflect real performance,
but it is useful for spotting problems. Examples on my system:

Standard interpreter VM with successFlag
0 tinyBenchmarks '444444444 bytecodes/sec; 14317245 sends/sec'
0 tinyBenchmarks '435374149 bytecodes/sec; 14012854 sends/sec'
0 tinyBenchmarks '437606837 bytecodes/sec; 15277259 sends/sec'
0 tinyBenchmarks '437981180 bytecodes/sec; 15252007 sends/sec'
0 tinyBenchmarks '443674176 bytecodes/sec; 14406658 sends/sec'

Interpreter VM with primFailCode
0 tinyBenchmarks '398133748 bytecodes/sec; 14895019 sends/sec'
0 tinyBenchmarks '393241167 bytecodes/sec; 14228935 sends/sec'
0 tinyBenchmarks '396284829 bytecodes/sec; 14250910 sends/sec'
0 tinyBenchmarks '396591789 bytecodes/sec; 14907050 sends/sec'
0 tinyBenchmarks '401883830 bytecodes/sec; 14520007 sends/sec'

Interpreter VM with primFailCode after optimizing #failed, #success:, and #successful
0 tinyBenchmarks '447161572 bytecodes/sec; 14979650 sends/sec'
0 tinyBenchmarks '442523768 bytecodes/sec; 14955371 sends/sec'
0 tinyBenchmarks '447161572 bytecodes/sec; 14991818 sends/sec'
0 tinyBenchmarks '443290043 bytecodes/sec; 14350644 sends/sec'
0 tinyBenchmarks '445604873 bytecodes/sec; 15114601 sends/sec'

Similar tests showed that the differences were almost entirely
associated with #failed.

I have to say that I am still uncomfortable about this, because
I cannot really explain why the change has such a large effect.
The #failed method is used only in a few places in the interpreter
itself. So if you are able to independently verify (or refute)
any of these results, that would be great.

Thanks,
Dave

Eliot Miranda-2

Re: Performance of primitiveFailFor: and use of primFailCode

In reply to this post by David T. Lewis

On Mon, May 23, 2011 at 8:42 PM, David T. Lewis <[hidden email]> wrote:

On Mon, May 23, 2011 at 07:30:09PM -0400, David T. Lewis wrote:
> On Mon, May 23, 2011 at 02:33:52PM -0700, Eliot Miranda wrote:
> >
> > On Mon, May 23, 2011 at 2:08 PM, David T. Lewis <[hidden email]> wrote:
> > >
> > > Testing success status, original:
> > > if (foo->successFlag) { ... }
> > >
> > > Testing success status, new:
> > > if (foo->primFailCode == 0) { ... }
> > >
> > > Setting failure status, original:
> > > foo->successFlag = 0;
> > >
> > > Setting failure status, new:
> > > if (foo->primFailCode == 0) {
> > > foo->primFailCode = 1;
> > > }
> > >
> > > So in each case the global struct is being used, both for successFlag
> > > and primFailCode. Sorry for the confusion. In any case, I'm still left
> > > scratching my head over the size of the performance difference.
> > >
> >
> > One thought, where are successFlag and primFailCode in the struct? Perhaps
> > the size of the offset needed to access them makes a difference?
>
> In both cases they are the first element of the struct, so that
> cannot be it.
>
> I think I had better circle back and redo my tests. Maybe I made
> a mistake somewhere.
>

No mistake, the performance problem was real.

Good news - I found the cause. Better news - this may be good for a
performance boost on StackVM and possibly Cog also.

thanks!

The performance hit was due almost entirely to InterpreterPrimitives>>failed,
and perhaps a little bit to #successful and #success: also.

This issue with #failed is due to "^primFailCode ~= 0" which, for purposes
of C translation, can be recoded as "^primFailCode" with an override in
the simulator as "^primFailCode ~= 0". This produces a significant speed
improvement, at least as fast as for the original interpreter implementation
using successFlag.

Note that with the Cog code generator and for the purposes of the simulator this can read

failed

<api>

^self cCode: [primFailCode] inSmalltalk: [primFailCode ~= 0]

The Cog inliner maps self cCode: aCBlock inSmalltalk: anStBlock to aCBlock at TMethod creation time, hence avoiding the inability to inline cCode:inSmallalk:. See MessageNode>>asTranslatorNode: in the Cog VMMaker. I'll integrate as such in Cog.

I expect that the same change applied to StackInterpreter may give a similar
10% improvement (though I have not tried it). I don't know what to expect
with Cog, but it may give a boost there as well.

Changes attached, also included in VMMaker-dtl.237 on SqueakSource.

Dave

David T. Lewis

Re: Performance of primitiveFailFor: and use of primFailCode

On Tue, May 24, 2011 at 09:07:30AM -0700, Eliot Miranda wrote:

>
> On Mon, May 23, 2011 at 8:42 PM, David T. Lewis <[hidden email]> wrote:
>
> >
> > The performance hit was due almost entirely to
> > InterpreterPrimitives>>failed,
> > and perhaps a little bit to #successful and #success: also.
> >
> > This issue with #failed is due to "^primFailCode ~= 0" which, for purposes
> > of C translation, can be recoded as "^primFailCode" with an override in
> > the simulator as "^primFailCode ~= 0". This produces a significant speed
> > improvement, at least as fast as for the original interpreter
> > implementation
> > using successFlag.
> >
>
> Note that with the Cog code generator and for the purposes of the simulator
> this can read
>
> failed
> <api>
> ^self cCode: [primFailCode] inSmalltalk: [primFailCode ~= 0]
>
> The Cog inliner maps self cCode: aCBlock inSmalltalk: anStBlock to aCBlock
> at TMethod creation time, hence avoiding the inability to inline
> cCode:inSmallalk:. See MessageNode>>asTranslatorNode: in the Cog VMMaker.
> I'll integrate as such in Cog.

Thanks. I had some problems with inlining when I wrote it that way, so I
had to back off to just using an override in the simulator. I'll look to
pick up the appropriate fixes for this from Cog as merging proceeds.

BTW I, did not actually test the simulator after doing this, hope I did
not break anything for Craig's Spoon work ;)

Dave

ccrraaiigg

Re: Performance of primitiveFailFor: and use of primFailCode

> ...I did not actually test the simulator after doing this, hope I did
> not break anything for Craig's Spoon work ;)

Oh, if you did I'm sure I'll let you know about it a few days after
you've fixed it. :) Actually, now I'm looking forward to ephemeron
conflicts. :)

-C

--
Craig Latta
www.netjam.org/resume
+31 06 2757 7177
+ 1 415 287 3547

Igor Stasenko

Re: Performance of primitiveFailFor: and use of primFailCode

On 24 May 2011 20:19, Craig Latta <[hidden email]> wrote:
>
>
>> ...I did not actually test the simulator after doing this, hope I did
>> not break anything for Craig's Spoon work ;)
>
> Oh, if you did I'm sure I'll let you know about it a few days after
> you've fixed it. :) Actually, now I'm looking forward to ephemeron
> conflicts. :)
>
What conflicts?

Please elaborate :)

--
Best regards,
Igor Stasenko AKA sig.

ccrraaiigg

Re: Performance of primitiveFailFor: and use of primFailCode

Hi Igor--

> > ...now I'm looking forward to ephemeron conflicts. :)
> >
>
> What conflicts?
>
> Please elaborate :)

Oh, none yet, I just suspect there will be some with the stuff I
wrote to GC stale methods.

-C

--
Craig Latta
www.netjam.org/resume
+31 06 2757 7177
+ 1 415 287 3547

Igor Stasenko

Re: Performance of primitiveFailFor: and use of primFailCode

On 25 May 2011 00:51, Craig Latta <[hidden email]> wrote:

>
>
> Hi Igor--
>
>> > ...now I'm looking forward to ephemeron conflicts. :)
>> >
>>
>> What conflicts?
>>
>> Please elaborate :)
>
> Oh, none yet, I just suspect there will be some with the stuff I
> wrote to GC stale methods.

Yes. This could be a problem. Consider a following:

MyClass>>someMethod
^ #( 'abc' 'def' )

ephemeron := Ephemeron new
key: self someMethod first
value: somethingElse.

So, we created an ephemeron, whose key are object which came from
method's literals. And even worse,
as it shown above, it could be not a direct literal, but nested object.

Now, if you GC this stale #someMethod , it will apparently turn
ephemeron's value to be weakly referenced,
and its key will be lost and replaced by nil.

To circumvent that, you have to make sure that all literals which kept
by method are still reachable from roots by other means.
Another approach is to detect and do something with such problematic
ephemerons, but as example shows, this could be tricky.

Btw this will happen with other weak refs as well.

array := WeakArray with: self someMethod first.

do you have a solution for that?

Of course it depends on your intents.

--
Best regards,
Igor Stasenko AKA sig.

Eliot Miranda-2

Re: Performance of primitiveFailFor: and use of primFailCode

On Tue, May 24, 2011 at 5:34 PM, Igor Stasenko <[hidden email]> wrote:

On 25 May 2011 00:51, Craig Latta <[hidden email]> wrote:
>
>
> Hi Igor--
>
>> > ...now I'm looking forward to ephemeron conflicts. :)
>> >
>>
>> What conflicts?
>>
>> Please elaborate :)
>
> Oh, none yet, I just suspect there will be some with the stuff I
> wrote to GC stale methods.

Yes. This could be a problem. Consider a following:

MyClass>>someMethod
^ #( 'abc' 'def' )

ephemeron := Ephemeron new
key: self someMethod first
value: somethingElse.

So, we created an ephemeron, whose key are object which came from
method's literals. And even worse,
as it shown above, it could be not a direct literal, but nested object.

Now, if you GC this stale #someMethod , it will apparently turn
ephemeron's value to be weakly referenced,
and its key will be lost and replaced by nil.

Uh, no. The ephemeron refers to the string 'abc' that happened to be referenced by the method. But that string won't be garbage collected until there are no references to it in the system, including from the ephemeron. i.e. the ephemeron will either need to nil its key or itself be collected before the 'abc' string can be collected. There is no magic here with references to objects from methods. In Smalltalk, methods are just objects. [and in the Cog VM there is a little bit of chicanery to preserve the illusion that there are no machien code methods involved, but that's what it does; hide the machine code].

To circumvent that, you have to make sure that all literals which kept
by method are still reachable from roots by other means.
Another approach is to detect and do something with such problematic
ephemerons, but as example shows, this could be tricky.

Btw this will happen with other weak refs as well.

array := WeakArray with: self someMethod first.

do you have a solution for that?

Of course it depends on your intents.

--
Best regards,
Igor Stasenko AKA sig.