[vwnc] VM optimization: code inlining?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Mark Plas

Hi Eliot,

 

You wrote “it would be able to prove that the result of findKeyOrNil: would be an index in range” , could you explain a bit how this can be done? The only thing I can think of would be because #findKeyOrNil: always sends ‘self basicAt: location’ before returning the value of location. So if that message doesn’t fail, the sender of #findKeyOrNil: wouldn’t fail either if it would just use the result as an argument to #basicAt:. Would this be the reasoning of the compiler/vm?

 

Thanks,

Mark

 


From: [hidden email] [mailto:[hidden email]] On Behalf Of Eliot Miranda
Sent: zaterdag 11 oktober 2008 0:39
To: Martin McClure
Cc: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

 

 

On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure <[hidden email]> wrote:

 

[excellent analysis snipped]

 


That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
as Dictionary>>at: in the non-failure case.


A self-style VM could do this kind of inlining automatically.

 

Not only that, it would use an at: that would avoid the bounds check and isInteger check because it would be able to prove that the result of findKeyOrNil: would be an index in range.  So the speed ups would be even higher than achieved by this experiment.

 

The
bytecode-level inlining that Terry's talking about could also do it,
though without the frequency feedback it's a bit harder to know where to
do this, and a mechanism would be needed to keep track of all of the
methods that had inlined a particular method in order to automatically
recompile all inlining methods whenever an inlined method is recompiled.

 

A Self-styl;e VM can still be implemented to target bytecode and have the VM generate the machine code.  The VM needs additional bytecodes to express things like non-bounds-checking at: etc.

 

But you and I know this.  We've been talking about it for a while now.  I'm looking forward to working on this in Cog.

 

Eliot

Qwaq, Inc.


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] findKeyOrNil pass?

Henrik Sperre Johansen
Sorry for going off-topic again, but speaking of findKeyOrNil: (and it's
cousin findElementOrNil: in Set),
I have a hard time seeing the reasoning behind the part:
pass := pass + 1.
pass > 2 ifTrue: [^self grow findElementOrNil: anObject]
 
As far as I can tell, this will only trigger if:
1. anObject is not part of the Set/Dictionary.
2. There are no nil slots in the Set/Dictionary (ie: it's entirely full).
It prevents an infinite loop if the 2 conditions were ever met, but I
can't see how that can ever be the case.
- The basic add methods all have a fullCheck, which increases space if
the dictionary after an add would lead to an entirely full dictionary.
- The noCheckAdd: senders I could find only gets called by methods whom
ensure the size is bigger than the number of elements added with
noCheckAdd:.
Is there another way to add elements/keys which would necessitate this
step?

I could somewhat understand a reasoning behind it if the check was pass
> 1 (wouldn't really need the pass at all then though), in which case
the underlaying assumption could be that if you somehow reached the last
element of the set for an object whose initialIndex was another object,
and all subsequent elements untill the end of the set were full, the
reason for that would most probably be a high collision rate, and a grow
would help reduce the collision amount, thus speeding up subsequent
operations.
With the new hash functions, whom if I've understood right, have a very
low collision rate, even that is an extremely rare case though.

Cheers,
Henry

Mark Plas wrote:

>
> Hi Eliot,
>
>  
>
> You wrote “it would be able to prove that the result of findKeyOrNil:
> would be an index in range” , could you explain a bit how this can be
> done? The only thing I can think of would be because #findKeyOrNil:
> always sends ‘self basicAt: location’ before returning the value of
> location. So if that message doesn’t fail, the sender of
> #findKeyOrNil: wouldn’t fail either if it would just use the result as
> an argument to #basicAt:. Would this be the reasoning of the compiler/vm?
>
>  
>
> Thanks,
>
> Mark
>
>  
>
> ------------------------------------------------------------------------
>
> *From:* [hidden email] [mailto:[hidden email]] *On
> Behalf Of *Eliot Miranda
> *Sent:* zaterdag 11 oktober 2008 0:39
> *To:* Martin McClure
> *Cc:* [hidden email]
> *Subject:* Re: [vwnc] VM optimization: code inlining?
>
>  
>
>  
>
> On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>  
>
> [excellent analysis snipped]
>
>  
>
>
>     That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
>     as Dictionary>>at: in the non-failure case.
>
>
>     A self-style VM could do this kind of inlining automatically.
>
>  
>
> Not only that, it would use an at: that would avoid the bounds check
> and isInteger check because it would be able to prove that the result
> of findKeyOrNil: would be an index in range.  So the speed ups would
> be even higher than achieved by this experiment.
>
>  
>
>     The
>     bytecode-level inlining that Terry's talking about could also do it,
>     though without the frequency feedback it's a bit harder to know
>     where to
>     do this, and a mechanism would be needed to keep track of all of the
>     methods that had inlined a particular method in order to automatically
>     recompile all inlining methods whenever an inlined method is
>     recompiled.
>
>  
>
> A Self-styl;e VM can still be implemented to target bytecode and have
> the VM generate the machine code.  The VM needs additional bytecodes
> to express things like non-bounds-checking at: etc.
>
>  
>
> But you and I know this.  We've been talking about it for a while now.
>  I'm looking forward to working on this in Cog.
>
>  
>
> Eliot
>
> Qwaq, Inc.
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>  
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] findKeyOrNil pass?

Reinout Heeck
Henrik Johansen wrote:

> but speaking of findKeyOrNil: (and it's
> cousin findElementOrNil: in Set),
> I have a hard time seeing the reasoning behind the part:
> pass := pass + 1.
> pass > 2 ifTrue: [^self grow findElementOrNil: anObject]


In pass one the search starts somewhere in 'the middle' of the
dictionary and proceeds scanning to the end of the dict.

In pass two the search wraps around to the start and then proceeds to
scan the whole dict (which is more than needed).

When both passes fail the dict is grown and the search retried using
recursion.



>  
> As far as I can tell, this will only trigger if:
> 1. anObject is not part of the Set/Dictionary.
> 2. There are no nil slots in the Set/Dictionary (ie: it's entirely full).

Correct

> It prevents an infinite loop if the 2 conditions were ever met, but I
> can't see how that can ever be the case.

> - The basic add methods all have a fullCheck, which increases space if
> the dictionary after an add would lead to an entirely full dictionary.
> - The noCheckAdd: senders I could find only gets called by methods whom
> ensure the size is bigger than the number of elements added with
> noCheckAdd:.
> Is there another way to add elements/keys which would necessitate this
> step?

See Dictionary>>add: for an example.

Browse senders of #findkeyOrNil: to find others.


>
> I could somewhat understand a reasoning behind it if the check was pass
>> 1 (wouldn't really need the pass at all then though), in which case
> the underlaying assumption could be that if you somehow reached the last
> element of the set for an object whose initialIndex was another object,
> and all subsequent elements untill the end of the set were full,

The dictionary treats its content as a circular buffer (if searching for
a hash hits the end of the dict it wraps around to the beginning to find
a slot for that hash 'bucket').

What you say only seems to hold if the dictionary would not do this
wrapping trick, but would grow instead.




More in general:

I find it quite typical for Smalltalk programs to do a lot of such
'double' work you point out.
In Smalltalk the emphasis is on code readability and correctness, but it
seems that as a result a lot of this duplicative work gets hidden so
deeply that it is hard to find (and hence remove).

In my opinion the VM should remove those inefficiencies (but it seems
I'll need to wait several decades before programming tools will reach
that level)...



R
-
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] findKeyOrNil pass?

Henrik Sperre Johansen
My point exactly.
The second scan never scans to the end of the dict, since the set of
keys is never entirely full, due to add: performing a fullCheck, and
those calling noCheckAdd: never fill the Set entirely.
Hence it finds the key on one of the slots previous to the original
index, or it finds a nil if the key isn't in the dictionary.
Thus, the whole pass > 2 testing is really not needed... (Unless, as I
said, there's some way of adding keys I'm not aware of).

Reinout Heeck wrote:

> Henrik Johansen wrote:
>
>  
>> but speaking of findKeyOrNil: (and it's
>> cousin findElementOrNil: in Set),
>> I have a hard time seeing the reasoning behind the part:
>> pass := pass + 1.
>> pass > 2 ifTrue: [^self grow findElementOrNil: anObject]
>>    
>
>
> In pass one the search starts somewhere in 'the middle' of the
> dictionary and proceeds scanning to the end of the dict.
>
> In pass two the search wraps around to the start and then proceeds to
> scan the whole dict (which is more than needed).
>
> When both passes fail the dict is grown and the search retried using
> recursion.
>
>
>
>  
>>  
>> As far as I can tell, this will only trigger if:
>> 1. anObject is not part of the Set/Dictionary.
>> 2. There are no nil slots in the Set/Dictionary (ie: it's entirely full).
>>    
>
> Correct
>
>  
>> It prevents an infinite loop if the 2 conditions were ever met, but I
>> can't see how that can ever be the case.
>>    
>
>  
>> - The basic add methods all have a fullCheck, which increases space if
>> the dictionary after an add would lead to an entirely full dictionary.
>> - The noCheckAdd: senders I could find only gets called by methods whom
>> ensure the size is bigger than the number of elements added with
>> noCheckAdd:.
>> Is there another way to add elements/keys which would necessitate this
>> step?
>>    
>
> See Dictionary>>add: for an example.
>
> Browse senders of #findkeyOrNil: to find others.
>
>
>  
>> I could somewhat understand a reasoning behind it if the check was pass
>>    
>>> 1 (wouldn't really need the pass at all then though), in which case
>>>      
>> the underlaying assumption could be that if you somehow reached the last
>> element of the set for an object whose initialIndex was another object,
>> and all subsequent elements untill the end of the set were full,
>>    
>
> The dictionary treats its content as a circular buffer (if searching for
> a hash hits the end of the dict it wraps around to the beginning to find
> a slot for that hash 'bucket').
>
> What you say only seems to hold if the dictionary would not do this
> wrapping trick, but would grow instead.
>
>
>
>
> More in general:
>
> I find it quite typical for Smalltalk programs to do a lot of such
> 'double' work you point out.
> In Smalltalk the emphasis is on code readability and correctness, but it
> seems that as a result a lot of this duplicative work gets hidden so
> deeply that it is hard to find (and hence remove).
>
> In my opinion the VM should remove those inefficiencies (but it seems
> I'll need to wait several decades before programming tools will reach
> that level)...
>
>
>
> R
> -
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>
>  
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Michel Tilman
In reply to this post by Joachim Geidel
Hello,

I tried to simulate what would happen if the copied values are not  
stored in the copiedValues field, but as indexed instance variables.  
Class "TestA" corresponds to the current BlockClosure class with three  
instance variables. In case of one copied value we store it directly  
in field "b" of "TestA". In case of multiple values we store these in  
an array. Class "TestB" has two instance variables and stores the  
values in indexed variables. The timings are, of course, indicative as  
this is Smalltalk code and not the VM code. Except for the case of 1  
copied value (similar results) the numbers suggest that the indexed  
variables variant may be a better alternative. Note that this is also  
relevant for full-copying blocks.

Note that today we should not only focus on performance, but also on  
energy efficiency, ability to run in limited space (embedded  
applications), ... so I consider any gains worthwhile.


Regards,

michel

===

a) no copied value

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestA basicNew ]] 9381

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestB basicNew: 0 ]] 8941


AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestA basicNew ]] 36258 scavenges, 1 incGCs, 9599999400 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestB basicNew: 0 ]] 30212 scavenges, 1 incGCs, 7999999280 bytes


a) 1 copied value

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestA basicNew b: #a ]] 13329

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 1) at: 1 put: #a ]] 12742


AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestA basicNew b: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 1) at: 1 put: #a ]] 36258 scavenges, 1 incGCs,  
9599999400 bytes


b) 2 copied values

Time millisecondsToRun: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 2.
        a at: 1 put: #a; at: 2 put: #b.
        TestA basicNew b: a ]] 23409

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 14228


AllocationProfiler profile: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 2.
        a at: 1 put: #a; at: 2 put: #b.
        TestA basicNew b: a ]] 66324 scavenges, 1 incGCs, 17599999384 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 42253 scavenges,  
0 incGCs, 11199999160 bytes


c) 4 copied values

Time millisecondsToRun: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 4.
        a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d.
        TestA basicNew b: a ]]  31158

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at:  
4 put: #d ]]  22184


AllocationProfiler profile: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 4.
        a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d.
        TestA basicNew b: a ]] 78298 scavenges, 1 incGCs, 20800000000 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b ]] 54229 scavenges,  
1 incGCs, 14399999532 bytes


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Eliot Miranda-2
In reply to this post by Mark Plas


On Mon, Oct 13, 2008 at 12:28 AM, Mark Plas <[hidden email]> wrote:

Hi Eliot,

 

You wrote "it would be able to prove that the result of findKeyOrNil: would be an index in range" , could you explain a bit how this can be done? The only thing I can think of would be because #findKeyOrNil: always sends 'self basicAt: location' before returning the value of location. So if that message doesn't fail, the sender of #findKeyOrNil: wouldn't fail either if it would just use the result as an argument to #basicAt:. Would this be the reasoning of the compiler/vm?

Look at your inlined at:

...

      length := self basicSize.

...

      index := (length > 4095 and: [aHashValue <= 16383])

            ifTrue: [aHashValue * (length // 4095 + 1) \\ length + 1]

            ifFalse: [aHashValue \\ length + 1].

...

                  [(index := index + 1) > length


From these statements an optimizing compiler can infer that index is always within 1 to self basicSize.

i.e. 

     r := exp \\ length => r >= 0 && r < length or failure

hence

     index >= 1 && index <= length && length == self basicSize

and hence

     index >= 1 && index <= self basicSize

or some such chain of reasoning.  An SSA-style optimizer would accomodate such variable range analysis relatively easily.

 

 

Thanks,

Mark

 


From: [hidden email] [mailto:[hidden email]] On Behalf Of Eliot Miranda
Sent: zaterdag 11 oktober 2008 0:39
To: Martin McClure
Cc: [hidden email]


Subject: Re: [vwnc] VM optimization: code inlining?

 

 

On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure <[hidden email]> wrote:

 

[excellent analysis snipped]

 


That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
as Dictionary>>at: in the non-failure case.


A self-style VM could do this kind of inlining automatically.

 

Not only that, it would use an at: that would avoid the bounds check and isInteger check because it would be able to prove that the result of findKeyOrNil: would be an index in range.  So the speed ups would be even higher than achieved by this experiment.

 

The
bytecode-level inlining that Terry's talking about could also do it,
though without the frequency feedback it's a bit harder to know where to
do this, and a mechanism would be needed to keep track of all of the
methods that had inlined a particular method in order to automatically
recompile all inlining methods whenever an inlined method is recompiled.

 

A Self-styl;e VM can still be implemented to target bytecode and have the VM generate the machine code.  The VM needs additional bytecodes to express things like non-bounds-checking at: etc.

 

But you and I know this.  We've been talking about it for a while now.  I'm looking forward to working on this in Cog.

 

Eliot

Qwaq, Inc.



_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Mark Plas

Hi Eliot,

 

I never heard of “SSA-style compilers”. I thought that only humans would be able to do this kind of code analysis, but it seems the basic ideas for it already existed in the ‘80s (according to wikipedia).

 

Thanks for your answer,

 

Mark

 


From: Eliot Miranda [mailto:[hidden email]]
Sent: maandag 13 oktober 2008 20:34
To: Mark Plas
Cc: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

 

 

On Mon, Oct 13, 2008 at 12:28 AM, Mark Plas <[hidden email]> wrote:

Hi Eliot,

 

You wrote "it would be able to prove that the result of findKeyOrNil: would be an index in range" , could you explain a bit how this can be done? The only thing I can think of would be because #findKeyOrNil: always sends 'self basicAt: location' before returning the value of location. So if that message doesn't fail, the sender of #findKeyOrNil: wouldn't fail either if it would just use the result as an argument to #basicAt:. Would this be the reasoning of the compiler/vm?

Look at your inlined at:

...

      length := self basicSize.

...

      index := (length > 4095 and: [aHashValue <= 16383])

            ifTrue: [aHashValue * (length // 4095 + 1) \\ length + 1]

            ifFalse: [aHashValue \\ length + 1].

...

                  [(index := index + 1) > length

 

From these statements an optimizing compiler can infer that index is always within 1 to self basicSize.

i.e. 

     r := exp \\ length => r >= 0 && r < length or failure

hence

     index >= 1 && index <= length && length == self basicSize

and hence

     index >= 1 && index <= self basicSize

or some such chain of reasoning.  An SSA-style optimizer would accomodate such variable range analysis relatively easily.

 

 

Thanks,

Mark

 


From: [hidden email] [mailto:[hidden email]] On Behalf Of Eliot Miranda
Sent: zaterdag 11 oktober 2008 0:39
To: Martin McClure
Cc: [hidden email]


Subject: Re: [vwnc] VM optimization: code inlining?

 

 

On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure <[hidden email]> wrote:

 

[excellent analysis snipped]

 


That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast
as Dictionary>>at: in the non-failure case.


A self-style VM could do this kind of inlining automatically.

 

Not only that, it would use an at: that would avoid the bounds check and isInteger check because it would be able to prove that the result of findKeyOrNil: would be an index in range.  So the speed ups would be even higher than achieved by this experiment.

 

The
bytecode-level inlining that Terry's talking about could also do it,
though without the frequency feedback it's a bit harder to know where to
do this, and a mechanism would be needed to keep track of all of the
methods that had inlined a particular method in order to automatically
recompile all inlining methods whenever an inlined method is recompiled.

 

A Self-styl;e VM can still be implemented to target bytecode and have the VM generate the machine code.  The VM needs additional bytecodes to express things like non-bounds-checking at: etc.

 

But you and I know this.  We've been talking about it for a while now.  I'm looking forward to working on this in Cog.

 

Eliot

Qwaq, Inc.

 


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Mark Plas
In reply to this post by Michel Tilman
Hi Michel,

An interesting experiment that sounds like it would not be too hard to implement.

The way I see it, these are the steps to be taken to get to the next level of performance for the VM:

1) inlining methods
2) more efficient block closures using your technique
3) adding extra code analysis to avoid type checks and in-bounds checks

Now, if Cincom could put this into their 7.7 release...

Mark

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Michel Tilman
Sent: maandag 13 oktober 2008 20:23
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hello,

I tried to simulate what would happen if the copied values are not
stored in the copiedValues field, but as indexed instance variables.
Class "TestA" corresponds to the current BlockClosure class with three
instance variables. In case of one copied value we store it directly
in field "b" of "TestA". In case of multiple values we store these in
an array. Class "TestB" has two instance variables and stores the
values in indexed variables. The timings are, of course, indicative as
this is Smalltalk code and not the VM code. Except for the case of 1
copied value (similar results) the numbers suggest that the indexed
variables variant may be a better alternative. Note that this is also
relevant for full-copying blocks.

Note that today we should not only focus on performance, but also on
energy efficiency, ability to run in limited space (embedded
applications), ... so I consider any gains worthwhile.


Regards,

michel

===

a) no copied value

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestA basicNew ]]       9381

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestB basicNew: 0 ]]    8941


AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestA basicNew ]]       36258 scavenges, 1 incGCs, 9599999400 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestB basicNew: 0 ]]    30212 scavenges, 1 incGCs, 7999999280 bytes


a) 1 copied value

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestA basicNew b: #a ]] 13329

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 1) at: 1 put: #a ]]    12742


AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestA basicNew b: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 1) at: 1 put: #a ]]    36258 scavenges, 1 incGCs,
9599999400 bytes


b) 2 copied values

Time millisecondsToRun: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 2.
        a at: 1 put: #a; at: 2 put: #b.
        TestA basicNew b: a ]]  23409

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]]     14228


AllocationProfiler profile: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 2.
        a at: 1 put: #a; at: 2 put: #b.
        TestA basicNew b: a ]] 66324 scavenges, 1 incGCs, 17599999384 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]]     42253 scavenges,
0 incGCs, 11199999160 bytes


c) 4 copied values

Time millisecondsToRun: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 4.
        a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d.
        TestA basicNew b: a ]]  31158

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at:
4 put: #d ]]  22184


AllocationProfiler profile: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 4.
        a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d.
        TestA basicNew b: a ]]  78298 scavenges, 1 incGCs, 20800000000 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b ]] 54229 scavenges,
1 incGCs, 14399999532 bytes


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Michel Tilman
Thanks Mark,

That was indeed one of my goals: it seemed to me that this would be  
fairly easy to implement. In fact, you can get rid of more data  
(though not the overhead of the closure object itself), For instance,  
the "outerContext" field is not needed in many cases and could be  
dynamic, and if we really set our minds to it, we can even get rid of  
the "method" field by making each closure instance an instance of a  
block-specific subclass of BlockClosure, though I suspect this would  
impact more than one or two things.

The code analysis is interesting (a while ago I had intended to work  
on SSA-type representations, but more in the context of  
parallelization, but have not found the time yet), but it seems to me  
that it is not without its own problems (type info / inference, fixed?  
semantics of arithmethic operators, ...), but that also makes it  
interesting...

Regards,

michel


On 14 Oct 2008, at 10:12, Mark Plas wrote:

> Hi Michel,
>
> An interesting experiment that sounds like it would not be too hard  
> to implement.
>
> The way I see it, these are the steps to be taken to get to the next  
> level of performance for the VM:
>
> 1) inlining methods
> 2) more efficient block closures using your technique
> 3) adding extra code analysis to avoid type checks and in-bounds  
> checks
>
> Now, if Cincom could put this into their 7.7 release...
>
> Mark
>
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Martin McClure
In reply to this post by Mark Plas
Mark Plas wrote:
>  
>
> To Martin: You wrote “since shared variable access is quite slow
> compared to instvar access.”, so I changed my method to use an instvar,
> but it was a bit slower than the version with a shared variable. What
> would be the reason for this?
>

Interesting. I'm not sure why you would have seen a slowdown. In
general, instvar access is quite a bit faster than shared variable access:

test
        ^Time
                microsecondsToRun:
                        [10000000
                                timesRepeat:
                                        [| a |
                                        a := instVar]]

runs in about 38 milliseconds, whereas

test
        ^Time
                microsecondsToRun:
                        [10000000
                                timesRepeat:
                                        [| a |
                                        a := SharedVar]]

takes about 90 milliseconds. If one takes into account that an empty
loop takes about 30 milliseconds, this is 8 milliseconds vs. 60
milliseconds, almost a full order of magnitude difference. If I
accounted for the common overhead of the assignment (bytecode <4D> store
local 1; pop) the difference for just accessing the variable would
probably be more than an order of magnitude.

Within the past year, I have been able to make measurable speed-ups in
real-world programs by caching the value of frequently-accessed shared
variables in instvars.

Regards,

-Martin
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Mark Plas
Hi Martin,

I've found what the problem is. This is the way I implemented the #at: method using an inst var:

at: key

        | result |
        result := self at: key ifAbsent: [undefinedValue].
        result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
        ^result


When you inspect the bytecodes of the method, it requires a copying block for the [undefinedValue]. When using a shared variable the copying block isn't required and a clean block is used instead.

When I change the method to this:

at: key

        | result |
        result := self at: key ifAbsent: undefinedValue.
        result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
        ^result

(removed the block for ifAbsent:) then it is clearly faster. I suppose you implemented it this way when you did your experiments?

The results then are:

1) #at: with shared var --> 4.2sec
2) #at: with inst var & copying block --> 4.3sec
3) #at: with inst var & no block --> 3.9sec

Mystery solved!
Mark

-----Original Message-----
From: Martin McClure [mailto:[hidden email]]
Sent: woensdag 15 oktober 2008 0:01
To: Mark Plas
Cc: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Mark Plas wrote:
>
>
> To Martin: You wrote "since shared variable access is quite slow
> compared to instvar access.", so I changed my method to use an instvar,
> but it was a bit slower than the version with a shared variable. What
> would be the reason for this?
>

Interesting. I'm not sure why you would have seen a slowdown. In
general, instvar access is quite a bit faster than shared variable access:

test
        ^Time
                microsecondsToRun:
                        [10000000
                                timesRepeat:
                                        [| a |
                                        a := instVar]]

runs in about 38 milliseconds, whereas

test
        ^Time
                microsecondsToRun:
                        [10000000
                                timesRepeat:
                                        [| a |
                                        a := SharedVar]]

takes about 90 milliseconds. If one takes into account that an empty
loop takes about 30 milliseconds, this is 8 milliseconds vs. 60
milliseconds, almost a full order of magnitude difference. If I
accounted for the common overhead of the assignment (bytecode <4D> store
local 1; pop) the difference for just accessing the variable would
probably be more than an order of magnitude.

Within the past year, I have been able to make measurable speed-ups in
real-world programs by caching the value of frequently-accessed shared
variables in instvars.

Regards,

-Martin

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

[vwnc] FW: VM optimization: code inlining?

Mark Plas
In reply to this post by Mark Plas
Hmmm. It seems my emails don't reach the vwnc list anymore?

-----Original Message-----
From: Mark Plas
Sent: dinsdag 14 oktober 2008 10:12
To: 'Michel Tilman'; [hidden email]
Subject: RE: [vwnc] VM optimization: code inlining?

Hi Michel,

An interesting experiment that sounds like it would not be too hard to implement.

The way I see it, these are the steps to be taken to get to the next level of performance for the VM:

1) inlining methods
2) more efficient block closures using your technique
3) adding extra code analysis to avoid type checks and in-bounds checks

Now, if Cincom could put this into their 7.7 release...

Mark

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Michel Tilman
Sent: maandag 13 oktober 2008 20:23
To: [hidden email]
Subject: Re: [vwnc] VM optimization: code inlining?

Hello,

I tried to simulate what would happen if the copied values are not
stored in the copiedValues field, but as indexed instance variables.
Class "TestA" corresponds to the current BlockClosure class with three
instance variables. In case of one copied value we store it directly
in field "b" of "TestA". In case of multiple values we store these in
an array. Class "TestB" has two instance variables and stores the
values in indexed variables. The timings are, of course, indicative as
this is Smalltalk code and not the VM code. Except for the case of 1
copied value (similar results) the numbers suggest that the indexed
variables variant may be a better alternative. Note that this is also
relevant for full-copying blocks.

Note that today we should not only focus on performance, but also on
energy efficiency, ability to run in limited space (embedded
applications), ... so I consider any gains worthwhile.


Regards,

michel

===

a) no copied value

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestA basicNew ]]       9381

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestB basicNew: 0 ]]    8941


AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestA basicNew ]]       36258 scavenges, 1 incGCs, 9599999400 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestB basicNew: 0 ]]    30212 scavenges, 1 incGCs, 7999999280 bytes


a) 1 copied value

Time millisecondsToRun: [ 400000000 timesRepeat: [
        TestA basicNew b: #a ]] 13329

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 1) at: 1 put: #a ]]    12742


AllocationProfiler profile: [ 400000000 timesRepeat: [
        TestA basicNew b: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 1) at: 1 put: #a ]]    36258 scavenges, 1 incGCs,
9599999400 bytes


b) 2 copied values

Time millisecondsToRun: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 2.
        a at: 1 put: #a; at: 2 put: #b.
        TestA basicNew b: a ]]  23409

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]]     14228


AllocationProfiler profile: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 2.
        a at: 1 put: #a; at: 2 put: #b.
        TestA basicNew b: a ]] 66324 scavenges, 1 incGCs, 17599999384 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]]     42253 scavenges,
0 incGCs, 11199999160 bytes


c) 4 copied values

Time millisecondsToRun: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 4.
        a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d.
        TestA basicNew b: a ]]  31158

Time millisecondsToRun: [ 400000000 timesRepeat: [
        (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at:
4 put: #d ]]  22184


AllocationProfiler profile: [ 400000000 timesRepeat: [
        | a |

        a := Array basicNew: 4.
        a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d.
        TestA basicNew b: a ]]  78298 scavenges, 1 incGCs, 20800000000 bytes

AllocationProfiler profile: [ 400000000 timesRepeat: [
        (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b ]] 54229 scavenges,
1 incGCs, 14399999532 bytes


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] VM optimization: code inlining?

Martin McClure
In reply to this post by Mark Plas
Mark Plas wrote:

> Hi Martin,
>
> I've found what the problem is. This is the way I implemented the #at: method using an inst var:
>
> at: key
>
>         | result |
>         result := self at: key ifAbsent: [undefinedValue].
>         result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
>         ^result
>
>
> When you inspect the bytecodes of the method, it requires a copying block for the [undefinedValue]. When using a shared variable the copying block isn't required and a clean block is used instead.
>
> When I change the method to this:
>
> at: key
>
>         | result |
>         result := self at: key ifAbsent: undefinedValue.
>         result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key].
>         ^result
>
> (removed the block for ifAbsent:) then it is clearly faster.

I hadn't thought about this case, in which you must use a copying block
referencing "self" before you can access an instvar from the block.

I suppose you implemented it this way when you did your experiments?

No, I actually did it the way you did for the intermediate experiment,
then inlined at:ifAbsent: for the second improved version. I didn't
compare instvar to shared var, I just used an instvar because "I know
it's faster" (one of those dangerous things to "know").

That probably explains why my first improved version was still creating
objects and doing (fewer) scavenges -- the block was still copying, but
must have been creating smaller closures.

Thanks for pointing this out.

Regards,

-Martin
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
12