Tuning timesRepeat:

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Tuning timesRepeat:

Paul Baumann

This shows that the inline response from #timesRepeat: is different from code:

 

5 timesRepeat: [false]

nil

 

| block |

block := [false].

5 timesRepeat: block

5

 

This shows that both the inline form and the message send form are affected by large integers:

 

| ms response |

ms := (Time millisecondsToRun: [

response := 2 * SmallInteger maxVal timesRepeat: []

]) // 2.

Array with: ms with: response.

#(69431 nil)

 

| ms response block |

block := [].

ms := (Time millisecondsToRun: [

response := 2 * SmallInteger maxVal timesRepeat: block

]) // 2.

Array with: ms with: response.

#(69539 1073741822)

 

Compare these averaged ms times with the ones above to see how much can be gained by tuning #timesRepeat: to avoid large integers:

 

(Time millisecondsToRun: [

2 timesRepeat: [SmallInteger maxVal timesRepeat: []]

]) // 2

1349

 

| block |

block := [].

(Time millisecondsToRun: [

2 timesRepeat: [SmallInteger maxVal timesRepeat: block]

]) // 2

3745

 

Add this method to optimize the #timesRepeat: message send:

 

LargeInteger>>timesRepeat: aBlock

                "Optimization proposed to Cincom by ICE. -plb 2011.06.08"

                self // SmallInteger maxVal timesRepeat: [SmallInteger maxVal timesRepeat: [aBlock value]].

                self \\ SmallInteger maxVal timesRepeat: [aBlock value].

 

This shows that the #timesRepeat: message send is now much faster than the inline #timesRepeat: for large integers:

 

| ms response block |

block := [].

ms := (Time millisecondsToRun: [

response := 2 * SmallInteger maxVal timesRepeat: block

]) // 2.

Array with: ms with: response.

#(3645 1073741822)

 

The inline form can also be made faster or by punting to a LargeInteger implementation if not a SmallInteger. That optimization would be made in the VM and is likely an easy change. The return value should be fixed too. These tests were done with a VW 7.5 image and VW 7.6 VM. The performance difference between integer types skews the results of typical performance comparison tools. More efficient code sometimes profiles slower as a side effect of higher estimated repeat multiples.

 

Paul Baumann

 

 

 



This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: Tuning timesRepeat:

Andres Valloud-6
At some point I considered the timesRepeat: tuning, but I decided
against it.  There are two factors driving this decision.  One is what
happens in the workspace.  Consider the following:

Time millisecondsToRun: [100000000 timesRepeat: [1 * 1 * 1 * 1 * 1]] 1323

"Comes with a warning against unoptimized blocks"
| block |
block := [1 * 1 * 1 * 1 * 1].
Time millisecondsToRun: [100000000 timesRepeat: block] 2191

"Using the [block value] idiom so the outer block is optimized"
| block |
block := [1 * 1 * 1 * 1 * 1].
Time millisecondsToRun: [100000000 timesRepeat: [block value]] 1717


So, as you can see, if the block you are measuring is clean, then it can
be optimized more than if the block is not clean.  This is particularly
important when you have huge repetition counts, because any evaluation
overhead will be amplified by the repetition count as well.

The issue with the large integer optimization is that, well... how
should it be implemented?  Like this?

LargePositiveInteger>>actualTimesRepeat: aBlock

   self // SmallInteger maxVal timesRepeat:
     [SmallInteger maxVal timesRepeat: [aBlock value]].
   self \\ SmallInteger maxVal timesRepeat: [aBlock value]


(note we have to avoid using a the special timesRepeat: selector)

But this implementation defeats the purpose of large repetition counts
because it cannot use the more optimized block evaluation mechanism.
Consequently, measurements will be deceptive.  Consider for example the
above [1 * 1 * 1 * 1 * 1] block.

Repetitions              Milliseconds
100000000                ~1325
200000000                ~2550
300000000                ~3675
400000000                ~5150
500000000                ~6350

Clearly, for every 10^8 repetitions you get about 1300 milliseconds'
worth of execution time.  But what happens when we use the above method
and run the code some more?

Repetitions              Milliseconds
SmallInteger maxVal      ~6875
SmallInteger maxVal + 1  ~9600


So, yeah, the provided code is faster than using the inlined
timesRepeat: implementation we have, but what happened there?  Now we
have a huge measurement skew induced by adding a single repetition.  So,
even with the "optimization" in LargePositiveInteger, measurements are
still significantly different depending on whether the receiver is a
small integer or a large integer.

This brings us to the second factor, which also affects measurement
tools which might be arguably forced not to use clean blocks.  So... do
applications actually use code like this?

largePositiveInteger timesRepeat: [business stuff]


Of course it's hard to tell from where I am sitting, but I've yet to see
such a thing on the field.  Thus, I am tempted to say that the above
case corresponds to measurement code.  If that's the case, then what
matters is the evaluation efficiency of [business stuff].  And if
[business stuff] takes very little time, then there are grounds for
using large repetition counts, and in turn that implies the need to
minimize measurement overhead for the sake of measurement precision, and
the requirement to reduce measurement overhead contradicts the large
positive integer version of actualTimesRepeat: because we cannot use the
more efficient block evaluation mechanism to implement it.

Moreover, measuring [business stuff] with huge multipliers does increase
the measured overhead no matter how small.  Recall the table of
evaluations of [1 * 1 * 1 * 1 * 1]:

Repetitions              Milliseconds
100000000                ~1325
200000000                ~2550
300000000                ~3675
400000000                ~5150
500000000                ~6350


Here is the table corresponding doing a single loop unroll, i.e.: we are
measuring the evaluation of

   [1 * 1 * 1 * 1 * 1.  1 * 1 * 1 * 1 * 1]

but with half the repetition count each time (note in the below table
the repetitions list how many times the 4 multiplications were done, but
in reality the repetition counts passed to timesRepeat: are half of what
is listed).  We get these times:

Repetitions (total)      Milliseconds
100000000                ~1100
200000000                ~2225
300000000                ~3350
400000000                ~4450
500000000                ~5550

So, each set of 10^8 multiplication expressions takes ~1100ms instead of
~1300ms, just because we changed the measurement methodology.  That's
about 20% worth of skew!  So what happens if we do 10 expressions per
block and decimate the repetition count?

Repetitions (total)      Milliseconds
100000000                ~1000
200000000                ~2025
300000000                ~2975
400000000                ~3975
500000000                ~5025

So that's another 10% of skew, for a total 30% of skew, compared to no
loop unrolling at all.  Of course decimating further will provide
diminishing returns, so I will stop here.  The bottom line for me is
that I might ignore a few percentage points pretending they are noise,
but 30% is *a lot*.

Thus, because of the implementation issues with clean blocks in the
workspace, and because of the induced overhead skew as noted above which
are bound to affect every measurement (benchmark tool or not), I am not
sure increasing the repetition count into the large integers is a good
idea.  Consequently, I am not convinced the apparently unnecessary
and/or counterproductive case needs optimizing.

Finally, note that this issue becomes basically invisible on 64 bit images.


On 6/8/2011 8:18 AM, Paul Baumann wrote:

> This shows that the inline response from #timesRepeat: is different from
> code:
>
> 5 timesRepeat: [false]
>
> nil
>
> | block |
>
> block := [false].
>
> 5 timesRepeat: block
>
> 5
>
> This shows that both the inline form and the message send form are
> affected by large integers:
>
> | ms response |
>
> ms := (Time millisecondsToRun: [
>
> response := 2 * SmallInteger maxVal timesRepeat: []
>
> ]) // 2.
>
> Array with: ms with: response.
>
> #(69431 nil)
>
> | ms response block |
>
> block := [].
>
> ms := (Time millisecondsToRun: [
>
> response := 2 * SmallInteger maxVal timesRepeat: block
>
> ]) // 2.
>
> Array with: ms with: response.
>
> #(69539 1073741822)
>
> Compare these averaged ms times with the ones above to see how much can
> be gained by tuning #timesRepeat: to avoid large integers:
>
> (Time millisecondsToRun: [
>
> 2 timesRepeat: [SmallInteger maxVal timesRepeat: []]
>
> ]) // 2
>
> 1349
>
> | block |
>
> block := [].
>
> (Time millisecondsToRun: [
>
> 2 timesRepeat: [SmallInteger maxVal timesRepeat: block]
>
> ]) // 2
>
> 3745
>
> Add this method to optimize the #timesRepeat: message send:
>
> LargeInteger>>timesRepeat: aBlock
>
> "Optimization proposed to Cincom by ICE. -plb 2011.06.08"
>
> self // SmallInteger maxVal timesRepeat: [SmallInteger maxVal
> timesRepeat: [aBlock value]].
>
> self \\ SmallInteger maxVal timesRepeat: [aBlock value].
>
> This shows that the #timesRepeat: message send is now much faster than
> the inline #timesRepeat: for large integers:
>
> | ms response block |
>
> block := [].
>
> ms := (Time millisecondsToRun: [
>
> response := 2 * SmallInteger maxVal timesRepeat: block
>
> ]) // 2.
>
> Array with: ms with: response.
>
> #(3645 1073741822)
>
> The inline form can also be made faster or by punting to a LargeInteger
> implementation if not a SmallInteger. That optimization would be made in
> the VM and is likely an easy change. The return value should be fixed
> too. These tests were done with a VW 7.5 image and VW 7.6 VM. The
> performance difference between integer types skews the results of
> typical performance comparison tools. More efficient code sometimes
> profiles slower as a side effect of higher estimated repeat multiples.
>
> Paul Baumann
>
>
> ------------------------------------------------------------------------
> This message may contain confidential information and is intended for
> specific recipients unless explicitly noted otherwise. If you have
> reason to believe you are not an intended recipient of this message,
> please delete it and notify the sender. This message may not represent
> the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or
> affiliates, and does not constitute a contract or guarantee. Unencrypted
> electronic mail is not secure and the recipient of this message is
> expected to provide safeguards from viruses and pursue alternate means
> of communication where privacy or a binding message is desired.
>
>
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: Tuning timesRepeat:

Paul Baumann

>> Finally, note that this issue becomes basically invisible on 64 bit images.

Andres,

Yes. I agree with your other comments. I could show how to reduce the hit, but no need. It is easy enough to avoid LPI repeats in the profiling code I found it in. It isn't a big deal. Computers will grow fast enough that the LPI would get hit more frequently in code like this, but installations will also grow into 64-bit images. I mentioned it because I suspected it might be easy to optimize in the VM. If it isn't, then no problem. Thanks for the thorough review.

BTW, the optimized form accepts floats and doubles that don't implement #timesRepeat:. It is good that it does because I've encountered code that would break if it suddenly didn't. There is a performance hit for using the wrong type though. If it had been possible to make the LPIs fast then this would be a similar issue.

Time millisecondsToRun: [536870911 timesRepeat: []].
 1245

Time millisecondsToRun: [536870911.0 timesRepeat: []].
 16801

Time millisecondsToRun: [536870911d timesRepeat: []].
 16140

Time millisecondsToRun: [536870911d asInteger timesRepeat: []].
 1403

Hint:

Time millisecondsToRun: [aLargePositiveInteger forRepeatLoop timesRepeat: []].

Regards,

Paul Baumann


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Andres Valloud
Sent: Wednesday, June 08, 2011 15:36
To: [hidden email]
Subject: Re: [vwnc] Tuning timesRepeat:

At some point I considered the timesRepeat: tuning, but I decided
against it.  There are two factors driving this decision.  One is what
happens in the workspace.  Consider the following:

Time millisecondsToRun: [100000000 timesRepeat: [1 * 1 * 1 * 1 * 1]] 1323

"Comes with a warning against unoptimized blocks"
| block |
block := [1 * 1 * 1 * 1 * 1].
Time millisecondsToRun: [100000000 timesRepeat: block] 2191

"Using the [block value] idiom so the outer block is optimized"
| block |
block := [1 * 1 * 1 * 1 * 1].
Time millisecondsToRun: [100000000 timesRepeat: [block value]] 1717


So, as you can see, if the block you are measuring is clean, then it can
be optimized more than if the block is not clean.  This is particularly
important when you have huge repetition counts, because any evaluation
overhead will be amplified by the repetition count as well.

The issue with the large integer optimization is that, well... how
should it be implemented?  Like this?

LargePositiveInteger>>actualTimesRepeat: aBlock

   self // SmallInteger maxVal timesRepeat:
     [SmallInteger maxVal timesRepeat: [aBlock value]].
   self \\ SmallInteger maxVal timesRepeat: [aBlock value]


(note we have to avoid using a the special timesRepeat: selector)

But this implementation defeats the purpose of large repetition counts
because it cannot use the more optimized block evaluation mechanism.
Consequently, measurements will be deceptive.  Consider for example the
above [1 * 1 * 1 * 1 * 1] block.

Repetitions              Milliseconds
100000000                ~1325
200000000                ~2550
300000000                ~3675
400000000                ~5150
500000000                ~6350

Clearly, for every 10^8 repetitions you get about 1300 milliseconds'
worth of execution time.  But what happens when we use the above method
and run the code some more?

Repetitions              Milliseconds
SmallInteger maxVal      ~6875
SmallInteger maxVal + 1  ~9600


So, yeah, the provided code is faster than using the inlined
timesRepeat: implementation we have, but what happened there?  Now we
have a huge measurement skew induced by adding a single repetition.  So,
even with the "optimization" in LargePositiveInteger, measurements are
still significantly different depending on whether the receiver is a
small integer or a large integer.

This brings us to the second factor, which also affects measurement
tools which might be arguably forced not to use clean blocks.  So... do
applications actually use code like this?

largePositiveInteger timesRepeat: [business stuff]


Of course it's hard to tell from where I am sitting, but I've yet to see
such a thing on the field.  Thus, I am tempted to say that the above
case corresponds to measurement code.  If that's the case, then what
matters is the evaluation efficiency of [business stuff].  And if
[business stuff] takes very little time, then there are grounds for
using large repetition counts, and in turn that implies the need to
minimize measurement overhead for the sake of measurement precision, and
the requirement to reduce measurement overhead contradicts the large
positive integer version of actualTimesRepeat: because we cannot use the
more efficient block evaluation mechanism to implement it.

Moreover, measuring [business stuff] with huge multipliers does increase
the measured overhead no matter how small.  Recall the table of
evaluations of [1 * 1 * 1 * 1 * 1]:

Repetitions              Milliseconds
100000000                ~1325
200000000                ~2550
300000000                ~3675
400000000                ~5150
500000000                ~6350


Here is the table corresponding doing a single loop unroll, i.e.: we are
measuring the evaluation of

   [1 * 1 * 1 * 1 * 1.  1 * 1 * 1 * 1 * 1]

but with half the repetition count each time (note in the below table
the repetitions list how many times the 4 multiplications were done, but
in reality the repetition counts passed to timesRepeat: are half of what
is listed).  We get these times:

Repetitions (total)      Milliseconds
100000000                ~1100
200000000                ~2225
300000000                ~3350
400000000                ~4450
500000000                ~5550

So, each set of 10^8 multiplication expressions takes ~1100ms instead of
~1300ms, just because we changed the measurement methodology.  That's
about 20% worth of skew!  So what happens if we do 10 expressions per
block and decimate the repetition count?

Repetitions (total)      Milliseconds
100000000                ~1000
200000000                ~2025
300000000                ~2975
400000000                ~3975
500000000                ~5025

So that's another 10% of skew, for a total 30% of skew, compared to no
loop unrolling at all.  Of course decimating further will provide
diminishing returns, so I will stop here.  The bottom line for me is
that I might ignore a few percentage points pretending they are noise,
but 30% is *a lot*.

Thus, because of the implementation issues with clean blocks in the
workspace, and because of the induced overhead skew as noted above which
are bound to affect every measurement (benchmark tool or not), I am not
sure increasing the repetition count into the large integers is a good
idea.  Consequently, I am not convinced the apparently unnecessary
and/or counterproductive case needs optimizing.

Finally, note that this issue becomes basically invisible on 64 bit images.


On 6/8/2011 8:18 AM, Paul Baumann wrote:

> This shows that the inline response from #timesRepeat: is different from
> code:
>
> 5 timesRepeat: [false]
>
> nil
>
> | block |
>
> block := [false].
>
> 5 timesRepeat: block
>
> 5
>
> This shows that both the inline form and the message send form are
> affected by large integers:
>
> | ms response |
>
> ms := (Time millisecondsToRun: [
>
> response := 2 * SmallInteger maxVal timesRepeat: []
>
> ]) // 2.
>
> Array with: ms with: response.
>
> #(69431 nil)
>
> | ms response block |
>
> block := [].
>
> ms := (Time millisecondsToRun: [
>
> response := 2 * SmallInteger maxVal timesRepeat: block
>
> ]) // 2.
>
> Array with: ms with: response.
>
> #(69539 1073741822)
>
> Compare these averaged ms times with the ones above to see how much can
> be gained by tuning #timesRepeat: to avoid large integers:
>
> (Time millisecondsToRun: [
>
> 2 timesRepeat: [SmallInteger maxVal timesRepeat: []]
>
> ]) // 2
>
> 1349
>
> | block |
>
> block := [].
>
> (Time millisecondsToRun: [
>
> 2 timesRepeat: [SmallInteger maxVal timesRepeat: block]
>
> ]) // 2
>
> 3745
>
> Add this method to optimize the #timesRepeat: message send:
>
> LargeInteger>>timesRepeat: aBlock
>
> "Optimization proposed to Cincom by ICE. -plb 2011.06.08"
>
> self // SmallInteger maxVal timesRepeat: [SmallInteger maxVal
> timesRepeat: [aBlock value]].
>
> self \\ SmallInteger maxVal timesRepeat: [aBlock value].
>
> This shows that the #timesRepeat: message send is now much faster than
> the inline #timesRepeat: for large integers:
>
> | ms response block |
>
> block := [].
>
> ms := (Time millisecondsToRun: [
>
> response := 2 * SmallInteger maxVal timesRepeat: block
>
> ]) // 2.
>
> Array with: ms with: response.
>
> #(3645 1073741822)
>
> The inline form can also be made faster or by punting to a LargeInteger
> implementation if not a SmallInteger. That optimization would be made in
> the VM and is likely an easy change. The return value should be fixed
> too. These tests were done with a VW 7.5 image and VW 7.6 VM. The
> performance difference between integer types skews the results of
> typical performance comparison tools. More efficient code sometimes
> profiles slower as a side effect of higher estimated repeat multiples.
>
> Paul Baumann
>
>
> ------------------------------------------------------------------------
> This message may contain confidential information and is intended for
> specific recipients unless explicitly noted otherwise. If you have
> reason to believe you are not an intended recipient of this message,
> please delete it and notify the sender. This message may not represent
> the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or
> affiliates, and does not constitute a contract or guarantee. Unencrypted
> electronic mail is not secure and the recipient of this message is
> expected to provide safeguards from viruses and pursue alternate means
> of communication where privacy or a binding message is desired.
>
>
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: Tuning timesRepeat:

Andres Valloud-6
> Yes. I agree with your other comments. I could show how to reduce the
> hit, but no need. It is easy enough to avoid LPI repeats in the
> profiling code I found it in. It isn't a big deal. Computers will
> grow fast enough that the LPI would get hit more frequently in code
> like this, but installations will also grow into 64-bit images. I
> mentioned it because I suspected it might be easy to optimize in the
> VM. If it isn't, then no problem. Thanks for the thorough review.

 > BTW, the optimized form accepts floats and doubles that don't
 > implement #timesRepeat:. It is good that it does because I've
 > encountered code that would break if it suddenly didn't.

At first sight, I think it's a matter of compiler optimization rather
than VM optimization.  If you look at the bytecodes produced for
timesRepeat:, then you can see timesRepeat: is rewritten in terms of a
[...] whileTrue: [...] construct.  There's a new temp for a counter, and
before each iteration there's a <= test against the timesRepeat: "receiver".

Now, interestingly, floating point values and what not know how to
compare themselves to small integers directly.  For example,

Time millisecondsToRun: [10000000 timesRepeat: [1 < 2.0d]] 673
Time millisecondsToRun: [10000000 timesRepeat: [2.0d > 1]] 86

So, in your case, it might be useful to reverse the timesRepeat: check
at the head of the loop so that the comparison receiver is the
timesRepeat: receiver.  In other words, if we look at

1000d timesRepeat: [nil]

and debug the expression to find the decompiled bytecodes, we find:

unboundMethod
        | t1 |
        t1 := 1.
        [t1 <= 1000.0d]
                whileTrue: [t1 := t1 + 1].
        ^nil


So if instead of that we had

unboundMethod
        | t1 |
        t1 := 1.
        [1000.0d >= t1]
                whileTrue: [t1 := t1 + 1].
        ^nil


then your cases would run faster.  For example,

Time millisecondsToRun:
        [
                | t1 |
                t1 := 1.
                [t1 <= 10000000.0d]
                        whileTrue: [t1 := t1 + 1].
                nil
        ] 366

Time millisecondsToRun:
        [
                | t1 |
                t1 := 1.
                [10000000.0d >= t1]
                        whileTrue: [t1 := t1 + 1].
                nil
        ] 95


The integer case seems to run just as fast regardless of the comparison
order.  Now, of course, if the limit was a (small) integer instead of
floating point, then the code would be even faster.

Time millisecondsToRun:
        [
                | t1 |
                t1 := 1.
                [10000000 >= t1]
                        whileTrue: [t1 := t1 + 1].
                nil
        ] 31

What code is "sending" timesRepeat: to floating point values?

Andres.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: Tuning timesRepeat:

Paul Baumann
Andres wrote:
>> What code is "sending" timesRepeat: to floating point values?

The repeat was the result of a calculation that included a float.

Below is the fastest form that the compiler can optimize to while avoiding performance inconsistencies. Performance in the range of SmallInteger is about the same as the form you showed. It doubles the efficient range of integers while also efficiently handling other kinds of numbers. The code being repeated can fit neatly into the compiler optimized whileTrue: argument.

Time millisecondsToRun: [
        | times endAt t1 |
        times := SmallInteger maxVal * 2d.
        endAt := SmallInteger minVal.
        t1 := (times asInteger max: 0) + 1 + endAt.
        [endAt ~~ (t1 := t1 - 1)]
                whileTrue: ["compiler in-lines code here"].
        nil
].
 2947
 2957
 2926

Here is the range adjusted performance of the form you showed (that slows without SmallInteger):

(Time millisecondsToRun: [
        | t1 times |
        times := SmallInteger maxVal.
        t1 := 1.
        [times >= t1]
                whileTrue: [t1 := t1 + 1].
        nil
]) * 2
 2948
 2760
 2802


When that same no-operation pattern is put into a method then it looks like this:

ArithmaticValue>>otherTimesRepeat0: aBlock
        "Note that this doesn't actually do any work, it is to profile counting only."
        | endAt t1 |
        endAt := SmallInteger minVal.
        t1 := (self asInteger max: 0) + 1 + endAt.
        [endAt ~~ (t1 := t1 - 1)]
                whileTrue: [].
        ^nil


(Time millisecondsToRun: [
        SmallInteger maxVal otherTimesRepeat0: [].
]) * 2.
 3014

This is what performance looks like when a block closure gets involved:

ArithmaticValue>>otherTimesRepeat: aBlock
        | endAt t1 |
        endAt := SmallInteger minVal.
        t1 := (self asInteger max: 0) + 1 + endAt.
        [endAt ~~ (t1 := t1 - 1)]
                whileTrue: [aBlock value].
        ^nil

(Time millisecondsToRun: [
        SmallInteger maxVal otherTimesRepeat: [].
]) * 2.
 6260

This shows that performance scales fine into large integers:

Time millisecondsToRun: [
        SmallInteger maxVal * 2 otherTimesRepeat: [].
].
 6356

Having the compiler optimize the code to avoid the block closure is (of course) a large savings in this context. A fast and general way enumerate a block closure without a point of degrading performance is with these methods:

ArithmeticValue>>myTimesRepeat: aBlock
        "Optimization proposed to Cincom by ICE. -plb 2011.06.09"
        ^[] repeat: self.

BlockClosure>>repeat: times
        "Optimization proposed to Cincom by ICE. -plb 2011.06.09"
        "Evaluate the receiver multiple times. The receiver is expected to be a zero argument block."

        | t1 resets |
        t1 := times asInteger max: 0.
        resets := t1 // 536870911.
        t1 := (t1 \\ 536870911) + 1.
        [0 ~~ (t1 := t1 - 1) or: [t1 := 536870911. (resets := resets - 1) ~~ -1]]
                whileTrue: [self value].
        ^nil

This is how performance compares to the current implementation using small integers:

(Time millisecondsToRun: [
        | block |
        block := [].
        SmallInteger maxVal timesRepeat: block.
]) * 2
 8304
 8252

(Time millisecondsToRun: [
        | block |
        block := [].
        SmallInteger maxVal myTimesRepeat: block.
]) *2.
 6794
 7246

This is how performance compares into large integers:

Time millisecondsToRun: [
        | block |
        block := [].
        SmallInteger maxVal * 2 timesRepeat: block.
].
 144143

Time millisecondsToRun: [
        | block |
        block := [].
        SmallInteger maxVal * 2 myTimesRepeat: block.
].
 6902
 6855

My point is that the problem can be fixed for both the compiler optimized code and for the message send form. The fix makes performance consistent without making it any slower. 64-bit will not fix performance of repeats on floats and fractions. Granted, those are application bugs, but they have been known to happen.

Here is a little extra to think about with regard return values and the simple block costs that you'd mentioned...

Consider how useful it can be to repeat with injection:

BlockClosure>>repeat: times inject: initialValue
        "Optimization proposed to Cincom by ICE. -plb 2011.06.09"
        "Evaluate the receiver multiple times. The receiver is expected to be a one argument block.
         The argument is the result of the previous execution of the receiver."

        | t1 resets last |
        t1 := times asInteger max: 0.
        resets := t1 // 536870911.
        t1 := (t1 \\ 536870911) + 1.
        last := initialValue.
        [0 ~~ (t1 := t1 - 1) or: [t1 := 536870911. (resets := resets - 1) ~~ -1]]
                whileTrue: [last := self value: last].
        ^last

Consider how passing context in as argument to the block can keep the performance of a simple closure:

BlockClosure>>repeat: times including: inc1 inject: initialValue
        "Optimization proposed to Cincom by ICE. -plb 2011.06.09"
        "Evaluate the receiver multiple times. The receiver is expected to be a two argument block.
         The first argument to the block can used to keep the block simple.
         The last argument is the result of the previous execution of the receiver."

        | t1 resets last |
        t1 := times asInteger max: 0.
        resets := t1 // 536870911.
        t1 := (t1 \\ 536870911) + 1.
        last := initialValue.
        [0 ~~ (t1 := t1 - 1) or: [t1 := 536870911. (resets := resets - 1) ~~ -1]]
                whileTrue: [last := self value: inc1 value: last].
        ^last

The performance is consistent no matter the kind of number (that can convert to an integer). Performance stays consistent regardless of the number of repeats. Here is a closure that is passed context so that it remains simple:

Time millisecondsToRun: [
        [:myself :last | myself ]
                repeat: SmallInteger maxVal * 2
                including: self
                inject: nil.
].
 6773
 7218

Here is the same test but the closure is not simple:

Time millisecondsToRun: [
        [:myself :last | self ]
                repeat: SmallInteger maxVal * 2
                including: self
                inject: nil.
].
 10654
 10698

The cost that closure adds to the repeat is equal to the cost of a #yourself message send:

Time millisecondsToRun: [
        [:myself :last | myself yourself ]
                repeat: SmallInteger maxVal * 2
                including: self
                inject: nil.
].
 10631
 10756

Regards,

Paul Baumann


This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc