[Challenge] Tuning the large factorial

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[Challenge] Tuning the large factorial

Paul Baumann

Here is a fun challenge for those that enjoy tuning code. Respect is the only award for the winner.

 

The challenge is to implement a #factorial method that executes fastest for the range of 1 to 3000.  Performance is measured relative to the current #factorial execution time. Final judgment will be done on a single machine using the same configuration for all tests. I will announce the entry results  and winner once entries are judged. It is recommended that you prefix your #factorial implementation with your initials (like #plbFactorial).

 

The rules are:

                1) Results must be computed (no pre-test lookup tables).

                2) Tuning must not affect performance of the standard #factorial method.

                3) Results must be valid; for example:

 

                                (1 to: 3000) do: [:i | i factorial = i plbFactorial ifFalse: [self error]].

 

                4) Entries can be submitted by reply to this vwnc forum post or by email:

 

                                email1: paul DOT baumann AT theice DOT com

                                email2: spamsnare AT gmail DOT com

 

                5) Entry deadline is 5pm EST on June 24, 2011.

                6) Performance will be compared using a test like this:

 

                                | t0 t1 |

                                ObjectMemory garbageCollect.

                                t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]].

                                ObjectMemory garbageCollect.

                                t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i plbFactorial]].

                                ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0

 

                7) The best relative percentage difference in execution time (by an average of three runs) will be the winner.

 

                                (#(

                                #(16296 6576 -59.6465)

                                #(16250 6660 -59.0154)

                                #(16016 6556 -59.0659)

                                ) inject: 0 into: [:sum :results | sum + results last]) / 3

 

                                -59.2426

 

                8) I will not be entering the contest because I will judge it.

                9) If you wish to keep your code confidential then make that clear when you submit your entry.

 

Paul Baumann

 

 



This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Steven Kelly

Nice idea, Paul! I think you ought to state the VW version, 64 or 32-bit VM, platform version, and hardware the final test will run on. Oh, and I win:

 

Integer>>smkFactorial

   ^(FactorialObject basicNew) argument: self; yourself

 

FactorialObject>>argument: anInteger

   argument := anInteger

 

FactorialObject>>equalFromInteger: anObject

   ^argument factorial = anObject

 

FactorialObject>>printOn: aStream

   argument factorial printOn: aStream

 

Time millisecondsToRun: [1 to: 3000 do: [:i | i smkFactorial]].
"0"

 

Lazy evaluation rules! :-)

Steve

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul Baumann
Sent: 10. kesäkuuta 2011 18:06
To: [hidden email]
Subject: [vwnc] [Challenge] Tuning the large factorial

 

Here is a fun challenge for those that enjoy tuning code. Respect is the only award for the winner.

 

The challenge is to implement a #factorial method that executes fastest for the range of 1 to 3000.  Performance is measured relative to the current #factorial execution time. Final judgment will be done on a single machine using the same configuration for all tests. I will announce the entry results  and winner once entries are judged. It is recommended that you prefix your #factorial implementation with your initials (like #plbFactorial).

 

The rules are:

                1) Results must be computed (no pre-test lookup tables).

                2) Tuning must not affect performance of the standard #factorial method.

                3) Results must be valid; for example:

 

                                (1 to: 3000) do: [:i | i factorial = i plbFactorial ifFalse: [self error]].

 

                4) Entries can be submitted by reply to this vwnc forum post or by email:

 

                                email1: paul DOT baumann AT theice DOT com

                                email2: spamsnare AT gmail DOT com

 

                5) Entry deadline is 5pm EST on June 24, 2011.

                6) Performance will be compared using a test like this:

 

                                | t0 t1 |

                                ObjectMemory garbageCollect.

                                t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]].

                                ObjectMemory garbageCollect.

                                t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i plbFactorial]].

                                ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0

 

                7) The best relative percentage difference in execution time (by an average of three runs) will be the winner.

 

                                (#(

                                #(16296 6576 -59.6465)

                                #(16250 6660 -59.0154)

                                #(16016 6556 -59.0659)

                                ) inject: 0 into: [:sum :results | sum + results last]) / 3

 

                                -59.2426

 

                8) I will not be entering the contest because I will judge it.

                9) If you wish to keep your code confidential then make that clear when you submit your entry.

 

Paul Baumann

 

 

 


This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Paul Baumann

Steven,

 

It must be a 32-bit version because 64-bit is less of a tuning challenge and few people actually use 64-bit. I was going to leave VW version open in case some submissions are version-specific, but all entries would be compared with whichever is used. If an entry is VW-version specific then I'll try to accommodate that as long as it is not unfair to other entries. The tests will be run on a Windows 7 machine (in case that matters to anyone).

 

This is an algorithm test to judge tuning of actual work performed. Check rule #1. Deferring work during the timing is not a computing the results. I should also clarify that no cached results may be used between factorial sends ("pre-test" was an example of the rule). If someone were to use an approach related to forking work into another process then the work is expected to be fully complete for the result and not used for subsequent sends.

 

I got a kick out of your approach though.

 

Paul Baumann

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Steven Kelly
Sent: Friday, June 10, 2011 11:40
To: [hidden email]
Subject: Re: [vwnc] [Challenge] Tuning the large factorial

 

Nice idea, Paul! I think you ought to state the VW version, 64 or 32-bit VM, platform version, and hardware the final test will run on. Oh, and I win:

 

Integer>>smkFactorial

   ^(FactorialObject basicNew) argument: self; yourself

 

FactorialObject>>argument: anInteger

   argument := anInteger

 

FactorialObject>>equalFromInteger: anObject

   ^argument factorial = anObject

 

FactorialObject>>printOn: aStream

   argument factorial printOn: aStream

 

Time millisecondsToRun: [1 to: 3000 do: [:i | i smkFactorial]].
"0"

 

Lazy evaluation rules! :-)

Steve

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul Baumann
Sent: 10. kesäkuuta 2011 18:06
To: [hidden email]
Subject: [vwnc] [Challenge] Tuning the large factorial

 

Here is a fun challenge for those that enjoy tuning code. Respect is the only award for the winner.

 

The challenge is to implement a #factorial method that executes fastest for the range of 1 to 3000.  Performance is measured relative to the current #factorial execution time. Final judgment will be done on a single machine using the same configuration for all tests. I will announce the entry results  and winner once entries are judged. It is recommended that you prefix your #factorial implementation with your initials (like #plbFactorial).

 

The rules are:

                1) Results must be computed (no pre-test lookup tables).

                2) Tuning must not affect performance of the standard #factorial method.

                3) Results must be valid; for example:

 

                                (1 to: 3000) do: [:i | i factorial = i plbFactorial ifFalse: [self error]].

 

                4) Entries can be submitted by reply to this vwnc forum post or by email:

 

                                email1: paul DOT baumann AT theice DOT com

                                email2: spamsnare AT gmail DOT com

 

                5) Entry deadline is 5pm EST on June 24, 2011.

                6) Performance will be compared using a test like this:

 

                                | t0 t1 |

                                ObjectMemory garbageCollect.

                                t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]].

                                ObjectMemory garbageCollect.

                                t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i plbFactorial]].

                                ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0

 

                7) The best relative percentage difference in execution time (by an average of three runs) will be the winner.

 

                                (#(

                                #(16296 6576 -59.6465)

                                #(16250 6660 -59.0154)

                                #(16016 6556 -59.0659)

                                ) inject: 0 into: [:sum :results | sum + results last]) / 3

 

                                -59.2426

 

                8) I will not be entering the contest because I will judge it.

                9) If you wish to keep your code confidential then make that clear when you submit your entry.

 

Paul Baumann

 

 

 


This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.



This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

John Brant-2
In reply to this post by Paul Baumann
On 6/10/2011 10:06 AM, Paul Baumann wrote:

> Here is a fun challenge for those that enjoy tuning code. Respect is the
> only award for the winner.
>  
> The challenge is to implement a #factorial method that executes fastest
> for the range of 1 to 3000.  Performance is measured relative to the
> current #factorial execution time. Final judgment will be done on a
> single machine using the same configuration for all tests. I will
> announce the entry results  and winner once entries are judged. It is
> recommended that you prefix your #factorial implementation with your
> initials (like #plbFactorial).

Here's my algorithm:

Integer>>jmbFactorial
        | value |
        self < 13
                ifTrue:
                        ["If we are small enough, just compute it the standard way."
                        value := 1.
                        2 to: self do: [:i | value := value * i].
                        ^value].
        value := self jmbFactorialHelper.
        ^(value at: 1) * (value at: 2)

Integer>>jmbFactorialHelper
        "Return a two element array with the first element containing the
factorial for (self // 2)
        and the second element being (self factorial / (self // 2) factorial)."

        | half halfValues newValues oddTotal oddValue i end |
        self == 1 ifTrue: [^#(1 1)].
        half := self bitShift: -1.
        halfValues := half jmbFactorialHelper.
        newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil.

        "Multiply all of the odd numbers in the upper range -- only do a few at
a time to try to limit to SmallIntegers"
        oddTotal := 1.
        i := half odd ifTrue: [half + 2] ifFalse: [half + 1].
        [i <= self] whileTrue:
                        [oddValue := 1.
                        end := i + 10 min: self.
                        [i <= end] whileTrue:
                                        [oddValue := oddValue * i.
                                        i := i + 2].
                        oddTotal := oddTotal * oddValue].

        "Multiply the even part of the upper range by the odd part of the upper
range. The even part can be
        calculated from the recursive part bit shifted by the size of the upper
from the recursive part."
        newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half
bitShift: -1)) * oddTotal.
        ^newValues


Using your performance test code on my machine, I get:
 #(15972 3073 -80.7601)
 #(15840 3063 -80.6629)
 #(15831 3019 -80.9298)


However, if I really wanted a faster factorial, the best solution is to
simply search for one. For example, I implemented the split recursive
algorithm by directly translating some C# code, and got these numbers:

 #(16023 2699 -83.1555)
 #(15950 2702 -83.0596)
 #(15929 2693 -83.0937)

which are faster than my algorithm. There are even faster algorithms but
those appear to be based on prime factorization, and I wasn't sure that
caching of some primes was permitted.


John Brant
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Andres Valloud-6
Hey, this is a lot of fun...

First I submitted the implementation in AT Integer Extensions (which I
wrote).  Let's call this avFactorial for the sake of illustration.  It
uses a binary recursion to calculate a product between two integers.
The idea is to keep products between integers of similar size, because
the efficiency of multiplication x*y is highest in the y=x => x^2 case
(so assuming complexity(x,y) is more or less continuous, then the
efficiency of multiplication is highest when x is close to y).  This is
also why 1 * 2 * 3... is about the worst possible way to calculate a
factorial.  Of course the recursion has some overhead as written, so
avFactorial only uses it after 120!.  avFactorial ranks at about -80 in
Paul's scale.

Then, I reused what I had previously done for Karatsuba multiplication
of large integers, because the VM implements the naive multiplication
algorithm (see http://blogten.blogspot.com/2008/07/mr-karatsuba.html).
So, avFactorial2 is basically the same code as avFactorial but using
karatsubaTimes:.  avFactorial2 ranks at about -81.3 in Paul's scale.

Then, I thought I should accumulate 2^n separately from the factorial,
so I rewrote avFactorial2 into avFactorial3 which uses associations of
the form odd -> bitShift.  At the end, the result is finalAnswer key
bitShift: finalAnswer value.  avFactorial3 ranks at about -81 is Paul's
scale.  Obviously the overhead is becoming a problem for small values.
In fact, avFactorial2 is faster than avFactorial3 until about 2500!, so
the benefit is not obvious until later.  Nevertheless, since for larger
factorials the execution time is going to be dominated by
multiplication, the overhead becomes less significant.  For example,

Time millisecondsToRun: [100000 avFactorial3] 1361
Time millisecondsToRun: [100000 avFactorial2] 1412

Then, I saw John's post below.  I thought I could do better than split
recursive with the pieces I had, so I adapted John's algorithm to use
Karatsuba.  This results in avFactorial4, which ranks at about -84.7 in
Paul's scale.

I also wrote a couple variants.  avFactorial3 always uses power of two
accumulation, and jmbFactorial2 does the bitShift: last in
jmbFactorialHelper.

Now, I should note that the closer we get to -100, the more sensitive
the performance measurement is to small value changes in Paul's scale.
So, for the sake of easier comparison, here are run times with a
standardized workload and testing conditions for all the algorithms:

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x
avFactorial]]] 9052

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x
avFactorial2]]] 8221

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x
avFactorial3]]] 7953

"Always use power of two accumulation regardless"
ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x
avFactorial3a]]] 7970

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x
avFactorial4]]] 6331

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x
jmbFactorial]]] 7682

"Reorder final multiplication to do bitShift: last"
ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x
jmbFactorial2]]] 7708

For comparison, the normal factorial performs like this (note only one
workload, not 3 as above):

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x factorial]] 13108


On 6/10/2011 1:29 PM, John Brant wrote:

> On 6/10/2011 10:06 AM, Paul Baumann wrote:
>> Here is a fun challenge for those that enjoy tuning code. Respect is the
>> only award for the winner.
>>
>> The challenge is to implement a #factorial method that executes fastest
>> for the range of 1 to 3000.  Performance is measured relative to the
>> current #factorial execution time. Final judgment will be done on a
>> single machine using the same configuration for all tests. I will
>> announce the entry results  and winner once entries are judged. It is
>> recommended that you prefix your #factorial implementation with your
>> initials (like #plbFactorial).
>
> Here's my algorithm:
>
> Integer>>jmbFactorial
> | value |
> self<  13
> ifTrue:
> ["If we are small enough, just compute it the standard way."
> value := 1.
> 2 to: self do: [:i | value := value * i].
> ^value].
> value := self jmbFactorialHelper.
> ^(value at: 1) * (value at: 2)
>
> Integer>>jmbFactorialHelper
> "Return a two element array with the first element containing the
> factorial for (self // 2)
> and the second element being (self factorial / (self // 2) factorial)."
>
> | half halfValues newValues oddTotal oddValue i end |
> self == 1 ifTrue: [^#(1 1)].
> half := self bitShift: -1.
> halfValues := half jmbFactorialHelper.
> newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil.
>
> "Multiply all of the odd numbers in the upper range -- only do a few at
> a time to try to limit to SmallIntegers"
> oddTotal := 1.
> i := half odd ifTrue: [half + 2] ifFalse: [half + 1].
> [i<= self] whileTrue:
> [oddValue := 1.
> end := i + 10 min: self.
> [i<= end] whileTrue:
> [oddValue := oddValue * i.
> i := i + 2].
> oddTotal := oddTotal * oddValue].
>
> "Multiply the even part of the upper range by the odd part of the upper
> range. The even part can be
> calculated from the recursive part bit shifted by the size of the upper
> from the recursive part."
> newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half
> bitShift: -1)) * oddTotal.
> ^newValues
>
>
> Using your performance test code on my machine, I get:
>   #(15972 3073 -80.7601)
>   #(15840 3063 -80.6629)
>   #(15831 3019 -80.9298)
>
>
> However, if I really wanted a faster factorial, the best solution is to
> simply search for one. For example, I implemented the split recursive
> algorithm by directly translating some C# code, and got these numbers:
>
>   #(16023 2699 -83.1555)
>   #(15950 2702 -83.0596)
>   #(15929 2693 -83.0937)
>
> which are faster than my algorithm. There are even faster algorithms but
> those appear to be based on prime factorization, and I wasn't sure that
> caching of some primes was permitted.
>
>
> John Brant
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

avFactorial.zip (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

John Brant-2
On 6/10/2011 7:51 PM, Andres Valloud wrote:
> Then, I saw John's post below.  I thought I could do better than split
> recursive with the pieces I had, so I adapted John's algorithm to use
> Karatsuba.  This results in avFactorial4, which ranks at about -84.7 in
> Paul's scale.

Of course, if I use your Karatsuba methods with the split recursive, it
is faster. However, I can refactor my version so that it performs the
bitShift for the even numbers only when you want the factorial and not
for the intermediate steps. Here's the version that uses your Karatsuba
methods with my bitShift change:

jmbFactorial
        | value |
        self < 13
                ifTrue:
                        [value := 1.
                        2 to: self do: [:i | value := value * i].
                        ^value].
        value := #(1 0 1 0) copy.
        self jmbFactorialHelper: value.
        ^((value at: 1) karatsubaTimes: (value at: 3)) bitShift: (value at: 2)
+ (value at: 4)

jmbFactorialHelper: anArray
        | half oddTotal i |
        self == 1 ifTrue: [^self].
        half := self bitShift: -1.
        half jmbFactorialHelper: anArray.
        anArray at: 1 put: ((anArray at: 1) karatsubaTimes: (anArray at: 3)).
        anArray at: 2 put: (anArray at: 2) + (anArray at: 4).
        i := half odd ifTrue: [half + 2] ifFalse: [half + 1].
        oddTotal := i karatsubaOddProductTo: self.
        anArray at: 3 put: ((anArray at: 3) karatsubaTimes: oddTotal).
        anArray at: 4 put: (anArray at: 4) + half - (half bitShift: -1)

Using this version, I got these results:
 #(16159 2357 -85.4137)
 #(15953 2367 -85.1627)
 #(15713 2324 -85.2097)

The updated split recursive algorithm numbers are:
 #(16048 2346 -85.3814)
 #(15955 2342 -85.3212)
 #(16006 2350 -85.318)

When I compute 100000 factorial, both my algorithm and the split
recursive take 1289 milliseconds vs. 28533 for the default implementation.


John Brant
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Paul Baumann
In reply to this post by Andres Valloud-6
Wow guys. 80% reduction in execution time is very impressive. If you openly share your results and code like that then it will make it easier for others to beat your improvements by combining strategies and making minor improvements. Hopefully others will still participate now that the bar has been set so high.

The closer you get to a 100% reduction in execution time the more fractions of a percentage difference matter. All the execution times will be compared with a baseline value for the standard #factorial. I'll use the lowest of several measurements for the baseline. The tests will be repeated multiple times if that helps to get more precision.

John,

It would be OK to use a table of prime numbers to optimize an algorithm but not to replace a calculation:

        #(3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 ...)

I wanted to avoid tables of intermediate or direct answers like this:

        plbFactorial2
                ^#(1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 ...) at: self

That is the absolute fastest approach, but it is a cache and the work is done outside of the timing:

 #(16415 0 -100.0)

Paul Baumann



-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Andres Valloud
Sent: Friday, June 10, 2011 20:51
To: [hidden email]
Subject: Re: [vwnc] [Challenge] Tuning the large factorial
Importance: Low

Hey, this is a lot of fun...

First I submitted the implementation in AT Integer Extensions (which I wrote).  Let's call this avFactorial for the sake of illustration.  It uses a binary recursion to calculate a product between two integers.
The idea is to keep products between integers of similar size, because the efficiency of multiplication x*y is highest in the y=x => x^2 case (so assuming complexity(x,y) is more or less continuous, then the efficiency of multiplication is highest when x is close to y).  This is also why 1 * 2 * 3... is about the worst possible way to calculate a factorial.  Of course the recursion has some overhead as written, so avFactorial only uses it after 120!.  avFactorial ranks at about -80 in Paul's scale.

Then, I reused what I had previously done for Karatsuba multiplication of large integers, because the VM implements the naive multiplication algorithm (see http://blogten.blogspot.com/2008/07/mr-karatsuba.html).
So, avFactorial2 is basically the same code as avFactorial but using karatsubaTimes:.  avFactorial2 ranks at about -81.3 in Paul's scale.

Then, I thought I should accumulate 2^n separately from the factorial, so I rewrote avFactorial2 into avFactorial3 which uses associations of the form odd -> bitShift.  At the end, the result is finalAnswer key
bitShift: finalAnswer value.  avFactorial3 ranks at about -81 is Paul's scale.  Obviously the overhead is becoming a problem for small values.
In fact, avFactorial2 is faster than avFactorial3 until about 2500!, so the benefit is not obvious until later.  Nevertheless, since for larger factorials the execution time is going to be dominated by multiplication, the overhead becomes less significant.  For example,

Time millisecondsToRun: [100000 avFactorial3] 1361 Time millisecondsToRun: [100000 avFactorial2] 1412

Then, I saw John's post below.  I thought I could do better than split recursive with the pieces I had, so I adapted John's algorithm to use Karatsuba.  This results in avFactorial4, which ranks at about -84.7 in Paul's scale.

I also wrote a couple variants.  avFactorial3 always uses power of two accumulation, and jmbFactorial2 does the bitShift: last in jmbFactorialHelper.

Now, I should note that the closer we get to -100, the more sensitive the performance measurement is to small value changes in Paul's scale.
So, for the sake of easier comparison, here are run times with a standardized workload and testing conditions for all the algorithms:

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial]]] 9052

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial2]]] 8221

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3]]] 7953

"Always use power of two accumulation regardless"
ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3a]]] 7970

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial4]]] 6331

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial]]] 7682

"Reorder final multiplication to do bitShift: last"
ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial2]]] 7708

For comparison, the normal factorial performs like this (note only one workload, not 3 as above):

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x factorial]] 13108


On 6/10/2011 1:29 PM, John Brant wrote:

> On 6/10/2011 10:06 AM, Paul Baumann wrote:
>> Here is a fun challenge for those that enjoy tuning code. Respect is
>> the only award for the winner.
>>
>> The challenge is to implement a #factorial method that executes
>> fastest for the range of 1 to 3000.  Performance is measured relative
>> to the current #factorial execution time. Final judgment will be done
>> on a single machine using the same configuration for all tests. I
>> will announce the entry results  and winner once entries are judged.
>> It is recommended that you prefix your #factorial implementation with
>> your initials (like #plbFactorial).
>
> Here's my algorithm:
>
> Integer>>jmbFactorial
>       | value |
>       self<  13
>               ifTrue:
>                       ["If we are small enough, just compute it the standard way."
>                       value := 1.
>                       2 to: self do: [:i | value := value * i].
>                       ^value].
>       value := self jmbFactorialHelper.
>       ^(value at: 1) * (value at: 2)
>
> Integer>>jmbFactorialHelper
>       "Return a two element array with the first element containing the
> factorial for (self // 2)
>       and the second element being (self factorial / (self // 2) factorial)."
>
>       | half halfValues newValues oddTotal oddValue i end |
>       self == 1 ifTrue: [^#(1 1)].
>       half := self bitShift: -1.
>       halfValues := half jmbFactorialHelper.
>       newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil.
>
>       "Multiply all of the odd numbers in the upper range -- only do a few
> at a time to try to limit to SmallIntegers"
>       oddTotal := 1.
>       i := half odd ifTrue: [half + 2] ifFalse: [half + 1].
>       [i<= self] whileTrue:
>                       [oddValue := 1.
>                       end := i + 10 min: self.
>                       [i<= end] whileTrue:
>                                       [oddValue := oddValue * i.
>                                       i := i + 2].
>                       oddTotal := oddTotal * oddValue].
>
>       "Multiply the even part of the upper range by the odd part of the
> upper range. The even part can be
>       calculated from the recursive part bit shifted by the size of the
> upper from the recursive part."
>       newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half
> bitShift: -1)) * oddTotal.
>       ^newValues
>
>
> Using your performance test code on my machine, I get:
>   #(15972 3073 -80.7601)
>   #(15840 3063 -80.6629)
>   #(15831 3019 -80.9298)
>
>
> However, if I really wanted a faster factorial, the best solution is
> to simply search for one. For example, I implemented the split
> recursive algorithm by directly translating some C# code, and got these numbers:
>
>   #(16023 2699 -83.1555)
>   #(15950 2702 -83.0596)
>   #(15929 2693 -83.0937)
>
> which are faster than my algorithm. There are even faster algorithms
> but those appear to be based on prime factorization, and I wasn't sure
> that caching of some primes was permitted.
>
>
> John Brant
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Paul Baumann
I'm removing one of the restrictions so that tuning can be easily expanded into new areas. Disregard rule #2. If your tuning brings you into tuning large integer implementation then go for it. The use of the standard #factorial baseline measurement makes this OK.

                Disregard this rule:
                2) Tuning must not affect performance of the standard #factorial method.

If you post your achievements then it would be useful to include your baseline measurement and show the percentage reduction in execution time from your baseline.

Paul Baumann



-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul Baumann
Sent: Saturday, June 11, 2011 09:57
To: Andres Valloud; [hidden email]; John Brant
Subject: Re: [vwnc] [Challenge] Tuning the large factorial
Importance: Low

Wow guys. 80% reduction in execution time is very impressive. If you openly share your results and code like that then it will make it easier for others to beat your improvements by combining strategies and making minor improvements. Hopefully others will still participate now that the bar has been set so high.

The closer you get to a 100% reduction in execution time the more fractions of a percentage difference matter. All the execution times will be compared with a baseline value for the standard #factorial. I'll use the lowest of several measurements for the baseline. The tests will be repeated multiple times if that helps to get more precision.

John,

It would be OK to use a table of prime numbers to optimize an algorithm but not to replace a calculation:

        #(3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 ...)

I wanted to avoid tables of intermediate or direct answers like this:

        plbFactorial2
                ^#(1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 ...) at: self

That is the absolute fastest approach, but it is a cache and the work is done outside of the timing:

 #(16415 0 -100.0)

Paul Baumann



-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Andres Valloud
Sent: Friday, June 10, 2011 20:51
To: [hidden email]
Subject: Re: [vwnc] [Challenge] Tuning the large factorial
Importance: Low

Hey, this is a lot of fun...

First I submitted the implementation in AT Integer Extensions (which I wrote).  Let's call this avFactorial for the sake of illustration.  It uses a binary recursion to calculate a product between two integers.
The idea is to keep products between integers of similar size, because the efficiency of multiplication x*y is highest in the y=x => x^2 case (so assuming complexity(x,y) is more or less continuous, then the efficiency of multiplication is highest when x is close to y).  This is also why 1 * 2 * 3... is about the worst possible way to calculate a factorial.  Of course the recursion has some overhead as written, so avFactorial only uses it after 120!.  avFactorial ranks at about -80 in Paul's scale.

Then, I reused what I had previously done for Karatsuba multiplication of large integers, because the VM implements the naive multiplication algorithm (see http://blogten.blogspot.com/2008/07/mr-karatsuba.html).
So, avFactorial2 is basically the same code as avFactorial but using karatsubaTimes:.  avFactorial2 ranks at about -81.3 in Paul's scale.

Then, I thought I should accumulate 2^n separately from the factorial, so I rewrote avFactorial2 into avFactorial3 which uses associations of the form odd -> bitShift.  At the end, the result is finalAnswer key
bitShift: finalAnswer value.  avFactorial3 ranks at about -81 is Paul's scale.  Obviously the overhead is becoming a problem for small values.
In fact, avFactorial2 is faster than avFactorial3 until about 2500!, so the benefit is not obvious until later.  Nevertheless, since for larger factorials the execution time is going to be dominated by multiplication, the overhead becomes less significant.  For example,

Time millisecondsToRun: [100000 avFactorial3] 1361 Time millisecondsToRun: [100000 avFactorial2] 1412

Then, I saw John's post below.  I thought I could do better than split recursive with the pieces I had, so I adapted John's algorithm to use Karatsuba.  This results in avFactorial4, which ranks at about -84.7 in Paul's scale.

I also wrote a couple variants.  avFactorial3 always uses power of two accumulation, and jmbFactorial2 does the bitShift: last in jmbFactorialHelper.

Now, I should note that the closer we get to -100, the more sensitive the performance measurement is to small value changes in Paul's scale.
So, for the sake of easier comparison, here are run times with a standardized workload and testing conditions for all the algorithms:

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial]]] 9052

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial2]]] 8221

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3]]] 7953

"Always use power of two accumulation regardless"
ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3a]]] 7970

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial4]]] 6331

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial]]] 7682

"Reorder final multiplication to do bitShift: last"
ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial2]]] 7708

For comparison, the normal factorial performs like this (note only one workload, not 3 as above):

ObjectMemory flushNewSpace; garbageCollect.
Time millisecondsToRun: [1 to: 3000 do: [:x | x factorial]] 13108


On 6/10/2011 1:29 PM, John Brant wrote:

> On 6/10/2011 10:06 AM, Paul Baumann wrote:
>> Here is a fun challenge for those that enjoy tuning code. Respect is
>> the only award for the winner.
>>
>> The challenge is to implement a #factorial method that executes
>> fastest for the range of 1 to 3000.  Performance is measured relative
>> to the current #factorial execution time. Final judgment will be done
>> on a single machine using the same configuration for all tests. I
>> will announce the entry results  and winner once entries are judged.
>> It is recommended that you prefix your #factorial implementation with
>> your initials (like #plbFactorial).
>
> Here's my algorithm:
>
> Integer>>jmbFactorial
>       | value |
>       self<  13
>               ifTrue:
>                       ["If we are small enough, just compute it the standard way."
>                       value := 1.
>                       2 to: self do: [:i | value := value * i].
>                       ^value].
>       value := self jmbFactorialHelper.
>       ^(value at: 1) * (value at: 2)
>
> Integer>>jmbFactorialHelper
>       "Return a two element array with the first element containing the
> factorial for (self // 2)
>       and the second element being (self factorial / (self // 2) factorial)."
>
>       | half halfValues newValues oddTotal oddValue i end |
>       self == 1 ifTrue: [^#(1 1)].
>       half := self bitShift: -1.
>       halfValues := half jmbFactorialHelper.
>       newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil.
>
>       "Multiply all of the odd numbers in the upper range -- only do a few
> at a time to try to limit to SmallIntegers"
>       oddTotal := 1.
>       i := half odd ifTrue: [half + 2] ifFalse: [half + 1].
>       [i<= self] whileTrue:
>                       [oddValue := 1.
>                       end := i + 10 min: self.
>                       [i<= end] whileTrue:
>                                       [oddValue := oddValue * i.
>                                       i := i + 2].
>                       oddTotal := oddTotal * oddValue].
>
>       "Multiply the even part of the upper range by the odd part of the
> upper range. The even part can be
>       calculated from the recursive part bit shifted by the size of the
> upper from the recursive part."
>       newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half
> bitShift: -1)) * oddTotal.
>       ^newValues
>
>
> Using your performance test code on my machine, I get:
>   #(15972 3073 -80.7601)
>   #(15840 3063 -80.6629)
>   #(15831 3019 -80.9298)
>
>
> However, if I really wanted a faster factorial, the best solution is
> to simply search for one. For example, I implemented the split
> recursive algorithm by directly translating some C# code, and got these numbers:
>
>   #(16023 2699 -83.1555)
>   #(15950 2702 -83.0596)
>   #(15929 2693 -83.0937)
>
> which are faster than my algorithm. There are even faster algorithms
> but those appear to be based on prime factorization, and I wasn't sure
> that caching of some primes was permitted.
>
>
> John Brant
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Nicolas Cellier
Paul Baumann <paul.baumann <at> theice.com> writes:

>
> I'm removing one of the restrictions so that tuning can be easily expanded
into new areas. Disregard rule
> #2. If your tuning brings you into tuning large integer implementation then go
for it. The use of the
> standard #factorial baseline measurement makes this OK.
>
>                 Disregard this rule:
>                 2) Tuning must not affect performance of the standard
#factorial method.
>
> If you post your achievements then it would be useful to include your baseline
measurement and show the
> percentage reduction in execution time from your baseline.
>
> Paul Baumann
>

Hi Paul,

with an implementation of prime swing divide and conquer algorithm - see
http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
a karatsuba multiplication, i get a score of -88%.

For 10000 factorial, the speed up ratio reaches 14
[FactorialPrimeSwing new factorialOf: 10000] timeToRun. 41.542 milliseconds
[10000 factorial] timeToRun 587.689 milliseconds
587.689/41.542 14.1469

The code is at FactorialPrimeSwingContest in public store with MIT license.
Feel free to improve...

Nicolas

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

John Brant-2
On 6/12/2011 7:18 PM, nicolas cellier wrote:
> with an implementation of prime swing divide and conquer algorithm - see
> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
> a karatsuba multiplication, i get a score of -88%.

What machine/vm are you running on? When I run your code on my 64 bit
Windows 7 with 32 bit VW vm, I'm getting:

 #(16056 2372 -85.2267)
 #(15925 2372 -85.1052)
 #(16034 2376 -85.1815)


> The code is at FactorialPrimeSwingContest in public store with MIT license.
> Feel free to improve...

Your version of the karatsubaTimes: method is a little faster than
Andres'. When I use it with a slightly different version of my algorithm
(code below). I get:

 #(15940 2205 -86.1669)
 #(16020 2202 -86.2547)
 #(15937 2203 -86.1768)

I'm surprised that the prime swing isn't faster. At 100,000 factorial,
it is a few milliseconds faster:

Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 100000]
        1047 1048 1049
Time millisecondsToRun: [100000 jmbFactorial]
        1056 1053 1053

However, when I raised it to 150,000, mine was a little faster:

Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 150000]
        2171 2147 2147
Time millisecondsToRun: [150000 jmbFactorial]
        2141 2127 2128

At 200,000 they were roughly the same:

Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 200000]
        3541 3541 3528
Time millisecondsToRun: [100000 jmbFactorial]
        3526 3529 3541


John Brant

----------------------

jmbFactorial
        | value |
        self < 13
                ifTrue:
                        [value := 1.
                        2 to: self do: [:i | value := value * i].
                        ^value].
        value := #(1 0 1 0) copy.
        self jmbFactorialHelper: value.
        ^((value at: 1) karatsubaTimes: (value at: 3)) bitShift: (value at: 2)
+ (value at: 4)


jmbFactorialHelper: anArray
        | half oddTotal i |
        self == 1 ifTrue: [^self].
        half := self bitShift: -1.
        half jmbFactorialHelper: anArray.
        anArray at: 1 put: ((anArray at: 1) karatsubaTimes: (anArray at: 3)).
        anArray at: 2 put: (anArray at: 2) + (anArray at: 4).
        i := half odd ifTrue: [half + 2] ifFalse: [half + 1].
        oddTotal := i karatsubaProductTo: self by: 2.
        anArray at: 3 put: ((anArray at: 3) karatsubaTimes: oddTotal).
        anArray at: 4 put: (anArray at: 4) + half - (half bitShift: -1)


karatsubaProductTo: anInteger by: stepInteger
        | next doubleStep |
        self > anInteger ifTrue: [^1].
        next := self + stepInteger.
        next <= anInteger
                ifTrue:
                        [doubleStep := stepInteger bitShift: 1.
                        ^(self karatsubaProductTo: anInteger by: doubleStep) karatsubaTimes:
(next karatsubaProductTo: anInteger by: doubleStep)]
                ifFalse: [^self]
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Nicolas Cellier
John Brant <brant <at> refactoryworkers.com> writes:

>
> On 6/12/2011 7:18 PM, nicolas cellier wrote:
> > with an implementation of prime swing divide and conquer algorithm - see
> > http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
> > a karatsuba multiplication, i get a score of -88%.
>
> What machine/vm are you running on? When I run your code on my 64 bit
> Windows 7 with 32 bit VW vm, I'm getting:
>
>  #(16056 2372 -85.2267)
>  #(15925 2372 -85.1052)
>  #(16034 2376 -85.1815)
>

VisualWorks® NonCommercial, 7.7 of 15 décembre 2009
MacOSX

>
> I'm surprised that the prime swing isn't faster. At 100,000 factorial,
> it is a few milliseconds faster:
>
> Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 100000]
> 1047 1048 1049
> Time millisecondsToRun: [100000 jmbFactorial]
> 1056 1053 1053
>
> However, when I raised it to 150,000, mine was a little faster:
>
> Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 150000]
> 2171 2147 2147
> Time millisecondsToRun: [150000 jmbFactorial]
> 2141 2127 2128
>
> At 200,000 they were roughly the same:
>
> Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 200000]
> 3541 3541 3528
> Time millisecondsToRun: [100000 jmbFactorial]
> 3526 3529 3541
>
> John Brant
>

Here jmbFactorial is a bit better than primeSwingFactorial v1.2 using same
karatsubaTimes:

Time millisecondsToRun:
  [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial]]]  12598
Time millisecondsToRun:
  [3 timesRepeat: [1 to: 3000 do: [:x | x primeSwingFactorial]]] 12737


[ObjectMemory garbageCollect. [100000 jmbFactorial] timeToRun] value asSeconds
asFloat
/ [ObjectMemory garbageCollect. [100000 primeSwingFactorial] timeToRun] value
asSeconds
 0.96685 0.968505

[ObjectMemory garbageCollect. [150000 jmbFactorial] timeToRun] value asSeconds
asFloat
/ [ObjectMemory garbageCollect. [150000 primeSwingFactorial] timeToRun] value
asSeconds
 0.940478 0.941528

[ObjectMemory garbageCollect. [200000 jmbFactorial] timeToRun] value asSeconds
asFloat
/ [ObjectMemory garbageCollect. [200000 primeSwingFactorial] timeToRun] value
asSeconds
 0.947829 0.947372

| t0 t1 |
ObjectMemory garbageCollect.
t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]].
ObjectMemory garbageCollect.
t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i primeSwingFactorial]].
^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0
 #(40805 4301 -89.4596)
 #(40362 4300 -89.3464)
 #(41264 4363 -89.4266)


| t0 t1 |
ObjectMemory garbageCollect.
t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]].
ObjectMemory garbageCollect.
t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i jmbFactorial]].
^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0
 #(40664 4146 -89.8042)
 #(39925 4086 -89.7658)
 #(40983 4145 -89.8861)

Nicolas

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

John Brant-2
On 6/12/2011 9:13 PM, nicolas cellier wrote:

>> On 6/12/2011 7:18 PM, nicolas cellier wrote:
>>> > > with an implementation of prime swing divide and conquer algorithm - see
>>> > > http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
>>> > > a karatsuba multiplication, i get a score of -88%.
>> >
>> > What machine/vm are you running on? When I run your code on my 64 bit
>> > Windows 7 with 32 bit VW vm, I'm getting:
>> >
>> >  #(16056 2372 -85.2267)
>> >  #(15925 2372 -85.1052)
>> >  #(16034 2376 -85.1815)
>> >
> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009
> MacOSX
>

Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare
session (on my Windows 7 machine):

jmbFactorial:
 #(17952 895 -95.0145)
 #(17982 895 -95.0228)
 #(18071 903 -95.003)

primeSwingFactorial (v1.2):
 #(17826 1343 -92.4661)
 #(17816 1335 -92.5067)
 #(17902 1338 -92.526)

I also ran the GMP factorial function:
--------------
#include <gmp.h>

int main(int argc, char **argv) {
        mpz_t v;
        int i;
        int j;
        for (j = 0; j < 100; j++)
                for (i = 1; i <= 3000; i++) {
                        mpz_init_set_si(v, 0);
                        mpz_fac_ui(v, i);
                }
}
--------------
This takes ~180ms per iteration compared to the fastest VW time of
895ms. While I haven't looked at the code for the GMP factorial
algorithm, its documentation appears to be describing what jmbFactorial
is doing. The big performance gain for GMP comes in when you do really
large factorials. For example, 1,000,000 factorial using GMP takes ~1.2
seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot
of VW time is doing GC. I'm guessing that it does a GC of old space when
it really only needs a GC of large space.


John Brant
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Andres Valloud-6
Assuming of course that large space is large enough to hold all that
stuff...

On 6/13/2011 6:01 AM, John Brant wrote:

> On 6/12/2011 9:13 PM, nicolas cellier wrote:
>>> On 6/12/2011 7:18 PM, nicolas cellier wrote:
>>>>>> with an implementation of prime swing divide and conquer algorithm - see
>>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
>>>>>> a karatsuba multiplication, i get a score of -88%.
>>>>
>>>> What machine/vm are you running on? When I run your code on my 64 bit
>>>> Windows 7 with 32 bit VW vm, I'm getting:
>>>>
>>>>   #(16056 2372 -85.2267)
>>>>   #(15925 2372 -85.1052)
>>>>   #(16034 2376 -85.1815)
>>>>
>> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009
>> MacOSX
>>
>
> Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare
> session (on my Windows 7 machine):
>
> jmbFactorial:
>   #(17952 895 -95.0145)
>   #(17982 895 -95.0228)
>   #(18071 903 -95.003)
>
> primeSwingFactorial (v1.2):
>   #(17826 1343 -92.4661)
>   #(17816 1335 -92.5067)
>   #(17902 1338 -92.526)
>
> I also ran the GMP factorial function:
> --------------
> #include<gmp.h>
>
> int main(int argc, char **argv) {
>          mpz_t v;
>          int i;
>          int j;
>          for (j = 0; j<  100; j++)
>                  for (i = 1; i<= 3000; i++) {
>                          mpz_init_set_si(v, 0);
>                          mpz_fac_ui(v, i);
>                  }
> }
> --------------
> This takes ~180ms per iteration compared to the fastest VW time of
> 895ms. While I haven't looked at the code for the GMP factorial
> algorithm, its documentation appears to be describing what jmbFactorial
> is doing. The big performance gain for GMP comes in when you do really
> large factorials. For example, 1,000,000 factorial using GMP takes ~1.2
> seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot
> of VW time is doing GC. I'm guessing that it does a GC of old space when
> it really only needs a GC of large space.
>
>
> John Brant
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Andres Valloud-6
Try liberally increasing the size of large space via ObjectMemory
class>>sizesAtStartup: (or, in newer VMs, via one of the -m switches).

On 6/13/2011 10:08 AM, Andres Valloud wrote:

> Assuming of course that large space is large enough to hold all that
> stuff...
>
> On 6/13/2011 6:01 AM, John Brant wrote:
>> On 6/12/2011 9:13 PM, nicolas cellier wrote:
>>>> On 6/12/2011 7:18 PM, nicolas cellier wrote:
>>>>>>> with an implementation of prime swing divide and conquer algorithm - see
>>>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
>>>>>>> a karatsuba multiplication, i get a score of -88%.
>>>>>
>>>>> What machine/vm are you running on? When I run your code on my 64 bit
>>>>> Windows 7 with 32 bit VW vm, I'm getting:
>>>>>
>>>>>    #(16056 2372 -85.2267)
>>>>>    #(15925 2372 -85.1052)
>>>>>    #(16034 2376 -85.1815)
>>>>>
>>> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009
>>> MacOSX
>>>
>>
>> Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare
>> session (on my Windows 7 machine):
>>
>> jmbFactorial:
>>    #(17952 895 -95.0145)
>>    #(17982 895 -95.0228)
>>    #(18071 903 -95.003)
>>
>> primeSwingFactorial (v1.2):
>>    #(17826 1343 -92.4661)
>>    #(17816 1335 -92.5067)
>>    #(17902 1338 -92.526)
>>
>> I also ran the GMP factorial function:
>> --------------
>> #include<gmp.h>
>>
>> int main(int argc, char **argv) {
>>           mpz_t v;
>>           int i;
>>           int j;
>>           for (j = 0; j<   100; j++)
>>                   for (i = 1; i<= 3000; i++) {
>>                           mpz_init_set_si(v, 0);
>>                           mpz_fac_ui(v, i);
>>                   }
>> }
>> --------------
>> This takes ~180ms per iteration compared to the fastest VW time of
>> 895ms. While I haven't looked at the code for the GMP factorial
>> algorithm, its documentation appears to be describing what jmbFactorial
>> is doing. The big performance gain for GMP comes in when you do really
>> large factorials. For example, 1,000,000 factorial using GMP takes ~1.2
>> seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot
>> of VW time is doing GC. I'm guessing that it does a GC of old space when
>> it really only needs a GC of large space.
>>
>>
>> John Brant
>> _______________________________________________
>> vwnc mailing list
>> [hidden email]
>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Andres Valloud-6
In reply to this post by Andres Valloud-6
Also, FYI, we just improved large integer performance on most big endian
platforms by something between 10% and 25%.  More speed ups are possible
(basically so large integers will be as fast on big endian platforms as
they are on little endian platforms), but we need to do other prep work
before we do that.

On 6/13/2011 10:08 AM, Andres Valloud wrote:

> Assuming of course that large space is large enough to hold all that
> stuff...
>
> On 6/13/2011 6:01 AM, John Brant wrote:
>> On 6/12/2011 9:13 PM, nicolas cellier wrote:
>>>> On 6/12/2011 7:18 PM, nicolas cellier wrote:
>>>>>>> with an implementation of prime swing divide and conquer algorithm - see
>>>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
>>>>>>> a karatsuba multiplication, i get a score of -88%.
>>>>>
>>>>> What machine/vm are you running on? When I run your code on my 64 bit
>>>>> Windows 7 with 32 bit VW vm, I'm getting:
>>>>>
>>>>>    #(16056 2372 -85.2267)
>>>>>    #(15925 2372 -85.1052)
>>>>>    #(16034 2376 -85.1815)
>>>>>
>>> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009
>>> MacOSX
>>>
>>
>> Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare
>> session (on my Windows 7 machine):
>>
>> jmbFactorial:
>>    #(17952 895 -95.0145)
>>    #(17982 895 -95.0228)
>>    #(18071 903 -95.003)
>>
>> primeSwingFactorial (v1.2):
>>    #(17826 1343 -92.4661)
>>    #(17816 1335 -92.5067)
>>    #(17902 1338 -92.526)
>>
>> I also ran the GMP factorial function:
>> --------------
>> #include<gmp.h>
>>
>> int main(int argc, char **argv) {
>>           mpz_t v;
>>           int i;
>>           int j;
>>           for (j = 0; j<   100; j++)
>>                   for (i = 1; i<= 3000; i++) {
>>                           mpz_init_set_si(v, 0);
>>                           mpz_fac_ui(v, i);
>>                   }
>> }
>> --------------
>> This takes ~180ms per iteration compared to the fastest VW time of
>> 895ms. While I haven't looked at the code for the GMP factorial
>> algorithm, its documentation appears to be describing what jmbFactorial
>> is doing. The big performance gain for GMP comes in when you do really
>> large factorials. For example, 1,000,000 factorial using GMP takes ~1.2
>> seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot
>> of VW time is doing GC. I'm guessing that it does a GC of old space when
>> it really only needs a GC of large space.
>>
>>
>> John Brant
>> _______________________________________________
>> vwnc mailing list
>> [hidden email]
>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [Challenge] Tuning the large factorial

Andres Valloud-6
In reply to this post by Andres Valloud-6
And finally I'm doing some experiments to speed up the transition from
small integers into large integers.  At first glance, it looks like we
might get anywhere from 3x to 10x depending on the operation... we'll see.

On 6/13/2011 10:08 AM, Andres Valloud wrote:

> Assuming of course that large space is large enough to hold all that
> stuff...
>
> On 6/13/2011 6:01 AM, John Brant wrote:
>> On 6/12/2011 9:13 PM, nicolas cellier wrote:
>>>> On 6/12/2011 7:18 PM, nicolas cellier wrote:
>>>>>>> with an implementation of prime swing divide and conquer algorithm - see
>>>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with
>>>>>>> a karatsuba multiplication, i get a score of -88%.
>>>>>
>>>>> What machine/vm are you running on? When I run your code on my 64 bit
>>>>> Windows 7 with 32 bit VW vm, I'm getting:
>>>>>
>>>>>    #(16056 2372 -85.2267)
>>>>>    #(15925 2372 -85.1052)
>>>>>    #(16034 2376 -85.1815)
>>>>>
>>> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009
>>> MacOSX
>>>
>>
>> Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare
>> session (on my Windows 7 machine):
>>
>> jmbFactorial:
>>    #(17952 895 -95.0145)
>>    #(17982 895 -95.0228)
>>    #(18071 903 -95.003)
>>
>> primeSwingFactorial (v1.2):
>>    #(17826 1343 -92.4661)
>>    #(17816 1335 -92.5067)
>>    #(17902 1338 -92.526)
>>
>> I also ran the GMP factorial function:
>> --------------
>> #include<gmp.h>
>>
>> int main(int argc, char **argv) {
>>           mpz_t v;
>>           int i;
>>           int j;
>>           for (j = 0; j<   100; j++)
>>                   for (i = 1; i<= 3000; i++) {
>>                           mpz_init_set_si(v, 0);
>>                           mpz_fac_ui(v, i);
>>                   }
>> }
>> --------------
>> This takes ~180ms per iteration compared to the fastest VW time of
>> 895ms. While I haven't looked at the code for the GMP factorial
>> algorithm, its documentation appears to be describing what jmbFactorial
>> is doing. The big performance gain for GMP comes in when you do really
>> large factorials. For example, 1,000,000 factorial using GMP takes ~1.2
>> seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot
>> of VW time is doing GC. I'm guessing that it does a GC of old space when
>> it really only needs a GC of large space.
>>
>>
>> John Brant
>> _______________________________________________
>> vwnc mailing list
>> [hidden email]
>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc