Here is a fun challenge for those that enjoy tuning code. Respect is the only award for the winner. The challenge is to implement a #factorial method that executes fastest for the range of 1 to 3000. Performance is measured relative to the current #factorial execution time. Final judgment will be done on a single machine using the same
configuration for all tests. I will announce the entry results and winner once entries are judged. It is recommended that you prefix your #factorial implementation with your initials (like #plbFactorial). The rules are: 1) Results must be computed (no pre-test lookup tables). 2) Tuning must not affect performance of the standard #factorial method. 3) Results must be valid; for example: (1 to: 3000) do: [:i | i factorial = i plbFactorial ifFalse: [self error]]. 4) Entries can be submitted by reply to this vwnc forum post or by email: email1: paul DOT baumann AT theice DOT com email2: spamsnare AT gmail DOT com 5) Entry deadline is 5pm EST on June 24, 2011. 6) Performance will be compared using a test like this: | t0 t1 | ObjectMemory garbageCollect. t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]]. ObjectMemory garbageCollect. t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i plbFactorial]]. ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0 7) The best relative percentage difference in execution time (by an average of three runs) will be the winner. (#( #(16296 6576 -59.6465) #(16250 6660 -59.0154) #(16016 6556 -59.0659) ) inject: 0 into: [:sum :results | sum + results last]) / 3 -59.2426 8) I will not be entering the contest because I will judge it.
9) If you wish to keep your code confidential then make that clear when you submit your entry. Paul Baumann This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Nice idea, Paul! I think you ought to state the VW version, 64 or 32-bit VM, platform version, and hardware the final test will run on. Oh, and I win: Integer>>smkFactorial ^(FactorialObject basicNew) argument: self; yourself FactorialObject>>argument: anInteger argument := anInteger FactorialObject>>equalFromInteger: anObject ^argument factorial = anObject FactorialObject>>printOn: aStream argument factorial printOn: aStream Time millisecondsToRun: [1 to: 3000 do: [:i | i smkFactorial]]. Lazy evaluation rules! :-) Steve From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul Baumann Here is a fun challenge for those that enjoy tuning code. Respect is the only award for the winner. The challenge is to implement a #factorial method that executes fastest for the range of 1 to 3000. Performance is measured relative to the current #factorial execution time. Final judgment will be done on a single machine using the same configuration for all tests. I will announce the entry results and winner once entries are judged. It is recommended that you prefix your #factorial implementation with your initials (like #plbFactorial). The rules are: 1) Results must be computed (no pre-test lookup tables). 2) Tuning must not affect performance of the standard #factorial method. 3) Results must be valid; for example: (1 to: 3000) do: [:i | i factorial = i plbFactorial ifFalse: [self error]]. 4) Entries can be submitted by reply to this vwnc forum post or by email: email1: paul DOT baumann AT theice DOT com email2: spamsnare AT gmail DOT com 5) Entry deadline is 5pm EST on June 24, 2011. 6) Performance will be compared using a test like this: | t0 t1 | ObjectMemory garbageCollect. t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]]. ObjectMemory garbageCollect. t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i plbFactorial]]. ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0 7) The best relative percentage difference in execution time (by an average of three runs) will be the winner. (#( #(16296 6576 -59.6465) #(16250 6660 -59.0154) #(16016 6556 -59.0659) ) inject: 0 into: [:sum :results | sum + results last]) / 3 -59.2426 8) I will not be entering the contest because I will judge it. 9) If you wish to keep your code confidential then make that clear when you submit your entry. Paul Baumann This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Steven, It must be a 32-bit version because 64-bit is less of a tuning challenge and few people actually use 64-bit. I was going to leave VW version open in case some submissions are version-specific, but all entries
would be compared with whichever is used. If an entry is VW-version specific then I'll try to accommodate that as long as it is not unfair to other entries. The tests will be run on a Windows 7 machine (in case that matters to anyone).
This is an algorithm test to judge tuning of actual work performed. Check rule #1. Deferring work during the timing is not a computing the results. I should also clarify that no cached results may be used between
factorial sends ("pre-test" was an example of the rule). If someone were to use an approach related to forking work into another process then the work is expected to be fully complete for the result and not used for subsequent sends. I got a kick out of your approach though. Paul Baumann From: [hidden email] [mailto:[hidden email]]
On Behalf Of Steven Kelly Nice idea, Paul! I think you ought to state the VW version, 64 or 32-bit VM, platform version, and hardware the final test will run on. Oh, and I win: Integer>>smkFactorial ^(FactorialObject basicNew) argument: self; yourself FactorialObject>>argument: anInteger argument := anInteger FactorialObject>>equalFromInteger: anObject ^argument factorial = anObject
FactorialObject>>printOn: aStream argument factorial printOn: aStream Time millisecondsToRun: [1 to: 3000 do: [:i | i smkFactorial]].
Lazy evaluation rules! :-) Steve From: [hidden email] [mailto:[hidden email]]
On Behalf Of Paul Baumann Here is a fun challenge for those that enjoy tuning code. Respect is the only award for the winner. The challenge is to implement a #factorial method that executes fastest for the range of 1 to 3000. Performance is measured relative to the current #factorial execution time. Final judgment will be done on a single machine using the same
configuration for all tests. I will announce the entry results and winner once entries are judged. It is recommended that you prefix your #factorial implementation with your initials (like #plbFactorial). The rules are: 1) Results must be computed (no pre-test lookup tables). 2) Tuning must not affect performance of the standard #factorial method. 3) Results must be valid; for example: (1 to: 3000) do: [:i | i factorial = i plbFactorial ifFalse: [self error]]. 4) Entries can be submitted by reply to this vwnc forum post or by email: email1: paul DOT baumann AT theice DOT com email2: spamsnare AT gmail DOT com 5) Entry deadline is 5pm EST on June 24, 2011. 6) Performance will be compared using a test like this: | t0 t1 | ObjectMemory garbageCollect. t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]]. ObjectMemory garbageCollect. t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i plbFactorial]]. ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0 7) The best relative percentage difference in execution time (by an average of three runs) will be the winner. (#( #(16296 6576 -59.6465) #(16250 6660 -59.0154) #(16016 6556 -59.0659) ) inject: 0 into: [:sum :results | sum + results last]) / 3 -59.2426 8) I will not be entering the contest because I will judge it.
9) If you wish to keep your code confidential then make that clear when you submit your entry. Paul Baumann This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are
not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted
electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Paul Baumann
On 6/10/2011 10:06 AM, Paul Baumann wrote:
> Here is a fun challenge for those that enjoy tuning code. Respect is the > only award for the winner. > > The challenge is to implement a #factorial method that executes fastest > for the range of 1 to 3000. Performance is measured relative to the > current #factorial execution time. Final judgment will be done on a > single machine using the same configuration for all tests. I will > announce the entry results and winner once entries are judged. It is > recommended that you prefix your #factorial implementation with your > initials (like #plbFactorial). Here's my algorithm: Integer>>jmbFactorial | value | self < 13 ifTrue: ["If we are small enough, just compute it the standard way." value := 1. 2 to: self do: [:i | value := value * i]. ^value]. value := self jmbFactorialHelper. ^(value at: 1) * (value at: 2) Integer>>jmbFactorialHelper "Return a two element array with the first element containing the factorial for (self // 2) and the second element being (self factorial / (self // 2) factorial)." | half halfValues newValues oddTotal oddValue i end | self == 1 ifTrue: [^#(1 1)]. half := self bitShift: -1. halfValues := half jmbFactorialHelper. newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil. "Multiply all of the odd numbers in the upper range -- only do a few at a time to try to limit to SmallIntegers" oddTotal := 1. i := half odd ifTrue: [half + 2] ifFalse: [half + 1]. [i <= self] whileTrue: [oddValue := 1. end := i + 10 min: self. [i <= end] whileTrue: [oddValue := oddValue * i. i := i + 2]. oddTotal := oddTotal * oddValue]. "Multiply the even part of the upper range by the odd part of the upper range. The even part can be calculated from the recursive part bit shifted by the size of the upper from the recursive part." newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half bitShift: -1)) * oddTotal. ^newValues Using your performance test code on my machine, I get: #(15972 3073 -80.7601) #(15840 3063 -80.6629) #(15831 3019 -80.9298) However, if I really wanted a faster factorial, the best solution is to simply search for one. For example, I implemented the split recursive algorithm by directly translating some C# code, and got these numbers: #(16023 2699 -83.1555) #(15950 2702 -83.0596) #(15929 2693 -83.0937) which are faster than my algorithm. There are even faster algorithms but those appear to be based on prime factorization, and I wasn't sure that caching of some primes was permitted. John Brant _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hey, this is a lot of fun...
First I submitted the implementation in AT Integer Extensions (which I wrote). Let's call this avFactorial for the sake of illustration. It uses a binary recursion to calculate a product between two integers. The idea is to keep products between integers of similar size, because the efficiency of multiplication x*y is highest in the y=x => x^2 case (so assuming complexity(x,y) is more or less continuous, then the efficiency of multiplication is highest when x is close to y). This is also why 1 * 2 * 3... is about the worst possible way to calculate a factorial. Of course the recursion has some overhead as written, so avFactorial only uses it after 120!. avFactorial ranks at about -80 in Paul's scale. Then, I reused what I had previously done for Karatsuba multiplication of large integers, because the VM implements the naive multiplication algorithm (see http://blogten.blogspot.com/2008/07/mr-karatsuba.html). So, avFactorial2 is basically the same code as avFactorial but using karatsubaTimes:. avFactorial2 ranks at about -81.3 in Paul's scale. Then, I thought I should accumulate 2^n separately from the factorial, so I rewrote avFactorial2 into avFactorial3 which uses associations of the form odd -> bitShift. At the end, the result is finalAnswer key bitShift: finalAnswer value. avFactorial3 ranks at about -81 is Paul's scale. Obviously the overhead is becoming a problem for small values. In fact, avFactorial2 is faster than avFactorial3 until about 2500!, so the benefit is not obvious until later. Nevertheless, since for larger factorials the execution time is going to be dominated by multiplication, the overhead becomes less significant. For example, Time millisecondsToRun: [100000 avFactorial3] 1361 Time millisecondsToRun: [100000 avFactorial2] 1412 Then, I saw John's post below. I thought I could do better than split recursive with the pieces I had, so I adapted John's algorithm to use Karatsuba. This results in avFactorial4, which ranks at about -84.7 in Paul's scale. I also wrote a couple variants. avFactorial3 always uses power of two accumulation, and jmbFactorial2 does the bitShift: last in jmbFactorialHelper. Now, I should note that the closer we get to -100, the more sensitive the performance measurement is to small value changes in Paul's scale. So, for the sake of easier comparison, here are run times with a standardized workload and testing conditions for all the algorithms: ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial]]] 9052 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial2]]] 8221 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3]]] 7953 "Always use power of two accumulation regardless" ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3a]]] 7970 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial4]]] 6331 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial]]] 7682 "Reorder final multiplication to do bitShift: last" ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial2]]] 7708 For comparison, the normal factorial performs like this (note only one workload, not 3 as above): ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [1 to: 3000 do: [:x | x factorial]] 13108 On 6/10/2011 1:29 PM, John Brant wrote: > On 6/10/2011 10:06 AM, Paul Baumann wrote: >> Here is a fun challenge for those that enjoy tuning code. Respect is the >> only award for the winner. >> >> The challenge is to implement a #factorial method that executes fastest >> for the range of 1 to 3000. Performance is measured relative to the >> current #factorial execution time. Final judgment will be done on a >> single machine using the same configuration for all tests. I will >> announce the entry results and winner once entries are judged. It is >> recommended that you prefix your #factorial implementation with your >> initials (like #plbFactorial). > > Here's my algorithm: > > Integer>>jmbFactorial > | value | > self< 13 > ifTrue: > ["If we are small enough, just compute it the standard way." > value := 1. > 2 to: self do: [:i | value := value * i]. > ^value]. > value := self jmbFactorialHelper. > ^(value at: 1) * (value at: 2) > > Integer>>jmbFactorialHelper > "Return a two element array with the first element containing the > factorial for (self // 2) > and the second element being (self factorial / (self // 2) factorial)." > > | half halfValues newValues oddTotal oddValue i end | > self == 1 ifTrue: [^#(1 1)]. > half := self bitShift: -1. > halfValues := half jmbFactorialHelper. > newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil. > > "Multiply all of the odd numbers in the upper range -- only do a few at > a time to try to limit to SmallIntegers" > oddTotal := 1. > i := half odd ifTrue: [half + 2] ifFalse: [half + 1]. > [i<= self] whileTrue: > [oddValue := 1. > end := i + 10 min: self. > [i<= end] whileTrue: > [oddValue := oddValue * i. > i := i + 2]. > oddTotal := oddTotal * oddValue]. > > "Multiply the even part of the upper range by the odd part of the upper > range. The even part can be > calculated from the recursive part bit shifted by the size of the upper > from the recursive part." > newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half > bitShift: -1)) * oddTotal. > ^newValues > > > Using your performance test code on my machine, I get: > #(15972 3073 -80.7601) > #(15840 3063 -80.6629) > #(15831 3019 -80.9298) > > > However, if I really wanted a faster factorial, the best solution is to > simply search for one. For example, I implemented the split recursive > algorithm by directly translating some C# code, and got these numbers: > > #(16023 2699 -83.1555) > #(15950 2702 -83.0596) > #(15929 2693 -83.0937) > > which are faster than my algorithm. There are even faster algorithms but > those appear to be based on prime factorization, and I wasn't sure that > caching of some primes was permitted. > > > John Brant > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc avFactorial.zip (7K) Download Attachment |
On 6/10/2011 7:51 PM, Andres Valloud wrote:
> Then, I saw John's post below. I thought I could do better than split > recursive with the pieces I had, so I adapted John's algorithm to use > Karatsuba. This results in avFactorial4, which ranks at about -84.7 in > Paul's scale. Of course, if I use your Karatsuba methods with the split recursive, it is faster. However, I can refactor my version so that it performs the bitShift for the even numbers only when you want the factorial and not for the intermediate steps. Here's the version that uses your Karatsuba methods with my bitShift change: jmbFactorial | value | self < 13 ifTrue: [value := 1. 2 to: self do: [:i | value := value * i]. ^value]. value := #(1 0 1 0) copy. self jmbFactorialHelper: value. ^((value at: 1) karatsubaTimes: (value at: 3)) bitShift: (value at: 2) + (value at: 4) jmbFactorialHelper: anArray | half oddTotal i | self == 1 ifTrue: [^self]. half := self bitShift: -1. half jmbFactorialHelper: anArray. anArray at: 1 put: ((anArray at: 1) karatsubaTimes: (anArray at: 3)). anArray at: 2 put: (anArray at: 2) + (anArray at: 4). i := half odd ifTrue: [half + 2] ifFalse: [half + 1]. oddTotal := i karatsubaOddProductTo: self. anArray at: 3 put: ((anArray at: 3) karatsubaTimes: oddTotal). anArray at: 4 put: (anArray at: 4) + half - (half bitShift: -1) Using this version, I got these results: #(16159 2357 -85.4137) #(15953 2367 -85.1627) #(15713 2324 -85.2097) The updated split recursive algorithm numbers are: #(16048 2346 -85.3814) #(15955 2342 -85.3212) #(16006 2350 -85.318) When I compute 100000 factorial, both my algorithm and the split recursive take 1289 milliseconds vs. 28533 for the default implementation. John Brant _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Andres Valloud-6
Wow guys. 80% reduction in execution time is very impressive. If you openly share your results and code like that then it will make it easier for others to beat your improvements by combining strategies and making minor improvements. Hopefully others will still participate now that the bar has been set so high.
The closer you get to a 100% reduction in execution time the more fractions of a percentage difference matter. All the execution times will be compared with a baseline value for the standard #factorial. I'll use the lowest of several measurements for the baseline. The tests will be repeated multiple times if that helps to get more precision. John, It would be OK to use a table of prime numbers to optimize an algorithm but not to replace a calculation: #(3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 ...) I wanted to avoid tables of intermediate or direct answers like this: plbFactorial2 ^#(1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 ...) at: self That is the absolute fastest approach, but it is a cache and the work is done outside of the timing: #(16415 0 -100.0) Paul Baumann -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Andres Valloud Sent: Friday, June 10, 2011 20:51 To: [hidden email] Subject: Re: [vwnc] [Challenge] Tuning the large factorial Importance: Low Hey, this is a lot of fun... First I submitted the implementation in AT Integer Extensions (which I wrote). Let's call this avFactorial for the sake of illustration. It uses a binary recursion to calculate a product between two integers. The idea is to keep products between integers of similar size, because the efficiency of multiplication x*y is highest in the y=x => x^2 case (so assuming complexity(x,y) is more or less continuous, then the efficiency of multiplication is highest when x is close to y). This is also why 1 * 2 * 3... is about the worst possible way to calculate a factorial. Of course the recursion has some overhead as written, so avFactorial only uses it after 120!. avFactorial ranks at about -80 in Paul's scale. Then, I reused what I had previously done for Karatsuba multiplication of large integers, because the VM implements the naive multiplication algorithm (see http://blogten.blogspot.com/2008/07/mr-karatsuba.html). So, avFactorial2 is basically the same code as avFactorial but using karatsubaTimes:. avFactorial2 ranks at about -81.3 in Paul's scale. Then, I thought I should accumulate 2^n separately from the factorial, so I rewrote avFactorial2 into avFactorial3 which uses associations of the form odd -> bitShift. At the end, the result is finalAnswer key bitShift: finalAnswer value. avFactorial3 ranks at about -81 is Paul's scale. Obviously the overhead is becoming a problem for small values. In fact, avFactorial2 is faster than avFactorial3 until about 2500!, so the benefit is not obvious until later. Nevertheless, since for larger factorials the execution time is going to be dominated by multiplication, the overhead becomes less significant. For example, Time millisecondsToRun: [100000 avFactorial3] 1361 Time millisecondsToRun: [100000 avFactorial2] 1412 Then, I saw John's post below. I thought I could do better than split recursive with the pieces I had, so I adapted John's algorithm to use Karatsuba. This results in avFactorial4, which ranks at about -84.7 in Paul's scale. I also wrote a couple variants. avFactorial3 always uses power of two accumulation, and jmbFactorial2 does the bitShift: last in jmbFactorialHelper. Now, I should note that the closer we get to -100, the more sensitive the performance measurement is to small value changes in Paul's scale. So, for the sake of easier comparison, here are run times with a standardized workload and testing conditions for all the algorithms: ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial]]] 9052 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial2]]] 8221 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3]]] 7953 "Always use power of two accumulation regardless" ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3a]]] 7970 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial4]]] 6331 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial]]] 7682 "Reorder final multiplication to do bitShift: last" ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial2]]] 7708 For comparison, the normal factorial performs like this (note only one workload, not 3 as above): ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [1 to: 3000 do: [:x | x factorial]] 13108 On 6/10/2011 1:29 PM, John Brant wrote: > On 6/10/2011 10:06 AM, Paul Baumann wrote: >> Here is a fun challenge for those that enjoy tuning code. Respect is >> the only award for the winner. >> >> The challenge is to implement a #factorial method that executes >> fastest for the range of 1 to 3000. Performance is measured relative >> to the current #factorial execution time. Final judgment will be done >> on a single machine using the same configuration for all tests. I >> will announce the entry results and winner once entries are judged. >> It is recommended that you prefix your #factorial implementation with >> your initials (like #plbFactorial). > > Here's my algorithm: > > Integer>>jmbFactorial > | value | > self< 13 > ifTrue: > ["If we are small enough, just compute it the standard way." > value := 1. > 2 to: self do: [:i | value := value * i]. > ^value]. > value := self jmbFactorialHelper. > ^(value at: 1) * (value at: 2) > > Integer>>jmbFactorialHelper > "Return a two element array with the first element containing the > factorial for (self // 2) > and the second element being (self factorial / (self // 2) factorial)." > > | half halfValues newValues oddTotal oddValue i end | > self == 1 ifTrue: [^#(1 1)]. > half := self bitShift: -1. > halfValues := half jmbFactorialHelper. > newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil. > > "Multiply all of the odd numbers in the upper range -- only do a few > at a time to try to limit to SmallIntegers" > oddTotal := 1. > i := half odd ifTrue: [half + 2] ifFalse: [half + 1]. > [i<= self] whileTrue: > [oddValue := 1. > end := i + 10 min: self. > [i<= end] whileTrue: > [oddValue := oddValue * i. > i := i + 2]. > oddTotal := oddTotal * oddValue]. > > "Multiply the even part of the upper range by the odd part of the > upper range. The even part can be > calculated from the recursive part bit shifted by the size of the > upper from the recursive part." > newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half > bitShift: -1)) * oddTotal. > ^newValues > > > Using your performance test code on my machine, I get: > #(15972 3073 -80.7601) > #(15840 3063 -80.6629) > #(15831 3019 -80.9298) > > > However, if I really wanted a faster factorial, the best solution is > to simply search for one. For example, I implemented the split > recursive algorithm by directly translating some C# code, and got these numbers: > > #(16023 2699 -83.1555) > #(15950 2702 -83.0596) > #(15929 2693 -83.0937) > > which are faster than my algorithm. There are even faster algorithms > but those appear to be based on prime factorization, and I wasn't sure > that caching of some primes was permitted. > > > John Brant > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
I'm removing one of the restrictions so that tuning can be easily expanded into new areas. Disregard rule #2. If your tuning brings you into tuning large integer implementation then go for it. The use of the standard #factorial baseline measurement makes this OK.
Disregard this rule: 2) Tuning must not affect performance of the standard #factorial method. If you post your achievements then it would be useful to include your baseline measurement and show the percentage reduction in execution time from your baseline. Paul Baumann -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Paul Baumann Sent: Saturday, June 11, 2011 09:57 To: Andres Valloud; [hidden email]; John Brant Subject: Re: [vwnc] [Challenge] Tuning the large factorial Importance: Low Wow guys. 80% reduction in execution time is very impressive. If you openly share your results and code like that then it will make it easier for others to beat your improvements by combining strategies and making minor improvements. Hopefully others will still participate now that the bar has been set so high. The closer you get to a 100% reduction in execution time the more fractions of a percentage difference matter. All the execution times will be compared with a baseline value for the standard #factorial. I'll use the lowest of several measurements for the baseline. The tests will be repeated multiple times if that helps to get more precision. John, It would be OK to use a table of prime numbers to optimize an algorithm but not to replace a calculation: #(3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 ...) I wanted to avoid tables of intermediate or direct answers like this: plbFactorial2 ^#(1 2 6 24 120 720 5040 40320 362880 3628800 39916800 479001600 ...) at: self That is the absolute fastest approach, but it is a cache and the work is done outside of the timing: #(16415 0 -100.0) Paul Baumann -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Andres Valloud Sent: Friday, June 10, 2011 20:51 To: [hidden email] Subject: Re: [vwnc] [Challenge] Tuning the large factorial Importance: Low Hey, this is a lot of fun... First I submitted the implementation in AT Integer Extensions (which I wrote). Let's call this avFactorial for the sake of illustration. It uses a binary recursion to calculate a product between two integers. The idea is to keep products between integers of similar size, because the efficiency of multiplication x*y is highest in the y=x => x^2 case (so assuming complexity(x,y) is more or less continuous, then the efficiency of multiplication is highest when x is close to y). This is also why 1 * 2 * 3... is about the worst possible way to calculate a factorial. Of course the recursion has some overhead as written, so avFactorial only uses it after 120!. avFactorial ranks at about -80 in Paul's scale. Then, I reused what I had previously done for Karatsuba multiplication of large integers, because the VM implements the naive multiplication algorithm (see http://blogten.blogspot.com/2008/07/mr-karatsuba.html). So, avFactorial2 is basically the same code as avFactorial but using karatsubaTimes:. avFactorial2 ranks at about -81.3 in Paul's scale. Then, I thought I should accumulate 2^n separately from the factorial, so I rewrote avFactorial2 into avFactorial3 which uses associations of the form odd -> bitShift. At the end, the result is finalAnswer key bitShift: finalAnswer value. avFactorial3 ranks at about -81 is Paul's scale. Obviously the overhead is becoming a problem for small values. In fact, avFactorial2 is faster than avFactorial3 until about 2500!, so the benefit is not obvious until later. Nevertheless, since for larger factorials the execution time is going to be dominated by multiplication, the overhead becomes less significant. For example, Time millisecondsToRun: [100000 avFactorial3] 1361 Time millisecondsToRun: [100000 avFactorial2] 1412 Then, I saw John's post below. I thought I could do better than split recursive with the pieces I had, so I adapted John's algorithm to use Karatsuba. This results in avFactorial4, which ranks at about -84.7 in Paul's scale. I also wrote a couple variants. avFactorial3 always uses power of two accumulation, and jmbFactorial2 does the bitShift: last in jmbFactorialHelper. Now, I should note that the closer we get to -100, the more sensitive the performance measurement is to small value changes in Paul's scale. So, for the sake of easier comparison, here are run times with a standardized workload and testing conditions for all the algorithms: ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial]]] 9052 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial2]]] 8221 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3]]] 7953 "Always use power of two accumulation regardless" ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial3a]]] 7970 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x avFactorial4]]] 6331 ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial]]] 7682 "Reorder final multiplication to do bitShift: last" ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial2]]] 7708 For comparison, the normal factorial performs like this (note only one workload, not 3 as above): ObjectMemory flushNewSpace; garbageCollect. Time millisecondsToRun: [1 to: 3000 do: [:x | x factorial]] 13108 On 6/10/2011 1:29 PM, John Brant wrote: > On 6/10/2011 10:06 AM, Paul Baumann wrote: >> Here is a fun challenge for those that enjoy tuning code. Respect is >> the only award for the winner. >> >> The challenge is to implement a #factorial method that executes >> fastest for the range of 1 to 3000. Performance is measured relative >> to the current #factorial execution time. Final judgment will be done >> on a single machine using the same configuration for all tests. I >> will announce the entry results and winner once entries are judged. >> It is recommended that you prefix your #factorial implementation with >> your initials (like #plbFactorial). > > Here's my algorithm: > > Integer>>jmbFactorial > | value | > self< 13 > ifTrue: > ["If we are small enough, just compute it the standard way." > value := 1. > 2 to: self do: [:i | value := value * i]. > ^value]. > value := self jmbFactorialHelper. > ^(value at: 1) * (value at: 2) > > Integer>>jmbFactorialHelper > "Return a two element array with the first element containing the > factorial for (self // 2) > and the second element being (self factorial / (self // 2) factorial)." > > | half halfValues newValues oddTotal oddValue i end | > self == 1 ifTrue: [^#(1 1)]. > half := self bitShift: -1. > halfValues := half jmbFactorialHelper. > newValues := Array with: (halfValues at: 1) * (halfValues at: 2) with: nil. > > "Multiply all of the odd numbers in the upper range -- only do a few > at a time to try to limit to SmallIntegers" > oddTotal := 1. > i := half odd ifTrue: [half + 2] ifFalse: [half + 1]. > [i<= self] whileTrue: > [oddValue := 1. > end := i + 10 min: self. > [i<= end] whileTrue: > [oddValue := oddValue * i. > i := i + 2]. > oddTotal := oddTotal * oddValue]. > > "Multiply the even part of the upper range by the odd part of the > upper range. The even part can be > calculated from the recursive part bit shifted by the size of the > upper from the recursive part." > newValues at: 2 put: ((halfValues at: 2) bitShift: half - (half > bitShift: -1)) * oddTotal. > ^newValues > > > Using your performance test code on my machine, I get: > #(15972 3073 -80.7601) > #(15840 3063 -80.6629) > #(15831 3019 -80.9298) > > > However, if I really wanted a faster factorial, the best solution is > to simply search for one. For example, I implemented the split > recursive algorithm by directly translating some C# code, and got these numbers: > > #(16023 2699 -83.1555) > #(15950 2702 -83.0596) > #(15929 2693 -83.0937) > > which are faster than my algorithm. There are even faster algorithms > but those appear to be based on prime factorization, and I wasn't sure > that caching of some primes was permitted. > > > John Brant > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Paul Baumann <paul.baumann <at> theice.com> writes:
> > I'm removing one of the restrictions so that tuning can be easily expanded into new areas. Disregard rule > #2. If your tuning brings you into tuning large integer implementation then go for it. The use of the > standard #factorial baseline measurement makes this OK. > > Disregard this rule: > 2) Tuning must not affect performance of the standard #factorial method. > > If you post your achievements then it would be useful to include your baseline measurement and show the > percentage reduction in execution time from your baseline. > > Paul Baumann > Hi Paul, with an implementation of prime swing divide and conquer algorithm - see http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with a karatsuba multiplication, i get a score of -88%. For 10000 factorial, the speed up ratio reaches 14 [FactorialPrimeSwing new factorialOf: 10000] timeToRun. 41.542 milliseconds [10000 factorial] timeToRun 587.689 milliseconds 587.689/41.542 14.1469 The code is at FactorialPrimeSwingContest in public store with MIT license. Feel free to improve... Nicolas _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On 6/12/2011 7:18 PM, nicolas cellier wrote:
> with an implementation of prime swing divide and conquer algorithm - see > http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with > a karatsuba multiplication, i get a score of -88%. What machine/vm are you running on? When I run your code on my 64 bit Windows 7 with 32 bit VW vm, I'm getting: #(16056 2372 -85.2267) #(15925 2372 -85.1052) #(16034 2376 -85.1815) > The code is at FactorialPrimeSwingContest in public store with MIT license. > Feel free to improve... Your version of the karatsubaTimes: method is a little faster than Andres'. When I use it with a slightly different version of my algorithm (code below). I get: #(15940 2205 -86.1669) #(16020 2202 -86.2547) #(15937 2203 -86.1768) I'm surprised that the prime swing isn't faster. At 100,000 factorial, it is a few milliseconds faster: Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 100000] 1047 1048 1049 Time millisecondsToRun: [100000 jmbFactorial] 1056 1053 1053 However, when I raised it to 150,000, mine was a little faster: Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 150000] 2171 2147 2147 Time millisecondsToRun: [150000 jmbFactorial] 2141 2127 2128 At 200,000 they were roughly the same: Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 200000] 3541 3541 3528 Time millisecondsToRun: [100000 jmbFactorial] 3526 3529 3541 John Brant ---------------------- jmbFactorial | value | self < 13 ifTrue: [value := 1. 2 to: self do: [:i | value := value * i]. ^value]. value := #(1 0 1 0) copy. self jmbFactorialHelper: value. ^((value at: 1) karatsubaTimes: (value at: 3)) bitShift: (value at: 2) + (value at: 4) jmbFactorialHelper: anArray | half oddTotal i | self == 1 ifTrue: [^self]. half := self bitShift: -1. half jmbFactorialHelper: anArray. anArray at: 1 put: ((anArray at: 1) karatsubaTimes: (anArray at: 3)). anArray at: 2 put: (anArray at: 2) + (anArray at: 4). i := half odd ifTrue: [half + 2] ifFalse: [half + 1]. oddTotal := i karatsubaProductTo: self by: 2. anArray at: 3 put: ((anArray at: 3) karatsubaTimes: oddTotal). anArray at: 4 put: (anArray at: 4) + half - (half bitShift: -1) karatsubaProductTo: anInteger by: stepInteger | next doubleStep | self > anInteger ifTrue: [^1]. next := self + stepInteger. next <= anInteger ifTrue: [doubleStep := stepInteger bitShift: 1. ^(self karatsubaProductTo: anInteger by: doubleStep) karatsubaTimes: (next karatsubaProductTo: anInteger by: doubleStep)] ifFalse: [^self] _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
John Brant <brant <at> refactoryworkers.com> writes:
> > On 6/12/2011 7:18 PM, nicolas cellier wrote: > > with an implementation of prime swing divide and conquer algorithm - see > > http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with > > a karatsuba multiplication, i get a score of -88%. > > What machine/vm are you running on? When I run your code on my 64 bit > Windows 7 with 32 bit VW vm, I'm getting: > > #(16056 2372 -85.2267) > #(15925 2372 -85.1052) > #(16034 2376 -85.1815) > VisualWorks® NonCommercial, 7.7 of 15 décembre 2009 MacOSX > > I'm surprised that the prime swing isn't faster. At 100,000 factorial, > it is a few milliseconds faster: > > Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 100000] > 1047 1048 1049 > Time millisecondsToRun: [100000 jmbFactorial] > 1056 1053 1053 > > However, when I raised it to 150,000, mine was a little faster: > > Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 150000] > 2171 2147 2147 > Time millisecondsToRun: [150000 jmbFactorial] > 2141 2127 2128 > > At 200,000 they were roughly the same: > > Time millisecondsToRun: [FactorialPrimeSwing new factorialOf: 200000] > 3541 3541 3528 > Time millisecondsToRun: [100000 jmbFactorial] > 3526 3529 3541 > > John Brant > Here jmbFactorial is a bit better than primeSwingFactorial v1.2 using same karatsubaTimes: Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x jmbFactorial]]] 12598 Time millisecondsToRun: [3 timesRepeat: [1 to: 3000 do: [:x | x primeSwingFactorial]]] 12737 [ObjectMemory garbageCollect. [100000 jmbFactorial] timeToRun] value asSeconds asFloat / [ObjectMemory garbageCollect. [100000 primeSwingFactorial] timeToRun] value asSeconds 0.96685 0.968505 [ObjectMemory garbageCollect. [150000 jmbFactorial] timeToRun] value asSeconds asFloat / [ObjectMemory garbageCollect. [150000 primeSwingFactorial] timeToRun] value asSeconds 0.940478 0.941528 [ObjectMemory garbageCollect. [200000 jmbFactorial] timeToRun] value asSeconds asFloat / [ObjectMemory garbageCollect. [200000 primeSwingFactorial] timeToRun] value asSeconds 0.947829 0.947372 | t0 t1 | ObjectMemory garbageCollect. t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]]. ObjectMemory garbageCollect. t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i primeSwingFactorial]]. ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0 #(40805 4301 -89.4596) #(40362 4300 -89.3464) #(41264 4363 -89.4266) | t0 t1 | ObjectMemory garbageCollect. t0 := Time millisecondsToRun: [1 to: 3000 do: [:i | i factorial]]. ObjectMemory garbageCollect. t1 := Time millisecondsToRun: [1 to: 3000 do: [:i | i jmbFactorial]]. ^Array with: t0 with: t1 with: 1 - (t1 / t0) * -100.0 #(40664 4146 -89.8042) #(39925 4086 -89.7658) #(40983 4145 -89.8861) Nicolas _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On 6/12/2011 9:13 PM, nicolas cellier wrote:
>> On 6/12/2011 7:18 PM, nicolas cellier wrote: >>> > > with an implementation of prime swing divide and conquer algorithm - see >>> > > http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with >>> > > a karatsuba multiplication, i get a score of -88%. >> > >> > What machine/vm are you running on? When I run your code on my 64 bit >> > Windows 7 with 32 bit VW vm, I'm getting: >> > >> > #(16056 2372 -85.2267) >> > #(15925 2372 -85.1052) >> > #(16034 2376 -85.1815) >> > > VisualWorks® NonCommercial, 7.7 of 15 décembre 2009 > MacOSX > Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare session (on my Windows 7 machine): jmbFactorial: #(17952 895 -95.0145) #(17982 895 -95.0228) #(18071 903 -95.003) primeSwingFactorial (v1.2): #(17826 1343 -92.4661) #(17816 1335 -92.5067) #(17902 1338 -92.526) I also ran the GMP factorial function: -------------- #include <gmp.h> int main(int argc, char **argv) { mpz_t v; int i; int j; for (j = 0; j < 100; j++) for (i = 1; i <= 3000; i++) { mpz_init_set_si(v, 0); mpz_fac_ui(v, i); } } -------------- This takes ~180ms per iteration compared to the fastest VW time of 895ms. While I haven't looked at the code for the GMP factorial algorithm, its documentation appears to be describing what jmbFactorial is doing. The big performance gain for GMP comes in when you do really large factorials. For example, 1,000,000 factorial using GMP takes ~1.2 seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot of VW time is doing GC. I'm guessing that it does a GC of old space when it really only needs a GC of large space. John Brant _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Assuming of course that large space is large enough to hold all that
stuff... On 6/13/2011 6:01 AM, John Brant wrote: > On 6/12/2011 9:13 PM, nicolas cellier wrote: >>> On 6/12/2011 7:18 PM, nicolas cellier wrote: >>>>>> with an implementation of prime swing divide and conquer algorithm - see >>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with >>>>>> a karatsuba multiplication, i get a score of -88%. >>>> >>>> What machine/vm are you running on? When I run your code on my 64 bit >>>> Windows 7 with 32 bit VW vm, I'm getting: >>>> >>>> #(16056 2372 -85.2267) >>>> #(15925 2372 -85.1052) >>>> #(16034 2376 -85.1815) >>>> >> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009 >> MacOSX >> > > Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare > session (on my Windows 7 machine): > > jmbFactorial: > #(17952 895 -95.0145) > #(17982 895 -95.0228) > #(18071 903 -95.003) > > primeSwingFactorial (v1.2): > #(17826 1343 -92.4661) > #(17816 1335 -92.5067) > #(17902 1338 -92.526) > > I also ran the GMP factorial function: > -------------- > #include<gmp.h> > > int main(int argc, char **argv) { > mpz_t v; > int i; > int j; > for (j = 0; j< 100; j++) > for (i = 1; i<= 3000; i++) { > mpz_init_set_si(v, 0); > mpz_fac_ui(v, i); > } > } > -------------- > This takes ~180ms per iteration compared to the fastest VW time of > 895ms. While I haven't looked at the code for the GMP factorial > algorithm, its documentation appears to be describing what jmbFactorial > is doing. The big performance gain for GMP comes in when you do really > large factorials. For example, 1,000,000 factorial using GMP takes ~1.2 > seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot > of VW time is doing GC. I'm guessing that it does a GC of old space when > it really only needs a GC of large space. > > > John Brant > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Try liberally increasing the size of large space via ObjectMemory
class>>sizesAtStartup: (or, in newer VMs, via one of the -m switches). On 6/13/2011 10:08 AM, Andres Valloud wrote: > Assuming of course that large space is large enough to hold all that > stuff... > > On 6/13/2011 6:01 AM, John Brant wrote: >> On 6/12/2011 9:13 PM, nicolas cellier wrote: >>>> On 6/12/2011 7:18 PM, nicolas cellier wrote: >>>>>>> with an implementation of prime swing divide and conquer algorithm - see >>>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with >>>>>>> a karatsuba multiplication, i get a score of -88%. >>>>> >>>>> What machine/vm are you running on? When I run your code on my 64 bit >>>>> Windows 7 with 32 bit VW vm, I'm getting: >>>>> >>>>> #(16056 2372 -85.2267) >>>>> #(15925 2372 -85.1052) >>>>> #(16034 2376 -85.1815) >>>>> >>> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009 >>> MacOSX >>> >> >> Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare >> session (on my Windows 7 machine): >> >> jmbFactorial: >> #(17952 895 -95.0145) >> #(17982 895 -95.0228) >> #(18071 903 -95.003) >> >> primeSwingFactorial (v1.2): >> #(17826 1343 -92.4661) >> #(17816 1335 -92.5067) >> #(17902 1338 -92.526) >> >> I also ran the GMP factorial function: >> -------------- >> #include<gmp.h> >> >> int main(int argc, char **argv) { >> mpz_t v; >> int i; >> int j; >> for (j = 0; j< 100; j++) >> for (i = 1; i<= 3000; i++) { >> mpz_init_set_si(v, 0); >> mpz_fac_ui(v, i); >> } >> } >> -------------- >> This takes ~180ms per iteration compared to the fastest VW time of >> 895ms. While I haven't looked at the code for the GMP factorial >> algorithm, its documentation appears to be describing what jmbFactorial >> is doing. The big performance gain for GMP comes in when you do really >> large factorials. For example, 1,000,000 factorial using GMP takes ~1.2 >> seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot >> of VW time is doing GC. I'm guessing that it does a GC of old space when >> it really only needs a GC of large space. >> >> >> John Brant >> _______________________________________________ >> vwnc mailing list >> [hidden email] >> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Andres Valloud-6
Also, FYI, we just improved large integer performance on most big endian
platforms by something between 10% and 25%. More speed ups are possible (basically so large integers will be as fast on big endian platforms as they are on little endian platforms), but we need to do other prep work before we do that. On 6/13/2011 10:08 AM, Andres Valloud wrote: > Assuming of course that large space is large enough to hold all that > stuff... > > On 6/13/2011 6:01 AM, John Brant wrote: >> On 6/12/2011 9:13 PM, nicolas cellier wrote: >>>> On 6/12/2011 7:18 PM, nicolas cellier wrote: >>>>>>> with an implementation of prime swing divide and conquer algorithm - see >>>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with >>>>>>> a karatsuba multiplication, i get a score of -88%. >>>>> >>>>> What machine/vm are you running on? When I run your code on my 64 bit >>>>> Windows 7 with 32 bit VW vm, I'm getting: >>>>> >>>>> #(16056 2372 -85.2267) >>>>> #(15925 2372 -85.1052) >>>>> #(16034 2376 -85.1815) >>>>> >>> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009 >>> MacOSX >>> >> >> Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare >> session (on my Windows 7 machine): >> >> jmbFactorial: >> #(17952 895 -95.0145) >> #(17982 895 -95.0228) >> #(18071 903 -95.003) >> >> primeSwingFactorial (v1.2): >> #(17826 1343 -92.4661) >> #(17816 1335 -92.5067) >> #(17902 1338 -92.526) >> >> I also ran the GMP factorial function: >> -------------- >> #include<gmp.h> >> >> int main(int argc, char **argv) { >> mpz_t v; >> int i; >> int j; >> for (j = 0; j< 100; j++) >> for (i = 1; i<= 3000; i++) { >> mpz_init_set_si(v, 0); >> mpz_fac_ui(v, i); >> } >> } >> -------------- >> This takes ~180ms per iteration compared to the fastest VW time of >> 895ms. While I haven't looked at the code for the GMP factorial >> algorithm, its documentation appears to be describing what jmbFactorial >> is doing. The big performance gain for GMP comes in when you do really >> large factorials. For example, 1,000,000 factorial using GMP takes ~1.2 >> seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot >> of VW time is doing GC. I'm guessing that it does a GC of old space when >> it really only needs a GC of large space. >> >> >> John Brant >> _______________________________________________ >> vwnc mailing list >> [hidden email] >> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Andres Valloud-6
And finally I'm doing some experiments to speed up the transition from
small integers into large integers. At first glance, it looks like we might get anywhere from 3x to 10x depending on the operation... we'll see. On 6/13/2011 10:08 AM, Andres Valloud wrote: > Assuming of course that large space is large enough to hold all that > stuff... > > On 6/13/2011 6:01 AM, John Brant wrote: >> On 6/12/2011 9:13 PM, nicolas cellier wrote: >>>> On 6/12/2011 7:18 PM, nicolas cellier wrote: >>>>>>> with an implementation of prime swing divide and conquer algorithm - see >>>>>>> http://www.luschny.de/math/factorial/FastFactorialFunctions.htm - combined with >>>>>>> a karatsuba multiplication, i get a score of -88%. >>>>> >>>>> What machine/vm are you running on? When I run your code on my 64 bit >>>>> Windows 7 with 32 bit VW vm, I'm getting: >>>>> >>>>> #(16056 2372 -85.2267) >>>>> #(15925 2372 -85.1052) >>>>> #(16034 2376 -85.1815) >>>>> >>> VisualWorks® NonCommercial, 7.7 of 15 décembre 2009 >>> MacOSX >>> >> >> Here's the numbers for 64 bit VW 7.7 running on 64 bit Linux in a VMWare >> session (on my Windows 7 machine): >> >> jmbFactorial: >> #(17952 895 -95.0145) >> #(17982 895 -95.0228) >> #(18071 903 -95.003) >> >> primeSwingFactorial (v1.2): >> #(17826 1343 -92.4661) >> #(17816 1335 -92.5067) >> #(17902 1338 -92.526) >> >> I also ran the GMP factorial function: >> -------------- >> #include<gmp.h> >> >> int main(int argc, char **argv) { >> mpz_t v; >> int i; >> int j; >> for (j = 0; j< 100; j++) >> for (i = 1; i<= 3000; i++) { >> mpz_init_set_si(v, 0); >> mpz_fac_ui(v, i); >> } >> } >> -------------- >> This takes ~180ms per iteration compared to the fastest VW time of >> 895ms. While I haven't looked at the code for the GMP factorial >> algorithm, its documentation appears to be describing what jmbFactorial >> is doing. The big performance gain for GMP comes in when you do really >> large factorials. For example, 1,000,000 factorial using GMP takes ~1.2 >> seconds vs. 48.4 seconds using jmbFactorial in VW. It appears that a lot >> of VW time is doing GC. I'm guessing that it does a GC of old space when >> it really only needs a GC of large space. >> >> >> John Brant >> _______________________________________________ >> vwnc mailing list >> [hidden email] >> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |