Why limit the size
of hash values? And, why use a bitshift of only the last digit of a large
integer?
Integer>>hash
^(self lastDigit bitShift: 8) + (self digitAt: 1) SmallInteger>>hash
^self We just fixed a
problem whereby integers used for oops crossed into VW large integer range
(>536870911). The VW hash algorythm for large integers became an issue
for GBS cache dictionaries and caused serious performance degredation. GBS code
can be easily fixed to work around the problem by avoiding the #hash send for
the GsToSt cache dictionary; however, the VW implementation for
Integer>>hash just seems wrong.
I've never seen a
hash value used in a way that requires it to be limited in size. Usually the
remainder of a division of a hash value is used for collection indexing, so
having a large number isn't a problem and actually helps provide better
distribution. Why define Integer>>hash to just return "self" like
SmallInteger does now? It doesn't make sense to have size limits for
#hash other than to constrain it to be a positive number--which is not the
case now.
Paul Baumann
IntercontinentalExchange |
ICE
|
Paul Baumann wrote:
> Why limit the size of hash values? And, why use a bitshift of only the > last digit of a large integer? > > Integer>>hash > ^(self lastDigit bitShift: 8) + (self digitAt: 1) > > SmallInteger>>hash > ^self I'm kind of in favor of keeping hashes within the SmallInteger range, primarily for performance reasons, but I agree that the existing Integer>>hash implementation is an especially bad one. You can easily have a very large collection of LargeIntegers that all have the same high-order byte, so only 256 hash values. It seems like a much better hash that stays within the SmallInteger bounds would be to use the low-order 24 bits of the LargeInteger: LargeInteger>>hash ^((self digitAt: 3) bitShift: 16) + ((self digitAt: 2) bitShift: 8) + (self digitAt: 1) All intermediate values are SmallIntegers, digitAt: is a primitive, and bitShift: for SmallIntegers is pretty fast, so this should perform pretty well (I haven't timed it). This hash would not give a good distribution if you happened to have a bunch of LargeIntegers whose values differed by exact multiples of 2^24. This should be a much rarer case than the cases in which the current hash fails, but if you really want to guard against it you could xor the higher-order bytes into the hash. Regards, -Martin |
In reply to this post by Paul Baumann
Hi Martin,
Constraining #hash to return [unsigned] small integers seems OK. That is the best option if there is a performance hit when computing the modulo of VERY large integers (per Robert's response), and if the cost of computing a small integer from a large integer is small. I haven't measured performance to see if such constraints are of any likely benefit, but it seems possible. I hope the hash range will grow in future releases along with the range of SmallInteger. If it doesn't then this issue will resurface in terabyte images fifteen years from now. :) Is anyone from Cincom following this discussion? Will you fix this? Paul Baumann IntercontinentalExchange | ICE 2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328 Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470 [hidden email] 24-hour ice helpdesk 770.738.2101 www.theice.com -----Original Message----- From: Martin McClure [mailto:[hidden email]] Sent: Thursday, September 14, 2006 1:29 PM To: Paul Baumann Cc: [hidden email] Subject: Re: [bug] Integer>>hash Paul Baumann wrote: > Why limit the size of hash values? And, why use a bitshift of only the > last digit of a large integer? > > Integer>>hash > ^(self lastDigit bitShift: 8) + (self digitAt: 1) > > SmallInteger>>hash > ^self I'm kind of in favor of keeping hashes within the SmallInteger range, primarily for performance reasons, but I agree that the existing Integer>>hash implementation is an especially bad one. You can easily have a very large collection of LargeIntegers that all have the same high-order byte, so only 256 hash values. It seems like a much better hash that stays within the SmallInteger bounds would be to use the low-order 24 bits of the LargeInteger: LargeInteger>>hash ^((self digitAt: 3) bitShift: 16) + ((self digitAt: 2) bitShift: 8) + (self digitAt: 1) All intermediate values are SmallIntegers, digitAt: is a primitive, and bitShift: for SmallIntegers is pretty fast, so this should perform pretty well (I haven't timed it). This hash would not give a good distribution if you happened to have a bunch of LargeIntegers whose values differed by exact multiples of 2^24. This should be a much rarer case than the cases in which the current hash fails, but if you really want to guard against it you could xor the higher-order bytes into the hash. Regards, -Martin -------------------------------------------------------- This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. |
Paul Baumann wrote:
> > I hope the hash range will grow in future releases along with the range > of SmallInteger. If it doesn't then this issue will resurface in > terabyte images fifteen years from now. :) Yes. It would be nice to have LargeInteger hashes that are 7 bytes long in the 64-bit version. This might require some special thought when converting 32-bit images to 64-bit ones. I haven't looked at the current Integer conversion strategy -- does it make "small large integers" out of the existing large integers in the image that could be SmallIntegers? -Martin |
In reply to this post by Paul Baumann
Martin McClure <[hidden email]> wrote:
| Paul Baumann wrote: | > | > I hope the hash range will grow in future releases along with the range | > of SmallInteger. If it doesn't then this issue will resurface in | > terabyte images fifteen years from now. :) | Yes. It would be nice to have LargeInteger hashes that are 7 bytes long | in the 64-bit version. This might require some special thought when | converting 32-bit images to 64-bit ones. I *think* the current strategy "just works" for vanilla images. See below. | I haven't looked at the current | Integer conversion strategy -- does it make "small large integers" out | of the existing large integers in the image that could be SmallIntegers? Yes. The image conversion maps all large positive and negative integers that are representable by the 64-bit system's 61-bit SmallIntegers and does an analogous thing for Doubles that are representable as SmallDoubles (0 and the doubes with exponents in the range 894 through 1152, or approx +/`- 5.9-38 to +/- 6.8+38). As far as hashes go, IMO (there is some disagreement here) the system must rehash every hashed collection on start-up when bringing up an image freshly converted from 32-bits to 64-bits. The conversion from 32- to 64- bits necessarily changes identity hashes because a) the implementation of class references (the 64-bit system has relatively compact object headers that contain 20 bit class indices, allowing for only 2^20 classes). A class's identty hash is its class index. b) the identity hashes of 61-bit SmallIntegers and SmallDoubles differ unavoidably from their non-immediate LargeInteger and Double duals in the 32-bit world. c) it is generally desirable to have more identity hash values if possible. The 64-biot system supportts 20-bit id hashes, up from 4-bits in the 32 bit system. New better-distributed id hashes are given to objects during conversion. If identity hashes change then all hashed collections must be rehashed because the default implementations of #hash and #= in Object are in terms of #identtyHash and #'=='. But in general this rehashing cannot be done by the image converter since hash functions are user-defined and don't necessarily default to identityHash. The converter can't compute the values of user-definable hash functions without implementing a full Smalltalk evaluator, which I consider unnecessary duplication. So the converter rehashes those collections it must for the system to boot, i.e. general instances of MethodDictionary and HandleRegistry. Very early in system startup the image checks ObjectMemory maximumIdentityHashValue (see ObjectMemory class>>#hashInstall) and rehashes all hashable general instances of Collection if it hash changed. So as far as I can see (not that far usually) there's no problem having different implementations of Integer>>hash in the 32-bit and 64-bit system as far as basic Smalltalk execution goes. How about something like Integer methods for comparing hash ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self digitLength) bitShift: 8) where LargeInteger digitLength is reimplemented as Integer methods for accessing digitLength "Answer the number of bytes needed to represent the absolute value of the receiver." "Class is byte-indexed, so just answer the size." <primitive: 62> ^self basicSize --- Eliot Miranda ,,,^..^,,, mailto:[hidden email] VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 |
In reply to this post by Paul Baumann
Hi Eliot,
Wow. It is wonderful to see a discussion of increasing the maximum identityHash. The range in 32-bit VW has been a pain. With 64-bit VW, it is unlikely that GS oop values would ever exceed the new range of SmallInteger maxVal. 64-bit will make problems like this simply go away. You could maintain different implementations of Integer>>hash in 32-bit and 64-bit, but I wonder if it is really necessary. I mean, is it really likely that a 32-bit application would have a hash-based collection containing LPI instances that greatly exceed SmallInteger maxVal? Is a modulo of those values really inefficient enough that Cincom would have to maintain two versions of Integer>>hash? I'm OK with returning a LPI hash within the range of SmallInteger, but it still doesn't seem worth the effort. Just return "self" for all integers for both 32-bit and 64-bit. Consider these measurements: | ints | ints := (1 to: 170) collect: [:e | 10 raisedTo: e]. Core.Time millisecondsToRun: [10000 timesRepeat: [ints asSet]]. 1584 2689 Set is a hashing collection. The second result shown (2689 ms) was with the current Integer>>hash behavior. The first result shown (1584 ms) was with LargePositiveInteger>>hash added to return "self". If there is a cost to doing the modulo of large integers then it certainly didn't show up in these results. But hey, I'm happy with any fix that gets a better hash value range for integers in the upper range of possible GS oop values. Regards, Paul Baumann -----Original Message----- From: [hidden email] [mailto:[hidden email]] Sent: Thursday, September 14, 2006 4:48 PM To: [hidden email] Cc: [hidden email]; Paul Baumann Subject: Re: [bug] Integer>>hash Martin McClure <[hidden email]> wrote: | Paul Baumann wrote: | > | > I hope the hash range will grow in future releases along with the | > range of SmallInteger. If it doesn't then this issue will resurface | > in terabyte images fifteen years from now. :) | Yes. It would be nice to have LargeInteger hashes that are 7 bytes | long in the 64-bit version. This might require some special thought | when converting 32-bit images to 64-bit ones. I *think* the current strategy "just works" for vanilla images. See below. | I haven't looked at the current | Integer conversion strategy -- does it make "small large integers" out | of the existing large integers in the image that could be SmallIntegers? Yes. The image conversion maps all large positive and negative integers that are representable by the 64-bit system's 61-bit SmallIntegers and does an analogous thing for Doubles that are representable as SmallDoubles (0 and the doubes with exponents in the range 894 through 1152, or approx +/`- 5.9-38 to +/- 6.8+38). As far as hashes go, IMO (there is some disagreement here) the system must rehash every hashed collection on start-up when bringing up an image freshly converted from 32-bits to 64-bits. The conversion from 32- to 64- bits necessarily changes identity hashes because a) the implementation of class references (the 64-bit system has relatively compact object headers that contain 20 bit class indices, allowing for only 2^20 classes). A class's identty hash is its class index. b) the identity hashes of 61-bit SmallIntegers and SmallDoubles differ unavoidably from their non-immediate LargeInteger and Double duals in the 32-bit world. c) it is generally desirable to have more identity hash values if possible. The 64-biot system supportts 20-bit id hashes, up from 4-bits in the 32 bit system. New better-distributed id hashes are given to objects during conversion. If identity hashes change then all hashed collections must be rehashed because the default implementations of #hash and #= in Object are in terms of #identtyHash and #'=='. But in general this rehashing cannot be done by the image converter since hash functions are user-defined and don't necessarily default to identityHash. The converter can't compute the values of user-definable hash functions without implementing a full Smalltalk evaluator, which I consider unnecessary duplication. So the converter rehashes those collections it must for the system to boot, i.e. general instances of MethodDictionary and HandleRegistry. Very early in system startup the image checks ObjectMemory maximumIdentityHashValue (see ObjectMemory class>>#hashInstall) and rehashes all hashable general instances of Collection if it hash changed. So as far as I can see (not that far usually) there's no problem having different implementations of Integer>>hash in the 32-bit and 64-bit system as far as basic Smalltalk execution goes. How about something like Integer methods for comparing hash ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self digitLength) bitShift: 8) where LargeInteger digitLength is reimplemented as Integer methods for accessing digitLength "Answer the number of bytes needed to represent the absolute value of the receiver." "Class is byte-indexed, so just answer the size." <primitive: 62> ^self basicSize --- Eliot Miranda ,,,^..^,,, mailto:[hidden email] VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 -------------------------------------------------------- This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. |
In reply to this post by Paul Baumann
I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok. here are some timings.
hash is the current nteger>>hash. speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined. newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e. (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8) The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions. A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed. (| insts r times | insts := LargeInteger allGeneralInstances asArray. times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0. r := 1000. (1 to: 10) do: [:n| times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]). times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]). times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]). times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])]. insts size -> times -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))]) -> (insts collect: [:li| li hash]) asSet size -> (insts collect: [:li| li speedOldHash]) asSet size -> (insts collect: [:li| li newHash]) asSet size -> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size)) The results in my mal imager on our mondo Opteron server are: 14164 ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 ) ->#(1.0d 0.48460654039036d 2.5427592933124d) ->1203 ->1203 ->14055 ->11.683291770574d i.e. the base case of looping sending lass takes 2.88 seconds. The original hash takes 11.65 seconds, the new hash 25.18 seconds. The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055. The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%. Paul, what was your new hash function and what realistic measurements do you have? If yours differs markedly from mine could you compare the two in a realistic setting? I'll reply to your message tmorow. The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :) --- Eliot Miranda ,,,^..^,,, mailto:[hidden email] VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 Integer-newHash.st (884 bytes) Download Attachment |
In reply to this post by Paul Baumann
Hi Eliot,
The "newHash" had a low distribution in a dispersed GS oop range to 1.5M: VW 7.4.1 default: 10356 large integers. Rate 37 per ms Added to growing bag of 39 per ms. Matching add of 931 per ms. Hash ranges: 10356 hash values ranging 8192 to 23038 with 0 repeats. LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) 10356 large integers. Rate 19 per ms Added to growing bag of 32 per ms. Matching add of 49 per ms. Hash ranges: 58 hash values ranging 268443647 to 268458239 with 10298 repeats. Martin's LPI>>hash 10356 large integers. Rate 54 per ms Added to growing bag of 57 per ms. "Why so low?" Matching add of 1235 per ms. Hash ranges: 10356 hash values ranging 0 to 16775947 with 0 repeats. LPI>>hash ^self 10356 large integers. Rate 326 per ms Added to growing bag of 693 per ms. Matching add of 617 per ms. Hash ranges: 10356 hash values ranging 536870912 to 1499916977 with 0 repeats. Returning "self" is winner here, but now look at the first 10000 values above maxValue: VW 7.4.1 default: 10000 large integers. Rate 0 per ms Added to growing bag of 1 per ms. Matching add of 1 per ms. Hash ranges: 256 hash values ranging 8192 to 8447 with 9744 repeats. LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) 10000 large integers. Rate 0 per ms Added to growing bag of 0 per ms. Matching add of 0 per ms. Hash ranges: 1 hash values ranging 268443647 to 268443647 with 9999 repeats. Martin's LPI>>hash 10000 large integers. Rate 531 per ms Added to growing bag of 1150 per ms. Matching add of 989 per ms. Hash ranges: 10000 hash values ranging 0 to 9999 with 0 repeats. LPI>>hash ^self 10000 large integers. Rate 329 per ms Added to growing bag of 672 per ms. Matching add of 645 per ms. Hash ranges: 10000 hash values ranging 536870912 to 536880911 with 0 repeats. Martin's implementation performs best in the lower range of LargePositiveInteger. Returning the LPI itself performs best in a broader GS range of LPI. So far: Martin's implementation is best so far if there is a need to constrain the value returned from #hash. Returning the LPI itself is best general approach if no reason can be found to constrain the value. It comes down to the questions: Is there a need to constrain the value returned from #hash? Are there even more efficient ways to return constrained values? Paul Baumann IntercontinentalExchange | ICE 2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328 Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470 [hidden email] -----Original Message----- From: [hidden email] [mailto:[hidden email]] Sent: Thursday, September 14, 2006 8:46 PM To: [hidden email] Cc: [hidden email]; Paul Baumann Subject: Re: [bug] Integer>>hash I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok. here are some timings. hash is the current nteger>>hash. speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined. newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e. (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8) The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions. A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed. (| insts r times | insts := LargeInteger allGeneralInstances asArray. times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0. r := 1000. (1 to: 10) do: [:n| times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]). times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]). times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]). times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])]. insts size -> times -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) -> asDouble - (times at: #class) / ((times at: #hash) - (times at: -> #class))]) (insts collect: [:li| li hash]) asSet size (insts collect: -> [:li| li speedOldHash]) asSet size (insts collect: [:li| li newHash]) -> asSet size ((insts collect: [:li| li newHash]) asSet size asDouble / -> (insts collect: [:li| li hash]) asSet size)) The results in my mal imager on our mondo Opteron server are: 14164 ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 ->#newHash->25181745 ) #(1.0d 0.48460654039036d 2.5427592933124d) ->1203 ->1203 ->14055 ->11.683291770574d i.e. the base case of looping sending lass takes 2.88 seconds. The original hash takes 11.65 seconds, the new hash 25.18 seconds. The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055. The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%. Paul, what was your new hash function and what realistic measurements do you have? If yours differs markedly from mine could you compare the two in a realistic setting? I'll reply to your message tmorow. The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :) --- Eliot Miranda ,,,^..^,,, mailto:[hidden email] VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 -------------------------------------------------------- This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. hashPerformanceMetrics.st (1K) Download Attachment |
In reply to this post by Paul Baumann
"Paul Baumann" <[hidden email]> wrote:
| Hi Eliot, | The "newHash" had a low distribution in a dispersed GS oop range to | 1.5M: Paul, you made a booboo. My hash isn't ^(SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) its ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + (self lastDigit bitShift: 8) i.e. bitAnd by one bit less than the maximum positive SmallInteger to avoid overflow on the subsequent add. Add in the most sgnificant byte to spread the hashes more. Can yu try again with the correct defn? TIA | VW 7.4.1 default: | 10356 large integers. Rate 37 per ms | Added to growing bag of 39 per ms. | Matching add of 931 per ms. | Hash ranges: 10356 hash values ranging 8192 to 23038 with 0 | repeats. | LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) | 10356 large integers. Rate 19 per ms | Added to growing bag of 32 per ms. | Matching add of 49 per ms. | Hash ranges: 58 hash values ranging 268443647 to 268458239 with | 10298 repeats. | Martin's LPI>>hash | 10356 large integers. Rate 54 per ms | Added to growing bag of 57 per ms. "Why so low?" | Matching add of 1235 per ms. | Hash ranges: 10356 hash values ranging 0 to 16775947 with 0 | repeats. | LPI>>hash ^self | 10356 large integers. Rate 326 per ms | Added to growing bag of 693 per ms. | Matching add of 617 per ms. | Hash ranges: 10356 hash values ranging 536870912 to 1499916977 | with 0 repeats. | Returning "self" is winner here, but now look at the first 10000 values | above maxValue: | VW 7.4.1 default: | 10000 large integers. Rate 0 per ms | Added to growing bag of 1 per ms. | Matching add of 1 per ms. | Hash ranges: 256 hash values ranging 8192 to 8447 with 9744 | repeats. | LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) | 10000 large integers. Rate 0 per ms | Added to growing bag of 0 per ms. | Matching add of 0 per ms. | Hash ranges: 1 hash values ranging 268443647 to 268443647 with | 9999 repeats. | Martin's LPI>>hash | 10000 large integers. Rate 531 per ms | Added to growing bag of 1150 per ms. | Matching add of 989 per ms. | Hash ranges: 10000 hash values ranging 0 to 9999 with 0 repeats. | LPI>>hash ^self | 10000 large integers. Rate 329 per ms | Added to growing bag of 672 per ms. | Matching add of 645 per ms. | Hash ranges: 10000 hash values ranging 536870912 to 536880911 | with 0 repeats. | Martin's implementation performs best in the lower range of | LargePositiveInteger. Returning the LPI itself performs best in a | broader GS range of LPI. | So far: | Martin's implementation is best so far if there is a need to constrain | the value returned from #hash. Returning the LPI itself is best general | approach if no reason can be found to constrain the value. | It comes down to the questions: | Is there a need to constrain the value returned from #hash? | Are there even more efficient ways to return constrained values? | Paul Baumann | IntercontinentalExchange | ICE | 2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328 | Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470 | [hidden email] | -----Original Message----- | From: [hidden email] [mailto:[hidden email]] | Sent: Thursday, September 14, 2006 8:46 PM | To: [hidden email] | Cc: [hidden email]; Paul Baumann | Subject: Re: [bug] Integer>>hash | I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + | (self lastDigit bitShift: 8) as a hash function and it looks ok. here | are some timings. | hash is the current nteger>>hash. | speedOldHash is the current Integer has implemented as fast as I can see | in vanilla Smalltalk; essentially digitlength is made a primitive and | lastDigit is inlined. | newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) with lastDigit inlined, i.e. | (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self | digitAt: self basicSize) bitShift: 8) | The timing script collects all large integers, sees how many different | values are prodeuced and tomes the various hash functions. A send of | #class is used as the overhead measurement case and its time s subtraced | when ratios are computed. | (| insts r times | | insts := LargeInteger allGeneralInstances asArray. | times := Dictionary with: #class -> 0 with: #hash -> 0 with: | #speedOldHash -> 0 with: #newHash -> 0. | r := 1000. | (1 to: 10) do: | [:n| | times at: #class put: (times at: #class) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) class]]]). | times at: #hash put: (times at: #hash) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) hash]]]). | times at: #speedOldHash put: (times at: #speedOldHash) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) speedOldHash]]]). | times at: #newHash put: (times at: #newHash) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) newHash]]])]. | insts size | -> times | -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) | -> asDouble - (times at: #class) / ((times at: #hash) - (times at: | -> #class))]) (insts collect: [:li| li hash]) asSet size (insts collect: | -> [:li| li speedOldHash]) asSet size (insts collect: [:li| li newHash]) | -> asSet size ((insts collect: [:li| li newHash]) asSet size asDouble / | -> (insts collect: [:li| li hash]) asSet size)) | The results in my mal imager on our mondo Opteron server are: | 14164 | ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 | ->#newHash->25181745 ) #(1.0d 0.48460654039036d 2.5427592933124d) | ->1203 | ->1203 | ->14055 | ->11.683291770574d | i.e. the base case of looping sending lass takes 2.88 seconds. The | original hash takes 11.65 seconds, the new hash 25.18 seconds. The | original hash produces only 1203 values from 14164 integers whereas the | new hash provides 14055. | The bottom line is that the new hash is two and a half times slower (5 | times slower than a sped-up implementation) but (albeit with a small | sample size) it collapses the space of hash values by | (14164-14055)/141.64 = 0.77% whereas the current hash function collapses | it by (14164-1203)/141.64 = 91.51%. | Paul, what was your new hash function and what realistic measurements do | you have? If yours differs markedly from mine could you compare the two | in a realistic setting? | I'll reply to your message tmorow. The missus is banging on the | ceiling; its time for me to go upstairs and help with the kids :) | --- | Eliot Miranda ,,,^..^,,, | mailto:[hidden email] | VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 | 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax | +1 408 216 4500 | | | | -------------------------------------------------------- | This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. | | ----- Eliot Miranda ,,,^..^,,, mailto:[hidden email] VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 |
In reply to this post by Paul Baumann
Oops. I read "newHash is" and started copying code.
The distribution is good now. The new numbers are much better in the range of LPI below 1.5M. In that range the performance was 270/ms compared to 326/ms returning "self". In the 10K range above SmallInteger the performance was 30/ms compared to Martin's 531/ms. "ints := (Core.SmallInteger maxVal + 1 to: 1500000000 by: 93003) asArray." | LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) | 10356 large integers. Rate 19 per ms | Added to growing bag of 32 per ms. | Matching add of 49 per ms. | Hash ranges: 58 hash values ranging 268443647 to 268458239 with | 10298 repeats. Eliot's LPI>>hash ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + (self lastDigit bitShift: 8) 10356 large integers. Rate 270 per ms Added to growing bag of 538 per ms. Matching add of 542 per ms. Hash ranges: 10356 hash values ranging 8192 to 268418690 with 0 repeats. "ints := (Core.SmallInteger maxVal + 1 to: Core.SmallInteger maxVal + 10000) asArray. " | LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) | 10000 large integers. Rate 0 per ms | Added to growing bag of 0 per ms. | Matching add of 0 per ms. | Hash ranges: 1 hash values ranging 268443647 to 268443647 with | 9999 repeats. Eliot's LPI>>hash ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + (self lastDigit bitShift: 8) 10000 large integers. Rate 30 per ms Added to growing bag of 70 per ms. Matching add of 55 per ms. Hash ranges: 10000 hash values ranging 8192 to 18191 with 0 repeats. Paul Baumann IntercontinentalExchange | ICE 2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328 Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470 [hidden email] 24-hour ice helpdesk 770.738.2101 www.theice.com -----Original Message----- From: [hidden email] [mailto:[hidden email]] Sent: Friday, September 15, 2006 12:48 PM To: Paul Baumann Cc: [hidden email]; [hidden email] Subject: Re: [bug] Integer>>hash "Paul Baumann" <[hidden email]> wrote: | Hi Eliot, | The "newHash" had a low distribution in a dispersed GS oop range to | 1.5M: Paul, you made a booboo. My hash isn't ^(SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) its ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + (self lastDigit bitShift: 8) i.e. bitAnd by one bit less than the maximum positive SmallInteger to avoid overflow on the subsequent add. Add in the most sgnificant byte to spread the hashes more. Can yu try again with the correct defn? TIA | VW 7.4.1 default: | 10356 large integers. Rate 37 per ms | Added to growing bag of 39 per ms. | Matching add of 931 per ms. | Hash ranges: 10356 hash values ranging 8192 to 23038 with 0 repeats. | LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) | 10356 large integers. Rate 19 per ms | Added to growing bag of 32 per ms. | Matching add of 49 per ms. | Hash ranges: 58 hash values ranging 268443647 to 268458239 with | 10298 repeats. | Martin's LPI>>hash | 10356 large integers. Rate 54 per ms | Added to growing bag of 57 per ms. "Why so low?" | Matching add of 1235 per ms. | Hash ranges: 10356 hash values ranging 0 to 16775947 with 0 repeats. | LPI>>hash ^self | 10356 large integers. Rate 326 per ms | Added to growing bag of 693 per ms. | Matching add of 617 per ms. | Hash ranges: 10356 hash values ranging 536870912 to 1499916977 with 0 | repeats. | Returning "self" is winner here, but now look at the first 10000 | values above maxValue: | VW 7.4.1 default: | 10000 large integers. Rate 0 per ms | Added to growing bag of 1 per ms. | Matching add of 1 per ms. | Hash ranges: 256 hash values ranging 8192 to 8447 with 9744 repeats. | LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) | 10000 large integers. Rate 0 per ms | Added to growing bag of 0 per ms. | Matching add of 0 per ms. | Hash ranges: 1 hash values ranging 268443647 to 268443647 with | 9999 repeats. | Martin's LPI>>hash | 10000 large integers. Rate 531 per ms | Added to growing bag of 1150 per ms. | Matching add of 989 per ms. | Hash ranges: 10000 hash values ranging 0 to 9999 with 0 repeats. | LPI>>hash ^self | 10000 large integers. Rate 329 per ms | Added to growing bag of 672 per ms. | Matching add of 645 per ms. | Hash ranges: 10000 hash values ranging 536870912 to 536880911 with 0 | repeats. | Martin's implementation performs best in the lower range of | LargePositiveInteger. Returning the LPI itself performs best in a | broader GS range of LPI. | So far: | Martin's implementation is best so far if there is a need to constrain | the value returned from #hash. Returning the LPI itself is best | general approach if no reason can be found to constrain the value. | It comes down to the questions: | Is there a need to constrain the value returned from #hash? | Are there even more efficient ways to return constrained values? | Paul Baumann | IntercontinentalExchange | ICE | 2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328 | Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470 | [hidden email] | -----Original Message----- | From: [hidden email] [mailto:[hidden email]] | Sent: Thursday, September 14, 2006 8:46 PM | To: [hidden email] | Cc: [hidden email]; Paul Baumann | Subject: Re: [bug] Integer>>hash | I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + | (self lastDigit bitShift: 8) as a hash function and it looks ok. here | are some timings. | hash is the current nteger>>hash. | speedOldHash is the current Integer has implemented as fast as I can | see in vanilla Smalltalk; essentially digitlength is made a primitive | and lastDigit is inlined. | newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit | bitShift: 8) with lastDigit inlined, i.e. | (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self | digitAt: self basicSize) bitShift: 8) | The timing script collects all large integers, sees how many different | values are prodeuced and tomes the various hash functions. A send of | #class is used as the overhead measurement case and its time s | subtraced when ratios are computed. | (| insts r times | | insts := LargeInteger allGeneralInstances asArray. | times := Dictionary with: #class -> 0 with: #hash -> 0 with: | #speedOldHash -> 0 with: #newHash -> 0. | r := 1000. | (1 to: 10) do: | [:n| | times at: #class put: (times at: #class) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) class]]]). | times at: #hash put: (times at: #hash) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) hash]]]). | times at: #speedOldHash put: (times at: #speedOldHash) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) speedOldHash]]]). | times at: #newHash put: (times at: #newHash) + (Time | microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: | i) newHash]]])]. | insts size | -> times | -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) | -> asDouble - (times at: #class) / ((times at: #hash) - (times at: | -> #class))]) (insts collect: [:li| li hash]) asSet size (insts collect: | -> [:li| li speedOldHash]) asSet size (insts collect: [:li| li | -> newHash]) | -> asSet size ((insts collect: [:li| li newHash]) asSet size asDouble | -> / (insts collect: [:li| li hash]) asSet size)) | The results in my mal imager on our mondo Opteron server are: | 14164 | ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 | ->#newHash->25181745 ) #(1.0d 0.48460654039036d 2.5427592933124d) | ->1203 | ->1203 | ->14055 | ->11.683291770574d | i.e. the base case of looping sending lass takes 2.88 seconds. The | original hash takes 11.65 seconds, the new hash 25.18 seconds. The | original hash produces only 1203 values from 14164 integers whereas | the new hash provides 14055. | The bottom line is that the new hash is two and a half times slower (5 | times slower than a sped-up implementation) but (albeit with a small | sample size) it collapses the space of hash values by | (14164-14055)/141.64 = 0.77% whereas the current hash function | collapses it by (14164-1203)/141.64 = 91.51%. | Paul, what was your new hash function and what realistic measurements | do you have? If yours differs markedly from mine could you compare | the two in a realistic setting? | I'll reply to your message tmorow. The missus is banging on the | ceiling; its time for me to go upstairs and help with the kids :) -------------------------------------------------------- This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. |
In reply to this post by Paul Baumann
I've figured out why Martin's hash did only 54/ms on one test but was
fastest with 531/ms on the other. The hash computes a sequential value range--it is like a division remainder of SmallInteger maxVal. The bag became inefficient when it grew beyond the hash value range of 9999 (for those integers) to accommodate 10356 keys. Eliot's hash has a widely varying distribution and that makes for a better hash value. Eliot's hash may not be better than the full precision and efficiency of returning "self", but it keeps values in the range of SmallInteger just in case that is somehow a requirement. Thanks Guys. I think it has been narrowed down well enough. Eliot, please release the one you prefer. I'm happy this will be fixed. Paul Baumann IntercontinentalExchange | ICE -------------------------------------------------------- This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired. |
> Eliot's > hash may not be better than the full precision and efficiency of > returning "self", but it keeps values in the range of SmallInteger > just > in case that is somehow a requirement. This requirement is there to easily create hash functions for application classes. Since this is typically done by using #xor: on the hashes of several of the object's ivars, intermediate large integers would be created if any of those inner objects' hash implementations would return large integers. For easy creation of #hash implementations we should be able to trust that all other #hash implementations return small integers only. R - |
In reply to this post by eliot-2
I pulled in the attached patch -- should there not be an override of
"newHash" in SmallInteger (if anyone is going to use the message "newHash") -- so it can just return "self"?? (for speed). [hidden email] wrote: > I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok. here are some timings. > > hash is the current nteger>>hash. > > speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined. > > newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e. > (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8) > > The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions. A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed. > > (| insts r times | > insts := LargeInteger allGeneralInstances asArray. > times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0. > r := 1000. > (1 to: 10) do: > [:n| > times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]). > times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]). > times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]). > times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])]. > insts size > -> times > -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))]) > -> (insts collect: [:li| li hash]) asSet size > -> (insts collect: [:li| li speedOldHash]) asSet size > -> (insts collect: [:li| li newHash]) asSet size > -> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size)) > > The results in my mal imager on our mondo Opteron server are: > > 14164 > ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 ) > ->#(1.0d 0.48460654039036d 2.5427592933124d) > ->1203 > ->1203 > ->14055 > ->11.683291770574d > > i.e. the base case of looping sending lass takes 2.88 seconds. The original hash takes 11.65 seconds, the new hash 25.18 seconds. The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055. > > The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%. > > Paul, what was your new hash function and what realistic measurements do you have? If yours differs markedly from mine could you compare the two in a realistic setting? > > I'll reply to your message tmorow. The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :) > --- > Eliot Miranda ,,,^..^,,, mailto:[hidden email] > VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 > 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 > > -- Dennis Smith [hidden email] Cherniak Software Development Corporation +1 905.771.7011 400-10 Commerce Valley Dr E Fax: +1 905.771.6288 Thornhill, ON Canada L3T 7N7 http://www.CherniakSoftware.com |
In reply to this post by Paul Baumann
Dennis Smith <[hidden email]> wrote:
| I pulled in the attached patch -- should there not be an override of | "newHash" | in SmallInteger (if anyone is going to use the message "newHash") -- so | it can just return "self"?? (for speed). No, Dennis. newHash is simply for testing purposes. One wuld use newHash to compare the performance of the new algorithm wth the current one. Once convinced one wanted to migrate one would rename newHash to hash, relacing the old definition. You'll also need o rehash the system. So you could use teh attached, which is an as-et-unreviewed patch and hence unofficial and not guaranteed to get ito the product at all. Use at your own risk... | [hidden email] wrote: | > I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok. here are some timings. | > | > hash is the current nteger>>hash. | > | > speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined. | > | > newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e. | > (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8) | > | > The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions. A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed. | > | > (| insts r times | | > insts := LargeInteger allGeneralInstances asArray. | > times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0. | > r := 1000. | > (1 to: 10) do: | > [:n| | > times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]). | > times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]). | > times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]). | > times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])]. | > insts size | > -> times | > -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))]) | > -> (insts collect: [:li| li hash]) asSet size | > -> (insts collect: [:li| li speedOldHash]) asSet size | > -> (insts collect: [:li| li newHash]) asSet size | > -> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size)) | > | > The results in my mal imager on our mondo Opteron server are: | > | > 14164 | > ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 ) | > ->#(1.0d 0.48460654039036d 2.5427592933124d) | > ->1203 | > ->1203 | > ->14055 | > ->11.683291770574d | > | > i.e. the base case of looping sending lass takes 2.88 seconds. The original hash takes 11.65 seconds, the new hash 25.18 seconds. The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055. | > | > The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%. | > | > Paul, what was your new hash function and what realistic measurements do you have? If yours differs markedly from mine could you compare the two in a realistic setting? | > | > I'll reply to your message tmorow. The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :) | > --- | > Eliot Miranda ,,,^..^,,, mailto:[hidden email] | > VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 | > 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 | > | > | -- | Dennis Smith [hidden email] | Cherniak Software Development Corporation +1 905.771.7011 | 400-10 Commerce Valley Dr E Fax: +1 905.771.6288 | Thornhill, ON Canada L3T 7N7 http://www.CherniakSoftware.com --- Eliot Miranda ,,,^..^,,, mailto:[hidden email] VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 ../codeFixes/51203/51203.st (1K) Download Attachment |
OK -- at my own risk -- I will use it now just for gemstone, and under a
different name :) [hidden email] wrote: > Dennis Smith <[hidden email]> wrote: > > | I pulled in the attached patch -- should there not be an override of > | "newHash" > | in SmallInteger (if anyone is going to use the message "newHash") -- so > | it can just return "self"?? (for speed). > > No, Dennis. newHash is simply for testing purposes. One wuld use newHash to compare the performance of the new algorithm wth the current one. Once convinced one wanted to migrate one would rename newHash to hash, relacing the old definition. You'll also need o rehash the system. So you could use teh attached, which is an as-et-unreviewed patch and hence unofficial and not guaranteed to get ito the product at all. Use at your own risk... > > | [hidden email] wrote: > | > I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok. here are some timings. > | > > | > hash is the current nteger>>hash. > | > > | > speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined. > | > > | > newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e. > | > (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8) > | > > | > The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions. A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed. > | > > | > (| insts r times | > | > insts := LargeInteger allGeneralInstances asArray. > | > times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0. > | > r := 1000. > | > (1 to: 10) do: > | > [:n| > | > times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]). > | > times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]). > | > times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]). > | > times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])]. > | > insts size > | > -> times > | > -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))]) > | > -> (insts collect: [:li| li hash]) asSet size > | > -> (insts collect: [:li| li speedOldHash]) asSet size > | > -> (insts collect: [:li| li newHash]) asSet size > | > -> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size)) > | > > | > The results in my mal imager on our mondo Opteron server are: > | > > | > 14164 > | > ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 ) > | > ->#(1.0d 0.48460654039036d 2.5427592933124d) > | > ->1203 > | > ->1203 > | > ->14055 > | > ->11.683291770574d > | > > | > i.e. the base case of looping sending lass takes 2.88 seconds. The original hash takes 11.65 seconds, the new hash 25.18 seconds. The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055. > | > > | > The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%. > | > > | > Paul, what was your new hash function and what realistic measurements do you have? If yours differs markedly from mine could you compare the two in a realistic setting? > | > > | > I'll reply to your message tmorow. The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :) > | > --- > | > Eliot Miranda ,,,^..^,,, mailto:[hidden email] > | > VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 > | > 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 > | > > | > > > | -- > | Dennis Smith [hidden email] > | Cherniak Software Development Corporation +1 905.771.7011 > | 400-10 Commerce Valley Dr E Fax: +1 905.771.6288 > | Thornhill, ON Canada L3T 7N7 http://www.CherniakSoftware.com > --- > Eliot Miranda ,,,^..^,,, mailto:[hidden email] > VisualWorks Engineering, Cincom Smalltalk: scene not herd Tel +1 408 216 4581 > 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500 > > -- Dennis Smith [hidden email] Cherniak Software Development Corporation +1 905.771.7011 400-10 Commerce Valley Dr E Fax: +1 905.771.6288 Thornhill, ON Canada L3T 7N7 http://www.CherniakSoftware.com |
Dennis Smith wrote:
> OK -- at my own risk -- I will use it now just for gemstone, and under a > different name :) Hi Dennis, I don't recall which version of GBS you're currently using, but as I commented recently on the GemStone customer forum, GBS 7.0 and later no longer use large integers in the caches, so the performance of the hash function should no longer make much (if any) difference. -Martin |
I read that -- and it did not register as to what it really meant.
I will pull the latest -- we are using 6.something. Thanks martin. Martin McClure wrote: > Dennis Smith wrote: >> OK -- at my own risk -- I will use it now just for gemstone, and >> under a different name :) > > Hi Dennis, > > I don't recall which version of GBS you're currently using, but as I > commented recently on the GemStone customer forum, GBS 7.0 and later > no longer use large integers in the caches, so the performance of the > hash function should no longer make much (if any) difference. > > -Martin -- Dennis Smith [hidden email] Cherniak Software Development Corporation +1 905.771.7011 400-10 Commerce Valley Dr E Fax: +1 905.771.6288 Thornhill, ON Canada L3T 7N7 http://www.CherniakSoftware.com |
Free forum by Nabble | Edit this page |