[bug] Integer>>hash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[bug] Integer>>hash

Paul Baumann
Why limit the size of hash values? And, why use a bitshift of only the last digit of a large integer?
 
Integer>>hash
 ^(self lastDigit bitShift: 8) + (self digitAt: 1)
 
SmallInteger>>hash
 ^self
 
We just fixed a problem whereby integers used for oops crossed into VW large integer range (>536870911). The VW hash algorythm for large integers became an issue for GBS cache dictionaries and caused serious performance degredation. GBS code can be easily fixed to work around the problem by avoiding the #hash send for the GsToSt cache dictionary; however, the VW implementation for Integer>>hash just seems wrong.
 
I've never seen a hash value used in a way that requires it to be limited in size. Usually the remainder of a division of a hash value is used for collection indexing, so having a large number isn't a problem and actually helps provide better distribution. Why define Integer>>hash to just return "self" like SmallInteger does now? It doesn't make sense to have size limits for #hash other than to constrain it to be a positive number--which is not the case now.
 
Paul Baumann 
IntercontinentalExchange | ICE

 

 

 


This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

Martin McClure
Paul Baumann wrote:
> Why limit the size of hash values? And, why use a bitshift of only the
> last digit of a large integer?
>  
> Integer>>hash
>  ^(self lastDigit bitShift: 8) + (self digitAt: 1)
>  
> SmallInteger>>hash
>  ^self

I'm kind of in favor of keeping hashes within the SmallInteger range,
primarily for performance reasons, but I agree that the existing
Integer>>hash implementation is an especially bad one. You can easily
have a very large collection of LargeIntegers that all have the same
high-order byte, so only 256 hash values.

It seems like a much better hash that stays within the SmallInteger
bounds would be to use the low-order 24 bits of the LargeInteger:

LargeInteger>>hash
^((self digitAt: 3) bitShift: 16)
    + ((self digitAt: 2) bitShift: 8)
    + (self digitAt: 1)

All intermediate values are SmallIntegers, digitAt: is a primitive, and
bitShift: for SmallIntegers is pretty fast, so this should perform
pretty well (I haven't timed it).

This hash would not give a good distribution if you happened to have a
bunch of LargeIntegers whose values differed by exact multiples of 2^24.
This should be a much rarer case than the cases in which the current
hash fails, but if you really want to guard against it you could xor the
higher-order bytes into the hash.

Regards,

-Martin

Reply | Threaded
Open this post in threaded view
|

RE: [bug] Integer>>hash

Paul Baumann
In reply to this post by Paul Baumann
Hi Martin,

Constraining #hash to return [unsigned] small integers seems OK. That is
the best option if there is a performance hit when computing the modulo
of VERY large integers (per Robert's response), and if the cost of
computing a small integer from a large integer is small. I haven't
measured performance to see if such constraints are of any likely
benefit, but it seems possible.

I hope the hash range will grow in future releases along with the range
of SmallInteger. If it doesn't then this issue will resurface in
terabyte images fifteen years from now. :)

Is anyone from Cincom following this discussion? Will you fix this?

Paul Baumann
IntercontinentalExchange | ICE
2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328
Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470
[hidden email]

24-hour ice helpdesk 770.738.2101
www.theice.com

-----Original Message-----
From: Martin McClure [mailto:[hidden email]]
Sent: Thursday, September 14, 2006 1:29 PM
To: Paul Baumann
Cc: [hidden email]
Subject: Re: [bug] Integer>>hash

Paul Baumann wrote:
> Why limit the size of hash values? And, why use a bitshift of only the

> last digit of a large integer?
>  
> Integer>>hash
>  ^(self lastDigit bitShift: 8) + (self digitAt: 1)
>  
> SmallInteger>>hash
>  ^self

I'm kind of in favor of keeping hashes within the SmallInteger range,
primarily for performance reasons, but I agree that the existing
Integer>>hash implementation is an especially bad one. You can easily
have a very large collection of LargeIntegers that all have the same
high-order byte, so only 256 hash values.

It seems like a much better hash that stays within the SmallInteger
bounds would be to use the low-order 24 bits of the LargeInteger:

LargeInteger>>hash
^((self digitAt: 3) bitShift: 16)
    + ((self digitAt: 2) bitShift: 8)
    + (self digitAt: 1)

All intermediate values are SmallIntegers, digitAt: is a primitive, and
bitShift: for SmallIntegers is pretty fast, so this should perform
pretty well (I haven't timed it).

This hash would not give a good distribution if you happened to have a
bunch of LargeIntegers whose values differed by exact multiples of 2^24.

This should be a much rarer case than the cases in which the current
hash fails, but if you really want to guard against it you could xor the
higher-order bytes into the hash.

Regards,

-Martin
 
 
 
--------------------------------------------------------
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.  
 

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

Martin McClure
Paul Baumann wrote:

>
> I hope the hash range will grow in future releases along with the range
> of SmallInteger. If it doesn't then this issue will resurface in
> terabyte images fifteen years from now. :)

Yes. It would be nice to have LargeInteger hashes that are 7 bytes long
in the 64-bit version. This might require some special thought when
converting 32-bit images to 64-bit ones. I haven't looked at the current
Integer conversion strategy -- does it make "small large integers" out
of the existing large integers in the image that could be SmallIntegers?

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

eliot-2
In reply to this post by Paul Baumann
Martin McClure <[hidden email]> wrote:

| Paul Baumann wrote:

| >
| > I hope the hash range will grow in future releases along with the range
| > of SmallInteger. If it doesn't then this issue will resurface in
| > terabyte images fifteen years from now. :)

| Yes. It would be nice to have LargeInteger hashes that are 7 bytes long
| in the 64-bit version. This might require some special thought when
| converting 32-bit images to 64-bit ones.

I *think* the current strategy "just works" for vanilla images.  See below.

| I haven't looked at the current
| Integer conversion strategy -- does it make "small large integers" out
| of the existing large integers in the image that could be SmallIntegers?

Yes.  The image conversion maps all large positive and negative integers that are representable by the 64-bit system's 61-bit SmallIntegers and does an analogous thing for Doubles that are representable as SmallDoubles (0 and the doubes with exponents in the range 894 through 1152, or approx +/`- 5.9-38 to +/- 6.8+38).

As far as hashes go, IMO (there is some disagreement here) the system must rehash every hashed collection on start-up when bringing up an image freshly converted from 32-bits to 64-bits.  The conversion from 32- to 64- bits necessarily changes identity hashes because

a) the implementation of class references (the 64-bit system has relatively compact object headers that contain 20 bit class indices, allowing for only 2^20 classes).  A class's identty hash is its class index.

b) the identity hashes of 61-bit SmallIntegers and SmallDoubles differ unavoidably from their non-immediate LargeInteger and Double duals in the 32-bit world.

c) it is generally desirable to have more identity hash values if possible.  The 64-biot system supportts 20-bit id hashes, up from 4-bits in the 32 bit system.  New better-distributed id hashes are given to objects during conversion.

If identity hashes change then all hashed collections must be rehashed because the default implementations of #hash and #= in Object are in terms of #identtyHash and #'=='.

But in general this rehashing cannot be done by the image converter since hash functions are user-defined and don't necessarily default to identityHash.  The converter can't compute the values of user-definable hash functions without implementing a full Smalltalk evaluator, which I consider unnecessary duplication.

So the converter rehashes those collections it must for the system to boot, i.e. general instances of MethodDictionary and HandleRegistry.  Very early in system startup the image checks ObjectMemory maximumIdentityHashValue (see ObjectMemory class>>#hashInstall) and rehashes all hashable general instances of Collection if it hash changed.

So as far as I can see (not that far usually) there's no problem having different implementations of Integer>>hash in the 32-bit and 64-bit system as far as basic Smalltalk execution goes.  How about something like

Integer methods for comparing
hash
        ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self digitLength) bitShift: 8)

where LargeInteger digitLength is reimplemented as
Integer methods for accessing
digitLength
        "Answer the number of bytes needed to represent the absolute value
        of the receiver."
        "Class is byte-indexed, so just answer the size."
        <primitive: 62>
        ^self basicSize

---
Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500


Reply | Threaded
Open this post in threaded view
|

RE: [bug] Integer>>hash

Paul Baumann
In reply to this post by Paul Baumann
Hi Eliot,

Wow. It is wonderful to see a discussion of increasing the maximum
identityHash. The range in 32-bit VW has been a pain. With 64-bit VW, it
is unlikely that GS oop values would ever exceed the new range of
SmallInteger maxVal. 64-bit will make problems like this simply go away.


You could maintain different implementations of Integer>>hash in 32-bit
and 64-bit, but I wonder if it is really necessary. I mean, is it really
likely that a 32-bit application would have a hash-based collection
containing LPI instances that greatly exceed SmallInteger maxVal? Is a
modulo of those values really inefficient enough that Cincom would have
to maintain two versions of Integer>>hash? I'm OK with returning a LPI
hash within the range of SmallInteger, but it still doesn't seem worth
the effort. Just return "self" for all integers for both 32-bit and
64-bit.

Consider these measurements:

| ints |
ints := (1 to: 170) collect: [:e | 10 raisedTo: e].
Core.Time millisecondsToRun: [10000 timesRepeat: [ints asSet]].
 1584
 2689

Set is a hashing collection. The second result shown (2689 ms) was with
the current Integer>>hash behavior. The first result shown (1584 ms) was
with LargePositiveInteger>>hash added to return "self". If there is a
cost to doing the modulo of large integers then it certainly didn't show
up in these results.

But hey, I'm happy with any fix that gets a better hash value range for
integers in the upper range of possible GS oop values.

Regards,

Paul Baumann


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Thursday, September 14, 2006 4:48 PM
To: [hidden email]
Cc: [hidden email]; Paul Baumann
Subject: Re: [bug] Integer>>hash

Martin McClure <[hidden email]> wrote:

| Paul Baumann wrote:

| >
| > I hope the hash range will grow in future releases along with the
| > range of SmallInteger. If it doesn't then this issue will resurface
| > in terabyte images fifteen years from now. :)

| Yes. It would be nice to have LargeInteger hashes that are 7 bytes
| long in the 64-bit version. This might require some special thought
| when converting 32-bit images to 64-bit ones.

I *think* the current strategy "just works" for vanilla images.  See
below.

| I haven't looked at the current
| Integer conversion strategy -- does it make "small large integers" out

| of the existing large integers in the image that could be
SmallIntegers?

Yes.  The image conversion maps all large positive and negative integers
that are representable by the 64-bit system's 61-bit SmallIntegers and
does an analogous thing for Doubles that are representable as
SmallDoubles (0 and the doubes with exponents in the range 894 through
1152, or approx +/`- 5.9-38 to +/- 6.8+38).

As far as hashes go, IMO (there is some disagreement here) the system
must rehash every hashed collection on start-up when bringing up an
image freshly converted from 32-bits to 64-bits.  The conversion from
32- to 64- bits necessarily changes identity hashes because

a) the implementation of class references (the 64-bit system has
relatively compact object headers that contain 20 bit class indices,
allowing for only 2^20 classes).  A class's identty hash is its class
index.

b) the identity hashes of 61-bit SmallIntegers and SmallDoubles differ
unavoidably from their non-immediate LargeInteger and Double duals in
the 32-bit world.

c) it is generally desirable to have more identity hash values if
possible.  The 64-biot system supportts 20-bit id hashes, up from 4-bits
in the 32 bit system.  New better-distributed id hashes are given to
objects during conversion.

If identity hashes change then all hashed collections must be rehashed
because the default implementations of #hash and #= in Object are in
terms of #identtyHash and #'=='.

But in general this rehashing cannot be done by the image converter
since hash functions are user-defined and don't necessarily default to
identityHash.  The converter can't compute the values of user-definable
hash functions without implementing a full Smalltalk evaluator, which I
consider unnecessary duplication.

So the converter rehashes those collections it must for the system to
boot, i.e. general instances of MethodDictionary and HandleRegistry.
Very early in system startup the image checks ObjectMemory
maximumIdentityHashValue (see ObjectMemory class>>#hashInstall) and
rehashes all hashable general instances of Collection if it hash
changed.

So as far as I can see (not that far usually) there's no problem having
different implementations of Integer>>hash in the 32-bit and 64-bit
system as far as basic Smalltalk execution goes.  How about something
like

Integer methods for comparing
hash
        ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self
digitAt: self digitLength) bitShift: 8)

where LargeInteger digitLength is reimplemented as Integer methods for
accessing digitLength
        "Answer the number of bytes needed to represent the absolute
value
        of the receiver."
        "Class is byte-indexed, so just answer the size."
        <primitive: 62>
        ^self basicSize

---
Eliot Miranda                 ,,,^..^,,,
mailto:[hidden email]
VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408
216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax
+1 408 216 4500
 
 
 
--------------------------------------------------------
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.  
 

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

eliot-2
In reply to this post by Paul Baumann
I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok.  here are some timings.

hash is the current nteger>>hash.

speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined.

newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e.
        (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8)

The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions.  A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed.

(| insts r times |
insts := LargeInteger allGeneralInstances asArray.
times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0.
r := 1000.
(1 to: 10) do:
        [:n|
        times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]).
        times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]).
        times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]).
        times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])].
insts size
-> times
-> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))])
-> (insts collect: [:li| li hash]) asSet size
-> (insts collect: [:li| li speedOldHash]) asSet size
-> (insts collect: [:li| li newHash]) asSet size
-> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size))

The results in my mal imager on our mondo Opteron server are:

14164
->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 )
->#(1.0d 0.48460654039036d 2.5427592933124d)
->1203
->1203
->14055
->11.683291770574d

i.e. the base case of looping sending lass takes 2.88 seconds.  The original hash takes 11.65 seconds, the new hash 25.18 seconds.  The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055.

The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%.

Paul, what was your new hash function and what realistic measurements do you have?  If yours differs markedly from mine could you compare the two in a realistic setting?

I'll reply to your message tmorow.  The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :)
---
Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500


Integer-newHash.st (884 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: [bug] Integer>>hash

Paul Baumann
In reply to this post by Paul Baumann
Hi Eliot,

The "newHash" had a low distribution in a dispersed GS oop range to
1.5M:

VW 7.4.1 default:
10356 large integers. Rate 37 per ms
        Added to growing bag of 39 per ms.
        Matching add of 931 per ms.
        Hash ranges: 10356 hash values ranging 8192 to 23038 with 0
repeats.

LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
bitShift: 8)
10356 large integers. Rate 19 per ms
        Added to growing bag of 32 per ms.
        Matching add of 49 per ms.
        Hash ranges: 58 hash values ranging 268443647 to 268458239 with
10298 repeats.

Martin's LPI>>hash
10356 large integers. Rate 54 per ms
        Added to growing bag of 57 per ms. "Why so low?"
        Matching add of 1235 per ms.
        Hash ranges: 10356 hash values ranging 0 to 16775947 with 0
repeats.

LPI>>hash ^self
10356 large integers. Rate 326 per ms
        Added to growing bag of 693 per ms.
        Matching add of 617 per ms.
        Hash ranges: 10356 hash values ranging 536870912 to 1499916977
with 0 repeats.

Returning "self" is winner here, but now look at the first 10000 values
above maxValue:

VW 7.4.1 default:
10000 large integers. Rate 0 per ms
        Added to growing bag of 1 per ms.
        Matching add of 1 per ms.
        Hash ranges: 256 hash values ranging 8192 to 8447 with 9744
repeats.

LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
bitShift: 8)
10000 large integers. Rate 0 per ms
        Added to growing bag of 0 per ms.
        Matching add of 0 per ms.
        Hash ranges: 1 hash values ranging 268443647 to 268443647 with
9999 repeats.

Martin's LPI>>hash
10000 large integers. Rate 531 per ms
        Added to growing bag of 1150 per ms.
        Matching add of 989 per ms.
        Hash ranges: 10000 hash values ranging 0 to 9999 with 0 repeats.

LPI>>hash ^self
10000 large integers. Rate 329 per ms
        Added to growing bag of 672 per ms.
        Matching add of 645 per ms.
        Hash ranges: 10000 hash values ranging 536870912 to 536880911
with 0 repeats.

Martin's implementation performs best in the lower range of
LargePositiveInteger. Returning the LPI itself performs best in a
broader GS range of LPI.

So far:
Martin's implementation is best so far if there is a need to constrain
the value returned from #hash. Returning the LPI itself is best general
approach if no reason can be found to constrain the value.

It comes down to the questions:
        Is there a need to constrain the value returned from #hash?
        Are there even more efficient ways to return constrained values?

Paul Baumann
IntercontinentalExchange | ICE
2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328
Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470
[hidden email]

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Thursday, September 14, 2006 8:46 PM
To: [hidden email]
Cc: [hidden email]; Paul Baumann
Subject: Re: [bug] Integer>>hash

I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) +
(self lastDigit bitShift: 8) as a hash function and it looks ok.  here
are some timings.  

hash is the current nteger>>hash.

speedOldHash is the current Integer has implemented as fast as I can see
in vanilla Smalltalk; essentially digitlength is made a primitive and
lastDigit is inlined.

newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit
bitShift: 8) with lastDigit inlined, i.e.
        (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self
digitAt: self basicSize) bitShift: 8)

The timing script collects all large integers, sees how many different
values are prodeuced and tomes the various hash functions.  A send of
#class is used as the overhead measurement case and its time s subtraced
when ratios are computed.

(| insts r times |
insts := LargeInteger allGeneralInstances asArray.
times := Dictionary with: #class -> 0 with: #hash -> 0 with:
#speedOldHash -> 0 with: #newHash -> 0.
r := 1000.
(1 to: 10) do:
        [:n|
        times at: #class put: (times at: #class) + (Time
microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
i) class]]]).
        times at: #hash put: (times at: #hash) + (Time
microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
i) hash]]]).
        times at: #speedOldHash put: (times at: #speedOldHash) + (Time
microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
i) speedOldHash]]]).
        times at: #newHash put: (times at: #newHash) + (Time
microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
i) newHash]]])].
insts size
-> times
-> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym)
-> asDouble - (times at: #class) / ((times at: #hash) - (times at:
-> #class))]) (insts collect: [:li| li hash]) asSet size (insts collect:

-> [:li| li speedOldHash]) asSet size (insts collect: [:li| li newHash])

-> asSet size ((insts collect: [:li| li newHash]) asSet size asDouble /
-> (insts collect: [:li| li hash]) asSet size))

The results in my mal imager on our mondo Opteron server are:

14164
->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808
->#newHash->25181745 ) #(1.0d 0.48460654039036d 2.5427592933124d)
->1203
->1203
->14055
->11.683291770574d

i.e. the base case of looping sending lass takes 2.88 seconds.  The
original hash takes 11.65 seconds, the new hash 25.18 seconds.  The
original hash produces only 1203 values from 14164 integers whereas the
new hash provides 14055.  

The bottom line is that the new hash is two and a half times slower (5
times slower than a sped-up implementation) but (albeit with a small
sample size) it collapses the space of hash values by
(14164-14055)/141.64 = 0.77% whereas the current hash function collapses
it by (14164-1203)/141.64 = 91.51%.

Paul, what was your new hash function and what realistic measurements do
you have?  If yours differs markedly from mine could you compare the two
in a realistic setting?

I'll reply to your message tmorow.  The missus is banging on the
ceiling; its time for me to go upstairs and help with the kids :)
---
Eliot Miranda                 ,,,^..^,,,
mailto:[hidden email]
VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408
216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax
+1 408 216 4500
 
 
 
--------------------------------------------------------
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.  
 

hashPerformanceMetrics.st (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

eliot-2
In reply to this post by Paul Baumann
"Paul Baumann" <[hidden email]> wrote:
| Hi Eliot,

| The "newHash" had a low distribution in a dispersed GS oop range to
| 1.5M:

Paul,

        you made a booboo.  My hash isn't
                ^(SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8)
        its
                ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) + (self lastDigit bitShift: 8)

i.e. bitAnd by one bit less than the maximum positive SmallInteger to avoid overflow on the subsequent add.  Add in the most sgnificant byte to spread the hashes more.  Can yu try again with the correct defn?

TIA

| VW 7.4.1 default:
| 10356 large integers. Rate 37 per ms
| Added to growing bag of 39 per ms.
| Matching add of 931 per ms.
| Hash ranges: 10356 hash values ranging 8192 to 23038 with 0
| repeats.

| LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8)
| 10356 large integers. Rate 19 per ms
| Added to growing bag of 32 per ms.
| Matching add of 49 per ms.
| Hash ranges: 58 hash values ranging 268443647 to 268458239 with
| 10298 repeats.

| Martin's LPI>>hash
| 10356 large integers. Rate 54 per ms
| Added to growing bag of 57 per ms. "Why so low?"
| Matching add of 1235 per ms.
| Hash ranges: 10356 hash values ranging 0 to 16775947 with 0
| repeats.

| LPI>>hash ^self
| 10356 large integers. Rate 326 per ms
| Added to growing bag of 693 per ms.
| Matching add of 617 per ms.
| Hash ranges: 10356 hash values ranging 536870912 to 1499916977
| with 0 repeats.

| Returning "self" is winner here, but now look at the first 10000 values
| above maxValue:

| VW 7.4.1 default:
| 10000 large integers. Rate 0 per ms
| Added to growing bag of 1 per ms.
| Matching add of 1 per ms.
| Hash ranges: 256 hash values ranging 8192 to 8447 with 9744
| repeats.

| LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8)
| 10000 large integers. Rate 0 per ms
| Added to growing bag of 0 per ms.
| Matching add of 0 per ms.
| Hash ranges: 1 hash values ranging 268443647 to 268443647 with
| 9999 repeats.

| Martin's LPI>>hash
| 10000 large integers. Rate 531 per ms
| Added to growing bag of 1150 per ms.
| Matching add of 989 per ms.
| Hash ranges: 10000 hash values ranging 0 to 9999 with 0 repeats.

| LPI>>hash ^self
| 10000 large integers. Rate 329 per ms
| Added to growing bag of 672 per ms.
| Matching add of 645 per ms.
| Hash ranges: 10000 hash values ranging 536870912 to 536880911
| with 0 repeats.

| Martin's implementation performs best in the lower range of
| LargePositiveInteger. Returning the LPI itself performs best in a
| broader GS range of LPI.

| So far:
| Martin's implementation is best so far if there is a need to constrain
| the value returned from #hash. Returning the LPI itself is best general
| approach if no reason can be found to constrain the value.

| It comes down to the questions:
| Is there a need to constrain the value returned from #hash?
| Are there even more efficient ways to return constrained values?

| Paul Baumann
| IntercontinentalExchange | ICE
| 2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328
| Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470
| [hidden email]

| -----Original Message-----
| From: [hidden email] [mailto:[hidden email]]
| Sent: Thursday, September 14, 2006 8:46 PM
| To: [hidden email]
| Cc: [hidden email]; Paul Baumann
| Subject: Re: [bug] Integer>>hash

| I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) +
| (self lastDigit bitShift: 8) as a hash function and it looks ok.  here
| are some timings.

| hash is the current nteger>>hash.

| speedOldHash is the current Integer has implemented as fast as I can see
| in vanilla Smalltalk; essentially digitlength is made a primitive and
| lastDigit is inlined.

| newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8) with lastDigit inlined, i.e.
| (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self
| digitAt: self basicSize) bitShift: 8)

| The timing script collects all large integers, sees how many different
| values are prodeuced and tomes the various hash functions.  A send of
| #class is used as the overhead measurement case and its time s subtraced
| when ratios are computed.

| (| insts r times |
| insts := LargeInteger allGeneralInstances asArray.
| times := Dictionary with: #class -> 0 with: #hash -> 0 with:
| #speedOldHash -> 0 with: #newHash -> 0.
| r := 1000.
| (1 to: 10) do:
| [:n|
| times at: #class put: (times at: #class) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
| i) class]]]).
| times at: #hash put: (times at: #hash) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
| i) hash]]]).
| times at: #speedOldHash put: (times at: #speedOldHash) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
| i) speedOldHash]]]).
| times at: #newHash put: (times at: #newHash) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at:
| i) newHash]]])].
| insts size
| -> times
| -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym)
| -> asDouble - (times at: #class) / ((times at: #hash) - (times at:
| -> #class))]) (insts collect: [:li| li hash]) asSet size (insts collect:

| -> [:li| li speedOldHash]) asSet size (insts collect: [:li| li newHash])

| -> asSet size ((insts collect: [:li| li newHash]) asSet size asDouble /
| -> (insts collect: [:li| li hash]) asSet size))

| The results in my mal imager on our mondo Opteron server are:

| 14164
| ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808
| ->#newHash->25181745 ) #(1.0d 0.48460654039036d 2.5427592933124d)
| ->1203
| ->1203
| ->14055
| ->11.683291770574d

| i.e. the base case of looping sending lass takes 2.88 seconds.  The
| original hash takes 11.65 seconds, the new hash 25.18 seconds.  The
| original hash produces only 1203 values from 14164 integers whereas the
| new hash provides 14055.

| The bottom line is that the new hash is two and a half times slower (5
| times slower than a sped-up implementation) but (albeit with a small
| sample size) it collapses the space of hash values by
| (14164-14055)/141.64 = 0.77% whereas the current hash function collapses
| it by (14164-1203)/141.64 = 91.51%.

| Paul, what was your new hash function and what realistic measurements do
| you have?  If yours differs markedly from mine could you compare the two
| in a realistic setting?

| I'll reply to your message tmorow.  The missus is banging on the
| ceiling; its time for me to go upstairs and help with the kids :)
| ---
| Eliot Miranda                 ,,,^..^,,,
| mailto:[hidden email]
| VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408
| 216 4581 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax
| +1 408 216 4500
|
|
|
| --------------------------------------------------------
| This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
|

| -----
Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500


Reply | Threaded
Open this post in threaded view
|

RE: [bug] Integer>>hash

Paul Baumann
In reply to this post by Paul Baumann
Oops. I read "newHash is" and started copying code.

The distribution is good now. The new numbers are much better in the
range of LPI below 1.5M. In that range the performance was 270/ms
compared to 326/ms returning "self". In the 10K range above SmallInteger
the performance was 30/ms compared to Martin's 531/ms.

"ints := (Core.SmallInteger maxVal + 1 to: 1500000000 by: 93003)
asArray."

| LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8)
| 10356 large integers. Rate 19 per ms
| Added to growing bag of 32 per ms.
| Matching add of 49 per ms.
| Hash ranges: 58 hash values ranging 268443647 to 268458239 with
| 10298 repeats.
 
Eliot's LPI>>hash ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) +
(self lastDigit bitShift: 8)
10356 large integers. Rate 270 per ms
        Added to growing bag of 538 per ms.
        Matching add of 542 per ms.
        Hash ranges: 10356 hash values ranging 8192 to 268418690 with 0
repeats.


"ints := (Core.SmallInteger maxVal + 1 to: Core.SmallInteger maxVal +
10000) asArray. "

| LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8)
| 10000 large integers. Rate 0 per ms
| Added to growing bag of 0 per ms.
| Matching add of 0 per ms.
| Hash ranges: 1 hash values ranging 268443647 to 268443647 with
| 9999 repeats.

Eliot's LPI>>hash ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) +
(self lastDigit bitShift: 8)
10000 large integers. Rate 30 per ms
        Added to growing bag of 70 per ms.
        Matching add of 55 per ms.
        Hash ranges: 10000 hash values ranging 8192 to 18191 with 0
repeats.


Paul Baumann
IntercontinentalExchange | ICE
2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328
Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470
[hidden email]

24-hour ice helpdesk 770.738.2101
www.theice.com

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Friday, September 15, 2006 12:48 PM
To: Paul Baumann
Cc: [hidden email]; [hidden email]
Subject: Re: [bug] Integer>>hash

"Paul Baumann" <[hidden email]> wrote:
| Hi Eliot,

| The "newHash" had a low distribution in a dispersed GS oop range to
| 1.5M:

Paul,

        you made a booboo.  My hash isn't
                ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
bitShift: 8)
        its
                ^(self bitAnd: (SmallInteger maxVal bitShift: -1)) +
(self lastDigit bitShift: 8)

i.e. bitAnd by one bit less than the maximum positive SmallInteger to
avoid overflow on the subsequent add.  Add in the most sgnificant byte
to spread the hashes more.  Can yu try again with the correct defn?

TIA

| VW 7.4.1 default:
| 10356 large integers. Rate 37 per ms
| Added to growing bag of 39 per ms.
| Matching add of 931 per ms.
| Hash ranges: 10356 hash values ranging 8192 to 23038 with 0
repeats.

| LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8)
| 10356 large integers. Rate 19 per ms
| Added to growing bag of 32 per ms.
| Matching add of 49 per ms.
| Hash ranges: 58 hash values ranging 268443647 to 268458239 with
| 10298 repeats.

| Martin's LPI>>hash
| 10356 large integers. Rate 54 per ms
| Added to growing bag of 57 per ms. "Why so low?"
| Matching add of 1235 per ms.
| Hash ranges: 10356 hash values ranging 0 to 16775947 with 0
repeats.

| LPI>>hash ^self
| 10356 large integers. Rate 326 per ms
| Added to growing bag of 693 per ms.
| Matching add of 617 per ms.
| Hash ranges: 10356 hash values ranging 536870912 to 1499916977
with 0
| repeats.

| Returning "self" is winner here, but now look at the first 10000
| values above maxValue:

| VW 7.4.1 default:
| 10000 large integers. Rate 0 per ms
| Added to growing bag of 1 per ms.
| Matching add of 1 per ms.
| Hash ranges: 256 hash values ranging 8192 to 8447 with 9744
repeats.

| LPI>>hash ^(SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8)
| 10000 large integers. Rate 0 per ms
| Added to growing bag of 0 per ms.
| Matching add of 0 per ms.
| Hash ranges: 1 hash values ranging 268443647 to 268443647 with
| 9999 repeats.

| Martin's LPI>>hash
| 10000 large integers. Rate 531 per ms
| Added to growing bag of 1150 per ms.
| Matching add of 989 per ms.
| Hash ranges: 10000 hash values ranging 0 to 9999 with 0 repeats.

| LPI>>hash ^self
| 10000 large integers. Rate 329 per ms
| Added to growing bag of 672 per ms.
| Matching add of 645 per ms.
| Hash ranges: 10000 hash values ranging 536870912 to 536880911
with 0
| repeats.

| Martin's implementation performs best in the lower range of
| LargePositiveInteger. Returning the LPI itself performs best in a
| broader GS range of LPI.

| So far:
| Martin's implementation is best so far if there is a need to constrain

| the value returned from #hash. Returning the LPI itself is best
| general approach if no reason can be found to constrain the value.

| It comes down to the questions:
| Is there a need to constrain the value returned from #hash?
| Are there even more efficient ways to return constrained values?

| Paul Baumann
| IntercontinentalExchange | ICE
| 2100 RiverEdge Pkwy | 5th Floor | Atlanta, GA 30328
| Tel: 770.738.2137 | Fax: 770.951.1307 | Cel: 505.780.1470
| [hidden email]

| -----Original Message-----
| From: [hidden email] [mailto:[hidden email]]
| Sent: Thursday, September 14, 2006 8:46 PM
| To: [hidden email]
| Cc: [hidden email]; Paul Baumann
| Subject: Re: [bug] Integer>>hash

| I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) +
| (self lastDigit bitShift: 8) as a hash function and it looks ok.  here

| are some timings.

| hash is the current nteger>>hash.

| speedOldHash is the current Integer has implemented as fast as I can
| see in vanilla Smalltalk; essentially digitlength is made a primitive
| and lastDigit is inlined.

| newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit
| bitShift: 8) with lastDigit inlined, i.e.
| (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self
| digitAt: self basicSize) bitShift: 8)

| The timing script collects all large integers, sees how many different

| values are prodeuced and tomes the various hash functions.  A send of
| #class is used as the overhead measurement case and its time s
| subtraced when ratios are computed.

| (| insts r times |
| insts := LargeInteger allGeneralInstances asArray.
| times := Dictionary with: #class -> 0 with: #hash -> 0 with:
| #speedOldHash -> 0 with: #newHash -> 0.
| r := 1000.
| (1 to: 10) do:
| [:n|
| times at: #class put: (times at: #class) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts
at:
| i) class]]]).
| times at: #hash put: (times at: #hash) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts
at:
| i) hash]]]).
| times at: #speedOldHash put: (times at: #speedOldHash) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts
at:
| i) speedOldHash]]]).
| times at: #newHash put: (times at: #newHash) + (Time
| microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts
at:
| i) newHash]]])].
| insts size
| -> times
| -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym)
| -> asDouble - (times at: #class) / ((times at: #hash) - (times at:
| -> #class))]) (insts collect: [:li| li hash]) asSet size (insts
collect:

| -> [:li| li speedOldHash]) asSet size (insts collect: [:li| li
| -> newHash])

| -> asSet size ((insts collect: [:li| li newHash]) asSet size asDouble
| -> / (insts collect: [:li| li hash]) asSet size))

| The results in my mal imager on our mondo Opteron server are:

| 14164
| ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808
| ->#newHash->25181745 ) #(1.0d 0.48460654039036d 2.5427592933124d)
| ->1203
| ->1203
| ->14055
| ->11.683291770574d

| i.e. the base case of looping sending lass takes 2.88 seconds.  The
| original hash takes 11.65 seconds, the new hash 25.18 seconds.  The
| original hash produces only 1203 values from 14164 integers whereas
| the new hash provides 14055.

| The bottom line is that the new hash is two and a half times slower (5

| times slower than a sped-up implementation) but (albeit with a small
| sample size) it collapses the space of hash values by
| (14164-14055)/141.64 = 0.77% whereas the current hash function
| collapses it by (14164-1203)/141.64 = 91.51%.

| Paul, what was your new hash function and what realistic measurements
| do you have?  If yours differs markedly from mine could you compare
| the two in a realistic setting?

| I'll reply to your message tmorow.  The missus is banging on the
| ceiling; its time for me to go upstairs and help with the kids :)
 
 
 
--------------------------------------------------------
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.  
 

Reply | Threaded
Open this post in threaded view
|

RE: [bug] Integer>>hash

Paul Baumann
In reply to this post by Paul Baumann
I've figured out why Martin's hash did only 54/ms on one test but was
fastest with 531/ms on the other. The hash computes a sequential value
range--it is like a division remainder of SmallInteger maxVal. The bag
became inefficient when it grew beyond the hash value range of 9999 (for
those integers) to accommodate 10356 keys. Eliot's hash has a widely
varying distribution and that makes for a better hash value. Eliot's
hash may not be better than the full precision and efficiency of
returning "self", but it keeps values in the range of SmallInteger just
in case that is somehow a requirement.

Thanks Guys. I think it has been narrowed down well enough. Eliot,
please release the one you prefer. I'm happy this will be fixed.

Paul Baumann
IntercontinentalExchange | ICE
 
 
 
--------------------------------------------------------
This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of IntercontinentalExchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.  
 

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

Reinout Heeck

>  Eliot's
> hash may not be better than the full precision and efficiency of
> returning "self", but it keeps values in the range of SmallInteger  
> just
> in case that is somehow a requirement.

This requirement is there to easily create hash functions for  
application classes.
Since this is typically done by using #xor: on the hashes of several  
of the object's ivars, intermediate large integers would be created  
if any of those inner objects' hash implementations would return  
large integers.

For easy creation of #hash implementations we should be able to trust  
that all other #hash implementations return small integers only.



R
-

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

Dennis smith-4
In reply to this post by eliot-2
I pulled in the attached patch -- should there not be an override of
"newHash"
in SmallInteger (if anyone is going to use the message "newHash") -- so
it can just return "self"?? (for speed).

[hidden email] wrote:

> I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok.  here are some timings.  
>
> hash is the current nteger>>hash.
>
> speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined.
>
> newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e.
> (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8)
>
> The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions.  A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed.
>
> (| insts r times |
> insts := LargeInteger allGeneralInstances asArray.
> times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0.
> r := 1000.
> (1 to: 10) do:
> [:n|
> times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]).
> times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]).
> times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]).
> times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])].
> insts size
> -> times
> -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))])
> -> (insts collect: [:li| li hash]) asSet size
> -> (insts collect: [:li| li speedOldHash]) asSet size
> -> (insts collect: [:li| li newHash]) asSet size
> -> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size))
>
> The results in my mal imager on our mondo Opteron server are:
>
> 14164
> ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 )
> ->#(1.0d 0.48460654039036d 2.5427592933124d)
> ->1203
> ->1203
> ->14055
> ->11.683291770574d
>
> i.e. the base case of looping sending lass takes 2.88 seconds.  The original hash takes 11.65 seconds, the new hash 25.18 seconds.  The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055.  
>
> The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%.
>
> Paul, what was your new hash function and what realistic measurements do you have?  If yours differs markedly from mine could you compare the two in a realistic setting?
>
> I'll reply to your message tmorow.  The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :)
> ---
> Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
> VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
> 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500
>
>  

--
Dennis Smith                        [hidden email]
Cherniak Software Development Corporation       +1 905.771.7011
400-10 Commerce Valley Dr E                Fax: +1 905.771.6288
Thornhill, ON Canada L3T 7N7    http://www.CherniakSoftware.com

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

eliot-2
In reply to this post by Paul Baumann
Dennis Smith <[hidden email]> wrote:

| I pulled in the attached patch -- should there not be an override of
| "newHash"
| in SmallInteger (if anyone is going to use the message "newHash") -- so
| it can just return "self"?? (for speed).

No, Dennis.  newHash is simply for testing purposes.  One wuld use newHash to compare the performance of the new algorithm wth the current one.  Once convinced one wanted to migrate one would rename newHash to hash, relacing the old definition.  You'll also need o rehash the system.  So you could use teh attached, which is an as-et-unreviewed patch and hence unofficial and not guaranteed to get ito the product at all.  Use at your own risk...

| [hidden email] wrote:
| > I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok.  here are some timings.
| >
| > hash is the current nteger>>hash.
| >
| > speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined.
| >
| > newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e.
| > (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8)
| >
| > The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions.  A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed.
| >
| > (| insts r times |
| > insts := LargeInteger allGeneralInstances asArray.
| > times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0.
| > r := 1000.
| > (1 to: 10) do:
| > [:n|
| > times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]).
| > times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]).
| > times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]).
| > times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])].
| > insts size
| > -> times
| > -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))])
| > -> (insts collect: [:li| li hash]) asSet size
| > -> (insts collect: [:li| li speedOldHash]) asSet size
| > -> (insts collect: [:li| li newHash]) asSet size
| > -> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size))
| >
| > The results in my mal imager on our mondo Opteron server are:
| >
| > 14164
| > ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 )
| > ->#(1.0d 0.48460654039036d 2.5427592933124d)
| > ->1203
| > ->1203
| > ->14055
| > ->11.683291770574d
| >
| > i.e. the base case of looping sending lass takes 2.88 seconds.  The original hash takes 11.65 seconds, the new hash 25.18 seconds.  The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055.
| >
| > The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%.
| >
| > Paul, what was your new hash function and what realistic measurements do you have?  If yours differs markedly from mine could you compare the two in a realistic setting?
| >
| > I'll reply to your message tmorow.  The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :)
| > ---
| > Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
| > VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
| > 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500
| >
| >

| --
| Dennis Smith                        [hidden email]
| Cherniak Software Development Corporation       +1 905.771.7011
| 400-10 Commerce Valley Dr E                Fax: +1 905.771.6288
| Thornhill, ON Canada L3T 7N7    http://www.CherniakSoftware.com
---
Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500


../codeFixes/51203/51203.st (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

Dennis smith-4
OK -- at my own risk -- I will use it now just for gemstone, and under a
different name :)

[hidden email] wrote:

> Dennis Smith <[hidden email]> wrote:
>
> | I pulled in the attached patch -- should there not be an override of
> | "newHash"
> | in SmallInteger (if anyone is going to use the message "newHash") -- so
> | it can just return "self"?? (for speed).
>
> No, Dennis.  newHash is simply for testing purposes.  One wuld use newHash to compare the performance of the new algorithm wth the current one.  Once convinced one wanted to migrate one would rename newHash to hash, relacing the old definition.  You'll also need o rehash the system.  So you could use teh attached, which is an as-et-unreviewed patch and hence unofficial and not guaranteed to get ito the product at all.  Use at your own risk...
>
> | [hidden email] wrote:
> | > I looked at using self bitAnd: (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) as a hash function and it looks ok.  here are some timings.  
> | >
> | > hash is the current nteger>>hash.
> | >
> | > speedOldHash is the current Integer has implemented as fast as I can see in vanilla Smalltalk; essentially digitlength is made a primitive and lastDigit is inlined.
> | >
> | > newHash is (SmallInteger maxVal bitShift: -1) + (self lastDigit bitShift: 8) with lastDigit inlined, i.e.
> | > (self bitAnd: (SmallInteger maxVal bitShift: -1)) + ((self digitAt: self basicSize) bitShift: 8)
> | >
> | > The timing script collects all large integers, sees how many different values are prodeuced and tomes the various hash functions.  A send of #class is used as the overhead measurement case and its time s subtraced when ratios are computed.
> | >
> | > (| insts r times |
> | > insts := LargeInteger allGeneralInstances asArray.
> | > times := Dictionary with: #class -> 0 with: #hash -> 0 with: #speedOldHash -> 0 with: #newHash -> 0.
> | > r := 1000.
> | > (1 to: 10) do:
> | > [:n|
> | > times at: #class put: (times at: #class) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) class]]]).
> | > times at: #hash put: (times at: #hash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) hash]]]).
> | > times at: #speedOldHash put: (times at: #speedOldHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) speedOldHash]]]).
> | > times at: #newHash put: (times at: #newHash) + (Time microsecondsToRun: [r timesRepeat: [1 to: insts size do: [:i| (insts at: i) newHash]]])].
> | > insts size
> | > -> times
> | > -> (#(hash speedOldHash newHash) collect: [:sym| (times at: sym) asDouble - (times at: #class) / ((times at: #hash) - (times at: #class))])
> | > -> (insts collect: [:li| li hash]) asSet size
> | > -> (insts collect: [:li| li speedOldHash]) asSet size
> | > -> (insts collect: [:li| li newHash]) asSet size
> | > -> ((insts collect: [:li| li newHash]) asSet size asDouble / (insts collect: [:li| li hash]) asSet size))
> | >
> | > The results in my mal imager on our mondo Opteron server are:
> | >
> | > 14164
> | > ->Dictionary (#class->2880200 #speedOldHash->7130494 #hash->11650808 #newHash->25181745 )
> | > ->#(1.0d 0.48460654039036d 2.5427592933124d)
> | > ->1203
> | > ->1203
> | > ->14055
> | > ->11.683291770574d
> | >
> | > i.e. the base case of looping sending lass takes 2.88 seconds.  The original hash takes 11.65 seconds, the new hash 25.18 seconds.  The original hash produces only 1203 values from 14164 integers whereas the new hash provides 14055.  
> | >
> | > The bottom line is that the new hash is two and a half times slower (5 times slower than a sped-up implementation) but (albeit with a small sample size) it collapses the space of hash values by (14164-14055)/141.64 = 0.77% whereas the current hash function collapses it by (14164-1203)/141.64 = 91.51%.
> | >
> | > Paul, what was your new hash function and what realistic measurements do you have?  If yours differs markedly from mine could you compare the two in a realistic setting?
> | >
> | > I'll reply to your message tmorow.  The missus is banging on the ceiling; its time for me to go upstairs and help with the kids :)
> | > ---
> | > Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
> | > VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
> | > 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500
> | >
> | >  
>
> | --
> | Dennis Smith                        [hidden email]
> | Cherniak Software Development Corporation       +1 905.771.7011
> | 400-10 Commerce Valley Dr E                Fax: +1 905.771.6288
> | Thornhill, ON Canada L3T 7N7    http://www.CherniakSoftware.com
> ---
> Eliot Miranda                 ,,,^..^,,,                mailto:[hidden email]
> VisualWorks Engineering, Cincom  Smalltalk: scene not herd  Tel +1 408 216 4581
> 3350 Scott Blvd, Bldg 36 Suite B, Santa Clara, CA 95054 USA Fax +1 408 216 4500
>
>  

--
Dennis Smith                        [hidden email]
Cherniak Software Development Corporation       +1 905.771.7011
400-10 Commerce Valley Dr E                Fax: +1 905.771.6288
Thornhill, ON Canada L3T 7N7    http://www.CherniakSoftware.com

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

Martin McClure
Dennis Smith wrote:
> OK -- at my own risk -- I will use it now just for gemstone, and under a
> different name :)

Hi Dennis,

I don't recall which version of GBS you're currently using, but as I
commented recently on the GemStone customer forum, GBS 7.0 and later no
longer use large integers in the caches, so the performance of the hash
function should no longer make much (if any) difference.

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: [bug] Integer>>hash

Dennis smith-4
I read that -- and it did not register as to what it really meant.
I will pull the latest -- we are  using 6.something.

Thanks martin.

Martin McClure wrote:

> Dennis Smith wrote:
>> OK -- at my own risk -- I will use it now just for gemstone, and
>> under a different name :)
>
> Hi Dennis,
>
> I don't recall which version of GBS you're currently using, but as I
> commented recently on the GemStone customer forum, GBS 7.0 and later
> no longer use large integers in the caches, so the performance of the
> hash function should no longer make much (if any) difference.
>
> -Martin

--
Dennis Smith                        [hidden email]
Cherniak Software Development Corporation       +1 905.771.7011
400-10 Commerce Valley Dr E                Fax: +1 905.771.6288
Thornhill, ON Canada L3T 7N7    http://www.CherniakSoftware.com