32 vs 64 bits and large integer hash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

32 vs 64 bits and large integer hash

Eliot Miranda-2
Hi All,

    right now we have the following definition of Large(Positive)Integer>>hash:

hash
^ByteArray hashBytes: self startingWith: self species hash

which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.

e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
 but in 32 bits: (2 raisedTo: 30) hash 230045764

This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to

hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]


P.S. Note that this will not break Float hash, which is defined as

Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Nicolas Cellier
+1
And brilliant hack! though the former code explains by itself, the later not so much without a comment.

Le mer. 21 nov. 2018 à 19:46, Eliot Miranda <[hidden email]> a écrit :
Hi All,

    right now we have the following definition of Large(Positive)Integer>>hash:

hash
^ByteArray hashBytes: self startingWith: self species hash

which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.

e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
 but in 32 bits: (2 raisedTo: 30) hash 230045764

This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to

hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]


P.S. Note that this will not break Float hash, which is defined as

Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)

_,,,^..^,,,_
best, Eliot



Reply | Threaded
Open this post in threaded view
|

Re: [Cuis-dev] 32 vs 64 bits and large integer hash

Juan Vuletich-4
In reply to this post by Eliot Miranda-2
Hi Eliot,

On Cuis I have been recently working on hash consistency for numeric types, making same value (or different values that compare as equal due to rounding) answer same hash regardless of class. But I have missed the SmallInteger / LargeInteger consistency across 32 / 64 bits images. Thanks! Fixes for Cuis now at GitHub repo. In particular, it is necessary for the following to answer the same values in 32 and 64 bits (it wasn't the case!):


largeInteger := (LargePositiveInteger new: 4)
            digitAt: 1 put: 1;
            digitAt: 2 put: 2;
            digitAt: 3 put: 3;
            digitAt: 4 put: 4;
            yourself.
smallInteger := largeInteger normalize.
float := smallInteger asFloat.
boxedFloat := BoxedFloat64 new basicAt: 1 put: (float basicAt: 1); basicAt: 2 put: (float basicAt: 2); yourself.
{largeInteger class. smallInteger class. float class. boxedFloat class.
largeInteger hash. smallInteger hash. float hash. boxedFloat hash } print.



largeInteger _ (LargePositiveInteger new: 4)
            digitAt: 1 put: 1;
            digitAt: 2 put: 2;
            digitAt: 3 put: 3;
            digitAt: 4 put: 80;
            yourself.
smallIntIn64ButLargeIntIn32Bits := largeInteger normalize.
float := smallIntIn64ButLargeIntIn32Bits asFloat.
boxedFloat := BoxedFloat64 new basicAt: 1 put: (float basicAt: 1); basicAt: 2 put: (float basicAt: 2); yourself.
{largeInteger class. smallIntIn64ButLargeIntIn32Bits class. float class. boxedFloat class.
largeInteger hash. smallIntIn64ButLargeIntIn32Bits hash. float hash. boxedFloat hash } print.


Note that I also included consistency between un-normalized LargeIntegers in the SmallInteger range (just in case).

On 11/21/2018 3:45 PM, Eliot Miranda via Cuis-dev wrote:
Hi All,

    right now we have the following definition of Large(Positive)Integer>>hash:

hash
^ByteArray hashBytes: self startingWith: self species hash

which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.

e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
 but in 32 bits: (2 raisedTo: 30) hash 230045764

This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to

hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]


P.S. Note that this will not break Float hash, which is defined as

Float>>hash
"Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)

_,,,^..^,,,_
best, Eliot


Cheers,
-- 
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
https://github.com/jvuletich
https://www.linkedin.com/in/juan-vuletich-75611b3
@JuanVuletich


Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Levente Uzonyi
In reply to this post by Eliot Miranda-2
Why not do it the other way around and implement SmallInteger >> #hash as
a primitive with the following fallback code?

SmallInteger >> #hash

  <primitive: XXX>
  | remainder hash |
  self < 0
  ifTrue: [
  remainder := 0 - self.
  hash := LargeNegativeInteger hash ]
  ifFalse: [
  remainder := self.
  hash := LargePositiveInteger hash ].
  [ remainder > 0 ] whileTrue: [
  hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
  remainder := remainder bitShift: -8 ].
  ^hash

The only problem to solve is the calculation of the initial hash value.
The VM has to know the initial hash value (e.g. it could be a constant
based on the sign of the receiver) or it has to know how to calculate it
(but that's currently done differently among forks) or the value has to be
an argument of the primitive.

Levente

P.S.: This is another case where Behavior >> #hash bites Squeak and causes
additional slowdown.
P.P.S.: Float >> #hash should use #bitXor: and #hashMultiply instead of
#+ and #bitShift:

On Wed, 21 Nov 2018, Eliot Miranda wrote:

> Hi All,
>     right now we have the following definition of Large(Positive)Integer>>hash:
>
> hash
> ^ByteArray hashBytes: self startingWith: self species hash
>
> which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
>
> e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
>  but in 32 bits: (2 raisedTo: 30) hash 230045764
>
> This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to
>
> hash
> ^self digitLength <= 8
> ifTrue: [self]
> ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
>
>
> P.S. Note that this will not break Float hash, which is defined as
>
> Float>>hash
> "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
> hash as an equal Integer."
>
> (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
> ^ ((self basicAt: 1) bitShift: -4) +
>   ((self basicAt: 2) bitShift: -4)
>
> P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
>
> _,,,^..^,,,_
> best, Eliot
>
>

Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Tobias Pape

> On 22.11.2018, at 01:15, Levente Uzonyi <[hidden email]> wrote:
>
> Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
>
> SmallInteger >> #hash
>
> <primitive: XXX>
> | remainder hash |
> self < 0
> ifTrue: [
> remainder := 0 - self.
> hash := LargeNegativeInteger hash ]
> ifFalse: [
> remainder := self.
> hash := LargePositiveInteger hash ].
> [ remainder > 0 ] whileTrue: [
> hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
> remainder := remainder bitShift: -8 ].
> ^hash
>
> The only problem to solve is the calculation of the initial hash value.
> The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
>
> Levente

I'm a bit puzzled. I thought  (small)Integers being their own hash is a good thing?

Best regards
        -Tobias

>
> P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
> P.P.S.: Float >> #hash should use #bitXor: and #hashMultiply instead of #+ and #bitShift:
>
> On Wed, 21 Nov 2018, Eliot Miranda wrote:
>
>> Hi All,
>>     right now we have the following definition of Large(Positive)Integer>>hash:
>> hash
>> ^ByteArray hashBytes: self startingWith: self species hash
>> which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
>> e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
>>  but in 32 bits: (2 raisedTo: 30) hash 230045764
>> This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to
>> hash
>> ^self digitLength <= 8
>> ifTrue: [self]
>> ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
>> P.S. Note that this will not break Float hash, which is defined as
>> Float>>hash
>> "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
>> hash as an equal Integer."
>> (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
>> ^ ((self basicAt: 1) bitShift: -4) +
>>   ((self basicAt: 2) bitShift: -4)
>> P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
>> _,,,^..^,,,_
>> best, Eliot
>


Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] 32 vs 64 bits and large integer hash

Marcus Denker-4
In reply to this post by Eliot Miranda-2


> On 21 Nov 2018, at 19:45, Eliot Miranda <[hidden email]> wrote:
>
> Hi All,
>
>     right now we have the following definition of Large(Positive)Integer>>hash:
>
> hash
> ^ByteArray hashBytes: self startingWith: self species hash
>
> which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
>
> e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
>  but in 32 bits: (2 raisedTo: 30) hash 230045764
>
> This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to
>
> hash
> ^self digitLength <= 8
> ifTrue: [self]
> ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
>
>

Thanks, I added an issue tracker entry:
       
        https://pharo.fogbugz.com/f/cases/22690/32-vs-64-bits-and-large-integer-hash

I will do a pull request later (if not someone else is faster to do it).

        Marcus


--
Marcus Denker  --  [hidden email]
http://www.zweidenker.de

Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Levente Uzonyi
In reply to this post by Tobias Pape
On Thu, 22 Nov 2018, Tobias Pape wrote:

>
>> On 22.11.2018, at 01:15, Levente Uzonyi <[hidden email]> wrote:
>>
>> Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
>>
>> SmallInteger >> #hash
>>
>> <primitive: XXX>
>> | remainder hash |
>> self < 0
>> ifTrue: [
>> remainder := 0 - self.
>> hash := LargeNegativeInteger hash ]
>> ifFalse: [
>> remainder := self.
>> hash := LargePositiveInteger hash ].
>> [ remainder > 0 ] whileTrue: [
>> hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
>> remainder := remainder bitShift: -8 ].
>> ^hash
>>
>> The only problem to solve is the calculation of the initial hash value.
>> The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
>>
>> Levente
>
> I'm a bit puzzled. I thought  (small)Integers being their own hash is a good thing?

I would call it simple but not necessarily good.
The problem with it is that consecutive numbers generate long chains in
HashedCollections:

a := (1 to: 1000) asArray.
s := Set withAll: a.
[ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
"==> 7014"

The solution in Squeak is to use PluggableSet instead of Set,
because it applies #hashMultiply on the hash value:

ps := PluggableSet integerSet.
ps addAll: a.
[ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
"==> 95"

IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the
long chains. That was probably the main reason for the push to make
#hashMultply a numbered primitive.

Levente

>
> Best regards
> -Tobias
>
>>
>> P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
>> P.P.S.: Float >> #hash should use #bitXor: and #hashMultiply instead of #+ and #bitShift:
>>
>> On Wed, 21 Nov 2018, Eliot Miranda wrote:
>>
>>> Hi All,
>>>     right now we have the following definition of Large(Positive)Integer>>hash:
>>> hash
>>> ^ByteArray hashBytes: self startingWith: self species hash
>>> which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
>>> e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
>>>  but in 32 bits: (2 raisedTo: 30) hash 230045764
>>> This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to
>>> hash
>>> ^self digitLength <= 8
>>> ifTrue: [self]
>>> ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
>>> P.S. Note that this will not break Float hash, which is defined as
>>> Float>>hash
>>> "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
>>> hash as an equal Integer."
>>> (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
>>> ^ ((self basicAt: 1) bitShift: -4) +
>>>   ((self basicAt: 2) bitShift: -4)
>>> P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
>>> _,,,^..^,,,_
>>> best, Eliot
>>

Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Tobias Pape

> On 22.11.2018, at 10:59, Levente Uzonyi <[hidden email]> wrote:
>
> On Thu, 22 Nov 2018, Tobias Pape wrote:
>
>>
>>> On 22.11.2018, at 01:15, Levente Uzonyi <[hidden email]> wrote:
>>> Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
>>> SmallInteger >> #hash
>>>
>>> <primitive: XXX>
>>> | remainder hash |
>>> self < 0
>>> ifTrue: [
>>> remainder := 0 - self.
>>> hash := LargeNegativeInteger hash ]
>>> ifFalse: [
>>> remainder := self.
>>> hash := LargePositiveInteger hash ].
>>> [ remainder > 0 ] whileTrue: [
>>> hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
>>> remainder := remainder bitShift: -8 ].
>>> ^hash
>>> The only problem to solve is the calculation of the initial hash value.
>>> The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
>>> Levente
>>
>> I'm a bit puzzled. I thought  (small)Integers being their own hash is a good thing?
>
> I would call it simple but not necessarily good.
> The problem with it is that consecutive numbers generate long chains in HashedCollections:
>
> a := (1 to: 1000) asArray.
> s := Set withAll: a.
> [ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
> "==> 7014"
>
> The solution in Squeak is to use PluggableSet instead of Set, because it applies #hashMultiply on the hash value:
>
> ps := PluggableSet integerSet.
> ps addAll: a.
> [ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
> "==> 95"
>
> IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the long chains. That was probably the main reason for the push to make #hashMultply a numbered primitive.
>

Interesting!


> Levente
>
>>
>> Best regards
>> -Tobias
>>
>>> P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
>>> P.P.S.: Float >> #hash should use #bitXor: and #hashMultiply instead of #+ and #bitShift:
>>> On Wed, 21 Nov 2018, Eliot Miranda wrote:
>>>> Hi All,
>>>>    right now we have the following definition of Large(Positive)Integer>>hash:
>>>> hash
>>>> ^ByteArray hashBytes: self startingWith: self species hash
>>>> which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
>>>> e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
>>>> but in 32 bits: (2 raisedTo: 30) hash 230045764
>>>> This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to
>>>> hash
>>>> ^self digitLength <= 8
>>>> ifTrue: [self]
>>>> ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
>>>> P.S. Note that this will not break Float hash, which is defined as
>>>> Float>>hash
>>>> "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
>>>> hash as an equal Integer."
>>>> (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
>>>> ^ ((self basicAt: 1) bitShift: -4) +
>>>>  ((self basicAt: 2) bitShift: -4)
>>>> P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
>>>> _,,,^..^,,,_
>>>> best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] 32 vs 64 bits and large integer hash

Marcus Denker-4
In reply to this post by Marcus Denker-4


On 22 Nov 2018, at 10:47, Marcus Denker <[hidden email]> wrote:



On 21 Nov 2018, at 19:45, Eliot Miranda <[hidden email]> wrote:

Hi All,

   right now we have the following definition of Large(Positive)Integer>>hash:

hash
^ByteArray hashBytes: self startingWith: self species hash

which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.

e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
but in 32 bits: (2 raisedTo: 30) hash 230045764

This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to

hash
^self digitLength <= 8
ifTrue: [self]
ifFalse: [ByteArray hashBytes: self startingWith: self species hash]



Thanks, I added an issue tracker entry:

https://pharo.fogbugz.com/f/cases/22690/32-vs-64-bits-and-large-integer-hash

I will do a pull request later (if not someone else is faster to do it).





Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Eliot Miranda-2
In reply to this post by Tobias Pape

> On Nov 21, 2018, at 11:48 PM, Tobias Pape <[hidden email]> wrote:
>
>
>> On 22.11.2018, at 01:15, Levente Uzonyi <[hidden email]> wrote:
>>
>> Why not do it the other way around and implement SmallInteger >> #hash as a primitive with the following fallback code?
>>
>> SmallInteger >> #hash
>>
>>    <primitive: XXX>
>>    | remainder hash |
>>    self < 0
>>        ifTrue: [
>>            remainder := 0 - self.
>>            hash := LargeNegativeInteger hash ]
>>        ifFalse: [
>>            remainder := self.
>>            hash := LargePositiveInteger hash ].
>>    [ remainder > 0 ] whileTrue: [
>>        hash := (hash + (remainder bitAnd: 16rFF)) hashMultiply.
>>        remainder := remainder bitShift: -8 ].
>>    ^hash
>>
>> The only problem to solve is the calculation of the initial hash value.
>> The VM has to know the initial hash value (e.g. it could be a constant based on the sign of the receiver) or it has to know how to calculate it (but that's currently done differently among forks) or the value has to be an argument of the primitive.
>>
>> Levente
>
> I'm a bit puzzled. I thought  (small)Integers being their own hash is a good thing?

+1

>
> Best regards
>    -Tobias
>
>>
>> P.S.: This is another case where Behavior >> #hash bites Squeak and causes additional slowdown.
>> P.P.S.: Float >> #hash should use #bitXor: and #hashMultiply instead of #+ and #bitShift:
>>
>>> On Wed, 21 Nov 2018, Eliot Miranda wrote:
>>>
>>> Hi All,
>>>    right now we have the following definition of Large(Positive)Integer>>hash:
>>> hash
>>> ^ByteArray hashBytes: self startingWith: self species hash
>>> which means that for all integers outside of the 32-bit SmallInteger range (-2 ^ 30 to 2 ^ 30 - 1), the 32-bit system and the 64-bit system answer different values for hash.
>>> e.g. in 64 bits: (2 raisedTo: 30) hash 1073741824
>>> but in 32 bits: (2 raisedTo: 30) hash 230045764
>>> This is unsatisfactory.  I propose changing Large(Positive)Integer>>hash to
>>> hash
>>> ^self digitLength <= 8
>>> ifTrue: [self]
>>> ifFalse: [ByteArray hashBytes: self startingWith: self species hash]
>>> P.S. Note that this will not break Float hash, which is defined as
>>> Float>>hash
>>> "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same
>>> hash as an equal Integer."
>>> (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
>>> ^ ((self basicAt: 1) bitShift: -4) +
>>>  ((self basicAt: 2) bitShift: -4)
>>> P.P.S. I *think* that "(self isFinite and: [self fractionPart = 0.0])" is equivalent to "self - self = self fractionPart" ;-)
>>> _,,,^..^,,,_
>>> best, Eliot
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Chris Muller-3
In reply to this post by Tobias Pape
> >> I'm a bit puzzled. I thought  (small)Integers being their own hash is a good thing?

I was wondering exactly the same thing!

> > I would call it simple but not necessarily good.
> > The problem with it is that consecutive numbers generate long chains in HashedCollections:
> >
> > a := (1 to: 1000) asArray.
> > s := Set withAll: a.
> > [ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
> > "==> 7014"
> >
> > The solution in Squeak is to use PluggableSet instead of Set, because it applies #hashMultiply on the hash value:
> >
> > ps := PluggableSet integerSet.
> > ps addAll: a.
> > [ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
> > "==> 95"
> >
> > IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the long chains. That was probably the main reason for the push to make #hashMultply a numbered primitive.
> >
>
> Interesting!

Indeed!  When making a #hash methods, one always focuses on hash
distribution and finding the elements, but its easy to forget about
performance of NOT finding an element.

Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Luciano Notarfrancesco
On Fri, Nov 23, 2018 at 12:36 AM Chris Muller <[hidden email]> wrote:
> >> I'm a bit puzzled. I thought  (small)Integers being their own hash is a good thing?

I was wondering exactly the same thing!

> > I would call it simple but not necessarily good.
> > The problem with it is that consecutive numbers generate long chains in HashedCollections:
> >
> > a := (1 to: 1000) asArray.
> > s := Set withAll: a.
> > [ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
> > "==> 7014"
> >
> > The solution in Squeak is to use PluggableSet instead of Set, because it applies #hashMultiply on the hash value:
> >
> > ps := PluggableSet integerSet.
> > ps addAll: a.
> > [ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
> > "==> 95"
> >
> > IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the long chains. That was probably the main reason for the push to make #hashMultply a numbered primitive.
> >
>
> Interesting!

Indeed!  When making a #hash methods, one always focuses on hash
distribution and finding the elements, but its easy to forget about
performance of NOT finding an element.


Yes! I had this problem blow in my face a couple of years ago, and Juan agreed to include the change in Cuis to make SmallIntegers NOT their own hash.

I think it is a good idea in general when programming a hash method in Smalltalk to make it somewhat random. It will never be perfect for every use case, and it doesn't need to be cryptographically secure, but there shouldn't be simple use cases (e.g., consecutive integers) that produce hashes that are not uniformly distributed. This can be achieved with minimal performance impact (and potentially big performance gains in hashed collections) quite simply by sending some hashMultiply message in your hash method.

Regards,
Luciano



Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Eliot Miranda-2


On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <[hidden email]> wrote:
On Fri, Nov 23, 2018 at 12:36 AM Chris Muller <[hidden email]> wrote:
> >> I'm a bit puzzled. I thought  (small)Integers being their own hash is a good thing?

I was wondering exactly the same thing!

> > I would call it simple but not necessarily good.
> > The problem with it is that consecutive numbers generate long chains in HashedCollections:
> >
> > a := (1 to: 1000) asArray.
> > s := Set withAll: a.
> > [ 1 to: 1000000 do: [ :each | s includes: each ] ] timeToRun.
> > "==> 7014"
> >
> > The solution in Squeak is to use PluggableSet instead of Set, because it applies #hashMultiply on the hash value:
> >
> > ps := PluggableSet integerSet.
> > ps addAll: a.
> > [ 1 to: 1000000 do: [ :each | ps includes: each ] ] timeToRun.
> > "==> 95"
> >
> > IIRC in Pharo SmallInteger's hash is based on #hashMultiply to avoid the long chains. That was probably the main reason for the push to make #hashMultply a numbered primitive.
> >
>
> Interesting!

Indeed!  When making a #hash methods, one always focuses on hash
distribution and finding the elements, but its easy to forget about
performance of NOT finding an element.


Yes! I had this problem blow in my face a couple of years ago, and Juan agreed to include the change in Cuis to make SmallIntegers NOT their own hash.

This seems to be a basic error.  The idea of a hash function is to produce a well-distributed set of integers for some set of values.  Since the SmallIntegers are themselves perfectly distributed (each unique SmallInteger is a unique value), it is impossible to produce a better distributed hash function than  the integers themselves. For some application it may indeed be possible to produce a better distributed set of hashes for the integers; for example an application which considers only powers of two could use the log base 2 to produce a smaller and better distributed set fo values modulo N than the integers themselves.  But in general the integers are definitionally well-distributed.  In fact, unless one has a perfect hash function one is in danger of producing a less well-distributed set of values from a hash function than the SmallIntegers themselves.

This argument doesn't apply as integers grow beyond the SmallInteger, but not because we want better distribution of values than the large integers themselves, but b because we want to avoid large integer arithmetic.
 
I think it is a good idea in general when programming a hash method in Smalltalk to make it somewhat random.

As I've indicated, I think this is impossible in general.  It is only in specific applications, using specific subsets of the SmallIn tigers for which a better hash function could be derived, but this would be specific to that application.  For purposes such as these we have PluggableDictionary et al which can exploit an application-specific hash.  But in general the SmallIntegers are ideal values for their own hashes.
 
It will never be perfect for every use case, and it doesn't need to be cryptographically secure, but there shouldn't be simple use cases (e.g., consecutive integers) that produce hashes that are not uniformly distributed. This can be achieved with minimal performance impact (and potentially big performance gains in hashed collections) quite simply by sending some hashMultiply message in your hash method.

Regards,
Luciano

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Luciano Notarfrancesco
Hi Eliot!

On Sat, Nov 24, 2018 at 9:18 PM Eliot Miranda <[hidden email]> wrote:

On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <[hidden email]> wrote:

Yes! I had this problem blow in my face a couple of years ago, and Juan agreed to include the change in Cuis to make SmallIntegers NOT their own hash.

This seems to be a basic error.  The idea of a hash function is to produce a well-distributed set of integers for some set of values.  Since the SmallIntegers are themselves perfectly distributed (each unique SmallInteger is a unique value), it is impossible to produce a better distributed hash function than  the integers themselves. For some application it may indeed be possible to produce a better distributed set of hashes for the integers; for example an application which considers only powers of two could use the log base 2 to produce a smaller and better distributed set fo values modulo N than the integers themselves.  But in general the integers are definitionally well-distributed.  In fact, unless one has a perfect hash function one is in danger of producing a less well-distributed set of values from a hash function than the SmallIntegers themselves.


The problem I see is that sequences of SmallIntegers that we humans tend add to hashed collections are not usually uniformly distributed. Hopefully they have some meaning, and thus some patterns to them, most of the time they don't look random, and thus if they are their own hash the performance of hashed collections will most likely suffer.

I found this problem when profiling my code. I guess some people might assume that Sets or Dictionaries are fast and use them with SmallIntegers without much care, and never discover that their code could be orders of magnitude faster. So, another option might be to modify scanFor: to randomize the hash of the argument (for example sending an additional hashMultiply). This is because scanFor: is assuming the arguments of successive calls will have uniformly distributed hashes:
    start := anObject hash \\ array size + 1.
the algorithm needs 'start' to be uniformly distributed between 1 and array size, and in the case of SmallIntegers current hash this is very unlikely and lookups degenerate in linear searches.

Also, any cryptographic hash will do great in pretty much ALL use cases. Finding a sequence of integers that produces non-uniform hashes is very hard, and is equivalent to breaking the cryptographic hash and make it unusable for cryptography. This is a sort of silver bullet, I seem to remember that Java does this now, but I think it is overkill and something like hashMultiply is enough (plus, it is cheaper).

My 2 maos,
Luciano


Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Eliot Miranda-2
Hi Luciano,

On Nov 24, 2018, at 9:06 PM, Luciano Notarfrancesco <[hidden email]> wrote:

Hi Eliot!

On Sat, Nov 24, 2018 at 9:18 PM Eliot Miranda <[hidden email]> wrote:

On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <[hidden email]> wrote:

Yes! I had this problem blow in my face a couple of years ago, and Juan agreed to include the change in Cuis to make SmallIntegers NOT their own hash.

This seems to be a basic error.  The idea of a hash function is to produce a well-distributed set of integers for some set of values.  Since the SmallIntegers are themselves perfectly distributed (each unique SmallInteger is a unique value), it is impossible to produce a better distributed hash function than  the integers themselves. For some application it may indeed be possible to produce a better distributed set of hashes for the integers; for example an application which considers only powers of two could use the log base 2 to produce a smaller and better distributed set fo values modulo N than the integers themselves.  But in general the integers are definitionally well-distributed.  In fact, unless one has a perfect hash function one is in danger of producing a less well-distributed set of values from a hash function than the SmallIntegers themselves.


The problem I see is that sequences of SmallIntegers that we humans tend add to hashed collections are not usually uniformly distributed. Hopefully they have some meaning, and thus some patterns to them, most of the time they don't look random, and thus if they are their own hash the performance of hashed collections will most likely suffer.

As I said, god specific cases where the distribution of interludes used fits some pattern, there is PluggableDictionary and PluggableSet.  In cases where integers are randomly or evenly distributed one cannot improve on integers being their own hashes, so KISS, and use the available alternatives.


I found this problem when profiling my code. I guess some people might assume that Sets or Dictionaries are fast and use them with SmallIntegers without much care, and never discover that their code could be orders of magnitude faster. So, another option might be to modify scanFor: to randomize the hash of the argument (for example sending an additional hashMultiply). This is because scanFor: is assuming the arguments of successive calls will have uniformly distributed hashes:
    start := anObject hash \\ array size + 1.
the algorithm needs 'start' to be uniformly distributed between 1 and array size, and in the case of SmallIntegers current hash this is very unlikely and lookups degenerate in linear searches.

Whether it is likely or unlikely depends on the specific set of integers.  Any hash function can suffer from the same issue; and if a hash function produces fewer values than its input set then that hash function may make things worse.  Hence it makes sense to keep things simple and be aware of alternatives for specific situations.


Also, any cryptographic hash will do great in pretty much ALL use cases. Finding a sequence of integers that produces non-uniform hashes is very hard,

Since one takes the result of the hash modulo N I think this statement is false.  It depends on the hash table size and the specific set of integers one is hashing.

and is equivalent to breaking the cryptographic hash and make it unusable for cryptography. This is a sort of silver bullet, I seem to remember that Java does this now, but I think it is overkill and something like hashMultiply is enough (plus, it is cheaper).

My 2 maos,
Luciano



Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Nicolas Cellier


Le dim. 25 nov. 2018 à 22:21, Eliot Miranda <[hidden email]> a écrit :
Hi Luciano,

On Nov 24, 2018, at 9:06 PM, Luciano Notarfrancesco <[hidden email]> wrote:

Hi Eliot!

On Sat, Nov 24, 2018 at 9:18 PM Eliot Miranda <[hidden email]> wrote:

On Sat, Nov 24, 2018 at 7:40 AM Luciano Notarfrancesco <[hidden email]> wrote:

Yes! I had this problem blow in my face a couple of years ago, and Juan agreed to include the change in Cuis to make SmallIntegers NOT their own hash.

This seems to be a basic error.  The idea of a hash function is to produce a well-distributed set of integers for some set of values.  Since the SmallIntegers are themselves perfectly distributed (each unique SmallInteger is a unique value), it is impossible to produce a better distributed hash function than  the integers themselves. For some application it may indeed be possible to produce a better distributed set of hashes for the integers; for example an application which considers only powers of two could use the log base 2 to produce a smaller and better distributed set fo values modulo N than the integers themselves.  But in general the integers are definitionally well-distributed.  In fact, unless one has a perfect hash function one is in danger of producing a less well-distributed set of values from a hash function than the SmallIntegers themselves.


The problem I see is that sequences of SmallIntegers that we humans tend add to hashed collections are not usually uniformly distributed. Hopefully they have some meaning, and thus some patterns to them, most of the time they don't look random, and thus if they are their own hash the performance of hashed collections will most likely suffer.

As I said, god specific cases where the distribution of interludes used fits some pattern, there is PluggableDictionary and PluggableSet.  In cases where integers are randomly or evenly distributed one cannot improve on integers being their own hashes, so KISS, and use the available alternatives.

Hi Eliot,
my bet is as Luciano: set of integers with uniform distribution is the least probable in my experience: it's just noise. Every other data of interest will show different patterns.
So I'm not sure that forcing usage of Pluggable* is that simple, for preserving theoretical property that we don't encounter...


I found this problem when profiling my code. I guess some people might assume that Sets or Dictionaries are fast and use them with SmallIntegers without much care, and never discover that their code could be orders of magnitude faster. So, another option might be to modify scanFor: to randomize the hash of the argument (for example sending an additional hashMultiply). This is because scanFor: is assuming the arguments of successive calls will have uniformly distributed hashes:
    start := anObject hash \\ array size + 1.
the algorithm needs 'start' to be uniformly distributed between 1 and array size, and in the case of SmallIntegers current hash this is very unlikely and lookups degenerate in linear searches.

Whether it is likely or unlikely depends on the specific set of integers.  Any hash function can suffer from the same issue; and if a hash function produces fewer values than its input set then that hash function may make things worse.  Hence it makes sense to keep things simple and be aware of alternatives for specific situations.


But that is the case for any other hash. What is important is the length of hash bits compared to the size of the Set.
If hashMultiply result in too short hashes, then we get a problem for Set of any other objects...


Also, any cryptographic hash will do great in pretty much ALL use cases. Finding a sequence of integers that produces non-uniform hashes is very hard,

Since one takes the result of the hash modulo N I think this statement is false.  It depends on the hash table size and the specific set of integers one is hashing.

Hmm number theory, let's see... Hmm, that's really a question for Luciano or Andres ;)
and is equivalent to breaking the cryptographic hash and make it unusable for cryptography. This is a sort of silver bullet, I seem to remember that Java does this now, but I think it is overkill and something like hashMultiply is enough (plus, it is cheaper).

My 2 maos,
Luciano




Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Luciano Notarfrancesco
In reply to this post by Eliot Miranda-2
On Sun, Nov 25, 2018 at 9:21 PM Eliot Miranda <[hidden email]> wrote:
Hi Luciano,

On Nov 24, 2018, at 9:06 PM, Luciano Notarfrancesco <[hidden email]> wrote:

Also, any cryptographic hash will do great in pretty much ALL use cases. Finding a sequence of integers that produces non-uniform hashes is very hard,

Since one takes the result of the hash modulo N I think this statement is false.  It depends on the hash table size and the specific set of integers one is hashing.


Well, in fact there might be small biases in the residues modulo N. For example, if the hash function produces uniformly distributed outputs of 8 bits (0 to 255 with probability 1/256), and you take the residues modulo 255, you'll find a small bias for 0:  1 to 254 have probability 1/254 while 0 has probability 2/254 (because both '0 \\ 255' and '255 \\ 255' are 0). Small biases like this can make all the difference for the security of a cryptographic system, but for a hashed collection it's fine.

Cheers,
Luciano


Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Luciano Notarfrancesco
On Mon, Nov 26, 2018 at 4:05 AM Luciano Notarfrancesco <[hidden email]> wrote:
1 to 254 have probability 1/254 while 0 has probability 2/254
 
Oops, that should be 1/256 and 2/256, sorry.
When probabilities don't sum 1, unexpected things can happen ;)


Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Luciano Notarfrancesco
Perhaps it is best to change Set>>scanFor:, as I suggested in other mail. I think it might be best because: 1) adheres better to KISS principles, keeping Integer>>hash unchanged and simple; 2) it might address similar performance problems with other types of objects, not only SmallIntegers; 3) leaves the responsibility of 'mixing' the hash bits to the hashed collections where it seems to belong.

If we do this, then we can have simpler requirements for hash; it has to be a SmallInteger, equal objects must have equal hash, and ideally it has to be well distributed. But it doesn't need to look random, so similar objects can have similar hashes (as it is currently the case with SmallInteger, and maybe with Strings too? it might be even desirable for similar Strings to have similar hashes). Then the only required change is in Dictionary and Set scanFor:
  start _ (anObject hash \\ array size) + 1.
should be changed to:
  start _ (anObject hash hashMultiply \\ array size) + 1.
(or something similar, anything that 'mixes' the bits of the hash, perhaps with a primitive that does this and the modulus all at once). This would ensure that lookup for hashed collections is done in constant time and doesn't degenerate into linear searches, at a very small cost (the lookup time is increased slightly due to the extra hashMultiply).

And there's always the alternative of PluggableSet / PluggableDictionary for the cases when a custom hash is desired.

My other 2 cents,
Luciano



On Mon, Nov 26, 2018 at 4:14 AM Luciano Notarfrancesco <[hidden email]> wrote:
On Mon, Nov 26, 2018 at 4:05 AM Luciano Notarfrancesco <[hidden email]> wrote:
1 to 254 have probability 1/254 while 0 has probability 2/254
 
Oops, that should be 1/256 and 2/256, sorry.
When probabilities don't sum 1, unexpected things can happen ;)


Reply | Threaded
Open this post in threaded view
|

Re: 32 vs 64 bits and large integer hash

Eliot Miranda-2
Hi Luciano,

On Nov 26, 2018, at 2:34 AM, Luciano Notarfrancesco <[hidden email]> wrote:

Perhaps it is best to change Set>>scanFor:, as I suggested in other mail. I think it might be best because: 1) adheres better to KISS principles, keeping Integer>>hash unchanged and simple; 2) it might address similar performance problems with other types of objects, not only SmallIntegers; 3) leaves the responsibility of 'mixing' the hash bits to the hashed collections where it seems to belong.

If we do this, then we can have simpler requirements for hash; it has to be a SmallInteger, equal objects must have equal hash, and ideally it has to be well distributed. But it doesn't need to look random, so similar objects can have similar hashes (as it is currently the case with SmallInteger, and maybe with Strings too? it might be even desirable for similar Strings to have similar hashes). Then the only required change is in Dictionary and Set scanFor:
  start _ (anObject hash \\ array size) + 1.
should be changed to:
  start _ (anObject hash hashMultiply \\ array size) + 1.
(or something similar, anything that 'mixes' the bits of the hash, perhaps with a primitive that does this and the modulus all at once). This would ensure that lookup for hashed collections is done in constant time and doesn't degenerate into linear searches, at a very small cost (the lookup time is increased slightly due to the extra hashMultiply).

I find this much more compelling.  I have no objection to this and it suggests an obvious extension to the hashMultiply primitive, which is to support large Integer receivers.

(p.s. _ in place of := is deprecated).

And there's always the alternative of PluggableSet / PluggableDictionary for the cases when a custom hash is desired.


My other 2 cents,
Luciano



On Mon, Nov 26, 2018 at 4:14 AM Luciano Notarfrancesco <[hidden email]> wrote:
On Mon, Nov 26, 2018 at 4:05 AM Luciano Notarfrancesco <[hidden email]> wrote:
1 to 254 have probability 1/254 while 0 has probability 2/254
 
Oops, that should be 1/256 and 2/256, sorry.
When probabilities don't sum 1, unexpected things can happen ;)