Hash collision

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Hash collision

HilaireFernandes
Hi,

In Dr. Geo I rely a lot on hash value for the geometric object. It
largely improves speed up to compare geometric items (to avoid
duplicated items). For example to detect when the user is building an
already created object.

Also when describing geometric sketch with Smalltalk code, hundred or
thousand of geometric items are created. The hash value speeds up
equality check (otherwise true comparing is iterative in the geometric
sketch tree and it slow down)

But sometime hash collisions occur. I guess in that case, DrGeo should
do the real equality check, or may be I am overusing the hash? I am
interested on advices. As for now I am doing nothing special.

Here is an annoying hash collision with Pharo 3/ Pharo 5 I think should
not occur:

(0.5@0.25) hash
         67006464
(-0.5@0.25) hash
        67006464


Squeak5.0 does not have this collision.

It looks related to the Float hash implementation a bit different.

What do you think?

Hilaire

--
Dr. Geo
http://drgeo.eu


Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Ben Coman


On Fri, Dec 9, 2016 at 10:51 PM, Hilaire <[hidden email]> wrote:
Hi,

In Dr. Geo I rely a lot on hash value for the geometric object. It
largely improves speed up to compare geometric items (to avoid
duplicated items). For example to detect when the user is building an
already created object.

Also when describing geometric sketch with Smalltalk code, hundred or
thousand of geometric items are created. The hash value speeds up
equality check (otherwise true comparing is iterative in the geometric
sketch tree and it slow down)

But sometime hash collisions occur. I guess in that case, DrGeo should
do the real equality check, or may be I am overusing the hash? I am
interested on advices. As for now I am doing nothing special.

Here is an annoying hash collision with Pharo 3/ Pharo 5 I think should
not occur:

(0.5@0.25) hash
         67006464
(-0.5@0.25) hash
        67006464


Squeak5.0 does not have this collision.

It looks related to the Float hash implementation a bit different.

What do you think?

More the point, this seems bad...
-0.5 hash "==> 57344"
0.5 hash "==> 57344"  

I read page 307 of "Hashing in Smalltalk Theory and Practice" **
regarding Squeak's (Pharo's) Point>>hash method...
   "at this time we will not test this hash function for points with our floating point data sets because we already know that [Pharo's] hash function for floats is of bad quality."

** Errm... actually first time I've opened the book since I bought it a couple of years ago.

And from page 273 reviewing Float's hash...
    "because of how the bitAnd: is used, the hash function will discard 32 of the original 64 bits.  In other words what seems a hash function using all the availabel bytes is anything but. [Also] because of the way the two 16 bit chunks are added togather, by the time bitshift:-8 is sent we will only have, at most, 17 bits worth of hash value.  Effectively all the bitshift: accomplishes is to get rid of the 8 lowest bits, which are zero anyway because of how the bitAnd: message were set up. Can these objections add up to particularly poor performance? The answer is definitely yes - since we will have only 17 bit hash values this hash function will be particularly unsuitable for any data set containing more than 2^17 = 131,072 objects"

    Float>>hash "Pharo"
  (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
  ^ (((self basicAt: 1) bitAnd: 16r00FFFF00) +
    ((self basicAt: 2) bitAnd: 16r00FFFF00)) bitShift: -8

The suggested fix depended on being willing to let go of byte ordering (which we might not want different results on different platforms)
    Float>>hash
^ByteArray   
   hashBytes: self 
   startingWith: self species hash 
 
but anyway that doesn't actually help this case...
    -0.5 hash "==>120484352" 
    0.5 hash "==>120484352"

I see Squeak5's float hash has changed...
     Float>>hash "Squeak5"
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

which gives...
    0.5 hash "==>66977792"
   -0.5 hash "==>201195520"

which solves the immediate problem, but I'm not capable of analysis its quality further.

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

philippeback
If you do not really care about performance because there are not that may geometric objects (say thousands), I guess that this would work (bit pattern as string).

{
0.5 binaryLiteralString hash.
-0.5 binaryLiteralString hash.
}

Phil


On Fri, Dec 9, 2016 at 6:45 PM, Ben Coman <[hidden email]> wrote:


On Fri, Dec 9, 2016 at 10:51 PM, Hilaire <[hidden email]> wrote:
Hi,

In Dr. Geo I rely a lot on hash value for the geometric object. It
largely improves speed up to compare geometric items (to avoid
duplicated items). For example to detect when the user is building an
already created object.

Also when describing geometric sketch with Smalltalk code, hundred or
thousand of geometric items are created. The hash value speeds up
equality check (otherwise true comparing is iterative in the geometric
sketch tree and it slow down)

But sometime hash collisions occur. I guess in that case, DrGeo should
do the real equality check, or may be I am overusing the hash? I am
interested on advices. As for now I am doing nothing special.

Here is an annoying hash collision with Pharo 3/ Pharo 5 I think should
not occur:

(0.5@0.25) hash
         67006464
(-0.5@0.25) hash
        67006464


Squeak5.0 does not have this collision.

It looks related to the Float hash implementation a bit different.

What do you think?

More the point, this seems bad...
-0.5 hash "==> 57344"
0.5 hash "==> 57344"  

I read page 307 of "Hashing in Smalltalk Theory and Practice" **
regarding Squeak's (Pharo's) Point>>hash method...
   "at this time we will not test this hash function for points with our floating point data sets because we already know that [Pharo's] hash function for floats is of bad quality."

** Errm... actually first time I've opened the book since I bought it a couple of years ago.

And from page 273 reviewing Float's hash...
    "because of how the bitAnd: is used, the hash function will discard 32 of the original 64 bits.  In other words what seems a hash function using all the availabel bytes is anything but. [Also] because of the way the two 16 bit chunks are added togather, by the time bitshift:-8 is sent we will only have, at most, 17 bits worth of hash value.  Effectively all the bitshift: accomplishes is to get rid of the 8 lowest bits, which are zero anyway because of how the bitAnd: message were set up. Can these objections add up to particularly poor performance? The answer is definitely yes - since we will have only 17 bit hash values this hash function will be particularly unsuitable for any data set containing more than 2^17 = 131,072 objects"

    Float>>hash "Pharo"
  (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
  ^ (((self basicAt: 1) bitAnd: 16r00FFFF00) +
    ((self basicAt: 2) bitAnd: 16r00FFFF00)) bitShift: -8

The suggested fix depended on being willing to let go of byte ordering (which we might not want different results on different platforms)
    Float>>hash
^ByteArray   
   hashBytes: self 
   startingWith: self species hash 
 
but anyway that doesn't actually help this case...
    -0.5 hash "==>120484352" 
    0.5 hash "==>120484352"

I see Squeak5's float hash has changed...
     Float>>hash "Squeak5"
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

which gives...
    0.5 hash "==>66977792"
   -0.5 hash "==>201195520"

which solves the immediate problem, but I'm not capable of analysis its quality further.

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Martin McClure-2
In reply to this post by Ben Coman
On 12/09/2016 09:45 AM, Ben Coman wrote:

The suggested fix depended on being willing to let go of byte ordering (which we might not want different results on different platforms)
    Float>>hash
^ByteArray   
   hashBytes: self 
   startingWith: self species hash
This fix contains a good idea, but has at least two problems:
1) The check for integers is important, that's what makes sure that 1.0 has the same hash as 1. Which it must, since they compare equal.
2) #hashBytes:startingWith: is designed to hash *bytes*. A BoxedFloat64 is two *words*, which ensures that when you hashMultiply the high-order bits of each word get discarded.

I tried implementing a hash that actually uses the bytes of the float (code at bottom of this message). I didn't do any real analysis of it, but at least it gives different answers for 0.5 and -0.5. I think it matches the *intent* of the code above (though would still need the integer check, which I didn't bother with).

 
but anyway that doesn't actually help this case...
    -0.5 hash "==>120484352" 
    0.5 hash "==>120484352"

I see Squeak5's float hash has changed...
     Float>>hash "Squeak5"
(self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
^ ((self basicAt: 1) bitShift: -4) +
  ((self basicAt: 2) bitShift: -4)

This is slightly better. It's only discarding eight bits out of 64, instead of discarding 32. But it would be better to mix all the bits into the hash.

Note that all of these hash implementations end up manipulating at least one large integer most of the time, since each word can be a large integer. It's possible that a primitive to copy a word object into the equivalent bytes might speed things up somewhat.

Regards,

-Martin


"Strictly experimental method on BoxedFloat64, could probably be made faster even without a primitive."

bytesHash
    | bytes word1 word2 |
    bytes := ByteArray new: 8.
    word1 := self basicAt: 1.
    word2 := self basicAt: 2.
    bytes at: 1 put: (word1 bitShift: -24).
    bytes at: 2 put: ((word1 bitShift: -16) bitAnd: 16rFF).
    bytes at: 3 put: ((word1 bitShift: -8) bitAnd: 16rFF).   
    bytes at: 4 put: (word1  bitAnd: 16rFF).
    bytes at: 5 put: (word2 bitShift: -24).
    bytes at: 6 put: ((word2 bitShift: -16) bitAnd: 16rFF).
    bytes at: 7 put: ((word2 bitShift: -8) bitAnd: 16rFF).   
    bytes at: 8 put: (word2  bitAnd: 16rFF).
    ^ ByteArray hashBytes: bytes startingWith: self class hash

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Andres Valloud-4
On 12/10/16 10:36 , Martin McClure wrote:

> On 12/09/2016 09:45 AM, Ben Coman wrote:
>>
>> The suggested fix depended on being willing to let go of byte ordering
>> (which we might not want different results on different platforms)
>>     Float>>hash
>> ^ByteArray
>>    hashBytes: self
>>    startingWith: self species hash
> This fix contains a good idea, but has at least two problems:
> 1) The check for integers is important, that's what makes sure that 1.0
> has the same hash as 1. Which it must, since they compare equal.
> 2) #hashBytes:startingWith: is designed to hash *bytes*. A BoxedFloat64
> is two *words*, which ensures that when you hashMultiply the high-order
> bits of each word get discarded.

The high order bits of each word, yes.  However, I would have expected
the byte oriented code to be, basically,

h := some constant.
bytes do: [:x | h * 1644525 + x]

or

h := some constant.
bytes do: [:x | h + x * 1644525]

In either case, the sign bit in one of the x values should have made a
difference.  Apparently it didn't.  What am I missing here?

> I tried implementing a hash that actually uses the bytes of the float
> (code at bottom of this message). I didn't do any real analysis of it,
> but at least it gives different answers for 0.5 and -0.5. I think it
> matches the *intent* of the code above (though would still need the
> integer check, which I didn't bother with).

Side comment: equality between integers, floats, fractions and other
numbers induces a desire to have equal hashes... in my opinion, the
resulting complexity hard to justify.

>> I see Squeak5's float hash has changed...
>>      Float>>hash "Squeak5"
>> (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self
>> truncated hash].
>> ^ ((self basicAt: 1) bitShift: -4) +
>>   ((self basicAt: 2) bitShift: -4)
>>
> This is slightly better. It's only discarding eight bits out of 64,
> instead of discarding 32. But it would be better to mix all the bits
> into the hash.

Yes.

> Note that all of these hash implementations end up manipulating at least
> one large integer most of the time, since each word can be a large
> integer. It's possible that a primitive to copy a word object into the
> equivalent bytes might speed things up somewhat.

... or use the byte oriented hash.

> "Strictly experimental method on BoxedFloat64, could probably be made
> faster even without a primitive."
>
> bytesHash
>     | bytes word1 word2 |
>     bytes := ByteArray new: 8.
>     word1 := self basicAt: 1.
>     word2 := self basicAt: 2.
>     bytes at: 1 put: (word1 bitShift: -24).
>     bytes at: 2 put: ((word1 bitShift: -16) bitAnd: 16rFF).
>     bytes at: 3 put: ((word1 bitShift: -8) bitAnd: 16rFF).
>     bytes at: 4 put: (word1  bitAnd: 16rFF).
>     bytes at: 5 put: (word2 bitShift: -24).
>     bytes at: 6 put: ((word2 bitShift: -16) bitAnd: 16rFF).
>     bytes at: 7 put: ((word2 bitShift: -8) bitAnd: 16rFF).
>     bytes at: 8 put: (word2  bitAnd: 16rFF).
>     ^ ByteArray hashBytes: bytes startingWith: self class hash

I would have expected the floats to be a byte object of size 8.  Why is
this conversion needed?  Is somehow the primitive thinking the float has
size 2?  Or is the primitive hashing 32 bits at a time?  The prim used
to be a C version of the byte oriented 1644525 multiplicative hash, did
this change?

Andres.

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Martin McClure-2
On 12/13/2016 08:39 AM, Andres Valloud wrote:
> I would have expected the floats to be a byte object of size 8.  Why is
> this conversion needed?  Is somehow the primitive thinking the float has
> size 2?  Or is the primitive hashing 32 bits at a time?  The prim used
> to be a C version of the byte oriented 1644525 multiplicative hash, did
> this change?

The float is not a byte object of size 8, it's a word object of size 2.

When you feed it to #hashBytes:startingWith: the primitive fails. (I
didn't look at why -- maybe because it isn't a byte object?)

The alternative Smalltalk code treats it like it was a byte object, even
though it's not. So it iterates through the two words, multiplying each
by 1664525, then keeping only the low-order 28 bits of the result. But
since each word is a full 32 bits, the high-order four bits of each
word, which include the sign bit, can never affect the hash.

Regards,

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Andres Valloud-4
Oh, ok... then the failure code isn't doing what was intended :).

On 12/13/16 15:16 , Martin McClure wrote:

> On 12/13/2016 08:39 AM, Andres Valloud wrote:
>> I would have expected the floats to be a byte object of size 8.  Why is
>> this conversion needed?  Is somehow the primitive thinking the float has
>> size 2?  Or is the primitive hashing 32 bits at a time?  The prim used
>> to be a C version of the byte oriented 1644525 multiplicative hash, did
>> this change?
>
> The float is not a byte object of size 8, it's a word object of size 2.
>
> When you feed it to #hashBytes:startingWith: the primitive fails. (I
> didn't look at why -- maybe because it isn't a byte object?)
>
> The alternative Smalltalk code treats it like it was a byte object, even
> though it's not. So it iterates through the two words, multiplying each
> by 1664525, then keeping only the low-order 28 bits of the result. But
> since each word is a full 32 bits, the high-order four bits of each
> word, which include the sign bit, can never affect the hash.
>
> Regards,
>
> -Martin
>

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

stepharong
Sorry but how can we improve the situation concretely


Stef

On Wed, 14 Dec 2016 00:22:47 +0100, Andres Valloud  
<[hidden email]> wrote:

> Oh, ok... then the failure code isn't doing what was intended :).
>
> On 12/13/16 15:16 , Martin McClure wrote:
>> On 12/13/2016 08:39 AM, Andres Valloud wrote:
>>> I would have expected the floats to be a byte object of size 8.  Why is
>>> this conversion needed?  Is somehow the primitive thinking the float  
>>> has
>>> size 2?  Or is the primitive hashing 32 bits at a time?  The prim used
>>> to be a C version of the byte oriented 1644525 multiplicative hash, did
>>> this change?
>>
>> The float is not a byte object of size 8, it's a word object of size 2.
>>
>> When you feed it to #hashBytes:startingWith: the primitive fails. (I
>> didn't look at why -- maybe because it isn't a byte object?)
>>
>> The alternative Smalltalk code treats it like it was a byte object, even
>> though it's not. So it iterates through the two words, multiplying each
>> by 1664525, then keeping only the low-order 28 bits of the result. But
>> since each word is a full 32 bits, the high-order four bits of each
>> word, which include the sign bit, can never affect the hash.
>>
>> Regards,
>>
>> -Martin
>>
>


--
Using Opera's mail client: http://www.opera.com/mail/

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Andres Valloud-4
The primitive and the primitive failure code should behave the same?

On 12/18/16 2:13 , stepharong wrote:

> Sorry but how can we improve the situation concretely
>
>
> Stef
>
> On Wed, 14 Dec 2016 00:22:47 +0100, Andres Valloud
> <[hidden email]> wrote:
>
>> Oh, ok... then the failure code isn't doing what was intended :).
>>
>> On 12/13/16 15:16 , Martin McClure wrote:
>>> On 12/13/2016 08:39 AM, Andres Valloud wrote:
>>>> I would have expected the floats to be a byte object of size 8.  Why is
>>>> this conversion needed?  Is somehow the primitive thinking the float
>>>> has
>>>> size 2?  Or is the primitive hashing 32 bits at a time?  The prim used
>>>> to be a C version of the byte oriented 1644525 multiplicative hash, did
>>>> this change?
>>>
>>> The float is not a byte object of size 8, it's a word object of size 2.
>>>
>>> When you feed it to #hashBytes:startingWith: the primitive fails. (I
>>> didn't look at why -- maybe because it isn't a byte object?)
>>>
>>> The alternative Smalltalk code treats it like it was a byte object, even
>>> though it's not. So it iterates through the two words, multiplying each
>>> by 1664525, then keeping only the low-order 28 bits of the result. But
>>> since each word is a full 32 bits, the high-order four bits of each
>>> word, which include the sign bit, can never affect the hash.
>>>
>>> Regards,
>>>
>>> -Martin
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

stepharong
Hi andres

I did not follow the discussion. My remark is just that a good discussion  
without code for integration
does not improve Pharo and we will have people facing again the problem.

Stef

On Sun, 18 Dec 2016 11:29:11 +0100, Andres Valloud  
<[hidden email]> wrote:

> The primitive and the primitive failure code should behave the same?
>
> On 12/18/16 2:13 , stepharong wrote:
>> Sorry but how can we improve the situation concretely
>>
>>
>> Stef
>>
>> On Wed, 14 Dec 2016 00:22:47 +0100, Andres Valloud
>> <[hidden email]> wrote:
>>
>>> Oh, ok... then the failure code isn't doing what was intended :).
>>>
>>> On 12/13/16 15:16 , Martin McClure wrote:
>>>> On 12/13/2016 08:39 AM, Andres Valloud wrote:
>>>>> I would have expected the floats to be a byte object of size 8.  Why  
>>>>> is
>>>>> this conversion needed?  Is somehow the primitive thinking the float
>>>>> has
>>>>> size 2?  Or is the primitive hashing 32 bits at a time?  The prim  
>>>>> used
>>>>> to be a C version of the byte oriented 1644525 multiplicative hash,  
>>>>> did
>>>>> this change?
>>>>
>>>> The float is not a byte object of size 8, it's a word object of size  
>>>> 2.
>>>>
>>>> When you feed it to #hashBytes:startingWith: the primitive fails. (I
>>>> didn't look at why -- maybe because it isn't a byte object?)
>>>>
>>>> The alternative Smalltalk code treats it like it was a byte object,  
>>>> even
>>>> though it's not. So it iterates through the two words, multiplying  
>>>> each
>>>> by 1664525, then keeping only the low-order 28 bits of the result. But
>>>> since each word is a full 32 bits, the high-order four bits of each
>>>> word, which include the sign bit, can never affect the hash.
>>>>
>>>> Regards,
>>>>
>>>> -Martin
>>>>
>>>
>>
>>
>


--
Using Opera's mail client: http://www.opera.com/mail/

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Martin McClure-2
In reply to this post by Andres Valloud-4
On 12/18/2016 02:29 AM, Andres Valloud wrote:
> The primitive and the primitive failure code should behave the same?

You think the primitive is faulty? I haven't looked at the primitive
code, but it just fails when given a Float. From the name, I'd guess
that it's designed for a byte object. The failure code certainly is.

A new primitive designed for word objects, or new code such as I gave
earlier (though ugly) would work, though.

Regards,

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Andres Valloud-4
At first glance, that the failure code only sees "two" things when it
should see "eight" seems to be problematic.

Perhaps the primitive insists on hashing byte objects, and there is a
distinction between "byte" objects and "word" objects (whatever "word"
means, presumably a constant width integer across all platforms).  I
haven't looked at the code.

 From my perspective... back in the day that primitive used to hash
bytes, and from what I saw here the failure code is hashing multi-byte
things.  If all of these observations are correct, then I'd say the
failure code isn't doing what the primitive is doing, and in doing so
it's introducing a lot of collisions that I'd like to believe the
intended hash function wouldn't produce.

Andres.

On 12/18/16 23:05 , Martin McClure wrote:

> On 12/18/2016 02:29 AM, Andres Valloud wrote:
>> The primitive and the primitive failure code should behave the same?
>
> You think the primitive is faulty? I haven't looked at the primitive
> code, but it just fails when given a Float. From the name, I'd guess
> that it's designed for a byte object. The failure code certainly is.
>
> A new primitive designed for word objects, or new code such as I gave
> earlier (though ugly) would work, though.
>
> Regards,
>
> -Martin
>

Reply | Threaded
Open this post in threaded view
|

Re: Hash collision

Martin McClure-2
On 12/19/2016 01:27 AM, Andres Valloud wrote:

> At first glance, that the failure code only sees "two" things when it
> should see "eight" seems to be problematic.
>
> Perhaps the primitive insists on hashing byte objects, and there is a
> distinction between "byte" objects and "word" objects (whatever "word"
> means, presumably a constant width integer across all platforms).  I
> haven't looked at the code.
>
> From my perspective... back in the day that primitive used to hash
> bytes, and from what I saw here the failure code is hashing multi-byte
> things.  If all of these observations are correct, then I'd say the
> failure code isn't doing what the primitive is doing, and in doing so
> it's introducing a lot of collisions that I'd like to believe the
> intended hash function wouldn't produce.

Ah, I see your concern.

As far as I can see, all classes that are using the StringHash primitive
are actually byte objects, so things are, I believe, working as designed.

The only problem is that Ben did an experiment to see whether Float's
hashing would be improved by using the StringHash primitive. Which it
failed to do, because Float is not a byte object.

We could use an equivalent primitive to hash word objects, but I haven't
found one.

We could also use a primitive to retrieve the bytes of a word object,
and I haven't found one of those either. There are places that are
converting the words to large integers and then hashing those, which
would work for Floats.

-Martin