Hash on collections

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Hash on collections

Vincent.Blondeau
Hi,

I was watching to the Collection>>#hash implementation and I found something strange:
self size <= 10 ifTrue:
                                [self do: [:elem | hash := hash bitXor: elem hash]].

Can someone know why the contents are only considered for the hash if the collection size is below 10? Is that a bug or a feature?

Moreover, it is in contradiction with the end of the comment:
-- two equal objects have equal hash values

TIA,

Cheers,
Vincent


winmail.dat (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hash on collections

Nicolas Cellier
Hi,
it does not look like a good hash (in term of probability of collisions) and would deserve a revision.
But what problem did you detect wrt equality?

The implication is this way:
   self assert: (a = b) ==> (a hash = b hash)

Since the element hash are bitXor'ed, the resulting hash will be independent of ordering of elements.
It will thus work for unordered collections (like Set in which ordering depends on container size).
(and badly for others if they omit to refine hash).

Or is it because Collection of different species could be equal?

Nicolas

2018-06-20 23:15 GMT+02:00 <[hidden email]>:
Hi,

I was watching to the Collection>>#hash implementation and I found something strange:
self size <= 10 ifTrue:
                                [self do: [:elem | hash := hash bitXor: elem hash]].

Can someone know why the contents are only considered for the hash if the collection size is below 10? Is that a bug or a feature?

Moreover, it is in contradiction with the end of the comment:
-- two equal objects have equal hash values

TIA,

Cheers,
Vincent


Reply | Threaded
Open this post in threaded view
|

Re: Hash on collections

Vincent.Blondeau

 

 

From: Pharo-dev [mailto:[hidden email]] On Behalf Of Nicolas Cellier
Sent: Wednesday, June 20, 2018 14:52
To: Pharo Development List <[hidden email]>
Subject: Re: [Pharo-dev] Hash on collections

 

Hi,

 

Hi,

 

 

it does not look like a good hash (in term of probability of collisions) and would deserve a revision.

But what problem did you detect wrt equality?

 

None, I was just wondering why there is a limit defined at 10 and if a problem can exist due to this.

 

The implication is this way:

   self assert: (a = b) ==> (a hash = b hash)

 

Since the element hash are bitXor'ed, the resulting hash will be independent of ordering of elements.

It will thus work for unordered collections (like Set in which ordering depends on container size).

(and badly for others if they omit to refine hash).

 

Thx for the explanation!

Or is it because Collection of different species could be equal?

 

I was watching to the dictionaries #= and #hash. And indeed, it is not a problem in it. But the hash used is really not efficient for unordered collections of unordered collections of same size.

 

Vincent

 

Nicolas

 

2018-06-20 23:15 GMT+02:00 <[hidden email]>:

Hi,

I was watching to the Collection>>#hash implementation and I found something strange:
self size <= 10 ifTrue:
                                [self do: [:elem | hash := hash bitXor: elem hash]].

Can someone know why the contents are only considered for the hash if the collection size is below 10? Is that a bug or a feature?

Moreover, it is in contradiction with the end of the comment:
-- two equal objects have equal hash values

TIA,

Cheers,
Vincent

 

Reply | Threaded
Open this post in threaded view
|

Re: Hash on collections

Nicolas Cellier


2018-06-21 0:00 GMT+02:00 <[hidden email]>:

 

 

From: Pharo-dev [mailto:[hidden email]] On Behalf Of Nicolas Cellier
Sent: Wednesday, June 20, 2018 14:52
To: Pharo Development List <[hidden email]>
Subject: Re: [Pharo-dev] Hash on collections

 

Hi,

 

Hi,

 

 

it does not look like a good hash (in term of probability of collisions) and would deserve a revision.

But what problem did you detect wrt equality?

 

None, I was just wondering why there is a limit defined at 10 and if a problem can exist due to this.

 

The implication is this way:

   self assert: (a = b) ==> (a hash = b hash)

 

Since the element hash are bitXor'ed, the resulting hash will be independent of ordering of elements.

It will thus work for unordered collections (like Set in which ordering depends on container size).

(and badly for others if they omit to refine hash).

 

Thx for the explanation!

Or is it because Collection of different species could be equal?

 

I was watching to the dictionaries #= and #hash. And indeed, it is not a problem in it. But the hash used is really not efficient for unordered collections of unordered collections of same size.

 

Vincent

 


Indeed,
but remember that we even not have a guaranty that Collection are of finite size...
I would expect that they are countable (by virtue of do: we can iterate over its elements, possibly with an infinite loop).
So it would make sense to refine hash in Set/Dictionary/HashedCollection and remove the size limitation there...

Maybe something like ^self inject: self size hash into: [:hash :elem | hash bitXor: elem hash].


Nicolas

 

2018-06-20 23:15 GMT+02:00 <[hidden email]>:

Hi,

I was watching to the Collection>>#hash implementation and I found something strange:
self size <= 10 ifTrue:
                                [self do: [:elem | hash := hash bitXor: elem hash]].

Can someone know why the contents are only considered for the hash if the collection size is below 10? Is that a bug or a feature?

Moreover, it is in contradiction with the end of the comment:
-- two equal objects have equal hash values

TIA,

Cheers,
Vincent

 


Reply | Threaded
Open this post in threaded view
|

Re: Hash on collections

Nicolas Cellier


2018-06-21 0:16 GMT+02:00 Nicolas Cellier <[hidden email]>:


2018-06-21 0:00 GMT+02:00 <[hidden email]>:

 

 

From: Pharo-dev [mailto:[hidden email]] On Behalf Of Nicolas Cellier
Sent: Wednesday, June 20, 2018 14:52
To: Pharo Development List <[hidden email]>
Subject: Re: [Pharo-dev] Hash on collections

 

Hi,

 

Hi,

 

 

it does not look like a good hash (in term of probability of collisions) and would deserve a revision.

But what problem did you detect wrt equality?

 

None, I was just wondering why there is a limit defined at 10 and if a problem can exist due to this.

 

The implication is this way:

   self assert: (a = b) ==> (a hash = b hash)

 

Since the element hash are bitXor'ed, the resulting hash will be independent of ordering of elements.

It will thus work for unordered collections (like Set in which ordering depends on container size).

(and badly for others if they omit to refine hash).

 

Thx for the explanation!

Or is it because Collection of different species could be equal?

 

I was watching to the dictionaries #= and #hash. And indeed, it is not a problem in it. But the hash used is really not efficient for unordered collections of unordered collections of same size.

 

Vincent

 


Indeed,
but remember that we even not have a guaranty that Collection are of finite size...
I would expect that they are countable (by virtue of do: we can iterate over its elements, possibly with an infinite loop).
So it would make sense to refine hash in Set/Dictionary/HashedCollection and remove the size limitation there...

Maybe something like ^self inject: self size hash into: [:hash :elem | hash bitXor: elem hash].



Oh, and I forgot,
you could offer you the book of Andres Valloud:


it will give a pretty deep insight in hashing algorithm efficiency (and measurement of).

Nicolas

 

2018-06-20 23:15 GMT+02:00 <[hidden email]>:

Hi,

I was watching to the Collection>>#hash implementation and I found something strange:
self size <= 10 ifTrue:
                                [self do: [:elem | hash := hash bitXor: elem hash]].

Can someone know why the contents are only considered for the hash if the collection size is below 10? Is that a bug or a feature?

Moreover, it is in contradiction with the end of the comment:
-- two equal objects have equal hash values

TIA,

Cheers,
Vincent