hashing collections

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

hashing collections

Ralph Boland
Food for thought:

In my finite state machine generator I do a lot of
hashing of sometimes large collections.
For performance reasons I created hash value storing versions
of the relevant collection classes.  Getting the hash value
of the hashed version of a collection was then very fast
(accessing an instance value) and the hash values were good.

For example, for Array I created HashArray  The hash
value of a HashArray was the XOR of the hash values of
the elements of the HashArray.  Note that this meant
that each time you changed the entry in a HashArray
you had to update the hash value.
So for HashArray    at: index put: object    would look something
like:

     hash := (hash xor: object hash) xor: (self at: index).
     super at: index put: object.

If you like living on the edge you could instead leave out updating
the hash value until you put the HashArray object where its hash
value would be needed (like a Set or Dictionary (key)).
This is of course error prone but then
putting collections in Sets or Dictionaries (keys)
is inherently error prone anyway because modifying the collection
changes its hash value thus breaking every Set or Dictionary
containing the collection.

I would be temped to have all real collection classes (e.g. not Interval)
not be storable in Sets or Dictionaries (key) at all and instead require
hash versions of the collection be used but of course this would
surely break a lot of things.

Ralph Boland


--
Quantum Theory cannot save us from the
tyranny of a deterministic universe.
But it does give God something to do.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: hashing collections

Andres Valloud-4
Watch out for cancellation effects... if the collections are
sequenceable and the hash values of their contents are not well
distributed, you may end up with performance issues because the hash
values cancel each other because the hash function commutes.  Consider
for example the hash value of the following collections:

#(1 2 3)
#(1 2 3 4 4)
#(1 2 3 5 5)
#(1 4 2 3 4)
#(5 4 3 2 1 4 5)

#(6)
#(1 2 3 4 5 6 5 4 3 2 1)

I am not familiar with your application, though, so perhaps the data
does not cause these problems.

Andres.


Ralph Boland wrote:

> Food for thought:
>
> In my finite state machine generator I do a lot of
> hashing of sometimes large collections.
> For performance reasons I created hash value storing versions
> of the relevant collection classes.  Getting the hash value
> of the hashed version of a collection was then very fast
> (accessing an instance value) and the hash values were good.
>
> For example, for Array I created HashArray  The hash
> value of a HashArray was the XOR of the hash values of
> the elements of the HashArray.  Note that this meant
> that each time you changed the entry in a HashArray
> you had to update the hash value.
> So for HashArray    at: index put: object    would look something
> like:
>
>      hash := (hash xor: object hash) xor: (self at: index).
>      super at: index put: object.
>
> If you like living on the edge you could instead leave out updating
> the hash value until you put the HashArray object where its hash
> value would be needed (like a Set or Dictionary (key)).
> This is of course error prone but then
> putting collections in Sets or Dictionaries (keys)
> is inherently error prone anyway because modifying the collection
> changes its hash value thus breaking every Set or Dictionary
> containing the collection.
>
> I would be temped to have all real collection classes (e.g. not Interval)
> not be storable in Sets or Dictionaries (key) at all and instead require
> hash versions of the collection be used but of course this would
> surely break a lot of things.
>
> Ralph Boland
>
>
>  

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: hashing collections

Ralph Boland
In reply to this post by Ralph Boland
>
> Watch out for cancellation effects... if the collections are
> sequenceable and the hash values of their contents are not well
> distributed, you may end up with performance issues because the hash
> values cancel each other because the hash function commutes.  Consider
> for example the hash value of the following collections:
>
> #(1 2 3)
> #(1 2 3 4 4)
> #(1 2 3 5 5)
> #(1 4 2 3 4)
> #(5 4 3 2 1 4 5)
>
> #(6)
> #(1 2 3 4 5 6 5 4 3 2 1)
>
> I am not familiar with your application, though, so perhaps the data
> does not cause these problems.
>
> Andres.
>

You are right.  These are serious problems with my idea and I retract
the proposal that the idea may be preferable to the problems we have
with putting the standard collections into Sets and Dictionaries.

It works well in my application because in each case the collections
have no duplicates and order doesn't matter.  That is, in my application,
the hash collections are all effectively sets or dictionaries in compacted form.
I use a bit of code specific to my application to ensure they then are
compared efficiently.

Ralph


> Ralph Boland wrote:
>> Food for thought:
>>
>> In my finite state machine generator I do a lot of
>> hashing of sometimes large collections.
>> For performance reasons I created hash value storing versions
>> of the relevant collection classes.  Getting the hash value
>> of the hashed version of a collection was then very fast
>> (accessing an instance value) and the hash values were good.
>>
>> For example, for Array I created HashArray  The hash
>> value of a HashArray was the XOR of the hash values of
>> the elements of the HashArray.  Note that this meant
>> that each time you changed the entry in a HashArray
>> you had to update the hash value.
>> So for HashArray    at: index put: object    would look something
>> like:
>>
>>      hash := (hash xor: object hash) xor: (self at: index).
>>      super at: index put: object.
>>
>> If you like living on the edge you could instead leave out updating
>> the hash value until you put the HashArray object where its hash
>> value would be needed (like a Set or Dictionary (key)).
>> This is of course error prone but then
>> putting collections in Sets or Dictionaries (keys)
>> is inherently error prone anyway because modifying the collection
>> changes its hash value thus breaking every Set or Dictionary
>> containing the collection.
>>
>> I would be temped to have all real collection classes (e.g. not Interval)
>> not be storable in Sets or Dictionaries (key) at all and instead require
>> hash versions of the collection be used but of course this would
>> surely break a lot of things.
>>
>> Ralph Boland
>>
>>
>>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project