Food for thought:
In my finite state machine generator I do a lot of hashing of sometimes large collections. For performance reasons I created hash value storing versions of the relevant collection classes. Getting the hash value of the hashed version of a collection was then very fast (accessing an instance value) and the hash values were good. For example, for Array I created HashArray The hash value of a HashArray was the XOR of the hash values of the elements of the HashArray. Note that this meant that each time you changed the entry in a HashArray you had to update the hash value. So for HashArray at: index put: object would look something like: hash := (hash xor: object hash) xor: (self at: index). super at: index put: object. If you like living on the edge you could instead leave out updating the hash value until you put the HashArray object where its hash value would be needed (like a Set or Dictionary (key)). This is of course error prone but then putting collections in Sets or Dictionaries (keys) is inherently error prone anyway because modifying the collection changes its hash value thus breaking every Set or Dictionary containing the collection. I would be temped to have all real collection classes (e.g. not Interval) not be storable in Sets or Dictionaries (key) at all and instead require hash versions of the collection be used but of course this would surely break a lot of things. Ralph Boland -- Quantum Theory cannot save us from the tyranny of a deterministic universe. But it does give God something to do. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Watch out for cancellation effects... if the collections are
sequenceable and the hash values of their contents are not well distributed, you may end up with performance issues because the hash values cancel each other because the hash function commutes. Consider for example the hash value of the following collections: #(1 2 3) #(1 2 3 4 4) #(1 2 3 5 5) #(1 4 2 3 4) #(5 4 3 2 1 4 5) #(6) #(1 2 3 4 5 6 5 4 3 2 1) I am not familiar with your application, though, so perhaps the data does not cause these problems. Andres. Ralph Boland wrote: > Food for thought: > > In my finite state machine generator I do a lot of > hashing of sometimes large collections. > For performance reasons I created hash value storing versions > of the relevant collection classes. Getting the hash value > of the hashed version of a collection was then very fast > (accessing an instance value) and the hash values were good. > > For example, for Array I created HashArray The hash > value of a HashArray was the XOR of the hash values of > the elements of the HashArray. Note that this meant > that each time you changed the entry in a HashArray > you had to update the hash value. > So for HashArray at: index put: object would look something > like: > > hash := (hash xor: object hash) xor: (self at: index). > super at: index put: object. > > If you like living on the edge you could instead leave out updating > the hash value until you put the HashArray object where its hash > value would be needed (like a Set or Dictionary (key)). > This is of course error prone but then > putting collections in Sets or Dictionaries (keys) > is inherently error prone anyway because modifying the collection > changes its hash value thus breaking every Set or Dictionary > containing the collection. > > I would be temped to have all real collection classes (e.g. not Interval) > not be storable in Sets or Dictionaries (key) at all and instead require > hash versions of the collection be used but of course this would > surely break a lot of things. > > Ralph Boland > > > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Ralph Boland
>
> Watch out for cancellation effects... if the collections are > sequenceable and the hash values of their contents are not well > distributed, you may end up with performance issues because the hash > values cancel each other because the hash function commutes. Consider > for example the hash value of the following collections: > > #(1 2 3) > #(1 2 3 4 4) > #(1 2 3 5 5) > #(1 4 2 3 4) > #(5 4 3 2 1 4 5) > > #(6) > #(1 2 3 4 5 6 5 4 3 2 1) > > I am not familiar with your application, though, so perhaps the data > does not cause these problems. > > Andres. > You are right. These are serious problems with my idea and I retract the proposal that the idea may be preferable to the problems we have with putting the standard collections into Sets and Dictionaries. It works well in my application because in each case the collections have no duplicates and order doesn't matter. That is, in my application, the hash collections are all effectively sets or dictionaries in compacted form. I use a bit of code specific to my application to ensure they then are compared efficiently. Ralph > Ralph Boland wrote: >> Food for thought: >> >> In my finite state machine generator I do a lot of >> hashing of sometimes large collections. >> For performance reasons I created hash value storing versions >> of the relevant collection classes. Getting the hash value >> of the hashed version of a collection was then very fast >> (accessing an instance value) and the hash values were good. >> >> For example, for Array I created HashArray The hash >> value of a HashArray was the XOR of the hash values of >> the elements of the HashArray. Note that this meant >> that each time you changed the entry in a HashArray >> you had to update the hash value. >> So for HashArray at: index put: object would look something >> like: >> >> hash := (hash xor: object hash) xor: (self at: index). >> super at: index put: object. >> >> If you like living on the edge you could instead leave out updating >> the hash value until you put the HashArray object where its hash >> value would be needed (like a Set or Dictionary (key)). >> This is of course error prone but then >> putting collections in Sets or Dictionaries (keys) >> is inherently error prone anyway because modifying the collection >> changes its hash value thus breaking every Set or Dictionary >> containing the collection. >> >> I would be temped to have all real collection classes (e.g. not Interval) >> not be storable in Sets or Dictionaries (key) at all and instead require >> hash versions of the collection be used but of course this would >> surely break a lot of things. >> >> Ralph Boland >> >> >> _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Free forum by Nabble | Edit this page |