Igor et al,
> Let me simplify the goal of perfect hashing: > - suppose you having N unique integer numbers, without strictly > specified range (any number could lie in range 0..infinity). > - now we need to find a function such that, for each element 'x' in > the above finite set, following conditions met: > > 0 <= f(x) < k*N > > and for each pair of two different elements x , y in original set > f(x) <> f(y) > > and k < infinity > > As you can see , this function maps an arbitrary integer values > without specific range, to an > integer values in range [0..k*N], (k>=1). > Given that we need to deal with a limited number of elemens for each > particular set of numbers, such function can be found. > We just need to search for an optimal (minimal) k, which is < 2 and in > perfect case == 1. > > I thought I'd point out a subtle potential problem. Assume N = 1000, f(x_j) = j-1, and so k = 1 because 0 <= f(x_j) < N because 0 < j <= 1000. Now, store x_{7i} in a Set, for all i such that 1 <= 7i <= 1000. Clearly, f(x) is a perfect hash function as defined above. However, if the size of the table used by the set is a multiple of 7, performance will be terrible. By the way, you may want to check out the Hash Analysis Tool I wrote for VisualWorks. It's in the public Store repository (bundle name "Hash Analysis Tool", there's also an extensions bundle with more hash functions and data sets). My blog has a (somewhat passable excuse for a) manual here: http://blogten.blogspot.com/2008/02/hash-analysis-tool-manual.html. Andres. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2009/10/22 Andres Valloud <[hidden email]>:
> Igor et al, > >> Let me simplify the goal of perfect hashing: >> - suppose you having N unique integer numbers, without strictly >> specified range (any number could lie in range 0..infinity). >> - now we need to find a function such that, for each element 'x' in >> the above finite set, following conditions met: >> >> 0 <= f(x) < k*N >> >> and for each pair of two different elements x , y in original set >> f(x) <> f(y) >> >> and k < infinity >> >> As you can see , this function maps an arbitrary integer values >> without specific range, to an >> integer values in range [0..k*N], (k>=1). >> Given that we need to deal with a limited number of elemens for each >> particular set of numbers, such function can be found. >> We just need to search for an optimal (minimal) k, which is < 2 and in >> perfect case == 1. >> >> > > I thought I'd point out a subtle potential problem. Assume N = 1000, > f(x_j) = j-1, and so k = 1 because 0 <= f(x_j) < N because 0 < j <= > 1000. Now, store x_{7i} in a Set, for all i such that 1 <= 7i <= 1000. > Clearly, f(x) is a perfect hash function as defined above. However, if > the size of the table used by the set is a multiple of 7, performance > will be terrible. > err.. i'm not sure i understood, if you having 1000 objects, which answering an unique hash values, which can be represented by: element hash == 7i and as you stating that 1 <= 7i <= 1000 , then there can't be 1000 unique hash values in this range for 1000 objects. Obviously you can have only 1000/7 unique hash values meeting such criteria. But if you read carefully, i'm not adressing the problem of duplicating hash values in input set, and presuming they are all unique. I'm not addressing the quality of hash function for object(s), i'm just illustrating a hash function which could be employed by set(s), which could work perfectly, once all objects having unique hash values. Surely, despite how good the hash function used by set, if hashes of its elements clashing, you will have a performance degradation. > By the way, you may want to check out the Hash Analysis Tool I wrote for > VisualWorks. It's in the public Store repository (bundle name "Hash > Analysis Tool", there's also an extensions bundle with more hash > functions and data sets). My blog has a (somewhat passable excuse for > a) manual here: > http://blogten.blogspot.com/2008/02/hash-analysis-tool-manual.html. > > Andres. > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Lukas Renggli
Yes I know
Now in pharo if I do not change the hash of my objects and I get a lot of them in a dictionary I do get bad hash: at least this is what I understood from the limited number of hash bits. So this means that by default we have bad performance. no? Stef >> yyyyyyyyeeeeeeeeeeesssssssssssssssssssssssss >> and get linearrrrrrrrrrrrrrrrrrrrrly boring and slow. > > Yes, but only if your objects don't implement #hash properly. HashMap > trades space and time to get O(1) behavior IF you have lots of > elements with bad hash values. If the hash values are good (which is > the case most of the time) or if you have not that many elements > (which is the case most of the time), it is a waste of time and space > though. > > Lukas > > -- > Lukas Renggli > http://www.lukas-renggli.ch > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
> So this means that by default we have bad performance. no?
No. In a fresh Pharo Web image less than 6% of the keys in Dictionaries and less than 10% of the values in Sets have a weak-hashes. Moreover the largest set with weak-hash values has 516 elements (on average only 1.8 elements), the largest dictionary with weak-hash keys has 1002 elements (on average only 4.3 elements). Using HashTable in such a situation would introduce a major speed penalty and waste a lot of memory. It would be cool if the Set and the Dictionary would choose their implementation strategy automatically depending on the use-case. In a standard Pharo image however that would just be the current implementation. There are simply no instances in the image where it would be worthwhile (large amount of data with bad hash) to use a HashMap. Lukas -- Lukas Renggli http://www.lukas-renggli.ch _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In the far past Visual Age would become the implementation logic for
Sets depending on how big the set was, so for example with 10 elements it would be linear list. Of course it's been 14 years, perhaps I'm remembering it wrong. On 2009-10-22, at 6:46 AM, Lukas Renggli wrote: >> So this means that by default we have bad performance. no? > > No. > > In a fresh Pharo Web image less than 6% of the keys in Dictionaries > and less than 10% of the values in Sets have a weak-hashes. Moreover > the largest set with weak-hash values has 516 elements (on average > only 1.8 elements), the largest dictionary with weak-hash keys has > 1002 elements (on average only 4.3 elements). Using HashTable in such > a situation would introduce a major speed penalty and waste a lot of > memory. > > It would be cool if the Set and the Dictionary would choose their > implementation strategy automatically depending on the use-case. In a > standard Pharo image however that would just be the current > implementation. There are simply no instances in the image where it > would be worthwhile (large amount of data with bad hash) to use a > HashMap. > > Lukas > > -- > Lukas Renggli > http://www.lukas-renggli.ch > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project -- = = = ======================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Andres Valloud-4
Andres Valloud wrote:
> > I thought I'd point out a subtle potential problem. Assume N = 1000, > f(x_j) = j-1, and so k = 1 because 0 <= f(x_j) < N because 0 < j <= > 1000. Now, store x_{7i} in a Set, for all i such that 1 <= 7i <= 1000. > Clearly, f(x) is a perfect hash function as defined above. However, if > the size of the table used by the set is a multiple of 7, performance > will be terrible. > Huh? If your hash function produces integers between 0 and 999, your table size must be at least 1000, otherwise you store past the end of your table. Regards, -Martin _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by johnmci
The nice way of doing this would be to have a layer of indirection so
that the storage strategy decides how to use the hashed storage. However, compared to the current Set/Dictionary implementation, that may be a bit slower. It would not be a terrific idea to have multiple subclasses of Set/Dictionary depending on the storage strategy, as that would result in an explosion of classes. John M McIntosh wrote: > In the far past Visual Age would become the implementation logic for > Sets depending > on how big the set was, so for example with 10 elements it would be > linear list. Of course > it's been 14 years, perhaps I'm remembering it wrong. > > On 2009-10-22, at 6:46 AM, Lukas Renggli wrote: > > >>> So this means that by default we have bad performance. no? >>> >> No. >> >> In a fresh Pharo Web image less than 6% of the keys in Dictionaries >> and less than 10% of the values in Sets have a weak-hashes. Moreover >> the largest set with weak-hash values has 516 elements (on average >> only 1.8 elements), the largest dictionary with weak-hash keys has >> 1002 elements (on average only 4.3 elements). Using HashTable in such >> a situation would introduce a major speed penalty and waste a lot of >> memory. >> >> It would be cool if the Set and the Dictionary would choose their >> implementation strategy automatically depending on the use-case. In a >> standard Pharo image however that would just be the current >> implementation. There are simply no instances in the image where it >> would be worthwhile (large amount of data with bad hash) to use a >> HashMap. >> >> Lukas >> >> -- >> Lukas Renggli >> http://www.lukas-renggli.ch >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> > > -- > = > = > = > ======================================================================== > John M. McIntosh <[hidden email]> Twitter: > squeaker68882 > Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com > = > = > = > ======================================================================== > > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Martin McClure-2
My point would be that either:
* the hashed storage is always of size 1000 (or greater), so if you store a few objects in the hashed collection it wastes a lot of space; * the hashed storage changes size to match the objects stored in it, but looking at f(x) mod p risks reintroducing the collisions you wanted to avoid in the first place. So, to not waste space in this particular scenario, then either N should be small or basically all N objects should be stored in the hashed collection. If N is small, then why not set f(x) to produce values between 1 and N, and then use an array? And if N is large and the utilization of the hashed collection is close to 100% of N, then why not also use an array in that case? In the end, really nice perfect hash functions point the finger at alternatives that may obviate using hashed collections. Of course, your mileage may vary... Andres. Martin McClure wrote: > Andres Valloud wrote: > >> I thought I'd point out a subtle potential problem. Assume N = 1000, >> f(x_j) = j-1, and so k = 1 because 0 <= f(x_j) < N because 0 < j <= >> 1000. Now, store x_{7i} in a Set, for all i such that 1 <= 7i <= 1000. >> Clearly, f(x) is a perfect hash function as defined above. However, if >> the size of the table used by the set is a multiple of 7, performance >> will be terrible. >> >> > > Huh? If your hash function produces integers between 0 and 999, your > table size must be at least 1000, otherwise you store past the end of > your table. > > Regards, > > -Martin > > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Andres Valloud-4
2009/10/22 Andres Valloud <[hidden email]>:
> The nice way of doing this would be to have a layer of indirection so > that the storage strategy decides how to use the hashed storage. > However, compared to the current Set/Dictionary implementation, that may > be a bit slower. It would not be a terrific idea to have multiple > subclasses of Set/Dictionary depending on the storage strategy, as that > would result in an explosion of classes. > Not necessary it should lead to explosion. We already having a storage in Set - array slot. So, its easy to imagine that this slot may change the class which implements the storage, while from outside you still seeing a Set. > John M McIntosh wrote: >> In the far past Visual Age would become the implementation logic for >> Sets depending >> on how big the set was, so for example with 10 elements it would be >> linear list. Of course >> it's been 14 years, perhaps I'm remembering it wrong. >> >> On 2009-10-22, at 6:46 AM, Lukas Renggli wrote: >> >> > -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Andres Valloud-4
2009/10/22 Andres Valloud <[hidden email]>:
> My point would be that either: > > * the hashed storage is always of size 1000 (or greater), so if you > store a few objects in the hashed collection it wastes a lot of space; > > * the hashed storage changes size to match the objects stored in it, but > looking at f(x) mod p risks reintroducing the collisions you wanted to > avoid in the first place. > a different hash function picked if size of input set is changed. So, there is no chance to have storage of size 1000 , holding few elements (unless your implementation really suck :). That's why you can always be sure that your storage is good enough to store all of the elements and with minimal possible chances of collision. The only problem here, as i noted, that you have to rehash the storage each time you modifying the input set, because you changing the hash function which is 'optimal' for new input set. And that's why this approach is bad for sets which changing often. > So, to not waste space in this particular scenario, then either N should > be small or basically all N objects should be stored in the hashed > collection. If N is small, then why not set f(x) to produce values > between 1 and N, and then use an array? And if N is large and the > utilization of the hashed collection is close to 100% of N, then why not > also use an array in that case? In the end, really nice perfect hash > functions point the finger at alternatives that may obviate using hashed > collections. Of course, your mileage may vary... > perfect hash is just about that: find a function which having minimal , or no hash collisions and with storage size very close to the number of elements in input set. > Andres. > > > Martin McClure wrote: >> Andres Valloud wrote: >> >>> I thought I'd point out a subtle potential problem. Assume N = 1000, >>> f(x_j) = j-1, and so k = 1 because 0 <= f(x_j) < N because 0 < j <= >>> 1000. Now, store x_{7i} in a Set, for all i such that 1 <= 7i <= 1000. >>> Clearly, f(x) is a perfect hash function as defined above. However, if >>> the size of the table used by the set is a multiple of 7, performance >>> will be terrible. >>> >>> >> >> Huh? If your hash function produces integers between 0 and 999, your >> table size must be at least 1000, otherwise you store past the end of >> your table. >> >> Regards, >> >> -Martin >> >> > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Lukas Renggli
Ok I see. I would be curious to see if SystemDictionary does not
degrade once we load Moose, Mondrian, Glamour.... >> So this means that by default we have bad performance. no? > > No. > > In a fresh Pharo Web image less than 6% of the keys in Dictionaries > and less than 10% of the values in Sets have a weak-hashes. Moreover > the largest set with weak-hash values has 516 elements (on average > only 1.8 elements), the largest dictionary with weak-hash keys has > 1002 elements (on average only 4.3 elements). Using HashTable in such > a situation would introduce a major speed penalty and waste a lot of > memory. > > It would be cool if the Set and the Dictionary would choose their > implementation strategy automatically depending on the use-case. In a > standard Pharo image however that would just be the current > implementation. There are simply no instances in the image where it > would be worthwhile (large amount of data with bad hash) to use a > HashMap. > > Lukas > > -- > Lukas Renggli > http://www.lukas-renggli.ch > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by johnmci
I was told that they had that and they removed it in VA5 I guess.
On Oct 22, 2009, at 8:00 PM, John M McIntosh wrote: > In the far past Visual Age would become the implementation logic for > Sets depending > on how big the set was, so for example with 10 elements it would be > linear list. Of course > it's been 14 years, perhaps I'm remembering it wrong. > > On 2009-10-22, at 6:46 AM, Lukas Renggli wrote: > >>> So this means that by default we have bad performance. no? >> >> No. >> >> In a fresh Pharo Web image less than 6% of the keys in Dictionaries >> and less than 10% of the values in Sets have a weak-hashes. Moreover >> the largest set with weak-hash values has 516 elements (on average >> only 1.8 elements), the largest dictionary with weak-hash keys has >> 1002 elements (on average only 4.3 elements). Using HashTable in such >> a situation would introduce a major speed penalty and waste a lot of >> memory. >> >> It would be cool if the Set and the Dictionary would choose their >> implementation strategy automatically depending on the use-case. In a >> standard Pharo image however that would just be the current >> implementation. There are simply no instances in the image where it >> would be worthwhile (large amount of data with bad hash) to use a >> HashMap. >> >> Lukas >> >> -- >> Lukas Renggli >> http://www.lukas-renggli.ch >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > -- > = > = > = > = > = > ====================================================================== > John M. McIntosh <[hidden email]> Twitter: > squeaker68882 > Corporate Smalltalk Consulting Ltd. http:// > www.smalltalkconsulting.com > = > = > = > = > = > ====================================================================== > > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Stéphane Ducasse
Yeah, that could be a problem. Strangely SystemDictionary is a
subclass of IdentityDictionary. It can be easily fixed, don't know if this has much of an impact on performance if the dictionary is smaller: SystemDictionary superclass: Dictionary. SystemDictionary allInstances do: [ :each | each rehash ] Lukas 2009/10/25 Stéphane Ducasse <[hidden email]>: > Ok I see. I would be curious to see if SystemDictionary does not > degrade once we load > Moose, Mondrian, Glamour.... > >>> So this means that by default we have bad performance. no? >> >> No. >> >> In a fresh Pharo Web image less than 6% of the keys in Dictionaries >> and less than 10% of the values in Sets have a weak-hashes. Moreover >> the largest set with weak-hash values has 516 elements (on average >> only 1.8 elements), the largest dictionary with weak-hash keys has >> 1002 elements (on average only 4.3 elements). Using HashTable in such >> a situation would introduce a major speed penalty and waste a lot of >> memory. >> >> It would be cool if the Set and the Dictionary would choose their >> implementation strategy automatically depending on the use-case. In a >> standard Pharo image however that would just be the current >> implementation. There are simply no instances in the image where it >> would be worthwhile (large amount of data with bad hash) to use a >> HashMap. >> >> Lukas >> >> -- >> Lukas Renggli >> http://www.lukas-renggli.ch >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > -- Lukas Renggli http://www.lukas-renggli.ch _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Stéphane Ducasse
Stéphane Ducasse wrote:
> Ok I see. I would be curious to see if SystemDictionary does not > degrade once we load > Moose, Mondrian, Glamour.... SystemDictionary is a subclass of IdentityDictionary, which does have some performance problems right around 2K elements, and more severe performance problems at sizes around 4K elements, but it's not nearly as bad as Set and Dictionary. But the hashed collection hash-spreading and prime table size code I sent yesterday gets rid of those bumps, so it's fast from very small sizes up through sizes > 10000, with very gradual degradation after that. Regards, -Martin _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Martin et al,
Here is a series of checkpointed changesets that mutate the latest RC Pharo image. Load the changesets in strict alphabetical order. I made a few small changes compared to Martin's changes, hopefully I didn't miss anything. Note that the changesets S1 through S3 modify SystemDictionary to be a hash based collection. For the SystemDictionary sizes I have in my image, this makes things slower. However, I'd expect that, as the system dictionary grows, using #hash will become faster than using #scaledIdentityHash. If you want to load these changes later, make sure to load changeset A, then S1 through S3, and finally Z (I am hoping B's hacks, which are undone by W, are not needed). Enjoy!... I hope :). Let me know what you think. Andres. Martin McClure wrote: > Stéphane Ducasse wrote: > >> Ok I see. I would be curious to see if SystemDictionary does not >> degrade once we load >> Moose, Mondrian, Glamour.... >> > > SystemDictionary is a subclass of IdentityDictionary, which does have > some performance problems right around 2K elements, and more severe > performance problems at sizes around 4K elements, but it's not nearly as > bad as Set and Dictionary. > > But the hashed collection hash-spreading and prime table size code I > sent yesterday gets rid of those bumps, so it's fast from very small > sizes up through sizes > 10000, with very gradual degradation after that. > > Regards, > > -Martin > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project HashChanges.zip (30K) Download Attachment |
In reply to this post by Lukas Renggli
Quick tests show using #hash instead of #scaledIdentityHash makes the
system dictionary need 30% more time to retrieve all its keys. I didn't try adding. The system dictionary size was about 3500. Lukas Renggli wrote: > Yeah, that could be a problem. Strangely SystemDictionary is a > subclass of IdentityDictionary. > > It can be easily fixed, don't know if this has much of an impact on > performance if the dictionary is smaller: > > SystemDictionary superclass: Dictionary. > SystemDictionary allInstances do: [ :each | each rehash ] > > Lukas > > 2009/10/25 Stéphane Ducasse <[hidden email]>: > >> Ok I see. I would be curious to see if SystemDictionary does not >> degrade once we load >> Moose, Mondrian, Glamour.... >> >> >>>> So this means that by default we have bad performance. no? >>>> >>> No. >>> >>> In a fresh Pharo Web image less than 6% of the keys in Dictionaries >>> and less than 10% of the values in Sets have a weak-hashes. Moreover >>> the largest set with weak-hash values has 516 elements (on average >>> only 1.8 elements), the largest dictionary with weak-hash keys has >>> 1002 elements (on average only 4.3 elements). Using HashTable in such >>> a situation would introduce a major speed penalty and waste a lot of >>> memory. >>> >>> It would be cool if the Set and the Dictionary would choose their >>> implementation strategy automatically depending on the use-case. In a >>> standard Pharo image however that would just be the current >>> implementation. There are simply no instances in the image where it >>> would be worthwhile (large amount of data with bad hash) to use a >>> HashMap. >>> >>> Lukas >>> >>> -- >>> Lukas Renggli >>> http://www.lukas-renggli.ch >>> >>> _______________________________________________ >>> Pharo-project mailing list >>> [hidden email] >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >>> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> >> > > > > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Free forum by Nabble | Edit this page |