HashTable vs Dictionary

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

HashTable vs Dictionary

jannik laval
Hi,

Since some weeks, 
We use in our projects the project HashTable, available at http://www.squeaksource.com/ht.html

It could be a good idea to integrate these collections to replace Dictionary and Set.


---
Jannik Laval
PhD Student - Rmod Team - INRIA
Certified Project Management Associate (IPMA)
---


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Nicolas Cellier
Do you mean also deep in Smalltalk Kernel ? Or only some applications ?

2009/10/20 Laval Jannik <[hidden email]>:

> Hi,
> Since some weeks,
> We use in our projects the project HashTable, available
> at http://www.squeaksource.com/ht.html
> It could be a good idea to integrate these collections to replace Dictionary
> and Set.
>
> ---
> Jannik Laval
> PhD Student - Rmod Team - INRIA
> Certified Project Management Associate (IPMA)
> http://www.jannik-laval.eu
> http://rmod.lille.inria.fr
> ---
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Lukas Renggli
In reply to this post by jannik laval
2009/10/20 Laval Jannik <[hidden email]>:
> Hi,
> Since some weeks,
> We use in our projects the project HashTable, available
> at http://www.squeaksource.com/ht.html
> It could be a good idea to integrate these collections to replace Dictionary
> and Set.

I don't think that this is generally a good idea. It might be
beneficial in your situation, but certainly not for everybody in all
situations. I've been using the HashTable implementation in several
projects with great success. In other projects I've been using the
BTree package on SqueakSource that provides similar functionality
optimized for yet another context. These highly specialized data
structures should only be applied if they make sense in the given
context (and if you understand both, the cost and gain). I guess this
was exactly the reason why SkipList was removed from the Pharo image.

Lukas

--
Lukas Renggli
http://www.lukas-renggli.ch

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Andres Valloud-4
In general terms, open addressing with linear probing should tie or beat
chaining.  For this to happen, the hash function used must be of high
quality.  If chaining is beating open addressing with linear probing,
I'd strongly suspect that the hash function used is producing
significant collisions or unevenly distributed values for the data set
in question.  You may want to take a look at the Hash Analysis Tool I
wrote for VW (it's in the public Store repository under a MIT license).  
For a more in-depth treatment of these matters, you may also want to
look at the hash book I wrote (www.lulu.com/avSmalltalkBooks), and its
matching Smalltalk Solutions 2008 presentation
(http://video.google.com/videoplay?docid=-7510297003494499688&hl=en#).

Andres.


Lukas Renggli wrote:

> 2009/10/20 Laval Jannik <[hidden email]>:
>  
>> Hi,
>> Since some weeks,
>> We use in our projects the project HashTable, available
>> at http://www.squeaksource.com/ht.html
>> It could be a good idea to integrate these collections to replace Dictionary
>> and Set.
>>    
>
> I don't think that this is generally a good idea. It might be
> beneficial in your situation, but certainly not for everybody in all
> situations. I've been using the HashTable implementation in several
> projects with great success. In other projects I've been using the
> BTree package on SqueakSource that provides similar functionality
> optimized for yet another context. These highly specialized data
> structures should only be applied if they make sense in the given
> context (and if you understand both, the cost and gain). I guess this
> was exactly the reason why SkipList was removed from the Pharo image.
>
> Lukas
>
>  

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Martin McClure-2
Andres Valloud wrote:

> In general terms, open addressing with linear probing should tie or beat
> chaining.  For this to happen, the hash function used must be of high
> quality.  If chaining is beating open addressing with linear probing,
> I'd strongly suspect that the hash function used is producing
> significant collisions or unevenly distributed values for the data set
> in question.  You may want to take a look at the Hash Analysis Tool I
> wrote for VW (it's in the public Store repository under a MIT license).  
> For a more in-depth treatment of these matters, you may also want to
> look at the hash book I wrote (www.lulu.com/avSmalltalkBooks), and its
> matching Smalltalk Solutions 2008 presentation
> (http://video.google.com/videoplay?docid=-7510297003494499688&hl=en#).

You are, of course, correct. I agree that the place to focus on is the
hash functions being used. And I believe that there are some simple
improvements that can be made.

 From a (very!) brief look, this HashTable implementation appears to be
inspired by difficulties caused by having only 4096 identity hash values
in Squeak/Pharo. If you have a large collection using identity hash you
will, by definition, have lots of collisions.

The *best* (and most difficult) solution is to increase the number of
identity hash values that can be stored in the object header. Until that
can be done, we adjust as best we can...

--------------

First, the identity-based collections.

The current IdentitySet and IdentityDictionary implementations do adjust
the hash function for large table sizes, with this code in #scanFor:

   finish > 4096
     ifTrue: [hash := anObject identityHash * (finish // 4096)]
     ifFalse: [hash := anObject identityHash].
   start := (hash \\ finish) + 1.

This makes IdentityDictionary pretty much as fast as it can be at most
sizes, but there's a problem here. Take a look at the HashTable
comparison graph:

http://img247.echo.cx/img247/3702/chart8vv.png

IdentityDictionary is fast at most sizes, but has problems right around
4K of size. I'm going to assert (without testing, always dangerous in
performance work!) that this is because the hash spreading algorithm
above but always leaves an area at the end of the table array that never
gets any hash values mapped to it. This will be worst when the table
size is between 4K and 8K, when the multiplier is still 1. When the
number of objects in the dictionary gets high enough that the table
grows again, the hashes start being spread out and it gets fast again.

I'd suggest trying replacing the above code with something like this:

   hash := anObject identityHash * (finish // 4096 + 1)
   start := (hash \\ finish) + 1.

Not only better performing (I expect) but simpler. Possibly even
better-performing might be this:

   hash := anObject identityHash * <well-chosen constant>
   start := (hash \\ finish) + 1.

where the constant is large enough to get a spread across most of the
positive SmallInteger range, but not large enough to exceed it. Andres,
what's a good choice for the constant? Maybe a prime that is never close
to a prime used for a table size in these collections?


---------------

As the graph shows, Dictionary has a real problem when grown above 4000
entries, and I would expect Set to have a similar problem. I assume this
is with objects that do not define a well-spread hash function, such as
objects that implement hash as identity hash. And this is quite a few
objects.

I'd suggest using a hash-spreading function on the equality-based
collections as well. Something like that used for IdentitySet *could*
work, but if you give that formula objects whose hashes *are* already
well-spread across the entire SmallInteger range, on most entries you'd
be creating short-lived large integers, and if you're looking for
performance you don't want to do that. Using #hashMultiply would
probably improve things quite a bit, though it might be possible to do a
bit better than that.

With changes along these lines, I agree with Andres that linear-probing
collections should be able to equal or beat the performance of chained
collections, and are a better choice for general-purpose collections.

Regards,

-Martin

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Andres Valloud-4
Martin,
> You are, of course, correct.

Well yes, thank you :), but it also helps when good, qualified people go
over one's assertions to make sure they're not wrong.

> --------------
>
> First, the identity-based collections.
>
> The current IdentitySet and IdentityDictionary implementations do adjust
> the hash function for large table sizes, with this code in #scanFor:
>
>    finish > 4096
>      ifTrue: [hash := anObject identityHash * (finish // 4096)]
>      ifFalse: [hash := anObject identityHash].
>    start := (hash \\ finish) + 1.
>
> This makes IdentityDictionary pretty much as fast as it can be at most
> sizes, but there's a problem here. Take a look at the HashTable
> comparison graph:
>
> http://img247.echo.cx/img247/3702/chart8vv.png
>
> IdentityDictionary is fast at most sizes, but has problems right around
> 4K of size. I'm going to assert (without testing, always dangerous in
> performance work!) that this is because the hash spreading algorithm
> above but always leaves an area at the end of the table array that never
> gets any hash values mapped to it. This will be worst when the table
> size is between 4K and 8K, when the multiplier is still 1. When the
> number of objects in the dictionary gets high enough that the table
> grows again, the hashes start being spread out and it gets fast again.
>  

At first sight, it seems to me that this should be fixed along the lines
of what you propose.

> I'd suggest trying replacing the above code with something like this:
>
>    hash := anObject identityHash * (finish // 4096 + 1)
>    start := (hash \\ finish) + 1.
>
> Not only better performing (I expect) but simpler. Possibly even
> better-performing might be this:
>
>    hash := anObject identityHash * <well-chosen constant>
>    start := (hash \\ finish) + 1.
>
> where the constant is large enough to get a spread across most of the
> positive SmallInteger range, but not large enough to exceed it. Andres,
> what's a good choice for the constant? Maybe a prime that is never close
> to a prime used for a table size in these collections?
>  

Actually, since Squeak/Pharo have hashMultiply, I'd guess that constant
(1664525, IIRC) should suffice for the vast majority of cases (1664525 *
4095 > 1 billion, that should do).  Hashed collection sizes (past the
first trivial tiny sizes) should be chosen carefully.  Knuth has some
suggestions for how to select good prime sizes.  In fact, 1664525 should
be seen as one potential hashed collection size too (but never used as
such).  The hash book has all the details on this.

> I'd suggest using a hash-spreading function on the equality-based
> collections as well.

Ideally, I'd do something along the lines of

Object>>scaledIdentityHash

  ^self identityHash bitShift: 18  "I think, basically as much as
possible without creating a large integer --- note that multiplying by a
power of two has the effect of a permutation modulo a well chosen prime
because gcd(2^k, wellChosenPrime) = 1 and so 2^k is inversible mod
wellChosenPrime"


implement

Object>>hash

  ^self scaledIdentityHash  "as opposed to identityHash"


and then substitute sending scaledIdentityHash for identityHash in the
identity based collections.  This has several benefits.

* All the scaling code in hashed collections, both identity and equality
if they scale values, can be thrown away.

* The index for an object becomes object hash "or scaledIdentityHash" \\
tableSize + 1 in all cases.  That's so little work the whole method in
which it is implemented can be inlined, removing another message send
and making collections even more efficient.

* Other senders of identityHash are not affected, particularly those
that assume identityHash has a certain number of bits and so on.


Finally, I am assuming that chaining means that each slot in the hash
table has a pointer that says where the collision chain continues (or a
node that points to the next node in the collision).  It may be
worthwhile to entertain using hash buckets, where the whole chain is in
an array, particularly for identity collections if there's a primitive
that scans for an object in an array.  That structure is also useful to
implement the symbol table (and all sorts of other tables, actually).

As Martin says, mod special cases where you know you will have a lot of
collisions, open addressing linear probing is the way to go.  And, for
special cases such as that of identityHash, it may be worthwhile
considering comparing using == but implementing identityHash in terms of
hash.  After all, certainly if two objects are == they should have the
same hash (and identityHash) values.

Andres.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Stéphane Ducasse
In reply to this post by Lukas Renggli
Yes but what jannik wants to say is that having slow dictionary in the  
image and fast outside is not really a
good strategy.

Stef

On Oct 20, 2009, at 11:37 PM, Lukas Renggli wrote:

> 2009/10/20 Laval Jannik <[hidden email]>:
>> Hi,
>> Since some weeks,
>> We use in our projects the project HashTable, available
>> at http://www.squeaksource.com/ht.html
>> It could be a good idea to integrate these collections to replace  
>> Dictionary
>> and Set.
>
> I don't think that this is generally a good idea. It might be
> beneficial in your situation, but certainly not for everybody in all
> situations. I've been using the HashTable implementation in several
> projects with great success. In other projects I've been using the
> BTree package on SqueakSource that provides similar functionality
> optimized for yet another context. These highly specialized data
> structures should only be applied if they make sense in the given
> context (and if you understand both, the cost and gain). I guess this
> was exactly the reason why SkipList was removed from the Pharo image.
>
> Lukas
>
> --
> Lukas Renggli
> http://www.lukas-renggli.ch
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Stéphane Ducasse
In reply to this post by Andres Valloud-4
Andres so that I understand clearly
        Dictionary is chaining or open?
I do not remember and too busy to check.

Stef

On Oct 21, 2009, at 3:05 AM, Andres Valloud wrote:

> In general terms, open addressing with linear probing should tie or  
> beat
> chaining.  For this to happen, the hash function used must be of high
> quality.  If chaining is beating open addressing with linear probing,
> I'd strongly suspect that the hash function used is producing
> significant collisions or unevenly distributed values for the data set
> in question.  You may want to take a look at the Hash Analysis Tool I
> wrote for VW (it's in the public Store repository under a MIT  
> license).
> For a more in-depth treatment of these matters, you may also want to
> look at the hash book I wrote (www.lulu.com/avSmalltalkBooks), and its
> matching Smalltalk Solutions 2008 presentation
> (http://video.google.com/videoplay?docid=-7510297003494499688&hl=en#).
>
> Andres.
>
>
> Lukas Renggli wrote:
>> 2009/10/20 Laval Jannik <[hidden email]>:
>>
>>> Hi,
>>> Since some weeks,
>>> We use in our projects the project HashTable, available
>>> at http://www.squeaksource.com/ht.html
>>> It could be a good idea to integrate these collections to replace  
>>> Dictionary
>>> and Set.
>>>
>>
>> I don't think that this is generally a good idea. It might be
>> beneficial in your situation, but certainly not for everybody in all
>> situations. I've been using the HashTable implementation in several
>> projects with great success. In other projects I've been using the
>> BTree package on SqueakSource that provides similar functionality
>> optimized for yet another context. These highly specialized data
>> structures should only be applied if they make sense in the given
>> context (and if you understand both, the cost and gain). I guess this
>> was exactly the reason why SkipList was removed from the Pharo image.
>>
>> Lukas
>>
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Stéphane Ducasse
In reply to this post by Martin McClure-2

> You are, of course, correct. I agree that the place to focus on is the
> hash functions being used. And I believe that there are some simple
> improvements that can be made.
>
> From a (very!) brief look, this HashTable implementation appears to be
> inspired by difficulties caused by having only 4096 identity hash  
> values
> in Squeak/Pharo. If you have a large collection using identity hash  
> you
> will, by definition, have lots of collisions.

yyyyyyyyeeeeeeeeeeesssssssssssssssssssssssss
and get linearrrrrrrrrrrrrrrrrrrrrly boring and slow.

> The *best* (and most difficult) solution is to increase the number of
> identity hash values that can be stored in the object header. Until  
> that
> can be done, we adjust as best we can...
>
> --------------
>
> First, the identity-based collections.
>
> The current IdentitySet and IdentityDictionary implementations do  
> adjust
> the hash function for large table sizes, with this code in #scanFor:
>
>   finish > 4096
>     ifTrue: [hash := anObject identityHash * (finish // 4096)]
>     ifFalse: [hash := anObject identityHash].
>   start := (hash \\ finish) + 1.
>
> This makes IdentityDictionary pretty much as fast as it can be at most
> sizes, but there's a problem here. Take a look at the HashTable
> comparison graph:
>
> http://img247.echo.cx/img247/3702/chart8vv.png
>
> IdentityDictionary is fast at most sizes, but has problems right  
> around
> 4K of size. I'm going to assert (without testing, always dangerous in
> performance work!) that this is because the hash spreading algorithm
> above but always leaves an area at the end of the table array that  
> never
> gets any hash values mapped to it. This will be worst when the table
> size is between 4K and 8K, when the multiplier is still 1. When the
> number of objects in the dictionary gets high enough that the table
> grows again, the hashes start being spread out and it gets fast again.
>
> I'd suggest trying replacing the above code with something like this:
>
>   hash := anObject identityHash * (finish // 4096 + 1)
>   start := (hash \\ finish) + 1.
>
> Not only better performing (I expect) but simpler. Possibly even
> better-performing might be this:
>
>   hash := anObject identityHash * <well-chosen constant>
>   start := (hash \\ finish) + 1.
>
> where the constant is large enough to get a spread across most of the
> positive SmallInteger range, but not large enough to exceed it.  
> Andres,
> what's a good choice for the constant? Maybe a prime that is never  
> close
> to a prime used for a table size in these collections?
>
>
> ---------------
>
> As the graph shows, Dictionary has a real problem when grown above  
> 4000
> entries, and I would expect Set to have a similar problem. I assume  
> this
> is with objects that do not define a well-spread hash function, such  
> as
> objects that implement hash as identity hash. And this is quite a few
> objects.
>
> I'd suggest using a hash-spreading function on the equality-based
> collections as well. Something like that used for IdentitySet *could*
> work, but if you give that formula objects whose hashes *are* already
> well-spread across the entire SmallInteger range, on most entries  
> you'd
> be creating short-lived large integers, and if you're looking for
> performance you don't want to do that. Using #hashMultiply would
> probably improve things quite a bit, though it might be possible to  
> do a
> bit better than that.
>
> With changes along these lines, I agree with Andres that linear-
> probing
> collections should be able to equal or beat the performance of chained
> collections, and are a better choice for general-purpose collections.
>
> Regards,
>
> -Martin

Martin
don;t you have some changes so that we could try :)

Stef

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Lukas Renggli
> yyyyyyyyeeeeeeeeeeesssssssssssssssssssssssss
> and get linearrrrrrrrrrrrrrrrrrrrrly boring and slow.

Yes, but only if your objects don't implement #hash properly. HashMap
trades space and time to get O(1) behavior IF you have lots of
elements with bad hash values. If the hash values are good (which is
the case most of the time) or if you have not that many elements
(which is the case most of the time), it is a waste of time and space
though.

Lukas

--
Lukas Renggli
http://www.lukas-renggli.ch

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Martin McClure-2
In reply to this post by Stéphane Ducasse
Stéphane Ducasse wrote:

[...]

>> With changes along these lines, I agree with Andres that linear-
>> probing
>> collections should be able to equal or beat the performance of chained
>> collections, and are a better choice for general-purpose collections.
>>
>> Regards,
>>
>> -Martin
>
> Martin
> don;t you have some changes so that we could try :)

Sigh. Too many projects...

But I'll try to get you something. Andres' post had some good
suggestions that will change what I implement.

-Martin

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Igor Stasenko
In reply to this post by Lukas Renggli
2009/10/21 Lukas Renggli <[hidden email]>:

>> yyyyyyyyeeeeeeeeeeesssssssssssssssssssssssss
>> and get linearrrrrrrrrrrrrrrrrrrrrly boring and slow.
>
> Yes, but only if your objects don't implement #hash properly. HashMap
> trades space and time to get O(1) behavior IF you have lots of
> elements with bad hash values. If the hash values are good (which is
> the case most of the time) or if you have not that many elements
> (which is the case most of the time), it is a waste of time and space
> though.
>
Right.
That's why each of implementations (Dictionary and HashTable) having own niche.
For small sizes, dictionaries in their current state is best.
For bigger sizes, one could choose to use HashTable, or any other
implementation, which may fit better.

Ultimately, there is no solution which is best in all cases (space,
access time, insertion/remove time etc).

> Lukas
>
> --
> Lukas Renggli
> http://www.lukas-renggli.ch
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>



--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Andres Valloud-4
Igor et al,
> That's why each of implementations (Dictionary and HashTable) having own niche.
> For small sizes, dictionaries in their current state is best.
> For bigger sizes, one could choose to use HashTable, or any other
> implementation, which may fit better.
>  

As long as this refers to objects that implement #hash as #identityHash,
yes.  If you can produce a (reasonably efficient) hash function that has
very few collisions, then open addressing with linear probing (the usual
implementation of Set/Dictionary) tends to be the most efficient.

Andres.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: HashTable vs Dictionary

Igor Stasenko
2009/10/22 Andres Valloud <[hidden email]>:

> Igor et al,
>> That's why each of implementations (Dictionary and HashTable) having own niche.
>> For small sizes, dictionaries in their current state is best.
>> For bigger sizes, one could choose to use HashTable, or any other
>> implementation, which may fit better.
>>
>
> As long as this refers to objects that implement #hash as #identityHash,
> yes.  If you can produce a (reasonably efficient) hash function that has
> very few collisions, then open addressing with linear probing (the usual
> implementation of Set/Dictionary) tends to be the most efficient.
>

Well, you can always attack the problem from different side - create a
Set and Dictionary
which using a perfect hash algorythms.
As you may know, a perfect hash is a function which producing unique
values for a limited set of elements.
So, that in sets hashed using perfect hash function, no collisions
possible, and hence the access time to any element is O(1).

The drawback of such approach, that finding a perfect hash function
for a set of elements is time consuming (best ones, AFAIK, consuming
O(k*n) time, where n is number of elements), so
you would never want to use a perfect hashing for collections which
tend to grow or shrink often.

But for collections, which populated once, and then used for read-only
access this is worth doing.
And its quite easy to modify current classes to use a perfect hash:
 just add a perfectHash ivar to it, and if its not nil , then use it
to find an index of item.
And if perfectHash is nil, then we just use the current implementation.

And of course, you need to implement a perfect hash search, which
standing aside :)
But then you could just do:

set := something collect: [... ] asSet rehashWithPerfectHash.

and voila, you got a set with no collisions :)

> Andres.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>



--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Perfect hash [was: Re: HashTable vs Dictionary]

Martin McClure-2
Igor Stasenko wrote:

> Well, you can always attack the problem from different side - create a
> Set and Dictionary
> which using a perfect hash algorythms.
> As you may know, a perfect hash is a function which producing unique
> values for a limited set of elements.
> So, that in sets hashed using perfect hash function, no collisions
> possible, and hence the access time to any element is O(1).
>
> The drawback of such approach, that finding a perfect hash function
> for a set of elements is time consuming (best ones, AFAIK, consuming
> O(k*n) time, where n is number of elements), so
> you would never want to use a perfect hashing for collections which
> tend to grow or shrink often.
>
> But for collections, which populated once, and then used for read-only
> access this is worth doing.


Hi Igor,

I'm afraid I don't see the usefulness. What kind of read accesses will
you be doing on this hashed collection? I can think of only two kinds of
access that make sense: The 'is this object in this collection'
question, and iterating over the collection to do some operation on all
objects therein.

For iteration, you might as well use an Array, and forget hashing
altogether.

For the 'is this object in this collection' question, you might more
straightforwardly replace the perfectHash instvar with a Boolean
isInTheCollection instvar. Saves space and time; you don't need the
collection at all.

Perfect hashing sounds cool, but is of quite limited usefulness. I can
see it being useful when *all* of the following are true:

* There are a relatively small number of possible objects in the
universe that is being hashed. Then each object can be given a unique
(thus "perfect") hash value.

* There are multiple collections that can contain exactly this universe
of objects, and no others.

* Membership in each collection by a given object is not easily
predictable, or might vary over time.

> And its quite easy to modify current classes to use a perfect hash:
>  just add a perfectHash ivar to it, and if its not nil , then use it
> to find an index of item.
> And if perfectHash is nil, then we just use the current implementation.
>
> And of course, you need to implement a perfect hash search, which
> standing aside :)
> But then you could just do:
>
> set := something collect: [... ] asSet rehashWithPerfectHash.

I don't think you can easily use class Set, or any existing
general-purpose collection class, to get a perfect hash. Existing
Smalltalk collection classes combine two functions sequentially to form
the hash function: The first function is a property of the object being
stored (typically implemented as #hash or #identityHash), and the second
function is a property of the collection itself (typically mod with the
current prime table size).

If your object knows its perfect hash with respect to a given
collection, you don't want that collection adding anything to that
function. In particular, you don't want the collection to take a modulo,
because then your hash function is no longer perfect and collisions
become quite likely again, defeating the entire purpose.

In general, you'd set it up so that the perfect hash value is exactly
the table index reserved for that object in the collection. Then that
slot either contains a reference to that object, or nil. In any
perfect-hash situation, you need a table the same size as your universe
in every collection. This means that the universe cannot change size
easily, and once it gets very large the space costs are no longer worth
the time savings.

Regards,

-Martin

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Perfect hash [was: Re: HashTable vs Dictionary]

Igor Stasenko
2009/10/22 Martin McClure <[hidden email]>:

> Igor Stasenko wrote:
>> Well, you can always attack the problem from different side - create a
>> Set and Dictionary
>> which using a perfect hash algorythms.
>> As you may know, a perfect hash is a function which producing unique
>> values for a limited set of elements.
>> So, that in sets hashed using perfect hash function, no collisions
>> possible, and hence the access time to any element is O(1).
>>
>> The drawback of such approach, that finding a perfect hash function
>> for a set of elements is time consuming (best ones, AFAIK, consuming
>> O(k*n) time, where n is number of elements), so
>> you would never want to use a perfect hashing for collections which
>> tend to grow or shrink often.
>>
>> But for collections, which populated once, and then used for read-only
>> access this is worth doing.
>
>
> Hi Igor,
>
> I'm afraid I don't see the usefulness. What kind of read accesses will
> you be doing on this hashed collection? I can think of only two kinds of
> access that make sense: The 'is this object in this collection'
> question, and iterating over the collection to do some operation on all
> objects therein.
>
> For iteration, you might as well use an Array, and forget hashing
> altogether.
>
> For the 'is this object in this collection' question, you might more
> straightforwardly replace the perfectHash instvar with a Boolean
> isInTheCollection instvar. Saves space and time; you don't need the
> collection at all.
>
> Perfect hashing sounds cool, but is of quite limited usefulness. I can
> see it being useful when *all* of the following are true:
>
> * There are a relatively small number of possible objects in the
> universe that is being hashed. Then each object can be given a unique
> (thus "perfect") hash value.
>
> * There are multiple collections that can contain exactly this universe
> of objects, and no others.
>
> * Membership in each collection by a given object is not easily
> predictable, or might vary over time.
>
>> And its quite easy to modify current classes to use a perfect hash:
>>  just add a perfectHash ivar to it, and if its not nil , then use it
>> to find an index of item.
>> And if perfectHash is nil, then we just use the current implementation.
>>
>> And of course, you need to implement a perfect hash search, which
>> standing aside :)
>> But then you could just do:
>>
>> set := something collect: [... ] asSet rehashWithPerfectHash.
>
> I don't think you can easily use class Set, or any existing
> general-purpose collection class, to get a perfect hash. Existing
> Smalltalk collection classes combine two functions sequentially to form
> the hash function: The first function is a property of the object being
> stored (typically implemented as #hash or #identityHash), and the second
> function is a property of the collection itself (typically mod with the
> current prime table size).
>
> If your object knows its perfect hash with respect to a given
> collection, you don't want that collection adding anything to that
> function. In particular, you don't want the collection to take a modulo,
> because then your hash function is no longer perfect and collisions
> become quite likely again, defeating the entire purpose.
>
> In general, you'd set it up so that the perfect hash value is exactly
> the table index reserved for that object in the collection. Then that
> slot either contains a reference to that object, or nil. In any
> perfect-hash situation, you need a table the same size as your universe
> in every collection. This means that the universe cannot change size
> easily, and once it gets very large the space costs are no longer worth
> the time savings.
>

Err, your thoughts went in wrong direction.
Let me simplify the goal of perfect hashing:
- suppose you having N unique integer numbers, without strictly
specified range (any number could lie in range 0..infinity).
- now we need to find a function such that, for each element 'x' in
the above finite set, following conditions met:

0 <= f(x) < k*N

and for each pair of two different elements x , y in original set
  f(x) <> f(y)

and k < infinity

As you can see , this function maps an arbitrary integer values
without specific range, to an
integer values in range [0..k*N], (k>=1).
Given that we need to deal with a limited number of elemens for each
particular set of numbers, such function can be found.
We just need to search for an optimal (minimal) k, which is < 2 and in
perfect case == 1.

> Regards,
>
> -Martin
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>



--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Perfect hash [was: Re: HashTable vs Dictionary]

Martin McClure-2
Igor Stasenko wrote:

>
> Err, your thoughts went in wrong direction.
> Let me simplify the goal of perfect hashing:
> - suppose you having N unique integer numbers, without strictly
> specified range (any number could lie in range 0..infinity).
> - now we need to find a function such that, for each element 'x' in
> the above finite set, following conditions met:
>
> 0 <= f(x) < k*N
>
> and for each pair of two different elements x , y in original set
>   f(x) <> f(y)
>
> and k < infinity
>
> As you can see , this function maps an arbitrary integer values
> without specific range, to an
> integer values in range [0..k*N], (k>=1).
> Given that we need to deal with a limited number of elemens for each
> particular set of numbers, such function can be found.
> We just need to search for an optimal (minimal) k, which is < 2 and in
> perfect case == 1.
>

Hi Igor,

Yes, you have defined a perfect hash function nicely, and that is the
definition I was assuming in my previous comments.

And even if you find a perfect hash function for a given set of objects,
  the situations in which the resulting collection is useful are quite
limited, which was the main idea behind my comments.

Unless I've missed something?

Regards,

-Martin

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Perfect hash [was: Re: HashTable vs Dictionary]

johnmci
*cough*

I can remember sitting at a client (rather late at night) under  
intense pressure looking at the performance curve that
when  straight up, because well some dictionary when over N elements  
and the hash target then resolved to a *very* long chain on an insert.

Sadly mmm 12 years later we not been able to provide a decent hash  
logic for Smalltalk (aka Pharo) that removes
the hassle of having to consider some specialized form of dictionary  
if the experienced programmer in the room twigs to the fact why that  
could be the problem...



On 2009-10-21, at 7:54 PM, Martin McClure wrote:

> Igor Stasenko wrote:
>>
>> Err, your thoughts went in wrong direction.
>> Let me simplify the goal of perfect hashing:
>> - suppose you having N unique integer numbers, without strictly
>> specified range (any number could lie in range 0..infinity).
>> - now we need to find a function such that, for each element 'x' in
>> the above finite set, following conditions met:
>>
>> 0 <= f(x) < k*N
>>
>> and for each pair of two different elements x , y in original set
>>  f(x) <> f(y)
>>
>> and k < infinity
>>
>> As you can see , this function maps an arbitrary integer values
>> without specific range, to an
>> integer values in range [0..k*N], (k>=1).
>> Given that we need to deal with a limited number of elemens for each
>> particular set of numbers, such function can be found.
>> We just need to search for an optimal (minimal) k, which is < 2 and  
>> in
>> perfect case == 1.
>>
>
> Hi Igor,
>
> Yes, you have defined a perfect hash function nicely, and that is the
> definition I was assuming in my previous comments.
>
> And even if you find a perfect hash function for a given set of  
> objects,
>  the situations in which the resulting collection is useful are quite
> limited, which was the main idea behind my comments.
>
> Unless I've missed something?
>
> Regards,
>
> -Martin
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>   Twitter:  
squeaker68882
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
=
=
=
========================================================================





_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Perfect hash [was: Re: HashTable vs Dictionary]

Igor Stasenko
In reply to this post by Martin McClure-2
2009/10/22 Martin McClure <[hidden email]>:

> Igor Stasenko wrote:
>>
>> Err, your thoughts went in wrong direction.
>> Let me simplify the goal of perfect hashing:
>> - suppose you having N unique integer numbers, without strictly
>> specified range (any number could lie in range 0..infinity).
>> - now we need to find a function such that, for each element 'x' in
>> the above finite set, following conditions met:
>>
>> 0 <= f(x) < k*N
>>
>> and for each pair of two different elements x , y in original set
>>   f(x) <> f(y)
>>
>> and k < infinity
>>
>> As you can see , this function maps an arbitrary integer values
>> without specific range, to an
>> integer values in range [0..k*N], (k>=1).
>> Given that we need to deal with a limited number of elemens for each
>> particular set of numbers, such function can be found.
>> We just need to search for an optimal (minimal) k, which is < 2 and in
>> perfect case == 1.
>>
>
> Hi Igor,
>
> Yes, you have defined a perfect hash function nicely, and that is the
> definition I was assuming in my previous comments.
>
> And even if you find a perfect hash function for a given set of objects,
>  the situations in which the resulting collection is useful are quite
> limited, which was the main idea behind my comments.
>
> Unless I've missed something?
>
Limited, why?

As i said, suppose you have a collection which is modified rarely.
As an example: MethodDictionary, Smalltalk
as well as any collections which you using in your code as an
immutable storage, once populated.

This can be done easily with few modifications to current classes.
Consider:

Set>>scanFor: anObject

-   start := (anObject hash \\ finish) + 1.
+  start := hashFunction ifNil: [ (anObject hash \\ finish) + 1 ]
ifNotNil: [ hashFunction value: anObject hash on: array ].


now you can tell set to use custom hash function, by using a new method:

Set>>rehashUsing: newHashFunction

  hashFunction := newHashFunction.
  hashFunction ifNil: [ self rehashUsingOldWay ] ifNotNil: [ array :=
hashFunction rehash: array filler: nil ].


oh, and btw, such modifications don't assume what hash function will
be used , so you can forget about perfect hashing, if you don't like
it :)
Mainly, what it does, just replacing a hardcoded hash function , which
is '(anObject hash \\ finish) + 1' by one, provided by you.

--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Perfect hash [was: Re: HashTable vs Dictionary]

Martin McClure-2
Igor Stasenko wrote:

> 2009/10/22 Martin McClure <[hidden email]>:
>> Igor Stasenko wrote:
>>> Err, your thoughts went in wrong direction.
>>> Let me simplify the goal of perfect hashing:
>>> - suppose you having N unique integer numbers, without strictly
>>> specified range (any number could lie in range 0..infinity).
>>> - now we need to find a function such that, for each element 'x' in
>>> the above finite set, following conditions met:
>>>
>>> 0 <= f(x) < k*N
>>>
>>> and for each pair of two different elements x , y in original set
>>>   f(x) <> f(y)
>>>
>>> and k < infinity
>>>
>>> As you can see , this function maps an arbitrary integer values
>>> without specific range, to an
>>> integer values in range [0..k*N], (k>=1).
>>> Given that we need to deal with a limited number of elemens for each
>>> particular set of numbers, such function can be found.
>>> We just need to search for an optimal (minimal) k, which is < 2 and in
>>> perfect case == 1.
>>>
>> Hi Igor,
>>
>> Yes, you have defined a perfect hash function nicely, and that is the
>> definition I was assuming in my previous comments.
>>
>> And even if you find a perfect hash function for a given set of objects,
>>  the situations in which the resulting collection is useful are quite
>> limited, which was the main idea behind my comments.
>>
>> Unless I've missed something?
>>
> Limited, why?
>
> As i said, suppose you have a collection which is modified rarely.
> As an example: MethodDictionary, Smalltalk
> as well as any collections which you using in your code as an
> immutable storage, once populated.
>
> This can be done easily with few modifications to current classes.
> Consider:
>
> Set>>scanFor: anObject
>
> -   start := (anObject hash \\ finish) + 1.
> +  start := hashFunction ifNil: [ (anObject hash \\ finish) + 1 ]
> ifNotNil: [ hashFunction value: anObject hash on: array ].
>
>
> now you can tell set to use custom hash function, by using a new method:
>
> Set>>rehashUsing: newHashFunction
>
>   hashFunction := newHashFunction.
>   hashFunction ifNil: [ self rehashUsingOldWay ] ifNotNil: [ array :=
> hashFunction rehash: array filler: nil ].
>
>
> oh, and btw, such modifications don't assume what hash function will
> be used , so you can forget about perfect hashing, if you don't like
> it :)
> Mainly, what it does, just replacing a hardcoded hash function , which
> is '(anObject hash \\ finish) + 1' by one, provided by you.
>

Ah, thanks. That does indeed make more sense than what I originally
thought you meant, sorry for the confusion.

Pluggable hashes could be useful, though I've been quite happy with
using subclasses for that, which keeps the code of the general
collection class cleaner.

However, this approach does mean that for perfect hash (not just
pluggable hash function) you have to find an actual function that is
fast and compact that is a perfect hash, or you've lost the slight
advantage that perfect hashes have over good but imperfect hashes. And I
doubt this can be done for something like a MethodDictionary. Also,
you'd have to add code to control the size of the hash table to be the
right size for the perfect hash function.

Regards,

-Martin

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
12