[vwnc] [vw76] Point>>hash still not good enough

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[vwnc] [vw76] Point>>hash still not good enough

Reinout Heeck-2

Point>>hash is not defined symmetrically over its x and y variables:

    ^x hash hashMultiply bitXor: y hash


As a result performance of hashed collections containing points shows
very asymmetrical run times:


Time millisecondsToRun:[
    d := Dictionary new.
    1 to: 3 do:[:x |  1 to: 1000 do: [:y |
            d at: x@y put: #foo]].
    1 to: 3 do:[:x |  1 to: 1000 do: [:y |
            d removeKey: x@y]].
]
"30532 30022 29883"


Time millisecondsToRun:[
    d := Dictionary new.
    1 to: 3 do:[:y |  1 to: 1000 do: [:x |
            d at: x@y put: #foo]].
    1 to: 3 do:[:y |  1 to: 1000 do: [:x |
            d removeKey: x@y]].
]
"8 8 8"




If I change the implementation of Point>>hash to be symmetric:
    ^x hash hashMultiply bitXor: y hash hashMultiply

the run times reported are
" 9 10 9 "
and
" 9 10 10"




HTH,

Reinout
--------


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

Andres Valloud-6
Hmmm... I think what is going on is that clustering is not letting the
dictionary do its job quickly.  In other words, the hash of y being left
alone results in chunks of 1000 points wanting to be together, and the
position of these chunks in the hashed collection is controlled by x
hash hashMultiply.

There are a number of things to observe.  First is that doing this,

Time millisecondsToRun:[
    d := Dictionary new: 1300000.
    1 to: 3 do:[:x |  1 to: 1000 do: [:y |
            d at: x@y put: #foo]].
    1 to: 3 do:[:x |  1 to: 1000 do: [:y |
            d removeKey: x@y]].
]

cuts down the running time to under 100 ms.  The second thing is that
the suggestion, now being symmetric on x and y, has other issues as
well.  For example, now it will always be the case that

x@y hash = y@x hash

no matter how well the hashes of x and y are mixed / post processed.
This is the motivation for using non symmetric hash functions when order
is important for comparison.  In fact, using the Hash Analysis Tool
reveals that the current implementation has these properties:

Collision rate: 1.
Chi square: 0.
Chi square mod p average: 0.343364.

On the other hand, the suggested hash function has these properties:

Collision rate: 2.346245 (anything over 1.01 is bad)
Chi square: 2.384693 (anything significantly over 0 is bad)
Chi square mod p average: 2.856256 (anything significantly over 0.5 is
bad)

For the sake of completeness, the original VW 7.5 implementation that
was replaced produced these numbers.

Collision rate: 244.140625
Chi square: 242.524372
Chi square mod p average: 242.524372 (bonus bad points for not
distributing well mod p)

When doing the hash book, one of my data sets was indeed the 1000x1000
point data set.  It gave me headaches like you wouldn't believe.  I
tried your suggestion, as well as many others, and discarded them
because of their collision rate.  Of the hash functions that gave
perfect collision rate, chi square, and chi square mod p averages, what
happened was that individual chi square mod p values had issues.  This
is indicative of clustering, and this is what you found.

While I ended up creating hash functions that had nice chi square mod p
values at the cost of about 25k collisions in a data set of 10^6 points,
the issue is that they require considerable more work than a single line
of code.  Rather, they are about 5 long lines of stuff that *looks* like
magic constants and unwarranted complication.  The value judgment here
was that in practice it would be rare to see a dictionary of all the
integer points between 0@0 and 999@999 (or 1@1 and 1000@1000), and so
the simpler hash function already provides far more value than the old
one.

The last observation to make is that nothing pays more than knowing your
data set.  For example, if one really wanted to put the 1000x1000 square
point data set in a dictionary, then one could actually *use* the
characteristics of the data set.  What could be done?  Something like
subclassing Point and reimplementing hash like this:

hash

  ^x * 1000 + y

In the face of this implementation, you could even claim the 5 lines of
densely packed code look ridiculous for this specific application.  On
the other hand, the hash function above uses information that is not
available to those that try to provide a hash function that will
actually work in most generic circumstances.

Andres.

PS: did you get to the section of the hash book that talks about
Point>>hash yet? :)

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of Reinout Heeck
Sent: Thursday, May 08, 2008 5:57 AM
To: 'VWNC'
Cc: Andres Valloud
Subject: [vwnc] [vw76] Point>>hash still not good enough


Point>>hash is not defined symmetrically over its x and y variables:

    ^x hash hashMultiply bitXor: y hash


As a result performance of hashed collections containing points shows
very asymmetrical run times:


Time millisecondsToRun:[
    d := Dictionary new.
    1 to: 3 do:[:x |  1 to: 1000 do: [:y |
            d at: x@y put: #foo]].
    1 to: 3 do:[:x |  1 to: 1000 do: [:y |
            d removeKey: x@y]].
]
"30532 30022 29883"


Time millisecondsToRun:[
    d := Dictionary new.
    1 to: 3 do:[:y |  1 to: 1000 do: [:x |
            d at: x@y put: #foo]].
    1 to: 3 do:[:y |  1 to: 1000 do: [:x |
            d removeKey: x@y]].
]
"8 8 8"




If I change the implementation of Point>>hash to be symmetric:
    ^x hash hashMultiply bitXor: y hash hashMultiply

the run times reported are
" 9 10 9 "
and
" 9 10 10"




HTH,

Reinout
--------


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

Reinout Heeck-2
Valloud, Andres wrote:
> Hmmm... I think what is going on is that clustering is not letting the
> dictionary do its job quickly.
Yes, it surprised me that this case was not fixed in the hashing overhaul.


> There are a number of things to observe.  First is that doing this,
>
>     d := Dictionary new: 1300000.
>  

Here the client of the dictionary has to be aware of the deficiencies in
int hashedcollection implementations and tune it's way around it. IMO
that is the wrong place (repetitious, fragile and in general cases like
library code impossible to pre-tune).
I had hoped these days would be over...

> The value judgment here
> was that in practice it would be rare to see a dictionary of all the
> integer points between 0@0 and 999@999 (or 1@1 and 1000@1000), and so
> the simpler hash function already provides far more value than the old
> one.
>  
In our practice this is not so rare, rougly in any system that works
with integer IDs (in our case order numbers) you will tend to see such
clusters of numbers being used as dictioanry keys or set members. This
time around I ran into it with UI code though, I'm busy overlaying the
display of dataset cells with a short animation whenever the value in a
cell changes, here I need a map from cell coordinates to some helper
objects and indeed the coordinates of these Points come mostly in clusters.

My point is that the value judgment above does not hold water in my
daily experience (whit the caveat that I am putting strong demands on
the base library, I want my code to be able to use a hashed collection
of points when I mean that - I would like that such code to be
performant enough without the need for cluttering it with special
collection tuning at every incarnation).

So (denying this value judgment) I conclude there is an impedance
mismatch between the new #hash implementations and the hashed
collections that use them:
the new hashing may yield clustered hash values if the values being
hashed are clustered - on the other hand the hashed collections break
down with such clustered hash values.
My first proposal tried a quick fix on the first half - the hashing
implementation, but as you point out it has quality issues. Another
solution to removing the impedance mismatch would be to fix the
collections, for example by throwing in an extra #hashMultiply in
#findKeyOrNil:, I'll play some more with that idea coming week.


> The last observation to make is that nothing pays more than knowing your
> data set.
But that comes at a price, you will have to be prepared to have your
client code express these details.
In large applications that need to be maintainable in the light of
frequent requirement changes I want to be able to write code as naively
as possible, so that only business rules are expressed - I don't want
future maintainers to be bogged down by tuning issues when they need to
concentrate on business functionality. Hence my denial of the value
judgment that was applied.


> PS: did you get to the section of the hash book that talks about
> Point>>hash yet? :)
>  
No, exercise 1.12 (Fermat's last theorem) has got a total grip on my
mind, when I thought to myself 'how would a hardware hacker solve that'
I had the (mis?)fortune of having a vague insight on how to restate the
problem so that a weaker proof is required. Whenever I pick up the book
now I cannot help myself - I just wander off thinking about that
exercise. So I guess I'll have to bite the bullet and sit down and write
out my thoughts on this - get it out of my system so I can continue with
your book :-)

R
-
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

Andres Valloud-3
Reinout Heeck wrote:
> Valloud, Andres wrote:
>  
>> Hmmm... I think what is going on is that clustering is not letting the
>> dictionary do its job quickly.
>>    
> Yes, it surprised me that this case was not fixed in the hashing overhaul.
>
>
>  

For all hash functions there will be datasets that cause them to behave
badly.  This cannot be avoided.

> I had hoped these [value judgment] days would be over...
>  

It is always a value judgment.  We could just resort to using MD5 for
everything, but then that would be too costly.  Plus, it would be a bad
hash if anybody decided to build a dictionary of MD5 collisions --- and
I know some people who have actually dealt with this particular matter.  
In other words, it is not possible to thoroughly "win" this game.

> In our practice this is not so rare, rougly in any system that works
> with integer IDs (in our case order numbers) you will tend to see such
> clusters of numbers being used as dictioanry keys or set members. This
> time around I ran into it with UI code though, I'm busy overlaying the
> display of dataset cells with a short animation whenever the value in a
> cell changes, here I need a map from cell coordinates to some helper
> objects and indeed the coordinates of these Points come mostly in clusters.
>  

Ok, then I guess something more complex is warranted.  Instead of what
you suggested, you may want to try this instead:

Point>>hash

  | xHash yHash |
  xHash := self x hash hashMultiply.
  xHash := (xHash bitShift: -14) bitXor:
    ((xHash bitAnd: 16r3FFF) bitShift: 15).
  ^xHash bitXor: self y hash hashMultiply


There are other, much more complicated implementations which are also
possible, but I'd rather see the need before using heavy handed stuff.

> My point is that the value judgment above does not hold water in my
> daily experience (whit the caveat that I am putting strong demands on
> the base library [...])

Right... and sooner or later it will be the case that somebody runs into
a situation in which the hash function above is also problematic...

> Another
> solution to removing the impedance mismatch would be to fix the
> collections, for example by throwing in an extra #hashMultiply in
> #findKeyOrNil:, I'll play some more with that idea coming week.
>  

Applying hashMultiply to every single hash value will cause problems
(truncation to 28 bits), will disturb perfect hash functions for no good
reason, and may introduce distribution issues that were not present
before its use.  Moreover, it is equivalent to a permutation mod 2^28,
and as such it will not improve distribution when there are no
collisions to begin with.  In other words, it may be more of a
performance hit than a benefit.  I'd strongly recommend against doing that.

>> The last observation to make is that nothing pays more than knowing your
>> data set.
>>    
> But that comes at a price, you will have to be prepared to have your
> client code express these details.
>  

Yes, there is always a price for everything.  This is normal.

>> PS: did you get to the section of the hash book that talks about
>> Point>>hash yet? :)
>>  
>>    
> No, exercise 1.12 (Fermat's last theorem) has got a total grip on my
> mind, when I thought to myself 'how would a hardware hacker solve that'
> I had the (mis?)fortune of having a vague insight on how to restate the
> problem so that a weaker proof is required. Whenever I pick up the book
> now I cannot help myself - I just wander off thinking about that
> exercise. So I guess I'll have to bite the bullet and sit down and write
> out my thoughts on this - get it out of my system so I can continue with
> your book :-)
>  

Heh :)... funny how that happens... I had something like that occur to
me with Concrete Mathematics' exercise 5.112... to show that
$\binom{2n}{n}$ is always divisible by 4 or 9, except when (IIRC) 2n =
64 or 2n = 256.  It is marked as a research problem, although I think
Granville et al solved it in 1996.

Andres.
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

Reinout Heeck

On May 9, 2008, at 9:33 PM, Andres Valloud wrote:

> Reinout Heeck wrote:
>> Valloud, Andres wrote:
>>
>>> Hmmm... I think what is going on is that clustering is not letting  
>>> the
>>> dictionary do its job quickly.
>>>
>> Yes, it surprised me that this case was not fixed in the hashing  
>> overhaul.
>>
>>
>>
>
> For all hash functions there will be datasets that cause them to  
> behave
> badly.  This cannot be avoided.
>
>> I had hoped these [value judgment] days would be over...

Heh :-)

What I meant there was the days of worrying about clustered values.



More generally, before the hashing overhaul I had run into three  
problems repeatedly:
1) Poor performance of hashed collections holding strings.
2) Poor performance of hashed collections holding clustered integer  
values.
3) Poor abstraction of hash combination 'for dummies' leading to poor  
#hash implementations on domain objects.

An earlier conversation we had extinguished my hopes for any solution  
to 3) in the overhaul but I had fully expected both 1) and 2) to be  
resolved by the redesign. Now I get the impression only 1) has been  
addressed adequately.

You seem to be dismissing 2) as being a special case that needs not  
perform with the base collection implementations. Since this is a  
value judgment we can keep debating this, instead I can provide you a  
simple datapoint: my shop would be relieved when 2) is addressed.


I had a quick look (for the first time) at the hash analysis tool in  
order to measure performance of some hashed collection access  
patterns. I only found quality assessments of #hash implementations  
for integer based values but no benchmark tool to measure run time of  
collection accesses - it seems this is not supported. Is there some  
other tool I need to load in order to measure the performance of  
hashed collections?



Thanks for your suggestions regarding Point>>hash, I'll experiment  
with them when I'm back at work (bank holiday here today).

Cheers,

Reinout
-------

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

Andres Valloud-3
Reinout,

In the past, your comments made me conclude that you prefer hash
functions that take little time to evaluate over hash functions that
distribute hash values very well but take their time to do so.

However, now that Point>>hash substantially improves what was shipped
with VW 7.5 and before in terms of distribution while keeping
computational complexity at some sort of bare minimum, it is also a
problem because it does not distribute well enough in cases in which it
seems that a 2d array (or hash bucket strategy as opposed to open
addressing linear probing) would do better for you.

What I am basically hearing in response to this situation is "make the
hash function more expensive so my problem goes away".  However, if
Point>>hash is modified to do more work in order to distribute better,
then it starts becoming a target of your previous admonitions regarding
execution speed.

While I do not necessarily disagree with improving the distribution of
Point>>hash further at some expense of hash value computation, my
concern is that it is not possible to win.

Regarding the hash analysis tool, the stated measurements are direct
predictors of hashed collection performance.  See chapter 2 of the hash
book.

Andres.


Reinout Heeck wrote:

> On May 9, 2008, at 9:33 PM, Andres Valloud wrote:
>
>  
>> Reinout Heeck wrote:
>>    
>>> Valloud, Andres wrote:
>>>
>>>      
>>>> Hmmm... I think what is going on is that clustering is not letting  
>>>> the
>>>> dictionary do its job quickly.
>>>>
>>>>        
>>> Yes, it surprised me that this case was not fixed in the hashing  
>>> overhaul.
>>>
>>>
>>>
>>>      
>> For all hash functions there will be datasets that cause them to  
>> behave
>> badly.  This cannot be avoided.
>>
>>    
>>> I had hoped these [value judgment] days would be over...
>>>      
>
> Heh :-)
>
> What I meant there was the days of worrying about clustered values.
>
>
>
> More generally, before the hashing overhaul I had run into three  
> problems repeatedly:
> 1) Poor performance of hashed collections holding strings.
> 2) Poor performance of hashed collections holding clustered integer  
> values.
> 3) Poor abstraction of hash combination 'for dummies' leading to poor  
> #hash implementations on domain objects.
>
> An earlier conversation we had extinguished my hopes for any solution  
> to 3) in the overhaul but I had fully expected both 1) and 2) to be  
> resolved by the redesign. Now I get the impression only 1) has been  
> addressed adequately.
>
> You seem to be dismissing 2) as being a special case that needs not  
> perform with the base collection implementations. Since this is a  
> value judgment we can keep debating this, instead I can provide you a  
> simple datapoint: my shop would be relieved when 2) is addressed.
>
>
> I had a quick look (for the first time) at the hash analysis tool in  
> order to measure performance of some hashed collection access  
> patterns. I only found quality assessments of #hash implementations  
> for integer based values but no benchmark tool to measure run time of  
> collection accesses - it seems this is not supported. Is there some  
> other tool I need to load in order to measure the performance of  
> hashed collections?
>
>
>
> Thanks for your suggestions regarding Point>>hash, I'll experiment  
> with them when I'm back at work (bank holiday here today).
>
> Cheers,
>
> Reinout
> -------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>  

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

hernan.wilkinson
Hi all,
 I agree with Andres when he says that it is impossible to have the best solution. For me, it means that having a message #hash as responsibility of the objects is not the right solution. Another object should have that responsibility therefore different hash algorithms could be use depending on the context... for sure somebody else thought about it, but here are some Smalltalks lines of what I mean:

  Set hashingWith: [ :aPoint | aPoint quickHash ].
  Set hashingWith: [ :aPoint | aPoint goodDistributionHash ].
  Set hashingWith: [ :aPoint | aPoint x bitXor: aPoint y ].
  etc.

Bye,
Hernan.



On Mon, May 12, 2008 at 6:13 PM, Andres Valloud <[hidden email]> wrote:
Reinout,

In the past, your comments made me conclude that you prefer hash
functions that take little time to evaluate over hash functions that
distribute hash values very well but take their time to do so.

However, now that Point>>hash substantially improves what was shipped
with VW 7.5 and before in terms of distribution while keeping
computational complexity at some sort of bare minimum, it is also a
problem because it does not distribute well enough in cases in which it
seems that a 2d array (or hash bucket strategy as opposed to open
addressing linear probing) would do better for you.

What I am basically hearing in response to this situation is "make the
hash function more expensive so my problem goes away".  However, if
Point>>hash is modified to do more work in order to distribute better,
then it starts becoming a target of your previous admonitions regarding
execution speed.

While I do not necessarily disagree with improving the distribution of
Point>>hash further at some expense of hash value computation, my
concern is that it is not possible to win.

Regarding the hash analysis tool, the stated measurements are direct
predictors of hashed collection performance.  See chapter 2 of the hash
book.

Andres.


Reinout Heeck wrote:
> On May 9, 2008, at 9:33 PM, Andres Valloud wrote:
>
>
>> Reinout Heeck wrote:
>>
>>> Valloud, Andres wrote:
>>>
>>>
>>>> Hmmm... I think what is going on is that clustering is not letting
>>>> the
>>>> dictionary do its job quickly.
>>>>
>>>>
>>> Yes, it surprised me that this case was not fixed in the hashing
>>> overhaul.
>>>
>>>
>>>
>>>
>> For all hash functions there will be datasets that cause them to
>> behave
>> badly.  This cannot be avoided.
>>
>>
>>> I had hoped these [value judgment] days would be over...
>>>
>
> Heh :-)
>
> What I meant there was the days of worrying about clustered values.
>
>
>
> More generally, before the hashing overhaul I had run into three
> problems repeatedly:
> 1) Poor performance of hashed collections holding strings.
> 2) Poor performance of hashed collections holding clustered integer
> values.
> 3) Poor abstraction of hash combination 'for dummies' leading to poor
> #hash implementations on domain objects.
>
> An earlier conversation we had extinguished my hopes for any solution
> to 3) in the overhaul but I had fully expected both 1) and 2) to be
> resolved by the redesign. Now I get the impression only 1) has been
> addressed adequately.
>
> You seem to be dismissing 2) as being a special case that needs not
> perform with the base collection implementations. Since this is a
> value judgment we can keep debating this, instead I can provide you a
> simple datapoint: my shop would be relieved when 2) is addressed.
>
>
> I had a quick look (for the first time) at the hash analysis tool in
> order to measure performance of some hashed collection access
> patterns. I only found quality assessments of #hash implementations
> for integer based values but no benchmark tool to measure run time of
> collection accesses - it seems this is not supported. Is there some
> other tool I need to load in order to measure the performance of
> hashed collections?
>
>
>
> Thanks for your suggestions regarding Point>>hash, I'll experiment
> with them when I'm back at work (bank holiday here today).
>
> Cheers,
>
> Reinout
> -------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

Andres Valloud-3
Failing that, there is always subclassing:

Point subclass: #PointForThisParticularUsage

PointForThisParticularUsage>>hash

  ^self someValueThatMakesSenseInThisParticularContext

Andres.

Hernan Wilkinson wrote:

> Hi all,
>  I agree with Andres when he says that it is impossible to have the best
> solution. For me, it means that having a message #hash as responsibility of
> the objects is not the right solution. Another object should have that
> responsibility therefore different hash algorithms could be use depending on
> the context... for sure somebody else thought about it, but here are some
> Smalltalks lines of what I mean:
>
>   Set hashingWith: [ :aPoint | aPoint quickHash ].
>   Set hashingWith: [ :aPoint | aPoint goodDistributionHash ].
>   Set hashingWith: [ :aPoint | aPoint x bitXor: aPoint y ].
>   etc.
>
> Bye,
> Hernan.
>
>
>
> On Mon, May 12, 2008 at 6:13 PM, Andres Valloud <[hidden email]>
> wrote:
>
>  
>> Reinout,
>>
>> In the past, your comments made me conclude that you prefer hash
>> functions that take little time to evaluate over hash functions that
>> distribute hash values very well but take their time to do so.
>>
>> However, now that Point>>hash substantially improves what was shipped
>> with VW 7.5 and before in terms of distribution while keeping
>> computational complexity at some sort of bare minimum, it is also a
>> problem because it does not distribute well enough in cases in which it
>> seems that a 2d array (or hash bucket strategy as opposed to open
>> addressing linear probing) would do better for you.
>>
>> What I am basically hearing in response to this situation is "make the
>> hash function more expensive so my problem goes away".  However, if
>> Point>>hash is modified to do more work in order to distribute better,
>> then it starts becoming a target of your previous admonitions regarding
>> execution speed.
>>
>> While I do not necessarily disagree with improving the distribution of
>> Point>>hash further at some expense of hash value computation, my
>> concern is that it is not possible to win.
>>
>> Regarding the hash analysis tool, the stated measurements are direct
>> predictors of hashed collection performance.  See chapter 2 of the hash
>> book.
>>
>> Andres.
>>
>>
>> Reinout Heeck wrote:
>>    
>>> On May 9, 2008, at 9:33 PM, Andres Valloud wrote:
>>>
>>>
>>>      
>>>> Reinout Heeck wrote:
>>>>
>>>>        
>>>>> Valloud, Andres wrote:
>>>>>
>>>>>
>>>>>          
>>>>>> Hmmm... I think what is going on is that clustering is not letting
>>>>>> the
>>>>>> dictionary do its job quickly.
>>>>>>
>>>>>>
>>>>>>            
>>>>> Yes, it surprised me that this case was not fixed in the hashing
>>>>> overhaul.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>          
>>>> For all hash functions there will be datasets that cause them to
>>>> behave
>>>> badly.  This cannot be avoided.
>>>>
>>>>
>>>>        
>>>>> I had hoped these [value judgment] days would be over...
>>>>>
>>>>>          
>>> Heh :-)
>>>
>>> What I meant there was the days of worrying about clustered values.
>>>
>>>
>>>
>>> More generally, before the hashing overhaul I had run into three
>>> problems repeatedly:
>>> 1) Poor performance of hashed collections holding strings.
>>> 2) Poor performance of hashed collections holding clustered integer
>>> values.
>>> 3) Poor abstraction of hash combination 'for dummies' leading to poor
>>> #hash implementations on domain objects.
>>>
>>> An earlier conversation we had extinguished my hopes for any solution
>>> to 3) in the overhaul but I had fully expected both 1) and 2) to be
>>> resolved by the redesign. Now I get the impression only 1) has been
>>> addressed adequately.
>>>
>>> You seem to be dismissing 2) as being a special case that needs not
>>> perform with the base collection implementations. Since this is a
>>> value judgment we can keep debating this, instead I can provide you a
>>> simple datapoint: my shop would be relieved when 2) is addressed.
>>>
>>>
>>> I had a quick look (for the first time) at the hash analysis tool in
>>> order to measure performance of some hashed collection access
>>> patterns. I only found quality assessments of #hash implementations
>>> for integer based values but no benchmark tool to measure run time of
>>> collection accesses - it seems this is not supported. Is there some
>>> other tool I need to load in order to measure the performance of
>>> hashed collections?
>>>
>>>
>>>
>>> Thanks for your suggestions regarding Point>>hash, I'll experiment
>>> with them when I'm back at work (bank holiday here today).
>>>
>>> Cheers,
>>>
>>> Reinout
>>> -------
>>>
>>> _______________________________________________
>>> vwnc mailing list
>>> [hidden email]
>>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>>>
>>>
>>>      
>> _______________________________________________
>> vwnc mailing list
>> [hidden email]
>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>>
>>    
>
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>  

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] [vw76] Point>>hash still not good enough

hernan.wilkinson
hmmm... yes, but with more coupling... I prefer forwarding over subclassing, is more flexible and dinamyc, but as you say, it is another option.

On Tue, May 13, 2008 at 3:01 PM, Andres Valloud <[hidden email]> wrote:
Failing that, there is always subclassing:

Point subclass: #PointForThisParticularUsage

PointForThisParticularUsage>>hash

 ^self someValueThatMakesSenseInThisParticularContext

Andres.

Hernan Wilkinson wrote:
> Hi all,
>  I agree with Andres when he says that it is impossible to have the best
> solution. For me, it means that having a message #hash as responsibility of
> the objects is not the right solution. Another object should have that
> responsibility therefore different hash algorithms could be use depending on
> the context... for sure somebody else thought about it, but here are some
> Smalltalks lines of what I mean:
>
>   Set hashingWith: [ :aPoint | aPoint quickHash ].
>   Set hashingWith: [ :aPoint | aPoint goodDistributionHash ].
>   Set hashingWith: [ :aPoint | aPoint x bitXor: aPoint y ].
>   etc.
>
> Bye,
> Hernan.
>
>
>
> On Mon, May 12, 2008 at 6:13 PM, Andres Valloud <[hidden email]>
> wrote:
>
>
>> Reinout,
>>
>> In the past, your comments made me conclude that you prefer hash
>> functions that take little time to evaluate over hash functions that
>> distribute hash values very well but take their time to do so.
>>
>> However, now that Point>>hash substantially improves what was shipped
>> with VW 7.5 and before in terms of distribution while keeping
>> computational complexity at some sort of bare minimum, it is also a
>> problem because it does not distribute well enough in cases in which it
>> seems that a 2d array (or hash bucket strategy as opposed to open
>> addressing linear probing) would do better for you.
>>
>> What I am basically hearing in response to this situation is "make the
>> hash function more expensive so my problem goes away".  However, if
>> Point>>hash is modified to do more work in order to distribute better,
>> then it starts becoming a target of your previous admonitions regarding
>> execution speed.
>>
>> While I do not necessarily disagree with improving the distribution of
>> Point>>hash further at some expense of hash value computation, my
>> concern is that it is not possible to win.
>>
>> Regarding the hash analysis tool, the stated measurements are direct
>> predictors of hashed collection performance.  See chapter 2 of the hash
>> book.
>>
>> Andres.
>>
>>
>> Reinout Heeck wrote:
>>
>>> On May 9, 2008, at 9:33 PM, Andres Valloud wrote:
>>>
>>>
>>>
>>>> Reinout Heeck wrote:
>>>>
>>>>
>>>>> Valloud, Andres wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hmmm... I think what is going on is that clustering is not letting
>>>>>> the
>>>>>> dictionary do its job quickly.
>>>>>>
>>>>>>
>>>>>>
>>>>> Yes, it surprised me that this case was not fixed in the hashing
>>>>> overhaul.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> For all hash functions there will be datasets that cause them to
>>>> behave
>>>> badly.  This cannot be avoided.
>>>>
>>>>
>>>>
>>>>> I had hoped these [value judgment] days would be over...
>>>>>
>>>>>
>>> Heh :-)
>>>
>>> What I meant there was the days of worrying about clustered values.
>>>
>>>
>>>
>>> More generally, before the hashing overhaul I had run into three
>>> problems repeatedly:
>>> 1) Poor performance of hashed collections holding strings.
>>> 2) Poor performance of hashed collections holding clustered integer
>>> values.
>>> 3) Poor abstraction of hash combination 'for dummies' leading to poor
>>> #hash implementations on domain objects.
>>>
>>> An earlier conversation we had extinguished my hopes for any solution
>>> to 3) in the overhaul but I had fully expected both 1) and 2) to be
>>> resolved by the redesign. Now I get the impression only 1) has been
>>> addressed adequately.
>>>
>>> You seem to be dismissing 2) as being a special case that needs not
>>> perform with the base collection implementations. Since this is a
>>> value judgment we can keep debating this, instead I can provide you a
>>> simple datapoint: my shop would be relieved when 2) is addressed.
>>>
>>>
>>> I had a quick look (for the first time) at the hash analysis tool in
>>> order to measure performance of some hashed collection access
>>> patterns. I only found quality assessments of #hash implementations
>>> for integer based values but no benchmark tool to measure run time of
>>> collection accesses - it seems this is not supported. Is there some
>>> other tool I need to load in order to measure the performance of
>>> hashed collections?
>>>
>>>
>>>
>>> Thanks for your suggestions regarding Point>>hash, I'll experiment
>>> with them when I'm back at work (bank holiday here today).
>>>
>>> Cheers,
>>>
>>> Reinout
>>> -------
>>>
>>> _______________________________________________
>>> vwnc mailing list
>>> [hidden email]
>>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>>>
>>>
>>>
>> _______________________________________________
>> vwnc mailing list
>> [hidden email]
>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>>
>>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc