Point>>hash is not defined symmetrically over its x and y variables: ^x hash hashMultiply bitXor: y hash As a result performance of hashed collections containing points shows very asymmetrical run times: Time millisecondsToRun:[ d := Dictionary new. 1 to: 3 do:[:x | 1 to: 1000 do: [:y | d at: x@y put: #foo]]. 1 to: 3 do:[:x | 1 to: 1000 do: [:y | d removeKey: x@y]]. ] "30532 30022 29883" Time millisecondsToRun:[ d := Dictionary new. 1 to: 3 do:[:y | 1 to: 1000 do: [:x | d at: x@y put: #foo]]. 1 to: 3 do:[:y | 1 to: 1000 do: [:x | d removeKey: x@y]]. ] "8 8 8" If I change the implementation of Point>>hash to be symmetric: ^x hash hashMultiply bitXor: y hash hashMultiply the run times reported are " 9 10 9 " and " 9 10 10" HTH, Reinout -------- _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hmmm... I think what is going on is that clustering is not letting the
dictionary do its job quickly. In other words, the hash of y being left alone results in chunks of 1000 points wanting to be together, and the position of these chunks in the hashed collection is controlled by x hash hashMultiply. There are a number of things to observe. First is that doing this, Time millisecondsToRun:[ d := Dictionary new: 1300000. 1 to: 3 do:[:x | 1 to: 1000 do: [:y | d at: x@y put: #foo]]. 1 to: 3 do:[:x | 1 to: 1000 do: [:y | d removeKey: x@y]]. ] cuts down the running time to under 100 ms. The second thing is that the suggestion, now being symmetric on x and y, has other issues as well. For example, now it will always be the case that x@y hash = y@x hash no matter how well the hashes of x and y are mixed / post processed. This is the motivation for using non symmetric hash functions when order is important for comparison. In fact, using the Hash Analysis Tool reveals that the current implementation has these properties: Collision rate: 1. Chi square: 0. Chi square mod p average: 0.343364. On the other hand, the suggested hash function has these properties: Collision rate: 2.346245 (anything over 1.01 is bad) Chi square: 2.384693 (anything significantly over 0 is bad) Chi square mod p average: 2.856256 (anything significantly over 0.5 is bad) For the sake of completeness, the original VW 7.5 implementation that was replaced produced these numbers. Collision rate: 244.140625 Chi square: 242.524372 Chi square mod p average: 242.524372 (bonus bad points for not distributing well mod p) When doing the hash book, one of my data sets was indeed the 1000x1000 point data set. It gave me headaches like you wouldn't believe. I tried your suggestion, as well as many others, and discarded them because of their collision rate. Of the hash functions that gave perfect collision rate, chi square, and chi square mod p averages, what happened was that individual chi square mod p values had issues. This is indicative of clustering, and this is what you found. While I ended up creating hash functions that had nice chi square mod p values at the cost of about 25k collisions in a data set of 10^6 points, the issue is that they require considerable more work than a single line of code. Rather, they are about 5 long lines of stuff that *looks* like magic constants and unwarranted complication. The value judgment here was that in practice it would be rare to see a dictionary of all the integer points between 0@0 and 999@999 (or 1@1 and 1000@1000), and so the simpler hash function already provides far more value than the old one. The last observation to make is that nothing pays more than knowing your data set. For example, if one really wanted to put the 1000x1000 square point data set in a dictionary, then one could actually *use* the characteristics of the data set. What could be done? Something like subclassing Point and reimplementing hash like this: hash ^x * 1000 + y In the face of this implementation, you could even claim the 5 lines of densely packed code look ridiculous for this specific application. On the other hand, the hash function above uses information that is not available to those that try to provide a hash function that will actually work in most generic circumstances. Andres. PS: did you get to the section of the hash book that talks about Point>>hash yet? :) -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Reinout Heeck Sent: Thursday, May 08, 2008 5:57 AM To: 'VWNC' Cc: Andres Valloud Subject: [vwnc] [vw76] Point>>hash still not good enough Point>>hash is not defined symmetrically over its x and y variables: ^x hash hashMultiply bitXor: y hash As a result performance of hashed collections containing points shows very asymmetrical run times: Time millisecondsToRun:[ d := Dictionary new. 1 to: 3 do:[:x | 1 to: 1000 do: [:y | d at: x@y put: #foo]]. 1 to: 3 do:[:x | 1 to: 1000 do: [:y | d removeKey: x@y]]. ] "30532 30022 29883" Time millisecondsToRun:[ d := Dictionary new. 1 to: 3 do:[:y | 1 to: 1000 do: [:x | d at: x@y put: #foo]]. 1 to: 3 do:[:y | 1 to: 1000 do: [:x | d removeKey: x@y]]. ] "8 8 8" If I change the implementation of Point>>hash to be symmetric: ^x hash hashMultiply bitXor: y hash hashMultiply the run times reported are " 9 10 9 " and " 9 10 10" HTH, Reinout -------- _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Valloud, Andres wrote:
> Hmmm... I think what is going on is that clustering is not letting the > dictionary do its job quickly. Yes, it surprised me that this case was not fixed in the hashing overhaul. > There are a number of things to observe. First is that doing this, > > d := Dictionary new: 1300000. > Here the client of the dictionary has to be aware of the deficiencies in int hashedcollection implementations and tune it's way around it. IMO that is the wrong place (repetitious, fragile and in general cases like library code impossible to pre-tune). I had hoped these days would be over... > The value judgment here > was that in practice it would be rare to see a dictionary of all the > integer points between 0@0 and 999@999 (or 1@1 and 1000@1000), and so > the simpler hash function already provides far more value than the old > one. > In our practice this is not so rare, rougly in any system that works with integer IDs (in our case order numbers) you will tend to see such clusters of numbers being used as dictioanry keys or set members. This time around I ran into it with UI code though, I'm busy overlaying the display of dataset cells with a short animation whenever the value in a cell changes, here I need a map from cell coordinates to some helper objects and indeed the coordinates of these Points come mostly in clusters. My point is that the value judgment above does not hold water in my daily experience (whit the caveat that I am putting strong demands on the base library, I want my code to be able to use a hashed collection of points when I mean that - I would like that such code to be performant enough without the need for cluttering it with special collection tuning at every incarnation). So (denying this value judgment) I conclude there is an impedance mismatch between the new #hash implementations and the hashed collections that use them: the new hashing may yield clustered hash values if the values being hashed are clustered - on the other hand the hashed collections break down with such clustered hash values. My first proposal tried a quick fix on the first half - the hashing implementation, but as you point out it has quality issues. Another solution to removing the impedance mismatch would be to fix the collections, for example by throwing in an extra #hashMultiply in #findKeyOrNil:, I'll play some more with that idea coming week. > The last observation to make is that nothing pays more than knowing your > data set. But that comes at a price, you will have to be prepared to have your client code express these details. In large applications that need to be maintainable in the light of frequent requirement changes I want to be able to write code as naively as possible, so that only business rules are expressed - I don't want future maintainers to be bogged down by tuning issues when they need to concentrate on business functionality. Hence my denial of the value judgment that was applied. > PS: did you get to the section of the hash book that talks about > Point>>hash yet? :) > No, exercise 1.12 (Fermat's last theorem) has got a total grip on my mind, when I thought to myself 'how would a hardware hacker solve that' I had the (mis?)fortune of having a vague insight on how to restate the problem so that a weaker proof is required. Whenever I pick up the book now I cannot help myself - I just wander off thinking about that exercise. So I guess I'll have to bite the bullet and sit down and write out my thoughts on this - get it out of my system so I can continue with your book :-) R - _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Reinout Heeck wrote:
> Valloud, Andres wrote: > >> Hmmm... I think what is going on is that clustering is not letting the >> dictionary do its job quickly. >> > Yes, it surprised me that this case was not fixed in the hashing overhaul. > > > For all hash functions there will be datasets that cause them to behave badly. This cannot be avoided. > I had hoped these [value judgment] days would be over... > It is always a value judgment. We could just resort to using MD5 for everything, but then that would be too costly. Plus, it would be a bad hash if anybody decided to build a dictionary of MD5 collisions --- and I know some people who have actually dealt with this particular matter. In other words, it is not possible to thoroughly "win" this game. > In our practice this is not so rare, rougly in any system that works > with integer IDs (in our case order numbers) you will tend to see such > clusters of numbers being used as dictioanry keys or set members. This > time around I ran into it with UI code though, I'm busy overlaying the > display of dataset cells with a short animation whenever the value in a > cell changes, here I need a map from cell coordinates to some helper > objects and indeed the coordinates of these Points come mostly in clusters. > Ok, then I guess something more complex is warranted. Instead of what you suggested, you may want to try this instead: Point>>hash | xHash yHash | xHash := self x hash hashMultiply. xHash := (xHash bitShift: -14) bitXor: ((xHash bitAnd: 16r3FFF) bitShift: 15). ^xHash bitXor: self y hash hashMultiply There are other, much more complicated implementations which are also possible, but I'd rather see the need before using heavy handed stuff. > My point is that the value judgment above does not hold water in my > daily experience (whit the caveat that I am putting strong demands on > the base library [...]) Right... and sooner or later it will be the case that somebody runs into a situation in which the hash function above is also problematic... > Another > solution to removing the impedance mismatch would be to fix the > collections, for example by throwing in an extra #hashMultiply in > #findKeyOrNil:, I'll play some more with that idea coming week. > Applying hashMultiply to every single hash value will cause problems (truncation to 28 bits), will disturb perfect hash functions for no good reason, and may introduce distribution issues that were not present before its use. Moreover, it is equivalent to a permutation mod 2^28, and as such it will not improve distribution when there are no collisions to begin with. In other words, it may be more of a performance hit than a benefit. I'd strongly recommend against doing that. >> The last observation to make is that nothing pays more than knowing your >> data set. >> > But that comes at a price, you will have to be prepared to have your > client code express these details. > Yes, there is always a price for everything. This is normal. >> PS: did you get to the section of the hash book that talks about >> Point>>hash yet? :) >> >> > No, exercise 1.12 (Fermat's last theorem) has got a total grip on my > mind, when I thought to myself 'how would a hardware hacker solve that' > I had the (mis?)fortune of having a vague insight on how to restate the > problem so that a weaker proof is required. Whenever I pick up the book > now I cannot help myself - I just wander off thinking about that > exercise. So I guess I'll have to bite the bullet and sit down and write > out my thoughts on this - get it out of my system so I can continue with > your book :-) > Heh :)... funny how that happens... I had something like that occur to me with Concrete Mathematics' exercise 5.112... to show that $\binom{2n}{n}$ is always divisible by 4 or 9, except when (IIRC) 2n = 64 or 2n = 256. It is marked as a research problem, although I think Granville et al solved it in 1996. Andres. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
On May 9, 2008, at 9:33 PM, Andres Valloud wrote: > Reinout Heeck wrote: >> Valloud, Andres wrote: >> >>> Hmmm... I think what is going on is that clustering is not letting >>> the >>> dictionary do its job quickly. >>> >> Yes, it surprised me that this case was not fixed in the hashing >> overhaul. >> >> >> > > For all hash functions there will be datasets that cause them to > behave > badly. This cannot be avoided. > >> I had hoped these [value judgment] days would be over... Heh :-) What I meant there was the days of worrying about clustered values. More generally, before the hashing overhaul I had run into three problems repeatedly: 1) Poor performance of hashed collections holding strings. 2) Poor performance of hashed collections holding clustered integer values. 3) Poor abstraction of hash combination 'for dummies' leading to poor #hash implementations on domain objects. An earlier conversation we had extinguished my hopes for any solution to 3) in the overhaul but I had fully expected both 1) and 2) to be resolved by the redesign. Now I get the impression only 1) has been addressed adequately. You seem to be dismissing 2) as being a special case that needs not perform with the base collection implementations. Since this is a value judgment we can keep debating this, instead I can provide you a simple datapoint: my shop would be relieved when 2) is addressed. I had a quick look (for the first time) at the hash analysis tool in order to measure performance of some hashed collection access patterns. I only found quality assessments of #hash implementations for integer based values but no benchmark tool to measure run time of collection accesses - it seems this is not supported. Is there some other tool I need to load in order to measure the performance of hashed collections? Thanks for your suggestions regarding Point>>hash, I'll experiment with them when I'm back at work (bank holiday here today). Cheers, Reinout ------- _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Reinout,
In the past, your comments made me conclude that you prefer hash functions that take little time to evaluate over hash functions that distribute hash values very well but take their time to do so. However, now that Point>>hash substantially improves what was shipped with VW 7.5 and before in terms of distribution while keeping computational complexity at some sort of bare minimum, it is also a problem because it does not distribute well enough in cases in which it seems that a 2d array (or hash bucket strategy as opposed to open addressing linear probing) would do better for you. What I am basically hearing in response to this situation is "make the hash function more expensive so my problem goes away". However, if Point>>hash is modified to do more work in order to distribute better, then it starts becoming a target of your previous admonitions regarding execution speed. While I do not necessarily disagree with improving the distribution of Point>>hash further at some expense of hash value computation, my concern is that it is not possible to win. Regarding the hash analysis tool, the stated measurements are direct predictors of hashed collection performance. See chapter 2 of the hash book. Andres. Reinout Heeck wrote: > On May 9, 2008, at 9:33 PM, Andres Valloud wrote: > > >> Reinout Heeck wrote: >> >>> Valloud, Andres wrote: >>> >>> >>>> Hmmm... I think what is going on is that clustering is not letting >>>> the >>>> dictionary do its job quickly. >>>> >>>> >>> Yes, it surprised me that this case was not fixed in the hashing >>> overhaul. >>> >>> >>> >>> >> For all hash functions there will be datasets that cause them to >> behave >> badly. This cannot be avoided. >> >> >>> I had hoped these [value judgment] days would be over... >>> > > Heh :-) > > What I meant there was the days of worrying about clustered values. > > > > More generally, before the hashing overhaul I had run into three > problems repeatedly: > 1) Poor performance of hashed collections holding strings. > 2) Poor performance of hashed collections holding clustered integer > values. > 3) Poor abstraction of hash combination 'for dummies' leading to poor > #hash implementations on domain objects. > > An earlier conversation we had extinguished my hopes for any solution > to 3) in the overhaul but I had fully expected both 1) and 2) to be > resolved by the redesign. Now I get the impression only 1) has been > addressed adequately. > > You seem to be dismissing 2) as being a special case that needs not > perform with the base collection implementations. Since this is a > value judgment we can keep debating this, instead I can provide you a > simple datapoint: my shop would be relieved when 2) is addressed. > > > I had a quick look (for the first time) at the hash analysis tool in > order to measure performance of some hashed collection access > patterns. I only found quality assessments of #hash implementations > for integer based values but no benchmark tool to measure run time of > collection accesses - it seems this is not supported. Is there some > other tool I need to load in order to measure the performance of > hashed collections? > > > > Thanks for your suggestions regarding Point>>hash, I'll experiment > with them when I'm back at work (bank holiday here today). > > Cheers, > > Reinout > ------- > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hi all,
I agree with Andres when he says that it is impossible to have the best solution. For me, it means that having a message #hash as responsibility of the objects is not the right solution. Another object should have that responsibility therefore different hash algorithms could be use depending on the context... for sure somebody else thought about it, but here are some Smalltalks lines of what I mean: Set hashingWith: [ :aPoint | aPoint quickHash ]. Set hashingWith: [ :aPoint | aPoint goodDistributionHash ]. Set hashingWith: [ :aPoint | aPoint x bitXor: aPoint y ]. etc. Bye, Hernan. On Mon, May 12, 2008 at 6:13 PM, Andres Valloud <[hidden email]> wrote: Reinout, _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Failing that, there is always subclassing:
Point subclass: #PointForThisParticularUsage PointForThisParticularUsage>>hash ^self someValueThatMakesSenseInThisParticularContext Andres. Hernan Wilkinson wrote: > Hi all, > I agree with Andres when he says that it is impossible to have the best > solution. For me, it means that having a message #hash as responsibility of > the objects is not the right solution. Another object should have that > responsibility therefore different hash algorithms could be use depending on > the context... for sure somebody else thought about it, but here are some > Smalltalks lines of what I mean: > > Set hashingWith: [ :aPoint | aPoint quickHash ]. > Set hashingWith: [ :aPoint | aPoint goodDistributionHash ]. > Set hashingWith: [ :aPoint | aPoint x bitXor: aPoint y ]. > etc. > > Bye, > Hernan. > > > > On Mon, May 12, 2008 at 6:13 PM, Andres Valloud <[hidden email]> > wrote: > > >> Reinout, >> >> In the past, your comments made me conclude that you prefer hash >> functions that take little time to evaluate over hash functions that >> distribute hash values very well but take their time to do so. >> >> However, now that Point>>hash substantially improves what was shipped >> with VW 7.5 and before in terms of distribution while keeping >> computational complexity at some sort of bare minimum, it is also a >> problem because it does not distribute well enough in cases in which it >> seems that a 2d array (or hash bucket strategy as opposed to open >> addressing linear probing) would do better for you. >> >> What I am basically hearing in response to this situation is "make the >> hash function more expensive so my problem goes away". However, if >> Point>>hash is modified to do more work in order to distribute better, >> then it starts becoming a target of your previous admonitions regarding >> execution speed. >> >> While I do not necessarily disagree with improving the distribution of >> Point>>hash further at some expense of hash value computation, my >> concern is that it is not possible to win. >> >> Regarding the hash analysis tool, the stated measurements are direct >> predictors of hashed collection performance. See chapter 2 of the hash >> book. >> >> Andres. >> >> >> Reinout Heeck wrote: >> >>> On May 9, 2008, at 9:33 PM, Andres Valloud wrote: >>> >>> >>> >>>> Reinout Heeck wrote: >>>> >>>> >>>>> Valloud, Andres wrote: >>>>> >>>>> >>>>> >>>>>> Hmmm... I think what is going on is that clustering is not letting >>>>>> the >>>>>> dictionary do its job quickly. >>>>>> >>>>>> >>>>>> >>>>> Yes, it surprised me that this case was not fixed in the hashing >>>>> overhaul. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> For all hash functions there will be datasets that cause them to >>>> behave >>>> badly. This cannot be avoided. >>>> >>>> >>>> >>>>> I had hoped these [value judgment] days would be over... >>>>> >>>>> >>> Heh :-) >>> >>> What I meant there was the days of worrying about clustered values. >>> >>> >>> >>> More generally, before the hashing overhaul I had run into three >>> problems repeatedly: >>> 1) Poor performance of hashed collections holding strings. >>> 2) Poor performance of hashed collections holding clustered integer >>> values. >>> 3) Poor abstraction of hash combination 'for dummies' leading to poor >>> #hash implementations on domain objects. >>> >>> An earlier conversation we had extinguished my hopes for any solution >>> to 3) in the overhaul but I had fully expected both 1) and 2) to be >>> resolved by the redesign. Now I get the impression only 1) has been >>> addressed adequately. >>> >>> You seem to be dismissing 2) as being a special case that needs not >>> perform with the base collection implementations. Since this is a >>> value judgment we can keep debating this, instead I can provide you a >>> simple datapoint: my shop would be relieved when 2) is addressed. >>> >>> >>> I had a quick look (for the first time) at the hash analysis tool in >>> order to measure performance of some hashed collection access >>> patterns. I only found quality assessments of #hash implementations >>> for integer based values but no benchmark tool to measure run time of >>> collection accesses - it seems this is not supported. Is there some >>> other tool I need to load in order to measure the performance of >>> hashed collections? >>> >>> >>> >>> Thanks for your suggestions regarding Point>>hash, I'll experiment >>> with them when I'm back at work (bank holiday here today). >>> >>> Cheers, >>> >>> Reinout >>> ------- >>> >>> _______________________________________________ >>> vwnc mailing list >>> [hidden email] >>> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc >>> >>> >>> >> _______________________________________________ >> vwnc mailing list >> [hidden email] >> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc >> >> > > > ------------------------------------------------------------------------ > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
hmmm... yes, but with more coupling... I prefer forwarding over subclassing, is more flexible and dinamyc, but as you say, it is another option.
On Tue, May 13, 2008 at 3:01 PM, Andres Valloud <[hidden email]> wrote: Failing that, there is always subclassing: _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |