Hi all,
i do not think that this behavior is correct (2 = 2.0) = true. (2 hash = 2.0 hash) = false. i saw a discussion about it in an old old thread (1999), but did not understand the rationale for such a behavior. i can exhibit a lot of examples of code completely broken by this behavior. Try this one: | aSet | aSet := Set new. aSet add: #( 1.0 2 3 4 5 6). aSet add: #( 1 2.0 3 4 5 6). aSet add: #( 1 2 3.0 4 5 6). aSet add: #( 1 2 3 4.0 5 6). aSet add: #( 1 2 3 4 5.0 6). aSet add: #( 1 2 3 4 5 6.0). aSet size Then, try this slight modification: | aSet | aSet := Set new: 12. aSet add: #( 1.0 2 3 4 5 6). aSet add: #( 1 2.0 3 4 5 6). aSet add: #( 1 2 3.0 4 5 6). aSet add: #( 1 2 3 4.0 5 6). aSet add: #( 1 2 3 4 5.0 6). aSet add: #( 1 2 3 4 5 6.0). aSet size You will get 6 or 5 depending on initial Set capacity... Secondly, Array hash, Matrix hash, WideSstring hash are hashing each and every element... Sawing this, I imagine one never store 100x100 Matrices in a Dictionary, unless prepared to spend a lot of time at coffee machine... In other words, finding a good balance between long to run hash algorithm and too many identical hash values is difficult. I am not sure that the exhaustive way is a good balance. The bet that string will be short is maybe not the worst one (unless you store whole sentence like i saw in translation), but betting that every array in Smalltalk will be short is nonsense. Try this one if you are not convinced: (SequenceableCollection allSubInstances collect: [:e | e size]) asBag valuesAndCounts associations asSortedCollection: [:a :b | a key > b key] In a 3.9a image, i have an average of 74 and a maximum over the millon. So i raise the question again. Should we really let above behavior exist in squeak ? Shouldn't we look for more efficient hashing ? I'm asking for a good rationale (compatibility issues included). |
In reply to this post by Nicolas Cellier-3
"nicolas cellier" <[hidden email]>
> Hi all, > > i do not think that this behavior is correct > (2 = 2.0) = true. > (2 hash = 2.0 hash) = false. > > i saw a discussion about it in an old old thread (1999), but did not > understand the rationale for such a behavior. > > i can exhibit a lot of examples of code completely broken by this > > Try this one: > > | aSet | > aSet := Set new. > aSet add: #( 1.0 2 3 4 5 6). > aSet add: #( 1 2.0 3 4 5 6). > aSet add: #( 1 2 3.0 4 5 6). > aSet add: #( 1 2 3 4.0 5 6). > aSet add: #( 1 2 3 4 5.0 6). > aSet add: #( 1 2 3 4 5 6.0). > aSet size > > Then, try this slight modification: > > | aSet | > aSet := Set new: 12. > aSet add: #( 1.0 2 3 4 5 6). > aSet add: #( 1 2.0 3 4 5 6). > aSet add: #( 1 2 3.0 4 5 6). > aSet add: #( 1 2 3 4.0 5 6). > aSet add: #( 1 2 3 4 5.0 6). > aSet add: #( 1 2 3 4 5 6.0). > aSet size > > You will get 6 or 5 depending on initial Set capacity... Wow, this is a very impressive example, thank you a lot. VisualWorks, Dolphin and VisualAge for Smalltalk use different algorithms to ensure that a float fl that satiyfies the condition fl = fl truncated has the same hash as the integer fl truncated. In Visual age this is done with: Float >> hash "Answer a unique Integer which represents a hash for the receiver." ^self truncated hash the other systems use something like float>>hash | tr | tr := self truncated. self = tr ifTrue: [^tr hash] ifFalse: [ " some computation " ] Your example works also with fractions, |
In reply to this post by Nicolas Cellier-3
Nicolas
do you have a fix for that behavior? Stef On 27 mars 06, at 21:44, nicolas cellier wrote: > Hi all, > > i do not think that this behavior is correct > (2 = 2.0) = true. > (2 hash = 2.0 hash) = false. > > i saw a discussion about it in an old old thread (1999), but did not > understand the rationale for such a behavior. > > i can exhibit a lot of examples of code completely broken by this > behavior. > > Try this one: > > | aSet | > aSet := Set new. > aSet add: #( 1.0 2 3 4 5 6). > aSet add: #( 1 2.0 3 4 5 6). > aSet add: #( 1 2 3.0 4 5 6). > aSet add: #( 1 2 3 4.0 5 6). > aSet add: #( 1 2 3 4 5.0 6). > aSet add: #( 1 2 3 4 5 6.0). > aSet size > > Then, try this slight modification: > > | aSet | > aSet := Set new: 12. > aSet add: #( 1.0 2 3 4 5 6). > aSet add: #( 1 2.0 3 4 5 6). > aSet add: #( 1 2 3.0 4 5 6). > aSet add: #( 1 2 3 4.0 5 6). > aSet add: #( 1 2 3 4 5.0 6). > aSet add: #( 1 2 3 4 5 6.0). > aSet size > > You will get 6 or 5 depending on initial Set capacity... > > Secondly, Array hash, Matrix hash, WideSstring hash are hashing > each and every > element... Sawing this, I imagine one never store 100x100 Matrices > in a > Dictionary, unless prepared to spend a lot of time at coffee > machine... > > In other words, finding a good balance between long to run hash > algorithm and > too many identical hash values is difficult. I am not sure that the > exhaustive way is a good balance. > > The bet that string will be short is maybe not the worst one > (unless you store > whole sentence like i saw in translation), but betting that every > array in > Smalltalk will be short is nonsense. > > Try this one if you are not convinced: > (SequenceableCollection allSubInstances collect: [:e | e size]) > asBag > valuesAndCounts associations asSortedCollection: [:a :b | a > key > b key] > > In a 3.9a image, i have an average of 74 and a maximum over the > millon. > > So i raise the question again. Should we really let above behavior > exist in > squeak ? > Shouldn't we look for more efficient hashing ? > I'm asking for a good rationale (compatibility issues included). > > |
In reply to this post by Nicolas Cellier-3
Hi, Stef and Boris, We know from 1997-99 discussions that VA implementation is not good. It computes hash code fast, but handle Set of Float inefficiently. Example: ((1 to: 100) collect: [:i | 10.0 raisedTo: i negated]) asSet. All elements have same hash code (0 hash) and access performance turn to linear search. In other dialects, self=self truncated ifFalse: [...] is also here to handle fraction/float equality (3/2) hash = 1.5 hash VW used to convert everything to Fraction (and had nasty bugs because their imperfect asRational algorithm sometimes have (aDouble = aDouble asRational asDouble) answering false, see SYSBUG-asRational at cincom public store). Now they convert everything to Float, then to Double if anything fail (Overflow) then to old Fraction hash (Large Fraction can also overflow max IEEE754 Double). I think new implementation is faster, and also cove rs other nasty bugs of aFloat=aDouble answering true, but having different fractional representations. This also cover scaled numbers. I do not have Dolphin, ST/X, ... image here, but i imagine they follow one of the above paths. Sorry, I have no original solution in mind, but borrow solutions of our Smalltalk cousins.
|
Nicolas
thanks! My question was also more interested :) If you want to see this bug to be fixed, submit a fix and a bunch of tests :). You look like that right guy to do that :) Stef On 28 mars 06, at 10:39, [hidden email] wrote: > > Hi, Stef and Boris, > > We know from 1997-99 discussions that VA implementation is not good. > It computes hash code fast, but handle Set of Float inefficiently. > Example: > ((1 to: 100) collect: [:i | 10.0 raisedTo: i negated]) asSet. > All elements have same hash code (0 hash) and access performance > turn to linear search. > > In other dialects, > self=self truncated ifFalse: [...] > is also here to handle fraction/float equality (3/2) hash = 1.5 hash > > VW used to convert everything to Fraction (and had nasty bugs > because their imperfect asRational algorithm sometimes have > (aDouble = aDouble asRational asDouble) answering false, see SYSBUG- > asRational at cincom public store). > Now they convert everything to Float, then to Double if anything > fail (Overflow) then to old Fraction hash (Large Fraction can also > overflow max IEEE754 Double). I think new implementation is faster, > and also cove rs other nasty bugs of aFloat=aDouble answering true, > but having different fractional representations. This also cover > scaled numbers. > > I do not have Dolphin, ST/X, ... image here, but i imagine they > follow one of the above paths. > > Sorry, I have no original solution in mind, but borrow solutions of > our Smalltalk cousins. > > > > > iFRANCE > exprimez-vous ! > |
On 28.03.2006, at 10:51, stéphane ducasse wrote: > Nicolas > > thanks! > My question was also more interested :) > If you want to see this bug to be fixed, submit a fix and a bunch > of tests :). And don't be discuraged if something is not added/fixed fast... it will be at some point. Marcus |
In reply to this post by Nicolas Cellier-3
So i think i should proceed with VW latter solution:
Rationale: Turning Float hash to Fraction hash really spoils things (factor 100) : Time millisecondsToRun: [10000 timesRepeat: [Float pi asFraction hash]]. 1217 Time millisecondsToRun: [10000 timesRepeat: [Float pi hash]]. 13 But turning Fraction hash to Float hash does harm less (factor 10): | tmp | tmp := Float pi asFraction. Array with: (Time millisecondsToRun: [10000 timesRepeat: [tmp hash]]) with: (Time millisecondsToRun: [10000 timesRepeat: [tmp asFloat hash]]) #(47 339) This first naive implementation does not handle large fractions well (all have same Float infinity hash), but this can be fixed later (since they do not equal... but wait, wait, they equal ! arghh see another bug in next mail). Nicolas Le Mardi 28 Mars 2006 10:51, vous avez écrit : > Nicolas > > thanks! > My question was also more interested :) > If you want to see this bug to be fixed, submit a fix and a bunch of > tests :). > You look like that right guy to do that :) > > Stef |
Free forum by Nabble | Edit this page |