hash and equal problem

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

hash and equal problem

Nicolas Cellier-3
Hi all,

i do not think that this behavior is correct
  (2 = 2.0) = true.
  (2 hash = 2.0 hash) = false.

i saw a discussion about it in an old old thread (1999), but did not
understand the rationale for such a behavior.

i can exhibit a lot of examples of code completely broken by this behavior.

Try this one:

  | aSet |
  aSet := Set new.
  aSet add: #( 1.0 2 3 4 5 6).
  aSet add: #( 1 2.0 3 4 5 6).
  aSet add: #( 1 2 3.0 4 5 6).
  aSet add: #( 1 2 3 4.0 5 6).
  aSet add: #( 1 2 3 4 5.0 6).
  aSet add: #( 1 2 3 4 5 6.0).
  aSet size

Then, try this slight modification:

  | aSet |
  aSet := Set new: 12.
  aSet add: #( 1.0 2 3 4 5 6).
  aSet add: #( 1 2.0 3 4 5 6).
  aSet add: #( 1 2 3.0 4 5 6).
  aSet add: #( 1 2 3 4.0 5 6).
  aSet add: #( 1 2 3 4 5.0 6).
  aSet add: #( 1 2 3 4 5 6.0).
  aSet size

You will get 6 or 5 depending on initial Set capacity...

Secondly, Array hash, Matrix hash, WideSstring hash are hashing each and every
element... Sawing this, I imagine one never store 100x100 Matrices in a
Dictionary, unless prepared to spend a lot of time at coffee machine...

In other words, finding a good balance between long to run hash algorithm and
too many identical hash values is difficult. I am not sure that the
exhaustive way is a good balance.

The bet that string will be short is maybe not the worst one (unless you store
whole sentence like i saw in translation), but betting that every array in
Smalltalk will be short is nonsense.

Try this one if you are not convinced:
  (SequenceableCollection allSubInstances collect: [:e | e size]) asBag
     valuesAndCounts associations asSortedCollection: [:a :b | a key > b key]

In a 3.9a image, i have an average of 74 and a maximum over the millon.

So i raise the question again. Should we really let above behavior exist in
squeak ?
Shouldn't we look for more efficient hashing ?
I'm asking for a good rationale (compatibility issues included).


Reply | Threaded
Open this post in threaded view
|

Re: hash and equal problem

Nicolas Cellier-3
I put the shortest expression of the problem at
http://bugs.impara.de/view.php?id=3360


Reply | Threaded
Open this post in threaded view
|

Re: hash and equal problem

Boris.Gaertner
In reply to this post by Nicolas Cellier-3
"nicolas cellier" <[hidden email]>



> Hi all,
>
> i do not think that this behavior is correct
>   (2 = 2.0) = true.
>   (2 hash = 2.0 hash) = false.
>
> i saw a discussion about it in an old old thread (1999), but did not
> understand the rationale for such a behavior.
>
> i can exhibit a lot of examples of code completely broken by this
behavior.

>
> Try this one:
>
>   | aSet |
>   aSet := Set new.
>   aSet add: #( 1.0 2 3 4 5 6).
>   aSet add: #( 1 2.0 3 4 5 6).
>   aSet add: #( 1 2 3.0 4 5 6).
>   aSet add: #( 1 2 3 4.0 5 6).
>   aSet add: #( 1 2 3 4 5.0 6).
>   aSet add: #( 1 2 3 4 5 6.0).
>   aSet size
>
> Then, try this slight modification:
>
>   | aSet |
>   aSet := Set new: 12.
>   aSet add: #( 1.0 2 3 4 5 6).
>   aSet add: #( 1 2.0 3 4 5 6).
>   aSet add: #( 1 2 3.0 4 5 6).
>   aSet add: #( 1 2 3 4.0 5 6).
>   aSet add: #( 1 2 3 4 5.0 6).
>   aSet add: #( 1 2 3 4 5 6.0).
>   aSet size
>
> You will get 6 or 5 depending on initial Set capacity...

Wow, this is a very impressive example, thank you a lot.

VisualWorks, Dolphin and VisualAge for Smalltalk
use different algorithms to ensure that a float fl that satiyfies
the condition  fl = fl truncated has the same hash
as the integer  fl truncated.

In Visual age this is done with:

Float >> hash
 "Answer a unique Integer which represents a hash for the receiver."

 ^self truncated hash


the other systems use something like


float>>hash
   | tr |
  tr := self truncated.
  self = tr
    ifTrue: [^tr hash]
   ifFalse: [ " some computation " ]



Your example works also with fractions,


Reply | Threaded
Open this post in threaded view
|

Re: hash and equal problem

stéphane ducasse-2
In reply to this post by Nicolas Cellier-3
Nicolas

do you have a fix for that behavior?

Stef

On 27 mars 06, at 21:44, nicolas cellier wrote:

> Hi all,
>
> i do not think that this behavior is correct
>   (2 = 2.0) = true.
>   (2 hash = 2.0 hash) = false.
>
> i saw a discussion about it in an old old thread (1999), but did not
> understand the rationale for such a behavior.
>
> i can exhibit a lot of examples of code completely broken by this  
> behavior.
>
> Try this one:
>
>   | aSet |
>   aSet := Set new.
>   aSet add: #( 1.0 2 3 4 5 6).
>   aSet add: #( 1 2.0 3 4 5 6).
>   aSet add: #( 1 2 3.0 4 5 6).
>   aSet add: #( 1 2 3 4.0 5 6).
>   aSet add: #( 1 2 3 4 5.0 6).
>   aSet add: #( 1 2 3 4 5 6.0).
>   aSet size
>
> Then, try this slight modification:
>
>   | aSet |
>   aSet := Set new: 12.
>   aSet add: #( 1.0 2 3 4 5 6).
>   aSet add: #( 1 2.0 3 4 5 6).
>   aSet add: #( 1 2 3.0 4 5 6).
>   aSet add: #( 1 2 3 4.0 5 6).
>   aSet add: #( 1 2 3 4 5.0 6).
>   aSet add: #( 1 2 3 4 5 6.0).
>   aSet size
>
> You will get 6 or 5 depending on initial Set capacity...
>
> Secondly, Array hash, Matrix hash, WideSstring hash are hashing  
> each and every
> element... Sawing this, I imagine one never store 100x100 Matrices  
> in a
> Dictionary, unless prepared to spend a lot of time at coffee  
> machine...
>
> In other words, finding a good balance between long to run hash  
> algorithm and
> too many identical hash values is difficult. I am not sure that the
> exhaustive way is a good balance.
>
> The bet that string will be short is maybe not the worst one  
> (unless you store
> whole sentence like i saw in translation), but betting that every  
> array in
> Smalltalk will be short is nonsense.
>
> Try this one if you are not convinced:
>   (SequenceableCollection allSubInstances collect: [:e | e size])  
> asBag
>      valuesAndCounts associations asSortedCollection: [:a :b | a  
> key > b key]
>
> In a 3.9a image, i have an average of 74 and a maximum over the  
> millon.
>
> So i raise the question again. Should we really let above behavior  
> exist in
> squeak ?
> Shouldn't we look for more efficient hashing ?
> I'm asking for a good rationale (compatibility issues included).
>
>


Reply | Threaded
Open this post in threaded view
|

Re: hash and equal problem

Nicolas Cellier-3
In reply to this post by Nicolas Cellier-3

Hi, Stef and Boris,

We know from 1997-99 discussions that VA implementation is not good.
It computes hash code fast, but handle Set of Float inefficiently.
Example:
    ((1 to: 100) collect: [:i | 10.0 raisedTo: i negated]) asSet.
All elements have same hash code (0 hash) and access performance turn to linear search.

In other dialects,
    self=self truncated ifFalse: [...]
is also here to handle fraction/float equality (3/2) hash = 1.5 hash

VW used to convert everything to Fraction (and had nasty bugs because their imperfect asRational algorithm sometimes have (aDouble = aDouble asRational asDouble) answering false, see SYSBUG-asRational at cincom public store).
Now they convert everything to Float, then to Double if anything fail (Overflow) then to old Fraction hash (Large Fraction can also overflow max IEEE754 Double). I think new implementation is faster, and also cove rs other nasty bugs of aFloat=aDouble answering true, but having different fractional representations. This also cover scaled numbers.

I do not have Dolphin, ST/X, ... image here, but i imagine they follow one of the above paths.

Sorry, I have no original solution in mind, but borrow solutions of our Smalltalk cousins.





iFRANCE
exprimez-vous !


Reply | Threaded
Open this post in threaded view
|

Re: hash and equal problem

stéphane ducasse-2
Nicolas

thanks!
My question was also more interested :)
If you want to see this bug to be fixed, submit a fix and a bunch of  
tests :).
You look like that right guy to do that :)

Stef

On 28 mars 06, at 10:39, [hidden email] wrote:

>
> Hi, Stef and Boris,
>
> We know from 1997-99 discussions that VA implementation is not good.
> It computes hash code fast, but handle Set of Float inefficiently.
> Example:
>     ((1 to: 100) collect: [:i | 10.0 raisedTo: i negated]) asSet.
> All elements have same hash code (0 hash) and access performance  
> turn to linear search.
>
> In other dialects,
>     self=self truncated ifFalse: [...]
> is also here to handle fraction/float equality (3/2) hash = 1.5 hash
>
> VW used to convert everything to Fraction (and had nasty bugs  
> because their imperfect asRational algorithm sometimes have  
> (aDouble = aDouble asRational asDouble) answering false, see SYSBUG-
> asRational at cincom public store).
> Now they convert everything to Float, then to Double if anything  
> fail (Overflow) then to old Fraction hash (Large Fraction can also  
> overflow max IEEE754 Double). I think new implementation is faster,  
> and also cove rs other nasty bugs of aFloat=aDouble answering true,  
> but having different fractional representations. This also cover  
> scaled numbers.
>
> I do not have Dolphin, ST/X, ... image here, but i imagine they  
> follow one of the above paths.
>
> Sorry, I have no original solution in mind, but borrow solutions of  
> our Smalltalk cousins.
>
>
>
>
> iFRANCE
> exprimez-vous !
>


Reply | Threaded
Open this post in threaded view
|

Re: hash and equal problem

Marcus Denker

On 28.03.2006, at 10:51, stéphane ducasse wrote:

> Nicolas
>
> thanks!
> My question was also more interested :)
> If you want to see this bug to be fixed, submit a fix and a bunch  
> of tests :).


And don't be discuraged if something is not added/fixed fast... it  
will be at
some point.

      Marcus
Reply | Threaded
Open this post in threaded view
|

Re: hash and equal problem

Nicolas Cellier-3
In reply to this post by Nicolas Cellier-3
So i think i should proceed with VW latter solution:

Rationale:

Turning Float hash to Fraction hash really spoils things (factor 100) :

Time millisecondsToRun: [10000 timesRepeat: [Float pi asFraction hash]].
    1217
Time millisecondsToRun: [10000 timesRepeat: [Float pi hash]].
    13

But turning Fraction hash to Float hash does harm less (factor 10):
| tmp |

tmp := Float pi asFraction.
Array
    with: (Time millisecondsToRun: [10000 timesRepeat: [tmp hash]])
    with: (Time millisecondsToRun: [10000 timesRepeat: [tmp asFloat hash]])
  #(47 339)

This first naive implementation does not handle large fractions well (all
 have same Float infinity hash), but this can be fixed later (since they do
 not equal... but wait, wait, they equal ! arghh see another bug in next
 mail).

Nicolas

Le Mardi 28 Mars 2006 10:51, vous avez écrit :
> Nicolas
>
> thanks!
> My question was also more interested :)
> If you want to see this bug to be fixed, submit a fix and a bunch of
> tests :).
> You look like that right guy to do that :)
>
> Stef