Smalltalk › Squeak › Squeak - Dev

hash and equal problem

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Nicolas Cellier-3

hash and equal problem

Hi all,

i do not think that this behavior is correct
(2 = 2.0) = true.
(2 hash = 2.0 hash) = false.

i saw a discussion about it in an old old thread (1999), but did not
understand the rationale for such a behavior.

i can exhibit a lot of examples of code completely broken by this behavior.

Try this one:

| aSet |
aSet := Set new.
aSet add: #( 1.0 2 3 4 5 6).
aSet add: #( 1 2.0 3 4 5 6).
aSet add: #( 1 2 3.0 4 5 6).
aSet add: #( 1 2 3 4.0 5 6).
aSet add: #( 1 2 3 4 5.0 6).
aSet add: #( 1 2 3 4 5 6.0).
aSet size

Then, try this slight modification:

| aSet |
aSet := Set new: 12.
aSet add: #( 1.0 2 3 4 5 6).
aSet add: #( 1 2.0 3 4 5 6).
aSet add: #( 1 2 3.0 4 5 6).
aSet add: #( 1 2 3 4.0 5 6).
aSet add: #( 1 2 3 4 5.0 6).
aSet add: #( 1 2 3 4 5 6.0).
aSet size

You will get 6 or 5 depending on initial Set capacity...

Secondly, Array hash, Matrix hash, WideSstring hash are hashing each and every
element... Sawing this, I imagine one never store 100x100 Matrices in a
Dictionary, unless prepared to spend a lot of time at coffee machine...

In other words, finding a good balance between long to run hash algorithm and
too many identical hash values is difficult. I am not sure that the
exhaustive way is a good balance.

The bet that string will be short is maybe not the worst one (unless you store
whole sentence like i saw in translation), but betting that every array in
Smalltalk will be short is nonsense.

Try this one if you are not convinced:
(SequenceableCollection allSubInstances collect: [:e | e size]) asBag
valuesAndCounts associations asSortedCollection: [:a :b | a key > b key]

In a 3.9a image, i have an average of 74 and a maximum over the millon.

So i raise the question again. Should we really let above behavior exist in
squeak ?
Shouldn't we look for more efficient hashing ?
I'm asking for a good rationale (compatibility issues included).

Nicolas Cellier-3

Re: hash and equal problem

I put the shortest expression of the problem at
http://bugs.impara.de/view.php?id=3360

Boris.Gaertner

Re: hash and equal problem

In reply to this post by Nicolas Cellier-3

"nicolas cellier" <[hidden email]>

behavior.

>
> Try this one:
>
> | aSet |
> aSet := Set new.
> aSet add: #( 1.0 2 3 4 5 6).
> aSet add: #( 1 2.0 3 4 5 6).
> aSet add: #( 1 2 3.0 4 5 6).
> aSet add: #( 1 2 3 4.0 5 6).
> aSet add: #( 1 2 3 4 5.0 6).
> aSet add: #( 1 2 3 4 5 6.0).
> aSet size
>
> Then, try this slight modification:
>
> | aSet |
> aSet := Set new: 12.
> aSet add: #( 1.0 2 3 4 5 6).
> aSet add: #( 1 2.0 3 4 5 6).
> aSet add: #( 1 2 3.0 4 5 6).
> aSet add: #( 1 2 3 4.0 5 6).
> aSet add: #( 1 2 3 4 5.0 6).
> aSet add: #( 1 2 3 4 5 6.0).
> aSet size
>
> You will get 6 or 5 depending on initial Set capacity...

Wow, this is a very impressive example, thank you a lot.

VisualWorks, Dolphin and VisualAge for Smalltalk
use different algorithms to ensure that a float fl that satiyfies
the condition fl = fl truncated has the same hash
as the integer fl truncated.

In Visual age this is done with:

Float >> hash
"Answer a unique Integer which represents a hash for the receiver."

^self truncated hash

the other systems use something like

float>>hash
| tr |
tr := self truncated.
self = tr
ifTrue: [^tr hash]
ifFalse: [ " some computation " ]

Your example works also with fractions,

stéphane ducasse-2

Re: hash and equal problem

In reply to this post by Nicolas Cellier-3

Nicolas

do you have a fix for that behavior?

Stef

On 27 mars 06, at 21:44, nicolas cellier wrote:

> Hi all,
>
> i do not think that this behavior is correct
> (2 = 2.0) = true.
> (2 hash = 2.0 hash) = false.
>
> i saw a discussion about it in an old old thread (1999), but did not
> understand the rationale for such a behavior.
>
> i can exhibit a lot of examples of code completely broken by this
> behavior.
>
> Try this one:
>
> | aSet |
> aSet := Set new.
> aSet add: #( 1.0 2 3 4 5 6).
> aSet add: #( 1 2.0 3 4 5 6).
> aSet add: #( 1 2 3.0 4 5 6).
> aSet add: #( 1 2 3 4.0 5 6).
> aSet add: #( 1 2 3 4 5.0 6).
> aSet add: #( 1 2 3 4 5 6.0).
> aSet size
>
> Then, try this slight modification:
>
> | aSet |
> aSet := Set new: 12.
> aSet add: #( 1.0 2 3 4 5 6).
> aSet add: #( 1 2.0 3 4 5 6).
> aSet add: #( 1 2 3.0 4 5 6).
> aSet add: #( 1 2 3 4.0 5 6).
> aSet add: #( 1 2 3 4 5.0 6).
> aSet add: #( 1 2 3 4 5 6.0).
> aSet size
>
> You will get 6 or 5 depending on initial Set capacity...
>
> Secondly, Array hash, Matrix hash, WideSstring hash are hashing
> each and every
> element... Sawing this, I imagine one never store 100x100 Matrices
> in a
> Dictionary, unless prepared to spend a lot of time at coffee
> machine...
>
> In other words, finding a good balance between long to run hash
> algorithm and
> too many identical hash values is difficult. I am not sure that the
> exhaustive way is a good balance.
>
> The bet that string will be short is maybe not the worst one
> (unless you store
> whole sentence like i saw in translation), but betting that every
> array in
> Smalltalk will be short is nonsense.
>
> Try this one if you are not convinced:
> (SequenceableCollection allSubInstances collect: [:e | e size])
> asBag
> valuesAndCounts associations asSortedCollection: [:a :b | a
> key > b key]
>
> In a 3.9a image, i have an average of 74 and a maximum over the
> millon.
>
> So i raise the question again. Should we really let above behavior
> exist in
> squeak ?
> Shouldn't we look for more efficient hashing ?
> I'm asking for a good rationale (compatibility issues included).
>
>

Nicolas Cellier-3

Re: hash and equal problem

In reply to this post by Nicolas Cellier-3

Hi, Stef and Boris,

We know from 1997-99 discussions that VA implementation is not good.
It computes hash code fast, but handle Set of Float inefficiently.
Example:
((1 to: 100) collect: [:i | 10.0 raisedTo: i negated]) asSet.
All elements have same hash code (0 hash) and access performance turn to linear search.

In other dialects,
self=self truncated ifFalse: [...]
is also here to handle fraction/float equality (3/2) hash = 1.5 hash

VW used to convert everything to Fraction (and had nasty bugs because their imperfect asRational algorithm sometimes have (aDouble = aDouble asRational asDouble) answering false, see SYSBUG-asRational at cincom public store).
Now they convert everything to Float, then to Double if anything fail (Overflow) then to old Fraction hash (Large Fraction can also overflow max IEEE754 Double). I think new implementation is faster, and also cove rs other nasty bugs of aFloat=aDouble answering true, but having different fractional representations. This also cover scaled numbers.

I do not have Dolphin, ST/X, ... image here, but i imagine they follow one of the above paths.

Sorry, I have no original solution in mind, but borrow solutions of our Smalltalk cousins.

iFRANCE
exprimez-vous !

stéphane ducasse-2

Re: hash and equal problem

Nicolas

thanks!
My question was also more interested :)
If you want to see this bug to be fixed, submit a fix and a bunch of
tests :).
You look like that right guy to do that :)

Stef

On 28 mars 06, at 10:39, [hidden email] wrote:

>
> Hi, Stef and Boris,
>
> We know from 1997-99 discussions that VA implementation is not good.
> It computes hash code fast, but handle Set of Float inefficiently.
> Example:
> ((1 to: 100) collect: [:i | 10.0 raisedTo: i negated]) asSet.
> All elements have same hash code (0 hash) and access performance
> turn to linear search.
>
> In other dialects,
> self=self truncated ifFalse: [...]
> is also here to handle fraction/float equality (3/2) hash = 1.5 hash
>
> VW used to convert everything to Fraction (and had nasty bugs
> because their imperfect asRational algorithm sometimes have
> (aDouble = aDouble asRational asDouble) answering false, see SYSBUG-
> asRational at cincom public store).
> Now they convert everything to Float, then to Double if anything
> fail (Overflow) then to old Fraction hash (Large Fraction can also
> overflow max IEEE754 Double). I think new implementation is faster,
> and also cove rs other nasty bugs of aFloat=aDouble answering true,
> but having different fractional representations. This also cover
> scaled numbers.
>
> I do not have Dolphin, ST/X, ... image here, but i imagine they
> follow one of the above paths.
>
> Sorry, I have no original solution in mind, but borrow solutions of
> our Smalltalk cousins.
>
>
>
>
> iFRANCE
> exprimez-vous !
>

Marcus Denker

Re: hash and equal problem

On 28.03.2006, at 10:51, stéphane ducasse wrote:

> Nicolas
>
> thanks!
> My question was also more interested :)
> If you want to see this bug to be fixed, submit a fix and a bunch
> of tests :).

And don't be discuraged if something is not added/fixed fast... it
will be at
some point.

Marcus

Nicolas Cellier-3

Re: hash and equal problem

In reply to this post by Nicolas Cellier-3

So i think i should proceed with VW latter solution:

Rationale:

Turning Float hash to Fraction hash really spoils things (factor 100) :

Time millisecondsToRun: [10000 timesRepeat: [Float pi asFraction hash]].
1217
Time millisecondsToRun: [10000 timesRepeat: [Float pi hash]].
13

But turning Fraction hash to Float hash does harm less (factor 10):
| tmp |

tmp := Float pi asFraction.
Array
with: (Time millisecondsToRun: [10000 timesRepeat: [tmp hash]])
with: (Time millisecondsToRun: [10000 timesRepeat: [tmp asFloat hash]])
#(47 339)

This first naive implementation does not handle large fractions well (all
have same Float infinity hash), but this can be fixed later (since they do
not equal... but wait, wait, they equal ! arghh see another bug in next
mail).

Nicolas

Le Mardi 28 Mars 2006 10:51, vous avez écrit :
> Nicolas
>
> thanks!
> My question was also more interested :)
> If you want to see this bug to be fixed, submit a fix and a bunch of
> tests :).
> You look like that right guy to do that :)
>
> Stef