(a = b) => (a hash = b hash)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

(a = b) => (a hash = b hash)

Prof. Andrew P. Black
The golden rule of hashing is (a = b) => (a hash = b hash).  Right?

Look at this:

(1 to: 10) species  --->  Array
#(1 2 3 4 5 6 7 8 9 10) species  --->  Array

(1 to: 10) =  #(1 2 3 4 5 6 7 8 9 10)   --->  true

However,

(1 to: 10) hash = #(1 2 3 4 5 6 7 8 9 10) hash  --->  false.

I believe that the last is a bug.

        Andrew


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Nicolas Cellier
http://bugs.squeak.org/view.php?id=3380
http://bugs.squeak.org/view.php?id=3488

2009/10/24 Andrew P. Black <[hidden email]>:

> The golden rule of hashing is (a = b) => (a hash = b hash).  Right?
>
> Look at this:
>
> (1 to: 10) species  --->  Array
> #(1 2 3 4 5 6 7 8 9 10) species  --->  Array
>
> (1 to: 10) =  #(1 2 3 4 5 6 7 8 9 10)   --->  true
>
> However,
>
> (1 to: 10) hash = #(1 2 3 4 5 6 7 8 9 10) hash  --->  false.
>
> I believe that the last is a bug.
>
>        Andrew
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Andres Valloud-4
I'd rather make the interval not equal to the array, at least with the
message #=.  Probably that can't be done safely.  If that's the case,
I'd say the best approach is to leave it "broken" but documented in
terms of "some collections may compare 'reasonably', however, since
equality in those cases is not well defined, then the rule a = b => a
hash = b hash may be waived".  For example, sure, #(1 2 3) = (1 to: 3),
however you can't do (1 to: 3) at: 2 put: 5.  And 1.0 may be equal to 1,
but the numbers represent completely different things so even though the
numeric value is the same, equality is ill defined, and so it's not
necessary for the hashing rule to apply.

Nicolas Cellier wrote:

> http://bugs.squeak.org/view.php?id=3380
> http://bugs.squeak.org/view.php?id=3488
>
> 2009/10/24 Andrew P. Black <[hidden email]>:
>  
>> The golden rule of hashing is (a = b) => (a hash = b hash).  Right?
>>
>> Look at this:
>>
>> (1 to: 10) species  --->  Array
>> #(1 2 3 4 5 6 7 8 9 10) species  --->  Array
>>
>> (1 to: 10) =  #(1 2 3 4 5 6 7 8 9 10)   --->  true
>>
>> However,
>>
>> (1 to: 10) hash = #(1 2 3 4 5 6 7 8 9 10) hash  --->  false.
>>
>> I believe that the last is a bug.
>>
>>        Andrew
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>>    
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>  

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Andres Valloud-4
(furthermore, changing Interval>>hash to match that of Array in cases
like (1 to: SmallInteger minVal) is bound to cause unacceptable
performance penalties)

Andres Valloud wrote:

> I'd rather make the interval not equal to the array, at least with the
> message #=.  Probably that can't be done safely.  If that's the case,
> I'd say the best approach is to leave it "broken" but documented in
> terms of "some collections may compare 'reasonably', however, since
> equality in those cases is not well defined, then the rule a = b => a
> hash = b hash may be waived".  For example, sure, #(1 2 3) = (1 to: 3),
> however you can't do (1 to: 3) at: 2 put: 5.  And 1.0 may be equal to 1,
> but the numbers represent completely different things so even though the
> numeric value is the same, equality is ill defined, and so it's not
> necessary for the hashing rule to apply.
>
> Nicolas Cellier wrote:
>  
>> http://bugs.squeak.org/view.php?id=3380
>> http://bugs.squeak.org/view.php?id=3488
>>
>> 2009/10/24 Andrew P. Black <[hidden email]>:
>>  
>>    
>>> The golden rule of hashing is (a = b) => (a hash = b hash).  Right?
>>>
>>> Look at this:
>>>
>>> (1 to: 10) species  --->  Array
>>> #(1 2 3 4 5 6 7 8 9 10) species  --->  Array
>>>
>>> (1 to: 10) =  #(1 2 3 4 5 6 7 8 9 10)   --->  true
>>>
>>> However,
>>>
>>> (1 to: 10) hash = #(1 2 3 4 5 6 7 8 9 10) hash  --->  false.
>>>
>>> I believe that the last is a bug.
>>>
>>>        Andrew
>>>
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>>    
>>>      
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>>  
>>    
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>  

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Nicolas Cellier
2009/10/24 Andres Valloud <[hidden email]>:
> (furthermore, changing Interval>>hash to match that of Array in cases
> like (1 to: SmallInteger minVal) is bound to cause unacceptable
> performance penalties)
>

Yes, large Array hash has severe performance penalty anyway.
Maybe we shall not hash each and every element.

And, even forgetting about equality with Array, Inteval>>#= and
Interval>>#hash do not agree in Squeak and Pharo:

        (1 to: 10 by: 5)  = (1 to: 9 by: 5). -> true
        (1 to: 10 by: 5) hash = (1 to: 9 by: 5) hash. -> false

Nicolas

> Andres Valloud wrote:
>> I'd rather make the interval not equal to the array, at least with the
>> message #=.  Probably that can't be done safely.  If that's the case,
>> I'd say the best approach is to leave it "broken" but documented in
>> terms of "some collections may compare 'reasonably', however, since
>> equality in those cases is not well defined, then the rule a = b => a
>> hash = b hash may be waived".  For example, sure, #(1 2 3) = (1 to: 3),
>> however you can't do (1 to: 3) at: 2 put: 5.  And 1.0 may be equal to 1,
>> but the numbers represent completely different things so even though the
>> numeric value is the same, equality is ill defined, and so it's not
>> necessary for the hashing rule to apply.
>>
>> Nicolas Cellier wrote:
>>
>>> http://bugs.squeak.org/view.php?id=3380
>>> http://bugs.squeak.org/view.php?id=3488
>>>
>>> 2009/10/24 Andrew P. Black <[hidden email]>:
>>>
>>>
>>>> The golden rule of hashing is (a = b) => (a hash = b hash).  Right?
>>>>
>>>> Look at this:
>>>>
>>>> (1 to: 10) species  --->  Array
>>>> #(1 2 3 4 5 6 7 8 9 10) species  --->  Array
>>>>
>>>> (1 to: 10) =  #(1 2 3 4 5 6 7 8 9 10)   --->  true
>>>>
>>>> However,
>>>>
>>>> (1 to: 10) hash = #(1 2 3 4 5 6 7 8 9 10) hash  --->  false.
>>>>
>>>> I believe that the last is a bug.
>>>>
>>>>        Andrew
>>>>
>>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>>
>>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Andres Valloud-4
Nicolas,
> Yes, large Array hash has severe performance penalty anyway.
> Maybe we shall not hash each and every element.
>  

I disagree.  The issue is that if hashing a very large array (or string)
becomes a performance problem, then we should engineer applications so
that we hash the metadata as opposed to the data.

In other words, assume you're hashing arbitrary 1kb strings.  You can't
have more than ~1 million of those in memory at any given time, and
there are only so many you can store in a database.  Chances are those
1kb strings represent chunks of data that can be uniquely identified
with no more than e.g.: 64 bits.  Then, just hash those 64 bits and be
done with it.  In the absence of assumptions on the hashed data, though,
you have no option other than to hash everything.

> And, even forgetting about equality with Array, Inteval>>#= and
> Interval>>#hash do not agree in Squeak and Pharo:
>
> (1 to: 10 by: 5)  = (1 to: 9 by: 5). -> true
> (1 to: 10 by: 5) hash = (1 to: 9 by: 5) hash. -> false
>
>  

Sometimes the end points of intervals are important... maybe it's better
to have equality fail than "fixing" hash in this case.

Andres.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Nicolas Cellier
2009/10/24 Andres Valloud <[hidden email]>:

> Nicolas,
>> Yes, large Array hash has severe performance penalty anyway.
>> Maybe we shall not hash each and every element.
>>
>
> I disagree.  The issue is that if hashing a very large array (or string)
> becomes a performance problem, then we should engineer applications so
> that we hash the metadata as opposed to the data.
>
> In other words, assume you're hashing arbitrary 1kb strings.  You can't
> have more than ~1 million of those in memory at any given time, and
> there are only so many you can store in a database.  Chances are those
> 1kb strings represent chunks of data that can be uniquely identified
> with no more than e.g.: 64 bits.  Then, just hash those 64 bits and be
> done with it.  In the absence of assumptions on the hashed data, though,
> you have no option other than to hash everything.
>

Good point.

Where is the balance between expensive hash and expensive collisions ?

The cost of collisions is
- perform hash and = on each successive keys untill hash is different
or = is true

Omitting a single byte in a ByteString leads to 256 possible collisions...

So with this POV, it is better to spend time in hash to reduce the
cost of worst-case distribution of keys with default hash algorithm.
My POV was to reduce cost for some distribution of keys, without
caring too much about worst case... One then has to provide its own
hash function for optimizing its problem... But maybe that was not a
good tradeoff...

In any case, restricting hash to leading characters is the worst
choice because it leads to late detection of string1 ~= string2 for
strings of equal hash...

>> And, even forgetting about equality with Array, Inteval>>#= and
>> Interval>>#hash do not agree in Squeak and Pharo:
>>
>>       (1 to: 10 by: 5)  = (1 to: 9 by: 5). -> true
>>       (1 to: 10 by: 5) hash = (1 to: 9 by: 5) hash. -> false
>>
>>
>
> Sometimes the end points of intervals are important... maybe it's better
> to have equality fail than "fixing" hash in this case.
>
> Andres.
>

Either changing Interval>>#= or Interval>>#hash is NECESSARY.
I note that Interval is used for Text selection, and in this context
(3 to: 2) is different from (4 to: 3), though they both equal #().
But this usage is related to another oddity (empty Interval have a
first and a last element !).

IMO, we should not let Array hash ~= Interval hash coexist with Array = Interval
Because it's like putting some traps on programmers path.
One day or the other an application will exhibit a non repeatable
error, just because the size of a Set was different and caused a
collision.

A general purpose library has to be robust, reliable and provide
repeatable behavior.
If behavior changes just because of allocated size of a container,
then it is a bad library.

I feel OK with the solution of having Array ~= Interval, if we have to
trade this feature against spoiling hashedCollection efficiency.
Of course, this could impact some code...

Nicolas

> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Prof. Andrew P. Black
I agree with Nicolas, but I think that it's really important that we have this discussion, reach a consensus, and them implement it!

That's why I posted the message here rather than just posting a bug report.

There used to be a method called hasSameElementsAs: (now called, inexplicably, sameElements:) that could be pressed into service so that two intervals, or an array and and interval, can have the same elements but still be unequal.  But the role of species was to define when things could be equal...

Maybe we need a method "hasSameSequenceOfElementsAs:" which would be applicable to any pair of  sequenceable collections, as well as being faster to implement, than sameElements: (which is currently quadratic).

On 25 Oct 2009, at 06:53, Nicolas Cellier wrote:


IMO, we should not let Array hash ~= Interval hash coexist with Array = Interval
Because it's like putting some traps on programmers path.
One day or the other an application will exhibit a non repeatable
error, just because the size of a Set was different and caused a
collision.


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Ralph Boland
In reply to this post by Prof. Andrew P. Black
> I agree with Nicolas, but I think that it's really important that we
> have this discussion, reach a consensus, and them implement it!
>
> That's why I posted the message here rather than just posting a bug
> report.
>
> There used to be a method called hasSameElementsAs: (now called,
> inexplicably, sameElements:) that could be pressed into service so
> that two intervals, or an array and and interval, can have the same
> elements but still be unequal.  But the role of species was to define
> when things could be equal...
>
> Maybe we need a method "hasSameSequenceOfElementsAs:" which would be
> applicable to any pair of  sequenceable collections, as well as being
> faster to implement, than sameElements: (which is currently quadratic).

I assume that by quadratic you mean a quadratic number of compares.
If you  implement hasSameElementsAs: by putting the elements of at least
one of your collections into a set and assume the include: operation for sets
is O(1) compares then the cost of hasSameElementsAs: is linear compares.
If you want to count duplicates then you need to use bags instead of sets.
Perhaps you already knew this.

Ralph Boland

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Nicolas Cellier
In reply to this post by Prof. Andrew P. Black
2009/10/26 Andrew P. Black <[hidden email]>:

> I agree with Nicolas, but I think that it's really important that we have
> this discussion, reach a consensus, and them implement it!
> That's why I posted the message here rather than just posting a bug report.
> There used to be a method called hasSameElementsAs: (now called,
> inexplicably, sameElements:) that could be pressed into service so that two
> intervals, or an array and and interval, can have the same elements but
> still be unequal.  But the role of species was to define when things could
> be equal...
> Maybe we need a method "hasSameSequenceOfElementsAs:" which would be
> applicable to any pair of  sequenceable collections, as well as being faster
> to implement, than sameElements: (which is currently quadratic).
>

This is exactly SequenceableCollection>>#hasEqualElements:
I have no #sameElements: in core.

Nicolas


> On 25 Oct 2009, at 06:53, Nicolas Cellier wrote:
>
> IMO, we should not let Array hash ~= Interval hash coexist with Array =
> Interval
> Because it's like putting some traps on programmers path.
> One day or the other an application will exhibit a non repeatable
> error, just because the size of a Set was different and caused a
> collision.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Prof. Andrew P. Black
RIght on both counts.  sameElements: is in RoelTyper
On 26 Oct 2009, at 07:56, Nicolas Cellier wrote:

This is exactly SequenceableCollection>>#hasEqualElements:
I have no #sameElements: in core.

Nicolas

hasSameSequenceOfElements: is still a better name than hasEqualElements:

:-)

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Stéphane Ducasse
In reply to this post by Prof. Andrew P. Black

On Oct 26, 2009, at 3:02 AM, Andrew P. Black wrote:

> I agree with Nicolas, but I think that it's really important that we  
> have this discussion, reach a consensus, and them implement it!

yes yes yes.
Communicating is the best way to increase the level of knowledge in  
the community.

>
> That's why I posted the message here rather than just posting a bug  
> report.
>
> There used to be a method called hasSameElementsAs: (now called,  
> inexplicably, sameElements:) that could be pressed into service so  
> that two intervals, or an array and and interval, can have the same  
> elements but still be unequal.  But the role of species was to  
> define when things could be equal...
>
> Maybe we need a method "hasSameSequenceOfElementsAs:" which would be  
> applicable to any pair of  sequenceable collections, as well as  
> being faster to implement, than sameElements: (which is currently  
> quadratic).
>
> On 25 Oct 2009, at 06:53, Nicolas Cellier wrote:
>
>>
>> IMO, we should not let Array hash ~= Interval hash coexist with  
>> Array = Interval
>> Because it's like putting some traps on programmers path.
>> One day or the other an application will exhibit a non repeatable
>> error, just because the size of a Set was different and caused a
>> collision.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Andres Valloud-4
In reply to this post by Nicolas Cellier
Nicolas,

> Where is the balance between expensive hash and expensive collisions ?
>
> The cost of collisions is
> - perform hash and = on each successive keys untill hash is different
> or = is true
>
> Omitting a single byte in a ByteString leads to 256 possible collisions...
>
> So with this POV, it is better to spend time in hash to reduce the
> cost of worst-case distribution of keys with default hash algorithm.
>  

In general, yes.

> My POV was to reduce cost for some distribution of keys, without
> caring too much about worst case... One then has to provide its own
> hash function for optimizing its problem... But maybe that was not a
> good tradeoff...
>
> In any case, restricting hash to leading characters is the worst
> choice because it leads to late detection of string1 ~= string2 for
> strings of equal hash...
>  

Yes.  Filenames and URLs are particularly problematic.  Fortunately they
tend to be O(100) in size, as opposed to O(1000) in size.  Thus, the
less demanding path of just hashing the characters is efficient enough
for the cases I've seen so far.

> IMO, we should not let Array hash ~= Interval hash coexist with Array = Interval
> Because it's like putting some traps on programmers path.
>  

Yes, I agree.  I'd vote on breaking #=, with the caveat that chances are
some code will be broken.  Does Pharo have a release strategy that
allows more fundamental changes to go in, as opposed to
maintenance/improvement fixes?

Andres.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: (a = b) => (a hash = b hash)

Stéphane Ducasse
>
> Yes.  Filenames and URLs are particularly problematic.  Fortunately  
> they
> tend to be O(100) in size, as opposed to O(1000) in size.  Thus, the
> less demanding path of just hashing the characters is efficient enough
> for the cases I've seen so far.
>
>> IMO, we should not let Array hash ~= Interval hash coexist with  
>> Array = Interval
>> Because it's like putting some traps on programmers path.
>>
>
> Yes, I agree.  I'd vote on breaking #=, with the caveat that chances  
> are
> some code will be broken.  Does Pharo have a release strategy that
> allows more fundamental changes to go in, as opposed to
> maintenance/improvement fixes?

yes we talk then we summarize and describe a bug entry and
Now I'm thinking that it would be good to have a page with important  
changes per version
I would put there the discussions about the float changes we did in 1.0.

Stef

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project