[pharo-project/pharo-core] 974630: 50520

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[pharo-project/pharo-core] 974630: 50520

Eliot Miranda-3
  Branch: refs/heads/5.0
  Home:   https://github.com/pharo-project/pharo-core
  Commit: 974630dc178b6b59156dac23623fe2dc9f98446c
      https://github.com/pharo-project/pharo-core/commit/974630dc178b6b59156dac23623fe2dc9f98446c
  Author: Jenkins Build Server <[hidden email]>
  Date:   2016-01-10 (Sun, 10 Jan 2016)

  Changed paths:
    A Morphic-Base.package/LayoutFrame.class/class/costly instance creation/fractions_.st
    A Morphic-Base.package/LayoutFrame.class/class/costly instance creation/fractions_offsets_.st
    A Morphic-Base.package/LayoutFrame.class/class/costly instance creation/offsets_.st
    R Morphic-Base.package/LayoutFrame.class/class/instance creation/fractions_.st
    R Morphic-Base.package/LayoutFrame.class/class/instance creation/fractions_offsets_.st
    R Morphic-Base.package/LayoutFrame.class/class/instance creation/offsets_.st
    M Morphic-Base.package/LayoutFrame.class/instance/accessing/bottomFraction.st
    A Morphic-Base.package/LayoutFrame.class/instance/comparing/=.st
    A Morphic-Base.package/LayoutFrame.class/instance/comparing/hash.st
    A Morphic-Tests.package/LayoutFrameTest.class/instance/tests - conversion/testIdentity.st
    M Morphic-Tests.package/PaginatedMorphTreeMorphTests.class/instance/tests/testPager.st
    M Morphic-Widgets-Tree.package/PaginatedMorphTreeModel.class/class/examples/example.st
    R ScriptLoader50.package/ScriptLoader.class/instance/pharo - scripts/script50519.st
    A ScriptLoader50.package/ScriptLoader.class/instance/pharo - scripts/script50520.st
    R ScriptLoader50.package/ScriptLoader.class/instance/pharo - updates/update50519.st
    A ScriptLoader50.package/ScriptLoader.class/instance/pharo - updates/update50520.st
    M ScriptLoader50.package/ScriptLoader.class/instance/public/commentForCurrentUpdate.st

  Log Message:
  -----------
  50520
17361 Better LayoutFrame class comments and class method comments are needed
        https://pharo.fogbugz.com/f/cases/17361

17363 (LayoutFrame fractions: (0 @ 0 corner: 1 @ 1)) = LayoutFrame identity should be true and it is not.
        https://pharo.fogbugz.com/f/cases/17363

17364 PaginateMorph uses Wrong LayoutFrame API
        https://pharo.fogbugz.com/f/cases/17364

http://files.pharo.org/image/50/50520.zip


Reply | Threaded
Open this post in threaded view
|

Bad layoutFrame>>#hash

David Allouche
I happened to look at the new code for LayoutFrame>>#hash

hash
        ^self species hash bitXor:
                (self leftFraction bitXor:
                (self leftOffset bitXor:
                (self topFraction bitXor:
                (self topOffset bitXor:
                (self rightFraction bitXor:
                (self rightOffset bitXor:
                (self bottomFraction bitXor:
                self bottomOffset)))))))

This is a terrible, terrible way to compute a hash.

0 xor 0 = 1 xor 1 = 2 xor 2 = etc.

NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

By the way, how do you guys do code reviews for code integrated into the core?


Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

David Allouche
Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background

https://en.wikipedia.org/wiki/Hash_function#Uniformity

> On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:
>
> I happened to look at the new code for LayoutFrame>>#hash
>
> hash
> ^self species hash bitXor:
> (self leftFraction bitXor:
> (self leftOffset bitXor:
> (self topFraction bitXor:
> (self topOffset bitXor:
> (self rightFraction bitXor:
> (self rightOffset bitXor:
> (self bottomFraction bitXor:
> self bottomOffset)))))))
>
> This is a terrible, terrible way to compute a hash.
>
> 0 xor 0 = 1 xor 1 = 2 xor 2 = etc.
>
> NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.
>
> SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.
>
> By the way, how do you guys do code reviews for code integrated into the core?
>


Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

Sven Van Caekenberghe-2
In reply to this post by David Allouche
David,

> On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:
>
> I happened to look at the new code for LayoutFrame>>#hash
>
> hash
> ^self species hash bitXor:
> (self leftFraction bitXor:
> (self leftOffset bitXor:
> (self topFraction bitXor:
> (self topOffset bitXor:
> (self rightFraction bitXor:
> (self rightOffset bitXor:
> (self bottomFraction bitXor:
> self bottomOffset)))))))
>
> This is a terrible, terrible way to compute a hash.
>
> 0 xor 0 = 1 xor 1 = 2 xor 2 = etc.
>
> NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

Yeah, that doesn't look very good ;-)

> SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

There is actually a 2 part book just about hashing in Smalltalk:

http://www.lulu.com/shop/andres-valloud/hashing-in-smalltalk-theory-and-practice/paperback/product-3788892.html

It is said to be very good, I sadly have not yet read it ;-)

> By the way, how do you guys do code reviews for code integrated into the core?

'Pharo is yours' which means: if you encounter something odd, you are encouraged to improve it. Maybe first discuss it on the ML. If necessary create an issue. Better provide a fix.

Sven




Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

Nicolas Cellier
Yep, a must read if you want to improve the hash methods.

2016-01-11 10:09 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
David,

> On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:
>
> I happened to look at the new code for LayoutFrame>>#hash
>
> hash
>       ^self species hash bitXor:
>               (self leftFraction bitXor:
>               (self leftOffset bitXor:
>               (self topFraction bitXor:
>               (self topOffset bitXor:
>               (self rightFraction bitXor:
>               (self rightOffset bitXor:
>               (self bottomFraction bitXor:
>               self bottomOffset)))))))
>
> This is a terrible, terrible way to compute a hash.
>
> 0 xor 0 = 1 xor 1 = 2 xor 2 = etc.
>
> NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

Yeah, that doesn't look very good ;-)

> SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

There is actually a 2 part book just about hashing in Smalltalk:

http://www.lulu.com/shop/andres-valloud/hashing-in-smalltalk-theory-and-practice/paperback/product-3788892.html

It is said to be very good, I sadly have not yet read it ;-)

> By the way, how do you guys do code reviews for code integrated into the core?

'Pharo is yours' which means: if you encounter something odd, you are encouraged to improve it. Maybe first discuss it on the ML. If necessary create an issue. Better provide a fix.

Sven





Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

stepharo
In reply to this post by David Allouche
You see K. Beck in his book smalltalk best pratices promotes this form.
So can you propose a better solution so that we learn?

Stef

Le 11/1/16 10:09, David Allouche a écrit :

> Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background
>
> https://en.wikipedia.org/wiki/Hash_function#Uniformity
>
>> On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:
>>
>> I happened to look at the new code for LayoutFrame>>#hash
>>
>> hash
>> ^self species hash bitXor:
>> (self leftFraction bitXor:
>> (self leftOffset bitXor:
>> (self topFraction bitXor:
>> (self topOffset bitXor:
>> (self rightFraction bitXor:
>> (self rightOffset bitXor:
>> (self bottomFraction bitXor:
>> self bottomOffset)))))))
>>
>> This is a terrible, terrible way to compute a hash.
>>
>> 0 xor 0 = 1 xor 1 = 2 xor 2 = etc.
>>
>> NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.
>>
>> SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.
>>
>> By the way, how do you guys do code reviews for code integrated into the core?
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

stepharo
In reply to this post by Nicolas Cellier
Ok but be concrete. What will be the solution for this particular case?

Stef

Le 11/1/16 22:30, Nicolas Cellier a écrit :
Yep, a must read if you want to improve the hash methods.

2016-01-11 10:09 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
David,

> On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:
>
> I happened to look at the new code for LayoutFrame>>#hash
>
> hash
>       ^self species hash bitXor:
>               (self leftFraction bitXor:
>               (self leftOffset bitXor:
>               (self topFraction bitXor:
>               (self topOffset bitXor:
>               (self rightFraction bitXor:
>               (self rightOffset bitXor:
>               (self bottomFraction bitXor:
>               self bottomOffset)))))))
>
> This is a terrible, terrible way to compute a hash.
>
> 0 xor 0 = 1 xor 1 = 2 xor 2 = etc.
>
> NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

Yeah, that doesn't look very good ;-)

> SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

There is actually a 2 part book just about hashing in Smalltalk:

http://www.lulu.com/shop/andres-valloud/hashing-in-smalltalk-theory-and-practice/paperback/product-3788892.html

It is said to be very good, I sadly have not yet read it ;-)

> By the way, how do you guys do code reviews for code integrated into the core?

'Pharo is yours' which means: if you encounter something odd, you are encouraged to improve it. Maybe first discuss it on the ML. If necessary create an issue. Better provide a fix.

Sven






Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

stepharo
In reply to this post by David Allouche
Hi david

I rad the wikipedia page you mention and this looks a bit obvious to me.
I knew this.
now it does not really help fixing the problem you mention.

Stef

Le 11/1/16 10:09, David Allouche a écrit :

> Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background
>
> https://en.wikipedia.org/wiki/Hash_function#Uniformity
>
>> On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:
>>
>> I happened to look at the new code for LayoutFrame>>#hash
>>
>> hash
>> ^self species hash bitXor:
>> (self leftFraction bitXor:
>> (self leftOffset bitXor:
>> (self topFraction bitXor:
>> (self topOffset bitXor:
>> (self rightFraction bitXor:
>> (self rightOffset bitXor:
>> (self bottomFraction bitXor:
>> self bottomOffset)))))))
>>
>> This is a terrible, terrible way to compute a hash.
>>
>> 0 xor 0 = 1 xor 1 = 2 xor 2 = etc.
>>
>> NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.
>>
>> SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.
>>
>> By the way, how do you guys do code reviews for code integrated into the core?
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

David Allouche

On 12 Jan 2016, at 20:59, stepharo <[hidden email]> wrote:

Hi david

I rad the wikipedia page you mention and this looks a bit obvious to me. I knew this.
now it does not really help fixing the problem you mention.

Stef

Since you asked for it…

The #hash method is used in Pharo to separate objects in bucket-based data structures.

It must have one required property which is required data integrity:

  • A. Objects that are equal according to #= must have equal hashes.

It should have two desirable properties that are required to support the algorithmic assumptions of bucket-based data structures:

  • B. Objects that are not equal should have different hashes. Ideally, the probability of hash collision should by 1/N where N is the size of hash value space.
  • C. The hash values should spread as much as possible over their value space. For example: ProtoObject>>#identityHash.

How these two desirable properties can be implemented depend on the actual distribution of the value used to compute the hash.

An object can safely use XOR hashing if its instance variables are all different objects that already provide a hash method with these two properties. For example, an object whose behaviour is defined by several of singleton delegates of different classes that do not override Object>>#hash.

But if the instance variables are not guaranteed to have those properties, it is necessary to more thoroughly "mix the bits". For example: SequenceableCollection>>#hash.

hash
| hash |

hash := self species hash.
1 to: self size do: [:i | hash := (hash + (self at: i) hash) hashMultiply].
^hash

The mixing of the bits is done by the combination of addition and hashMultiply. The value of "species hash" does not need to be processed by hashMultiply, because it is probably computed by ProtoObject>>identityHash, and that already provides C.

LayoutFrame is a particularly good example of a data structure that should not use XOR hashing. Its instance variables provide none of the required properties: SmallInteger>>#hash is just ^self. In addition, common LayoutFrame instances tend to use pairs of identical values in their instance variables. Like 0@0 corner: 1@1, or Margin fromNumber: 10.

A good hash function for LayoutFrame needs to:

  1. Get hashes for each instance variable that provides A and B.
  2. Combine them in a way that is order-sensitive (to maintain B) and that does some extra mixing (to provide C).

Now, we can assume that common values for the instance variables will be instances of SmallInteger, Float, or Fraction.

By the way, you can see in Fraction>>#hash, that a comment mentions the assumption that the fraction is already reduced, that is what makes it acceptable to use bitXor.

The hash of SmallInteger does provide A and B, but not C. The hash of Float is harder to understand, but tests show that it provide distinct values for 0.5, 0.25 and 0.125. So it hopefully provides B for the range of values of interest to LayoutFrame.

Another class that has similar hashing constraints to LayoutFrame is Point. Its hash method is:

hash
"Hash is reimplemented because = is implemented."

^(x hash hashMultiply + y hash) hashMultiply

It does not include species in the computation, which is less than ideal because it favours collisions with objects of different species that have a similar content and a the same #hash algorithm.

Finally, it is probably not necessary or useful to use the accessors to get to the instance variables.

So a good implementation would be this (untested, there might be a typo or two)

hash
| hash |
hash := self species hash
hash := (hash + leftFraction hash) hashMultiply
hash := (hash + leftOffset hash) hashMultiply
hash := (hash + topFraction hash) hashMultiply
hash := (hash + topOffset hash) hashMultiply
hash := (hash + rightFraction hash) hashMultiply
hash := (hash + rightOffset hash) hashMultiply
hash := (hash + bottomFraction hash) hashMultiply
hash := (hash + bottomOffset hash) hashMultiply
^ hash

I am sure you could have figured that by yourself, by thinking about how hash values are used and by looking at the code of kernel classes, instead of getting offended and turning all defensive.

I hope we can both learn from this episode. I do not enjoy antagonising people, but I will not write lengthy messages like this one every time someone questions my thinking.

Have a nice day.

Le 11/1/16 10:09, David Allouche a écrit :
Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background

https://en.wikipedia.org/wiki/Hash_function#Uniformity

On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:

I happened to look at the new code for LayoutFrame>>#hash

hash
^self species hash bitXor:
(self leftFraction bitXor:
(self leftOffset bitXor:
(self topFraction bitXor:
(self topOffset bitXor:
(self rightFraction bitXor:
(self rightOffset bitXor:
(self bottomFraction bitXor:
self bottomOffset)))))))

This is a terrible, terrible way to compute a hash.

0 xor 0 = 1 xor 1 = 2 xor 2 = etc.

NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

By the way, how do you guys do code reviews for code integrated into the core?






Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

Stephan Eggermont-3
Name:
SLICE-Issue-17363-LayoutFrame-fractions-0--0-corner-1--1--LayoutFrame-identity-should-be-true-and-it-is-not-StephanEggermont.2
Author: StephanEggermont
Time: 13 January 2016, 6:06:01.571074 pm
UUID: 9a5e9ae3-1357-4bfb-921a-3762d57402fd
Ancestors:
Dependencies: Morphic-Base-StephanEggermont.529

Better hash both in distribution and in working with Float coordinates



hash
        ^(((((((( self species hash
                + leftFraction hash) hashMultiply
                + leftOffset hash) hashMultiply
                + topFraction hash) hashMultiply
                + topOffset hash) hashMultiply
                + rightFraction hash) hashMultiply
                + rightOffset hash) hashMultiply
                + bottomFraction hash) hashMultiply
                + bottomOffset hash) hashMultiply



Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

David Allouche

> On 13 Jan 2016, at 18:09, Stephan Eggermont <[hidden email]> wrote:
>
> Name: SLICE-Issue-17363-LayoutFrame-fractions-0--0-corner-1--1--LayoutFrame-identity-should-be-true-and-it-is-not-StephanEggermont.2
> Author: StephanEggermont
> Time: 13 January 2016, 6:06:01.571074 pm
> UUID: 9a5e9ae3-1357-4bfb-921a-3762d57402fd
> Ancestors:
> Dependencies: Morphic-Base-StephanEggermont.529
>
> Better hash both in distribution and in working with Float coordinates
>
>
>
> hash
> ^(((((((( self species hash
> + leftFraction hash) hashMultiply
> + leftOffset hash) hashMultiply
> + topFraction hash) hashMultiply
> + topOffset hash) hashMultiply
> + rightFraction hash) hashMultiply
> + rightOffset hash) hashMultiply
> + bottomFraction hash) hashMultiply
> + bottomOffset hash) hashMultiply

Thank you for integrating the fix.

I just realised I forgot to put in "." expression separators in my code.

Why did you use a single expression instead of using a temporary variable? I find it harder to understand.



Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

Nicolas Cellier
In reply to this post by David Allouche


2016-01-13 17:22 GMT+01:00 David Allouche <[hidden email]>:

On 12 Jan 2016, at 20:59, stepharo <[hidden email]> wrote:

Hi david

I rad the wikipedia page you mention and this looks a bit obvious to me. I knew this.
now it does not really help fixing the problem you mention.

Stef

Since you asked for it…

The #hash method is used in Pharo to separate objects in bucket-based data structures.

It must have one required property which is required data integrity:

  • A. Objects that are equal according to #= must have equal hashes.

It should have two desirable properties that are required to support the algorithmic assumptions of bucket-based data structures:

  • B. Objects that are not equal should have different hashes. Ideally, the probability of hash collision should by 1/N where N is the size of hash value space.
  • C. The hash values should spread as much as possible over their value space. For example: ProtoObject>>#identityHash.

How these two desirable properties can be implemented depend on the actual distribution of the value used to compute the hash.

An object can safely use XOR hashing if its instance variables are all different objects that already provide a hash method with these two properties. For example, an object whose behaviour is defined by several of singleton delegates of different classes that do not override Object>>#hash.

But if the instance variables are not guaranteed to have those properties, it is necessary to more thoroughly "mix the bits". For example: SequenceableCollection>>#hash.

hash
| hash |

hash := self species hash.
1 to: self size do: [:i | hash := (hash + (self at: i) hash) hashMultiply].
^hash

The mixing of the bits is done by the combination of addition and hashMultiply. The value of "species hash" does not need to be processed by hashMultiply, because it is probably computed by ProtoObject>>identityHash, and that already provides C.

LayoutFrame is a particularly good example of a data structure that should not use XOR hashing. Its instance variables provide none of the required properties: SmallInteger>>#hash is just ^self. In addition, common LayoutFrame instances tend to use pairs of identical values in their instance variables. Like 0@0 corner: 1@1, or Margin fromNumber: 10.

A good hash function for LayoutFrame needs to:

  1. Get hashes for each instance variable that provides A and B.
  2. Combine them in a way that is order-sensitive (to maintain B) and that does some extra mixing (to provide C).

Now, we can assume that common values for the instance variables will be instances of SmallInteger, Float, or Fraction.

By the way, you can see in Fraction>>#hash, that a comment mentions the assumption that the fraction is already reduced, that is what makes it acceptable to use bitXor.

The hash of SmallInteger does provide A and B, but not C. The hash of Float is harder to understand, but tests show that it provide distinct values for 0.5, 0.25 and 0.125. So it hopefully provides B for the range of values of interest to LayoutFrame.


Note that Squeak has reduced the number of collisions for floats by not erasing the most significant bits of each word.
I've tested the squeak implementation with Andres Valloud tool in VW, it does not perform so bad for the sets I tested in term of collisions...

Float>>hash
    "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

    (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
    ^ ((self basicAt: 1) bitShift: -4) +
       ((self basicAt: 2) bitShift: -4)

I also changed Fraction to use a hashMultiply and made a more explicit isAnExactFloat test.

Fraction>>hash
    "Hash is reimplemented because = is implemented."
   
    "Care is taken that a Fraction equal to a Float also has an equal hash"
    self isAnExactFloat ifTrue: [^self asExactFloat hash].
   
    "Else, I cannot be exactly equal to a Float, use own hash algorithm."
    ^numerator hash hashMultiply bitXor: denominator hash
 
Fraction>>isAnExactFloat
    "Answer true if this Fraction can be converted exactly to a Float"
    ^ denominator isPowerOfTwo
        and: ["I have a reasonable significand: not too big"
            numerator highBitOfMagnitude <= Float precision
                and: ["I have a reasonable exponent: not too small"
                    Float emin + denominator highBitOfMagnitude <= Float precision]]

Fraction>>asExactFloat
    "When we know that this Fraction is an exact Float, this conversion is much faster than asFloat."

    ^numerator asFloat timesTwoPower: 1 - denominator highBit

Not that Fraction should allways be reduced (this is an invariant of the class, I hope well defined in class comment).

Integer hash is not that good because the risk of occupying consecutive buckets is high... I wonder if returning ^self hashMultiply would not arrange things here and reduce the requirements for spreading hashMultiply everywhere else?

Nicolas

Another class that has similar hashing constraints to LayoutFrame is Point. Its hash method is:

hash
"Hash is reimplemented because = is implemented."

^(x hash hashMultiply + y hash) hashMultiply

It does not include species in the computation, which is less than ideal because it favours collisions with objects of different species that have a similar content and a the same #hash algorithm.

Finally, it is probably not necessary or useful to use the accessors to get to the instance variables.

So a good implementation would be this (untested, there might be a typo or two)

hash
| hash |
hash := self species hash
hash := (hash + leftFraction hash) hashMultiply
hash := (hash + leftOffset hash) hashMultiply
hash := (hash + topFraction hash) hashMultiply
hash := (hash + topOffset hash) hashMultiply
hash := (hash + rightFraction hash) hashMultiply
hash := (hash + rightOffset hash) hashMultiply
hash := (hash + bottomFraction hash) hashMultiply
hash := (hash + bottomOffset hash) hashMultiply
^ hash

I am sure you could have figured that by yourself, by thinking about how hash values are used and by looking at the code of kernel classes, instead of getting offended and turning all defensive.

I hope we can both learn from this episode. I do not enjoy antagonising people, but I will not write lengthy messages like this one every time someone questions my thinking.

Have a nice day.

Le 11/1/16 10:09, David Allouche a écrit :
Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background

https://en.wikipedia.org/wiki/Hash_function#Uniformity

On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:

I happened to look at the new code for LayoutFrame>>#hash

hash
^self species hash bitXor:
(self leftFraction bitXor:
(self leftOffset bitXor:
(self topFraction bitXor:
(self topOffset bitXor:
(self rightFraction bitXor:
(self rightOffset bitXor:
(self bottomFraction bitXor:
self bottomOffset)))))))

This is a terrible, terrible way to compute a hash.

0 xor 0 = 1 xor 1 = 2 xor 2 = etc.

NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

By the way, how do you guys do code reviews for code integrated into the core?







Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

David Allouche

On 13 Jan 2016, at 19:31, Nicolas Cellier <[hidden email]> wrote:

Note that Squeak has reduced the number of collisions for floats by not erasing the most significant bits of each word.
I've tested the squeak implementation with Andres Valloud tool in VW, it does not perform so bad for the sets I tested in term of collisions...

Float>>hash
    "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

    (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
    ^ ((self basicAt: 1) bitShift: -4) +
       ((self basicAt: 2) bitShift: -4)

I also changed Fraction to use a hashMultiply and made a more explicit isAnExactFloat test.

Fraction>>hash
    "Hash is reimplemented because = is implemented."
    
    "Care is taken that a Fraction equal to a Float also has an equal hash"
    self isAnExactFloat ifTrue: [^self asExactFloat hash].
    
    "Else, I cannot be exactly equal to a Float, use own hash algorithm."
    ^numerator hash hashMultiply bitXor: denominator hash
 
Fraction>>isAnExactFloat
    "Answer true if this Fraction can be converted exactly to a Float"
    ^ denominator isPowerOfTwo
        and: ["I have a reasonable significand: not too big"
            numerator highBitOfMagnitude <= Float precision
                and: ["I have a reasonable exponent: not too small"
                    Float emin + denominator highBitOfMagnitude <= Float precision]]

Fraction>>asExactFloat
    "When we know that this Fraction is an exact Float, this conversion is much faster than asFloat."

    ^numerator asFloat timesTwoPower: 1 - denominator highBit

Not that Fraction should allways be reduced (this is an invariant of the class, I hope well defined in class comment).

Integer hash is not that good because the risk of occupying consecutive buckets is high... I wonder if returning ^self hashMultiply would not arrange things here and reduce the requirements for spreading hashMultiply everywhere else?

The risk does not really lie in occupying consecutive buckets. The risk is rather that integers used as keys could have a pattern that cause bucket collisions. For example, if all key are multiple of 100, and the bucket count is 2, 4, 5, 10, 20, 25 or 50.

It would make more sense to solve this problem by using multiplyHash in the bucket data structure classes, instead of having to pay this computation cost in SmallInteger.

Also, using it in SmallInteger would not dispense from using it in the #hash method of classes that use SmallInteger, because it matters that each component of the hash is processed a different number of times by multiplyHash.

Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

stepharo
In reply to this post by David Allouche
Thanks for your post it was really nice.

I am sure you could have figured that by yourself,
If I asked this is that I could not. And I read the book on hash numbers.

by thinking about how hash values are used and by looking at the code of kernel classes, instead of getting offended and turning all defensive.
I do not know where you see me offended and defensive. I asked because the wikipedia page was giving
no information that I could use.
So please do not read between the lines else tell me I can also ignore your emails.

Stef


I hope we can both learn from this episode. I do not enjoy antagonising people, but I will not write lengthy messages like this one every time someone questions my thinking.

Have a nice day.

Le 11/1/16 10:09, David Allouche a écrit :
Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background

https://en.wikipedia.org/wiki/Hash_function#Uniformity

On 11 Jan 2016, at 10:01, David Allouche [hidden email] wrote:

I happened to look at the new code for LayoutFrame>>#hash

hash
^self species hash bitXor:
(self leftFraction bitXor:
(self leftOffset bitXor:
(self topFraction bitXor:
(self topOffset bitXor:
(self rightFraction bitXor:
(self rightOffset bitXor:
(self bottomFraction bitXor:
self bottomOffset)))))))

This is a terrible, terrible way to compute a hash.

0 xor 0 = 1 xor 1 = 2 xor 2 = etc.

NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

By the way, how do you guys do code reviews for code integrated into the core?







Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

stepharo
In reply to this post by David Allouche
Hi david

I would like to turn your post into a chapter for a forthcoming book so that people can read and get the idea.
I did not think about the same object. Is it ok for you?

Stef

PS: May be this was obviosu for you that we should all get this out of your previous mail but it was not my case
and not after 8 hours on intense work.



Le 13/1/16 17:22, David Allouche a écrit :

On 12 Jan 2016, at 20:59, stepharo <[hidden email]> wrote:

Hi david

I rad the wikipedia page you mention and this looks a bit obvious to me. I knew this.
now it does not really help fixing the problem you mention.

Stef

Since you asked for it…

The #hash method is used in Pharo to separate objects in bucket-based data structures.

It must have one required property which is required data integrity:

  • A. Objects that are equal according to #= must have equal hashes.

It should have two desirable properties that are required to support the algorithmic assumptions of bucket-based data structures:

  • B. Objects that are not equal should have different hashes. Ideally, the probability of hash collision should by 1/N where N is the size of hash value space.
  • C. The hash values should spread as much as possible over their value space. For example: ProtoObject>>#identityHash.

How these two desirable properties can be implemented depend on the actual distribution of the value used to compute the hash.

An object can safely use XOR hashing if its instance variables are all different objects that already provide a hash method with these two properties. For example, an object whose behaviour is defined by several of singleton delegates of different classes that do not override Object>>#hash.

But if the instance variables are not guaranteed to have those properties, it is necessary to more thoroughly "mix the bits". For example: SequenceableCollection>>#hash.

hash
| hash |

hash := self species hash.
1 to: self size do: [:i | hash := (hash + (self at: i) hash) hashMultiply].
^hash

The mixing of the bits is done by the combination of addition and hashMultiply. The value of "species hash" does not need to be processed by hashMultiply, because it is probably computed by ProtoObject>>identityHash, and that already provides C.

LayoutFrame is a particularly good example of a data structure that should not use XOR hashing. Its instance variables provide none of the required properties: SmallInteger>>#hash is just ^self. In addition, common LayoutFrame instances tend to use pairs of identical values in their instance variables. Like 0@0 corner: 1@1, or Margin fromNumber: 10.

A good hash function for LayoutFrame needs to:

  1. Get hashes for each instance variable that provides A and B.
  2. Combine them in a way that is order-sensitive (to maintain B) and that does some extra mixing (to provide C).

Now, we can assume that common values for the instance variables will be instances of SmallInteger, Float, or Fraction.

By the way, you can see in Fraction>>#hash, that a comment mentions the assumption that the fraction is already reduced, that is what makes it acceptable to use bitXor.

The hash of SmallInteger does provide A and B, but not C. The hash of Float is harder to understand, but tests show that it provide distinct values for 0.5, 0.25 and 0.125. So it hopefully provides B for the range of values of interest to LayoutFrame.

Another class that has similar hashing constraints to LayoutFrame is Point. Its hash method is:

hash
"Hash is reimplemented because = is implemented."

^(x hash hashMultiply + y hash) hashMultiply

It does not include species in the computation, which is less than ideal because it favours collisions with objects of different species that have a similar content and a the same #hash algorithm.

Finally, it is probably not necessary or useful to use the accessors to get to the instance variables.

So a good implementation would be this (untested, there might be a typo or two)

hash
| hash |
hash := self species hash
hash := (hash + leftFraction hash) hashMultiply
hash := (hash + leftOffset hash) hashMultiply
hash := (hash + topFraction hash) hashMultiply
hash := (hash + topOffset hash) hashMultiply
hash := (hash + rightFraction hash) hashMultiply
hash := (hash + rightOffset hash) hashMultiply
hash := (hash + bottomFraction hash) hashMultiply
hash := (hash + bottomOffset hash) hashMultiply
^ hash

I am sure you could have figured that by yourself, by thinking about how hash values are used and by looking at the code of kernel classes, instead of getting offended and turning all defensive.

I hope we can both learn from this episode. I do not enjoy antagonising people, but I will not write lengthy messages like this one every time someone questions my thinking.

Have a nice day.

Le 11/1/16 10:09, David Allouche a écrit :
Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background

https://en.wikipedia.org/wiki/Hash_function#Uniformity

On 11 Jan 2016, at 10:01, David Allouche [hidden email] wrote:

I happened to look at the new code for LayoutFrame>>#hash

hash
^self species hash bitXor:
(self leftFraction bitXor:
(self leftOffset bitXor:
(self topFraction bitXor:
(self topOffset bitXor:
(self rightFraction bitXor:
(self rightOffset bitXor:
(self bottomFraction bitXor:
self bottomOffset)))))))

This is a terrible, terrible way to compute a hash.

0 xor 0 = 1 xor 1 = 2 xor 2 = etc.

NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

By the way, how do you guys do code reviews for code integrated into the core?







Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

Stephan Eggermont-3
In reply to this post by David Allouche
On 13-01-16 18:38, David Allouche wrote:
> Why did you use a single expression instead of using a temporary variable? I find it harder to understand.

I did find it very slightly easier to understand. It is not a strong
opinion however.

The ideomatic smalltalk would be

^#(leftFraction leftOffset topFraction topOffset rightFraction
rightOffset bottomFraction bottomOffset)
        inject: self species hash into: [:hashValue :selector |
                (hashValue + self perform: selector) hashMultiply ]

but that would be slower.

I would be tempted to add a binary message

Integer>>+** aNumber
        ^(aNumber +self hash) hashMultiply

(I'd prefer +#*, but that doesn't parse)

hash
        ^self species hash +**
        leftFraction +** leftOffset +**
        topFraction +** topOffset +**
        rightFraction +** rightOffset +**
        bottomFraction +** bottomOffset



Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

stepharo
In reply to this post by Nicolas Cellier
Nicolas

two questions:
    - should we integrate your float changes in Pharo?
    - I would like to get a chapter explaining the points raised by david, would you like to review it?

Stef

Le 13/1/16 19:31, Nicolas Cellier a écrit :


2016-01-13 17:22 GMT+01:00 David Allouche <[hidden email]>:

On 12 Jan 2016, at 20:59, stepharo <[hidden email]> wrote:

Hi david

I rad the wikipedia page you mention and this looks a bit obvious to me. I knew this.
now it does not really help fixing the problem you mention.

Stef

Since you asked for it…

The #hash method is used in Pharo to separate objects in bucket-based data structures.

It must have one required property which is required data integrity:

  • A. Objects that are equal according to #= must have equal hashes.

It should have two desirable properties that are required to support the algorithmic assumptions of bucket-based data structures:

  • B. Objects that are not equal should have different hashes. Ideally, the probability of hash collision should by 1/N where N is the size of hash value space.
  • C. The hash values should spread as much as possible over their value space. For example: ProtoObject>>#identityHash.

How these two desirable properties can be implemented depend on the actual distribution of the value used to compute the hash.

An object can safely use XOR hashing if its instance variables are all different objects that already provide a hash method with these two properties. For example, an object whose behaviour is defined by several of singleton delegates of different classes that do not override Object>>#hash.

But if the instance variables are not guaranteed to have those properties, it is necessary to more thoroughly "mix the bits". For example: SequenceableCollection>>#hash.

hash
| hash |

hash := self species hash.
1 to: self size do: [:i | hash := (hash + (self at: i) hash) hashMultiply].
^hash

The mixing of the bits is done by the combination of addition and hashMultiply. The value of "species hash" does not need to be processed by hashMultiply, because it is probably computed by ProtoObject>>identityHash, and that already provides C.

LayoutFrame is a particularly good example of a data structure that should not use XOR hashing. Its instance variables provide none of the required properties: SmallInteger>>#hash is just ^self. In addition, common LayoutFrame instances tend to use pairs of identical values in their instance variables. Like 0@0 corner: 1@1, or Margin fromNumber: 10.

A good hash function for LayoutFrame needs to:

  1. Get hashes for each instance variable that provides A and B.
  2. Combine them in a way that is order-sensitive (to maintain B) and that does some extra mixing (to provide C).

Now, we can assume that common values for the instance variables will be instances of SmallInteger, Float, or Fraction.

By the way, you can see in Fraction>>#hash, that a comment mentions the assumption that the fraction is already reduced, that is what makes it acceptable to use bitXor.

The hash of SmallInteger does provide A and B, but not C. The hash of Float is harder to understand, but tests show that it provide distinct values for 0.5, 0.25 and 0.125. So it hopefully provides B for the range of values of interest to LayoutFrame.


Note that Squeak has reduced the number of collisions for floats by not erasing the most significant bits of each word.
I've tested the squeak implementation with Andres Valloud tool in VW, it does not perform so bad for the sets I tested in term of collisions...

Float>>hash
    "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

    (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
    ^ ((self basicAt: 1) bitShift: -4) +
       ((self basicAt: 2) bitShift: -4)

I also changed Fraction to use a hashMultiply and made a more explicit isAnExactFloat test.

Fraction>>hash
    "Hash is reimplemented because = is implemented."
   
    "Care is taken that a Fraction equal to a Float also has an equal hash"
    self isAnExactFloat ifTrue: [^self asExactFloat hash].
   
    "Else, I cannot be exactly equal to a Float, use own hash algorithm."
    ^numerator hash hashMultiply bitXor: denominator hash
 
Fraction>>isAnExactFloat
    "Answer true if this Fraction can be converted exactly to a Float"
    ^ denominator isPowerOfTwo
        and: ["I have a reasonable significand: not too big"
            numerator highBitOfMagnitude <= Float precision
                and: ["I have a reasonable exponent: not too small"
                    Float emin + denominator highBitOfMagnitude <= Float precision]]

Fraction>>asExactFloat
    "When we know that this Fraction is an exact Float, this conversion is much faster than asFloat."

    ^numerator asFloat timesTwoPower: 1 - denominator highBit

Not that Fraction should allways be reduced (this is an invariant of the class, I hope well defined in class comment).

Integer hash is not that good because the risk of occupying consecutive buckets is high... I wonder if returning ^self hashMultiply would not arrange things here and reduce the requirements for spreading hashMultiply everywhere else?

Nicolas

Another class that has similar hashing constraints to LayoutFrame is Point. Its hash method is:

hash
"Hash is reimplemented because = is implemented."

^(x hash hashMultiply + y hash) hashMultiply

It does not include species in the computation, which is less than ideal because it favours collisions with objects of different species that have a similar content and a the same #hash algorithm.

Finally, it is probably not necessary or useful to use the accessors to get to the instance variables.

So a good implementation would be this (untested, there might be a typo or two)

hash
| hash |
hash := self species hash
hash := (hash + leftFraction hash) hashMultiply
hash := (hash + leftOffset hash) hashMultiply
hash := (hash + topFraction hash) hashMultiply
hash := (hash + topOffset hash) hashMultiply
hash := (hash + rightFraction hash) hashMultiply
hash := (hash + rightOffset hash) hashMultiply
hash := (hash + bottomFraction hash) hashMultiply
hash := (hash + bottomOffset hash) hashMultiply
^ hash

I am sure you could have figured that by yourself, by thinking about how hash values are used and by looking at the code of kernel classes, instead of getting offended and turning all defensive.

I hope we can both learn from this episode. I do not enjoy antagonising people, but I will not write lengthy messages like this one every time someone questions my thinking.

Have a nice day.

Le 11/1/16 10:09, David Allouche a écrit :
Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background

https://en.wikipedia.org/wiki/Hash_function#Uniformity

On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:

I happened to look at the new code for LayoutFrame>>#hash

hash
^self species hash bitXor:
(self leftFraction bitXor:
(self leftOffset bitXor:
(self topFraction bitXor:
(self topOffset bitXor:
(self rightFraction bitXor:
(self rightOffset bitXor:
(self bottomFraction bitXor:
self bottomOffset)))))))

This is a terrible, terrible way to compute a hash.

0 xor 0 = 1 xor 1 = 2 xor 2 = etc.

NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

By the way, how do you guys do code reviews for code integrated into the core?








Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

Nicolas Cellier
In reply to this post by David Allouche


2016-01-13 19:55 GMT+01:00 David Allouche <[hidden email]>:

On 13 Jan 2016, at 19:31, Nicolas Cellier <[hidden email]> wrote:

Note that Squeak has reduced the number of collisions for floats by not erasing the most significant bits of each word.
I've tested the squeak implementation with Andres Valloud tool in VW, it does not perform so bad for the sets I tested in term of collisions...

Float>>hash
    "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

    (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
    ^ ((self basicAt: 1) bitShift: -4) +
       ((self basicAt: 2) bitShift: -4)

I also changed Fraction to use a hashMultiply and made a more explicit isAnExactFloat test.

Fraction>>hash
    "Hash is reimplemented because = is implemented."
    
    "Care is taken that a Fraction equal to a Float also has an equal hash"
    self isAnExactFloat ifTrue: [^self asExactFloat hash].
    
    "Else, I cannot be exactly equal to a Float, use own hash algorithm."
    ^numerator hash hashMultiply bitXor: denominator hash
 
Fraction>>isAnExactFloat
    "Answer true if this Fraction can be converted exactly to a Float"
    ^ denominator isPowerOfTwo
        and: ["I have a reasonable significand: not too big"
            numerator highBitOfMagnitude <= Float precision
                and: ["I have a reasonable exponent: not too small"
                    Float emin + denominator highBitOfMagnitude <= Float precision]]

Fraction>>asExactFloat
    "When we know that this Fraction is an exact Float, this conversion is much faster than asFloat."

    ^numerator asFloat timesTwoPower: 1 - denominator highBit

Not that Fraction should allways be reduced (this is an invariant of the class, I hope well defined in class comment).

Integer hash is not that good because the risk of occupying consecutive buckets is high... I wonder if returning ^self hashMultiply would not arrange things here and reduce the requirements for spreading hashMultiply everywhere else?

The risk does not really lie in occupying consecutive buckets. The risk is rather that integers used as keys could have a pattern that cause bucket collisions. For example, if all key are multiple of 100, and the bucket count is 2, 4, 5, 10, 20, 25 or 50.

It would make more sense to solve this problem by using multiplyHash in the bucket data structure classes, instead of having to pay this computation cost in SmallInteger.


Ah yes, I thought the same.
 
Also, using it in SmallInteger would not dispense from using it in the #hash method of classes that use SmallInteger, because it matters that each component of the hash is processed a different number of times by multiplyHash.


Agree again, bad suggestion from my side, because using + or bitXor: without breaking symmetry with hashMultiply would lead to bad distribution of hashes... Consider I posted too fast :)

BTW a 64 bits spur could have a larger hash pattern (small integers are 61 bits wide sign included, and identityHash is longer too, so there's no reason to stick to 28 bits).

Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

Nicolas Cellier
In reply to this post by stepharo


2016-01-13 20:34 GMT+01:00 stepharo <[hidden email]>:
Nicolas

two questions:
    - should we integrate your float changes in Pharo?
    - I would like to get a chapter explaining the points raised by david, would you like to review it?

Stef


1) yes, that's low hanging fruits if you have a good soul to review it.
   while at reviewing the other hashes...
2) maybe

Le 13/1/16 19:31, Nicolas Cellier a écrit :


2016-01-13 17:22 GMT+01:00 David Allouche <[hidden email]>:

On 12 Jan 2016, at 20:59, stepharo <[hidden email]> wrote:

Hi david

I rad the wikipedia page you mention and this looks a bit obvious to me. I knew this.
now it does not really help fixing the problem you mention.

Stef

Since you asked for it…

The #hash method is used in Pharo to separate objects in bucket-based data structures.

It must have one required property which is required data integrity:

  • A. Objects that are equal according to #= must have equal hashes.

It should have two desirable properties that are required to support the algorithmic assumptions of bucket-based data structures:

  • B. Objects that are not equal should have different hashes. Ideally, the probability of hash collision should by 1/N where N is the size of hash value space.
  • C. The hash values should spread as much as possible over their value space. For example: ProtoObject>>#identityHash.

How these two desirable properties can be implemented depend on the actual distribution of the value used to compute the hash.

An object can safely use XOR hashing if its instance variables are all different objects that already provide a hash method with these two properties. For example, an object whose behaviour is defined by several of singleton delegates of different classes that do not override Object>>#hash.

But if the instance variables are not guaranteed to have those properties, it is necessary to more thoroughly "mix the bits". For example: SequenceableCollection>>#hash.

hash
| hash |

hash := self species hash.
1 to: self size do: [:i | hash := (hash + (self at: i) hash) hashMultiply].
^hash

The mixing of the bits is done by the combination of addition and hashMultiply. The value of "species hash" does not need to be processed by hashMultiply, because it is probably computed by ProtoObject>>identityHash, and that already provides C.

LayoutFrame is a particularly good example of a data structure that should not use XOR hashing. Its instance variables provide none of the required properties: SmallInteger>>#hash is just ^self. In addition, common LayoutFrame instances tend to use pairs of identical values in their instance variables. Like 0@0 corner: 1@1, or Margin fromNumber: 10.

A good hash function for LayoutFrame needs to:

  1. Get hashes for each instance variable that provides A and B.
  2. Combine them in a way that is order-sensitive (to maintain B) and that does some extra mixing (to provide C).

Now, we can assume that common values for the instance variables will be instances of SmallInteger, Float, or Fraction.

By the way, you can see in Fraction>>#hash, that a comment mentions the assumption that the fraction is already reduced, that is what makes it acceptable to use bitXor.

The hash of SmallInteger does provide A and B, but not C. The hash of Float is harder to understand, but tests show that it provide distinct values for 0.5, 0.25 and 0.125. So it hopefully provides B for the range of values of interest to LayoutFrame.


Note that Squeak has reduced the number of collisions for floats by not erasing the most significant bits of each word.
I've tested the squeak implementation with Andres Valloud tool in VW, it does not perform so bad for the sets I tested in term of collisions...

Float>>hash
    "Hash is reimplemented because = is implemented. Both words of the float are used. (The bitShift:'s ensure that the intermediate results do not become a large integer.) Care is taken to answer same hash as an equal Integer."

    (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash].
    ^ ((self basicAt: 1) bitShift: -4) +
       ((self basicAt: 2) bitShift: -4)

I also changed Fraction to use a hashMultiply and made a more explicit isAnExactFloat test.

Fraction>>hash
    "Hash is reimplemented because = is implemented."
   
    "Care is taken that a Fraction equal to a Float also has an equal hash"
    self isAnExactFloat ifTrue: [^self asExactFloat hash].
   
    "Else, I cannot be exactly equal to a Float, use own hash algorithm."
    ^numerator hash hashMultiply bitXor: denominator hash
 
Fraction>>isAnExactFloat
    "Answer true if this Fraction can be converted exactly to a Float"
    ^ denominator isPowerOfTwo
        and: ["I have a reasonable significand: not too big"
            numerator highBitOfMagnitude <= Float precision
                and: ["I have a reasonable exponent: not too small"
                    Float emin + denominator highBitOfMagnitude <= Float precision]]

Fraction>>asExactFloat
    "When we know that this Fraction is an exact Float, this conversion is much faster than asFloat."

    ^numerator asFloat timesTwoPower: 1 - denominator highBit

Not that Fraction should allways be reduced (this is an invariant of the class, I hope well defined in class comment).

Integer hash is not that good because the risk of occupying consecutive buckets is high... I wonder if returning ^self hashMultiply would not arrange things here and reduce the requirements for spreading hashMultiply everywhere else?

Nicolas

Another class that has similar hashing constraints to LayoutFrame is Point. Its hash method is:

hash
"Hash is reimplemented because = is implemented."

^(x hash hashMultiply + y hash) hashMultiply

It does not include species in the computation, which is less than ideal because it favours collisions with objects of different species that have a similar content and a the same #hash algorithm.

Finally, it is probably not necessary or useful to use the accessors to get to the instance variables.

So a good implementation would be this (untested, there might be a typo or two)

hash
| hash |
hash := self species hash
hash := (hash + leftFraction hash) hashMultiply
hash := (hash + leftOffset hash) hashMultiply
hash := (hash + topFraction hash) hashMultiply
hash := (hash + topOffset hash) hashMultiply
hash := (hash + rightFraction hash) hashMultiply
hash := (hash + rightOffset hash) hashMultiply
hash := (hash + bottomFraction hash) hashMultiply
hash := (hash + bottomOffset hash) hashMultiply
^ hash

I am sure you could have figured that by yourself, by thinking about how hash values are used and by looking at the code of kernel classes, instead of getting offended and turning all defensive.

I hope we can both learn from this episode. I do not enjoy antagonising people, but I will not write lengthy messages like this one every time someone questions my thinking.

Have a nice day.

Le 11/1/16 10:09, David Allouche a écrit :
Since it's not obvious why xor-ing is a bad way to produce hash, here is a link that gives some background

https://en.wikipedia.org/wiki/Hash_function#Uniformity

On 11 Jan 2016, at 10:01, David Allouche <[hidden email]> wrote:

I happened to look at the new code for LayoutFrame>>#hash

hash
^self species hash bitXor:
(self leftFraction bitXor:
(self leftOffset bitXor:
(self topFraction bitXor:
(self topOffset bitXor:
(self rightFraction bitXor:
(self rightOffset bitXor:
(self bottomFraction bitXor:
self bottomOffset)))))))

This is a terrible, terrible way to compute a hash.

0 xor 0 = 1 xor 1 = 2 xor 2 = etc.

NEVER compute hashes with xor, that's wrong, and it causes horrible, hard to debug, performance bugs down the road.

SequenceableCollection>>#hash looks more like a correct way of doing it. I do not know what is the correct, efficient way to reuse this logic in LayoutFrame>>#hash.

By the way, how do you guys do code reviews for code integrated into the core?









Reply | Threaded
Open this post in threaded view
|

Re: Bad layoutFrame>>#hash

David Allouche
In reply to this post by stepharo

> On 13 Jan 2016, at 20:32, stepharo <[hidden email]> wrote:
>
> Hi david
>
> I would like to turn your post into a chapter for a forthcoming book so that people can read and get the idea.

Yes I would be glad :-) Please just remember to credit the author.

Is there a recommended license for Pharo documentation? If not, I am happy to release that text under the MIT License.

I would also like to review your derived work. But that is not a requirement.

> I did not think about the same object. Is it ok for you?
>
> Stef
>
> PS: May be this was obviosu for you that we should all get this out of your previous mail but it was not my case
> and not after 8 hours on intense work.

Sorry, I do not understand what you are saying here. I do not want to read between the lines, because I would probably misunderstand your intention.

Regards.
12