Hi,
In Dr. Geo I rely a lot on hash value for the geometric object. It largely improves speed up to compare geometric items (to avoid duplicated items). For example to detect when the user is building an already created object. Also when describing geometric sketch with Smalltalk code, hundred or thousand of geometric items are created. The hash value speeds up equality check (otherwise true comparing is iterative in the geometric sketch tree and it slow down) But sometime hash collisions occur. I guess in that case, DrGeo should do the real equality check, or may be I am overusing the hash? I am interested on advices. As for now I am doing nothing special. Here is an annoying hash collision with Pharo 3/ Pharo 5 I think should not occur: (0.5@0.25) hash 67006464 (-0.5@0.25) hash 67006464 Squeak5.0 does not have this collision. It looks related to the Float hash implementation a bit different. What do you think? Hilaire -- Dr. Geo http://drgeo.eu |
On Fri, Dec 9, 2016 at 10:51 PM, Hilaire <[hidden email]> wrote: Hi, More the point, this seems bad... -0.5 hash "==> 57344" 0.5 hash "==> 57344" I read page 307 of "Hashing in Smalltalk Theory and Practice" ** regarding Squeak's (Pharo's) Point>>hash method... "at this time we will not test this hash function for points with our floating point data sets because we already know that [Pharo's] hash function for floats is of bad quality." ** Errm... actually first time I've opened the book since I bought it a couple of years ago. And from page 273 reviewing Float's hash... "because of how the bitAnd: is used, the hash function will discard 32 of the original 64 bits. In other words what seems a hash function using all the availabel bytes is anything but. [Also] because of the way the two 16 bit chunks are added togather, by the time bitshift:-8 is sent we will only have, at most, 17 bits worth of hash value. Effectively all the bitshift: accomplishes is to get rid of the 8 lowest bits, which are zero anyway because of how the bitAnd: message were set up. Can these objections add up to particularly poor performance? The answer is definitely yes - since we will have only 17 bit hash values this hash function will be particularly unsuitable for any data set containing more than 2^17 = 131,072 objects" Float>>hash "Pharo" (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash]. ^ (((self basicAt: 1) bitAnd: 16r00FFFF00) + ((self basicAt: 2) bitAnd: 16r00FFFF00)) bitShift: -8 The suggested fix depended on being willing to let go of byte ordering (which we might not want different results on different platforms) Float>>hash ^ByteArray hashBytes: self startingWith: self species hash but anyway that doesn't actually help this case... -0.5 hash "==>120484352" 0.5 hash "==>120484352" I see Squeak5's float hash has changed... Float>>hash "Squeak5" (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self truncated hash]. ^ ((self basicAt: 1) bitShift: -4) + ((self basicAt: 2) bitShift: -4) which gives... 0.5 hash "==>66977792" -0.5 hash "==>201195520" which solves the immediate problem, but I'm not capable of analysis its quality further. cheers -ben |
If you do not really care about performance because there are not that may geometric objects (say thousands), I guess that this would work (bit pattern as string). { 0.5 binaryLiteralString hash. -0.5 binaryLiteralString hash. } Phil On Fri, Dec 9, 2016 at 6:45 PM, Ben Coman <[hidden email]> wrote:
|
In reply to this post by Ben Coman
On 12/09/2016 09:45 AM, Ben Coman
wrote:
This fix contains a good idea, but has at least two problems: 1) The check for integers is important, that's what makes sure that 1.0 has the same hash as 1. Which it must, since they compare equal. 2) #hashBytes:startingWith: is designed to hash *bytes*. A BoxedFloat64 is two *words*, which ensures that when you hashMultiply the high-order bits of each word get discarded. I tried implementing a hash that actually uses the bytes of the float (code at bottom of this message). I didn't do any real analysis of it, but at least it gives different answers for 0.5 and -0.5. I think it matches the *intent* of the code above (though would still need the integer check, which I didn't bother with).
This is slightly better. It's only discarding eight bits out of 64, instead of discarding 32. But it would be better to mix all the bits into the hash. Note that all of these hash implementations end up manipulating at least one large integer most of the time, since each word can be a large integer. It's possible that a primitive to copy a word object into the equivalent bytes might speed things up somewhat. Regards, -Martin
"Strictly experimental method on BoxedFloat64, could probably be made faster even without a primitive." bytesHash |
On 12/10/16 10:36 , Martin McClure wrote:
> On 12/09/2016 09:45 AM, Ben Coman wrote: >> >> The suggested fix depended on being willing to let go of byte ordering >> (which we might not want different results on different platforms) >> Float>>hash >> ^ByteArray >> hashBytes: self >> startingWith: self species hash > This fix contains a good idea, but has at least two problems: > 1) The check for integers is important, that's what makes sure that 1.0 > has the same hash as 1. Which it must, since they compare equal. > 2) #hashBytes:startingWith: is designed to hash *bytes*. A BoxedFloat64 > is two *words*, which ensures that when you hashMultiply the high-order > bits of each word get discarded. The high order bits of each word, yes. However, I would have expected the byte oriented code to be, basically, h := some constant. bytes do: [:x | h * 1644525 + x] or h := some constant. bytes do: [:x | h + x * 1644525] In either case, the sign bit in one of the x values should have made a difference. Apparently it didn't. What am I missing here? > I tried implementing a hash that actually uses the bytes of the float > (code at bottom of this message). I didn't do any real analysis of it, > but at least it gives different answers for 0.5 and -0.5. I think it > matches the *intent* of the code above (though would still need the > integer check, which I didn't bother with). Side comment: equality between integers, floats, fractions and other numbers induces a desire to have equal hashes... in my opinion, the resulting complexity hard to justify. >> I see Squeak5's float hash has changed... >> Float>>hash "Squeak5" >> (self isFinite and: [self fractionPart = 0.0]) ifTrue: [^self >> truncated hash]. >> ^ ((self basicAt: 1) bitShift: -4) + >> ((self basicAt: 2) bitShift: -4) >> > This is slightly better. It's only discarding eight bits out of 64, > instead of discarding 32. But it would be better to mix all the bits > into the hash. Yes. > Note that all of these hash implementations end up manipulating at least > one large integer most of the time, since each word can be a large > integer. It's possible that a primitive to copy a word object into the > equivalent bytes might speed things up somewhat. ... or use the byte oriented hash. > "Strictly experimental method on BoxedFloat64, could probably be made > faster even without a primitive." > > bytesHash > | bytes word1 word2 | > bytes := ByteArray new: 8. > word1 := self basicAt: 1. > word2 := self basicAt: 2. > bytes at: 1 put: (word1 bitShift: -24). > bytes at: 2 put: ((word1 bitShift: -16) bitAnd: 16rFF). > bytes at: 3 put: ((word1 bitShift: -8) bitAnd: 16rFF). > bytes at: 4 put: (word1 bitAnd: 16rFF). > bytes at: 5 put: (word2 bitShift: -24). > bytes at: 6 put: ((word2 bitShift: -16) bitAnd: 16rFF). > bytes at: 7 put: ((word2 bitShift: -8) bitAnd: 16rFF). > bytes at: 8 put: (word2 bitAnd: 16rFF). > ^ ByteArray hashBytes: bytes startingWith: self class hash I would have expected the floats to be a byte object of size 8. Why is this conversion needed? Is somehow the primitive thinking the float has size 2? Or is the primitive hashing 32 bits at a time? The prim used to be a C version of the byte oriented 1644525 multiplicative hash, did this change? Andres. |
On 12/13/2016 08:39 AM, Andres Valloud wrote:
> I would have expected the floats to be a byte object of size 8. Why is > this conversion needed? Is somehow the primitive thinking the float has > size 2? Or is the primitive hashing 32 bits at a time? The prim used > to be a C version of the byte oriented 1644525 multiplicative hash, did > this change? The float is not a byte object of size 8, it's a word object of size 2. When you feed it to #hashBytes:startingWith: the primitive fails. (I didn't look at why -- maybe because it isn't a byte object?) The alternative Smalltalk code treats it like it was a byte object, even though it's not. So it iterates through the two words, multiplying each by 1664525, then keeping only the low-order 28 bits of the result. But since each word is a full 32 bits, the high-order four bits of each word, which include the sign bit, can never affect the hash. Regards, -Martin |
Oh, ok... then the failure code isn't doing what was intended :).
On 12/13/16 15:16 , Martin McClure wrote: > On 12/13/2016 08:39 AM, Andres Valloud wrote: >> I would have expected the floats to be a byte object of size 8. Why is >> this conversion needed? Is somehow the primitive thinking the float has >> size 2? Or is the primitive hashing 32 bits at a time? The prim used >> to be a C version of the byte oriented 1644525 multiplicative hash, did >> this change? > > The float is not a byte object of size 8, it's a word object of size 2. > > When you feed it to #hashBytes:startingWith: the primitive fails. (I > didn't look at why -- maybe because it isn't a byte object?) > > The alternative Smalltalk code treats it like it was a byte object, even > though it's not. So it iterates through the two words, multiplying each > by 1664525, then keeping only the low-order 28 bits of the result. But > since each word is a full 32 bits, the high-order four bits of each > word, which include the sign bit, can never affect the hash. > > Regards, > > -Martin > |
Sorry but how can we improve the situation concretely
Stef On Wed, 14 Dec 2016 00:22:47 +0100, Andres Valloud <[hidden email]> wrote: > Oh, ok... then the failure code isn't doing what was intended :). > > On 12/13/16 15:16 , Martin McClure wrote: >> On 12/13/2016 08:39 AM, Andres Valloud wrote: >>> I would have expected the floats to be a byte object of size 8. Why is >>> this conversion needed? Is somehow the primitive thinking the float >>> has >>> size 2? Or is the primitive hashing 32 bits at a time? The prim used >>> to be a C version of the byte oriented 1644525 multiplicative hash, did >>> this change? >> >> The float is not a byte object of size 8, it's a word object of size 2. >> >> When you feed it to #hashBytes:startingWith: the primitive fails. (I >> didn't look at why -- maybe because it isn't a byte object?) >> >> The alternative Smalltalk code treats it like it was a byte object, even >> though it's not. So it iterates through the two words, multiplying each >> by 1664525, then keeping only the low-order 28 bits of the result. But >> since each word is a full 32 bits, the high-order four bits of each >> word, which include the sign bit, can never affect the hash. >> >> Regards, >> >> -Martin >> > -- Using Opera's mail client: http://www.opera.com/mail/ |
The primitive and the primitive failure code should behave the same?
On 12/18/16 2:13 , stepharong wrote: > Sorry but how can we improve the situation concretely > > > Stef > > On Wed, 14 Dec 2016 00:22:47 +0100, Andres Valloud > <[hidden email]> wrote: > >> Oh, ok... then the failure code isn't doing what was intended :). >> >> On 12/13/16 15:16 , Martin McClure wrote: >>> On 12/13/2016 08:39 AM, Andres Valloud wrote: >>>> I would have expected the floats to be a byte object of size 8. Why is >>>> this conversion needed? Is somehow the primitive thinking the float >>>> has >>>> size 2? Or is the primitive hashing 32 bits at a time? The prim used >>>> to be a C version of the byte oriented 1644525 multiplicative hash, did >>>> this change? >>> >>> The float is not a byte object of size 8, it's a word object of size 2. >>> >>> When you feed it to #hashBytes:startingWith: the primitive fails. (I >>> didn't look at why -- maybe because it isn't a byte object?) >>> >>> The alternative Smalltalk code treats it like it was a byte object, even >>> though it's not. So it iterates through the two words, multiplying each >>> by 1664525, then keeping only the low-order 28 bits of the result. But >>> since each word is a full 32 bits, the high-order four bits of each >>> word, which include the sign bit, can never affect the hash. >>> >>> Regards, >>> >>> -Martin >>> >> > > |
Hi andres
I did not follow the discussion. My remark is just that a good discussion without code for integration does not improve Pharo and we will have people facing again the problem. Stef On Sun, 18 Dec 2016 11:29:11 +0100, Andres Valloud <[hidden email]> wrote: > The primitive and the primitive failure code should behave the same? > > On 12/18/16 2:13 , stepharong wrote: >> Sorry but how can we improve the situation concretely >> >> >> Stef >> >> On Wed, 14 Dec 2016 00:22:47 +0100, Andres Valloud >> <[hidden email]> wrote: >> >>> Oh, ok... then the failure code isn't doing what was intended :). >>> >>> On 12/13/16 15:16 , Martin McClure wrote: >>>> On 12/13/2016 08:39 AM, Andres Valloud wrote: >>>>> I would have expected the floats to be a byte object of size 8. Why >>>>> is >>>>> this conversion needed? Is somehow the primitive thinking the float >>>>> has >>>>> size 2? Or is the primitive hashing 32 bits at a time? The prim >>>>> used >>>>> to be a C version of the byte oriented 1644525 multiplicative hash, >>>>> did >>>>> this change? >>>> >>>> The float is not a byte object of size 8, it's a word object of size >>>> 2. >>>> >>>> When you feed it to #hashBytes:startingWith: the primitive fails. (I >>>> didn't look at why -- maybe because it isn't a byte object?) >>>> >>>> The alternative Smalltalk code treats it like it was a byte object, >>>> even >>>> though it's not. So it iterates through the two words, multiplying >>>> each >>>> by 1664525, then keeping only the low-order 28 bits of the result. But >>>> since each word is a full 32 bits, the high-order four bits of each >>>> word, which include the sign bit, can never affect the hash. >>>> >>>> Regards, >>>> >>>> -Martin >>>> >>> >> >> > -- Using Opera's mail client: http://www.opera.com/mail/ |
In reply to this post by Andres Valloud-4
On 12/18/2016 02:29 AM, Andres Valloud wrote:
> The primitive and the primitive failure code should behave the same? You think the primitive is faulty? I haven't looked at the primitive code, but it just fails when given a Float. From the name, I'd guess that it's designed for a byte object. The failure code certainly is. A new primitive designed for word objects, or new code such as I gave earlier (though ugly) would work, though. Regards, -Martin |
At first glance, that the failure code only sees "two" things when it
should see "eight" seems to be problematic. Perhaps the primitive insists on hashing byte objects, and there is a distinction between "byte" objects and "word" objects (whatever "word" means, presumably a constant width integer across all platforms). I haven't looked at the code. From my perspective... back in the day that primitive used to hash bytes, and from what I saw here the failure code is hashing multi-byte things. If all of these observations are correct, then I'd say the failure code isn't doing what the primitive is doing, and in doing so it's introducing a lot of collisions that I'd like to believe the intended hash function wouldn't produce. Andres. On 12/18/16 23:05 , Martin McClure wrote: > On 12/18/2016 02:29 AM, Andres Valloud wrote: >> The primitive and the primitive failure code should behave the same? > > You think the primitive is faulty? I haven't looked at the primitive > code, but it just fails when given a Float. From the name, I'd guess > that it's designed for a byte object. The failure code certainly is. > > A new primitive designed for word objects, or new code such as I gave > earlier (though ugly) would work, though. > > Regards, > > -Martin > |
On 12/19/2016 01:27 AM, Andres Valloud wrote:
> At first glance, that the failure code only sees "two" things when it > should see "eight" seems to be problematic. > > Perhaps the primitive insists on hashing byte objects, and there is a > distinction between "byte" objects and "word" objects (whatever "word" > means, presumably a constant width integer across all platforms). I > haven't looked at the code. > > From my perspective... back in the day that primitive used to hash > bytes, and from what I saw here the failure code is hashing multi-byte > things. If all of these observations are correct, then I'd say the > failure code isn't doing what the primitive is doing, and in doing so > it's introducing a lot of collisions that I'd like to believe the > intended hash function wouldn't produce. Ah, I see your concern. As far as I can see, all classes that are using the StringHash primitive are actually byte objects, so things are, I believe, working as designed. The only problem is that Ben did an experiment to see whether Float's hashing would be improved by using the StringHash primitive. Which it failed to do, because Float is not a byte object. We could use an equivalent primitive to hash word objects, but I haven't found one. We could also use a primitive to retrieve the bytes of a word object, and I haven't found one of those either. There are places that are converting the words to large integers and then hashing those, which would work for Floats. -Martin |
Free forum by Nabble | Edit this page |