Smalltalk › Pharo › Pharo Smalltalk Developers

Hashed collection changes, the performance graphs

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

11 messages Options

Martin McClure-2

Hashed collection changes, the performance graphs

Gotta go see my wife in a play now; will comment on these graphs later.

-Martin

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

DictPerformanceGraphs.png (49K) Download Attachment

Martin McClure-2

Re: Hashed collection changes, the performance graphs

Martin McClure wrote:
> Gotta go see my wife in a play now; will comment on these graphs later.

The play was fun :-)

Now about the graphs:

The test code, a variant of the test from the HashTable SqueakSource
wiki, is at the bottom of this message. Basically, it adds a bunch of
instances of Object to a Dictionary and sees how much time it takes to
look them up again.

From the graphs in the previous message, you can see that performance
for sizes > 4000 is greatly improved. For size = 10000, #at: is 1000
times faster, 2-3 microseconds vs. >2 milliseconds. At size=10000,
#at:put: is about 200 times faster, ~10 microseconds vs. >2
milliseconds, and the large spikes for growing the collection are 21
milliseconds vs. > 4 seconds, again a factor of about 200.

Performance for dictionary sizes < 4000 is essentially the same as
before, so these collections can serve as general-purpose collections
over a wide range of sizes. I've attached the graphs for sizes <4000 to
this message so you can see that more clearly than on the previous graphs.

These results should hold for any object that inherits #hash from
Object, in other words uses its identity hash as its equality hash.
Other objects with better hashes did not have as serious a problem, but
will probably show increased performance as well due to using prime
table sizes.

These changes are in Set, so should improve Set's subclasses as well.
IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary did not
have as serious a problem, but should see some improvement.
MethodDictionaries have been left alone on the assumption that the VM
depends on the hashing of those.

Since there are still only 4K possible values of identity hash,
collisions are inevitable in large collections, and the number of
collisions will grow linearly with collection size. So how well does the
spread hash / prime table size do at even larger sizes? I ran the same
test at sizes of one million. As expected, access was quite a bit slower
than it had been at 10000. Time for #at: was ~250 microseconds, and
#at:put: was about the same. Note, however, that this is still ten times
faster than Dictionary previously was at a size of 10000; 100 times
larger yet 10 times faster.

I had a lot of fun doing this. This is better results than I expected,
for a fairly minor (though deep) code change.

Regards,

-Martin

| test ord |
Transcript cr.
test := Dictionary new.
[ test size >= 10000] whileFalse: [
ord := OrderedCollection new: 100.
Transcript show:
[
100 timesRepeat: [
test at: ( ord add: Object new ) put: nil ].
] timeToRun asString.
Transcript tab;
show: test size asString;
tab.
Transcript show:
[
1000 timesRepeat: [
ord do: [ :each | test at: each ] ]
] timeToRun asString.
Transcript tab; show:
[
1000 timesRepeat: [
ord do: [ :each | ] ]
] timeToRun asString.
Transcript cr ]

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

SmallDictPerformanceGraphs1.png (38K) Download Attachment

Stéphane Ducasse

Re: Hashed collection changes, the performance graphs

Martin

so what you are saying is that we can gain speed. And that these
changes are beneficial for simple default.

> The test code, a variant of the test from the HashTable SqueakSource
> wiki, is at the bottom of this message. Basically, it adds a bunch
> of instances of Object to a Dictionary and sees how much time it
> takes to look them up again.
>
> From the graphs in the previous message, you can see that
> performance for sizes > 4000 is greatly improved. For size = 10000,
> #at: is 1000 times faster, 2-3 microseconds vs. >2 milliseconds. At
> size=10000, #at:put: is about 200 times faster, ~10 microseconds vs.
> >2 milliseconds, and the large spikes for growing the collection are
> 21 milliseconds vs. > 4 seconds, again a factor of about 200.
>
> Performance for dictionary sizes < 4000 is essentially the same as
> before, so these collections can serve as general-purpose
> collections over a wide range of sizes. I've attached the graphs for
> sizes <4000 to this message so you can see that more clearly than on
> the previous graphs.
>
> These results should hold for any object that inherits #hash from
> Object, in other words uses its identity hash as its equality hash.
> Other objects with better hashes did not have as serious a problem,
> but will probably show increased performance as well due to using
> prime table sizes.
>
> These changes are in Set, so should improve Set's subclasses as
> well. IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary
> did not have as serious a problem, but should see some improvement.
> MethodDictionaries have been left alone on the assumption that the
> VM depends on the hashing of those.

So where are these changes :)
Should we integrate them?
Andres and other hash freaks what are your point of view?

> Since there are still only 4K possible values of identity hash,
> collisions are inevitable in large collections, and the number of
> collisions will grow linearly with collection size. So how well does
> the spread hash / prime table size do at even larger sizes? I ran
> the same test at sizes of one million. As expected, access was quite
> a bit slower than it had been at 10000. Time for #at: was ~250
> microseconds, and #at:put: was about the same. Note, however, that
> this is still ten times faster than Dictionary previously was at a
> size of 10000; 100 times larger yet 10 times faster.
>
> I had a lot of fun doing this. This is better results than I
> expected, for a fairly minor (though deep) code change.

So where are these changes :)
Should we integrate them?
Andres and other hash freaks what are your point of view?

> | test ord |
> Transcript cr.
> test := Dictionary new.
> [ test size >= 10000] whileFalse: [
> ord := OrderedCollection new: 100.
> Transcript show:
> [
> 100 timesRepeat: [
> test at: ( ord add: Object new ) put: nil ].
> ] timeToRun asString.
> Transcript tab;
> show: test size asString;
> tab.
> Transcript show:
> [
> 1000 timesRepeat: [
> ord do: [ :each | test at: each ] ]
> ] timeToRun asString.
> Transcript tab; show:
> [
> 1000 timesRepeat: [
> ord do: [ :each | ] ]
> ] timeToRun asString.
> Transcript cr ]
>
> <
> SmallDictPerformanceGraphs1
> .png>_______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Andres Valloud-4

Re: Hashed collection changes, the performance graphs

Stephane et al,

Martin and I discussed this yesterday night at the PDX Smalltalk user
group meeting. Mostly we agree, with perhaps one small exception. In
order to make hashed collections simpler and identity hash handling more
straightforward, hashed collections should get a small integer range
identity hash value. Something between 0 and SmallInteger maxVal, for
example. The issue is how to get that.

Martin's approach is to change identityHash itself, and provide
primIdentityHash to access the existing identityHash implementation
(primitive 75). My approach is to leave identityHash as is, and provide
a new message called scaledIdentityHash. Note that in both Martin's and
my implementation, the new identity hash values are identical. In other
words, whereas Martin's changes implement

identityHash

^self primIdentityHash bitShift: 18

my changes implement

scaledIdentityHash

^self identityHash bitShift: 18

So, really, this is an issue of how to implement the *same* behavior,
with the *same* goals. Now, there are advantages and disadvantages for
both. I'll try to explain. Martin, please feel free to correct what
follows if I miss something.

There are two kinds of senders of identityHash. Some (most) do not care
whether the answer is well distributed or not. Typically, these are
implementors of #hash, for example:

SomeObject>>hash

^self someInstanceVariable identityHash

If, on one hand, we change identityHash, then the behavior of these hash
methods improve because the values they answer would be distributed more
uniformly. Nevertheless, precisely because they send identityHash, it
could be argued that these hash methods were implemented assuming there
would be just a few of these objects in hashed collections. If this
turns out not to be the case, then performance could be much better by
implementing a better hash method.

Some (a few) other senders of identityHash assume that the answer has a
particular distribution (in Squeak's/Pharo's case, 0 through 4095).
Consequently, the senders may do things like this:

SomeObject>>hash

^(self variableA identityHash bitShift: 12) + self variableB identityHash

Also, there may be hashed collections which use identityHash to index
hash buckets (e.g.: GemStone's own 32 bit object cache for VisualWorks,
which uses identityHash to access hash buckets). So, if we change
identityHash, these senders which required special care to write would
break, requiring them to be written like this:

SomeObject>>hash

^(self variableA primIdentityHash bitShift: 12) + self variableB
primIdentityHash

If, on the other hand, we provide a new scaledIdentityHash method, then
all senders of identityHash keep behaving the same way as they do now.
Hashed collections, which currently send identityHash, would need to
change so that they send scaledIdentityHash instead. Furthermore, any
implementor of hash such as

SomeObject>>hash

^self someInstanceVariable identityHash

would need to be rewritten as

SomeObject>>hash

^self someInstanceVariable scaledIdentityHash

if hashed collections are expected to contain numerous instances of
SomeObject. If these are not rewritten, then they would continue to
have the same performance characteristics they have today (which may be
adequate to begin with). Clever hacks such as

SomeObject>>hash

^(self variableA identityHash bitShift: 12) + self variableB identityHash

would also remain undisturbed. Finally, I do not know of any Smalltalk
in which identityHash does not answer the actual object header bits for
the identity hash. If we change identityHash, then AFAIK Pharo would
become the only Smalltalk in which identityHash does not answer
consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14
for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for VA
and VisualSmalltalk). Moreover, AFAIK, identityHash has had this
behavior for decades. Is it a good idea to change it? I am not so sure
on that one. Also, I wouldn't want to break clever hacks that assume
identityHash behaves the way it currently does, and I wouldn't want to
make Pharo different gratuitously.

So that's where we stand. I think I understand where Martin is coming
from, and I think Martin understands where I am coming from. One way or
the other, we also agree that something should be done to make hashed
collections faster in Pharo.

Andres.

Stéphane Ducasse wrote:

> Martin
>
> so what you are saying is that we can gain speed. And that these
> changes are beneficial for simple default.
>
>
>
>> The test code, a variant of the test from the HashTable SqueakSource
>> wiki, is at the bottom of this message. Basically, it adds a bunch
>> of instances of Object to a Dictionary and sees how much time it
>> takes to look them up again.
>>
>> From the graphs in the previous message, you can see that
>> performance for sizes > 4000 is greatly improved. For size = 10000,
>> #at: is 1000 times faster, 2-3 microseconds vs. >2 milliseconds. At
>> size=10000, #at:put: is about 200 times faster, ~10 microseconds vs.
>>
>>> 2 milliseconds, and the large spikes for growing the collection are
>>>
>> 21 milliseconds vs. > 4 seconds, again a factor of about 200.
>>
>> Performance for dictionary sizes < 4000 is essentially the same as
>> before, so these collections can serve as general-purpose
>> collections over a wide range of sizes. I've attached the graphs for
>> sizes <4000 to this message so you can see that more clearly than on
>> the previous graphs.
>>
>> These results should hold for any object that inherits #hash from
>> Object, in other words uses its identity hash as its equality hash.
>> Other objects with better hashes did not have as serious a problem,
>> but will probably show increased performance as well due to using
>> prime table sizes.
>>
>> These changes are in Set, so should improve Set's subclasses as
>> well. IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary
>> did not have as serious a problem, but should see some improvement.
>> MethodDictionaries have been left alone on the assumption that the
>> VM depends on the hashing of those.
>>
>
> So where are these changes :)
> Should we integrate them?
> Andres and other hash freaks what are your point of view?
>
>
>> Since there are still only 4K possible values of identity hash,
>> collisions are inevitable in large collections, and the number of
>> collisions will grow linearly with collection size. So how well does
>> the spread hash / prime table size do at even larger sizes? I ran
>> the same test at sizes of one million. As expected, access was quite
>> a bit slower than it had been at 10000. Time for #at: was ~250
>> microseconds, and #at:put: was about the same. Note, however, that
>> this is still ten times faster than Dictionary previously was at a
>> size of 10000; 100 times larger yet 10 times faster.
>>
>> I had a lot of fun doing this. This is better results than I
>> expected, for a fairly minor (though deep) code change.
>>
>
> So where are these changes :)
> Should we integrate them?
> Andres and other hash freaks what are your point of view?
>
>
>
>> | test ord |
>> Transcript cr.
>> test := Dictionary new.
>> [ test size >= 10000] whileFalse: [
>> ord := OrderedCollection new: 100.
>> Transcript show:
>> [
>> 100 timesRepeat: [
>> test at: ( ord add: Object new ) put: nil ].
>> ] timeToRun asString.
>> Transcript tab;
>> show: test size asString;
>> tab.
>> Transcript show:
>> [
>> 1000 timesRepeat: [
>> ord do: [ :each | test at: each ] ]
>> ] timeToRun asString.
>> Transcript tab; show:
>> [
>> 1000 timesRepeat: [
>> ord do: [ :each | ] ]
>> ] timeToRun asString.
>> Transcript cr ]
>>
>> <
>> SmallDictPerformanceGraphs1
>> .png>_______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
> .
>
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Martin McClure-2

Re: Hashed collection changes, the performance graphs

Andres Valloud wrote:
> Stephane et al,
>
> Martin and I discussed this yesterday night at the PDX Smalltalk user
> group meeting. Mostly we agree, with perhaps one small exception. In
> order to make hashed collections simpler and identity hash handling more
> straightforward, hashed collections should get a small integer range
> identity hash value. Something between 0 and SmallInteger maxVal, for
> example. The issue is how to get that.

Hi Andres,

Oh, good; you started the discussion while I was being interrupted in
the middle of writing my previous message -- thanks. :-)

[...good description of the choices omitted...]

> There are two kinds of senders of identityHash.

[...]

I took a survey of the senders of #identityHash in the latest web image.
There aren't that many. The largest category is those that want the
printString of the identityHash.

Of those that care about the value of the identityHash, there are
several that use it in #hash methods. The most common is this definition:

hash
^self identityHash

These are presumably overriding superclass behavior to restore Object
behavior. If Object>>hash were to be defined to answer 'self
scaledIdentityHash' then these methods would also need to be changed in
order to get the performance improvement for collections containing
those objects. If, OTOH, #identityHash itself changes, these classes get
the benefit with no change.

> Some (most) do not care
>> whether the answer is well distributed or not. Typically, these are
>> implementors of #hash, for example:
>>
>> SomeObject>>hash
>>
>> ^self someInstanceVariable identityHash

And indeed there are also some do this.

> If, on one hand, we change identityHash, then the behavior of these hash
> methods improve because the values they answer would be distributed more
> uniformly. Nevertheless, precisely because they send identityHash, it
> could be argued that these hash methods were implemented assuming there
> would be just a few of these objects in hashed collections.

If the authors knew about the limited range of #identityHash, that is
entirely possible. I tend to think it more likely that in most cases
these implementations are just the simplest way to follow the dictate
that 'a=b -> a hash = b hash', and that they didn't really think about
the impact on collection performance.

>
> Some (a few) other senders of identityHash assume that the answer has a
> particular distribution (in Squeak's/Pharo's case, 0 through 4095).

Here are the senders I found that care about the actual values returned,
along with whether they'd be harmed or improved by increasing the range
returned:

FixedIdentitySet -- harmed
IdentityDictionary -- somewhat improved
IdentitySet -- somewhat improved
KeyedIdentitySet -- greatly improved
MethodDictionary -- harmed
RefactoryTyper -- probably improved
WeakIdentityKeyDictionary -- somewhat improved

5 improved, 2 harmed. And one of the listed harmed is MethodDictionary,
whose performance would not be harmed, but I assume the VM would not be
happy if their hashing was changed (anybody know for sure whether that's
true?)

> Consequently, the senders may do things like this:
>
> SomeObject>>hash
>
> ^(self variableA identityHash bitShift: 12) + self variableB identityHash

They could, and I admit to having written this kind of code in the past,
but I doubt that I'm typical in doing so. Do you know of any Pharo code
that actually *does* this sort of thing? There isn't any in the
distributed web image, but I didn't look at every package that is meant
to be loadable in Pharo.

>
>
> If, on the other hand, we provide a new scaledIdentityHash method, then
> all senders of identityHash keep behaving the same way as they do now.
> Hashed collections, which currently send identityHash, would need to
> change so that they send scaledIdentityHash instead. Furthermore, any
> implementor of hash such as
>
> SomeObject>>hash
>
> ^self someInstanceVariable identityHash
>
>
> would need to be rewritten as
>
> SomeObject>>hash
>
> ^self someInstanceVariable scaledIdentityHash

Yes. I believe that this is a disadvantage of the #scaledIdentityHash
approach.

>
>
> if hashed collections are expected to contain numerous instances of
> SomeObject. If these are not rewritten, then they would continue to
> have the same performance characteristics they have today (which may be
> adequate to begin with).

True, if the collections are expected to remain small their performance
is already adequate. However, small collections will still have good
performance if #identityHash is changed, and collections that the
authors thought would remain small when they were written back in the
days of sub-megabyte memory may now get bigger than their authors intended.

> Clever hacks such as
>
> SomeObject>>hash
>
> ^(self variableA identityHash bitShift: 12) + self variableB identityHash
>
>
> would also remain undisturbed.

Yes, if #identityHash is changed it's the clever hacks that will have to
change. This could be a disadvantage of this approach, but often, as in
the case of IdentityDictionary, IdentitySet, and
WeakIdentityKeyDictionary, the necessary change is simply to remove the
clever hack, get simpler code, and enjoy better performance than you got
with the clever hack, so making the change is IMO an improvement.

> Finally, I do not know of any Smalltalk
> in which identityHash does not answer the actual object header bits for
> the identity hash. If we change identityHash, then AFAIK Pharo would
> become the only Smalltalk in which identityHash does not answer
> consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14
> for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for VA
> and VisualSmalltalk).

GemStone is a Smalltalk that does not answer consecutive values for
identityHash. In GemStone the identityHash is computed from the object's
OOP, and OOPs are not consecutive. Every object's identityHash is
different, the range is not limited (at least in 32-bit GemStone; I'd
have to check what we do in 64-bit). And Smalltalk-80 basically used the
same scheme, though you could only have 32K objects, every one had a
different identityHash based on OOP.

Also, most (all?) Smalltalks with limited ranges for identityHash do
have a larger range of identityHash for SmallIntegers (usually ^self),
so you can't use the clever hacks if you might have any SmallIntegers in
your collection. So any general-purpose collection must already deal
with the full SmallInteger range of identity hashes as keys, cannot use
the clever hacks, and so is likely to only be improved by changing
#identityHash. This is a key point that I forgot to bring up last night.

>
> So that's where we stand. I think I understand where Martin is coming
> from, and I think Martin understands where I am coming from. One way or
> the other, we also agree that something should be done to make hashed
> collections faster in Pharo.

Yes, Andres and I are in far more agreement than disagreement about the
hashing changes.

Regards,

-Martin

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Andres Valloud-4

Re: Hashed collection changes, the performance graphs

Martin,

One of the constituencies I thought of when I decided to leave
identityHash alone was folks like you. Now, as a representative, if you
are ok with dealing with broken identityHash senders (which I hope will
be few), then most of my motivation for leaving identityHash unchanged
is gone. Thus, I would not mind changing identityHash and implementing
primIdentityHash.

What about others? Would anybody mind if identityHash was changed?

Some comments below...

> I took a survey of the senders of #identityHash in the latest web image.
> There aren't that many. The largest category is those that want the
> printString of the identityHash.
>

These would probably need to be changed to get the printString of the
primIdentityHash.

> Of those that care about the value of the identityHash, there are
> several that use it in #hash methods. The most common is this definition:
>
> hash
> ^self identityHash
>
> These are presumably overriding superclass behavior to restore Object
> behavior.

I'd like to take a look at these, I suspect there may be low hanging
fruit waiting to be fixed.

> If the authors knew about the limited range of #identityHash, that is
> entirely possible. I tend to think it more likely that in most cases
> these implementations are just the simplest way to follow the dictate
> that 'a=b -> a hash = b hash', and that they didn't really think about
> the impact on collection performance.
>

Or maybe they chose identityHash because they can assume uniqueness (=
effectively being ==)...

> 5 improved, 2 harmed. And one of the listed harmed is MethodDictionary,
> whose performance would not be harmed, but I assume the VM would not be
> happy if their hashing was changed (anybody know for sure whether that's
> true?)
>

The VM probably knows a lot about identityHash values, and most likely
uses the primIdentityHash values because then it doesn't have to shift
on access.

> They could, and I admit to having written this kind of code in the past,
> but I doubt that I'm typical in doing so. Do you know of any Pharo code
> that actually *does* this sort of thing? There isn't any in the
> distributed web image, but I didn't look at every package that is meant
> to be loadable in Pharo.
>

I might suspect that Magma does this kind of stuff... but that's just a
guess. I didn't immediately see any code doing so. As long as package
maintainers are fine with two quite different versions of Pharo with
very different identityHash method behaviors, then I do not have a problem.

>> Clever hacks such as
>>
>> SomeObject>>hash
>>
>> ^(self variableA identityHash bitShift: 12) + self variableB identityHash
>>
>>
>> would also remain undisturbed.
>>
>
> Yes, if #identityHash is changed it's the clever hacks that will have to
> change. This could be a disadvantage of this approach, but often, as in
> the case of IdentityDictionary, IdentitySet, and
> WeakIdentityKeyDictionary, the necessary change is simply to remove the
> clever hack, get simpler code, and enjoy better performance than you got
> with the clever hack, so making the change is IMO an improvement.
>

We agree, mod I wouldn't want to impose version maintenance homework on
maintainers of large packages. For the sake of illustration only, and
using Magma without knowing if it would be affected, I wouldn't want
whoever is maintaining Magma to maintain two branches... one for Pharo
1.xyz, and one for Pharo 1.xyz++.

>> Finally, I do not know of any Smalltalk
>> in which identityHash does not answer the actual object header bits for
>> the identity hash. If we change identityHash, then AFAIK Pharo would
>> become the only Smalltalk in which identityHash does not answer
>> consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14
>> for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for VA
>> and VisualSmalltalk).
>>
>
> GemStone is a Smalltalk that does not answer consecutive values for
> identityHash.

Haha, I was thinking of "regular" image based Smalltalks...

> In GemStone the identityHash is computed from the object's
> OOP, and OOPs are not consecutive.

Not necessarily, although I suspect identityHash values map to an
integer interval along the lines of [0, 2^40-1]. So, if you look at
hash(x) as a function, the image of hash(x) is a set of consecutive
intervals. Using bitShift: to scale identityHash values would make the
image of hash(x) sparse (with the exception of small integers,
characters and, to some extent in VW 64 bit, small doubles).

> And Smalltalk-80 basically used the
> same scheme, though you could only have 32K objects, every one had a
> different identityHash based on OOP.
>

These are also consecutive values... [0, 2^15-1], basically.

> Also, most (all?) Smalltalks with limited ranges for identityHash do
> have a larger range of identityHash for SmallIntegers (usually ^self),
> so you can't use the clever hacks if you might have any SmallIntegers in
> your collection. So any general-purpose collection must already deal
> with the full SmallInteger range of identity hashes as keys, cannot use
> the clever hacks, and so is likely to only be improved by changing
> #identityHash. This is a key point that I forgot to bring up last night.
>

Well, more or less, because with scaledIdentityHash you'd need to
implement it in SmallInteger as ^self... but yes, I think hashed
collections shouldn't be put into a position where they judge what's a
good hash value and what isn't (and spend CPU time doing so at
runtime!!!). Java does this, and as far as I could see back when I
studied Java's hashing implementation, IMO it's not a good idea.

Andres.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

johnmci

Re: Hashed collection changes, the performance graphs

I looked at the squeak identityHash in squeak a few years back

(a) how efficient was the algorithm? Attempted at one time to replace with a fixed array of numbers. There was some benefit on slow machines.

(b) only calculate the identity hash when required, but then how do you mark the oops with that fact when all the identityHash values are in use in the image. The hassle of fixing an image to use the new hash logic wasn't worth the gain which was difficult to measure.

In both cases I was more interested in the cpu usage on those 68K machines.

On Wed, Oct 28, 2009 at 7:14 PM, Andres Valloud <[hidden email]> wrote:

Martin,

One of the constituencies I thought of when I decided to leave
identityHash alone was folks like you.

--
===========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
===========================================================================

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Hashed collection changes, the performance graphs

In reply to this post by Martin McClure-2

Yes this is my gut feeling too.

> If the authors knew about the limited range of #identityHash, that is
> entirely possible. I tend to think it more likely that in most cases
> these implementations are just the simplest way to follow the dictate
> that 'a=b -> a hash = b hash', and that they didn't really think about
> the impact on collection performance.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Hashed collection changes, the performance graphs

In reply to this post by Andres Valloud-4

What I can tell you is that I ***loves*** this discussion.
It illustrates the spirit of pharo that we want to push.
Let's make a better and cooler smalltalk :)
And people are even learning nice knowledge.
Thanks

On Oct 29, 2009, at 3:14 AM, Andres Valloud wrote:

> Martin,
>
> One of the constituencies I thought of when I decided to leave
> identityHash alone was folks like you. Now, as a representative, if
> you
> are ok with dealing with broken identityHash senders (which I hope
> will
> be few), then most of my motivation for leaving identityHash unchanged
> is gone. Thus, I would not mind changing identityHash and
> implementing
> primIdentityHash.
>
> What about others? Would anybody mind if identityHash was changed?
>
> Some comments below...
>
>> I took a survey of the senders of #identityHash in the latest web
>> image.
>> There aren't that many. The largest category is those that want the
>> printString of the identityHash.
>>
>
> These would probably need to be changed to get the printString of the
> primIdentityHash.
>
>> Of those that care about the value of the identityHash, there are
>> several that use it in #hash methods. The most common is this
>> definition:
>>
>> hash
>> ^self identityHash
>>
>> These are presumably overriding superclass behavior to restore Object
>> behavior.
>
> I'd like to take a look at these, I suspect there may be low hanging
> fruit waiting to be fixed.
>
>> If the authors knew about the limited range of #identityHash, that is
>> entirely possible. I tend to think it more likely that in most cases
>> these implementations are just the simplest way to follow the dictate
>> that 'a=b -> a hash = b hash', and that they didn't really think
>> about
>> the impact on collection performance.
>>
>
> Or maybe they chose identityHash because they can assume uniqueness (=
> effectively being ==)...
>
>> 5 improved, 2 harmed. And one of the listed harmed is
>> MethodDictionary,
>> whose performance would not be harmed, but I assume the VM would
>> not be
>> happy if their hashing was changed (anybody know for sure whether
>> that's
>> true?)
>>
>
> The VM probably knows a lot about identityHash values, and most likely
> uses the primIdentityHash values because then it doesn't have to shift
> on access.
>
>> They could, and I admit to having written this kind of code in the
>> past,
>> but I doubt that I'm typical in doing so. Do you know of any Pharo
>> code
>> that actually *does* this sort of thing? There isn't any in the
>> distributed web image, but I didn't look at every package that is
>> meant
>> to be loadable in Pharo.
>>
>
> I might suspect that Magma does this kind of stuff... but that's
> just a
> guess. I didn't immediately see any code doing so. As long as
> package
> maintainers are fine with two quite different versions of Pharo with
> very different identityHash method behaviors, then I do not have a
> problem.
>
>>> Clever hacks such as
>>>
>>> SomeObject>>hash
>>>
>>> ^(self variableA identityHash bitShift: 12) + self variableB
>>> identityHash
>>>
>>>
>>> would also remain undisturbed.
>>>
>>
>> Yes, if #identityHash is changed it's the clever hacks that will
>> have to
>> change. This could be a disadvantage of this approach, but often,
>> as in
>> the case of IdentityDictionary, IdentitySet, and
>> WeakIdentityKeyDictionary, the necessary change is simply to remove
>> the
>> clever hack, get simpler code, and enjoy better performance than
>> you got
>> with the clever hack, so making the change is IMO an improvement.
>>
>
> We agree, mod I wouldn't want to impose version maintenance homework
> on
> maintainers of large packages. For the sake of illustration only, and
> using Magma without knowing if it would be affected, I wouldn't want
> whoever is maintaining Magma to maintain two branches... one for Pharo
> 1.xyz, and one for Pharo 1.xyz++.
>
>>> Finally, I do not know of any Smalltalk
>>> in which identityHash does not answer the actual object header
>>> bits for
>>> the identity hash. If we change identityHash, then AFAIK Pharo would
>>> become the only Smalltalk in which identityHash does not answer
>>> consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo,
>>> k=14
>>> for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15
>>> for VA
>>> and VisualSmalltalk).
>>>
>>
>> GemStone is a Smalltalk that does not answer consecutive values for
>> identityHash.
>
> Haha, I was thinking of "regular" image based Smalltalks...
>
>> In GemStone the identityHash is computed from the object's
>> OOP, and OOPs are not consecutive.
>
> Not necessarily, although I suspect identityHash values map to an
> integer interval along the lines of [0, 2^40-1]. So, if you look at
> hash(x) as a function, the image of hash(x) is a set of consecutive
> intervals. Using bitShift: to scale identityHash values would make
> the
> image of hash(x) sparse (with the exception of small integers,
> characters and, to some extent in VW 64 bit, small doubles).
>
>> And Smalltalk-80 basically used the
>> same scheme, though you could only have 32K objects, every one had a
>> different identityHash based on OOP.
>>
>
> These are also consecutive values... [0, 2^15-1], basically.
>
>> Also, most (all?) Smalltalks with limited ranges for identityHash do
>> have a larger range of identityHash for SmallIntegers (usually
>> ^self),
>> so you can't use the clever hacks if you might have any
>> SmallIntegers in
>> your collection. So any general-purpose collection must already deal
>> with the full SmallInteger range of identity hashes as keys, cannot
>> use
>> the clever hacks, and so is likely to only be improved by changing
>> #identityHash. This is a key point that I forgot to bring up last
>> night.
>>
>
> Well, more or less, because with scaledIdentityHash you'd need to
> implement it in SmallInteger as ^self... but yes, I think hashed
> collections shouldn't be put into a position where they judge what's a
> good hash value and what isn't (and spend CPU time doing so at
> runtime!!!). Java does this, and as far as I could see back when I
> studied Java's hashing implementation, IMO it's not a good idea.
>
> Andres.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Hashed collection changes, the performance graphs

In reply to this post by Andres Valloud-4

>
> We agree, mod I wouldn't want to impose version maintenance homework
> on
> maintainers of large packages. For the sake of illustration only, and
> using Magma without knowing if it would be affected, I wouldn't want
> whoever is maintaining Magma to maintain two branches... one for Pharo
> 1.xyz, and one for Pharo 1.xyz++.

They will have to and metacello will help them for that.
Evolution has a price and I think that it should be low enough when
considering advantage
that this will be ok.

>>> Well, more or less, because with scaledIdentityHash you'd need to
> implement it in SmallInteger as ^self... but yes, I think hashed
> collections shouldn't be put into a position where they judge what's a
> good hash value and what isn't (and spend CPU time doing so at
> runtime!!!). Java does this, and as far as I could see back when I
> studied Java's hashing implementation, IMO it's not a good idea.

Stef

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Hashed collection changes, the performance graphs

In reply to this post by Andres Valloud-4

thanks for this cool mail!

On Oct 29, 2009, at 12:51 AM, Andres Valloud wrote:

> Stephane et al,
>
> Martin and I discussed this yesterday night at the PDX Smalltalk user
> group meeting. Mostly we agree, with perhaps one small exception. In
> order to make hashed collections simpler and identity hash handling
> more
> straightforward, hashed collections should get a small integer range
> identity hash value. Something between 0 and SmallInteger maxVal, for
> example. The issue is how to get that.
>
> Martin's approach is to change identityHash itself, and provide
> primIdentityHash to access the existing identityHash implementation
> (primitive 75). My approach is to leave identityHash as is, and
> provide
> a new message called scaledIdentityHash. Note that in both Martin's
> and
> my implementation, the new identity hash values are identical. In
> other
> words, whereas Martin's changes implement
>
> identityHash
>
> ^self primIdentityHash bitShift: 18
>
>
> my changes implement
>
> scaledIdentityHash
>
> ^self identityHash bitShift: 18
>
>
> So, really, this is an issue of how to implement the *same* behavior,
> with the *same* goals. Now, there are advantages and disadvantages
> for
> both. I'll try to explain. Martin, please feel free to correct what
> follows if I miss something.
>
> There are two kinds of senders of identityHash. Some (most) do not
> care
> whether the answer is well distributed or not. Typically, these are
> implementors of #hash, for example:
>
> SomeObject>>hash
>
> ^self someInstanceVariable identityHash
>
>
> If, on one hand, we change identityHash, then the behavior of these
> hash
> methods improve because the values they answer would be distributed
> more
> uniformly. Nevertheless, precisely because they send identityHash, it
> could be argued that these hash methods were implemented assuming
> there
> would be just a few of these objects in hashed collections. If this
> turns out not to be the case, then performance could be much better by
> implementing a better hash method.
>
> Some (a few) other senders of identityHash assume that the answer
> has a
> particular distribution (in Squeak's/Pharo's case, 0 through 4095).
> Consequently, the senders may do things like this:
>
> SomeObject>>hash
>
> ^(self variableA identityHash bitShift: 12) + self variableB
> identityHash
>
>
> Also, there may be hashed collections which use identityHash to index
> hash buckets (e.g.: GemStone's own 32 bit object cache for
> VisualWorks,
> which uses identityHash to access hash buckets). So, if we change
> identityHash, these senders which required special care to write would
> break, requiring them to be written like this:
>
> SomeObject>>hash
>
> ^(self variableA primIdentityHash bitShift: 12) + self variableB
> primIdentityHash
>
>
> If, on the other hand, we provide a new scaledIdentityHash method,
> then
> all senders of identityHash keep behaving the same way as they do now.
> Hashed collections, which currently send identityHash, would need to
> change so that they send scaledIdentityHash instead. Furthermore, any
> implementor of hash such as
>
> SomeObject>>hash
>
> ^self someInstanceVariable identityHash
>
>
> would need to be rewritten as
>
> SomeObject>>hash
>
> ^self someInstanceVariable scaledIdentityHash
>
>
> if hashed collections are expected to contain numerous instances of
> SomeObject. If these are not rewritten, then they would continue to
> have the same performance characteristics they have today (which may
> be
> adequate to begin with). Clever hacks such as
>
> SomeObject>>hash
>
> ^(self variableA identityHash bitShift: 12) + self variableB
> identityHash
>
>
> would also remain undisturbed. Finally, I do not know of any
> Smalltalk
> in which identityHash does not answer the actual object header bits
> for
> the identity hash. If we change identityHash, then AFAIK Pharo would
> become the only Smalltalk in which identityHash does not answer
> consecutive values between 0 and (2^k)-1 (k=12 for Squeak/Pharo, k=14
> for VisualWorks 32 bits, k=20 for VisualWorks 64 bits, IIRC k=15 for
> VA
> and VisualSmalltalk). Moreover, AFAIK, identityHash has had this
> behavior for decades. Is it a good idea to change it? I am not so
> sure
> on that one. Also, I wouldn't want to break clever hacks that assume
> identityHash behaves the way it currently does, and I wouldn't want to
> make Pharo different gratuitously.
>
> So that's where we stand. I think I understand where Martin is coming
> from, and I think Martin understands where I am coming from. One
> way or
> the other, we also agree that something should be done to make hashed
> collections faster in Pharo.
>
> Andres.
>
>
> Stéphane Ducasse wrote:
>> Martin
>>
>> so what you are saying is that we can gain speed. And that these
>> changes are beneficial for simple default.
>>
>>
>>
>>> The test code, a variant of the test from the HashTable SqueakSource
>>> wiki, is at the bottom of this message. Basically, it adds a bunch
>>> of instances of Object to a Dictionary and sees how much time it
>>> takes to look them up again.
>>>
>>> From the graphs in the previous message, you can see that
>>> performance for sizes > 4000 is greatly improved. For size = 10000,
>>> #at: is 1000 times faster, 2-3 microseconds vs. >2 milliseconds. At
>>> size=10000, #at:put: is about 200 times faster, ~10 microseconds vs.
>>>
>>>> 2 milliseconds, and the large spikes for growing the collection are
>>>>
>>> 21 milliseconds vs. > 4 seconds, again a factor of about 200.
>>>
>>> Performance for dictionary sizes < 4000 is essentially the same as
>>> before, so these collections can serve as general-purpose
>>> collections over a wide range of sizes. I've attached the graphs for
>>> sizes <4000 to this message so you can see that more clearly than on
>>> the previous graphs.
>>>
>>> These results should hold for any object that inherits #hash from
>>> Object, in other words uses its identity hash as its equality hash.
>>> Other objects with better hashes did not have as serious a problem,
>>> but will probably show increased performance as well due to using
>>> prime table sizes.
>>>
>>> These changes are in Set, so should improve Set's subclasses as
>>> well. IdentityDictionary, IdentitySet, and WeakIdentityKeyDictionary
>>> did not have as serious a problem, but should see some improvement.
>>> MethodDictionaries have been left alone on the assumption that the
>>> VM depends on the hashing of those.
>>>
>>
>> So where are these changes :)
>> Should we integrate them?
>> Andres and other hash freaks what are your point of view?
>>
>>
>>> Since there are still only 4K possible values of identity hash,
>>> collisions are inevitable in large collections, and the number of
>>> collisions will grow linearly with collection size. So how well does
>>> the spread hash / prime table size do at even larger sizes? I ran
>>> the same test at sizes of one million. As expected, access was quite
>>> a bit slower than it had been at 10000. Time for #at: was ~250
>>> microseconds, and #at:put: was about the same. Note, however, that
>>> this is still ten times faster than Dictionary previously was at a
>>> size of 10000; 100 times larger yet 10 times faster.
>>>
>>> I had a lot of fun doing this. This is better results than I
>>> expected, for a fairly minor (though deep) code change.
>>>
>>
>> So where are these changes :)
>> Should we integrate them?
>> Andres and other hash freaks what are your point of view?
>>
>>
>>
>>> | test ord |
>>> Transcript cr.
>>> test := Dictionary new.
>>> [ test size >= 10000] whileFalse: [
>>> ord := OrderedCollection new: 100.
>>> Transcript show:
>>> [
>>> 100 timesRepeat: [
>>> test at: ( ord add: Object new ) put: nil ].
>>> ] timeToRun asString.
>>> Transcript tab;
>>> show: test size asString;
>>> tab.
>>> Transcript show:
>>> [
>>> 1000 timesRepeat: [
>>> ord do: [ :each | test at: each ] ]
>>> ] timeToRun asString.
>>> Transcript tab; show:
>>> [
>>> 1000 timesRepeat: [
>>> ord do: [ :each | ] ]
>>> ] timeToRun asString.
>>> Transcript cr ]
>>>
>>> <
>>> SmallDictPerformanceGraphs1
>>> .png>_______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>> .
>>
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project