Smalltalk › Squeak › Squeak - Dev

Hash changes

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

17 messages Options

Levente Uzonyi-2

Hash changes

Hi,

I uploaded the second part of Andrés' hash changes to the inbox. Every
package requires a new mcm. The load order is:

Collections-ul.217 "Initialization"
Kernel-ul.308 "Object >> #hash"
Kernel-ul.309
Kernel-ul.310
Collections-ul.218 "IdentityDictionary"
Collections-ul.219
Collections-ul.220 "KeyedIdentitySet"
Collections-ul.221
Collections-ul.222 "WeakIdentityKeyDictionary"
Collections-ul.223
Collections-ul.224 "IdentitySet"
Collections-ul.225
System-ul.185 "SystemDictionary"
System-ul.186
System-ul.187
Collections-ul.226 "Cleanup"
Collections-ul.227

These packages modify the following:
- Object >> #hash returns an integer between 0 and 1073479680 instead of
0 and 4096, though the range is not continuous
- identity-based collections use the same range extension method as
Object >> #hash
- SystemDictionary uses #hash instead of #identityHash
- all preambles and postscripts are removed from Collections, Kernel and
System packages

Cheers,
Levente

David T. Lewis

Re: Hash changes

I think that this is important work. Most of the background discussion
took place on the Pharo list, so some may not have seen it. I personally
am not qualified to do anything with this, but hopefully others can
take a look at it and incorporate it in trunk if appropriate.

Dave

On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:

> Hi,
>
> I uploaded the second part of Andr?s' hash changes to the inbox. Every
> package requires a new mcm. The load order is:
>
> Collections-ul.217 "Initialization"
> Kernel-ul.308 "Object >> #hash"
> Kernel-ul.309
> Kernel-ul.310
> Collections-ul.218 "IdentityDictionary"
> Collections-ul.219
> Collections-ul.220 "KeyedIdentitySet"
> Collections-ul.221
> Collections-ul.222 "WeakIdentityKeyDictionary"
> Collections-ul.223
> Collections-ul.224 "IdentitySet"
> Collections-ul.225
> System-ul.185 "SystemDictionary"
> System-ul.186
> System-ul.187
> Collections-ul.226 "Cleanup"
> Collections-ul.227
>
> These packages modify the following:
> - Object >> #hash returns an integer between 0 and 1073479680 instead of
> 0 and 4096, though the range is not continuous
> - identity-based collections use the same range extension method as
> Object >> #hash
> - SystemDictionary uses #hash instead of #identityHash
> - all preambles and postscripts are removed from Collections, Kernel and
> System packages
>
> Cheers,
> Levente
>

Andreas.Raab

Re: Hash changes

I'll be happy to do this but can someone summarize the expected effects
of those changes? The main thing I don't understand is why it is
advantageous to simply scale the id-hash given that it doesn't actually
increase the range of hashes, i.e., id-hash is still limited to 12 bits
so no matter how much you multiply these values the result will still
only be 4096 distinct values. Why is it advantageous to make these
numbers artificially larger without increasing the number of hash bits?

Cheers,
- Andreas

David T. Lewis wrote:

> I think that this is important work. Most of the background discussion
> took place on the Pharo list, so some may not have seen it. I personally
> am not qualified to do anything with this, but hopefully others can
> take a look at it and incorporate it in trunk if appropriate.
>
> Dave
>
> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>> Hi,
>>
>> I uploaded the second part of Andr?s' hash changes to the inbox. Every
>> package requires a new mcm. The load order is:
>>
>> Collections-ul.217 "Initialization"
>> Kernel-ul.308 "Object >> #hash"
>> Kernel-ul.309
>> Kernel-ul.310
>> Collections-ul.218 "IdentityDictionary"
>> Collections-ul.219
>> Collections-ul.220 "KeyedIdentitySet"
>> Collections-ul.221
>> Collections-ul.222 "WeakIdentityKeyDictionary"
>> Collections-ul.223
>> Collections-ul.224 "IdentitySet"
>> Collections-ul.225
>> System-ul.185 "SystemDictionary"
>> System-ul.186
>> System-ul.187
>> Collections-ul.226 "Cleanup"
>> Collections-ul.227
>>
>> These packages modify the following:
>> - Object >> #hash returns an integer between 0 and 1073479680 instead of
>> 0 and 4096, though the range is not continuous
>> - identity-based collections use the same range extension method as
>> Object >> #hash
>> - SystemDictionary uses #hash instead of #identityHash
>> - all preambles and postscripts are removed from Collections, Kernel and
>> System packages
>>
>> Cheers,
>> Levente
>>
>
>
>

Nicolas Cellier

Re: Re: Hash changes

>From what I understand:

It does not reduce the rate of hash collisions.
It just reduces the cost of hash collision by avoiding consecutive hash.
Consecutive hash are evil because in case of collision scanFor: tend
to be of linear-cost
(scan all consecutive slots until an empty one)

Oh, but you participated to this discussion, didn't you ?
http://bugs.squeak.org/view.php?id=1876

( via http://code.google.com/p/pharo/issues/detail?id=213 )

Nicolas

2009/11/30 Andreas Raab <[hidden email]>:

> I'll be happy to do this but can someone summarize the expected effects of
> those changes? The main thing I don't understand is why it is advantageous
> to simply scale the id-hash given that it doesn't actually increase the
> range of hashes, i.e., id-hash is still limited to 12 bits so no matter how
> much you multiply these values the result will still only be 4096 distinct
> values. Why is it advantageous to make these numbers artificially larger
> without increasing the number of hash bits?
>
> Cheers,
> - Andreas
>
> David T. Lewis wrote:
>>
>> I think that this is important work. Most of the background discussion
>> took place on the Pharo list, so some may not have seen it. I personally
>> am not qualified to do anything with this, but hopefully others can
>> take a look at it and incorporate it in trunk if appropriate.
>>
>> Dave
>>
>> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>>>
>>> Hi,
>>>
>>> I uploaded the second part of Andr?s' hash changes to the inbox. Every
>>> package requires a new mcm. The load order is:
>>>
>>> Collections-ul.217 "Initialization"
>>> Kernel-ul.308 "Object >> #hash"
>>> Kernel-ul.309
>>> Kernel-ul.310
>>> Collections-ul.218 "IdentityDictionary"
>>> Collections-ul.219
>>> Collections-ul.220 "KeyedIdentitySet"
>>> Collections-ul.221
>>> Collections-ul.222 "WeakIdentityKeyDictionary"
>>> Collections-ul.223
>>> Collections-ul.224 "IdentitySet"
>>> Collections-ul.225
>>> System-ul.185 "SystemDictionary"
>>> System-ul.186
>>> System-ul.187
>>> Collections-ul.226 "Cleanup"
>>> Collections-ul.227
>>>
>>> These packages modify the following:
>>> - Object >> #hash returns an integer between 0 and 1073479680 instead of
>>> 0 and 4096, though the range is not continuous
>>> - identity-based collections use the same range extension method as
>>> Object >> #hash
>>> - SystemDictionary uses #hash instead of #identityHash
>>> - all preambles and postscripts are removed from Collections, Kernel and
>>> System packages
>>>
>>> Cheers,
>>> Levente
>>>
>>
>>
>>
>
>
>

Levente Uzonyi-2

Re: Re: Hash changes

In reply to this post by Andreas.Raab

On Mon, 30 Nov 2009, Andreas Raab wrote:

> I'll be happy to do this but can someone summarize the expected effects of
> those changes? The main thing I don't understand is why it is advantageous to
> simply scale the id-hash given that it doesn't actually increase the range of
> hashes, i.e., id-hash is still limited to 12 bits so no matter how much you
> multiply these values the result will still only be 4096 distinct values. Why
> is it advantageous to make these numbers artificially larger without
> increasing the number of hash bits?
>

The following test is a modified version of Martin McClure's test
(microbenchmark) the original can be found here:
http://lists.gforge.inria.fr/pipermail/pharo-project/2009-October/015070.html
Run the test before and after the changes.

| test array |
Transcript open; cr.
test := Dictionary new.
[ test size >= 10000 ] whileFalse: [
Smalltalk garbageCollect.
array := Array new: 100 streamContents: [ :stream |
100 timesRepeat: [ stream nextPut: Object new ] ]. "Create 100 new Objects"
Transcript
print: [ "Add the 100 new Objects to the Dictionary as keys with nil values"
1 to: array size do: [ :index |
test at: (array at: index) put: nil ] ] timeToRun;
tab;
print: test size;
tab;
print: [ "Access the added Objects 1000 times"
1 to: 1000 do: [ :iteration |
1 to: array size do: [ :index |
test at: (array at: index) ] ] ] timeToRun;
tab;
print: [ "This should be substracted from the previous value"
1 to: 1000 do: [ :iteration |
1 to: array size do: [ :index |
array at: index ] ] ] timeToRun;
cr;
flush ]

The most important are the second (number of elements in the
dictionary) and third column (time to access the 100 last added element
1000 times) of the result printed to Transcript.

(I'm about to upload 2 more packages because the inlining trick
used in the implementation doesn't work with SmallIntegers (LargeIntegers
can be created while calculating the hash) and the enhancement for WeakSet
growing is missing)

Levente

> Cheers,
> - Andreas
>
> David T. Lewis wrote:
>> I think that this is important work. Most of the background discussion
>> took place on the Pharo list, so some may not have seen it. I personally
>> am not qualified to do anything with this, but hopefully others can
>> take a look at it and incorporate it in trunk if appropriate.
>>
>> Dave
>>
>> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>>> Hi,
>>>
>>> I uploaded the second part of Andr?s' hash changes to the inbox. Every
>>> package requires a new mcm. The load order is:
>>>
>>> Collections-ul.217 "Initialization"
>>> Kernel-ul.308 "Object >> #hash"
>>> Kernel-ul.309
>>> Kernel-ul.310
>>> Collections-ul.218 "IdentityDictionary"
>>> Collections-ul.219
>>> Collections-ul.220 "KeyedIdentitySet"
>>> Collections-ul.221
>>> Collections-ul.222 "WeakIdentityKeyDictionary"
>>> Collections-ul.223
>>> Collections-ul.224 "IdentitySet"
>>> Collections-ul.225
>>> System-ul.185 "SystemDictionary"
>>> System-ul.186
>>> System-ul.187
>>> Collections-ul.226 "Cleanup"
>>> Collections-ul.227
>>>
>>> These packages modify the following:
>>> - Object >> #hash returns an integer between 0 and 1073479680 instead of
>>> 0 and 4096, though the range is not continuous
>>> - identity-based collections use the same range extension method as
>>> Object >> #hash
>>> - SystemDictionary uses #hash instead of #identityHash
>>> - all preambles and postscripts are removed from Collections, Kernel and
>>> System packages
>>>
>>> Cheers,
>>> Levente
>>>
>>
>>
>>
>
>
>

Levente Uzonyi-2

Re: Re: Hash changes

On Mon, 30 Nov 2009, Levente Uzonyi wrote:
>
> (I'm about to upload 2 more packages because the inlining trick used in the
> implementation doesn't work with SmallIntegers (LargeIntegers can be created
> while calculating the hash) and the enhancement for WeakSet growing is
> missing)

Two packages are not sufficent, so I'll update some of the exisiting mczs.

Levente

Andreas.Raab

Re: Hash changes

In reply to this post by Levente Uzonyi-2

Levente Uzonyi wrote:
> The following test is a modified version of Martin McClure's test
> (microbenchmark) the original can be found here:
> http://lists.gforge.inria.fr/pipermail/pharo-project/2009-October/015070.html
>
> Run the test before and after the changes.

Since you have the changes loaded, why not just post your results? Also,
could you try making two modifications to the benchmark to see if this
has any impact:

> | test array |
> Transcript open; cr.
> test := Dictionary new.

"There used to be an old issue with growth of Sets that were initially
sized with goodSizeFor: 3. Try working around it by starting at 5."
test := Dictionary new: 5.

> [ test size >= 10000 ] whileFalse: [
> Smalltalk garbageCollect.
> array := Array new: 100 streamContents: [ :stream |
> 100 timesRepeat: [ stream nextPut: Object new ] ]. "Create 100
> new Objects"

"Having sequences of objects with strictly consecutive hashes is not
representative for real systems. Instead, create 4096 objects and choose
100 atRandom."

objSet := Array new: 4096 streamContents: [ :stream |
4096 timesRepeat: [ stream nextPut: Object new ] ].
array := (1 to: 100) collect:[:i| objSet atRandom].

Cheers,
- Andreas

> Transcript
> print: [ "Add the 100 new Objects to the Dictionary as keys with
> nil values"
> 1 to: array size do: [ :index |
> test at: (array at: index) put: nil ] ] timeToRun;
> tab;
> print: test size;
> tab;
> print: [ "Access the added Objects 1000 times"
> 1 to: 1000 do: [ :iteration |
> 1 to: array size do: [ :index |
> test at: (array at: index) ] ] ] timeToRun;
> tab;
> print: [ "This should be substracted from the previous value"
> 1 to: 1000 do: [ :iteration |
> 1 to: array size do: [ :index |
> array at: index ] ] ] timeToRun;
> cr;
> flush ]
>
> The most important are the second (number of elements in the dictionary)
> and third column (time to access the 100 last added element 1000 times)
> of the result printed to Transcript.
>
> (I'm about to upload 2 more packages because the inlining trick used in
> the implementation doesn't work with SmallIntegers (LargeIntegers can be
> created while calculating the hash) and the enhancement for WeakSet
> growing is missing)
>
> Levente
>
>> Cheers,
>> - Andreas
>>
>> David T. Lewis wrote:
>>> I think that this is important work. Most of the background discussion
>>> took place on the Pharo list, so some may not have seen it. I personally
>>> am not qualified to do anything with this, but hopefully others can
>>> take a look at it and incorporate it in trunk if appropriate.
>>>
>>> Dave
>>>
>>> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>>>> Hi,
>>>>
>>>> I uploaded the second part of Andr?s' hash changes to the inbox.
>>>> Every package requires a new mcm. The load order is:
>>>>
>>>> Collections-ul.217 "Initialization"
>>>> Kernel-ul.308 "Object >> #hash"
>>>> Kernel-ul.309
>>>> Kernel-ul.310
>>>> Collections-ul.218 "IdentityDictionary"
>>>> Collections-ul.219
>>>> Collections-ul.220 "KeyedIdentitySet"
>>>> Collections-ul.221
>>>> Collections-ul.222 "WeakIdentityKeyDictionary"
>>>> Collections-ul.223
>>>> Collections-ul.224 "IdentitySet"
>>>> Collections-ul.225
>>>> System-ul.185 "SystemDictionary"
>>>> System-ul.186
>>>> System-ul.187
>>>> Collections-ul.226 "Cleanup"
>>>> Collections-ul.227
>>>>
>>>> These packages modify the following:
>>>> - Object >> #hash returns an integer between 0 and 1073479680
>>>> instead of
>>>> 0 and 4096, though the range is not continuous
>>>> - identity-based collections use the same range extension method as
>>>> Object >> #hash
>>>> - SystemDictionary uses #hash instead of #identityHash
>>>> - all preambles and postscripts are removed from Collections, Kernel
>>>> and
>>>> System packages
>>>>
>>>> Cheers,
>>>> Levente
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>

Levente Uzonyi-2

Re: Re: Hash changes

On Mon, 30 Nov 2009, Andreas Raab wrote:

>> | test array |
>> Transcript open; cr.
>> test := Dictionary new.
>
> "There used to be an old issue with growth of Sets that were initially
> sized with goodSizeFor: 3. Try working around it by starting at 5."
> test := Dictionary new: 5.
>

This shouldn't be an issue in this case. The capacity of
new HashedCollections are already modified to be carefully selected
primes. AFAIK only hashMultiply doesn't like 5 as capacity but it's not
involved here. Anyway I added this to the new test.

> "Having sequences of objects with strictly consecutive hashes is not
> representative for real systems. Instead, create 4096 objects and choose 100
> atRandom."
>
> objSet := Array new: 4096 streamContents: [ :stream |
> 4096 timesRepeat: [ stream nextPut: Object new ] ].
> array := (1 to: 100) collect:[:i| objSet atRandom].
>

Good point but the hashes are not consecutive (at least with my vm). Since
we need 10000 different Objects I added this idea to the test with some
modifications. 10000 Objects are created, then shuffled and used when needed.

The modified test:

| test array objects |
Transcript open; cr.
test := Dictionary new: 5.
objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled readStream.
[ test size >= 10000 ] whileFalse: [
Smalltalk garbageCollect.
array := Array new: 100 streamContents: [ :stream |
100 timesRepeat: [ stream nextPut: objects next ] ]. "Create 100 new Objects"
Transcript
print: [ "Add the 100 new Objects to the Dictionary as keys with nil values"
1 to: array size do: [ :index |
test at: (array at: index) put: nil ] ] timeToRun;
tab;
print: test size;
tab;
print: [ "Access the added Objects 1000 times"
1 to: 1000 do: [ :iteration |
1 to: array size do: [ :index |
test at: (array at: index) ] ] ] timeToRun;
tab;
print: [ "This should be substracted from the previous value"
1 to: 1000 do: [ :iteration |
1 to: array size do: [ :index |
array at: index ] ] ] timeToRun;
cr;
flush ]

The results:
Updated image:
0 100 93 4
0 200 144 4
0 300 125 4
0 400 105 3
0 500 111 4
0 600 150 4
0 700 115 3
-- snip --
0 9600 297 3
0 9700 296 4
0 9800 299 5
1 9900 312 4
0 10000 309 4

Values in the third column are increasing somewhat, but not
significantly, no spikes etc.

Current image:
0 100 76 3
1 200 119 4
0 300 105 4
0 400 80 4
0 500 112 4
0 600 111 3
1 700 98 4
-- snip --
1 3700 1279 4
2 3800 1998 4
3 3900 3196 4
5 4000 4473 4
11 4100 11301 4
19 4200 18823 4
36 4300 36391 3
93 4400 45294 4
47 4500 46674 3
...

Around 4000 things get really slow. There are also smaller spikes near
2500. I didn't wait for 4600. As you can see in the first column #at:put:
is consuming more time every iteration.

Levente

Andreas.Raab

Re: Hash changes

Thanks. Now I see what's happening. The problem is that adding objects
without scaled hash to a "regular" Dictionary creates hugely
disproportional collision chain(s) as soon as we get north of 4096
objects. At this point all the possible entries that any new object
could have by default are already taken and since we use a linear
collision chain they combine into one gigantonormeous (hope I spelled
that correctly :-) collision chain.

That explains the difference - for larger number of elements we end up
(with your changes) with an average length of the collision chain being
(dict size // 4096) for both adding and accessing where without the
scale we end up with a collision chain of (dict size) length (for
adding) and (dict size // 2) for accessing elements once all possible
initial slots are filled up and spill over.

(and we should probably add a comment explaining the issue)

Cheers,
- Andreas

Levente Uzonyi wrote:

> On Mon, 30 Nov 2009, Andreas Raab wrote:
>
>>> | test array |
>>> Transcript open; cr.
>>> test := Dictionary new.
>>
>> "There used to be an old issue with growth of Sets that were initially
>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>> test := Dictionary new: 5.
>>
>
> This shouldn't be an issue in this case. The capacity of new
> HashedCollections are already modified to be carefully selected primes.
> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
> here. Anyway I added this to the new test.
>
>> "Having sequences of objects with strictly consecutive hashes is not
>> representative for real systems. Instead, create 4096 objects and
>> choose 100 atRandom."
>>
>> objSet := Array new: 4096 streamContents: [ :stream |
>> 4096 timesRepeat: [ stream nextPut: Object new ] ].
>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>
>
> Good point but the hashes are not consecutive (at least with my vm).
> Since we need 10000 different Objects I added this idea to the test with
> some modifications. 10000 Objects are created, then shuffled and used
> when needed.
>
> The modified test:
>
> | test array objects |
> Transcript open; cr.
> test := Dictionary new: 5.
> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
> readStream.
> [ test size >= 10000 ] whileFalse: [
> Smalltalk garbageCollect.
> array := Array new: 100 streamContents: [ :stream |
> 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create
> 100 new Objects"
> Transcript
> print: [ "Add the 100 new Objects to the Dictionary as keys with
> nil values"
> 1 to: array size do: [ :index |
> test at: (array at: index) put: nil ] ] timeToRun;
> tab;
> print: test size;
> tab;
> print: [ "Access the added Objects 1000 times"
> 1 to: 1000 do: [ :iteration |
> 1 to: array size do: [ :index |
> test at: (array at: index) ] ] ] timeToRun;
> tab;
> print: [ "This should be substracted from the previous value"
> 1 to: 1000 do: [ :iteration |
> 1 to: array size do: [ :index |
> array at: index ] ] ] timeToRun;
> cr;
> flush ]
>
> The results:
> Updated image:
> 0 100 93 4
> 0 200 144 4
> 0 300 125 4
> 0 400 105 3
> 0 500 111 4
> 0 600 150 4
> 0 700 115 3
> -- snip --
> 0 9600 297 3
> 0 9700 296 4
> 0 9800 299 5
> 1 9900 312 4
> 0 10000 309 4
>
> Values in the third column are increasing somewhat, but not
> significantly, no spikes etc.
>
> Current image:
> 0 100 76 3
> 1 200 119 4
> 0 300 105 4
> 0 400 80 4
> 0 500 112 4
> 0 600 111 3
> 1 700 98 4
> -- snip --
> 1 3700 1279 4
> 2 3800 1998 4
> 3 3900 3196 4
> 5 4000 4473 4
> 11 4100 11301 4
> 19 4200 18823 4
> 36 4300 36391 3
> 93 4400 45294 4
> 47 4500 46674 3
> ...
>
> Around 4000 things get really slow. There are also smaller spikes near
> 2500. I didn't wait for 4600. As you can see in the first column
> #at:put: is consuming more time every iteration.
>
>
> Levente
>
>

Levente Uzonyi-2

Re: Re: Hash changes

In reply to this post by Levente Uzonyi-2

On Tue, 1 Dec 2009, Levente Uzonyi wrote:

> On Mon, 30 Nov 2009, Levente Uzonyi wrote:
>>
>> (I'm about to upload 2 more packages because the inlining trick used in the
>> implementation doesn't work with SmallIntegers (LargeIntegers can be
>> created while calculating the hash) and the enhancement for WeakSet growing
>> is missing)
>
> Two packages are not sufficent, so I'll update some of the exisiting mczs.
>

The new versions are in the inbox. The load order is:
Collections-ul.217 "Initialization"
Kernel-ul.312 "Object >> #hash"
Kernel-ul.313
Kernel-ul.314
Collections-ul.228 "IdentityDictionary"
Collections-ul.229
Collections-ul.230 "KeyedIdentitySet"
Collections-ul.231
Collections-ul.232 "WeakIdentityKeyDictionary"
Collections-ul.233
Collections-ul.234 "IdentitySet"
Collections-ul.235
System-ul.185 "SystemDictionary"
System-ul.186
System-ul.187
Collections-ul.236 "Cleanup"
Collections-ul.237

Every package requires a new mcm.
These changes also include the WeakSet growing tweak.

Please ignore the following versions from the inbox:
Kernel-ul.308
Kernel-ul.309
Kernel-ul.310
Collections-ul.218
Collections-ul.219
Collections-ul.220
Collections-ul.221
Collections-ul.222
Collections-ul.223
Collections-ul.224
Collections-ul.225
Collections-ul.226
Collections-ul.227

Levente

Andreas.Raab

Re: Hash changes

Levente Uzonyi wrote:
> The new versions are in the inbox. The load order is:

Thanks, I'll give it a shot. Folks, if you read this please abstain from
posting or updating while I'm in the process of integrating these
changes. I'll post an all clear message when I'm done. (plus: brace for
a flurry of commit messages and knock on wood that the server doesn't
die on the last commit ;-)

Cheers,
- Andreas

> Collections-ul.217 "Initialization"
> Kernel-ul.312 "Object >> #hash"
> Kernel-ul.313
> Kernel-ul.314
> Collections-ul.228 "IdentityDictionary"
> Collections-ul.229
> Collections-ul.230 "KeyedIdentitySet"
> Collections-ul.231
> Collections-ul.232 "WeakIdentityKeyDictionary"
> Collections-ul.233
> Collections-ul.234 "IdentitySet"
> Collections-ul.235
> System-ul.185 "SystemDictionary"
> System-ul.186
> System-ul.187
> Collections-ul.236 "Cleanup"
> Collections-ul.237
>
> Every package requires a new mcm.
> These changes also include the WeakSet growing tweak.
>
> Please ignore the following versions from the inbox:
> Kernel-ul.308
> Kernel-ul.309
> Kernel-ul.310
> Collections-ul.218
> Collections-ul.219
> Collections-ul.220
> Collections-ul.221
> Collections-ul.222
> Collections-ul.223
> Collections-ul.224
> Collections-ul.225
> Collections-ul.226
> Collections-ul.227
>
>
> Levente
>
>

Andreas.Raab

All Clear (Re: Hash changes)

Folks -

The updates went smoothly and I updated an image without problems. Looks
like we're good. Couple of comments:

- I was really impressed by the collection rehashing hack. I had not
seen this before.

- I'm actually thinking that we might want to keep the rehashing utility
with a comment on how to use it. It looks like a really useful tool for
doing this kind of stuff.

- Is there a standard way to deal with dog-slow tests (like searching
all the source code of Squeak, running the decompiler on all methods)?
These few, crazily slow tests make it virtually impossible to "run all
test" after loading each change if you want them to finish in a finite
amount of time. As a consequence I've only run the collection tests.

Cheers,
- Andreas

Levente Uzonyi-2

Re: All Clear (Re: Hash changes)

On Mon, 30 Nov 2009, Andreas Raab wrote:

> Folks -
>
> The updates went smoothly and I updated an image without problems. Looks like
> we're good. Couple of comments:
>
> - I was really impressed by the collection rehashing hack. I had not seen
> this before.
>
> - I'm actually thinking that we might want to keep the rehashing utility with
> a comment on how to use it. It looks like a really useful tool for doing this
> kind of stuff.
>

I wouldn't add it to the image, since it's rarely useful. But I have a
plan to set up a repository with various collections related stuff and
this could be available there.

> - Is there a standard way to deal with dog-slow tests (like searching all the
> source code of Squeak, running the decompiler on all methods)? These few,
> crazily slow tests make it virtually impossible to "run all test" after
> loading each change if you want them to finish in a finite amount of time. As
> a consequence I've only run the collection tests.

I used to run all tests but Compiler-Tests and Traits-* first. When
everything seems to be ok, I run only Compiler-Tests and Traits-* in the
background. IIRC these tests are slow, because MultiByteFileStream is
really slow. Somehow we should improve it, but I couldn't find a simple
way to do that.

Levente

>
> Cheers,
> - Andreas
>
>

Nicolas Cellier

Re: Re: Hash changes

In reply to this post by Andreas.Raab

2009/12/1 Andreas Raab <[hidden email]>:

> Thanks. Now I see what's happening. The problem is that adding objects
> without scaled hash to a "regular" Dictionary creates hugely disproportional
> collision chain(s) as soon as we get north of 4096 objects. At this point
> all the possible entries that any new object could have by default are
> already taken and since we use a linear collision chain they combine into
> one gigantonormeous (hope I spelled that correctly :-) collision chain.
>
> That explains the difference - for larger number of elements we end up (with
> your changes) with an average length of the collision chain being (dict size
> // 4096) for both adding and accessing where without the scale we end up
> with a collision chain of (dict size) length (for adding) and (dict size //
> 2) for accessing elements once all possible initial slots are filled up and
> spill over.
>
> (and we should probably add a comment explaining the issue)
>
> Cheers,
> - Andreas
>

Yes, that's what I tried to explain previously...
The change does not change the RATE of collisions, only the COST of collisions.
But I must be just bad in English ;)

Cheers

Nicolas

> Levente Uzonyi wrote:
>>
>> On Mon, 30 Nov 2009, Andreas Raab wrote:
>>
>>>> | test array |
>>>> Transcript open; cr.
>>>> test := Dictionary new.
>>>
>>> "There used to be an old issue with growth of Sets that were initially
>>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>>> test := Dictionary new: 5.
>>>
>>
>> This shouldn't be an issue in this case. The capacity of new
>> HashedCollections are already modified to be carefully selected primes.
>> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
>> here. Anyway I added this to the new test.
>>
>>> "Having sequences of objects with strictly consecutive hashes is not
>>> representative for real systems. Instead, create 4096 objects and choose 100
>>> atRandom."
>>>
>>> objSet := Array new: 4096 streamContents: [ :stream |
>>> 4096 timesRepeat: [ stream nextPut: Object new ] ].
>>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>>
>>
>> Good point but the hashes are not consecutive (at least with my vm). Since
>> we need 10000 different Objects I added this idea to the test with some
>> modifications. 10000 Objects are created, then shuffled and used when
>> needed.
>>
>> The modified test:
>>
>> | test array objects |
>> Transcript open; cr.
>> test := Dictionary new: 5.
>> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
>> readStream.
>> [ test size >= 10000 ] whileFalse: [
>> Smalltalk garbageCollect.
>> array := Array new: 100 streamContents: [ :stream |
>> 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create 100
>> new Objects"
>> Transcript
>> print: [ "Add the 100 new Objects to the Dictionary as keys with
>> nil values"
>> 1 to: array size do: [ :index |
>> test at: (array at: index) put: nil ] ] timeToRun;
>> tab;
>> print: test size;
>> tab;
>> print: [ "Access the added Objects 1000 times"
>> 1 to: 1000 do: [ :iteration |
>> 1 to: array size do: [ :index |
>> test at: (array at: index) ] ] ] timeToRun;
>> tab;
>> print: [ "This should be substracted from the previous value"
>> 1 to: 1000 do: [ :iteration |
>> 1 to: array size do: [ :index |
>> array at: index ] ] ] timeToRun;
>> cr;
>> flush ]
>>
>> The results:
>> Updated image:
>> 0 100 93 4
>> 0 200 144 4
>> 0 300 125 4
>> 0 400 105 3
>> 0 500 111 4
>> 0 600 150 4
>> 0 700 115 3
>> -- snip --
>> 0 9600 297 3
>> 0 9700 296 4
>> 0 9800 299 5
>> 1 9900 312 4
>> 0 10000 309 4
>>
>> Values in the third column are increasing somewhat, but not significantly,
>> no spikes etc.
>>
>> Current image:
>> 0 100 76 3
>> 1 200 119 4
>> 0 300 105 4
>> 0 400 80 4
>> 0 500 112 4
>> 0 600 111 3
>> 1 700 98 4
>> -- snip --
>> 1 3700 1279 4
>> 2 3800 1998 4
>> 3 3900 3196 4
>> 5 4000 4473 4
>> 11 4100 11301 4
>> 19 4200 18823 4
>> 36 4300 36391 3
>> 93 4400 45294 4
>> 47 4500 46674 3
>> ...
>>
>> Around 4000 things get really slow. There are also smaller spikes near
>> 2500. I didn't wait for 4600. As you can see in the first column #at:put: is
>> consuming more time every iteration.
>>
>>
>> Levente
>>
>>
>
>
>

Andreas.Raab

Re: Hash changes

Nicolas Cellier wrote:
> Yes, that's what I tried to explain previously...
> The change does not change the RATE of collisions, only the COST of collisions.
> But I must be just bad in English ;)

Nope, my disconnect was in not realizing the devastating effects of
filling *all* the initial slots and creating a single collision chain as
the result. I thought you were saying something along the lines that
creating objects with strictly consecutive hashes, i.e., hash(n+1) =
hash(n)+1 would lead to the creation of some degenerate collision chain
which is why I was asking to randomize input to see if the behavior is
caused by that dependency.

Cheers,
- Andreas

>
> Cheers
>
> Nicolas
>
>> Levente Uzonyi wrote:
>>> On Mon, 30 Nov 2009, Andreas Raab wrote:
>>>
>>>>> | test array |
>>>>> Transcript open; cr.
>>>>> test := Dictionary new.
>>>> "There used to be an old issue with growth of Sets that were initially
>>>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>>>> test := Dictionary new: 5.
>>>>
>>> This shouldn't be an issue in this case. The capacity of new
>>> HashedCollections are already modified to be carefully selected primes.
>>> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
>>> here. Anyway I added this to the new test.
>>>
>>>> "Having sequences of objects with strictly consecutive hashes is not
>>>> representative for real systems. Instead, create 4096 objects and choose 100
>>>> atRandom."
>>>>
>>>> objSet := Array new: 4096 streamContents: [ :stream |
>>>> 4096 timesRepeat: [ stream nextPut: Object new ] ].
>>>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>>>
>>> Good point but the hashes are not consecutive (at least with my vm). Since
>>> we need 10000 different Objects I added this idea to the test with some
>>> modifications. 10000 Objects are created, then shuffled and used when
>>> needed.
>>>
>>> The modified test:
>>>
>>> | test array objects |
>>> Transcript open; cr.
>>> test := Dictionary new: 5.
>>> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
>>> readStream.
>>> [ test size >= 10000 ] whileFalse: [
>>> Smalltalk garbageCollect.
>>> array := Array new: 100 streamContents: [ :stream |
>>> 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create 100
>>> new Objects"
>>> Transcript
>>> print: [ "Add the 100 new Objects to the Dictionary as keys with
>>> nil values"
>>> 1 to: array size do: [ :index |
>>> test at: (array at: index) put: nil ] ] timeToRun;
>>> tab;
>>> print: test size;
>>> tab;
>>> print: [ "Access the added Objects 1000 times"
>>> 1 to: 1000 do: [ :iteration |
>>> 1 to: array size do: [ :index |
>>> test at: (array at: index) ] ] ] timeToRun;
>>> tab;
>>> print: [ "This should be substracted from the previous value"
>>> 1 to: 1000 do: [ :iteration |
>>> 1 to: array size do: [ :index |
>>> array at: index ] ] ] timeToRun;
>>> cr;
>>> flush ]
>>>
>>> The results:
>>> Updated image:
>>> 0 100 93 4
>>> 0 200 144 4
>>> 0 300 125 4
>>> 0 400 105 3
>>> 0 500 111 4
>>> 0 600 150 4
>>> 0 700 115 3
>>> -- snip --
>>> 0 9600 297 3
>>> 0 9700 296 4
>>> 0 9800 299 5
>>> 1 9900 312 4
>>> 0 10000 309 4
>>>
>>> Values in the third column are increasing somewhat, but not significantly,
>>> no spikes etc.
>>>
>>> Current image:
>>> 0 100 76 3
>>> 1 200 119 4
>>> 0 300 105 4
>>> 0 400 80 4
>>> 0 500 112 4
>>> 0 600 111 3
>>> 1 700 98 4
>>> -- snip --
>>> 1 3700 1279 4
>>> 2 3800 1998 4
>>> 3 3900 3196 4
>>> 5 4000 4473 4
>>> 11 4100 11301 4
>>> 19 4200 18823 4
>>> 36 4300 36391 3
>>> 93 4400 45294 4
>>> 47 4500 46674 3
>>> ...
>>>
>>> Around 4000 things get really slow. There are also smaller spikes near
>>> 2500. I didn't wait for 4600. As you can see in the first column #at:put: is
>>> consuming more time every iteration.
>>>
>>>
>>> Levente
>>>
>>>
>>
>>
>
>

laza

Re: All Clear (Re: Hash changes)

In reply to this post by Andreas.Raab

> Folks -
>
> The updates went smoothly and I updated an image without problems. Looks
> like we're good. Couple of comments:
>
> - I was really impressed by the collection rehashing hack. I had not seen
> this before.
>
> - I'm actually thinking that we might want to keep the rehashing utility
> with a comment on how to use it. It looks like a really useful tool for
> doing this kind of stuff.
>
> - Is there a standard way to deal with dog-slow tests (like searching all
> the source code of Squeak, running the decompiler on all methods)? These
> few, crazily slow tests make it virtually impossible to "run all test" after
> loading each change if you want them to finish in a finite amount of time.
> As a consequence I've only run the collection tests.
>
> Cheers,
> - Andreas
>
>

Andres Valloud-4

Re: All Clear (Re: Hash changes)

In reply to this post by Andreas.Raab

Andreas et al,

Andreas Raab wrote:
> Folks -
>
> The updates went smoothly and I updated an image without problems. Looks
> like we're good. Couple of comments:
>
> - I was really impressed by the collection rehashing hack. I had not
> seen this before.

Thanks :).

Andres.