Hash changes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Hash changes

Levente Uzonyi-2
Hi,

I uploaded the second part of Andrés' hash changes to the inbox. Every
package requires a new mcm. The load order is:

Collections-ul.217 "Initialization"
Kernel-ul.308 "Object >> #hash"
Kernel-ul.309
Kernel-ul.310
Collections-ul.218 "IdentityDictionary"
Collections-ul.219
Collections-ul.220 "KeyedIdentitySet"
Collections-ul.221
Collections-ul.222 "WeakIdentityKeyDictionary"
Collections-ul.223
Collections-ul.224 "IdentitySet"
Collections-ul.225
System-ul.185 "SystemDictionary"
System-ul.186
System-ul.187
Collections-ul.226 "Cleanup"
Collections-ul.227

These packages modify the following:
- Object >> #hash returns an integer between 0 and 1073479680 instead of
   0 and 4096, though the range is not continuous
- identity-based collections use the same range extension method as
   Object >> #hash
- SystemDictionary uses #hash instead of #identityHash
- all preambles and postscripts are removed from Collections, Kernel and
   System packages

Cheers,
Levente

Reply | Threaded
Open this post in threaded view
|

Re: Hash changes

David T. Lewis
I think that this is important work. Most of the background discussion
took place on the Pharo list, so some may not have seen it. I personally
am not qualified to do anything with this, but hopefully others can
take a look at it and incorporate it in trunk if appropriate.

Dave

On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:

> Hi,
>
> I uploaded the second part of Andr?s' hash changes to the inbox. Every
> package requires a new mcm. The load order is:
>
> Collections-ul.217 "Initialization"
> Kernel-ul.308 "Object >> #hash"
> Kernel-ul.309
> Kernel-ul.310
> Collections-ul.218 "IdentityDictionary"
> Collections-ul.219
> Collections-ul.220 "KeyedIdentitySet"
> Collections-ul.221
> Collections-ul.222 "WeakIdentityKeyDictionary"
> Collections-ul.223
> Collections-ul.224 "IdentitySet"
> Collections-ul.225
> System-ul.185 "SystemDictionary"
> System-ul.186
> System-ul.187
> Collections-ul.226 "Cleanup"
> Collections-ul.227
>
> These packages modify the following:
> - Object >> #hash returns an integer between 0 and 1073479680 instead of
>   0 and 4096, though the range is not continuous
> - identity-based collections use the same range extension method as
>   Object >> #hash
> - SystemDictionary uses #hash instead of #identityHash
> - all preambles and postscripts are removed from Collections, Kernel and
>   System packages
>
> Cheers,
> Levente
>


Reply | Threaded
Open this post in threaded view
|

Re: Hash changes

Andreas.Raab
I'll be happy to do this but can someone summarize the expected effects
of those changes? The main thing I don't understand is why it is
advantageous to simply scale the id-hash given that it doesn't actually
increase the range of hashes, i.e., id-hash is still limited to 12 bits
so no matter how much you multiply these values the result will still
only be 4096 distinct values. Why is it advantageous to make these
numbers artificially larger without increasing the number of hash bits?

Cheers,
   - Andreas

David T. Lewis wrote:

> I think that this is important work. Most of the background discussion
> took place on the Pharo list, so some may not have seen it. I personally
> am not qualified to do anything with this, but hopefully others can
> take a look at it and incorporate it in trunk if appropriate.
>
> Dave
>
> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>> Hi,
>>
>> I uploaded the second part of Andr?s' hash changes to the inbox. Every
>> package requires a new mcm. The load order is:
>>
>> Collections-ul.217 "Initialization"
>> Kernel-ul.308 "Object >> #hash"
>> Kernel-ul.309
>> Kernel-ul.310
>> Collections-ul.218 "IdentityDictionary"
>> Collections-ul.219
>> Collections-ul.220 "KeyedIdentitySet"
>> Collections-ul.221
>> Collections-ul.222 "WeakIdentityKeyDictionary"
>> Collections-ul.223
>> Collections-ul.224 "IdentitySet"
>> Collections-ul.225
>> System-ul.185 "SystemDictionary"
>> System-ul.186
>> System-ul.187
>> Collections-ul.226 "Cleanup"
>> Collections-ul.227
>>
>> These packages modify the following:
>> - Object >> #hash returns an integer between 0 and 1073479680 instead of
>>   0 and 4096, though the range is not continuous
>> - identity-based collections use the same range extension method as
>>   Object >> #hash
>> - SystemDictionary uses #hash instead of #identityHash
>> - all preambles and postscripts are removed from Collections, Kernel and
>>   System packages
>>
>> Cheers,
>> Levente
>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Re: Hash changes

Nicolas Cellier
>From what I understand:

It does not reduce the rate of hash collisions.
It just reduces the cost of hash collision by avoiding consecutive hash.
Consecutive hash are evil because in case of collision scanFor: tend
to be of linear-cost
(scan all consecutive slots until an empty one)

Oh, but you participated to this discussion, didn't you ?
http://bugs.squeak.org/view.php?id=1876

( via http://code.google.com/p/pharo/issues/detail?id=213 )

Nicolas

2009/11/30 Andreas Raab <[hidden email]>:

> I'll be happy to do this but can someone summarize the expected effects of
> those changes? The main thing I don't understand is why it is advantageous
> to simply scale the id-hash given that it doesn't actually increase the
> range of hashes, i.e., id-hash is still limited to 12 bits so no matter how
> much you multiply these values the result will still only be 4096 distinct
> values. Why is it advantageous to make these numbers artificially larger
> without increasing the number of hash bits?
>
> Cheers,
>  - Andreas
>
> David T. Lewis wrote:
>>
>> I think that this is important work. Most of the background discussion
>> took place on the Pharo list, so some may not have seen it. I personally
>> am not qualified to do anything with this, but hopefully others can
>> take a look at it and incorporate it in trunk if appropriate.
>>
>> Dave
>>
>> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>>>
>>> Hi,
>>>
>>> I uploaded the second part of Andr?s' hash changes to the inbox. Every
>>> package requires a new mcm. The load order is:
>>>
>>> Collections-ul.217 "Initialization"
>>> Kernel-ul.308 "Object >> #hash"
>>> Kernel-ul.309
>>> Kernel-ul.310
>>> Collections-ul.218 "IdentityDictionary"
>>> Collections-ul.219
>>> Collections-ul.220 "KeyedIdentitySet"
>>> Collections-ul.221
>>> Collections-ul.222 "WeakIdentityKeyDictionary"
>>> Collections-ul.223
>>> Collections-ul.224 "IdentitySet"
>>> Collections-ul.225
>>> System-ul.185 "SystemDictionary"
>>> System-ul.186
>>> System-ul.187
>>> Collections-ul.226 "Cleanup"
>>> Collections-ul.227
>>>
>>> These packages modify the following:
>>> - Object >> #hash returns an integer between 0 and 1073479680 instead of
>>>  0 and 4096, though the range is not continuous
>>> - identity-based collections use the same range extension method as
>>>  Object >> #hash
>>> - SystemDictionary uses #hash instead of #identityHash
>>> - all preambles and postscripts are removed from Collections, Kernel and
>>>  System packages
>>>
>>> Cheers,
>>> Levente
>>>
>>
>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re: Hash changes

Levente Uzonyi-2
In reply to this post by Andreas.Raab
On Mon, 30 Nov 2009, Andreas Raab wrote:

> I'll be happy to do this but can someone summarize the expected effects of
> those changes? The main thing I don't understand is why it is advantageous to
> simply scale the id-hash given that it doesn't actually increase the range of
> hashes, i.e., id-hash is still limited to 12 bits so no matter how much you
> multiply these values the result will still only be 4096 distinct values. Why
> is it advantageous to make these numbers artificially larger without
> increasing the number of hash bits?
>

The following test is a modified version of Martin McClure's test
(microbenchmark) the original can be found here:
http://lists.gforge.inria.fr/pipermail/pharo-project/2009-October/015070.html
Run the test before and after the changes.

| test array |
Transcript open; cr.
test := Dictionary new.
[ test size >= 10000 ] whileFalse: [
  Smalltalk garbageCollect.
  array := Array new: 100 streamContents: [ :stream |
  100 timesRepeat: [ stream nextPut: Object new ] ]. "Create 100 new Objects"
  Transcript
  print: [ "Add the 100 new Objects to the Dictionary as keys with nil values"
  1 to: array size do: [ :index |
  test at: (array at: index) put: nil ] ] timeToRun;
  tab;
  print: test size;
  tab;
  print: [ "Access the added Objects 1000 times"
  1 to: 1000 do: [ :iteration |
  1 to: array size do: [ :index |
  test at: (array at: index) ] ] ] timeToRun;
  tab;
  print: [ "This should be substracted from the previous value"
  1 to: 1000 do: [ :iteration |
  1 to: array size do: [ :index |
  array at: index ] ] ] timeToRun;
  cr;
  flush ]

The most important are the second (number of elements in the
dictionary) and third column (time to access the 100 last added element
1000 times) of the result printed to Transcript.

(I'm about to upload 2 more packages because the inlining trick
used in the implementation doesn't work with SmallIntegers (LargeIntegers
can be created while calculating the hash) and the enhancement for WeakSet
growing is missing)

Levente

> Cheers,
>  - Andreas
>
> David T. Lewis wrote:
>> I think that this is important work. Most of the background discussion
>> took place on the Pharo list, so some may not have seen it. I personally
>> am not qualified to do anything with this, but hopefully others can
>> take a look at it and incorporate it in trunk if appropriate.
>>
>> Dave
>>
>> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>>> Hi,
>>>
>>> I uploaded the second part of Andr?s' hash changes to the inbox. Every
>>> package requires a new mcm. The load order is:
>>>
>>> Collections-ul.217 "Initialization"
>>> Kernel-ul.308 "Object >> #hash"
>>> Kernel-ul.309
>>> Kernel-ul.310
>>> Collections-ul.218 "IdentityDictionary"
>>> Collections-ul.219
>>> Collections-ul.220 "KeyedIdentitySet"
>>> Collections-ul.221
>>> Collections-ul.222 "WeakIdentityKeyDictionary"
>>> Collections-ul.223
>>> Collections-ul.224 "IdentitySet"
>>> Collections-ul.225
>>> System-ul.185 "SystemDictionary"
>>> System-ul.186
>>> System-ul.187
>>> Collections-ul.226 "Cleanup"
>>> Collections-ul.227
>>>
>>> These packages modify the following:
>>> - Object >> #hash returns an integer between 0 and 1073479680 instead of
>>>   0 and 4096, though the range is not continuous
>>> - identity-based collections use the same range extension method as
>>>   Object >> #hash
>>> - SystemDictionary uses #hash instead of #identityHash
>>> - all preambles and postscripts are removed from Collections, Kernel and
>>>   System packages
>>>
>>> Cheers,
>>> Levente
>>>
>>
>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re: Hash changes

Levente Uzonyi-2
On Mon, 30 Nov 2009, Levente Uzonyi wrote:
>
> (I'm about to upload 2 more packages because the inlining trick used in the
> implementation doesn't work with SmallIntegers (LargeIntegers can be created
> while calculating the hash) and the enhancement for WeakSet growing is
> missing)

Two packages are not sufficent, so I'll update some of the exisiting mczs.

Levente

Reply | Threaded
Open this post in threaded view
|

Re: Hash changes

Andreas.Raab
In reply to this post by Levente Uzonyi-2
Levente Uzonyi wrote:
> The following test is a modified version of Martin McClure's test
> (microbenchmark) the original can be found here:
> http://lists.gforge.inria.fr/pipermail/pharo-project/2009-October/015070.html 
>
> Run the test before and after the changes.

Since you have the changes loaded, why not just post your results? Also,
could you try making two modifications to the benchmark to see if this
has any impact:

> | test array |
> Transcript open; cr.
> test := Dictionary new.

"There used to be an old issue with growth of Sets that were initially
sized with goodSizeFor: 3. Try working around it by starting at 5."
test := Dictionary new: 5.

> [ test size >= 10000 ] whileFalse: [
>     Smalltalk garbageCollect.
>     array := Array new: 100 streamContents: [ :stream |
>         100 timesRepeat: [ stream nextPut: Object new ] ]. "Create 100
> new Objects"

"Having sequences of objects with strictly consecutive hashes is not
representative for real systems. Instead, create 4096 objects and choose
100 atRandom."

objSet := Array new: 4096 streamContents: [ :stream |
        4096 timesRepeat: [ stream nextPut: Object new ] ].
array := (1 to: 100) collect:[:i| objSet atRandom].

Cheers,
   - Andreas

>     Transcript
>         print: [ "Add the 100 new Objects to the Dictionary as keys with
> nil values"
>             1 to: array size do: [ :index |
>                 test at: (array at: index) put: nil ] ] timeToRun;
>         tab;
>         print: test size;
>         tab;
>         print: [ "Access the added Objects 1000 times"
>             1 to: 1000 do: [ :iteration |
>                 1 to: array size do: [ :index |
>                     test at: (array at: index) ] ] ] timeToRun;
>         tab;
>         print: [ "This should be substracted from the previous value"
>             1 to: 1000 do: [ :iteration |
>                 1 to: array size do: [ :index |
>                     array at: index ] ] ] timeToRun;
>         cr;
>         flush ]
>
> The most important are the second (number of elements in the dictionary)
> and third column (time to access the 100 last added element 1000 times)
> of the result printed to Transcript.
>
> (I'm about to upload 2 more packages because the inlining trick used in
> the implementation doesn't work with SmallIntegers (LargeIntegers can be
> created while calculating the hash) and the enhancement for WeakSet
> growing is missing)
>
> Levente
>
>> Cheers,
>>  - Andreas
>>
>> David T. Lewis wrote:
>>> I think that this is important work. Most of the background discussion
>>> took place on the Pharo list, so some may not have seen it. I personally
>>> am not qualified to do anything with this, but hopefully others can
>>> take a look at it and incorporate it in trunk if appropriate.
>>>
>>> Dave
>>>
>>> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote:
>>>> Hi,
>>>>
>>>> I uploaded the second part of Andr?s' hash changes to the inbox.
>>>> Every package requires a new mcm. The load order is:
>>>>
>>>> Collections-ul.217 "Initialization"
>>>> Kernel-ul.308 "Object >> #hash"
>>>> Kernel-ul.309
>>>> Kernel-ul.310
>>>> Collections-ul.218 "IdentityDictionary"
>>>> Collections-ul.219
>>>> Collections-ul.220 "KeyedIdentitySet"
>>>> Collections-ul.221
>>>> Collections-ul.222 "WeakIdentityKeyDictionary"
>>>> Collections-ul.223
>>>> Collections-ul.224 "IdentitySet"
>>>> Collections-ul.225
>>>> System-ul.185 "SystemDictionary"
>>>> System-ul.186
>>>> System-ul.187
>>>> Collections-ul.226 "Cleanup"
>>>> Collections-ul.227
>>>>
>>>> These packages modify the following:
>>>> - Object >> #hash returns an integer between 0 and 1073479680
>>>> instead of
>>>>   0 and 4096, though the range is not continuous
>>>> - identity-based collections use the same range extension method as
>>>>   Object >> #hash
>>>> - SystemDictionary uses #hash instead of #identityHash
>>>> - all preambles and postscripts are removed from Collections, Kernel
>>>> and
>>>>   System packages
>>>>
>>>> Cheers,
>>>> Levente
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Re: Hash changes

Levente Uzonyi-2
On Mon, 30 Nov 2009, Andreas Raab wrote:

>> | test array |
>> Transcript open; cr.
>> test := Dictionary new.
>
> "There used to be an old issue with growth of Sets that were initially
> sized with goodSizeFor: 3. Try working around it by starting at 5."
> test := Dictionary new: 5.
>

This shouldn't be an issue in this case. The capacity of
new HashedCollections are already modified to be carefully selected
primes. AFAIK only hashMultiply doesn't like 5 as capacity but it's not
involved here. Anyway I added this to the new test.

> "Having sequences of objects with strictly consecutive hashes is not
> representative for real systems. Instead, create 4096 objects and choose 100
> atRandom."
>
> objSet := Array new: 4096 streamContents: [ :stream |
> 4096 timesRepeat: [ stream nextPut: Object new ] ].
> array := (1 to: 100) collect:[:i| objSet atRandom].
>

Good point but the hashes are not consecutive (at least with my vm). Since
we need 10000 different Objects I added this idea to the test with some
modifications. 10000 Objects are created, then shuffled and used when needed.

The modified test:

| test array objects |
Transcript open; cr.
test := Dictionary new: 5.
objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled readStream.
[ test size >= 10000 ] whileFalse: [
  Smalltalk garbageCollect.
  array := Array new: 100 streamContents: [ :stream |
  100 timesRepeat: [ stream nextPut: objects next  ] ]. "Create 100 new Objects"
  Transcript
  print: [ "Add the 100 new Objects to the Dictionary as keys with nil values"
  1 to: array size do: [ :index |
  test at: (array at: index) put: nil ] ] timeToRun;
  tab;
  print: test size;
  tab;
  print: [ "Access the added Objects 1000 times"
  1 to: 1000 do: [ :iteration |
  1 to: array size do: [ :index |
  test at: (array at: index) ] ] ] timeToRun;
  tab;
  print: [ "This should be substracted from the previous value"
  1 to: 1000 do: [ :iteration |
  1 to: array size do: [ :index |
  array at: index ] ] ] timeToRun;
  cr;
  flush ]

The results:
Updated image:
0 100 93 4
0 200 144 4
0 300 125 4
0 400 105 3
0 500 111 4
0 600 150 4
0 700 115 3
-- snip --
0 9600 297 3
0 9700 296 4
0 9800 299 5
1 9900 312 4
0 10000 309 4

Values in the third column are increasing somewhat, but not
significantly, no spikes etc.

Current image:
0 100 76 3
1 200 119 4
0 300 105 4
0 400 80 4
0 500 112 4
0 600 111 3
1 700 98 4
-- snip --
1 3700 1279 4
2 3800 1998 4
3 3900 3196 4
5 4000 4473 4
11 4100 11301 4
19 4200 18823 4
36 4300 36391 3
93 4400 45294 4
47 4500 46674 3
...

Around 4000 things get really slow. There are also smaller spikes near
2500. I didn't wait for 4600. As you can see in the first column #at:put:
is consuming more time every iteration.


Levente

Reply | Threaded
Open this post in threaded view
|

Re: Hash changes

Andreas.Raab
Thanks. Now I see what's happening. The problem is that adding objects
without scaled hash to a "regular" Dictionary creates hugely
disproportional collision chain(s) as soon as we get north of 4096
objects. At this point all the possible entries that any new object
could have by default are already taken and since we use a linear
collision chain they combine into one gigantonormeous (hope I spelled
that correctly :-) collision chain.

That explains the difference - for larger number of elements we end up
(with your changes) with an average length of the collision chain being
(dict size // 4096) for both adding and accessing where without the
scale we end up with a collision chain of (dict size) length (for
adding) and (dict size // 2) for accessing elements once all possible
initial slots are filled up and spill over.

(and we should probably add a comment explaining the issue)

Cheers,
   - Andreas

Levente Uzonyi wrote:

> On Mon, 30 Nov 2009, Andreas Raab wrote:
>
>>> | test array |
>>> Transcript open; cr.
>>> test := Dictionary new.
>>
>> "There used to be an old issue with growth of Sets that were initially
>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>> test := Dictionary new: 5.
>>
>
> This shouldn't be an issue in this case. The capacity of new
> HashedCollections are already modified to be carefully selected primes.
> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
> here. Anyway I added this to the new test.
>
>> "Having sequences of objects with strictly consecutive hashes is not
>> representative for real systems. Instead, create 4096 objects and
>> choose 100 atRandom."
>>
>> objSet := Array new: 4096 streamContents: [ :stream |
>>     4096 timesRepeat: [ stream nextPut: Object new ] ].
>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>
>
> Good point but the hashes are not consecutive (at least with my vm).
> Since we need 10000 different Objects I added this idea to the test with
> some modifications. 10000 Objects are created, then shuffled and used
> when needed.
>
> The modified test:
>
> | test array objects |
> Transcript open; cr.
> test := Dictionary new: 5.
> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
> readStream.
> [ test size >= 10000 ] whileFalse: [
>     Smalltalk garbageCollect.
>     array := Array new: 100 streamContents: [ :stream |
>         100 timesRepeat: [ stream nextPut: objects next  ] ]. "Create
> 100 new Objects"
>     Transcript
>         print: [ "Add the 100 new Objects to the Dictionary as keys with
> nil values"
>             1 to: array size do: [ :index |
>                 test at: (array at: index) put: nil ] ] timeToRun;
>         tab;
>         print: test size;
>         tab;
>         print: [ "Access the added Objects 1000 times"
>             1 to: 1000 do: [ :iteration |
>                 1 to: array size do: [ :index |
>                     test at: (array at: index) ] ] ] timeToRun;
>         tab;
>         print: [ "This should be substracted from the previous value"
>             1 to: 1000 do: [ :iteration |
>                 1 to: array size do: [ :index |
>                     array at: index ] ] ] timeToRun;
>         cr;
>         flush ]
>
> The results:
> Updated image:
> 0    100    93    4
> 0    200    144    4
> 0    300    125    4
> 0    400    105    3
> 0    500    111    4
> 0    600    150    4
> 0    700    115    3
> -- snip --
> 0    9600    297    3
> 0    9700    296    4
> 0    9800    299    5
> 1    9900    312    4
> 0    10000    309    4
>
> Values in the third column are increasing somewhat, but not
> significantly, no spikes etc.
>
> Current image:
> 0    100    76    3
> 1    200    119    4
> 0    300    105    4
> 0    400    80    4
> 0    500    112    4
> 0    600    111    3
> 1    700    98    4
> -- snip --
> 1    3700    1279    4
> 2    3800    1998    4
> 3    3900    3196    4
> 5    4000    4473    4
> 11    4100    11301    4
> 19    4200    18823    4
> 36    4300    36391    3
> 93    4400    45294    4
> 47    4500    46674    3
> ...
>
> Around 4000 things get really slow. There are also smaller spikes near
> 2500. I didn't wait for 4600. As you can see in the first column
> #at:put: is consuming more time every iteration.
>
>
> Levente
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Re: Hash changes

Levente Uzonyi-2
In reply to this post by Levente Uzonyi-2
On Tue, 1 Dec 2009, Levente Uzonyi wrote:

> On Mon, 30 Nov 2009, Levente Uzonyi wrote:
>>
>> (I'm about to upload 2 more packages because the inlining trick used in the
>> implementation doesn't work with SmallIntegers (LargeIntegers can be
>> created while calculating the hash) and the enhancement for WeakSet growing
>> is missing)
>
> Two packages are not sufficent, so I'll update some of the exisiting mczs.
>

The new versions are in the inbox. The load order is:
Collections-ul.217 "Initialization"
Kernel-ul.312 "Object >> #hash"
Kernel-ul.313
Kernel-ul.314
Collections-ul.228 "IdentityDictionary"
Collections-ul.229
Collections-ul.230 "KeyedIdentitySet"
Collections-ul.231
Collections-ul.232 "WeakIdentityKeyDictionary"
Collections-ul.233
Collections-ul.234 "IdentitySet"
Collections-ul.235
System-ul.185 "SystemDictionary"
System-ul.186
System-ul.187
Collections-ul.236 "Cleanup"
Collections-ul.237

Every package requires a new mcm.
These changes also include the WeakSet growing tweak.

Please ignore the following versions from the inbox:
Kernel-ul.308
Kernel-ul.309
Kernel-ul.310
Collections-ul.218
Collections-ul.219
Collections-ul.220
Collections-ul.221
Collections-ul.222
Collections-ul.223
Collections-ul.224
Collections-ul.225
Collections-ul.226
Collections-ul.227


Levente

Reply | Threaded
Open this post in threaded view
|

Re: Hash changes

Andreas.Raab
Levente Uzonyi wrote:
> The new versions are in the inbox. The load order is:

Thanks, I'll give it a shot. Folks, if you read this please abstain from
posting or updating while I'm in the process of integrating these
changes. I'll post an all clear message when I'm done. (plus: brace for
a flurry of commit messages and knock on wood that the server doesn't
die on the last commit ;-)

Cheers,
   - Andreas

> Collections-ul.217 "Initialization"
> Kernel-ul.312 "Object >> #hash"
> Kernel-ul.313
> Kernel-ul.314
> Collections-ul.228 "IdentityDictionary"
> Collections-ul.229
> Collections-ul.230 "KeyedIdentitySet"
> Collections-ul.231
> Collections-ul.232 "WeakIdentityKeyDictionary"
> Collections-ul.233
> Collections-ul.234 "IdentitySet"
> Collections-ul.235
> System-ul.185 "SystemDictionary"
> System-ul.186
> System-ul.187
> Collections-ul.236 "Cleanup"
> Collections-ul.237
>
> Every package requires a new mcm.
> These changes also include the WeakSet growing tweak.
>
> Please ignore the following versions from the inbox:
> Kernel-ul.308
> Kernel-ul.309
> Kernel-ul.310
> Collections-ul.218
> Collections-ul.219
> Collections-ul.220
> Collections-ul.221
> Collections-ul.222
> Collections-ul.223
> Collections-ul.224
> Collections-ul.225
> Collections-ul.226
> Collections-ul.227
>
>
> Levente
>
>


Reply | Threaded
Open this post in threaded view
|

All Clear (Re: Hash changes)

Andreas.Raab
Folks -

The updates went smoothly and I updated an image without problems. Looks
like we're good. Couple of comments:

- I was really impressed by the collection rehashing hack. I had not
seen this before.

- I'm actually thinking that we might want to keep the rehashing utility
with a comment on how to use it. It looks like a really useful tool for
doing this kind of stuff.

- Is there a standard way to deal with dog-slow tests (like searching
all the source code of Squeak, running the decompiler on all methods)?
These few, crazily slow tests make it virtually impossible to "run all
test" after loading each change if you want them to finish in a finite
amount of time. As a consequence I've only run the collection tests.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: All Clear (Re: Hash changes)

Levente Uzonyi-2
On Mon, 30 Nov 2009, Andreas Raab wrote:

> Folks -
>
> The updates went smoothly and I updated an image without problems. Looks like
> we're good. Couple of comments:
>
> - I was really impressed by the collection rehashing hack. I had not seen
> this before.
>
> - I'm actually thinking that we might want to keep the rehashing utility with
> a comment on how to use it. It looks like a really useful tool for doing this
> kind of stuff.
>

I wouldn't add it to the image, since it's rarely useful. But I have a
plan to set up a repository with various collections related stuff and
this could be available there.

> - Is there a standard way to deal with dog-slow tests (like searching all the
> source code of Squeak, running the decompiler on all methods)? These few,
> crazily slow tests make it virtually impossible to "run all test" after
> loading each change if you want them to finish in a finite amount of time. As
> a consequence I've only run the collection tests.

I used to run all tests but Compiler-Tests and Traits-* first. When
everything seems to be ok, I run only Compiler-Tests and Traits-* in the
background. IIRC these tests are slow, because MultiByteFileStream is
really slow. Somehow we should improve it, but I couldn't find a simple
way to do that.


Levente

>
> Cheers,
>  - Andreas
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Re: Hash changes

Nicolas Cellier
In reply to this post by Andreas.Raab
2009/12/1 Andreas Raab <[hidden email]>:

> Thanks. Now I see what's happening. The problem is that adding objects
> without scaled hash to a "regular" Dictionary creates hugely disproportional
> collision chain(s) as soon as we get north of 4096 objects. At this point
> all the possible entries that any new object could have by default are
> already taken and since we use a linear collision chain they combine into
> one gigantonormeous (hope I spelled that correctly :-) collision chain.
>
> That explains the difference - for larger number of elements we end up (with
> your changes) with an average length of the collision chain being (dict size
> // 4096) for both adding and accessing where without the scale we end up
> with a collision chain of (dict size) length (for adding) and (dict size //
> 2) for accessing elements once all possible initial slots are filled up and
> spill over.
>
> (and we should probably add a comment explaining the issue)
>
> Cheers,
>  - Andreas
>

Yes, that's what I tried to explain previously...
The change does not change the RATE of collisions, only the COST of collisions.
But I must be just bad in English ;)

Cheers

Nicolas

> Levente Uzonyi wrote:
>>
>> On Mon, 30 Nov 2009, Andreas Raab wrote:
>>
>>>> | test array |
>>>> Transcript open; cr.
>>>> test := Dictionary new.
>>>
>>> "There used to be an old issue with growth of Sets that were initially
>>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>>> test := Dictionary new: 5.
>>>
>>
>> This shouldn't be an issue in this case. The capacity of new
>> HashedCollections are already modified to be carefully selected primes.
>> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
>> here. Anyway I added this to the new test.
>>
>>> "Having sequences of objects with strictly consecutive hashes is not
>>> representative for real systems. Instead, create 4096 objects and choose 100
>>> atRandom."
>>>
>>> objSet := Array new: 4096 streamContents: [ :stream |
>>>    4096 timesRepeat: [ stream nextPut: Object new ] ].
>>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>>
>>
>> Good point but the hashes are not consecutive (at least with my vm). Since
>> we need 10000 different Objects I added this idea to the test with some
>> modifications. 10000 Objects are created, then shuffled and used when
>> needed.
>>
>> The modified test:
>>
>> | test array objects |
>> Transcript open; cr.
>> test := Dictionary new: 5.
>> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
>> readStream.
>> [ test size >= 10000 ] whileFalse: [
>>    Smalltalk garbageCollect.
>>    array := Array new: 100 streamContents: [ :stream |
>>        100 timesRepeat: [ stream nextPut: objects next  ] ]. "Create 100
>> new Objects"
>>    Transcript
>>        print: [ "Add the 100 new Objects to the Dictionary as keys with
>> nil values"
>>            1 to: array size do: [ :index |
>>                test at: (array at: index) put: nil ] ] timeToRun;
>>        tab;
>>        print: test size;
>>        tab;
>>        print: [ "Access the added Objects 1000 times"
>>            1 to: 1000 do: [ :iteration |
>>                1 to: array size do: [ :index |
>>                    test at: (array at: index) ] ] ] timeToRun;
>>        tab;
>>        print: [ "This should be substracted from the previous value"
>>            1 to: 1000 do: [ :iteration |
>>                1 to: array size do: [ :index |
>>                    array at: index ] ] ] timeToRun;
>>        cr;
>>        flush ]
>>
>> The results:
>> Updated image:
>> 0    100    93    4
>> 0    200    144    4
>> 0    300    125    4
>> 0    400    105    3
>> 0    500    111    4
>> 0    600    150    4
>> 0    700    115    3
>> -- snip --
>> 0    9600    297    3
>> 0    9700    296    4
>> 0    9800    299    5
>> 1    9900    312    4
>> 0    10000    309    4
>>
>> Values in the third column are increasing somewhat, but not significantly,
>> no spikes etc.
>>
>> Current image:
>> 0    100    76    3
>> 1    200    119    4
>> 0    300    105    4
>> 0    400    80    4
>> 0    500    112    4
>> 0    600    111    3
>> 1    700    98    4
>> -- snip --
>> 1    3700    1279    4
>> 2    3800    1998    4
>> 3    3900    3196    4
>> 5    4000    4473    4
>> 11    4100    11301    4
>> 19    4200    18823    4
>> 36    4300    36391    3
>> 93    4400    45294    4
>> 47    4500    46674    3
>> ...
>>
>> Around 4000 things get really slow. There are also smaller spikes near
>> 2500. I didn't wait for 4600. As you can see in the first column #at:put: is
>> consuming more time every iteration.
>>
>>
>> Levente
>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Hash changes

Andreas.Raab
Nicolas Cellier wrote:
> Yes, that's what I tried to explain previously...
> The change does not change the RATE of collisions, only the COST of collisions.
> But I must be just bad in English ;)

Nope, my disconnect was in not realizing the devastating effects of
filling *all* the initial slots and creating a single collision chain as
the result. I thought you were saying something along the lines that
creating objects with strictly consecutive hashes, i.e., hash(n+1) =
hash(n)+1 would lead to the creation of some degenerate collision chain
which is why I was asking to randomize input to see if the behavior is
caused by that dependency.

Cheers,
   - Andreas

>
> Cheers
>
> Nicolas
>
>> Levente Uzonyi wrote:
>>> On Mon, 30 Nov 2009, Andreas Raab wrote:
>>>
>>>>> | test array |
>>>>> Transcript open; cr.
>>>>> test := Dictionary new.
>>>> "There used to be an old issue with growth of Sets that were initially
>>>> sized with goodSizeFor: 3. Try working around it by starting at 5."
>>>> test := Dictionary new: 5.
>>>>
>>> This shouldn't be an issue in this case. The capacity of new
>>> HashedCollections are already modified to be carefully selected primes.
>>> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved
>>> here. Anyway I added this to the new test.
>>>
>>>> "Having sequences of objects with strictly consecutive hashes is not
>>>> representative for real systems. Instead, create 4096 objects and choose 100
>>>> atRandom."
>>>>
>>>> objSet := Array new: 4096 streamContents: [ :stream |
>>>>    4096 timesRepeat: [ stream nextPut: Object new ] ].
>>>> array := (1 to: 100) collect:[:i| objSet atRandom].
>>>>
>>> Good point but the hashes are not consecutive (at least with my vm). Since
>>> we need 10000 different Objects I added this idea to the test with some
>>> modifications. 10000 Objects are created, then shuffled and used when
>>> needed.
>>>
>>> The modified test:
>>>
>>> | test array objects |
>>> Transcript open; cr.
>>> test := Dictionary new: 5.
>>> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled
>>> readStream.
>>> [ test size >= 10000 ] whileFalse: [
>>>    Smalltalk garbageCollect.
>>>    array := Array new: 100 streamContents: [ :stream |
>>>        100 timesRepeat: [ stream nextPut: objects next  ] ]. "Create 100
>>> new Objects"
>>>    Transcript
>>>        print: [ "Add the 100 new Objects to the Dictionary as keys with
>>> nil values"
>>>            1 to: array size do: [ :index |
>>>                test at: (array at: index) put: nil ] ] timeToRun;
>>>        tab;
>>>        print: test size;
>>>        tab;
>>>        print: [ "Access the added Objects 1000 times"
>>>            1 to: 1000 do: [ :iteration |
>>>                1 to: array size do: [ :index |
>>>                    test at: (array at: index) ] ] ] timeToRun;
>>>        tab;
>>>        print: [ "This should be substracted from the previous value"
>>>            1 to: 1000 do: [ :iteration |
>>>                1 to: array size do: [ :index |
>>>                    array at: index ] ] ] timeToRun;
>>>        cr;
>>>        flush ]
>>>
>>> The results:
>>> Updated image:
>>> 0    100    93    4
>>> 0    200    144    4
>>> 0    300    125    4
>>> 0    400    105    3
>>> 0    500    111    4
>>> 0    600    150    4
>>> 0    700    115    3
>>> -- snip --
>>> 0    9600    297    3
>>> 0    9700    296    4
>>> 0    9800    299    5
>>> 1    9900    312    4
>>> 0    10000    309    4
>>>
>>> Values in the third column are increasing somewhat, but not significantly,
>>> no spikes etc.
>>>
>>> Current image:
>>> 0    100    76    3
>>> 1    200    119    4
>>> 0    300    105    4
>>> 0    400    80    4
>>> 0    500    112    4
>>> 0    600    111    3
>>> 1    700    98    4
>>> -- snip --
>>> 1    3700    1279    4
>>> 2    3800    1998    4
>>> 3    3900    3196    4
>>> 5    4000    4473    4
>>> 11    4100    11301    4
>>> 19    4200    18823    4
>>> 36    4300    36391    3
>>> 93    4400    45294    4
>>> 47    4500    46674    3
>>> ...
>>>
>>> Around 4000 things get really slow. There are also smaller spikes near
>>> 2500. I didn't wait for 4600. As you can see in the first column #at:put: is
>>> consuming more time every iteration.
>>>
>>>
>>> Levente
>>>
>>>
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: All Clear (Re: Hash changes)

laza
In reply to this post by Andreas.Raab
It almost took 17(!) hours to run all tests on my poor old machine.
But it seems, that there are no new failed tests.

Alex

 result  |              class              |            selector
---------+---------------------------------+--------------------------------
 error   | DecompilerTestFailuresCollector | testDecompilerInClassesCAtoCM
 error   | DecompilerTestFailuresCollector | testDecompilerInClassesENtoEZ
 error   | DecompilerTestFailuresCollector | testDecompilerInClassesTNtoTZ
 error   | DecompilerTests                 | testDecompilerInClassesENtoEZ
 error   | MirrorPrimitiveTests            | testMirrorAt
 error   | MirrorPrimitiveTests            | testMirrorClass
 error   | MirrorPrimitiveTests            | testMirrorEqEq
 error   | MirrorPrimitiveTests            | testMirrorInstVarAt
 error   | MirrorPrimitiveTests            | testMirrorPerform
 error   | MirrorPrimitiveTests            | testMirrorSize
 error   | SMDependencyTest                | test2
 failure | BitBltTest                      | testAllAlphasRgbAdd
 failure | BitBltTest                      | testAllAlphasRgbMax
 failure | BitBltTest                      | testAllAlphasRgbMin
 failure | BitBltTest                      | testAllAlphasRgbMinInvert
 failure | BitBltTest                      | testAllAlphasRgbMul
 failure | BitBltTest                      | testAllAlphasRgbSub
 failure | ClosureCompilerTest             | testDebuggerTempAccess
 failure | ClosureCompilerTest             | testInjectIntoDecompilations
 failure | ClosureCompilerTest             | testInjectIntoDecompiledDebugs
 failure | DebuggerUnwindBug               | testUnwindDebuggerWithStep
 failure | DecompilerTests                 | testDecompilerInClassesAAtoAM
 failure | DecompilerTests                 | testDecompilerInClassesBAtoBM
 failure | DecompilerTests                 | testDecompilerInClassesCAtoCM
 failure | DecompilerTests                 | testDecompilerInClassesCNtoCZ
 failure | DecompilerTests                 | testDecompilerInClassesDAtoDM
 failure | DecompilerTests                 | testDecompilerInClassesFAtoFM
 failure | DecompilerTests                 | testDecompilerInClassesFNtoFZ
 failure | DecompilerTests                 | testDecompilerInClassesGNtoGZ
 failure | DecompilerTests                 | testDecompilerInClassesHNtoHZ
 failure | DecompilerTests                 | testDecompilerInClassesINtoIZ
 failure | DecompilerTests                 | testDecompilerInClassesJNtoJZ
 failure | DecompilerTests                 | testDecompilerInClassesKAtoKM
 failure | DecompilerTests                 | testDecompilerInClassesLNtoLZ
 failure | DecompilerTests                 | testDecompilerInClassesMAtoMM
 failure | DecompilerTests                 | testDecompilerInClassesMNtoMZ
 failure | DecompilerTests                 | testDecompilerInClassesNAtoNM
 failure | DecompilerTests                 | testDecompilerInClassesOAtoOM
 failure | DecompilerTests                 | testDecompilerInClassesPAtoPM
 failure | DecompilerTests                 | testDecompilerInClassesPNtoPZ
 failure | DecompilerTests                 | testDecompilerInClassesRAtoRM
 failure | DecompilerTests                 | testDecompilerInClassesSAtoSM
 failure | DecompilerTests                 | testDecompilerInClassesSNtoSZ
 failure | DecompilerTests                 | testDecompilerInClassesTAtoTM
 failure | DecompilerTests                 | testDecompilerInClassesTNtoTZ
 failure | DecompilerTests                 | testDecompilerInClassesWAtoWM
 failure | LocaleTest                      | testIsFontAvailable
 failure | MCChangeNotificationTest        | testCoreMethodModified
 failure | MorphicToolBuilderTests         | testGetButtonSideEffectFree
 failure | ScannerTest                     | testLiteralSymbols
 failure | TextMorphTest                   | testInitialize
 failure | TraitFileOutTest                | testFileOutCategory
 failure | TraitTest                       | testTraitMethodClass
 failure | TraitTest                       | testTraitMethodSelector
(54 Zeilen)


On Tue, Dec 1, 2009 at 6:34 AM, Andreas Raab <[hidden email]> wrote:

> Folks -
>
> The updates went smoothly and I updated an image without problems. Looks
> like we're good. Couple of comments:
>
> - I was really impressed by the collection rehashing hack. I had not seen
> this before.
>
> - I'm actually thinking that we might want to keep the rehashing utility
> with a comment on how to use it. It looks like a really useful tool for
> doing this kind of stuff.
>
> - Is there a standard way to deal with dog-slow tests (like searching all
> the source code of Squeak, running the decompiler on all methods)? These
> few, crazily slow tests make it virtually impossible to "run all test" after
> loading each change if you want them to finish in a finite amount of time.
> As a consequence I've only run the collection tests.
>
> Cheers,
>  - Andreas
>
>

Reply | Threaded
Open this post in threaded view
|

Re: All Clear (Re: Hash changes)

Andres Valloud-4
In reply to this post by Andreas.Raab
Andreas et al,

Andreas Raab wrote:
> Folks -
>
> The updates went smoothly and I updated an image without problems. Looks
> like we're good. Couple of comments:
>
> - I was really impressed by the collection rehashing hack. I had not
> seen this before.

Thanks :).

Andres.