Hi,
I uploaded the second part of Andrés' hash changes to the inbox. Every package requires a new mcm. The load order is: Collections-ul.217 "Initialization" Kernel-ul.308 "Object >> #hash" Kernel-ul.309 Kernel-ul.310 Collections-ul.218 "IdentityDictionary" Collections-ul.219 Collections-ul.220 "KeyedIdentitySet" Collections-ul.221 Collections-ul.222 "WeakIdentityKeyDictionary" Collections-ul.223 Collections-ul.224 "IdentitySet" Collections-ul.225 System-ul.185 "SystemDictionary" System-ul.186 System-ul.187 Collections-ul.226 "Cleanup" Collections-ul.227 These packages modify the following: - Object >> #hash returns an integer between 0 and 1073479680 instead of 0 and 4096, though the range is not continuous - identity-based collections use the same range extension method as Object >> #hash - SystemDictionary uses #hash instead of #identityHash - all preambles and postscripts are removed from Collections, Kernel and System packages Cheers, Levente |
I think that this is important work. Most of the background discussion
took place on the Pharo list, so some may not have seen it. I personally am not qualified to do anything with this, but hopefully others can take a look at it and incorporate it in trunk if appropriate. Dave On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote: > Hi, > > I uploaded the second part of Andr?s' hash changes to the inbox. Every > package requires a new mcm. The load order is: > > Collections-ul.217 "Initialization" > Kernel-ul.308 "Object >> #hash" > Kernel-ul.309 > Kernel-ul.310 > Collections-ul.218 "IdentityDictionary" > Collections-ul.219 > Collections-ul.220 "KeyedIdentitySet" > Collections-ul.221 > Collections-ul.222 "WeakIdentityKeyDictionary" > Collections-ul.223 > Collections-ul.224 "IdentitySet" > Collections-ul.225 > System-ul.185 "SystemDictionary" > System-ul.186 > System-ul.187 > Collections-ul.226 "Cleanup" > Collections-ul.227 > > These packages modify the following: > - Object >> #hash returns an integer between 0 and 1073479680 instead of > 0 and 4096, though the range is not continuous > - identity-based collections use the same range extension method as > Object >> #hash > - SystemDictionary uses #hash instead of #identityHash > - all preambles and postscripts are removed from Collections, Kernel and > System packages > > Cheers, > Levente > |
I'll be happy to do this but can someone summarize the expected effects
of those changes? The main thing I don't understand is why it is advantageous to simply scale the id-hash given that it doesn't actually increase the range of hashes, i.e., id-hash is still limited to 12 bits so no matter how much you multiply these values the result will still only be 4096 distinct values. Why is it advantageous to make these numbers artificially larger without increasing the number of hash bits? Cheers, - Andreas David T. Lewis wrote: > I think that this is important work. Most of the background discussion > took place on the Pharo list, so some may not have seen it. I personally > am not qualified to do anything with this, but hopefully others can > take a look at it and incorporate it in trunk if appropriate. > > Dave > > On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote: >> Hi, >> >> I uploaded the second part of Andr?s' hash changes to the inbox. Every >> package requires a new mcm. The load order is: >> >> Collections-ul.217 "Initialization" >> Kernel-ul.308 "Object >> #hash" >> Kernel-ul.309 >> Kernel-ul.310 >> Collections-ul.218 "IdentityDictionary" >> Collections-ul.219 >> Collections-ul.220 "KeyedIdentitySet" >> Collections-ul.221 >> Collections-ul.222 "WeakIdentityKeyDictionary" >> Collections-ul.223 >> Collections-ul.224 "IdentitySet" >> Collections-ul.225 >> System-ul.185 "SystemDictionary" >> System-ul.186 >> System-ul.187 >> Collections-ul.226 "Cleanup" >> Collections-ul.227 >> >> These packages modify the following: >> - Object >> #hash returns an integer between 0 and 1073479680 instead of >> 0 and 4096, though the range is not continuous >> - identity-based collections use the same range extension method as >> Object >> #hash >> - SystemDictionary uses #hash instead of #identityHash >> - all preambles and postscripts are removed from Collections, Kernel and >> System packages >> >> Cheers, >> Levente >> > > > |
>From what I understand:
It does not reduce the rate of hash collisions. It just reduces the cost of hash collision by avoiding consecutive hash. Consecutive hash are evil because in case of collision scanFor: tend to be of linear-cost (scan all consecutive slots until an empty one) Oh, but you participated to this discussion, didn't you ? http://bugs.squeak.org/view.php?id=1876 ( via http://code.google.com/p/pharo/issues/detail?id=213 ) Nicolas 2009/11/30 Andreas Raab <[hidden email]>: > I'll be happy to do this but can someone summarize the expected effects of > those changes? The main thing I don't understand is why it is advantageous > to simply scale the id-hash given that it doesn't actually increase the > range of hashes, i.e., id-hash is still limited to 12 bits so no matter how > much you multiply these values the result will still only be 4096 distinct > values. Why is it advantageous to make these numbers artificially larger > without increasing the number of hash bits? > > Cheers, > - Andreas > > David T. Lewis wrote: >> >> I think that this is important work. Most of the background discussion >> took place on the Pharo list, so some may not have seen it. I personally >> am not qualified to do anything with this, but hopefully others can >> take a look at it and incorporate it in trunk if appropriate. >> >> Dave >> >> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote: >>> >>> Hi, >>> >>> I uploaded the second part of Andr?s' hash changes to the inbox. Every >>> package requires a new mcm. The load order is: >>> >>> Collections-ul.217 "Initialization" >>> Kernel-ul.308 "Object >> #hash" >>> Kernel-ul.309 >>> Kernel-ul.310 >>> Collections-ul.218 "IdentityDictionary" >>> Collections-ul.219 >>> Collections-ul.220 "KeyedIdentitySet" >>> Collections-ul.221 >>> Collections-ul.222 "WeakIdentityKeyDictionary" >>> Collections-ul.223 >>> Collections-ul.224 "IdentitySet" >>> Collections-ul.225 >>> System-ul.185 "SystemDictionary" >>> System-ul.186 >>> System-ul.187 >>> Collections-ul.226 "Cleanup" >>> Collections-ul.227 >>> >>> These packages modify the following: >>> - Object >> #hash returns an integer between 0 and 1073479680 instead of >>> 0 and 4096, though the range is not continuous >>> - identity-based collections use the same range extension method as >>> Object >> #hash >>> - SystemDictionary uses #hash instead of #identityHash >>> - all preambles and postscripts are removed from Collections, Kernel and >>> System packages >>> >>> Cheers, >>> Levente >>> >> >> >> > > > |
In reply to this post by Andreas.Raab
On Mon, 30 Nov 2009, Andreas Raab wrote:
> I'll be happy to do this but can someone summarize the expected effects of > those changes? The main thing I don't understand is why it is advantageous to > simply scale the id-hash given that it doesn't actually increase the range of > hashes, i.e., id-hash is still limited to 12 bits so no matter how much you > multiply these values the result will still only be 4096 distinct values. Why > is it advantageous to make these numbers artificially larger without > increasing the number of hash bits? > The following test is a modified version of Martin McClure's test (microbenchmark) the original can be found here: http://lists.gforge.inria.fr/pipermail/pharo-project/2009-October/015070.html Run the test before and after the changes. | test array | Transcript open; cr. test := Dictionary new. [ test size >= 10000 ] whileFalse: [ Smalltalk garbageCollect. array := Array new: 100 streamContents: [ :stream | 100 timesRepeat: [ stream nextPut: Object new ] ]. "Create 100 new Objects" Transcript print: [ "Add the 100 new Objects to the Dictionary as keys with nil values" 1 to: array size do: [ :index | test at: (array at: index) put: nil ] ] timeToRun; tab; print: test size; tab; print: [ "Access the added Objects 1000 times" 1 to: 1000 do: [ :iteration | 1 to: array size do: [ :index | test at: (array at: index) ] ] ] timeToRun; tab; print: [ "This should be substracted from the previous value" 1 to: 1000 do: [ :iteration | 1 to: array size do: [ :index | array at: index ] ] ] timeToRun; cr; flush ] The most important are the second (number of elements in the dictionary) and third column (time to access the 100 last added element 1000 times) of the result printed to Transcript. (I'm about to upload 2 more packages because the inlining trick used in the implementation doesn't work with SmallIntegers (LargeIntegers can be created while calculating the hash) and the enhancement for WeakSet growing is missing) Levente > Cheers, > - Andreas > > David T. Lewis wrote: >> I think that this is important work. Most of the background discussion >> took place on the Pharo list, so some may not have seen it. I personally >> am not qualified to do anything with this, but hopefully others can >> take a look at it and incorporate it in trunk if appropriate. >> >> Dave >> >> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote: >>> Hi, >>> >>> I uploaded the second part of Andr?s' hash changes to the inbox. Every >>> package requires a new mcm. The load order is: >>> >>> Collections-ul.217 "Initialization" >>> Kernel-ul.308 "Object >> #hash" >>> Kernel-ul.309 >>> Kernel-ul.310 >>> Collections-ul.218 "IdentityDictionary" >>> Collections-ul.219 >>> Collections-ul.220 "KeyedIdentitySet" >>> Collections-ul.221 >>> Collections-ul.222 "WeakIdentityKeyDictionary" >>> Collections-ul.223 >>> Collections-ul.224 "IdentitySet" >>> Collections-ul.225 >>> System-ul.185 "SystemDictionary" >>> System-ul.186 >>> System-ul.187 >>> Collections-ul.226 "Cleanup" >>> Collections-ul.227 >>> >>> These packages modify the following: >>> - Object >> #hash returns an integer between 0 and 1073479680 instead of >>> 0 and 4096, though the range is not continuous >>> - identity-based collections use the same range extension method as >>> Object >> #hash >>> - SystemDictionary uses #hash instead of #identityHash >>> - all preambles and postscripts are removed from Collections, Kernel and >>> System packages >>> >>> Cheers, >>> Levente >>> >> >> >> > > > |
On Mon, 30 Nov 2009, Levente Uzonyi wrote:
> > (I'm about to upload 2 more packages because the inlining trick used in the > implementation doesn't work with SmallIntegers (LargeIntegers can be created > while calculating the hash) and the enhancement for WeakSet growing is > missing) Two packages are not sufficent, so I'll update some of the exisiting mczs. Levente |
In reply to this post by Levente Uzonyi-2
Levente Uzonyi wrote:
> The following test is a modified version of Martin McClure's test > (microbenchmark) the original can be found here: > http://lists.gforge.inria.fr/pipermail/pharo-project/2009-October/015070.html > > Run the test before and after the changes. Since you have the changes loaded, why not just post your results? Also, could you try making two modifications to the benchmark to see if this has any impact: > | test array | > Transcript open; cr. > test := Dictionary new. "There used to be an old issue with growth of Sets that were initially sized with goodSizeFor: 3. Try working around it by starting at 5." test := Dictionary new: 5. > [ test size >= 10000 ] whileFalse: [ > Smalltalk garbageCollect. > array := Array new: 100 streamContents: [ :stream | > 100 timesRepeat: [ stream nextPut: Object new ] ]. "Create 100 > new Objects" "Having sequences of objects with strictly consecutive hashes is not representative for real systems. Instead, create 4096 objects and choose 100 atRandom." objSet := Array new: 4096 streamContents: [ :stream | 4096 timesRepeat: [ stream nextPut: Object new ] ]. array := (1 to: 100) collect:[:i| objSet atRandom]. Cheers, - Andreas > Transcript > print: [ "Add the 100 new Objects to the Dictionary as keys with > nil values" > 1 to: array size do: [ :index | > test at: (array at: index) put: nil ] ] timeToRun; > tab; > print: test size; > tab; > print: [ "Access the added Objects 1000 times" > 1 to: 1000 do: [ :iteration | > 1 to: array size do: [ :index | > test at: (array at: index) ] ] ] timeToRun; > tab; > print: [ "This should be substracted from the previous value" > 1 to: 1000 do: [ :iteration | > 1 to: array size do: [ :index | > array at: index ] ] ] timeToRun; > cr; > flush ] > > The most important are the second (number of elements in the dictionary) > and third column (time to access the 100 last added element 1000 times) > of the result printed to Transcript. > > (I'm about to upload 2 more packages because the inlining trick used in > the implementation doesn't work with SmallIntegers (LargeIntegers can be > created while calculating the hash) and the enhancement for WeakSet > growing is missing) > > Levente > >> Cheers, >> - Andreas >> >> David T. Lewis wrote: >>> I think that this is important work. Most of the background discussion >>> took place on the Pharo list, so some may not have seen it. I personally >>> am not qualified to do anything with this, but hopefully others can >>> take a look at it and incorporate it in trunk if appropriate. >>> >>> Dave >>> >>> On Mon, Nov 30, 2009 at 06:26:21AM +0100, Levente Uzonyi wrote: >>>> Hi, >>>> >>>> I uploaded the second part of Andr?s' hash changes to the inbox. >>>> Every package requires a new mcm. The load order is: >>>> >>>> Collections-ul.217 "Initialization" >>>> Kernel-ul.308 "Object >> #hash" >>>> Kernel-ul.309 >>>> Kernel-ul.310 >>>> Collections-ul.218 "IdentityDictionary" >>>> Collections-ul.219 >>>> Collections-ul.220 "KeyedIdentitySet" >>>> Collections-ul.221 >>>> Collections-ul.222 "WeakIdentityKeyDictionary" >>>> Collections-ul.223 >>>> Collections-ul.224 "IdentitySet" >>>> Collections-ul.225 >>>> System-ul.185 "SystemDictionary" >>>> System-ul.186 >>>> System-ul.187 >>>> Collections-ul.226 "Cleanup" >>>> Collections-ul.227 >>>> >>>> These packages modify the following: >>>> - Object >> #hash returns an integer between 0 and 1073479680 >>>> instead of >>>> 0 and 4096, though the range is not continuous >>>> - identity-based collections use the same range extension method as >>>> Object >> #hash >>>> - SystemDictionary uses #hash instead of #identityHash >>>> - all preambles and postscripts are removed from Collections, Kernel >>>> and >>>> System packages >>>> >>>> Cheers, >>>> Levente >>>> >>> >>> >>> >> >> >> > > |
On Mon, 30 Nov 2009, Andreas Raab wrote:
>> | test array | >> Transcript open; cr. >> test := Dictionary new. > > "There used to be an old issue with growth of Sets that were initially > sized with goodSizeFor: 3. Try working around it by starting at 5." > test := Dictionary new: 5. > This shouldn't be an issue in this case. The capacity of new HashedCollections are already modified to be carefully selected primes. AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved here. Anyway I added this to the new test. > "Having sequences of objects with strictly consecutive hashes is not > representative for real systems. Instead, create 4096 objects and choose 100 > atRandom." > > objSet := Array new: 4096 streamContents: [ :stream | > 4096 timesRepeat: [ stream nextPut: Object new ] ]. > array := (1 to: 100) collect:[:i| objSet atRandom]. > Good point but the hashes are not consecutive (at least with my vm). Since we need 10000 different Objects I added this idea to the test with some modifications. 10000 Objects are created, then shuffled and used when needed. The modified test: | test array objects | Transcript open; cr. test := Dictionary new: 5. objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled readStream. [ test size >= 10000 ] whileFalse: [ Smalltalk garbageCollect. array := Array new: 100 streamContents: [ :stream | 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create 100 new Objects" Transcript print: [ "Add the 100 new Objects to the Dictionary as keys with nil values" 1 to: array size do: [ :index | test at: (array at: index) put: nil ] ] timeToRun; tab; print: test size; tab; print: [ "Access the added Objects 1000 times" 1 to: 1000 do: [ :iteration | 1 to: array size do: [ :index | test at: (array at: index) ] ] ] timeToRun; tab; print: [ "This should be substracted from the previous value" 1 to: 1000 do: [ :iteration | 1 to: array size do: [ :index | array at: index ] ] ] timeToRun; cr; flush ] The results: Updated image: 0 100 93 4 0 200 144 4 0 300 125 4 0 400 105 3 0 500 111 4 0 600 150 4 0 700 115 3 -- snip -- 0 9600 297 3 0 9700 296 4 0 9800 299 5 1 9900 312 4 0 10000 309 4 Values in the third column are increasing somewhat, but not significantly, no spikes etc. Current image: 0 100 76 3 1 200 119 4 0 300 105 4 0 400 80 4 0 500 112 4 0 600 111 3 1 700 98 4 -- snip -- 1 3700 1279 4 2 3800 1998 4 3 3900 3196 4 5 4000 4473 4 11 4100 11301 4 19 4200 18823 4 36 4300 36391 3 93 4400 45294 4 47 4500 46674 3 ... Around 4000 things get really slow. There are also smaller spikes near 2500. I didn't wait for 4600. As you can see in the first column #at:put: is consuming more time every iteration. Levente |
Thanks. Now I see what's happening. The problem is that adding objects
without scaled hash to a "regular" Dictionary creates hugely disproportional collision chain(s) as soon as we get north of 4096 objects. At this point all the possible entries that any new object could have by default are already taken and since we use a linear collision chain they combine into one gigantonormeous (hope I spelled that correctly :-) collision chain. That explains the difference - for larger number of elements we end up (with your changes) with an average length of the collision chain being (dict size // 4096) for both adding and accessing where without the scale we end up with a collision chain of (dict size) length (for adding) and (dict size // 2) for accessing elements once all possible initial slots are filled up and spill over. (and we should probably add a comment explaining the issue) Cheers, - Andreas Levente Uzonyi wrote: > On Mon, 30 Nov 2009, Andreas Raab wrote: > >>> | test array | >>> Transcript open; cr. >>> test := Dictionary new. >> >> "There used to be an old issue with growth of Sets that were initially >> sized with goodSizeFor: 3. Try working around it by starting at 5." >> test := Dictionary new: 5. >> > > This shouldn't be an issue in this case. The capacity of new > HashedCollections are already modified to be carefully selected primes. > AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved > here. Anyway I added this to the new test. > >> "Having sequences of objects with strictly consecutive hashes is not >> representative for real systems. Instead, create 4096 objects and >> choose 100 atRandom." >> >> objSet := Array new: 4096 streamContents: [ :stream | >> 4096 timesRepeat: [ stream nextPut: Object new ] ]. >> array := (1 to: 100) collect:[:i| objSet atRandom]. >> > > Good point but the hashes are not consecutive (at least with my vm). > Since we need 10000 different Objects I added this idea to the test with > some modifications. 10000 Objects are created, then shuffled and used > when needed. > > The modified test: > > | test array objects | > Transcript open; cr. > test := Dictionary new: 5. > objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled > readStream. > [ test size >= 10000 ] whileFalse: [ > Smalltalk garbageCollect. > array := Array new: 100 streamContents: [ :stream | > 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create > 100 new Objects" > Transcript > print: [ "Add the 100 new Objects to the Dictionary as keys with > nil values" > 1 to: array size do: [ :index | > test at: (array at: index) put: nil ] ] timeToRun; > tab; > print: test size; > tab; > print: [ "Access the added Objects 1000 times" > 1 to: 1000 do: [ :iteration | > 1 to: array size do: [ :index | > test at: (array at: index) ] ] ] timeToRun; > tab; > print: [ "This should be substracted from the previous value" > 1 to: 1000 do: [ :iteration | > 1 to: array size do: [ :index | > array at: index ] ] ] timeToRun; > cr; > flush ] > > The results: > Updated image: > 0 100 93 4 > 0 200 144 4 > 0 300 125 4 > 0 400 105 3 > 0 500 111 4 > 0 600 150 4 > 0 700 115 3 > -- snip -- > 0 9600 297 3 > 0 9700 296 4 > 0 9800 299 5 > 1 9900 312 4 > 0 10000 309 4 > > Values in the third column are increasing somewhat, but not > significantly, no spikes etc. > > Current image: > 0 100 76 3 > 1 200 119 4 > 0 300 105 4 > 0 400 80 4 > 0 500 112 4 > 0 600 111 3 > 1 700 98 4 > -- snip -- > 1 3700 1279 4 > 2 3800 1998 4 > 3 3900 3196 4 > 5 4000 4473 4 > 11 4100 11301 4 > 19 4200 18823 4 > 36 4300 36391 3 > 93 4400 45294 4 > 47 4500 46674 3 > ... > > Around 4000 things get really slow. There are also smaller spikes near > 2500. I didn't wait for 4600. As you can see in the first column > #at:put: is consuming more time every iteration. > > > Levente > > |
In reply to this post by Levente Uzonyi-2
On Tue, 1 Dec 2009, Levente Uzonyi wrote:
> On Mon, 30 Nov 2009, Levente Uzonyi wrote: >> >> (I'm about to upload 2 more packages because the inlining trick used in the >> implementation doesn't work with SmallIntegers (LargeIntegers can be >> created while calculating the hash) and the enhancement for WeakSet growing >> is missing) > > Two packages are not sufficent, so I'll update some of the exisiting mczs. > The new versions are in the inbox. The load order is: Collections-ul.217 "Initialization" Kernel-ul.312 "Object >> #hash" Kernel-ul.313 Kernel-ul.314 Collections-ul.228 "IdentityDictionary" Collections-ul.229 Collections-ul.230 "KeyedIdentitySet" Collections-ul.231 Collections-ul.232 "WeakIdentityKeyDictionary" Collections-ul.233 Collections-ul.234 "IdentitySet" Collections-ul.235 System-ul.185 "SystemDictionary" System-ul.186 System-ul.187 Collections-ul.236 "Cleanup" Collections-ul.237 Every package requires a new mcm. These changes also include the WeakSet growing tweak. Please ignore the following versions from the inbox: Kernel-ul.308 Kernel-ul.309 Kernel-ul.310 Collections-ul.218 Collections-ul.219 Collections-ul.220 Collections-ul.221 Collections-ul.222 Collections-ul.223 Collections-ul.224 Collections-ul.225 Collections-ul.226 Collections-ul.227 Levente |
Levente Uzonyi wrote:
> The new versions are in the inbox. The load order is: Thanks, I'll give it a shot. Folks, if you read this please abstain from posting or updating while I'm in the process of integrating these changes. I'll post an all clear message when I'm done. (plus: brace for a flurry of commit messages and knock on wood that the server doesn't die on the last commit ;-) Cheers, - Andreas > Collections-ul.217 "Initialization" > Kernel-ul.312 "Object >> #hash" > Kernel-ul.313 > Kernel-ul.314 > Collections-ul.228 "IdentityDictionary" > Collections-ul.229 > Collections-ul.230 "KeyedIdentitySet" > Collections-ul.231 > Collections-ul.232 "WeakIdentityKeyDictionary" > Collections-ul.233 > Collections-ul.234 "IdentitySet" > Collections-ul.235 > System-ul.185 "SystemDictionary" > System-ul.186 > System-ul.187 > Collections-ul.236 "Cleanup" > Collections-ul.237 > > Every package requires a new mcm. > These changes also include the WeakSet growing tweak. > > Please ignore the following versions from the inbox: > Kernel-ul.308 > Kernel-ul.309 > Kernel-ul.310 > Collections-ul.218 > Collections-ul.219 > Collections-ul.220 > Collections-ul.221 > Collections-ul.222 > Collections-ul.223 > Collections-ul.224 > Collections-ul.225 > Collections-ul.226 > Collections-ul.227 > > > Levente > > |
Folks -
The updates went smoothly and I updated an image without problems. Looks like we're good. Couple of comments: - I was really impressed by the collection rehashing hack. I had not seen this before. - I'm actually thinking that we might want to keep the rehashing utility with a comment on how to use it. It looks like a really useful tool for doing this kind of stuff. - Is there a standard way to deal with dog-slow tests (like searching all the source code of Squeak, running the decompiler on all methods)? These few, crazily slow tests make it virtually impossible to "run all test" after loading each change if you want them to finish in a finite amount of time. As a consequence I've only run the collection tests. Cheers, - Andreas |
On Mon, 30 Nov 2009, Andreas Raab wrote:
> Folks - > > The updates went smoothly and I updated an image without problems. Looks like > we're good. Couple of comments: > > - I was really impressed by the collection rehashing hack. I had not seen > this before. > > - I'm actually thinking that we might want to keep the rehashing utility with > a comment on how to use it. It looks like a really useful tool for doing this > kind of stuff. > I wouldn't add it to the image, since it's rarely useful. But I have a plan to set up a repository with various collections related stuff and this could be available there. > - Is there a standard way to deal with dog-slow tests (like searching all the > source code of Squeak, running the decompiler on all methods)? These few, > crazily slow tests make it virtually impossible to "run all test" after > loading each change if you want them to finish in a finite amount of time. As > a consequence I've only run the collection tests. I used to run all tests but Compiler-Tests and Traits-* first. When everything seems to be ok, I run only Compiler-Tests and Traits-* in the background. IIRC these tests are slow, because MultiByteFileStream is really slow. Somehow we should improve it, but I couldn't find a simple way to do that. Levente > > Cheers, > - Andreas > > |
In reply to this post by Andreas.Raab
2009/12/1 Andreas Raab <[hidden email]>:
> Thanks. Now I see what's happening. The problem is that adding objects > without scaled hash to a "regular" Dictionary creates hugely disproportional > collision chain(s) as soon as we get north of 4096 objects. At this point > all the possible entries that any new object could have by default are > already taken and since we use a linear collision chain they combine into > one gigantonormeous (hope I spelled that correctly :-) collision chain. > > That explains the difference - for larger number of elements we end up (with > your changes) with an average length of the collision chain being (dict size > // 4096) for both adding and accessing where without the scale we end up > with a collision chain of (dict size) length (for adding) and (dict size // > 2) for accessing elements once all possible initial slots are filled up and > spill over. > > (and we should probably add a comment explaining the issue) > > Cheers, > - Andreas > Yes, that's what I tried to explain previously... The change does not change the RATE of collisions, only the COST of collisions. But I must be just bad in English ;) Cheers Nicolas > Levente Uzonyi wrote: >> >> On Mon, 30 Nov 2009, Andreas Raab wrote: >> >>>> | test array | >>>> Transcript open; cr. >>>> test := Dictionary new. >>> >>> "There used to be an old issue with growth of Sets that were initially >>> sized with goodSizeFor: 3. Try working around it by starting at 5." >>> test := Dictionary new: 5. >>> >> >> This shouldn't be an issue in this case. The capacity of new >> HashedCollections are already modified to be carefully selected primes. >> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved >> here. Anyway I added this to the new test. >> >>> "Having sequences of objects with strictly consecutive hashes is not >>> representative for real systems. Instead, create 4096 objects and choose 100 >>> atRandom." >>> >>> objSet := Array new: 4096 streamContents: [ :stream | >>> 4096 timesRepeat: [ stream nextPut: Object new ] ]. >>> array := (1 to: 100) collect:[:i| objSet atRandom]. >>> >> >> Good point but the hashes are not consecutive (at least with my vm). Since >> we need 10000 different Objects I added this idea to the test with some >> modifications. 10000 Objects are created, then shuffled and used when >> needed. >> >> The modified test: >> >> | test array objects | >> Transcript open; cr. >> test := Dictionary new: 5. >> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled >> readStream. >> [ test size >= 10000 ] whileFalse: [ >> Smalltalk garbageCollect. >> array := Array new: 100 streamContents: [ :stream | >> 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create 100 >> new Objects" >> Transcript >> print: [ "Add the 100 new Objects to the Dictionary as keys with >> nil values" >> 1 to: array size do: [ :index | >> test at: (array at: index) put: nil ] ] timeToRun; >> tab; >> print: test size; >> tab; >> print: [ "Access the added Objects 1000 times" >> 1 to: 1000 do: [ :iteration | >> 1 to: array size do: [ :index | >> test at: (array at: index) ] ] ] timeToRun; >> tab; >> print: [ "This should be substracted from the previous value" >> 1 to: 1000 do: [ :iteration | >> 1 to: array size do: [ :index | >> array at: index ] ] ] timeToRun; >> cr; >> flush ] >> >> The results: >> Updated image: >> 0 100 93 4 >> 0 200 144 4 >> 0 300 125 4 >> 0 400 105 3 >> 0 500 111 4 >> 0 600 150 4 >> 0 700 115 3 >> -- snip -- >> 0 9600 297 3 >> 0 9700 296 4 >> 0 9800 299 5 >> 1 9900 312 4 >> 0 10000 309 4 >> >> Values in the third column are increasing somewhat, but not significantly, >> no spikes etc. >> >> Current image: >> 0 100 76 3 >> 1 200 119 4 >> 0 300 105 4 >> 0 400 80 4 >> 0 500 112 4 >> 0 600 111 3 >> 1 700 98 4 >> -- snip -- >> 1 3700 1279 4 >> 2 3800 1998 4 >> 3 3900 3196 4 >> 5 4000 4473 4 >> 11 4100 11301 4 >> 19 4200 18823 4 >> 36 4300 36391 3 >> 93 4400 45294 4 >> 47 4500 46674 3 >> ... >> >> Around 4000 things get really slow. There are also smaller spikes near >> 2500. I didn't wait for 4600. As you can see in the first column #at:put: is >> consuming more time every iteration. >> >> >> Levente >> >> > > > |
Nicolas Cellier wrote:
> Yes, that's what I tried to explain previously... > The change does not change the RATE of collisions, only the COST of collisions. > But I must be just bad in English ;) Nope, my disconnect was in not realizing the devastating effects of filling *all* the initial slots and creating a single collision chain as the result. I thought you were saying something along the lines that creating objects with strictly consecutive hashes, i.e., hash(n+1) = hash(n)+1 would lead to the creation of some degenerate collision chain which is why I was asking to randomize input to see if the behavior is caused by that dependency. Cheers, - Andreas > > Cheers > > Nicolas > >> Levente Uzonyi wrote: >>> On Mon, 30 Nov 2009, Andreas Raab wrote: >>> >>>>> | test array | >>>>> Transcript open; cr. >>>>> test := Dictionary new. >>>> "There used to be an old issue with growth of Sets that were initially >>>> sized with goodSizeFor: 3. Try working around it by starting at 5." >>>> test := Dictionary new: 5. >>>> >>> This shouldn't be an issue in this case. The capacity of new >>> HashedCollections are already modified to be carefully selected primes. >>> AFAIK only hashMultiply doesn't like 5 as capacity but it's not involved >>> here. Anyway I added this to the new test. >>> >>>> "Having sequences of objects with strictly consecutive hashes is not >>>> representative for real systems. Instead, create 4096 objects and choose 100 >>>> atRandom." >>>> >>>> objSet := Array new: 4096 streamContents: [ :stream | >>>> 4096 timesRepeat: [ stream nextPut: Object new ] ]. >>>> array := (1 to: 100) collect:[:i| objSet atRandom]. >>>> >>> Good point but the hashes are not consecutive (at least with my vm). Since >>> we need 10000 different Objects I added this idea to the test with some >>> modifications. 10000 Objects are created, then shuffled and used when >>> needed. >>> >>> The modified test: >>> >>> | test array objects | >>> Transcript open; cr. >>> test := Dictionary new: 5. >>> objects := ((1 to: 10000) collect: [ :each | Object new ]) shuffled >>> readStream. >>> [ test size >= 10000 ] whileFalse: [ >>> Smalltalk garbageCollect. >>> array := Array new: 100 streamContents: [ :stream | >>> 100 timesRepeat: [ stream nextPut: objects next ] ]. "Create 100 >>> new Objects" >>> Transcript >>> print: [ "Add the 100 new Objects to the Dictionary as keys with >>> nil values" >>> 1 to: array size do: [ :index | >>> test at: (array at: index) put: nil ] ] timeToRun; >>> tab; >>> print: test size; >>> tab; >>> print: [ "Access the added Objects 1000 times" >>> 1 to: 1000 do: [ :iteration | >>> 1 to: array size do: [ :index | >>> test at: (array at: index) ] ] ] timeToRun; >>> tab; >>> print: [ "This should be substracted from the previous value" >>> 1 to: 1000 do: [ :iteration | >>> 1 to: array size do: [ :index | >>> array at: index ] ] ] timeToRun; >>> cr; >>> flush ] >>> >>> The results: >>> Updated image: >>> 0 100 93 4 >>> 0 200 144 4 >>> 0 300 125 4 >>> 0 400 105 3 >>> 0 500 111 4 >>> 0 600 150 4 >>> 0 700 115 3 >>> -- snip -- >>> 0 9600 297 3 >>> 0 9700 296 4 >>> 0 9800 299 5 >>> 1 9900 312 4 >>> 0 10000 309 4 >>> >>> Values in the third column are increasing somewhat, but not significantly, >>> no spikes etc. >>> >>> Current image: >>> 0 100 76 3 >>> 1 200 119 4 >>> 0 300 105 4 >>> 0 400 80 4 >>> 0 500 112 4 >>> 0 600 111 3 >>> 1 700 98 4 >>> -- snip -- >>> 1 3700 1279 4 >>> 2 3800 1998 4 >>> 3 3900 3196 4 >>> 5 4000 4473 4 >>> 11 4100 11301 4 >>> 19 4200 18823 4 >>> 36 4300 36391 3 >>> 93 4400 45294 4 >>> 47 4500 46674 3 >>> ... >>> >>> Around 4000 things get really slow. There are also smaller spikes near >>> 2500. I didn't wait for 4600. As you can see in the first column #at:put: is >>> consuming more time every iteration. >>> >>> >>> Levente >>> >>> >> >> > > |
In reply to this post by Andreas.Raab
It almost took 17(!) hours to run all tests on my poor old machine.
But it seems, that there are no new failed tests. Alex result | class | selector ---------+---------------------------------+-------------------------------- error | DecompilerTestFailuresCollector | testDecompilerInClassesCAtoCM error | DecompilerTestFailuresCollector | testDecompilerInClassesENtoEZ error | DecompilerTestFailuresCollector | testDecompilerInClassesTNtoTZ error | DecompilerTests | testDecompilerInClassesENtoEZ error | MirrorPrimitiveTests | testMirrorAt error | MirrorPrimitiveTests | testMirrorClass error | MirrorPrimitiveTests | testMirrorEqEq error | MirrorPrimitiveTests | testMirrorInstVarAt error | MirrorPrimitiveTests | testMirrorPerform error | MirrorPrimitiveTests | testMirrorSize error | SMDependencyTest | test2 failure | BitBltTest | testAllAlphasRgbAdd failure | BitBltTest | testAllAlphasRgbMax failure | BitBltTest | testAllAlphasRgbMin failure | BitBltTest | testAllAlphasRgbMinInvert failure | BitBltTest | testAllAlphasRgbMul failure | BitBltTest | testAllAlphasRgbSub failure | ClosureCompilerTest | testDebuggerTempAccess failure | ClosureCompilerTest | testInjectIntoDecompilations failure | ClosureCompilerTest | testInjectIntoDecompiledDebugs failure | DebuggerUnwindBug | testUnwindDebuggerWithStep failure | DecompilerTests | testDecompilerInClassesAAtoAM failure | DecompilerTests | testDecompilerInClassesBAtoBM failure | DecompilerTests | testDecompilerInClassesCAtoCM failure | DecompilerTests | testDecompilerInClassesCNtoCZ failure | DecompilerTests | testDecompilerInClassesDAtoDM failure | DecompilerTests | testDecompilerInClassesFAtoFM failure | DecompilerTests | testDecompilerInClassesFNtoFZ failure | DecompilerTests | testDecompilerInClassesGNtoGZ failure | DecompilerTests | testDecompilerInClassesHNtoHZ failure | DecompilerTests | testDecompilerInClassesINtoIZ failure | DecompilerTests | testDecompilerInClassesJNtoJZ failure | DecompilerTests | testDecompilerInClassesKAtoKM failure | DecompilerTests | testDecompilerInClassesLNtoLZ failure | DecompilerTests | testDecompilerInClassesMAtoMM failure | DecompilerTests | testDecompilerInClassesMNtoMZ failure | DecompilerTests | testDecompilerInClassesNAtoNM failure | DecompilerTests | testDecompilerInClassesOAtoOM failure | DecompilerTests | testDecompilerInClassesPAtoPM failure | DecompilerTests | testDecompilerInClassesPNtoPZ failure | DecompilerTests | testDecompilerInClassesRAtoRM failure | DecompilerTests | testDecompilerInClassesSAtoSM failure | DecompilerTests | testDecompilerInClassesSNtoSZ failure | DecompilerTests | testDecompilerInClassesTAtoTM failure | DecompilerTests | testDecompilerInClassesTNtoTZ failure | DecompilerTests | testDecompilerInClassesWAtoWM failure | LocaleTest | testIsFontAvailable failure | MCChangeNotificationTest | testCoreMethodModified failure | MorphicToolBuilderTests | testGetButtonSideEffectFree failure | ScannerTest | testLiteralSymbols failure | TextMorphTest | testInitialize failure | TraitFileOutTest | testFileOutCategory failure | TraitTest | testTraitMethodClass failure | TraitTest | testTraitMethodSelector (54 Zeilen) On Tue, Dec 1, 2009 at 6:34 AM, Andreas Raab <[hidden email]> wrote: > Folks - > > The updates went smoothly and I updated an image without problems. Looks > like we're good. Couple of comments: > > - I was really impressed by the collection rehashing hack. I had not seen > this before. > > - I'm actually thinking that we might want to keep the rehashing utility > with a comment on how to use it. It looks like a really useful tool for > doing this kind of stuff. > > - Is there a standard way to deal with dog-slow tests (like searching all > the source code of Squeak, running the decompiler on all methods)? These > few, crazily slow tests make it virtually impossible to "run all test" after > loading each change if you want them to finish in a finite amount of time. > As a consequence I've only run the collection tests. > > Cheers, > - Andreas > > |
In reply to this post by Andreas.Raab
Andreas et al,
Andreas Raab wrote: > Folks - > > The updates went smoothly and I updated an image without problems. Looks > like we're good. Couple of comments: > > - I was really impressed by the collection rehashing hack. I had not > seen this before. Thanks :). Andres. |
Free forum by Nabble | Edit this page |