Hi all, I have a problem with Dictionaries : dic := Dictionary new: 10000. 1 to: 10000 do:[ :i | dic at: i put: i printString. ]. 1 to: 10000 do:[ :i | dic removeKey: i. ] Removing each keys is very slooow, time execution difference between add and remove is crazy. Any idea for fix that ? Can I use another Dictionary implementation or settings ? Thanks. Pierre |
Pierre It’s all to do with rehashing. A dictionary is stored as an Array of Association. After a key and its corresponding association is removed, the remaining entries are rehashed from the removal point to the end of the array. By doing the removals in natural order, you rehash the whole of the remaining array after each removal. Try doing the removals in reverse order; on my machine it reduces the time from 6 seconds to 6 milliseconds. HTH Peter Kenny From: Pharo-users <[hidden email]> On Behalf Of LABORDE Pierre Hi all, I have a problem with Dictionaries : dic := Dictionary new: 10000. 1 to: 10000 do:[ :i | dic at: i put: i printString. ]. 1 to: 10000 do:[ :i | dic removeKey: i. ] Removing each keys is very slooow, time execution difference between add and remove is crazy. Any idea for fix that ? Can I use another Dictionary implementation or settings ? Thanks. Pierre |
Thanks Peter. Here some informations about execution time : dic := Dictionary new: 10000. Time microsecondsToRun:[ 1 to: 10000 do:[ :i | dic at: i put: i printString. ]. ]. "6000µs" Time microsecondsToRun:[ 1 to: 10000 do:[ :i | dic removeKey: i.
]. ]. "2683000µs, up to 447x" Regarding your comment, I have wrote a too simple example. Here a new test with a better representation of my need and datas (randomization) : dic := Dictionary new: 10000. pool := SortedCollection new: 10000. random := Random new. pool sort:[ :a :b | random next < 0.5 ]. 1 to: 10000 do:[ :i | pool add: i ]. Time microsecondsToRun:[ pool do:[ :i | dic at: i put: i printString. ]. ]. "6000µs => same time, good point" Time microsecondsToRun:[ pool do:[ :i | dic removeKey: i.
]. ]. "17000µs, up to 2,8x => huge changes !" It’s appear more reasonable !
I need to test it with real datas, to be continued. Thanks for all. Pierre De : Pharo-users [mailto:[hidden email]]
De la part de PBKResearch Pierre It’s all to do with rehashing. A dictionary is stored as an Array of Association. After a key and its corresponding association is
removed, the remaining entries are rehashed from the removal point to the end of the array. By doing the removals in natural order, you rehash the whole of the remaining array after each removal. Try doing the removals in reverse order; on my machine it reduces
the time from 6 seconds to 6 milliseconds. HTH Peter Kenny From: Pharo-users <[hidden email]>
On Behalf Of LABORDE Pierre Hi all, I have a problem with Dictionaries : dic := Dictionary new: 10000. 1 to: 10000 do:[ :i | dic at: i put: i printString. ]. 1 to: 10000 do:[ :i | dic removeKey: i. ] Removing each keys is very slooow, time execution difference between add and remove is crazy. Any idea for fix that ? Can I use another Dictionary implementation or settings ? Thanks. Pierre |
This is actually a very good issue you have raised. As far as I am able to read the code for Dictionary and HashedCollection, the implementation is not very well suited for deletions at all. First, as you have discovered, the remove has an implementation which is unsound (performance wise). Second, HashedCollections grow when you add to them, but they do not shrink when you remove from them. That is, in scenarios where your dictionaries grow huge, and then later shrink, there is a potential “memory leak” in that the array is kept at its largest size. Btw, thanks for the shuffle function using sort. I do believe the built-in shuffle in SequenceableCollection is faster though (O(n·logn) vs. O(n)). It seems from the wikipedia page on hash tables, that the there are algorithms with efficient removal also for the "open addressing with linear probing” implementation used in HashedCollection. But it only seems to work if the hash table is also implementing a “shrink” function which is where the re-hashing is done. Interesting stuff. It would be a good programming exercise to get that to work. — Kasper On 7 February 2020 at 11.17.27, LABORDE Pierre ([hidden email]) wrote:
|
In reply to this post by LABORDE Pierre
It's a problem with the traditional Smalltalk-80 implementation of Dictionaries.
Using my compiler and library, 1311 usec to build 498 usec to delete (1..10000) 1318 usec to build 472 usec to delete (10000..1) (The time difference between building and deleting is mostly #printString.) There is no defect in the Pharo compiler here, it's the hash table design which is basically flawed if you want delete entries. My library uses separate chaining (https://en.wikipedia.org/wiki/Hash_table#Separate_chaining) which makes deletion simple and fast and allows 'null' keys. Pharo uses open addressing, which makes deletion much harder. It doesn't seem to improve retrieval performance either. On Mon, 3 Feb 2020 at 23:18, LABORDE Pierre <[hidden email]> wrote: > > Hi all, > > > > I have a problem with Dictionaries : > > > > dic := Dictionary new: 10000. > > > > 1 to: 10000 do:[ :i | > > dic at: i put: i printString. > > ]. > > > > 1 to: 10000 do:[ :i | > > dic removeKey: i. > > ] > > > > Removing each keys is very slooow, time execution difference between add and remove is crazy. > > Any idea for fix that ? Can I use another Dictionary implementation or settings ? > > > > Thanks. > > > > Pierre > > |
Hi RIchard, On 9 February 2020 at 17.54.30, Richard O'Keefe ([hidden email]) wrote:
Does your implementation use a fixed number of buckets, or do they grow (and shrink)? I had been wondering about doing an open addressing with items marked as deleted to see how it would perform. Aka "Another technique for removal is simply to mark the slot as deleted. However this eventually requires rebuilding the table simply to remove deleted records.” from https://en.wikipedia.org/wiki/Open_addressing. Best, Kasper |
My Dictionary and Set implementations (neither of which inherits from the other)
grow but do not shrink. It's easy enough to do, I just have not bothered because I have a significant redesign that improves space and time that needs finishing. I've used the marking-deleted-entries technique in the past. I'd prefer not to do it again. On Tue, 11 Feb 2020 at 07:31, Kasper Østerbye <[hidden email]> wrote: > > Hi RIchard, > > On 9 February 2020 at 17.54.30, Richard O'Keefe ([hidden email]) wrote: > > My library uses separate chaining > (https://en.wikipedia.org/wiki/Hash_table#Separate_chaining) > which makes deletion simple and fast and allows 'null' keys. > > Does your implementation use a fixed number of buckets, or do they grow (and shrink)? > > I had been wondering about doing an open addressing with items marked as deleted to see how it would perform. > > Aka "Another technique for removal is simply to mark the slot as deleted. However this eventually requires rebuilding the table simply to remove deleted records.” from https://en.wikipedia.org/wiki/Open_addressing. > > Best, > > Kasper > > > |
Free forum by Nabble | Edit this page |