I've recently been playing with measuring the effectiveness of Sets and
similar hashed collections. I've added a bunch of methods to the various hashed collection classes, like: ========= PluggableLookupTable>>bestSlotFor: anObject "Answer the integer index of the slot where anObject would best belong if this collection were currently empty." #CUadded. ^ searchPolicy hash: anObject max: self basicSize. ========= And so on for all the other classes (it's actually a little bit more complicated than that, but that's the general idea). With that in place, I can define two very useful methods: ========= Set>>collisions "Answer an Integer count of the elements which are not in their natural positions (i.e. not in the place they hash to)." | count | #CUadded. count := 0. 1 to: self basicSize do: [:i || element | element := self basicAt: i. element notNil ifTrue: [(self bestSlotFor: element) ~= i ifTrue: [count := count + 1]]]. ^ count. ========= and: ========= Set>>averageProbesPerElement "Answer the average number of probes of the table necessary to find an element in the receiver. In the ideal case this is 1." | probes | #CUadded. self isEmpty ifTrue: [^ 1]. probes := 0. 1 to: self basicSize do: [:i || element pos distance | element := self basicAt: i. element notNil ifTrue: [pos := self bestSlotFor: element. distance := i - pos. distance < 0 ifTrue: [distance := distance + self basicSize]. probes := probes + distance + 1]]. ^ probes / self size. ========= And I've also added a #loadFactor method to Set. The suggestion is that these would make a useful addition to the standard image. I'm happy to post the code if anyone wants it (or email it, which would avoid formatting problems). A further suggestion is to refactor the various #findElementOrNil: implementations to make use of #bestSlotFor: -- that would eliminate quite a lot of duplication of code (and make sure that the results of #bestSlotFor: were always correct ;-). The bug is that I noticed that some Collections in the image had absurdly high #averageProbesPerElement values (over 20 in one case). Investigating showed that the worst of them were PluggableSets held in Packages, the methodNames collection. It turns out that methodNames is updated by code like: methodNames := methodNames select: [...]. methodNames := methodNames collect: [...]. both of which answer a new PluggableSet with the *default* search policy, instead of the custom policy used by Package. I *suspect* that the underlying bug is that Collection>>select: and Collection>>collect: both start by doing something like: newCollection = self species new. rather than: newCollection := self copyEmpty. but I'm not too sure about the implications of the change. -- chris |
Chris
You wrote in message news:[hidden email]... > I've recently been playing with measuring the effectiveness of Sets and > similar hashed collections. > > I've added a bunch of methods to the various hashed collection classes, > like:....[various statistical methods snipped]... > The suggestion is that these would make a useful addition to the standard > image. I'm happy to post the code if anyone wants it (or email it, which > would avoid formatting problems). A further suggestion is to refactor the > various #findElementOrNil: implementations to make use of #bestSlotFor: -- > that would eliminate quite a lot of duplication of code (and make sure > the results of #bestSlotFor: were always correct ;-). Please e-mail them to me if you are happy for them to be included in the image, certainly they would be useful for testing the effectiveness of one's #hash implementations. > > The bug is that I noticed that some Collections in the image had absurdly > high #averageProbesPerElement values (over 20 in one case). Investigating > showed that the worst of them were PluggableSets held in Packages, the > methodNames collection. It turns out that methodNames is updated by code > like: > > methodNames := methodNames select: [...]. > methodNames := methodNames collect: [...]. > > both of which answer a new PluggableSet with the *default* search policy, > instead of the custom policy used by Package. I *suspect* that the > underlying bug is that Collection>>select: and Collection>>collect: both > start by doing something like: > > newCollection = self species new. > > rather than: > > newCollection := self copyEmpty. > > but I'm not too sure about the implications of the change. Thanks for the bug report (no. 417) - that is rather worrying! Regards Blair |
In reply to this post by Chris Uppal-3
Chris
You wrote in message news:[hidden email]... > ... > The bug is that I noticed that some Collections in the image had absurdly > high #averageProbesPerElement values (over 20 in one case). Investigating > showed that the worst of them were PluggableSets held in Packages, the > methodNames collection. It turns out that methodNames is updated by code > like: > > methodNames := methodNames select: [...]. > methodNames := methodNames collect: [...]. > > both of which answer a new PluggableSet with the *default* search policy, > instead of the custom policy used by Package. I *suspect* that the > underlying bug is that Collection>>select: and Collection>>collect: both > start by doing something like: > > newCollection = self species new. > > rather than: > > newCollection := self copyEmpty. > > but I'm not too sure about the implications of the change. Looking into this a bit more, there are certainly some issues with PluggableSets and PluggableLookupTables in relation to collect:/select: but this is a rather complex and subtle area of the Collection hierarchy. It all touches on the #species mechanism which is used to create a collection "like" the receiver, but not necessarily of the exact same class. The need for this is pretty clear cut with regard to #collect:, because the transformation block may quite often evaluate to objects that are not compatible with the original collection "type". For example: (1 to: 5) collect: [:each | each * 2] or the Collection may not itself be copyable (e.g. it is a proxy or wrapper for an externally represented collection), or in the case of weak collections if the answer was also weak the elements might immediately disappear! In the case of Pluggable collections it seems quite possible that the search policy associated with the collection may not be compatible with the results of evaluating the transformation block (pluggable collections are useful for represent homogeneous collections of objects, such as a dictionary of strings with case-insensitive lookup). Therefore I think that both PluggableSet and PluggableLookupTable should override #species to answer Set and LookupTable respectively, and thus #collect: will answer a "normal" set/dictionary. Furthermore this suggests that the Package code which is using #collect: on the methodNames collection is incorrect. #select: is more tricky. It seems that St-80 has always used the #species mechanism for #select: as well, which I think is dubious. Certainly there are some cases where it simplifies the collection hierarchy implementation (e.g. for Interval where again one needs the result to be an Array, and the default implementation of #select: in Collection which uses #species will thus work for Interval without needing to be overridden because Interval>>species answers Array), but personally I think it is wrong. It seems to be overloading the #species mechanism with too many different tasks. While #collect: is expected to answer a Collection "like" the receiver, #select: should, IMO, answer a Collection the same as the receiver, except in those cases (such as Interval) where that is not possible. Since #select: is filtering down the same elements, it should almost always be possible for the result to be the exact same class of collection. Unfortunately, however, I am reluctant to change it unilaterally since the existing behaviour is just too long established. So we have the choice between just implementing #species in PluggableSet/PluggableLookupTable, in which case both #collect: and #select: will answer a "normal" Set/LookupTable, or overridding #select: as well, in which case #select: can be made to maintain the pluggable hashed collection. Regards Blair |
Blair,
> > Looking into this a bit more, there are certainly some issues with > PluggableSets and PluggableLookupTables in relation to collect:/select: but > this is a rather complex and subtle area of the Collection hierarchy. Indeed. > Therefore I think that both > PluggableSet and PluggableLookupTable should override #species to answer Set > and LookupTable respectively, and thus #collect: will answer a "normal" > set/dictionary. Furthermore this suggests that the Package code which is > using #collect: on the methodNames collection is incorrect. Yes, this all sounds very reasonable to me. > #select: is more tricky. It seems that St-80 has always used the #species > mechanism for #select: as well, which I think is dubious. Agreed. > Unfortunately, however, I am reluctant to change it unilaterally > since the existing behaviour is just too long established. I can see your point. > So we have the choice between just implementing #species in > PluggableSet/PluggableLookupTable, in which case both #collect: and #select: > will answer a "normal" Set/LookupTable, or overridding #select: as well, in > which case #select: can be made to maintain the pluggable hashed collection. I'd be inclined to implement #species and to override #select:. The justification (besides your logic that I've snipped) would be that the only classic Collection with "pluggable" behaviour -- SortedCollection -- also arranges for #select: to preserve the pluggable datum. > Blair -- chris |
Free forum by Nabble | Edit this page |