A suggestion and a bug

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

A suggestion and a bug

Chris Uppal-3
I've recently been playing with measuring the effectiveness of Sets and
similar hashed collections.

I've added a bunch of methods to the various hashed collection classes,
like:

=========
PluggableLookupTable>>bestSlotFor: anObject
     "Answer the integer index of the slot where anObject would
     best belong if this collection were currently empty."

    #CUadded.
     ^ searchPolicy hash: anObject max: self basicSize.
=========

And so on for all the other classes (it's actually a little bit more
complicated than that, but that's the general idea).  With that in place, I
can define two very useful methods:

=========
Set>>collisions
    "Answer an Integer count of the elements which are not in their
     natural positions (i.e. not in the place they hash to)."
    | count |
   #CUadded.

    count := 0.
    1 to: self basicSize do:
         [:i || element |
         element := self basicAt: i.
         element notNil ifTrue:
         [(self bestSlotFor: element) ~= i ifTrue: [count := count + 1]]].
    ^ count.
=========

and:

=========
Set>>averageProbesPerElement
    "Answer the average number of probes of the table necessary to
    find an element in the receiver.  In the ideal case this is 1."
    | probes |
#CUadded.

    self isEmpty ifTrue: [^ 1].

    probes := 0.
    1 to: self basicSize do:
        [:i || element pos distance |
        element := self basicAt: i.
        element notNil ifTrue:
             [pos := self bestSlotFor: element.
                  distance := i - pos.
                  distance < 0 ifTrue: [distance := distance + self
basicSize].
                  probes := probes + distance + 1]].
    ^ probes / self size.
=========

And I've also added a #loadFactor method to Set.

The suggestion is that these would make a useful addition to the standard
image.  I'm happy to post the code if anyone wants it (or email it, which
would avoid formatting problems).  A further suggestion is to refactor the
various #findElementOrNil: implementations to make use of #bestSlotFor: --
that would eliminate quite a lot of duplication of code (and make sure that
the results of #bestSlotFor: were always correct ;-).

The bug is that I noticed that some Collections in the image had absurdly
high #averageProbesPerElement values (over 20 in one case).  Investigating
showed that the worst of them were PluggableSets held in Packages, the
methodNames collection.  It turns out that methodNames is updated by code
like:

    methodNames := methodNames select: [...].
    methodNames := methodNames collect: [...].

both of which answer a new PluggableSet with the *default* search policy,
instead of the custom policy used by Package.  I *suspect* that the
underlying bug is that Collection>>select: and Collection>>collect: both
start by doing something like:

    newCollection = self species new.

rather than:

    newCollection := self copyEmpty.

but I'm not too sure about the implications of the change.

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: A suggestion and a bug

Blair McGlashan
Chris

You wrote in message
news:[hidden email]...

> I've recently been playing with measuring the effectiveness of Sets and
> similar hashed collections.
>
> I've added a bunch of methods to the various hashed collection classes,
> like:....[various statistical methods snipped]...
> The suggestion is that these would make a useful addition to the standard
> image.  I'm happy to post the code if anyone wants it (or email it, which
> would avoid formatting problems).  A further suggestion is to refactor the
> various #findElementOrNil: implementations to make use of #bestSlotFor: --
> that would eliminate quite a lot of duplication of code (and make sure
that
> the results of #bestSlotFor: were always correct ;-).

Please e-mail them to me if you are happy for them to be included in the
image, certainly they would be useful for testing the effectiveness of one's
#hash implementations.

>
> The bug is that I noticed that some Collections in the image had absurdly
> high #averageProbesPerElement values (over 20 in one case).  Investigating
> showed that the worst of them were PluggableSets held in Packages, the
> methodNames collection.  It turns out that methodNames is updated by code
> like:
>
>     methodNames := methodNames select: [...].
>     methodNames := methodNames collect: [...].
>
> both of which answer a new PluggableSet with the *default* search policy,
> instead of the custom policy used by Package.  I *suspect* that the
> underlying bug is that Collection>>select: and Collection>>collect: both
> start by doing something like:
>
>     newCollection = self species new.
>
> rather than:
>
>     newCollection := self copyEmpty.
>
> but I'm not too sure about the implications of the change.

Thanks for the bug report (no. 417) - that is rather worrying!

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: A suggestion and a bug

Blair McGlashan
In reply to this post by Chris Uppal-3
Chris

You wrote in message
news:[hidden email]...

> ...
> The bug is that I noticed that some Collections in the image had absurdly
> high #averageProbesPerElement values (over 20 in one case).  Investigating
> showed that the worst of them were PluggableSets held in Packages, the
> methodNames collection.  It turns out that methodNames is updated by code
> like:
>
>     methodNames := methodNames select: [...].
>     methodNames := methodNames collect: [...].
>
> both of which answer a new PluggableSet with the *default* search policy,
> instead of the custom policy used by Package.  I *suspect* that the
> underlying bug is that Collection>>select: and Collection>>collect: both
> start by doing something like:
>
>     newCollection = self species new.
>
> rather than:
>
>     newCollection := self copyEmpty.
>
> but I'm not too sure about the implications of the change.

Looking into this a bit more, there are certainly some issues with
PluggableSets and PluggableLookupTables in relation to collect:/select: but
this is a rather complex and subtle area of the Collection hierarchy. It all
touches on the #species mechanism which is used to create a collection
"like" the receiver, but not necessarily of the exact same class. The need
for this is pretty clear cut with regard to #collect:, because the
transformation block may quite often evaluate to objects that are not
compatible with the original collection "type". For example:
    (1 to: 5) collect: [:each | each * 2]
or the Collection may not itself be copyable (e.g. it is a proxy or wrapper
for an externally represented collection), or in the case of weak
collections if the answer was also weak the elements might immediately
disappear!

In the case of Pluggable collections it seems quite possible that the search
policy associated with the collection may not be compatible with the results
of evaluating the transformation block (pluggable collections are useful for
represent homogeneous collections of objects, such as a dictionary of
strings with case-insensitive lookup). Therefore I think that both
PluggableSet and PluggableLookupTable should override #species to answer Set
and LookupTable respectively, and thus #collect: will answer a "normal"
set/dictionary. Furthermore this suggests that the Package code which is
using #collect: on the methodNames collection is incorrect.

#select: is more tricky. It seems that St-80 has always used the #species
mechanism for #select: as well, which I think is dubious. Certainly there
are some cases where it simplifies the collection hierarchy implementation
(e.g. for Interval where again one needs the result to be an Array, and the
default implementation of #select: in Collection which uses #species will
thus work for Interval without needing to be overridden because
Interval>>species answers Array), but personally I think it is wrong. It
seems to be overloading the #species mechanism with too many different
tasks. While #collect: is expected to answer a Collection "like" the
receiver, #select: should, IMO, answer a Collection the same as the
receiver, except in those cases (such as Interval) where that is not
possible. Since #select: is filtering down the same elements, it should
almost always be possible for the result to be the exact same class of
collection. Unfortunately, however, I am reluctant to change it unilaterally
since the existing behaviour is just too long established.

So we have the choice between just implementing #species in
PluggableSet/PluggableLookupTable, in which case both #collect: and #select:
will answer a "normal" Set/LookupTable, or overridding #select: as well, in
which case #select: can be made to maintain the pluggable hashed collection.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: A suggestion and a bug

Chris Uppal-3
Blair,
>
> Looking into this a bit more, there are certainly some issues with
> PluggableSets and PluggableLookupTables in relation to collect:/select:
but
> this is a rather complex and subtle area of the Collection hierarchy.

Indeed.

> Therefore I think that both
> PluggableSet and PluggableLookupTable should override #species to answer
Set
> and LookupTable respectively, and thus #collect: will answer a "normal"
> set/dictionary. Furthermore this suggests that the Package code which is
> using #collect: on the methodNames collection is incorrect.

Yes, this all sounds very reasonable to me.

> #select: is more tricky. It seems that St-80 has always used the #species
> mechanism for #select: as well, which I think is dubious.

Agreed.

> Unfortunately, however, I am reluctant to change it unilaterally
> since the existing behaviour is just too long established.

I can see your point.

> So we have the choice between just implementing #species in
> PluggableSet/PluggableLookupTable, in which case both #collect: and
#select:
> will answer a "normal" Set/LookupTable, or overridding #select: as well,
in
> which case #select: can be made to maintain the pluggable hashed
collection.

I'd be inclined to implement #species and to override #select:.  The
justification (besides your logic that I've snipped) would be that the only
classic Collection with "pluggable" behaviour -- SortedCollection -- also
arranges for #select: to preserve the pluggable datum.

> Blair

    -- chris