Posted by
Andrea Ferretti on
Feb 18, 2015; 7:16pm
URL: https://forum.world.st/Small-benchmark-tp4806092p4806394.html
For me there is no problem in including kmeans, but I should warn you
that this implementation is far from optimal, for various reasons.
First, I have reused the class Point, while it would make more sense
to do something that works in any dimension.
Second, these examples are (intentionally) written in a particularly
naive way. The main reason why is that I wanted to benchmark the speed
of Pharo for interactive work. I assumed that during an interactive
session, one would not bother to write an optimized implementation of
an algorithm, but just use the collection's `groupBy:` method; also I
did not want to make anything specific to kmeans, because I wanted the
comparison to be meaningful for other algorithms where kmeans-specific
optmizations would not
apply.
In the specific case of kmeans, there are a few easy optimizations to do:
- first, since sqrt is increasing, all comparisons can be made with
the squared distance, saving the sqrt operation
- moreover, the main step of the algorithm requires to group points by
their nearest centroid. In general a group-by operation requires a
hashmap, but in this case there are a finite, known number of choices
for the nearest centroid. It is more performant to just keep the index
of the nearest centroid for each point, thereby avoiding to build a
hashmap. In fact, it is more performant to accumulate the sum and the
count of the points near the centroid while traversing, and then just
perform a ratio at the end of the step.
2015-02-18 20:09 GMT+01:00 stepharo <
[hidden email]>: