Smalltalk › Squeak › Squeak - Dev

Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

22 messages Options

Bryce Kampjes

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

David Griswold writes:
> Sends that have more than 4 receiver types, such as your micro-benchmark,
> can't even use PICs or any kind of inline cache, so these are a full
> megamorphic send in Strongtalk, which is implemented as an actual hashed
> lookup, which is the slowest case of all. You might say that is what
> Smalltalk is all about, but in reality megamorphic sends are relatively rare
> as a percentage of sends. Compilers aren't magic- no one can eliminate the
> fundamental computation that a truly megamorphic send has to do- it *has* to
> do some kind of real lookup, and a call, so the performance will naturally
> be similar across all Smalltalks.

I'm fairly sure that VisualWorks has a hash PIC that it uses for
mega-morphic sends. Eliot talked about this at Smalltalk Solutions.
I also doubt that VW does any advanced optimizations such as global
code motion (moving type-checks out of loops) or loop unrolling. If it
did it would be faster than Exupery for the bytecode benchmark.

However, in this case if you're actually compiling your benchmark in
Strongtalk it's possible that the performance difference between VW and
Strongtalk is the method specialization done by Strongtalk.

Strongtalk, AFAIK, compiles a version for each receiver for a
method. This is an optimization because it allows more precise type
information to be gathered as it's not polluted by other classes use
of an inherited method. Specializing methods by receiver should also
allow faster inlining of self sends as they can be fully resolved at
compile time. (1)

Having a separate compiled method for every receiver may be doing bad
things to your CPU's instruction cache. That could be where
Strongtalk's lack of performance here is coming from. First level
instruction caches are small, the largest on a desktop CPU is only
64Kb. If you want to find out then it is possible to measure cache
misses unfortunately I only know how to do this under Linux.

Microbenchmarks are getting less reliable as compilers and hardware
becomes smarter.

Bryce

(1) Exupery also compiles a version of each method for each
receiver. It does this to allow it to compile specialised versions of
the #at: and #new primitives. Specialising is often the right thing to
do, especially if you plan to inline methods.

A fully tuned compiler might, but might not, only specialise methods
when it helps. However in general it may cost more to figure out when
it helps to specialise than it costs to always specialise. Without
extensive macro benchmarking it is dangerous to guess.

David Griswold-3

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

On 12/18/06, [hidden email] <[hidden email]> wrote:

David Griswold writes:
> Sends that have more than 4 receiver types, such as your micro-benchmark,
> can't even use PICs or any kind of inline cache, so these are a full
> megamorphic send in Strongtalk, which is implemented as an actual hashed
> lookup, which is the slowest case of all. You might say that is what
> Smalltalk is all about, but in reality megamorphic sends are relatively rare
> as a percentage of sends. Compilers aren't magic- no one can eliminate the
> fundamental computation that a truly megamorphic send has to do- it *has* to
> do some kind of real lookup, and a call, so the performance will naturally
> be similar across all Smalltalks.

I'm fairly sure that VisualWorks has a hash PIC that it uses for
mega-morphic sends. Eliot talked about this at Smalltalk Solutions.
I also doubt that VW does any advanced optimizations such as global
code motion (moving type-checks out of loops) or loop unrolling. If it
did it would be faster than Exupery for the bytecode benchmark.

I don't know exactly the details on how VW's hash PICs work, but I think my original comment holds: since both Strongtalk and VW do hashing for megamorphic sends, and type-feedback doesn't help Strongtalk for this case, I would expect them to be fairly similar in performance, modulo standard code quality issues that would reflect the level of tuning in the compiler.

It is quite possible that VW doesn't do loop unrolling (which is why my original post put less confidence on that), but I am pretty sure they do array bounds-check removal, which as I said I would expect would account for a good chunk of any performance difference (although at the moment we don't actually have comparative numbers, since no one has run both VW and compiled Strongtalk on the same machine on this benchmark).

Strongtalk should be able to move the Array access type-test out of the loop; I had assumed that VW could do that too, since it seems like a relatively easy thing to do.

However, in this case if you're actually compiling your benchmark in
Strongtalk it's possible that the performance difference between VW and
Strongtalk is the method specialization done by Strongtalk.

Strongtalk, AFAIK, compiles a version for each receiver for a
method. This is an optimization because it allows more precise type
information to be gathered as it's not polluted by other classes use
of an inherited method. Specializing methods by receiver should also
allow faster inlining of self sends as they can be fully resolved at
compile time. (1)

Having a separate compiled method for every receiver may be doing bad
things to your CPU's instruction cache. That could be where
Strongtalk's lack of performance here is coming from. First level
instruction caches are small, the largest on a desktop CPU is only
64Kb. If you want to find out then it is possible to measure cache
misses unfortunately I only know how to do this under Linux.

I doubt the instruction cache is the issue here, since the only customized methods involved are a few different versions of #yourself, which does nothing but return self, so the methods should only be a few instructions long. It should take a lot more than that to thrash the instruction cache.

And in general in Strongtalk, the code duplication caused by customization is counteracted by the fact that only hotspot code is compiled in the first place, unlike VW. The entire compiled code cache in Strongtalk for all code in the image is rarely bigger than 2-4 megabytes total, which is probably smaller than VW's code cache. There is probably a bit more instruction cache pressure in Strongtalk, but we've never seen anything that looked like a performance hit because of it, since all that really matters is whether the inner-loop working set of the moment set thrashes or not, not the whole code cache.

Microbenchmarks are getting less reliable as compilers and hardware
becomes smarter.

Absolutely!

Cheers,
Dave