Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
Folks,

I'm sorry to tell that Strongtalk is NOT that fast. I followed the  
instructions and *compiled* the following benchmark in Strongtalk,  
evaluated the same expression in Squeak and in VW and got the these  
results on my 1.73GHz 1.0GB WinXP notebook:

- VisualWorks: 16799 (N.C. 7.4.1)
- Strongtalk: 47517 (1.1.2)
- Squeak: 56726 (3.9#7056)

Below is the Squeak/VW source code, attached is the Strongtalk source  
code. The test is simple: a long loop around a single polymorphic call  
site "(instances at: i) yourself", straight forward inlineable and with  
intentionally unpredictable type information at the call site (modeled  
after the Thue-Morse sequence).

I'm disappointed, Strongtalk was always advertised as being the fastest  
Smalltalk available "...executes Smalltalk much faster than any other  
Smalltalk implementation...", and now it shows to be in almost the same  
class as Squeak is :) :(

Can somebody reproduce the figures, any other results? Have I done  
something wrong?

BTW: congrats to the implementors of Squeak and, of course, to Cincom!  
(uhm, and also to the Strongtalk team!)

/Klaus

--------------
  | instances base |
  base := (Array
        with: OrderedCollection basicNew
        with: SequenceableCollection basicNew
        with: Collection basicNew
        with: Object basicNew) ,
        (Array
        with: Character space
        with: Date basicNew
        with: Time basicNew
        with: Magnitude basicNew).
  instances := OrderedCollection with: (base at: 1).
  2 to: base size do: [:i |
   instances := instances , instances reverse.
   instances addLast: (base at: i)].
  instances := (instances , instances reverse) asArray.
  ^ Time millisecondsToRun: [
        1234567 timesRepeat: [
                1 to: instances size do: [:i |
                        (instances at: i) yourself]]]
--------------


Benchmark.dlt (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Andreas.Raab
Klaus D. Witzel wrote:
> I'm disappointed, Strongtalk was always advertised as being the fastest
> Smalltalk available "...executes Smalltalk much faster than any other
> Smalltalk implementation...", and now it shows to be in almost the same
> class as Squeak is :) :(
>
> Can somebody reproduce the figures, any other results? Have I done
> something wrong?

Yes. First, you are equating the result of a single micro-benchmark with
overall system performance. Micro-benchmarks are used to measure
specific aspects of a particular implementation. Your benchmark measures
highly polymorphic send performance. Which is not typical for Smalltalk
code to begin with.

In other words, your claim is based on measuring a single atypical
performance characteristic. This has *nothing* to do with "Smalltalk
performance". If you want to measure "Smalltalk performance" you should
run a number of the standard benchmarks (Richards, Slopstone) that come
with Strongtalk and compare those.

Quite honestly, I'm surprised to see a person like you who obviously
understands enough about dynamic systems to measure PIC effects to make
such unsubstantiated claims. I would expect that you know how to
evaluate the results of a micro-benchmarks, and I would in particular
expect that you know that 80-90% of all call-sites in realistic code are
mono-morphic to begin with which render your benchmark results
absolutely useless for "Smalltalk code".

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Michael Haupt-3
In reply to this post by Klaus D. Witzel
Hi Klaus,

I haven't yet reproduced the benchmarks, but I'd never judge the
performance of an entire implementation based on one single simplistic
benchmark. Sorry, this sounds very negative, I don't mean to pick on
you.

The benchmark you have run is a micro-benchmark that measures the
performance of just one point of interest. Basically, it measures the
performance of sending #yourself to various different objects,
instances of 8 different classes. I believe that #yourself is
implemented in Object and never overwritten.

(Having 8 different classes doesn't exceed typical PICs; they normally
have 8 entries, if I'm not mistaken. The benchmark could really stress
the VM if much more than 8 different classes were chosen - but in the
end, it would be more interesting to have actual different
*implementations* of a message, because the VM can quite easily
determine that the implementation for #yourself is the same for all
objects.)

In a nutshell, micro-benchmarks are fine but should be more diverse. Measure
- monomorphic call sites (just one target),
- polymorphic call sites (small number of different targets), and
- megamorphic call sites (very large number of different targets).

The results of all of these together would tell more.

Also, an optimising VM normally takes some time to start optimising -
before the adaptive optimisation logic sees that there are some "hot
spots", usually the interpreter has to execute stuff for some time. Of
course, this doesn't hold for Squeak.

And once the VM has started optimising, there is still some impact due
to optimisation (it consumes time as well!). You normally let the
benchmark run several times until you can be sure that the VM has
applied all optimisations and measure the performance yielded by this
"steady state". This results in numbers that report only actual
performance instead of VM and optimisation interference.

I wonder whether there is something like SPECjvm98 for Smalltalk systems.

Of course, we also shouldn't forget that Strongtalk has not been
developed for some 10 years now, whereas VisualWorks has been
constantly maintained by at least one VM guru. ;-)

Best,

Michael

Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
In reply to this post by Andreas.Raab
Hi Andreas,

on Sun, 17 Dec 2006 11:52:36 +0100, you wrote:

> Klaus D. Witzel wrote:
>> I'm disappointed, Strongtalk was always advertised as being the fastest  
>> Smalltalk available "...executes Smalltalk much faster than any other  
>> Smalltalk implementation...", and now it shows to be in almost the same  
>> class as Squeak is :) :(
>>  Can somebody reproduce the figures, any other results? Have I done  
>> something wrong?
>
> Yes. First, you are equating the result of a single micro-benchmark with  
> overall system performance. Micro-benchmarks are used to measure  
> specific aspects of a particular implementation. Your benchmark measures  
> highly polymorphic send performance. Which is not typical for Smalltalk  
> code to begin with.
>
> In other words, your claim is based on measuring a single atypical  
> performance characteristic.

Not really. There are at least two motivations:

  - a - being faster means >= and the = part is missing
  - b - the test puts some stress on the call site, any specific suggestion  
 from your side on how to test and compare that on typical situations (this  
is *not* a rhetorical  question)? And, no it's not atypical, see below.

> This has *nothing* to do with "Smalltalk performance". If you want to  
> measure "Smalltalk performance" you should run a number of the standard  
> benchmarks (Richards, Slopstone) that come with Strongtalk and compare  
> those.

Well, how about something new, or are you after stangnation, Andreas (no  
offense, really ;-)

> Quite honestly, I'm surprised to see a person like you who obviously  
> understands enough about dynamic systems to measure PIC effects to make  
> such unsubstantiated claims.

O.K. I understand that as lack of use case. Take this (take that ;-)

  | allCs |
  allCs := Smalltalk allClasses.
  "start timing here"
  1 to: allCs size do: [:i | (allCs at: i) methodDict "just access the  
iVar"]
  "note that #yourself from the previous example is now just #methodDict"

This snippet is performed on behalf of every developer who asks for  
senders and/or implementors. It is, IMHO, the most often used piece of  
code of every Smalltalk, ever.

So I limited the amount of information to be handled by PICs (as noted in  
the comment of Michael), otherwise the comparision would've just been on  
the possible "methodCache" performance (which would've been a nice test as  
well [after eliminating differences with (Smalltalk size)] but I was not  
looking for that).

But, even for the latter I expect to find the = in >= when Strongtalk  
"...executes Smalltalk much faster than any other Smalltalk  
implementation...".

> I would expect that you know how to evaluate the results of a  
> micro-benchmarks, and I would in particular expect that you know that  
> 80-90% of all call-sites in realistic code are mono-morphic to begin  
> with which render your benchmark results absolutely useless for  
> "Smalltalk code".

Absolutely not (yes, I know about these figures. no, I disagree: see  
above).

BTW: does anybody know about recent work in the direction of "Message  
Dispatch on Pipelined Processors", thanks for any pointers.

/Klaus

> Cheers,
>    - Andreas
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
In reply to this post by Michael Haupt-3
Thank you Michael for your illustrative response.

I had taken most of the steps you mention, before posting, but the one  
with stress on a small PIC size was rather unexpected. Perhaps it would be  
interesting to find out the actual limit and try again. But the  
performance of this mini-morphic situation is not convincing.

I intentionally used instances of classes which inherit from each other:  
this is the typical situation when processing collections - regardless of  
using Traits. And yes, as you mention it csn be interesting to have actual  
different *implementations* of a message. But I doubt that there will be  
remarkable difference, since the methodCache is per receiver class (and so  
IMO there is no change for the example in my previous post).

Thanks again.

/Klaus

On Sun, 17 Dec 2006 12:04:21 +0100, Michael Haupt wrote:

> Hi Klaus,
>
> I haven't yet reproduced the benchmarks, but I'd never judge the
> performance of an entire implementation based on one single simplistic
> benchmark. Sorry, this sounds very negative, I don't mean to pick on
> you.
>
> The benchmark you have run is a micro-benchmark that measures the
> performance of just one point of interest. Basically, it measures the
> performance of sending #yourself to various different objects,
> instances of 8 different classes. I believe that #yourself is
> implemented in Object and never overwritten.
>
> (Having 8 different classes doesn't exceed typical PICs; they normally
> have 8 entries, if I'm not mistaken. The benchmark could really stress
> the VM if much more than 8 different classes were chosen - but in the
> end, it would be more interesting to have actual different
> *implementations* of a message, because the VM can quite easily
> determine that the implementation for #yourself is the same for all
> objects.)
>
> In a nutshell, micro-benchmarks are fine but should be more diverse.  
> Measure
> - monomorphic call sites (just one target),
> - polymorphic call sites (small number of different targets), and
> - megamorphic call sites (very large number of different targets).
>
> The results of all of these together would tell more.
>
> Also, an optimising VM normally takes some time to start optimising -
> before the adaptive optimisation logic sees that there are some "hot
> spots", usually the interpreter has to execute stuff for some time. Of
> course, this doesn't hold for Squeak.
>
> And once the VM has started optimising, there is still some impact due
> to optimisation (it consumes time as well!). You normally let the
> benchmark run several times until you can be sure that the VM has
> applied all optimisations and measure the performance yielded by this
> "steady state". This results in numbers that report only actual
> performance instead of VM and optimisation interference.
>
> I wonder whether there is something like SPECjvm98 for Smalltalk systems.
>
> Of course, we also shouldn't forget that Strongtalk has not been
> developed for some 10 years now, whereas VisualWorks has been
> constantly maintained by at least one VM guru. ;-)
>
> Best,
>
> Michael
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Andreas.Raab
In reply to this post by Klaus D. Witzel
Klaus D. Witzel wrote:

> O.K. I understand that as lack of use case. Take this (take that ;-)
>
>  | allCs |
>  allCs := Smalltalk allClasses.
>  "start timing here"
>  1 to: allCs size do: [:i | (allCs at: i) methodDict "just access the
> iVar"]
>  "note that #yourself from the previous example is now just #methodDict"
>
> This snippet is performed on behalf of every developer who asks for
> senders and/or implementors. It is, IMHO, the most often used piece of
> code of every Smalltalk, ever.

Absolutely not. Your claim that "this snippet is performed on behalf of
every developer who asks for senders and/or implementors" is misleading.
This is not what is *actually* performed. What is actually done is a lot
more. For every single mega-morphic send you have dozens of mono-morphic
sends.

That is my whole point. Just like in your previous post, you are not
using actual code but rather a specifically devised micro-benchmark that
has none of the characteristics of actual code. It's not done sending
#methodDict - this is when the work starts not when it ends. If you look
at the actual code that is executed, say:

     MessageTally tallySends:[Time browseAllCallsOn: #yourself]

you will find that when browsing senders there are some 50 messages sent
  in addition to the single mega-morphic send and it is *those* fifty
messages are where the real work is - the single mega-morphic send is
simply noise in the overall performance. And it's these fifty messages
(which have different performance characteristics) where Strongtalk just
completely rulez.

> But, even for the latter I expect to find the = in >= when Strongtalk
> "...executes Smalltalk much faster than any other Smalltalk
> implementation...".

The claim is about "Smalltalk code", not about "Klaus Witzel Benchmarks"
(the difference between the two should be obvious). I can always design
you a benchmark that makes a particular system look bad.

>> I would expect that you know how to evaluate the results of a
>> micro-benchmarks, and I would in particular expect that you know that
>> 80-90% of all call-sites in realistic code are mono-morphic to begin
>> with which render your benchmark results absolutely useless for
>> "Smalltalk code".
>
> Absolutely not (yes, I know about these figures. no, I disagree: see
> above).

If you know about the figures, then how can you claim that your
benchmark has any validity for general code? And as I am saying in the
above the *actual* code has "Smalltalk performance characteristics"
whereas your made-up micro-benchmark doesn't.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Michael Haupt-3
In reply to this post by Klaus D. Witzel
Hi Klaus,

On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
> But I doubt that there will be
> remarkable difference, since the methodCache is per receiver class

I don't understand that bit. What do you mean by it? In case of PICs,
the cache is per send site.

Best,

Michael

Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
Hi Michael,

on Sun, 17 Dec 2006 14:03:24 +0100, you wrote:
> Hi Klaus,
>
> On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
>> But I doubt that there will be
>> remarkable difference, since the methodCache is per receiver class
>
> I don't understand that bit. What do you mean by it? In case of PICs,
> the cache is per send site.

In VW and in Squeak there's no PIC and the corresponding thing which  
contributes to performance I called "methodCache" (still in the context of  
comparision).

I also believe (without having searched for it) that, when the PIC is  
exhausted (or not in use at bytecode time) the comparision  
(Squeak/VM/Strongtalk) reflects the "methodCache" thing performance.

> Best,
>
> Michael
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Michael Haupt-3
Hi Klaus,

On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
> In VW and in Squeak there's no PIC and the corresponding thing which
> contributes to performance I called "methodCache" (still in the context of
> comparision).

Squeak of course doesn't have PICs, but I was pretty sure VisualWorks had.

Best,

Michael

Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
In reply to this post by Andreas.Raab
Hi Andreas,

on Sun, 17 Dec 2006 13:51:53 +0100, you wrote:

> Klaus D. Witzel wrote:
>> O.K. I understand that as lack of use case. Take this (take that ;-)
>>   | allCs |
>>  allCs := Smalltalk allClasses.
>>  "start timing here"
>>  1 to: allCs size do: [:i | (allCs at: i) methodDict "just access the  
>> iVar"]
>>  "note that #yourself from the previous example is now just #methodDict"
>>  This snippet is performed on behalf of every developer who asks for  
>> senders and/or implementors. It is, IMHO, the most often used piece of  
>> code of every Smalltalk, ever.
>
> Absolutely not. Your claim that "this snippet is performed on behalf of  
> every developer who asks for senders and/or implementors" is misleading.  
> This is not what is *actually* performed.

Of course it is performed, even in reality.

> What is actually done is a lot more.

That's you point, accepted, agreed, out (and what about over ;-)

> For every single mega-morphic send you have dozens of mono-morphic sends.
>
> That is my whole point.

Agreed, NP.

> Just like in your previous post, you are not using actual code

The previous post has a simulation of actual code, I have no doubts (see  
also my layers thing below, which I hope explains).

> but rather a specifically devised micro-benchmark that has none of the  
> characteristics of actual code.

Sending #methodDict to a collection instances of behavior IS reality,  
sorry. Perhaps you meant something else?

> It's not done sending #methodDict - this is when the work starts not  
> when it ends.

But this is "only" your point. My point is, performance is performance.  
The possible dozens of mono-morphic sends do not amorthisize the bad (in  
my case) mini-morphic performance.

I agree they could've been responsible for amorthisation of the investment  
if the figures where true for ">=" but, the latter is apparently not the  
case.

Hey man, I understand you point. But a PIC of size 8 is not a mega-morphic  
thing. Let's not take this one any further (if possible, please).

> If you look at the actual code that is executed, say:
>
>      MessageTally tallySends:[Time browseAllCallsOn: #yourself]
>
> you will find that when browsing senders there are some 50 messages sent  
>   in addition to the single mega-morphic send and it is *those* fifty  
> messages are where the real work is - the single mega-morphic send is  
> simply noise in the overall performance.

Well, I used #yourself because a) I was not interested in any particular  
implementation, which b) has constant response time and c) because it is  
guaranteed to not choke the test. I was not interested in the leafs, right  
you are.

In my imagination a system like Smalltalk has several layers. And I timed  
just one of them and found the results.

> And it's these fifty messages (which have different performance  
> characteristics) where Strongtalk just completely rulez.
>
>> But, even for the latter I expect to find the = in >= when Strongtalk  
>> "...executes Smalltalk much faster than any other Smalltalk  
>> implementation...".
>
> The claim is about "Smalltalk code", not about "Klaus Witzel Benchmarks"  
> (the difference between the two should be obvious).

Not that I see any difference, I posted Smalltalk code (perhaps you meant  
something else?)

> I can always design you a benchmark that makes a particular system look  
> bad.

C'mon. It's either faster or it's not. No way out.

>>> I would expect that you know how to evaluate the results of a  
>>> micro-benchmarks, and I would in particular expect that you know that  
>>> 80-90% of all call-sites in realistic code are mono-morphic to begin  
>>> with which render your benchmark results absolutely useless for  
>>> "Smalltalk code".
>>  Absolutely not (yes, I know about these figures. no, I disagree: see  
>> above).
>
> If you know about the figures, then how can you claim that your  
> benchmark has any validity for general code?

Did I? I see this rather as a counter example which sheds some light on  
the performance claim (using your words, sheds some "Klaus Witzel code"  
light ;-)

And I found some system whose performance didn't pass a simple Thue-Morse  
sequence test :-)

> And as I am saying in the above the *actual* code has "Smalltalk  
> performance characteristics" whereas your made-up micro-benchmark  
> doesn't.

C'mon. Sending messages to elements of collections _is_ characteristic for  
the Smalltalks.

/Klaus

> Cheers,
>    - Andreas
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
In reply to this post by Michael Haupt-3
Hi Michael,

on Sun, 17 Dec 2006 14:44:01 +0100, you wrote:

> Hi Klaus,
>
> On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
>> In VW and in Squeak there's no PIC and the corresponding thing which
>> contributes to performance I called "methodCache" (still in the context  
>> of
>> comparision).
>
> Squeak of course doesn't have PICs, but I was pretty sure VisualWorks  
> had.

Will have a look if that is responsible for the figures :)

> Best,
>
> Michael
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Michael Haupt-3
In reply to this post by Klaus D. Witzel
Hi Klaus,

On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
> > I can always design you a benchmark that makes a particular system look
> > bad.
>
> C'mon. It's either faster or it's not. No way out.

exactly, you're absolutely right there.

The question is which conclusions you draw from such a punctual
observation. And the conclusion you initially drew - something along
the lines of "I thought Strongtalk was the fastest Smalltalk, whoops,
that's not true" - is far too general in the limited light your
particular measurement sheds on Strongtalk performance.

I guess this is what it boils down to. Judging an entire system's
performance by just one small simple point of observation just doesn't
work. (I must admit that this started me in the first place.)

Best,

Michael

Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
Hi Michael,

on Sun, 17 Dec 2006 15:58:12 +0100, you wrote:

> Hi Klaus,
>
> On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
>> > I can always design you a benchmark that makes a particular system  
>> look
>> > bad.
>>
>> C'mon. It's either faster or it's not. No way out.
>
> exactly, you're absolutely right there.
>
> The question is which conclusions you draw from such a punctual
> observation.

Right you are, and so is Andreas.

> And the conclusion you initially drew - something along
> the lines of "I thought Strongtalk was the fastest Smalltalk, whoops,
> that's not true" - is far too general in the limited light your
> particular measurement sheds on Strongtalk performance.

Yes, I can now see that my "and now it shows to be in almost the same  
class as Squeak is" is understood as a strong claim. But it just expresses  
my disappointment and desillusion.

> I guess this is what it boils down to. Judging an entire system's
> performance by just one small simple point of observation just doesn't
> work. (I must admit that this started me in the first place.)

No (agreeing with your "just doesn't work"). But the message is clear  
(reflecting Andreas' point): if you have nothing to inline (etc), then  
PICs (which can run out of steam) won't help.

No misunderstanding, my point did not change: even if so (have nothing to  
inline), *faster than all others* is "fast even beyond PICs capabilities".  
Otherwise it can be contradicted by Andreas (... can always design you a  
benchmark that makes a particular system look bad... :)

Perhaps there is something to learn from VW (without compromising the  
existing, I mean). Who knows.

 From a pragmatic point of view, if you can't write inlineable (type  
feedback'able, etc) code (or, as Andreas pointed out: can't write such a  
test :) , whatever the reason, don't expect a guarantee for superior  
performance.

/Klaus

> Best,
>
> Michael
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Bert Freudenberg
On Dec 17, 2006, at 16:32 , Klaus D. Witzel wrote:
>
> Yes, I can now see that my "and now it shows to be in almost the  
> same class as Squeak is" is understood as a strong claim. But it  
> just expresses my disappointment and desillusion.

I can understand the desillusion part, although it's hardly  
surprising. I mean, there are benchmarks where Squeak outperforms  
even standard C. But disappointment? Hardly, unless you believe in  
magic ;)

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
Thank you Bert, you re+setted me up (needless to say, pun intended :)

On Sun, 17 Dec 2006 19:06:55 +0100, Bert Freudenberg wrote:

> On Dec 17, 2006, at 16:32 , Klaus D. Witzel wrote:
>>
>> Yes, I can now see that my "and now it shows to be in almost the same  
>> class as Squeak is" is understood as a strong claim. But it just  
>> expresses my disappointment and desillusion.
>
> I can understand the desillusion part, although it's hardly surprising.  
> I mean, there are benchmarks where Squeak outperforms even standard C.  
> But disappointment? Hardly, unless you believe in magic ;)
>
> - Bert -
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Andreas.Raab
In reply to this post by Klaus D. Witzel
Klaus D. Witzel wrote:
>> The claim is about "Smalltalk code", not about "Klaus Witzel
>> Benchmarks" (the difference between the two should be obvious).
>
> Not that I see any difference, I posted Smalltalk code (perhaps you
> meant something else?)

Yes, clearly you don't see the difference and this seems to be at the
heart of the problem. You are running a micro-benchmark with specific
performance characteristics, that are not typical for Smalltalk code in
the large. Of course, you are free to make up your own performance
characteristics and measure these but that's what I call "Klaus Witzel
Benchmarks" - code that has been chosen because it has performance
characteristics that you want to measure not the performance
characteristics that "Smalltalk code" *typically* has.

The Strongtalk claims are about *typical* Smalltalk performance
characteristics, nobody has ever claimed that Strongtalk would run any
code with any performance characteristic that anyone could ever come up
with faster than other Smalltalks. In particular, there is no claim
about "faster polymorphic send performance than any other Smalltalk".

Nevertheless, solely based on this benchmark (which, again, do not
reflect typical Smalltalk performance characteristics) you are making
outrageous claims like: "I'm sorry to tell that Strongtalk is NOT that
fast." or "I'm disappointed, Strongtalk was always advertised as being
the fastest Smalltalk available "...executes Smalltalk much faster than
any other Smalltalk implementation...", and now it shows to be in almost
the same class as Squeak is".

That's what I object to. Your benchmark is absolutely no basis for such
far-reaching and (once you do some real benchmarking) obviously false
claims. A single micro-benchmark is simply not enough to judge overall
performance.

>> And as I am saying in the above the *actual* code has "Smalltalk
>> performance characteristics" whereas your made-up micro-benchmark
>> doesn't.
>
> C'mon. Sending messages to elements of collections _is_ characteristic
> for the Smalltalks.

Yes, sending messages to elements of collections is characteristic. But
sending messages to elements of *highly polymorphic* collections (which
you specifically constructed for the benchmark) is not.

Fortunately, it is very easy to show just how non-characteristic your
choice of collection is by looking at an actual image:

        lastObj := Object new.
        nextObj := nil someObject.
        bag := Bag new.
        [nextObj == lastObj] whileFalse:[
                nextObj isCollection ifTrue:[
                        set := Set new.
                        nextObj do:[:each| set add: each class].
                        bag add: set size.
                ].
                nextObj := nextObj nextObject.
        ].
        max := bag size.
        bag sortedCounts do:[:assoc|
                Transcript crtab; show: assoc key.
                Transcript show: ' (', ((100.0 * assoc key / max) truncateTo: 0.01)
asString,'%): '.
                Transcript show: assoc value.
        ].

The result of which is (in a Croquet image I'm doing my work in):

        306384 (85.12%): 1
        31278 (8.69%): 2
        19377 (5.38%): 0
        2487 (0.69%): 3
        178 (0.04%): 4
        51 (0.01%): 5
        38 (0.01%): 6
        18 (0.0%): 10
        17 (0.0%): 7
        14 (0.0%): 8
        8 (0.0%): 9
        [...etc...]

In other words, more than 90% of all the collections (some 350,000 so
it's a nice big sample) have at most a single receiver type. 8% have two
receiver types. Everything else is noise. If you keep in mind that good
amount of the 8% are due to monomorphic collections using Arrays
utilizing nil to indicate empty slots the practical percentage of
monomorphic collections is probably somewhere between 95-98%.

So no, your benchmark is not characteristic for Smalltalk code.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
Hi Andreas,

as said earlier I understand your argument, and now I can also appreciate  
the figures you extracted from a Croquet image. I have something similiar  
(in terms of extracted figures :) from Squeak 3.9 running morpic. I assume  
that all morphs are used in the World's steps and looked at what is there  
most often (expression's code is below). As can be seen I concentrate just  
on the "collective" aspect, i.e. when #submorphs are fetched. There are  
290 senders of #submorph (and 307 accessors of the corresponding iVar,  
</phew>) and I collect possible call-sites where the PIC's size must be >=  
3 (counting that as non-trivial case and to be for sure on distance to  
your figures).

I so found 1034 non-trivial elements (sum of distinct types [when >= 3]  
over the 846 morphs [objects which respond to #submorphs]) in my running  
image (strange things these morphs :) But these figures are not used in  
the next computation, just for selecting a single subject:

AlignmentMorph (which here has 159 instances and 161 users) looks to be  
max. Using your (as always excellent!) piece of code, that shows that  
roughly 33% of them have non-trivial #submorphs. So much for your "noise"  
 from the morphic side ;-)

        97 (61.0%): 1
        50 (31.44%): 3
        6 (3.77%): 0
        3 (1.88%): 2
        3 (1.88%): 4

/Klaus

P.S. please note that in your post you compared all collections with  
distribution of omega types to my smaller collection with distribution of  
8 types. We already remarked (have we?) that it makes no sense to compare  
[for example] the collection of all classes (b/o 100% distinct types).  
Same goes with the other dimension, IINM.

---------
Figures produced with:
---------
  | bag max | bag := Bag new.
  AlignmentMorph allInstances asArray collect: [:each |
   bag add: (each submorphs collect: [:object | object class])  
asIdentitySet size].
        "Andreas' code follows"
  max := bag size.
  bag sortedCounts do:[:assoc|
  Transcript crtab; show: assoc key.
  Transcript show: ' (', ((100.0 * assoc key / max) truncateTo: 0.01)  
asString,'%): '.
  Transcript show: assoc value]
---------
AlignmentMorph seems to be max:
---------
  | morphInstances subtypeMorphs distinctTypes submorphs minTypes maxTypes  
outliners |
  morphInstances := subtypeMorphs := 0.
  minTypes := Smalltalk size. maxTypes := 0.
  distinctTypes := IdentitySet new: 1000.
  outliners := IdentitySet new: 100.
  Smalltalk garbageCollect; garbageCollect.
  "count # of submorphs containers and their distinct submorphs' type(s)"
  SystemNavigation default allObjectsDo: [:object |
        ((object respondsTo: #submorphs)
                and: [(submorphs := object submorphs) notNil and: [
                                submorphs isEmpty not]]) ifTrue: [
                distinctTypes do: [:key | distinctTypes remove: key].
                submorphs do: [:each | distinctTypes add: each class name].
                morphInstances := morphInstances + 1.
                subtypeMorphs := subtypeMorphs + (submorphs := distinctTypes size).
                minTypes := minTypes min: submorphs.
                maxTypes := maxTypes max: submorphs.
                submorphs >= 3
                        ifTrue: [outliners add: object class name].
                ]].
  "determine the popularity of morphs with most # of distinct submorphs"
  outliners := outliners asArray collect: [:each |
        each -> (SystemNavigation default allCallsOn: (distinctTypes := Smalltalk  
associationAt: each)) size
                        -> distinctTypes value instanceCount].
  ^ {morphInstances. subtypeMorphs. minTypes. maxTypes} , outliners asArray
---------

On Sun, 17 Dec 2006 20:32:51 +0100, Andreas Raab wrote:

> Klaus D. Witzel wrote:
>>> The claim is about "Smalltalk code", not about "Klaus Witzel  
>>> Benchmarks" (the difference between the two should be obvious).
>>  Not that I see any difference, I posted Smalltalk code (perhaps you  
>> meant something else?)
>
> Yes, clearly you don't see the difference and this seems to be at the  
> heart of the problem. You are running a micro-benchmark with specific  
> performance characteristics, that are not typical for Smalltalk code in  
> the large. Of course, you are free to make up your own performance  
> characteristics and measure these but that's what I call "Klaus Witzel  
> Benchmarks" - code that has been chosen because it has performance  
> characteristics that you want to measure not the performance  
> characteristics that "Smalltalk code" *typically* has.
>
> The Strongtalk claims are about *typical* Smalltalk performance  
> characteristics, nobody has ever claimed that Strongtalk would run any  
> code with any performance characteristic that anyone could ever come up  
> with faster than other Smalltalks. In particular, there is no claim  
> about "faster polymorphic send performance than any other Smalltalk".
>
> Nevertheless, solely based on this benchmark (which, again, do not  
> reflect typical Smalltalk performance characteristics) you are making  
> outrageous claims like: "I'm sorry to tell that Strongtalk is NOT that  
> fast." or "I'm disappointed, Strongtalk was always advertised as being  
> the fastest Smalltalk available "...executes Smalltalk much faster than  
> any other Smalltalk implementation...", and now it shows to be in almost  
> the same class as Squeak is".
>
> That's what I object to. Your benchmark is absolutely no basis for such  
> far-reaching and (once you do some real benchmarking) obviously false  
> claims. A single micro-benchmark is simply not enough to judge overall  
> performance.
>
>>> And as I am saying in the above the *actual* code has "Smalltalk  
>>> performance characteristics" whereas your made-up micro-benchmark  
>>> doesn't.
>>  C'mon. Sending messages to elements of collections _is_ characteristic  
>> for the Smalltalks.
>
> Yes, sending messages to elements of collections is characteristic. But  
> sending messages to elements of *highly polymorphic* collections (which  
> you specifically constructed for the benchmark) is not.
>
> Fortunately, it is very easy to show just how non-characteristic your  
> choice of collection is by looking at an actual image:
>
> lastObj := Object new.
> nextObj := nil someObject.
> bag := Bag new.
> [nextObj == lastObj] whileFalse:[
> nextObj isCollection ifTrue:[
> set := Set new.
> nextObj do:[:each| set add: each class].
> bag add: set size.
> ].
> nextObj := nextObj nextObject.
> ].
> max := bag size.
> bag sortedCounts do:[:assoc|
> Transcript crtab; show: assoc key.
> Transcript show: ' (', ((100.0 * assoc key / max) truncateTo: 0.01)  
> asString,'%): '.
> Transcript show: assoc value.
> ].
>
> The result of which is (in a Croquet image I'm doing my work in):
>
> 306384 (85.12%): 1
> 31278 (8.69%): 2
> 19377 (5.38%): 0
> 2487 (0.69%): 3
> 178 (0.04%): 4
> 51 (0.01%): 5
> 38 (0.01%): 6
> 18 (0.0%): 10
> 17 (0.0%): 7
> 14 (0.0%): 8
> 8 (0.0%): 9
> [...etc...]
>
> In other words, more than 90% of all the collections (some 350,000 so  
> it's a nice big sample) have at most a single receiver type. 8% have two  
> receiver types. Everything else is noise. If you keep in mind that good  
> amount of the 8% are due to monomorphic collections using Arrays  
> utilizing nil to indicate empty slots the practical percentage of  
> monomorphic collections is probably somewhere between 95-98%.
>
> So no, your benchmark is not characteristic for Smalltalk code.
>
> Cheers,
>    - Andreas
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

David Griswold-3
In reply to this post by Klaus D. Witzel
Klaus,

There are three issues here:

1) You did *not* run it enough under Strongtalk to compile the benchmark, so you are measuring interpreted performance.  You need to run it until the performance speeds up and stabilizes.   When it is compiled, on my machine (Sonoma Pentium M 1.7Ghz), Squeak 3.1 runs the benchmark in 60453, and Strongtalk runs it in 22139.  That's not the latest Squeak but I doubt it has changed much.  I don't have a recent VisualWorks installed, but from my knowledge of how the various systems work, I would expect VisualWorks to be a bit faster than Strongtalk at this (very poor) microbenchmark, for reasons explained below.

2) Andreas Raab was right in his comments.  The performance you are measuring is *not* general Smalltalk performance, it is specifically the performance of megamorphic sends, which are one of the few cases where Strongtalk's type-feedback doesn't help at all. 

Here is how sends work in Strongtalk:

Monomorphic and slightly polymorphic sends (1 or 2 receiver classes at the send site) can be inlined, which is the common case (over 90% of sends fall in this category), and that is where Strongtalk can give you big speedups.  

Sends that have between 2 and 4 receiver classes are usually handled with a polymorphic inline cache (PIC), which is still a real dispatch and call, and is only slightly faster (if at all) than in other Smalltalks, since that is the most highly optimized piece of code in any normal Smalltalk implementation.  PICs are not primarily for optimization; their real role is to gather type information for the inlining compiler.  Note that VisualWorks now has PICs, so it uses the same technology for non-inlined sends as Strongtalk.   

Sends that have more than 4 receiver types, such as your micro-benchmark, can't even use PICs or any kind of inline cache, so these are a full megamorphic send in Strongtalk, which is implemented as an actual hashed lookup, which is the slowest case of all.  You might say that is what Smalltalk is all about, but in reality megamorphic sends are relatively rare as a percentage of sends.  Compilers aren't magic- no one can eliminate the fundamental computation that a truly megamorphic send has to do- it *has* to do some kind of real lookup, and a call, so the performance will naturally be similar across all Smalltalks.

Every Smalltalk has that overhead.  What Strongtalk does is eliminate that overhead when you don't really need it, when a send doesn't actually have many receiver classes.  That is what other Smalltalk's can't do: they make you pay the cost of a dispatch and call all the time, even if you don't need it, which is the common case.

So your 'picBench' isn't even measuring PIC performance.

3) I would expect VisualWorks to be about the same speed or a bit faster than Strongtalk on this atypical benchmark because of several factors.  We have established that type-feedback doesn't help this benchmark, so from the point of view of sends, VisualWorks and Strongtalk would be doing basically the same kind of things.  The reason VisualWorks would probably be a bit faster on this benchmark is because it probably does array bounds-check elimination and maybe even loop unrolling, which aren't yet implemented in Strongtalk, and I'm sure aren't implemented in Squeak.  We did those in the Java VM, but hadn't yet gotten to that for Strongtalk; Strongtalk hasn't even really been tuned, and VisualWorks has been tuned for many years.  Your benchmark consists of a tight inner loop that does only two things: a megamorphic send, and an array lookup.  So the array bounds check and loop overhead are a significant factor, and if VisualWorks can optimize those, it would make a real difference. 

But once again, this is not even remotely typical Smalltalk code.  Array bounds-checks and loop unrolling are rarely used optimizations that generally only help when you have a very tight inner loop that does almost nothing and where the loop itself is a literal SmallInteger>>to:do: send, you are accessing an array, and the array access is literally imbedded in the loop, not in a called method.  How much of your code really looks like that?

-Dave

On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
Folks,

I'm sorry to tell that Strongtalk is NOT that fast. I followed the
instructions and *compiled* the following benchmark in Strongtalk,
evaluated the same expression in Squeak and in VW and got the these
results on my 1.73GHz 1.0GB WinXP notebook:

- VisualWorks:  16799 (N.C. 7.4.1)
- Strongtalk:   47517 (1.1.2)
- Squeak:               56726 (3.9#7056)

Below is the Squeak/VW source code, attached is the Strongtalk source
code. The test is simple: a long loop around a single polymorphic call
site "(instances at: i) yourself", straight forward inlineable and with
intentionally unpredictable type information at the call site (modeled
after the Thue-Morse sequence).

I'm disappointed, Strongtalk was always advertised as being the fastest
Smalltalk available "...executes Smalltalk much faster than any other
Smalltalk implementation...", and now it shows to be in almost the same
class as Squeak is :) :(

Can somebody reproduce the figures, any other results? Have I done
something wrong?

BTW: congrats to the implementors of Squeak and, of course, to Cincom!
(uhm, and also to the Strongtalk team!)

/Klaus

--------------
  | instances base |
  base := (Array
        with: OrderedCollection basicNew
        with: SequenceableCollection basicNew
        with: Collection basicNew
        with: Object basicNew) ,
        (Array
        with: Character space
        with: Date basicNew
        with: Time basicNew
        with: Magnitude basicNew).
  instances := OrderedCollection with: (base at: 1).
  2 to: base size do: [:i |
   instances := instances , instances reverse.
   instances addLast: (base at: i)].
  instances := (instances , instances reverse) asArray.
  ^ Time millisecondsToRun: [
        1234567 timesRepeat: [
                1 to: instances size do: [:i |
                        (instances at: i) yourself]]]
--------------



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

Klaus D. Witzel
Thank you David for answering my question

>> Can somebody reproduce the figures, any other results? Have I done
>> something wrong?

and thank you also for the explanations. I understand that PICs in  
Strongtalk are [in the current incarnation] limited to 4 entries, that's  
good to know.

Just a minor adjustment: the #at: on the array was never in doubt and the  
integer loop was by intention because (I think) on all three systems it's  
compiled away already at the bytecode level and the #at: is expected to be  
subsummed at the primitive level. I've seen walkbacks in Strongtalk in  
which the source code #to:do: was inlined with #whileTrue sans block, like  
in Squeak.

As to you figures, will retry with a "warmer" image :)

And I have nothing against people calling my test a poor benchmark. I  
wanted to compare the performance at this particular level and according  
to your report even there [the at this level unoptimized] Strongtalk is  
close to VW. And no, I would never say that mega-morphic sends is all what  
Smalltalk is about.

Let me comment this one

> ...  How much of your code really looks like that?

Well, at that level almost all users of collection #do: look like that. I  
just made the level below an O(1) constant, otherwise the polymorphic  
nature of "(array at: i) doSomethingPolymorphically" would perhaps have  
gone unnoticed.

Thanks again, very insightful.

/Klaus

On Mon, 18 Dec 2006 00:08:08 +0100, David Griswold  
<[hidden email]> wrote:

> Klaus,
>
> There are three issues here:
>
> 1) You did *not* run it enough under Strongtalk to compile the  
> benchmark, so
> you are measuring interpreted performance.  You need to run it until the
> performance speeds up and stabilizes.   When it is compiled, on my  
> machine
> (Sonoma Pentium M 1.7Ghz), Squeak 3.1 runs the benchmark in 60453, and
> Strongtalk runs it in 22139.  That's not the latest Squeak but I doubt it
> has changed much.  I don't have a recent VisualWorks installed, but from  
> my
> knowledge of how the various systems work, I would expect VisualWorks to  
> be
> a bit faster than Strongtalk at this (very poor) microbenchmark, for  
> reasons
> explained below.
>
> 2) Andreas Raab was right in his comments.  The performance you are
> measuring is *not* general Smalltalk performance, it is specifically the
> performance of megamorphic sends, which are one of the few cases where
> Strongtalk's type-feedback doesn't help at all.
>
> Here is how sends work in Strongtalk:
>
> Monomorphic and slightly polymorphic sends (1 or 2 receiver classes at  
> the
> send site) can be inlined, which is the common case (over 90% of sends  
> fall
> in this category), and that is where Strongtalk can give you big  
> speedups.
>
> Sends that have between 2 and 4 receiver classes are usually handled  
> with a
> polymorphic inline cache (PIC), which is still a real dispatch and call,  
> and
> is only slightly faster (if at all) than in other Smalltalks, since that  
> is
> the most highly optimized piece of code in any normal Smalltalk
> implementation.  PICs are not primarily for optimization; their real  
> role is
> to gather type information for the inlining compiler.  Note that  
> VisualWorks
> now has PICs, so it uses the same technology for non-inlined sends as
> Strongtalk.
>
> Sends that have more than 4 receiver types, such as your micro-benchmark,
> can't even use PICs or any kind of inline cache, so these are a full
> megamorphic send in Strongtalk, which is implemented as an actual hashed
> lookup, which is the slowest case of all.  You might say that is what
> Smalltalk is all about, but in reality megamorphic sends are relatively  
> rare
> as a percentage of sends.  Compilers aren't magic- no one can eliminate  
> the
> fundamental computation that a truly megamorphic send has to do- it  
> *has* to
> do some kind of real lookup, and a call, so the performance will  
> naturally
> be similar across all Smalltalks.
>
> Every Smalltalk has that overhead.  What Strongtalk does is eliminate  
> that
> overhead when you don't really need it, when a send doesn't actually have
> many receiver classes.  That is what other Smalltalk's can't do: they  
> make
> you pay the cost of a dispatch and call all the time, even if you don't  
> need
> it, which is the common case.
>
> So your 'picBench' isn't even measuring PIC performance.
>
> 3) I would expect VisualWorks to be about the same speed or a bit faster
> than Strongtalk on this atypical benchmark because of several factors.  
> We
> have established that type-feedback doesn't help this benchmark, so from  
> the
> point of view of sends, VisualWorks and Strongtalk would be doing  
> basically
> the same kind of things.  The reason VisualWorks would probably be a bit
> faster on this benchmark is because it probably does array bounds-check
> elimination and maybe even loop unrolling, which aren't yet implemented  
> in
> Strongtalk, and I'm sure aren't implemented in Squeak.  We did those in  
> the
> Java VM, but hadn't yet gotten to that for Strongtalk; Strongtalk hasn't
> even really been tuned, and VisualWorks has been tuned for many years.  
> Your
> benchmark consists of a tight inner loop that does only two things: a
> megamorphic send, and an array lookup.  So the array bounds check and  
> loop
> overhead are a significant factor, and if VisualWorks can optimize  
> those, it
> would make a real difference.
>
> But once again, this is not even remotely typical Smalltalk code.  Array
> bounds-checks and loop unrolling are rarely used optimizations that
> generally only help when you have a very tight inner loop that does  
> almost
> nothing and where the loop itself is a literal SmallInteger>>to:do: send,
> you are accessing an array, and the array access is literally imbedded in
> the loop, not in a called method.  How much of your code really looks  
> like
> that?
>
> -Dave
>
> On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
>>
>> Folks,
>>
>> I'm sorry to tell that Strongtalk is NOT that fast. I followed the
>> instructions and *compiled* the following benchmark in Strongtalk,
>> evaluated the same expression in Squeak and in VW and got the these
>> results on my 1.73GHz 1.0GB WinXP notebook:
>>
>> - VisualWorks:  16799 (N.C. 7.4.1)
>> - Strongtalk:   47517 (1.1.2)
>> - Squeak:               56726 (3.9#7056)
>>
>> Below is the Squeak/VW source code, attached is the Strongtalk source
>> code. The test is simple: a long loop around a single polymorphic call
>> site "(instances at: i) yourself", straight forward inlineable and with
>> intentionally unpredictable type information at the call site (modeled
>> after the Thue-Morse sequence).
>>
>> I'm disappointed, Strongtalk was always advertised as being the fastest
>> Smalltalk available "...executes Smalltalk much faster than any other
>> Smalltalk implementation...", and now it shows to be in almost the same
>> class as Squeak is :) :(
>>
>> Can somebody reproduce the figures, any other results? Have I done
>> something wrong?
>>
>> BTW: congrats to the implementors of Squeak and, of course, to Cincom!
>> (uhm, and also to the Strongtalk team!)
>>
>> /Klaus
>>
>> --------------
>>   | instances base |
>>   base := (Array
>>         with: OrderedCollection basicNew
>>         with: SequenceableCollection basicNew
>>         with: Collection basicNew
>>         with: Object basicNew) ,
>>         (Array
>>         with: Character space
>>         with: Date basicNew
>>         with: Time basicNew
>>         with: Magnitude basicNew).
>>   instances := OrderedCollection with: (base at: 1).
>>   2 to: base size do: [:i |
>>    instances := instances , instances reverse.
>>    instances addLast: (base at: i)].
>>   instances := (instances , instances reverse) asArray.
>>   ^ Time millisecondsToRun: [
>>         1234567 timesRepeat: [
>>                 1 to: instances size do: [:i |
>>                         (instances at: i) yourself]]]
>> --------------
>>



Reply | Threaded
Open this post in threaded view
|

Re: Thue-Morse and performance: Squeak v.s. Strongtalk v.s. VisualWorks

David Griswold-3
Hi Klaus,

On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
Thank you David for answering my question

>> Can somebody reproduce the figures, any other results? Have I done
>> something wrong?

and thank you also for the explanations. I understand that PICs in
Strongtalk are [in the current incarnation] limited to 4 entries, that's
good to know.

Just a minor adjustment: the #at: on the array was never in doubt and the
integer loop was by intention because (I think) on all three systems it's
compiled away already at the bytecode level and the #at: is expected to be
subsummed at the primitive level. I've seen walkbacks in Strongtalk in
which the source code #to:do: was inlined with #whileTrue sans block, like
in Squeak.

Yes, #to:do: is treated specially by the bytecode compiler, although it doesn't really have to be, since type-feedback would be able to inline and eliminate the block.  The only reason it is treated specially is just so it still runs reasonable fast in the interpreter, before methods are compiled, because it is so important in inner loops.  #at:, on the other hand, is not treated specially in Strongtalk, unlike most other Smalltalks.

As to you figures, will retry with a "warmer" image :)

And I have nothing against people calling my test a poor benchmark. I
wanted to compare the performance at this particular level and according
to your report even there [the at this level unoptimized] Strongtalk is
close to VW. And no, I would never say that mega-morphic sends is all what
Smalltalk is about.

Let me comment this one

> ...  How much of your code really looks like that?

Well, at that level almost all users of collection #do: look like that. I
just made the level below an O(1) constant, otherwise the polymorphic
nature of "(array at: i) doSomethingPolymorphically" would perhaps have
gone unnoticed.

#do: loops are significantly different, because 1) they are not treated specially by the bytecode compiler, so there is a real block and usually a closure in most Smalltalks, 2) the implementation of #do:, which is where the inner loop might be, does not literally contain the body of the loop, so loop unrolling can't be applied by a non-inlining Smalltalk.  Array bounds-check elimination might apply, but when the loop contains more than a few sends (including the additional Block>>value: send), the benefits rapidly become minor.

So in fact, a #do: benchmark (with a block that needs a closure, since all real #do: sends need a closure) would be a much better benchmark, because it's the way people actually write code, and sure enough Strongtalk can both inline the #do: implementation, and inline the block into the loop, so it would show much bigger advantages compared to other Smalltalks.  And even that would understate the potential Strongtalk advantage, because if the compiler was tuned, it would be able to do bounds-check elimination and loop unrolling even for #do:, because it can inline the block, whereas VisualWorks would never be able to.

Cheers,
Dave

Thanks again, very insightful.

/Klaus

On Mon, 18 Dec 2006 00:08:08 +0100, David Griswold
<[hidden email]> wrote:

> Klaus,
>
> There are three issues here:
>
> 1) You did *not* run it enough under Strongtalk to compile the
> benchmark, so
> you are measuring interpreted performance.  You need to run it until the
> performance speeds up and stabilizes.   When it is compiled, on my
> machine
> (Sonoma Pentium M 1.7Ghz), Squeak 3.1 runs the benchmark in 60453, and
> Strongtalk runs it in 22139.  That's not the latest Squeak but I doubt it
> has changed much.  I don't have a recent VisualWorks installed, but from
> my
> knowledge of how the various systems work, I would expect VisualWorks to
> be
> a bit faster than Strongtalk at this (very poor) microbenchmark, for
> reasons
> explained below.
>
> 2) Andreas Raab was right in his comments.  The performance you are

> measuring is *not* general Smalltalk performance, it is specifically the
> performance of megamorphic sends, which are one of the few cases where
> Strongtalk's type-feedback doesn't help at all.
>
> Here is how sends work in Strongtalk:
>
> Monomorphic and slightly polymorphic sends (1 or 2 receiver classes at
> the
> send site) can be inlined, which is the common case (over 90% of sends
> fall
> in this category), and that is where Strongtalk can give you big
> speedups.
>
> Sends that have between 2 and 4 receiver classes are usually handled
> with a
> polymorphic inline cache (PIC), which is still a real dispatch and call,
> and
> is only slightly faster (if at all) than in other Smalltalks, since that
> is
> the most highly optimized piece of code in any normal Smalltalk
> implementation.  PICs are not primarily for optimization; their real
> role is
> to gather type information for the inlining compiler.  Note that
> VisualWorks
> now has PICs, so it uses the same technology for non-inlined sends as
> Strongtalk.
>
> Sends that have more than 4 receiver types, such as your micro-benchmark,
> can't even use PICs or any kind of inline cache, so these are a full
> megamorphic send in Strongtalk, which is implemented as an actual hashed
> lookup, which is the slowest case of all.  You might say that is what
> Smalltalk is all about, but in reality megamorphic sends are relatively
> rare
> as a percentage of sends.  Compilers aren't magic- no one can eliminate
> the
> fundamental computation that a truly megamorphic send has to do- it
> *has* to
> do some kind of real lookup, and a call, so the performance will
> naturally
> be similar across all Smalltalks.
>
> Every Smalltalk has that overhead.  What Strongtalk does is eliminate
> that
> overhead when you don't really need it, when a send doesn't actually have
> many receiver classes.  That is what other Smalltalk's can't do: they
> make
> you pay the cost of a dispatch and call all the time, even if you don't
> need
> it, which is the common case.
>
> So your 'picBench' isn't even measuring PIC performance.
>
> 3) I would expect VisualWorks to be about the same speed or a bit faster
> than Strongtalk on this atypical benchmark because of several factors.
> We
> have established that type-feedback doesn't help this benchmark, so from
> the
> point of view of sends, VisualWorks and Strongtalk would be doing
> basically
> the same kind of things.  The reason VisualWorks would probably be a bit
> faster on this benchmark is because it probably does array bounds-check
> elimination and maybe even loop unrolling, which aren't yet implemented
> in
> Strongtalk, and I'm sure aren't implemented in Squeak.  We did those in
> the
> Java VM, but hadn't yet gotten to that for Strongtalk; Strongtalk hasn't
> even really been tuned, and VisualWorks has been tuned for many years.
> Your
> benchmark consists of a tight inner loop that does only two things: a
> megamorphic send, and an array lookup.  So the array bounds check and
> loop
> overhead are a significant factor, and if VisualWorks can optimize
> those, it
> would make a real difference.
>
> But once again, this is not even remotely typical Smalltalk code.  Array
> bounds-checks and loop unrolling are rarely used optimizations that
> generally only help when you have a very tight inner loop that does
> almost
> nothing and where the loop itself is a literal SmallInteger>>to:do: send,
> you are accessing an array, and the array access is literally imbedded in
> the loop, not in a called method.  How much of your code really looks
> like
> that?
>
> -Dave
>
> On 12/17/06, Klaus D. Witzel <[hidden email]> wrote:
>>
>> Folks,
>>
>> I'm sorry to tell that Strongtalk is NOT that fast. I followed the

>> instructions and *compiled* the following benchmark in Strongtalk,
>> evaluated the same expression in Squeak and in VW and got the these
>> results on my 1.73GHz 1.0GB WinXP notebook:
>>
>> - VisualWorks:  16799 (N.C. 7.4.1)
>> - Strongtalk:   47517 (1.1.2)
>> - Squeak:               56726 (3.9#7056)
>>
>> Below is the Squeak/VW source code, attached is the Strongtalk source
>> code. The test is simple: a long loop around a single polymorphic call
>> site "(instances at: i) yourself", straight forward inlineable and with
>> intentionally unpredictable type information at the call site (modeled
>> after the Thue-Morse sequence).
>>
>> I'm disappointed, Strongtalk was always advertised as being the fastest
>> Smalltalk available "...executes Smalltalk much faster than any other
>> Smalltalk implementation...", and now it shows to be in almost the same
>> class as Squeak is :) :(
>>
>> Can somebody reproduce the figures, any other results? Have I done
>> something wrong?
>>
>> BTW: congrats to the implementors of Squeak and, of course, to Cincom!
>> (uhm, and also to the Strongtalk team!)
>>
>> /Klaus
>>
>> --------------
>>   | instances base |
>>   base := (Array
>>         with: OrderedCollection basicNew
>>         with: SequenceableCollection basicNew
>>         with: Collection basicNew
>>         with: Object basicNew) ,
>>         (Array
>>         with: Character space
>>         with: Date basicNew
>>         with: Time basicNew
>>         with: Magnitude basicNew).
>>   instances := OrderedCollection with: (base at: 1).
>>   2 to: base size do: [:i |
>>    instances := instances , instances reverse.
>>    instances addLast: (base at: i)].
>>   instances := (instances , instances reverse) asArray.
>>   ^ Time millisecondsToRun: [
>>         1234567 timesRepeat: [
>>                 1 to: instances size do: [:i |
>>                         (instances at: i) yourself]]]
>> --------------
>>






12