Playing with Exupery

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Playing with Exupery

Göran Krampe
Hi folks!

Just stumbled on this little tiny benchmark:

   http://butunclebob.com/ArticleS.UncleBob.SpeedOfJavaCppRuby

As you can see Squeak is referenced a bit down in the comments with code:

"Java is the winner, for 2100000 it spends 250 msec on average. Ruby 6500
msec on average. Squeak Smalltalk 2269 msec, VisualWorks Smalltalk 460
msec."

I threw the same code into my Gjallar 3.8 image using the Exupery VM for
Win32 and cobbled together this (excluding the actual #runOn: methods):
----------
bench
        "self bench"

        Transcript show: 'Normal:', ([self runOn: 2100000 ] timeToRun) asString;cr.

        Exupery initialiseExupery.
        ExuperyProfiler optimise:[self runOnX: 2100000].
        Transcript show: 'ExuperyOptimised:', ([self runOnX: 2100000 ] timeToRun)
asString;cr.

        Exupery dynamicallyInline.
        Transcript show: 'ExuperyOptimisedInlined:', ([self runOnX: 2100000 ]
timeToRun) asString;cr.
------------------------
Which gave this (three runs):

Normal:2698
ExuperyOptimised:5217
ExuperyOptimisedInlined:1807
Normal:2666
ExuperyOptimised:5030
ExuperyOptimisedInlined:1812
Normal:2672
ExuperyOptimised:5182
ExuperyOptimisedInlined:1770

Note: #runOnX: is exactly the same as #runOn:, I just didn't want to get
Exupery mixed up with "normal" code.

So.... well, evidently Exupery manages to do some inlining here :) and the
end result is roughly 32% faster than interpreted. I did look at the
exupery.log file but couldn't make much out of it.

I also played around a little bit with the constants in #optimise but that
only made things *worse* - possibly it decided to compile more and then
didn't manage to inline? Have no idea. :)

regards, Göran

PS. This was using Exupery 0.10 - I see there is a 0.11 on SM... Darn! :)


Reply | Threaded
Open this post in threaded view
|

Re: Playing with Exupery

Andres Valloud-2
Hello Göran et al,

Friday, February 16, 2007, 8:20:30 AM, you wrote:

GK> "Java is the winner, for 2100000 it spends 250 msec on average.
GK> Ruby 6500 msec on average. Squeak Smalltalk 2269 msec, VisualWorks
GK> Smalltalk 460 msec."

I found the webpage a bit confusing.  For example, the original code
snippets calculate square roots while the latter don't.

Nevertheless, I found the Smalltalk code to be worth of improvement.
For example, switching to use byte arrays and using 0 and 1 as
booleans results in a 25% speedup in VW.  The scaled times would be
250ms for Java and 370ms for VW.

Something important to note: not setting a single threaded program to
run exclusively on one of multiple available cores can result in
severe performance degradation under Windows XP (anywhere from no
degradation up to 2x slower).  The degradation itself is not stable
either, as it swings from moderate to really bad from one second to
the next.

This is something we're investigating in the context of VW, and it
seems to be related to Windows' process scheduler swapping cores under
single threaded programs leading to CPU cache thrashing.  Other single
threaded programs such as Winrar exhibit the same behavior.

I do not have access to an AMD dual core CPU, would somebody be able
to run some tests with and without core affinity set?

The "on average" description of run times may have something to do
with this, and to me it casts a bit of doubt as to the precision of
the results.

We also observed that running two images on a dual core CPU, with each
image set to run on one of the cores, results in both images running
faster than a single image with core affinity by about 10%.

As you can see, further research is required.  If anybody has good
pointers on this, I'd appreciate a link.

--
Best regards,
 Andres                            mailto:[hidden email]


Reply | Threaded
Open this post in threaded view
|

Playing with Exupery

Bryce Kampjes
In reply to this post by Göran Krampe
Göran Krampe writes:
 > Hi folks!
 >
 > Just stumbled on this little tiny benchmark:
 >
 >    http://butunclebob.com/ArticleS.UncleBob.SpeedOfJavaCppRuby

Interesting, for what it's worth Exupery should be as fast or slightly
faster than VW for that benchmark. Though I may have lost a little
time recently but there's plenty of room for a little tuning to make
it back up.

First I had a look at what Exupery's doing with:

   tail -f Exupery.log | grep -v block

Which shows the following:

   9:38:21 pm: Initializing the Code Cache
   9:38:24 pm: Compiling SmallInteger>>to:by:do: inlining #()
   9:38:25 pm: Compiling WBKToys>>runOn: inlining #()
   9:38:29 pm: Compiling SmallInteger>>to:by:do: inlining {{61 . ExuperyBlockContext}}
   9:38:29 pm: Failed to inline ExuperyBlockContext>>value:
   9:38:29 pm: Compiling WBKToys>>runOn: inlining {{33 . Array} . {44 . Array} . {61 . Array}}

What's interesting is the #to:by:do. If you look at the bytecodes it's
there. If I wrote a compiled version of ExuperyBlockContext>>value
then even with the current bytecodes it should improve noticeably. Only
compiled primitives can be inlined at the moment.

Here's Exupery's benchmark suite:

   arithmaticLoopBenchmark 1387 compiled 127 ratio: 10.920
   bytecodeBenchmark 2139 compiled 484 ratio: 4.419
   sendBenchmark 1582 compiled 728 ratio: 2.173
   doLoopsBenchmark 1063 compiled 843 ratio: 1.261
   pointCreation 1075 compiled 1030 ratio: 1.044
   largeExplorers 585 compiled 595 ratio: 0.983
   compilerBenchmark 474 compiled 454 ratio: 1.044
   Cumulative Time 1058.337 compiled 521.541 ratio 2.028

The bytecode benchmark is a prime number sieve similar to what you
were using though coded to avoid sends and real blocks.

Bryce