Re: Cog VM -- Thanks and Performance / Optimization Questions [Micro-Bench Loop]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Cog VM -- Thanks and Performance / Optimization Questions [Micro-Bench Loop]

John B Thiel
Thanks Eliot, Stef and All for good answers on the Cog VM inlining and
implementation details.  That is also exciting news about Eliot and
MarkusD's work on an adaptive optimizer. By "completely close the gap
to C"  how close do you mean? The StrongTalk work posits a theoretical
best of about 2x, so I suppose you mean in that general 2x-5x
neighborhood vs machine code... ?

The 20x slower macro algorithm I mentioned is too large to post.
Below is a simple micro-loop I made up to explore some aspects of
closures, loops and relative VM timings.  The C-equivalent machine
code takes 400msec on my test machine.  In Squeak/Pharo, I see a huge
range of speed factors with this, from 7x to 220x (!!) slower than
machine code, on current and near-history VMs (Cog and standard
Squeak).

Consider:

"A = to:by:do:  loop"
benchLooperA
|sum|
  sum:=1e8.
        sum to: 1 by: -1 do: [:x |
                sum := (sum bitShift: -1) + x ] .
        ^sum

"B = timesRepeat loop"
benchLooperB
|sum count|
  count := sum:=1e8.
        sum timesRepeat:  [
                sum := (sum bitShift: -1) + count.
                count := count - 1 ] .
        ^sum


A and B compute the same except one using #to:by:do:,  the other
#timesRepeat.  The near-machine code for this computation runs about
400msec  (0.4 sec) on my test system. (x86 Windows 7, N450 Atom cpu,
1.66GHz)

On the following 3 test VM/Images,     :

1.  Squeak 4.1 stock (vm = Squeak 4.0.2, 2010apr3)
2.  Pharo 1.1.1 stock (vm = Cog 2010sep21)
3.  Pharo 1.2rc2 + Cog2011feb6

I get the following times for
   A   Time millisecondsToRun: [ Sandbox benchLooperA ]
   B   Time millisecondsToRun: [ Sandbox benchLooperB ]

(A, B) execution times rounded to seconds

1.   (24, 88)
2.   (34, 11)
3.   (3, 8)


Here we see a 3x-4x difference A to B,  with an anomoly that Pharo
1.1.1 is actually much *slower* on #to:by:do:   (is it the closure bug
(?), see below).  (Also we see 8x-11x speedup with the latest
Cog-2011feb6, great!)

Now here is something odd -- if I invoke the loops via workspace DoIt,
the timing changes in Pharo 1.1.1,  like this:

Time millisecondsToRun: [
|sum|
  sum:=1e8.
        sum to: 1 by: -1 do: [:x |
                sum := (sum bitShift: -1) + x ] .
        sum ]
       
(highlight and DoIt)

(A, B) invoked from workspace, time in seconds

1.  (24, 88)
2.  (5, 11)
3.  (3, 8)


The Pharo 1.1.1 timing anomaly for case 2A disappeared - it's now 5
seconds instead of 34.

So, 4 questions:

* What causes the Pharo 1.1.1. anomaly timing difference for #to:by:do
invoked via workspace DoIt vs in the #benchLooper method  (5 sec vs 34
sec)  ?

* Why is the #timesRepeat loop 3x-4x slower than #to:by:do ?  Is that
simply the difference between inlined and non-inlined methods, or
other factors?

* Even the best Cog time here is 7x slower vs machine-code (and 20x
slower for the timesRepeat case), whereas latest Cog fib: runs at 3x.
What factors make this micro-bench loop not as optimizable as fib: for
Cog?

* Why is the stock Squeak 4.1 VM  **220x slower** than machine code
for case B = timesRepeat loop,  (88 sec vs 0.4 sec) ?


Thanks to everyone for insights and comments.

-- jbthiel

Reply | Threaded
Open this post in threaded view
|

Re: Cog VM -- Thanks and Performance / Optimization Questions [Micro-Bench Loop]

Eliot Miranda-2
On Thu, Feb 24, 2011 at 2:13 PM, John B Thiel <[hidden email]> wrote:

> Thanks Eliot, Stef and All for good answers on the Cog VM inlining and
> implementation details.  That is also exciting news about Eliot and
> MarkusD's work on an adaptive optimizer. By "completely close the gap
> to C"  how close do you mean? The StrongTalk work posits a theoretical
> best of about 2x, so I suppose you mean in that general 2x-5x
> neighborhood vs machine code... ?
>
> The 20x slower macro algorithm I mentioned is too large to post.
> Below is a simple micro-loop I made up to explore some aspects of
> closures, loops and relative VM timings.  The C-equivalent machine
> code takes 400msec on my test machine.  In Squeak/Pharo, I see a huge
> range of speed factors with this, from 7x to 220x (!!) slower than
> machine code, on current and near-history VMs (Cog and standard
> Squeak).
>
> Consider:
>
> "A = to:by:do:  loop"
> benchLooperA
> |sum|
>        sum:=1e8.
>        sum to: 1 by: -1 do: [:x |
>                sum := (sum bitShift: -1) + x ] .
>        ^sum
>
> "B = timesRepeat loop"
> benchLooperB
> |sum count|
>        count := sum:=1e8.
>        sum timesRepeat:  [
>                sum := (sum bitShift: -1) + count.
>                count := count - 1 ] .
>        ^sum
>
>
> A and B compute the same except one using #to:by:do:,  the other
> #timesRepeat.  The near-machine code for this computation runs about
> 400msec  (0.4 sec) on my test system. (x86 Windows 7, N450 Atom cpu,
> 1.66GHz)
>
> On the following 3 test VM/Images,     :
>
> 1.  Squeak 4.1 stock (vm = Squeak 4.0.2, 2010apr3)
> 2.  Pharo 1.1.1 stock (vm = Cog 2010sep21)
> 3.  Pharo 1.2rc2 + Cog2011feb6
>
> I get the following times for
>   A   Time millisecondsToRun: [ Sandbox benchLooperA ]
>   B   Time millisecondsToRun: [ Sandbox benchLooperB ]
>
> (A, B) execution times rounded to seconds
>
> 1.   (24, 88)
> 2.   (34, 11)
> 3.   (3, 8)
>
>
> Here we see a 3x-4x difference A to B,  with an anomoly that Pharo
> 1.1.1 is actually much *slower* on #to:by:do:   (is it the closure bug
> (?), see below).  (Also we see 8x-11x speedup with the latest
> Cog-2011feb6, great!)
>
> Now here is something odd -- if I invoke the loops via workspace DoIt,
> the timing changes in Pharo 1.1.1,  like this:
>
> Time millisecondsToRun: [
> |sum|
>        sum:=1e8.
>        sum to: 1 by: -1 do: [:x |
>                sum := (sum bitShift: -1) + x ] .
>        sum     ]
>
> (highlight and DoIt)
>
> (A, B) invoked from workspace, time in seconds
>
> 1.  (24, 88)
> 2.  (5, 11)
> 3.  (3, 8)
>
>
> The Pharo 1.1.1 timing anomaly for case 2A disappeared - it's now 5
> seconds instead of 34.
>
> So, 4 questions:
>
> * What causes the Pharo 1.1.1. anomaly timing difference for #to:by:do
> invoked via workspace DoIt vs in the #benchLooper method  (5 sec vs 34
> sec)  ?

Cog has a policy in the VM to avoid compilation for rarely used
methods.  It will not compile a method to machine code unless
a) it is found in the first level method lookup cache
b) it is evaluated via withArgs:executeMethod: (which is how Doits are
now evaluated)

a) prevents Cog wasting time compiling large methods that are only run
once, e.g. class initializers, where compiling the method is likely to
take much longer than simply interpreting it.  This si a simple
heuristic but seems to work well.  To avoid its effects when
benchmarking simply arrange to run all code at least twice, ignoring
the first run.

b) means that code in a workspace, which can only ever be run once,
runs at Cog speed instead of always being interpreted.


> * Why is the #timesRepeat loop 3x-4x slower than #to:by:do ?  Is that
> simply the difference between inlined and non-inlined methods, or
> other factors?

Yes, it is due to the benefits of inlining blocks.

> * Even the best Cog time here is 7x slower vs machine-code (and 20x
> slower for the timesRepeat case), whereas latest Cog fib: runs at 3x.
> What factors make this micro-bench loop not as optimizable as fib: for
> Cog?

It is not that it is not as optimizable, it is that Cog does not
(currently) do aggressive optimization.  In particular the temporary
variables remain temporary variables on the stack and require reads
and writes to access, whereas the C compiler is very probably putting
these variables in registers.  The plan is for Cog to do register
allocation as part of Sista, i.e. when presented with optimized
methods that contain hints that it should do register allocation.

>
> * Why is the stock Squeak 4.1 VM  **220x slower** than machine code
> for case B = timesRepeat loop,  (88 sec vs 0.4 sec) ?

Because interpreters are slow.

>
>
> Thanks to everyone for insights and comments.
>
> -- jbthiel
>
>