Thanks Eliot, Stef and All for good answers on the Cog VM inlining and
implementation details. That is also exciting news about Eliot and MarkusD's work on an adaptive optimizer. By "completely close the gap to C" how close do you mean? The StrongTalk work posits a theoretical best of about 2x, so I suppose you mean in that general 2x-5x neighborhood vs machine code... ? The 20x slower macro algorithm I mentioned is too large to post. Below is a simple micro-loop I made up to explore some aspects of closures, loops and relative VM timings. The C-equivalent machine code takes 400msec on my test machine. In Squeak/Pharo, I see a huge range of speed factors with this, from 7x to 220x (!!) slower than machine code, on current and near-history VMs (Cog and standard Squeak). Consider: "A = to:by:do: loop" benchLooperA |sum| sum:=1e8. sum to: 1 by: -1 do: [:x | sum := (sum bitShift: -1) + x ] . ^sum "B = timesRepeat loop" benchLooperB |sum count| count := sum:=1e8. sum timesRepeat: [ sum := (sum bitShift: -1) + count. count := count - 1 ] . ^sum A and B compute the same except one using #to:by:do:, the other #timesRepeat. The near-machine code for this computation runs about 400msec (0.4 sec) on my test system. (x86 Windows 7, N450 Atom cpu, 1.66GHz) On the following 3 test VM/Images, : 1. Squeak 4.1 stock (vm = Squeak 4.0.2, 2010apr3) 2. Pharo 1.1.1 stock (vm = Cog 2010sep21) 3. Pharo 1.2rc2 + Cog2011feb6 I get the following times for A Time millisecondsToRun: [ Sandbox benchLooperA ] B Time millisecondsToRun: [ Sandbox benchLooperB ] (A, B) execution times rounded to seconds 1. (24, 88) 2. (34, 11) 3. (3, 8) Here we see a 3x-4x difference A to B, with an anomoly that Pharo 1.1.1 is actually much *slower* on #to:by:do: (is it the closure bug (?), see below). (Also we see 8x-11x speedup with the latest Cog-2011feb6, great!) Now here is something odd -- if I invoke the loops via workspace DoIt, the timing changes in Pharo 1.1.1, like this: Time millisecondsToRun: [ |sum| sum:=1e8. sum to: 1 by: -1 do: [:x | sum := (sum bitShift: -1) + x ] . sum ] (highlight and DoIt) (A, B) invoked from workspace, time in seconds 1. (24, 88) 2. (5, 11) 3. (3, 8) The Pharo 1.1.1 timing anomaly for case 2A disappeared - it's now 5 seconds instead of 34. So, 4 questions: * What causes the Pharo 1.1.1. anomaly timing difference for #to:by:do invoked via workspace DoIt vs in the #benchLooper method (5 sec vs 34 sec) ? * Why is the #timesRepeat loop 3x-4x slower than #to:by:do ? Is that simply the difference between inlined and non-inlined methods, or other factors? * Even the best Cog time here is 7x slower vs machine-code (and 20x slower for the timesRepeat case), whereas latest Cog fib: runs at 3x. What factors make this micro-bench loop not as optimizable as fib: for Cog? * Why is the stock Squeak 4.1 VM **220x slower** than machine code for case B = timesRepeat loop, (88 sec vs 0.4 sec) ? Thanks to everyone for insights and comments. -- jbthiel |
On Thu, Feb 24, 2011 at 2:13 PM, John B Thiel <[hidden email]> wrote:
> Thanks Eliot, Stef and All for good answers on the Cog VM inlining and > implementation details. That is also exciting news about Eliot and > MarkusD's work on an adaptive optimizer. By "completely close the gap > to C" how close do you mean? The StrongTalk work posits a theoretical > best of about 2x, so I suppose you mean in that general 2x-5x > neighborhood vs machine code... ? > > The 20x slower macro algorithm I mentioned is too large to post. > Below is a simple micro-loop I made up to explore some aspects of > closures, loops and relative VM timings. The C-equivalent machine > code takes 400msec on my test machine. In Squeak/Pharo, I see a huge > range of speed factors with this, from 7x to 220x (!!) slower than > machine code, on current and near-history VMs (Cog and standard > Squeak). > > Consider: > > "A = to:by:do: loop" > benchLooperA > |sum| > sum:=1e8. > sum to: 1 by: -1 do: [:x | > sum := (sum bitShift: -1) + x ] . > ^sum > > "B = timesRepeat loop" > benchLooperB > |sum count| > count := sum:=1e8. > sum timesRepeat: [ > sum := (sum bitShift: -1) + count. > count := count - 1 ] . > ^sum > > > A and B compute the same except one using #to:by:do:, the other > #timesRepeat. The near-machine code for this computation runs about > 400msec (0.4 sec) on my test system. (x86 Windows 7, N450 Atom cpu, > 1.66GHz) > > On the following 3 test VM/Images, : > > 1. Squeak 4.1 stock (vm = Squeak 4.0.2, 2010apr3) > 2. Pharo 1.1.1 stock (vm = Cog 2010sep21) > 3. Pharo 1.2rc2 + Cog2011feb6 > > I get the following times for > A Time millisecondsToRun: [ Sandbox benchLooperA ] > B Time millisecondsToRun: [ Sandbox benchLooperB ] > > (A, B) execution times rounded to seconds > > 1. (24, 88) > 2. (34, 11) > 3. (3, 8) > > > Here we see a 3x-4x difference A to B, with an anomoly that Pharo > 1.1.1 is actually much *slower* on #to:by:do: (is it the closure bug > (?), see below). (Also we see 8x-11x speedup with the latest > Cog-2011feb6, great!) > > Now here is something odd -- if I invoke the loops via workspace DoIt, > the timing changes in Pharo 1.1.1, like this: > > Time millisecondsToRun: [ > |sum| > sum:=1e8. > sum to: 1 by: -1 do: [:x | > sum := (sum bitShift: -1) + x ] . > sum ] > > (highlight and DoIt) > > (A, B) invoked from workspace, time in seconds > > 1. (24, 88) > 2. (5, 11) > 3. (3, 8) > > > The Pharo 1.1.1 timing anomaly for case 2A disappeared - it's now 5 > seconds instead of 34. > > So, 4 questions: > > * What causes the Pharo 1.1.1. anomaly timing difference for #to:by:do > invoked via workspace DoIt vs in the #benchLooper method (5 sec vs 34 > sec) ? Cog has a policy in the VM to avoid compilation for rarely used methods. It will not compile a method to machine code unless a) it is found in the first level method lookup cache b) it is evaluated via withArgs:executeMethod: (which is how Doits are now evaluated) a) prevents Cog wasting time compiling large methods that are only run once, e.g. class initializers, where compiling the method is likely to take much longer than simply interpreting it. This si a simple heuristic but seems to work well. To avoid its effects when benchmarking simply arrange to run all code at least twice, ignoring the first run. b) means that code in a workspace, which can only ever be run once, runs at Cog speed instead of always being interpreted. > * Why is the #timesRepeat loop 3x-4x slower than #to:by:do ? Is that > simply the difference between inlined and non-inlined methods, or > other factors? Yes, it is due to the benefits of inlining blocks. > * Even the best Cog time here is 7x slower vs machine-code (and 20x > slower for the timesRepeat case), whereas latest Cog fib: runs at 3x. > What factors make this micro-bench loop not as optimizable as fib: for > Cog? It is not that it is not as optimizable, it is that Cog does not (currently) do aggressive optimization. In particular the temporary variables remain temporary variables on the stack and require reads and writes to access, whereas the C compiler is very probably putting these variables in registers. The plan is for Cog to do register allocation as part of Sista, i.e. when presented with optimized methods that contain hints that it should do register allocation. > > * Why is the stock Squeak 4.1 VM **220x slower** than machine code > for case B = timesRepeat loop, (88 sec vs 0.4 sec) ? Because interpreters are slow. > > > Thanks to everyone for insights and comments. > > -- jbthiel > > |
Free forum by Nabble | Edit this page |