Cog performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Cog performance

Levente Uzonyi-2
Hi,

I was curious how much speedup Cog gives when the code has only a few
message sends, so I ran the following "benchmark":

| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
  1000 timesRepeat: [
  'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
  1000 timesRepeat: [
  'abb' do: [ :e | stream nextPut: e; cr ] ] ].
[ TextDiffBuilder from: s1 to: s2 ] timeToRun.

The above pattern makes TextDiffBuilder >> #lcsFor:and: run for a while.
My results are a bit surprising:
CogVM: 2914
SqueakVM: 1900

MessageTally shows that (I wonder if it's accurate with Cog at all)
CogVM's garbage collector is a bit better, but it runs the code slower
than SqueakVM:

CogVM:
**Leaves**
60.6% {1886ms} TextDiffBuilder>>lcsFor:and:
36.2% {1127ms} DiffElement>>=
1.8% {56ms} ByteString(String)>>=

**GCs**
  full 1 totalling 153ms (5.0% uptime), avg 153.0ms
  incr 21 totalling 76ms (2.0% uptime), avg 4.0ms
  tenures 13 (avg 1 GCs/tenure)
  root table 0 overflows

SqueakVM:
**Leaves**
46.8% {888ms} TextDiffBuilder>>lcsFor:and:
35.3% {670ms} DiffElement>>=
9.8% {186ms} ByteString(String)>>compare:with:collated:
6.9% {131ms} ByteString(String)>>=

**GCs**
  full 3 totalling 254ms (13.0% uptime), avg 85.0ms
  incr 301 totalling 110ms (6.0% uptime), avg 0.0ms
  tenures 272 (avg 1 GCs/tenure)
  root table 0 overflows

Is Cog slower because #to:do: loops are not optimized, or is there some
other reason for the slowdown?


Levente

Reply | Threaded
Open this post in threaded view
|

Re: Cog performance

Eliot Miranda-2
Hi Levente,

On Tue, Jun 22, 2010 at 2:28 PM, Levente Uzonyi <[hidden email]> wrote:
Hi,

I was curious how much speedup Cog gives when the code has only a few message sends, so I ran the following "benchmark":

| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'abb' do: [ :e | stream nextPut: e; cr ] ] ].
[ TextDiffBuilder from: s1 to: s2 ] timeToRun.

The above pattern makes TextDiffBuilder >> #lcsFor:and: run for a while. My results are a bit surprising:
CogVM: 2914
SqueakVM: 1900

MessageTally shows that (I wonder if it's accurate with Cog at all) CogVM's garbage collector is a bit better, but it runs the code slower than SqueakVM:

CogVM:
**Leaves**
60.6% {1886ms} TextDiffBuilder>>lcsFor:and:
36.2% {1127ms} DiffElement>>=
1.8% {56ms} ByteString(String)>>=

**GCs**
       full                    1 totalling 153ms (5.0% uptime), avg 153.0ms
       incr            21 totalling 76ms (2.0% uptime), avg 4.0ms
       tenures         13 (avg 1 GCs/tenure)
       root table      0 overflows

SqueakVM:
**Leaves**
46.8% {888ms} TextDiffBuilder>>lcsFor:and:
35.3% {670ms} DiffElement>>=
9.8% {186ms} ByteString(String)>>compare:with:collated:
6.9% {131ms} ByteString(String)>>=

**GCs**
       full                    3 totalling 254ms (13.0% uptime), avg 85.0ms
       incr            301 totalling 110ms (6.0% uptime), avg 0.0ms
       tenures         272 (avg 1 GCs/tenure)
       root table      0 overflows

Is Cog slower because #to:do: loops are not optimized, or is there some other reason for the slowdown?

I can't say for sure without profiling (you'll find a good VM profiler QVMProfiler in the image in the tarball, which as yet works on MacOS only).
But I expect that the reason is the cost of invoking interpreter primitives from machine code.  Cog only implements a few primitives in machine code (arithmetic, at: & block value) and for all others (e.g. nextPut: above) it executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and Cog is using the interpreter version.  But the cost of invoking an interpreter primitive from machine code is higher than invoking it from the interpreter because of the system-call-like glue between the machine-code stack pages and the C stack on which the interpreter primitive runs.

Three primitives that are currently interpreter primitives but must be implemented in machine code for better performance are new/basicNew, new:/basicNew: and at:put:.  I've avoided implementing these in machine code because the object representation is so complex and am instead about to start work on a simpler object representation.  When I have that I'll implement these primitives and then the speed difference should tilt the other way.

Of course if anyone would like to implement these in the context of the current object representation be my guest and report back asap...

best
Eliot


Levente




Reply | Threaded
Open this post in threaded view
|

Re: Cog performance

Levente Uzonyi-2
On Tue, 22 Jun 2010, Eliot Miranda wrote:

<snip>

> I can't say for sure without profiling (you'll find a good VM profiler
> QVMProfiler in the image in the tarball, which as yet works on MacOS only).

This looks promising, I (or someone else :)) just have to implement
#primitiveExecutableModulesAndOffsets under win32 (and un*x), but that
doesn't seem to be easy (at least the win32 part).

> But I expect that the reason is the cost of invoking interpreter primitives
> from machine code.  Cog only implements a few primitives in machine code
> (arithmetic, at: & block value) and for all others (e.g. nextPut: above) it
> executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and
> Cog is using the interpreter version.  But the cost of invoking an
> interpreter primitive from machine code is higher than invoking it from the
> interpreter because of the system-call-like glue between the machine-code
> stack pages and the C stack on which the interpreter primitive runs.
>
> Three primitives that are currently interpreter primitives but must be
> implemented in machine code for better performance are new/basicNew,
> new:/basicNew: and at:put:.  I've avoided implementing these in machine code
> because the object representation is so complex and am instead about to
> start work on a simpler object representation.  When I have that I'll
> implement these primitives and then the speed difference should tilt the
> other way.

This sounds reasonable. #lcsFor:and: uses #at:put: twice in the inner
loop. One of them (lcss at: max + k + 1 put: lcs) can be eliminated
without affecting the computation, because that just stores the results.
So without only one #at:put: it took me 2423ms to run the benchmark. Which
is still a bit too high. I think only the profiler can help here.

Btw, is MessageTally less accurate with CogVM than with the SqueakVM?


Levente

>
> Of course if anyone would like to implement these in the context of the
> current object representation be my guest and report back asap...
>
> best
> Eliot
>
>>
>>
>> Levente
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Cog performance

Eliot Miranda-2


On Thu, Jun 24, 2010 at 3:19 PM, Levente Uzonyi <[hidden email]> wrote:
On Tue, 22 Jun 2010, Eliot Miranda wrote:

<snip>


I can't say for sure without profiling (you'll find a good VM profiler
QVMProfiler in the image in the tarball, which as yet works on MacOS only).

This looks promising, I (or someone else :)) just have to implement #primitiveExecutableModulesAndOffsets under win32 (and un*x), but that doesn't seem to be easy (at least the win32 part).

If you look at platforms/win32/vm/sqWin32Backtrace.c you'll find code that extracts symbols from dlls for constructing a symbolic backtrace on crashes.  The code also uses a Teleplace.map file generated by the VM makefile which contains the symbols for the VM.  From this code you ought to be able to be able to implement a QVMProfilerWin32SymbolsManager almost entirely out of primitives.


But I expect that the reason is the cost of invoking interpreter primitives
from machine code.  Cog only implements a few primitives in machine code
(arithmetic, at: & block value) and for all others (e.g. nextPut: above) it
executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and
Cog is using the interpreter version.  But the cost of invoking an
interpreter primitive from machine code is higher than invoking it from the
interpreter because of the system-call-like glue between the machine-code
stack pages and the C stack on which the interpreter primitive runs.

Three primitives that are currently interpreter primitives but must be
implemented in machine code for better performance are new/basicNew,
new:/basicNew: and at:put:.  I've avoided implementing these in machine code
because the object representation is so complex and am instead about to
start work on a simpler object representation.  When I have that I'll
implement these primitives and then the speed difference should tilt the
other way.

This sounds reasonable. #lcsFor:and: uses #at:put: twice in the inner loop. One of them (lcss at: max + k + 1 put: lcs) can be eliminated without affecting the computation, because that just stores the results. So without only one #at:put: it took me 2423ms to run the benchmark. Which is still a bit too high. I think only the profiler can help here.

Btw, is MessageTally less accurate with CogVM than with the SqueakVM?

I'm not sure.  We use a variant written by Andreas that is more accurate than MessageTally but that may use different plumbing.  

best
Eliot



Levente



Of course if anyone would like to implement these in the context of the
current object representation be my guest and report back asap...

best
Eliot



Levente







Reply | Threaded
Open this post in threaded view
|

Re: Cog performance

Levente Uzonyi-2
On Thu, 24 Jun 2010, Eliot Miranda wrote:

> On Thu, Jun 24, 2010 at 3:19 PM, Levente Uzonyi <[hidden email]> wrote:
>
>> On Tue, 22 Jun 2010, Eliot Miranda wrote:
>>
>> <snip>
>>
>>
>>  I can't say for sure without profiling (you'll find a good VM profiler
>>> QVMProfiler in the image in the tarball, which as yet works on MacOS
>>> only).
>>>
>>
>> This looks promising, I (or someone else :)) just have to implement
>> #primitiveExecutableModulesAndOffsets under win32 (and un*x), but that
>> doesn't seem to be easy (at least the win32 part).
>
>
> If you look at platforms/win32/vm/sqWin32Backtrace.c you'll find code that
> extracts symbols from dlls for constructing a symbolic backtrace on crashes.
> The code also uses a Teleplace.map file generated by the VM makefile which
> contains the symbols for the VM.  From this code you ought to be able to be
> able to implement a QVMProfilerWin32SymbolsManager almost entirely out of
> primitives.

Thanks for the pointers. I took the "easy" way and did pretty much the
same you did for mac. A primitive lists the loaded modules and the
symbols are extracted by external programs (nm or the dependency walker
if nm didn't work) via OSProcess (actually ProcessWrapper, because
OSProcess doesn't work on windows). So now I have a version of the
QVMProfiler which seems to be working on win32.

And now I can answer my original question: Why is the following code
slower on the CogVM than on the SqueakVM?

| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'abb' do: [ :e | stream nextPut: e; cr ] ] ].
[ TextDiffBuilder from: s1 to: s2 ] timeToRun.

CogVM: 2914
SqueakVM: 1900

The answer is: Cog didn't compile the code to machine code, it's mostly
interpreted:

2.859 seconds; sampling frequency 997 hz
2742 samples in the VM (2851 samples in the entire program)  96.18% of total

243 samples in generated vm code   8.86% of entire vm (  8.52% of total)
2499 samples in vanilla vm code 91.14% of entire vm (87.65% of total)

% of generated vm code (% of total) (samples) (cumulative)
32.10%    (  2.74%) DiffElement>>= (78) (32.10%)
11.93%    (  1.02%) ceReturnToInterpreterTrampoline (29) (44.03%)
11.11%    (  0.95%) String>>compare:with:collated: (27) (55.14%)
   9.88%    (  0.84%) Object>>at:put: (24) (65.02%)
   9.88%    (  0.84%) String>>= (24) (74.90%)
   6.17%    (  0.53%) SmallInteger>>= (15) (81.07%)
   4.94%    (  0.42%) cePrimReturnEnterCogCode (12) (86.01%)
   3.70%    (  0.32%) ByteString class>>compare:with:collated: (9) (89.71%)
   2.47%    (  0.21%) PIC isString (6) (92.18%)
   2.47%    (  0.21%) ceEnterCogCodePopReceiverReg (6) (94.65%)
   2.47%    (  0.21%) DiffElement>>hash (6) (97.12%)
   0.82%    (  0.07%) PIC isByteString (2) (97.94%)
   0.41%    (  0.04%) ByteString>>isByteString (1) (98.35%)
   0.41%    (  0.04%) DiffElement>>string (1) (98.77%)
   0.41%    (  0.04%) SequenceableCollection>>copyFrom:to: (1) (99.18%)
   0.41%    (  0.04%) String>>isString (1) (99.59%)
   0.41%    (  0.04%) WriteStream>>nextPut: (1) (100.0%)


% of vanilla vm code (% of total) (samples) (cumulative)
84.23%    (73.83%) _interpret (2105) (84.23%)
   2.60%    (  2.28%) _installinAtCacheatstring (65) (86.83%)
   1.20%    (  1.05%) _assertValidExecutionPointersimbar (30) (88.04%)
etc.

Whenever this test is repeated, the result is similar. Somehow the CogVM
doesn't compile the code to machine code. But we can force it. Just repeat
the test a few times:

| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
         1000 timesRepeat: [
                 'abb' do: [ :e | stream nextPut: e; cr ] ] ].
5 timesRepeat: [
  Transcript show: [ TextDiffBuilder from: s1 to: s2 ] timeToRun; cr ]


Transcript:
2905
845
850
843
850

So, when the CogVM generates native code, things get a bit different. The
benchmark is evaluated 2.2x faster than with SqueakVM (instead of being
1.5x slower). Here's the new vmreport from QVMProfiler:

0.782 seconds; sampling frequency 996 hz
779 samples in the VM (779 samples in the entire program)  100.0% of total

545 samples in generated vm code 69.96% of entire vm (69.96% of total)
234 samples in vanilla vm code 30.04% of entire vm (30.04% of total)

% of generated vm code (% of total) (samples) (cumulative)
27.34%    (19.13%) TextDiffBuilder>>lcsFor:and: (149) (27.34%)
16.15%    (11.30%) Object>>at: (88) (43.49%)
15.60%    (10.91%) SmallInteger>>+ (85) (59.08%)
10.46%    (  7.32%) DiffElement>>= (57) (69.54%)
   5.32%    (  3.72%) SmallInteger>>= (29) (74.86%)
   4.22%    (  2.95%) Object>>at:put: (23) (79.08%)
   3.49%    (  2.44%) String>>compare:with:collated: (19) (82.57%)
   2.75%    (  1.93%) SmallInteger>>< (15) (85.32%)
   2.75%    (  1.93%) String>>= (15) (88.07%)
   1.83%    (  1.28%) cePrimReturnEnterCogCode (10) (89.91%)
   1.65%    (  1.16%) SmallInteger>>- (9) (91.56%)
   1.47%    (  1.03%) ByteString class>>compare:with:collated: (8) (93.03%)
   1.47%    (  1.03%) String>>isString (8) (94.50%)
etc.

% of vanilla vm code (% of total)
(samples) (cumulative)
24.36%    (  7.32%) _stObjectatput (57) (24.36%)
13.68%    (  4.11%) _noAtCacheCommonAtPut (32) (38.03%)
   8.97%    (  2.70%) _primitiveCompareString (21) (47.01%)
   8.55%    (  2.57%) _ceNewArraySlotSize (20) (55.56%)
   8.12%    (  2.44%) _markAndTrace (19) (63.68%)
   7.69%    (  2.31%) _isWordsOrBytesNonInt (18) (71.37%)
   5.13%    (  1.54%) _sizeOfSTArrayFromCPrimitive (12) (76.50%)
   4.70%    (  1.41%) _stackValue (11) (81.20%)
   3.42%    (  1.03%) _arrayValueOf (8) (84.62%)
etc.

This also shows that the lack of the machine code version of the #at:put:
primitive is not significant in this case.


Levente

>
>
> But I expect that the reason is the cost of invoking interpreter primitives
>>> from machine code.  Cog only implements a few primitives in machine code
>>> (arithmetic, at: & block value) and for all others (e.g. nextPut: above)
>>> it
>>> executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and
>>> Cog is using the interpreter version.  But the cost of invoking an
>>> interpreter primitive from machine code is higher than invoking it from
>>> the
>>> interpreter because of the system-call-like glue between the machine-code
>>> stack pages and the C stack on which the interpreter primitive runs.
>>>
>>> Three primitives that are currently interpreter primitives but must be
>>> implemented in machine code for better performance are new/basicNew,
>>> new:/basicNew: and at:put:.  I've avoided implementing these in machine
>>> code
>>> because the object representation is so complex and am instead about to
>>> start work on a simpler object representation.  When I have that I'll
>>> implement these primitives and then the speed difference should tilt the
>>> other way.
>>>
>>
>> This sounds reasonable. #lcsFor:and: uses #at:put: twice in the inner loop.
>> One of them (lcss at: max + k + 1 put: lcs) can be eliminated without
>> affecting the computation, because that just stores the results. So without
>> only one #at:put: it took me 2423ms to run the benchmark. Which is still a
>> bit too high. I think only the profiler can help here.
>>
>> Btw, is MessageTally less accurate with CogVM than with the SqueakVM?
>>
>
> I'm not sure.  We use a variant written by Andreas that is more accurate
> than MessageTally but that may use different plumbing.
>
> best
> Eliot
>
>
>>
>> Levente
>>
>>
>>
>>> Of course if anyone would like to implement these in the context of the
>>> current object representation be my guest and report back asap...
>>>
>>> best
>>> Eliot
>>>
>>>
>>>>
>>>> Levente
>>>>
>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Cog performance

Eliot Miranda-2
Hi Levente,

On Sat, Jul 3, 2010 at 6:07 PM, Levente Uzonyi <[hidden email]> wrote:
On Thu, 24 Jun 2010, Eliot Miranda wrote:

On Thu, Jun 24, 2010 at 3:19 PM, Levente Uzonyi <[hidden email]> wrote:

On Tue, 22 Jun 2010, Eliot Miranda wrote:

<snip>


 I can't say for sure without profiling (you'll find a good VM profiler
QVMProfiler in the image in the tarball, which as yet works on MacOS
only).


This looks promising, I (or someone else :)) just have to implement
#primitiveExecutableModulesAndOffsets under win32 (and un*x), but that
doesn't seem to be easy (at least the win32 part).


If you look at platforms/win32/vm/sqWin32Backtrace.c you'll find code that
extracts symbols from dlls for constructing a symbolic backtrace on crashes.
The code also uses a Teleplace.map file generated by the VM makefile which
contains the symbols for the VM.  From this code you ought to be able to be
able to implement a QVMProfilerWin32SymbolsManager almost entirely out of
primitives.

Thanks for the pointers. I took the "easy" way and did pretty much the same you did for mac. A primitive lists the loaded modules and the symbols are extracted by external programs (nm or the dependency walker if nm didn't work) via OSProcess (actually ProcessWrapper, because OSProcess doesn't work on windows). So now I have a version of the QVMProfiler which seems to be working on win32.

First, FANTASTIC!  Thanks!  Can you send me your code?  How are you running the external programs, with OSProcess?


And now I can answer my original question: Why is the following code slower on the CogVM than on the SqueakVM?


| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'abb' do: [ :e | stream nextPut: e; cr ] ] ].
[ TextDiffBuilder from: s1 to: s2 ] timeToRun.

CogVM: 2914
SqueakVM: 1900

The answer is: Cog didn't compile the code to machine code, it's mostly interpreted:

2.859 seconds; sampling frequency 997 hz
2742 samples in the VM  (2851 samples in the entire program)  96.18% of total

243 samples in generated vm code   8.86% of entire vm (  8.52% of total)
2499 samples in vanilla vm code 91.14% of entire vm (87.65% of total)

% of generated vm code (% of total)                                                             (samples) (cumulative)
32.10%    (  2.74%)     DiffElement>>=                                                  (78)            (32.10%)
11.93%    (  1.02%)     ceReturnToInterpreterTrampoline                 (29)            (44.03%)
11.11%    (  0.95%)     String>>compare:with:collated:                          (27)            (55.14%)
 9.88%    (  0.84%)    Object>>at:put:                                                 (24)            (65.02%)
 9.88%    (  0.84%)    String>>=                                                                       (24)            (74.90%)
 6.17%    (  0.53%)    SmallInteger>>=                                                 (15)            (81.07%)
 4.94%    (  0.42%)    cePrimReturnEnterCogCode                                (12)            (86.01%)
 3.70%    (  0.32%)    ByteString class>>compare:with:collated:        (9)             (89.71%)
 2.47%    (  0.21%)    PIC isString                                                                    (6)             (92.18%)
 2.47%    (  0.21%)    ceEnterCogCodePopReceiverReg                    (6)             (94.65%)
 2.47%    (  0.21%)    DiffElement>>hash                                                       (6)             (97.12%)
 0.82%    (  0.07%)    PIC isByteString                                                        (2)             (97.94%)
 0.41%    (  0.04%)    ByteString>>isByteString                                        (1)             (98.35%)
 0.41%    (  0.04%)    DiffElement>>string                                             (1)             (98.77%)
 0.41%    (  0.04%)    SequenceableCollection>>copyFrom:to:    (1)             (99.18%)
 0.41%    (  0.04%)    String>>isString                                                        (1)             (99.59%)
 0.41%    (  0.04%)    WriteStream>>nextPut:                                           (1)             (100.0%)


% of vanilla vm code (% of total)                                                                       (samples) (cumulative)
84.23%    (73.83%)      _interpret                                                                      (2105)  (84.23%)
 2.60%    (  2.28%)    _installinAtCacheatstring                                       (65)            (86.83%)
 1.20%    (  1.05%)    _assertValidExecutionPointersimbar              (30)            (88.04%)
etc.

Whenever this test is repeated, the result is similar. Somehow the CogVM doesn't compile the code to machine code. But we can force it. Just repeat the test a few times:


| s1 s2 |
Smalltalk garbageCollect.
s1 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'aab' do: [ :e | stream nextPut: e; cr ] ] ].
s2 := String streamContents: [ :stream |
       1000 timesRepeat: [
               'abb' do: [ :e | stream nextPut: e; cr ] ] ].
5 timesRepeat: [
       Transcript show: [ TextDiffBuilder from: s1 to: s2 ] timeToRun; cr ]


Transcript:
2905
845
850
843
850

I suspected as much.  At the moment the heuristic the VM uses to decide whether to compile to native code is as follows:

a) if a method is found in the lookup cache rather than through a class hierarchy lookup (and it has less than L literals) compile it to machine code (i.e. JIT it on the second send)

b) if a block is about to be evaluated interpretively and it is the same as the most recently interpretively evaluated block,  (and its method has less than L literals) compile the method to machine code (i.e. if a block is repeatedly evaluated JIT it on the second repetition)

c) if a method is evaluated through primitiveExecuteMethod[ArgsArray] (and it has less than L literals) compile it to machine code (i.e. always compile doits, because they never get repeated).

So the above gets faster on the second time.  I thought of adding another heuristic for long-running interpreted methods which would be
d) if an interpreted backwards jump occurs within the same method N times in a row (and its containing method has less than L literals) then compile to machine code and map the current frame from an interpreted one to a machine code one.

This is doable, but it requires mapping of bytecode pcs to machine code pcs, whereas the current code only requires mapping of machine-code pcs to bytecode pcs, and could obscure benchmarks that didn't repeat twice, because some time would be lost before compiling.  But this heuristic would be the only way things like the finalization loop would ever get jitted and so I think its worth-while doing.  Alas its not my highest priority right now.


So, when the CogVM generates native code, things get a bit different. The benchmark is evaluated 2.2x faster than with SqueakVM (instead of being 1.5x slower). Here's the new vmreport from QVMProfiler:

0.782 seconds; sampling frequency 996 hz
779 samples in the VM   (779 samples in the entire program)  100.0% of total

545 samples in generated vm code 69.96% of entire vm (69.96% of total)
234 samples in vanilla vm code 30.04% of entire vm (30.04% of total)

% of generated vm code (% of total)                                                             (samples) (cumulative)
27.34%    (19.13%)      TextDiffBuilder>>lcsFor:and:                            (149)   (27.34%)
16.15%    (11.30%)      Object>>at:                                                             (88)            (43.49%)
15.60%    (10.91%)      SmallInteger>>+                                                 (85)            (59.08%)
10.46%    (  7.32%)     DiffElement>>=                                                  (57)            (69.54%)
 5.32%    (  3.72%)    SmallInteger>>=                                                 (29)            (74.86%)
 4.22%    (  2.95%)    Object>>at:put:                                                 (23)            (79.08%)
 3.49%    (  2.44%)    String>>compare:with:collated:                          (19)            (82.57%)
 2.75%    (  1.93%)    SmallInteger>><                                                 (15)            (85.32%)
 2.75%    (  1.93%)    String>>=                                                                       (15)            (88.07%)
 1.83%    (  1.28%)    cePrimReturnEnterCogCode                                (10)            (89.91%)
 1.65%    (  1.16%)    SmallInteger>>-                                                 (9)             (91.56%)
 1.47%    (  1.03%)    ByteString class>>compare:with:collated:        (8)             (93.03%)
 1.47%    (  1.03%)    String>>isString                                                        (8)             (94.50%)
etc.

% of vanilla vm code (% of total) (samples) (cumulative)
24.36%    (  7.32%)     _stObjectatput                                                  (57)            (24.36%)
13.68%    (  4.11%)     _noAtCacheCommonAtPut                           (32)            (38.03%)
 8.97%    (  2.70%)    _primitiveCompareString                         (21)            (47.01%)
 8.55%    (  2.57%)    _ceNewArraySlotSize                                     (20)            (55.56%)
 8.12%    (  2.44%)    _markAndTrace                                                   (19)            (63.68%)
 7.69%    (  2.31%)    _isWordsOrBytesNonInt                           (18)            (71.37%)
 5.13%    (  1.54%)    _sizeOfSTArrayFromCPrimitive                    (12)            (76.50%)
 4.70%    (  1.41%)    _stackValue                                                     (11)            (81.20%)
 3.42%    (  1.03%)    _arrayValueOf                                                   (8)             (84.62%)
etc.

This also shows that the lack of the machine code version of the #at:put: primitive is not significant in this case.

On the contrary, fully 24.36% +  13.68%, or 38% of entire execution time is going into at:put: (stObjectatput & noAtCacheCommonAtPut are the guts of the interpreter's at:put: implementation).  The code would probably be netter than 3x faster than the interpreter (up from 2.2x) with a machine-code at:put:.

great that you've done this Levente!  Thanks!

best,
Eliot




Levente




But I expect that the reason is the cost of invoking interpreter primitives
from machine code.  Cog only implements a few primitives in machine code
(arithmetic, at: & block value) and for all others (e.g. nextPut: above)
it
executes the interpreter primitives.  lcsFor:and: uses at:put: heavily and
Cog is using the interpreter version.  But the cost of invoking an
interpreter primitive from machine code is higher than invoking it from
the
interpreter because of the system-call-like glue between the machine-code
stack pages and the C stack on which the interpreter primitive runs.

Three primitives that are currently interpreter primitives but must be
implemented in machine code for better performance are new/basicNew,
new:/basicNew: and at:put:.  I've avoided implementing these in machine
code
because the object representation is so complex and am instead about to
start work on a simpler object representation.  When I have that I'll
implement these primitives and then the speed difference should tilt the
other way.


This sounds reasonable. #lcsFor:and: uses #at:put: twice in the inner loop.
One of them (lcss at: max + k + 1 put: lcs) can be eliminated without
affecting the computation, because that just stores the results. So without
only one #at:put: it took me 2423ms to run the benchmark. Which is still a
bit too high. I think only the profiler can help here.

Btw, is MessageTally less accurate with CogVM than with the SqueakVM?


I'm not sure.  We use a variant written by Andreas that is more accurate
than MessageTally but that may use different plumbing.

best
Eliot



Levente



Of course if anyone would like to implement these in the context of the
current object representation be my guest and report back asap...

best
Eliot



Levente










Reply | Threaded
Open this post in threaded view
|

Re: Cog performance

David T. Lewis
On Sat, Jul 03, 2010 at 06:53:47PM -0700, Eliot Miranda wrote:
>
> <snip>
>
> First, FANTASTIC!  Thanks!  Can you send me your code?  How are you running
> the external programs, with OSProcess?

http://squeaksource.com/ProcessWrapper


Reply | Threaded
Open this post in threaded view
|

QVMProfiler with win32 support (was: Re: [squeak-dev] Cog performance)

Levente Uzonyi-2
In reply to this post by Eliot Miranda-2
On Sat, 3 Jul 2010, Eliot Miranda wrote:

> Hi Levente,
>
> On Sat, Jul 3, 2010 at 6:07 PM, Levente Uzonyi <[hidden email]> wrote:
>
>> On Thu, 24 Jun 2010, Eliot Miranda wrote:
>>
>>  On Thu, Jun 24, 2010 at 3:19 PM, Levente Uzonyi <[hidden email]> wrote:
>>>
>>>  On Tue, 22 Jun 2010, Eliot Miranda wrote:
>>>>
>>>> <snip>
>>>>
>>>>
>>>>  I can't say for sure without profiling (you'll find a good VM profiler
>>>>
>>>>> QVMProfiler in the image in the tarball, which as yet works on MacOS
>>>>> only).
>>>>>
>>>>>
>>>> This looks promising, I (or someone else :)) just have to implement
>>>> #primitiveExecutableModulesAndOffsets under win32 (and un*x), but that
>>>> doesn't seem to be easy (at least the win32 part).
>>>>
>>>
>>>
>>> If you look at platforms/win32/vm/sqWin32Backtrace.c you'll find code that
>>> extracts symbols from dlls for constructing a symbolic backtrace on
>>> crashes.
>>> The code also uses a Teleplace.map file generated by the VM makefile which
>>> contains the symbols for the VM.  From this code you ought to be able to
>>> be
>>> able to implement a QVMProfilerWin32SymbolsManager almost entirely out of
>>> primitives.
>>>
>>
>> Thanks for the pointers. I took the "easy" way and did pretty much the same
>> you did for mac. A primitive lists the loaded modules and the symbols are
>> extracted by external programs (nm or the dependency walker if nm didn't
>> work) via OSProcess (actually ProcessWrapper, because OSProcess doesn't work
>> on windows). So now I have a version of the QVMProfiler which seems to be
>> working on win32.
>>
>
> First, FANTASTIC!  Thanks!  Can you send me your code?  How are you running
> the external programs, with OSProcess?

I uploaded the code to http://leves.web.elte.hu/QVMProfiler/ , the latest
version is http://leves.web.elte.hu/QVMProfiler/Qwaq-VMProfiling-ul.64.mcz 
. I used ProcessWrapper (http://squeaksource.com/ProcessWrapper.html ) to
run external programs. OSProcess would be nice, but it doesn't work on
win32. FFI would be sufficient, but the cygwin version doesn't work with
Cog.

How to load it:
1. Get a recent CogVM (this should do it:
http://leves.web.elte.hu/squeak/Cog17.zip )
2. Get the plugin from
http://leves.web.elte.hu/QVMProfiler/QVMProfileWin32SupportPlugin.dll
3. Get nm.exe (part of the MinGW binutils http://mingw.org) and grep.exe
(part of UnxUtils http://unxutils.sourceforge.net/ ) or other sources. I
uploaded the version I use here:
http://leves.web.elte.hu/QVMProfiler/tools/
4. Get the Dependency Walker from
http://dependencywalker.com/depends22_x86.zip . Make sure that the
binaries (nm.exe, grep.exe, depends.exe) are on your path, or just copy
them to the same directory as your Squeak image is. Otherwise the vm may
crash...
5. Open your Squeak image and load ProcessWrapper by evaluating: Installer
ss project: 'ProcessWrapper'; install: 'ProcessWrapper'. When the
installer ask if you want to download the plugin for ProcessWrapper,
answer yes.
6. Install the profiler by evaluating: Installer monticello http:
'http://leves.web.elte.hu/QVMProfiler'; install: 'Qwaq-VMProfiling'.
Click proceed when you see the warning about the missing class
InterpreterPlugin.
7. Evaluate QVMProfiler spyOn: [ 0 tinyBenchmarks ]


Notes:
- I think this package should have it's own squeaksource repository, the
mac and win32 parts could be separated just like the plugins (it's not
possible at the moment, since to minimize effort the win32 classes are
just subclasses of the mac lasses).
- Un*x version would be nice too.
- I added SparseLargeArray to the package, because the package depends on
it. I had to change QVMSparseLargeArray >> #noCheckAt:put:, to initialize
the chunk's slots with the default value.
- The GUI doesn't really work in Squeak 4.1 it's probably the effect of
the changes since 3.8. (zooming works, but the selected area is not
highlighted, the "profile:" button doesn't work. Some checkboxes can only
be pressed when the window is large enough.
- nm.exe only works with cygwin/mingw binaries, so it's not really useful
at the moment. Using the generated map file for the vm (and vm plugins)
might be a better idea.
- The dependency walker can only extract the exported symbols. It
also creates huge output files which is pretty bad. I also tried mingw's
pexports, but it crashes while it's processing ntdll.dll.
- The code processing the output of the dependency walker is just a hack,
it should be improved.
- The name of the temp directory is simply vmsyms, without the pid,
because OSProcess can't tell it. The directory is created in the image's
directory instead of %TEMP%.
- The files generated by nm have .nm suffix to avoid conflicts with the
files generated by the dependency walker. They could be generated in a
separate directories to avoid the suffixes.
- I changed some mac related parts, but I don't have a mac, so I couldn't
check if I broke something or not. Hopefully not.
- The code is probably full of bugs, as usual. :)


About the plugin:
- The Cog VMMaker doesn't generate declared variables if they are not used
by slang code. I don't know if Squeak's VMMaker does so, but it's really
annoying.
- For simplicity the primitive generates the same output as the mac
primitive.
- psapi is loaded dynamically (just like in sqWin32Backtrace.c), but
psapi.h is included. I saw that the vm is linked against psapi, so the
methods could be used directly in sqWin32Backtrace. And the plugin could
use it too.
- The path of a module can be at most 1023 bytes long at the moment.
- The code may contain bugs.


Feedback is welcome.

>
>
> And now I can answer my original question: Why is the following code slower
>> on the CogVM than on the SqueakVM?
>>
>>
>> | s1 s2 |
>> Smalltalk garbageCollect.
>> s1 := String streamContents: [ :stream |
>>        1000 timesRepeat: [
>>                'aab' do: [ :e | stream nextPut: e; cr ] ] ].
>> s2 := String streamContents: [ :stream |
>>        1000 timesRepeat: [
>>                'abb' do: [ :e | stream nextPut: e; cr ] ] ].
>> [ TextDiffBuilder from: s1 to: s2 ] timeToRun.
>>
>> CogVM: 2914
>> SqueakVM: 1900
>>
>> The answer is: Cog didn't compile the code to machine code, it's mostly
>> interpreted:
>>
>> 2.859 seconds; sampling frequency 997 hz
>> 2742 samples in the VM  (2851 samples in the entire program)  96.18% of
>> total
>>
>> 243 samples in generated vm code   8.86% of entire vm (  8.52% of total)
>> 2499 samples in vanilla vm code 91.14% of entire vm (87.65% of total)
>>
>> % of generated vm code (% of total)
>>                     (samples) (cumulative)
>> 32.10%    (  2.74%)     DiffElement>>=
>>              (78)            (32.10%)
>> 11.93%    (  1.02%)     ceReturnToInterpreterTrampoline
>> (29)            (44.03%)
>> 11.11%    (  0.95%)     String>>compare:with:collated:
>>      (27)            (55.14%)
>>  9.88%    (  0.84%)    Object>>at:put:
>>             (24)            (65.02%)
>>  9.88%    (  0.84%)    String>>=
>>                             (24)            (74.90%)
>>  6.17%    (  0.53%)    SmallInteger>>=
>>             (15)            (81.07%)
>>  4.94%    (  0.42%)    cePrimReturnEnterCogCode
>>    (12)            (86.01%)
>>  3.70%    (  0.32%)    ByteString class>>compare:with:collated:        (9)
>>             (89.71%)
>>  2.47%    (  0.21%)    PIC isString
>>                            (6)             (92.18%)
>>  2.47%    (  0.21%)    ceEnterCogCodePopReceiverReg                    (6)
>>             (94.65%)
>>  2.47%    (  0.21%)    DiffElement>>hash
>>                     (6)             (97.12%)
>>  0.82%    (  0.07%)    PIC isByteString
>>                    (2)             (97.94%)
>>  0.41%    (  0.04%)    ByteString>>isByteString
>>            (1)             (98.35%)
>>  0.41%    (  0.04%)    DiffElement>>string
>>             (1)             (98.77%)
>>  0.41%    (  0.04%)    SequenceableCollection>>copyFrom:to:    (1)
>>     (99.18%)
>>  0.41%    (  0.04%)    String>>isString
>>                    (1)             (99.59%)
>>  0.41%    (  0.04%)    WriteStream>>nextPut:
>>             (1)             (100.0%)
>>
>>
>> % of vanilla vm code (% of total)
>>                             (samples) (cumulative)
>> 84.23%    (73.83%)      _interpret
>>                              (2105)  (84.23%)
>>  2.60%    (  2.28%)    _installinAtCacheatstring
>>             (65)            (86.83%)
>>  1.20%    (  1.05%)    _assertValidExecutionPointersimbar              (30)
>>            (88.04%)
>> etc.
>>
>> Whenever this test is repeated, the result is similar. Somehow the CogVM
>> doesn't compile the code to machine code. But we can force it. Just repeat
>> the test a few times:
>>
>>
>> | s1 s2 |
>> Smalltalk garbageCollect.
>> s1 := String streamContents: [ :stream |
>>        1000 timesRepeat: [
>>                'aab' do: [ :e | stream nextPut: e; cr ] ] ].
>> s2 := String streamContents: [ :stream |
>>        1000 timesRepeat: [
>>                'abb' do: [ :e | stream nextPut: e; cr ] ] ].
>> 5 timesRepeat: [
>>        Transcript show: [ TextDiffBuilder from: s1 to: s2 ] timeToRun; cr ]
>>
>>
>> Transcript:
>> 2905
>> 845
>> 850
>> 843
>> 850
>>
>
> I suspected as much.  At the moment the heuristic the VM uses to decide
> whether to compile to native code is as follows:
>
> a) if a method is found in the lookup cache rather than through a class
> hierarchy lookup (and it has less than L literals) compile it to machine
> code (i.e. JIT it on the second send)
>
> b) if a block is about to be evaluated interpretively and it is the same as
> the most recently interpretively evaluated block,  (and its method has less
> than L literals) compile the method to machine code (i.e. if a block is
> repeatedly evaluated JIT it on the second repetition)
>
> c) if a method is evaluated through primitiveExecuteMethod[ArgsArray] (and
> it has less than L literals) compile it to machine code (i.e. always compile
> doits, because they never get repeated).
>
> So the above gets faster on the second time.  I thought of adding another
> heuristic for long-running interpreted methods which would be
> d) if an interpreted backwards jump occurs within the same method N times in
> a row (and its containing method has less than L literals) then compile to
> machine code and map the current frame from an interpreted one to a machine
> code one.
>
> This is doable, but it requires mapping of bytecode pcs to machine code pcs,
> whereas the current code only requires mapping of machine-code pcs to
> bytecode pcs, and could obscure benchmarks that didn't repeat twice, because
> some time would be lost before compiling.  But this heuristic would be the
> only way things like the finalization loop would ever get jitted and so I
> think its worth-while doing.  Alas its not my highest priority right now.

So if WeakArray >> #finalizationProcess would be implemented as
[ ... ] repeat instead of [ true ] whileTrue: [ ... ], then most of it
would be compiled to native code via b).

>
>
> So, when the CogVM generates native code, things get a bit different. The
>> benchmark is evaluated 2.2x faster than with SqueakVM (instead of being 1.5x
>> slower). Here's the new vmreport from QVMProfiler:
>>
>> 0.782 seconds; sampling frequency 996 hz
>> 779 samples in the VM   (779 samples in the entire program)  100.0% of
>> total
>>
>> 545 samples in generated vm code 69.96% of entire vm (69.96% of total)
>> 234 samples in vanilla vm code 30.04% of entire vm (30.04% of total)
>>
>> % of generated vm code (% of total)
>>                     (samples) (cumulative)
>> 27.34%    (19.13%)      TextDiffBuilder>>lcsFor:and:
>>      (149)   (27.34%)
>> 16.15%    (11.30%)      Object>>at:
>>                     (88)            (43.49%)
>> 15.60%    (10.91%)      SmallInteger>>+
>>             (85)            (59.08%)
>> 10.46%    (  7.32%)     DiffElement>>=
>>              (57)            (69.54%)
>>  5.32%    (  3.72%)    SmallInteger>>=
>>             (29)            (74.86%)
>>  4.22%    (  2.95%)    Object>>at:put:
>>             (23)            (79.08%)
>>  3.49%    (  2.44%)    String>>compare:with:collated:
>>    (19)            (82.57%)
>>  2.75%    (  1.93%)    SmallInteger>><
>>             (15)            (85.32%)
>>  2.75%    (  1.93%)    String>>=
>>                             (15)            (88.07%)
>>  1.83%    (  1.28%)    cePrimReturnEnterCogCode
>>    (10)            (89.91%)
>>  1.65%    (  1.16%)    SmallInteger>>-
>>             (9)             (91.56%)
>>  1.47%    (  1.03%)    ByteString class>>compare:with:collated:        (8)
>>             (93.03%)
>>  1.47%    (  1.03%)    String>>isString
>>                    (8)             (94.50%)
>> etc.
>>
>> % of vanilla vm code (% of total) (samples) (cumulative)
>> 24.36%    (  7.32%)     _stObjectatput
>>              (57)            (24.36%)
>> 13.68%    (  4.11%)     _noAtCacheCommonAtPut
>> (32)            (38.03%)
>>  8.97%    (  2.70%)    _primitiveCompareString                         (21)
>>            (47.01%)
>>  8.55%    (  2.57%)    _ceNewArraySlotSize
>>     (20)            (55.56%)
>>  8.12%    (  2.44%)    _markAndTrace
>>             (19)            (63.68%)
>>  7.69%    (  2.31%)    _isWordsOrBytesNonInt                           (18)
>>            (71.37%)
>>  5.13%    (  1.54%)    _sizeOfSTArrayFromCPrimitive                    (12)
>>            (76.50%)
>>  4.70%    (  1.41%)    _stackValue
>>             (11)            (81.20%)
>>  3.42%    (  1.03%)    _arrayValueOf
>>             (8)             (84.62%)
>> etc.
>>
>> This also shows that the lack of the machine code version of the #at:put:
>> primitive is not significant in this case.
>>
>
> On the contrary, fully 24.36% +  13.68%, or 38% of entire execution time is
> going into at:put: (stObjectatput & noAtCacheCommonAtPut are the guts of the
> interpreter's at:put: implementation).  The code would probably be netter
> than 3x faster than the interpreter (up from 2.2x) with a machine-code
> at:put:.

No. It's just 7.32% + 4.11% = 11.43%. The first column contains the
percent of "vanilla vm code" the second column contains the "% of total".
I fell for it too. So the speedup would be 2.3x instead of 2.2x if
#at:put: were implemented in machine code.


Levente

>
> great that you've done this Levente!  Thanks!
>
> best,
> Eliot
>
>
>
>>
>> Levente
>>
>>
>>
>>>
>>> But I expect that the reason is the cost of invoking interpreter
>>> primitives
>>>
>>>> from machine code.  Cog only implements a few primitives in machine code
>>>>> (arithmetic, at: & block value) and for all others (e.g. nextPut: above)
>>>>> it
>>>>> executes the interpreter primitives.  lcsFor:and: uses at:put: heavily
>>>>> and
>>>>> Cog is using the interpreter version.  But the cost of invoking an
>>>>> interpreter primitive from machine code is higher than invoking it from
>>>>> the
>>>>> interpreter because of the system-call-like glue between the
>>>>> machine-code
>>>>> stack pages and the C stack on which the interpreter primitive runs.
>>>>>
>>>>> Three primitives that are currently interpreter primitives but must be
>>>>> implemented in machine code for better performance are new/basicNew,
>>>>> new:/basicNew: and at:put:.  I've avoided implementing these in machine
>>>>> code
>>>>> because the object representation is so complex and am instead about to
>>>>> start work on a simpler object representation.  When I have that I'll
>>>>> implement these primitives and then the speed difference should tilt the
>>>>> other way.
>>>>>
>>>>>
>>>> This sounds reasonable. #lcsFor:and: uses #at:put: twice in the inner
>>>> loop.
>>>> One of them (lcss at: max + k + 1 put: lcs) can be eliminated without
>>>> affecting the computation, because that just stores the results. So
>>>> without
>>>> only one #at:put: it took me 2423ms to run the benchmark. Which is still
>>>> a
>>>> bit too high. I think only the profiler can help here.
>>>>
>>>> Btw, is MessageTally less accurate with CogVM than with the SqueakVM?
>>>>
>>>>
>>> I'm not sure.  We use a variant written by Andreas that is more accurate
>>> than MessageTally but that may use different plumbing.
>>>
>>> best
>>> Eliot
>>>
>>>
>>>
>>>> Levente
>>>>
>>>>
>>>>
>>>>  Of course if anyone would like to implement these in the context of the
>>>>> current object representation be my guest and report back asap...
>>>>>
>>>>> best
>>>>> Eliot
>>>>>
>>>>>
>>>>>
>>>>>> Levente
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Cog performance

Göran Krampe
In reply to this post by Eliot Miranda-2
Hi guys!

On 07/04/2010 03:53 AM, Eliot Miranda wrote:
> great that you've done this Levente!  Thanks!

Just shows what can happen when stuff is actually out in the open... ;)

(sorry, couldn't help that remark)

Anyway, great to see stuff "happening" with Cog - and Cog of course
makes Squeak a much stronger choice in the jungle of dynamic runtimes.

regards, Göran