[squeak-dev] VM performance discrepancy on Linux and Windows

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] VM performance discrepancy on Linux and Windows

Yoshiki Ohshima-2
  Hello,

  I had some suspicous for a while but we did a little test with a
computer that dual boot Windows XP and (Ubuntu) Linux to run
tinyBenchmarks.  (The computer happens to be a 1.8GHz Pentium-M Dell
laptop.)

On Windows, 3.10.6 VM (pre-compiled one on the site) with
etoys-dev.image, the result was:

  311 million bytecodes/sec, 8.9 million sends/sec

On Linux, 3.9-8 VM (pre-compiled one on the site) with etoys-dev.image
the result was:

  190 million bytecodes/ec, 5.7 million seonds/sec

  Has any of you been experiencing similar gap?  Have anybody looked
at the generated code, or has anybody done some experiment recently?

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] VM performance discrepancy on Linux and Windows

johnmci
Ian was/is aware of the magic to ensure the compiler makes a more  
efficient set of assembler instructions for non-generic intel CPU  
flavors.
I'll assume these were not applied to the 3.9-8 VM you are using

On Apr 9, 2008, at 1:38 PM, Yoshiki Ohshima wrote:

>  Hello,
>
>  I had some suspicous for a while but we did a little test with a
> computer that dual boot Windows XP and (Ubuntu) Linux to run
> tinyBenchmarks.  (The computer happens to be a 1.8GHz Pentium-M Dell
> laptop.)
>
> On Windows, 3.10.6 VM (pre-compiled one on the site) with
> etoys-dev.image, the result was:
>
>  311 million bytecodes/sec, 8.9 million sends/sec
>
> On Linux, 3.9-8 VM (pre-compiled one on the site) with etoys-dev.image
> the result was:
>
>  190 million bytecodes/ec, 5.7 million seonds/sec
>
>  Has any of you been experiencing similar gap?  Have anybody looked
> at the generated code, or has anybody done some experiment recently?
>
> -- Yoshiki
>

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
=
=
=
========================================================================



Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Re: [squeak-dev] VM performance discrepancy on Linux and Windows

Yoshiki Ohshima-2
  Well,

  So some god words came and now I'm looking at the assembly code...

Bottom line is: gcc 2.95.2 on my Linux makes the bytecode/sec count
larger, but makes send/sec count smaller.

gcc 2.95.2 on Windows generates a code sequence for two bytecodes like this:

-------------------
    69d8: 46                   inc    %esi
    69d9: 0f b6 1e             movzbl (%esi),%ebx
    69dc: 83 c7 04             add    $0x4,%edi
    69df: a1 00 00 00 00       mov    0x0,%eax
    69e4: 8b 40 08             mov    0x8(%eax),%eax
    69e7: 89 07                 mov    %eax,(%edi)
    69e9: ff 24 9d 80 27 00 00 jmp    *0x2780(,%ebx,4)
    69f0: 46                   inc    %esi
    69f1: 0f b6 1e             movzbl (%esi),%ebx
    69f4: 83 c7 04             add    $0x4,%edi
    69f7: a1 00 00 00 00       mov    0x0,%eax
    69fc: 8b 40 0c             mov    0xc(%eax),%eax
    69ff: 89 07                 mov    %eax,(%edi)
    6a01: ff 24 9d 80 27 00 00 jmp    *0x2780(,%ebx,4)
-------------------

Apparently, %esi is used (exclusively) for IP, and %ebx keeps the next
byte, and "jmp *" takes you to the next location stored in the table
starts at 0x2780.

gcc 4.1.2 on Fedora Core 7 generates a code sequence for two bytecodes like this:

-------------------
    efcf: 8d 46 01             lea    0x1(%esi),%eax
    efd2: 0f b6 08             movzbl (%eax),%ecx
    efd5: 89 c6                 mov    %eax,%esi
    efd7: a1 40 00 00 00       mov    0x40,%eax
    efdc: 8d 57 04             lea    0x4(%edi),%edx
    efdf: 89 d7                 mov    %edx,%edi
    efe1: 89 cb                 mov    %ecx,%ebx
    efe3: 8b 40 2c             mov    0x2c(%eax),%eax
    efe6: 89 02                 mov    %eax,(%edx)
    efe8: 8b 04 8d 20 04 00 00 mov    0x420(,%ecx,4),%eax
    efef: ff e0                 jmp    *%eax
    eff1: 8d 46 01             lea    0x1(%esi),%eax
    eff4: 0f b6 08             movzbl (%eax),%ecx
    eff7: 89 c6                 mov    %eax,%esi
    eff9: a1 40 00 00 00       mov    0x40,%eax
    effe: 8d 57 04             lea    0x4(%edi),%edx
    f001: 89 d7                 mov    %edx,%edi
    f003: 89 cb                 mov    %ecx,%ebx
    f005: 8b 40 30             mov    0x30(%eax),%eax
    f008: 89 02                 mov    %eax,(%edx)
    f00a: 8b 04 8d 20 04 00 00 mov    0x420(,%ecx,4),%eax
    f011: ff e0                 jmp    *%eax
-------------------

%esi is almost used for IP but use %eax for fetching the next byte,
jmp also seems to use %eax so right before it is spilled and the
destination address is brought into %eax.

  I'd be surprized that this is optimized for a specific x86
variation.  I copy the command line option from Windows Makefile to
Fedora:

-mpentium -mwindows -Werror-implicit-function-declaration -fomit-frame-pointer -funroll-loops -fschedule-insns2

and got equally unsatisfying (slightly different) sequence.

Ok, so one thing to try is to install gcc 2.95.2 to Fedora Core 7 and
compile the interpreter with it.  The resulting assembly code is close
to the one on Windows.  The bytecode/sec count went put but send/sec
went down.  I have a feeling that I saw it before but of course cannot
remember the exact condition...

  If somebody has dual boot machine and can compare 8 (or more) cases
(Namely, the combination of Windows/Linux, 2.95.2/4.1.2, more
options/less options), that would be great.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Re: [squeak-dev] VM performance discrepancy on Linux and Windows

Andreas.Raab
Yoshiki Ohshima wrote:
> Apparently, %esi is used (exclusively) for IP, and %ebx keeps the next
> byte, and "jmp *" takes you to the next location stored in the table
> starts at 0x2780.

All of that comes straight out of sqGnu.h:

#define BC_CASE(N) case N: _##N:
#define BC_BREAK goto *jumpTable[currentBytecode]

#if defined(__i386__)
# define IP_REG asm("%esi")
# define SP_REG asm("%edi")
# define CB_REG asm("%ebx")
#endif

You might want to check if the gnuifier got confused over time - I had
to update it to deal correctly with sqInt etc. gnu-interp.c should look
like here:

sqInt interpret(void) {
     sqInt localReturnValue;
     sqInt localReturnContext;
     sqInt localHomeContext;
     register char* localSP SP_REG;
     register char* localIP IP_REG;
     register sqInt currentBytecode CB_REG;
     BC_JUMP_TABLE;

                switch (currentBytecode) {
                BC_CASE(0)
                        /* pushReceiverVariableBytecode */
                        BC_BREAK;

> %esi is almost used for IP but use %eax for fetching the next byte,
> jmp also seems to use %eax so right before it is spilled and the
> destination address is brought into %eax.

Sounds more like the static register assignments get ignored.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Re: [squeak-dev] VM performance discrepancy on Linux and Windows

johnmci
The sqGnu.h I have reads

#if defined(__i386__)
# define IP_REG asm("%esi")
# define SP_REG asm("%edi")
//# if (__GNUC__ > 2) || ((__GNUC__ == 2) && (__GNUC_MINOR__ >= 95))
#   define CB_REG asm("%ebx")
//# else
//#   define CB_REG /* avoid undue register pressure */
//# endif
#endif


The first two byte codes assemble to this when done right.

L10161:
        addl $1, %esi
        movzbl (%esi), %ebx
        addl $4, %edi
        movl _foo, %eax
        movl 84(%eax), %eax
        movl 4(%eax), %eax
        movl %eax, (%edi)
        movl 512(%esp,%ebx,4), %eax
L10421:
        jmp *%eax

L10162:
        addl $1, %esi
        movzbl (%esi), %ebx
        addl $4, %edi
        movl _foo, %eax
        movl 84(%eax), %eax
        movl 8(%eax), %eax
        movl %eax, (%edi)
        movl 512(%esp,%ebx,4), %eax
        jmp *%eax



sqInt interpret(void) {
#ifdef FOO_REG
     register struct foo * foo FOO_REG = &fum;
#endif
     sqInt localReturnValue;
     sqInt localReturnContext;
     sqInt localHomeContext;
     char* localSP;
     char* localIP;
     sqInt currentBytecode;
     JUMP_TABLE;


Plus use of  -DUSE_INLINE_MEMORY_ACCESSORS

However much of this also relies on GCC version, in this case 4.01,  
usage of SP_REG, etc produced dreadful code with GCC 4.x, but was  
required for earlier versions.

I noted for PowerPC
(Note building with GCC 3.3 PowerPC produces better code than gcc 4.0,  
gcc 3.1 or gcc 2.95, FYI gcc 3.1 produces lousy code
But since you are building on Intel your milage will vary (lots)

On Apr 11, 2008, at 12:22 AM, Andreas Raab wrote:

> Yoshiki Ohshima wrote:
>> Apparently, %esi is used (exclusively) for IP, and %ebx keeps the  
>> next
>> byte, and "jmp *" takes you to the next location stored in the table
>> starts at 0x2780.
>
> All of that comes straight out of sqGnu.h:
>
> #define BC_CASE(N) case N: _##N:
> #define BC_BREAK goto *jumpTable[currentBytecode]
>
> #if defined(__i386__)
> # define IP_REG asm("%esi")
> # define SP_REG asm("%edi")
> # define CB_REG asm("%ebx")
> #endif
>
> You might want to check if the gnuifier got confused over time - I  
> had to update it to deal correctly with sqInt etc. gnu-interp.c  
> should look like here:
>
> sqInt interpret(void) {
>    sqInt localReturnValue;
>    sqInt localReturnContext;
>    sqInt localHomeContext;
>    register char* localSP SP_REG;
>    register char* localIP IP_REG;
>    register sqInt currentBytecode CB_REG;
>    BC_JUMP_TABLE;
>
> switch (currentBytecode) {
> BC_CASE(0)
> /* pushReceiverVariableBytecode */
> BC_BREAK;
>
>> %esi is almost used for IP but use %eax for fetching the next byte,
>> jmp also seems to use %eax so right before it is spilled and the
>> destination address is brought into %eax.
>
> Sounds more like the static register assignments get ignored.
>
> Cheers,
>  - Andreas
>

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
=
=
=
========================================================================



Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: [Vm-dev] Re: VM performance discrepancy on Linux and Windows

Paolo Bonzini-2
In reply to this post by Andreas.Raab

>> %esi is almost used for IP but use %eax for fetching the next byte,
>> jmp also seems to use %eax so right before it is spilled and the
>> destination address is brought into %eax.
>
> Sounds more like the static register assignments get ignored.

It does not really get ignored, but the compiler performs more
aggressive live range splitting (because it uses SSA in 4.x so live
range splitting comes from free -- sometimes even if you don't want
it...).  OTOH the optimizer is better, which is why sends are faster.

GCC 3.x should be in the same ballpark as 2.95.

Paolo

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: [Vm-dev] Re: VM performance discrepancy on Linux and Windows

Yoshiki Ohshima-2
At Fri, 11 Apr 2008 09:47:13 +0200,
Paolo Bonzini wrote:

>
>
> >> %esi is almost used for IP but use %eax for fetching the next byte,
> >> jmp also seems to use %eax so right before it is spilled and the
> >> destination address is brought into %eax.
> >
> > Sounds more like the static register assignments get ignored.
>
> It does not really get ignored, but the compiler performs more
> aggressive live range splitting (because it uses SSA in 4.x so live
> range splitting comes from free -- sometimes even if you don't want
> it...).  OTOH the optimizer is better, which is why sends are faster.
>
> GCC 3.x should be in the same ballpark as 2.95.

  Hmm, so simple increment assigns a new value and the range is
splitted there?

  BTW, I got a report that says the result from gcc 3.3 was comparable
with 4.1.

  Now, I still have a feeling that there is some difference between
the compiler in
http://squeakvm.org/win32/release/Squeak-Win32-Tools-1.2.zip and
2.95.2 I compiled on my Linux as there seems some unexplained
discrepancy.  Can anybody with dual boot machine do some more testing?
(If the machine is green and white cute one, that would be really
interesting.)

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: [Vm-dev] Re: VM performance discrepancy on Linux and Windows

Paolo Bonzini-2
>> It does not really get ignored, but the compiler performs more
>> aggressive live range splitting (because it uses SSA in 4.x so live
>> range splitting comes from free -- sometimes even if you don't want
>> it...).  OTOH the optimizer is better, which is why sends are faster.
>
>   Hmm, so simple increment assigns a new value and the range is
> splitted there?

while (i < j)
   {
     *a++ = *i++;
     *b++ = *i++;
     *c++ = *i++;
   }

The increments of a/b/b are not splittable, because the value is used
again in the next loop iteration and obviously it has to be in the same
registers.  But the increments of i can be rewritten as

   {
     *a++ = *i; temp1 = i + 1;
     *b++ = *temp1; temp2 = temp1 + 1;
     *c++ = *temp2; i = temp2 + 1;
   }

Most of the time register allocation will allocate temp1 and temp2 to
the same register as i ("coalescing"), but there's no guarantee that it
will.

Paolo

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: [Vm-dev] Re: VM performance discrepancy on Linux and Windows

Andreas.Raab
In reply to this post by Paolo Bonzini-2
Paolo Bonzini wrote:
> GCC 3.x should be in the same ballpark as 2.95.

I have not found this to be true. I tried multiple times (gcc 3.1, 3.3
and 3.4) and each time the resulting VM was *significantly* slower than
2.95.2.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: [Vm-dev] Re: VM performance discrepancy on Linux and Windows

johnmci
GCC 3.3 on power pc gives best figures as compared to 3.1, 2.95 or 4.01
however since this is intel your milage will vary,

Also setting -mtune=  -march=   optimizes/de-optimizes decisions too.

On Apr 11, 2008, at 8:57 AM, Andreas Raab wrote:

> Paolo Bonzini wrote:
>> GCC 3.x should be in the same ballpark as 2.95.
>
> I have not found this to be true. I tried multiple times (gcc 3.1,  
> 3.3 and 3.4) and each time the resulting VM was *significantly*  
> slower than 2.95.2.
>
> Cheers,
>  - Andreas

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
=
=
=
========================================================================