Hello,
I had some suspicous for a while but we did a little test with a computer that dual boot Windows XP and (Ubuntu) Linux to run tinyBenchmarks. (The computer happens to be a 1.8GHz Pentium-M Dell laptop.) On Windows, 3.10.6 VM (pre-compiled one on the site) with etoys-dev.image, the result was: 311 million bytecodes/sec, 8.9 million sends/sec On Linux, 3.9-8 VM (pre-compiled one on the site) with etoys-dev.image the result was: 190 million bytecodes/ec, 5.7 million seonds/sec Has any of you been experiencing similar gap? Have anybody looked at the generated code, or has anybody done some experiment recently? -- Yoshiki |
Ian was/is aware of the magic to ensure the compiler makes a more
efficient set of assembler instructions for non-generic intel CPU flavors. I'll assume these were not applied to the 3.9-8 VM you are using On Apr 9, 2008, at 1:38 PM, Yoshiki Ohshima wrote: > Hello, > > I had some suspicous for a while but we did a little test with a > computer that dual boot Windows XP and (Ubuntu) Linux to run > tinyBenchmarks. (The computer happens to be a 1.8GHz Pentium-M Dell > laptop.) > > On Windows, 3.10.6 VM (pre-compiled one on the site) with > etoys-dev.image, the result was: > > 311 million bytecodes/sec, 8.9 million sends/sec > > On Linux, 3.9-8 VM (pre-compiled one on the site) with etoys-dev.image > the result was: > > 190 million bytecodes/ec, 5.7 million seonds/sec > > Has any of you been experiencing similar gap? Have anybody looked > at the generated code, or has anybody done some experiment recently? > > -- Yoshiki > -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
Well,
So some god words came and now I'm looking at the assembly code... Bottom line is: gcc 2.95.2 on my Linux makes the bytecode/sec count larger, but makes send/sec count smaller. gcc 2.95.2 on Windows generates a code sequence for two bytecodes like this: ------------------- 69d8: 46 inc %esi 69d9: 0f b6 1e movzbl (%esi),%ebx 69dc: 83 c7 04 add $0x4,%edi 69df: a1 00 00 00 00 mov 0x0,%eax 69e4: 8b 40 08 mov 0x8(%eax),%eax 69e7: 89 07 mov %eax,(%edi) 69e9: ff 24 9d 80 27 00 00 jmp *0x2780(,%ebx,4) 69f0: 46 inc %esi 69f1: 0f b6 1e movzbl (%esi),%ebx 69f4: 83 c7 04 add $0x4,%edi 69f7: a1 00 00 00 00 mov 0x0,%eax 69fc: 8b 40 0c mov 0xc(%eax),%eax 69ff: 89 07 mov %eax,(%edi) 6a01: ff 24 9d 80 27 00 00 jmp *0x2780(,%ebx,4) ------------------- Apparently, %esi is used (exclusively) for IP, and %ebx keeps the next byte, and "jmp *" takes you to the next location stored in the table starts at 0x2780. gcc 4.1.2 on Fedora Core 7 generates a code sequence for two bytecodes like this: ------------------- efcf: 8d 46 01 lea 0x1(%esi),%eax efd2: 0f b6 08 movzbl (%eax),%ecx efd5: 89 c6 mov %eax,%esi efd7: a1 40 00 00 00 mov 0x40,%eax efdc: 8d 57 04 lea 0x4(%edi),%edx efdf: 89 d7 mov %edx,%edi efe1: 89 cb mov %ecx,%ebx efe3: 8b 40 2c mov 0x2c(%eax),%eax efe6: 89 02 mov %eax,(%edx) efe8: 8b 04 8d 20 04 00 00 mov 0x420(,%ecx,4),%eax efef: ff e0 jmp *%eax eff1: 8d 46 01 lea 0x1(%esi),%eax eff4: 0f b6 08 movzbl (%eax),%ecx eff7: 89 c6 mov %eax,%esi eff9: a1 40 00 00 00 mov 0x40,%eax effe: 8d 57 04 lea 0x4(%edi),%edx f001: 89 d7 mov %edx,%edi f003: 89 cb mov %ecx,%ebx f005: 8b 40 30 mov 0x30(%eax),%eax f008: 89 02 mov %eax,(%edx) f00a: 8b 04 8d 20 04 00 00 mov 0x420(,%ecx,4),%eax f011: ff e0 jmp *%eax ------------------- %esi is almost used for IP but use %eax for fetching the next byte, jmp also seems to use %eax so right before it is spilled and the destination address is brought into %eax. I'd be surprized that this is optimized for a specific x86 variation. I copy the command line option from Windows Makefile to Fedora: -mpentium -mwindows -Werror-implicit-function-declaration -fomit-frame-pointer -funroll-loops -fschedule-insns2 and got equally unsatisfying (slightly different) sequence. Ok, so one thing to try is to install gcc 2.95.2 to Fedora Core 7 and compile the interpreter with it. The resulting assembly code is close to the one on Windows. The bytecode/sec count went put but send/sec went down. I have a feeling that I saw it before but of course cannot remember the exact condition... If somebody has dual boot machine and can compare 8 (or more) cases (Namely, the combination of Windows/Linux, 2.95.2/4.1.2, more options/less options), that would be great. -- Yoshiki |
Yoshiki Ohshima wrote:
> Apparently, %esi is used (exclusively) for IP, and %ebx keeps the next > byte, and "jmp *" takes you to the next location stored in the table > starts at 0x2780. All of that comes straight out of sqGnu.h: #define BC_CASE(N) case N: _##N: #define BC_BREAK goto *jumpTable[currentBytecode] #if defined(__i386__) # define IP_REG asm("%esi") # define SP_REG asm("%edi") # define CB_REG asm("%ebx") #endif You might want to check if the gnuifier got confused over time - I had to update it to deal correctly with sqInt etc. gnu-interp.c should look like here: sqInt interpret(void) { sqInt localReturnValue; sqInt localReturnContext; sqInt localHomeContext; register char* localSP SP_REG; register char* localIP IP_REG; register sqInt currentBytecode CB_REG; BC_JUMP_TABLE; switch (currentBytecode) { BC_CASE(0) /* pushReceiverVariableBytecode */ BC_BREAK; > %esi is almost used for IP but use %eax for fetching the next byte, > jmp also seems to use %eax so right before it is spilled and the > destination address is brought into %eax. Sounds more like the static register assignments get ignored. Cheers, - Andreas |
The sqGnu.h I have reads
#if defined(__i386__) # define IP_REG asm("%esi") # define SP_REG asm("%edi") //# if (__GNUC__ > 2) || ((__GNUC__ == 2) && (__GNUC_MINOR__ >= 95)) # define CB_REG asm("%ebx") //# else //# define CB_REG /* avoid undue register pressure */ //# endif #endif The first two byte codes assemble to this when done right. L10161: addl $1, %esi movzbl (%esi), %ebx addl $4, %edi movl _foo, %eax movl 84(%eax), %eax movl 4(%eax), %eax movl %eax, (%edi) movl 512(%esp,%ebx,4), %eax L10421: jmp *%eax L10162: addl $1, %esi movzbl (%esi), %ebx addl $4, %edi movl _foo, %eax movl 84(%eax), %eax movl 8(%eax), %eax movl %eax, (%edi) movl 512(%esp,%ebx,4), %eax jmp *%eax sqInt interpret(void) { #ifdef FOO_REG register struct foo * foo FOO_REG = &fum; #endif sqInt localReturnValue; sqInt localReturnContext; sqInt localHomeContext; char* localSP; char* localIP; sqInt currentBytecode; JUMP_TABLE; Plus use of -DUSE_INLINE_MEMORY_ACCESSORS However much of this also relies on GCC version, in this case 4.01, usage of SP_REG, etc produced dreadful code with GCC 4.x, but was required for earlier versions. I noted for PowerPC (Note building with GCC 3.3 PowerPC produces better code than gcc 4.0, gcc 3.1 or gcc 2.95, FYI gcc 3.1 produces lousy code But since you are building on Intel your milage will vary (lots) On Apr 11, 2008, at 12:22 AM, Andreas Raab wrote: > Yoshiki Ohshima wrote: >> Apparently, %esi is used (exclusively) for IP, and %ebx keeps the >> next >> byte, and "jmp *" takes you to the next location stored in the table >> starts at 0x2780. > > All of that comes straight out of sqGnu.h: > > #define BC_CASE(N) case N: _##N: > #define BC_BREAK goto *jumpTable[currentBytecode] > > #if defined(__i386__) > # define IP_REG asm("%esi") > # define SP_REG asm("%edi") > # define CB_REG asm("%ebx") > #endif > > You might want to check if the gnuifier got confused over time - I > had to update it to deal correctly with sqInt etc. gnu-interp.c > should look like here: > > sqInt interpret(void) { > sqInt localReturnValue; > sqInt localReturnContext; > sqInt localHomeContext; > register char* localSP SP_REG; > register char* localIP IP_REG; > register sqInt currentBytecode CB_REG; > BC_JUMP_TABLE; > > switch (currentBytecode) { > BC_CASE(0) > /* pushReceiverVariableBytecode */ > BC_BREAK; > >> %esi is almost used for IP but use %eax for fetching the next byte, >> jmp also seems to use %eax so right before it is spilled and the >> destination address is brought into %eax. > > Sounds more like the static register assignments get ignored. > > Cheers, > - Andreas > -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Andreas.Raab
>> %esi is almost used for IP but use %eax for fetching the next byte, >> jmp also seems to use %eax so right before it is spilled and the >> destination address is brought into %eax. > > Sounds more like the static register assignments get ignored. It does not really get ignored, but the compiler performs more aggressive live range splitting (because it uses SSA in 4.x so live range splitting comes from free -- sometimes even if you don't want it...). OTOH the optimizer is better, which is why sends are faster. GCC 3.x should be in the same ballpark as 2.95. Paolo |
At Fri, 11 Apr 2008 09:47:13 +0200,
Paolo Bonzini wrote: > > > >> %esi is almost used for IP but use %eax for fetching the next byte, > >> jmp also seems to use %eax so right before it is spilled and the > >> destination address is brought into %eax. > > > > Sounds more like the static register assignments get ignored. > > It does not really get ignored, but the compiler performs more > aggressive live range splitting (because it uses SSA in 4.x so live > range splitting comes from free -- sometimes even if you don't want > it...). OTOH the optimizer is better, which is why sends are faster. > > GCC 3.x should be in the same ballpark as 2.95. Hmm, so simple increment assigns a new value and the range is splitted there? BTW, I got a report that says the result from gcc 3.3 was comparable with 4.1. Now, I still have a feeling that there is some difference between the compiler in http://squeakvm.org/win32/release/Squeak-Win32-Tools-1.2.zip and 2.95.2 I compiled on my Linux as there seems some unexplained discrepancy. Can anybody with dual boot machine do some more testing? (If the machine is green and white cute one, that would be really interesting.) -- Yoshiki |
>> It does not really get ignored, but the compiler performs more
>> aggressive live range splitting (because it uses SSA in 4.x so live >> range splitting comes from free -- sometimes even if you don't want >> it...). OTOH the optimizer is better, which is why sends are faster. > > Hmm, so simple increment assigns a new value and the range is > splitted there? while (i < j) { *a++ = *i++; *b++ = *i++; *c++ = *i++; } The increments of a/b/b are not splittable, because the value is used again in the next loop iteration and obviously it has to be in the same registers. But the increments of i can be rewritten as { *a++ = *i; temp1 = i + 1; *b++ = *temp1; temp2 = temp1 + 1; *c++ = *temp2; i = temp2 + 1; } Most of the time register allocation will allocate temp1 and temp2 to the same register as i ("coalescing"), but there's no guarantee that it will. Paolo |
In reply to this post by Paolo Bonzini-2
Paolo Bonzini wrote:
> GCC 3.x should be in the same ballpark as 2.95. I have not found this to be true. I tried multiple times (gcc 3.1, 3.3 and 3.4) and each time the resulting VM was *significantly* slower than 2.95.2. Cheers, - Andreas |
GCC 3.3 on power pc gives best figures as compared to 3.1, 2.95 or 4.01
however since this is intel your milage will vary, Also setting -mtune= -march= optimizes/de-optimizes decisions too. On Apr 11, 2008, at 8:57 AM, Andreas Raab wrote: > Paolo Bonzini wrote: >> GCC 3.x should be in the same ballpark as 2.95. > > I have not found this to be true. I tried multiple times (gcc 3.1, > 3.3 and 3.4) and each time the resulting VM was *significantly* > slower than 2.95.2. > > Cheers, > - Andreas -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
Free forum by Nabble | Edit this page |