The two areas most in need of improvement before a 1.0 now are run
time performance and reliability. Hopefully 0.15 will lead to a decent improvement in both. First runtime performance as the end of 0.14 involved a decent round of testing and debugging. Here's some benchmarks: arithmaticLoopBenchmark 417 compiled 94 ratio: 4.436 bytecodeBenchmark 725 compiled 262 ratio: 2.767 sendBenchmark 692 compiled 403 ratio: 1.717 doLoopsBenchmark 389 compiled 385 ratio: 1.010 pointCreation 423 compiled 426 ratio: 0.993 largeExplorers 198 compiled 199 ratio: 0.995 compilerBenchmark 245 compiled 249 ratio: 0.984 Cumulative Time 401 compiled 260 ratio 1.542 The primary goal is to improve the last two benchmarks, the two macro benchmarks. Both benchmarks use a profiler to decide what to compile, the goal is to compile enough methods to make a difference reasonably quickly so the benchmark doesn't take too long to run. Here's the profile for compilerBenchmark: CPU: Core 2, speed 3005.67 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 Counted INST_RETIRED.ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 100000 samples % samples % image name app name symbol name 4122385 62.5654 4860169 58.3687 squeak squeak interpret 447635 6.7937 715498 8.5928 anon (tgid:6321 range:0xb1c91000-0xb7bf0000) squeak (no symbols) 224375 3.4053 412666 4.9560 squeak squeak exuperyCreateContext 157809 2.3951 269117 3.2320 squeak squeak exuperyIsNativeContext 126427 1.9188 230642 2.7699 squeak squeak allocateheaderSizeh1h2h3doFillwith 96316 1.4618 107350 1.2892 squeak squeak sweepPhase 87506 1.3281 47304 0.5681 squeak squeak lookupMethodInClass 53014 0.8046 71782 0.8621 squeak squeak markAndTrace 52262 0.7932 84112 1.0102 squeak squeak exuperySetupMessageSend 51999 0.7892 76301 0.9163 squeak squeak exuperyCallMethod 50920 0.7728 84568 1.0156 squeak squeak instantiateContextsizeInBytes 47231 0.7168 31047 0.3729 no-vmlinux no-vmlinux (no symbols) 42841 0.6502 52130 0.6261 MiscPrimitivePlugin MiscPrimitivePlugin primitiveStringHash 42560 0.6459 77447 0.9301 squeak squeak activateNewMethod Only 14% of the time is going into code compiled by Exupery and it's helper functions. 62% of the time is still in the main interpreter loop. Interestingly the ratio between time in native code and time in exuperyCreateContext is the same as the send benchmark so it's likely that the native code is mostly in send processing. Either being called from interpreted code or sending to compiled code. The native code is executing 1.6 instructions per cycle, the CPU maxes out at 4 instructions per cycle.1.6 instructions per cycle would be excellent for an Athlon but Cores are more efficient, it's still good though. Half of the time spent in exuperySetupMessage send is going to dispatching to unhandled primitives, the other half will be going to sends to interpreted code. There's a few obvious things to do to improve performance: * Implement more addressing modes * Natively compile calls to C primitives * Implement the ^true, ^false, and ^ nil primitives * Remove jumps to jumps. Implementing more addressing modes looks the most promising. It should speed up most of the benchmarks as all but the bytecode benchmark spend significant time in code that suffers badly from a single missing addressing mode especially object creation code and send/return code. The current send optimisation is PICs which only work when sending from compiled code to compiled code. Sends to and from interpreted code are about the same speed or a little slower than interpreted to interpreted sends. It's true that this can be avoided by compiling more methods so most sends are compiled to compiled but it's much easier to decide what to compile if compiling anything is likely to lead to a speed improvement and not risk a speed loss. Compiling the call to the primitive function into native code will allow the primitives to be dispatched via PIC instead of needing to go through exuperySetupMessageSend. Half of the calls to exuperySetupMessageSend in the compiler benchmark are for primitives, in the large explorers benchmark three quarters of the calls are for primitives. That time will disappear. Evaluating blocks uses a primitive send which takes a large proportion of the block dispatch time. There's a handful of primitives that are implemented inside the main interpret loop. ^ true, ^ false, and ^ nil are some of them. They often show up when they fail to inline as Exupery can not yet compile them. If code uses them, then compiling it will cause a large time loss due to using a full primitive dispatch compared with the interpreter. Given how simple they are implementing them makes sense. Exupery can create code that jumps directly to an unconditional jump. This does happen in some inner loops. The jumps should be modified to go to the target jump's destination. Jumping to a jump makes the CPU's front end's life difficult. In the compiler benchmark only for 9% of the time are the reservation stations full, which indicates that for most of the time the front end can not keep up with instruction execution. Here's an example of the kind of code that's commonly generated with addressing mode problems. This example is from the method return sequence. Every compiled method goes through a block like this when returning: (block24 (mov #nilObj eax) (mov (eax) eax) (mov eax (8 ecx)) (mov #activeContext eax) (mov ebx (eax)) (mov #youngStart eax) (mov #activeContext ebx) (mov (ebx) ebx) (cmp (eax) ebx) (jumpUnsignedGreaterEqualThan block25) (mov #activeContext eax) (mov (eax) eax) (mov (eax) ebx) (mov 1073741824 eax) (and ebx eax) (jnz block25) (mov 2400 eax) (mov #rootTableCount ecx) (cmp eax (ecx)) (jumpSignedGreaterEqualThan block26) (jmp block27) ) The problem is instructions like "(mov #nilObj eax)" the address should be encoded in the memory access that uses it. There's no need to move an address into a register before using it. There's other problems besides not handling literal indirect addressing but the literal indirect problem is the largest.by a long shot. I'm going to add literal indirect addressing first as it's harder to estimate what it'll do to overall performance but it is a problem for almost all the benchmarks. It would also be worthwhile improving the profiling tools. It should be relatively easy to get oprofile to show the compiled method names instead of lumping all compiled code into the "anon" memory bucket. It would also be worthwhile and easy to write some code to read the oprofile files and compute the ratios rather than calculate them by hand. Bryce _______________________________________________ Exupery mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery |
Free forum by Nabble | Edit this page |