Smalltalk › Squeak › Other › Squeak - Exupery

Planning for Exupery 0.15

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

1 message

Bryce Kampjes

Aug 08, 2008; 9:38pm

Planning for Exupery 0.15

433 posts

The two areas most in need of improvement before a 1.0 now are run
time performance and reliability. Hopefully 0.15 will lead to a decent
improvement in both. First runtime performance as the end of 0.14
involved a decent round of testing and debugging.

Here's some benchmarks:

arithmaticLoopBenchmark 417 compiled 94 ratio: 4.436
bytecodeBenchmark 725 compiled 262 ratio: 2.767
sendBenchmark 692 compiled 403 ratio: 1.717
doLoopsBenchmark 389 compiled 385 ratio: 1.010
pointCreation 423 compiled 426 ratio: 0.993
largeExplorers 198 compiled 199 ratio: 0.995
compilerBenchmark 245 compiled 249 ratio: 0.984
Cumulative Time 401 compiled 260 ratio 1.542

The primary goal is to improve the last two benchmarks, the
two macro benchmarks. Both benchmarks use a profiler to decide
what to compile, the goal is to compile enough methods to make
a difference reasonably quickly so the benchmark doesn't take
too long to run.

Here's the profile for compilerBenchmark:
CPU: Core 2, speed 3005.67 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Counted INST_RETIRED.ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 100000
samples % samples % image name app name symbol name
4122385 62.5654 4860169 58.3687 squeak squeak interpret
447635 6.7937 715498 8.5928 anon (tgid:6321 range:0xb1c91000-0xb7bf0000) squeak (no symbols)
224375 3.4053 412666 4.9560 squeak squeak exuperyCreateContext
157809 2.3951 269117 3.2320 squeak squeak exuperyIsNativeContext
126427 1.9188 230642 2.7699 squeak squeak allocateheaderSizeh1h2h3doFillwith
96316 1.4618 107350 1.2892 squeak squeak sweepPhase
87506 1.3281 47304 0.5681 squeak squeak lookupMethodInClass
53014 0.8046 71782 0.8621 squeak squeak markAndTrace
52262 0.7932 84112 1.0102 squeak squeak exuperySetupMessageSend
51999 0.7892 76301 0.9163 squeak squeak exuperyCallMethod
50920 0.7728 84568 1.0156 squeak squeak instantiateContextsizeInBytes
47231 0.7168 31047 0.3729 no-vmlinux no-vmlinux (no symbols)
42841 0.6502 52130 0.6261 MiscPrimitivePlugin MiscPrimitivePlugin primitiveStringHash
42560 0.6459 77447 0.9301 squeak squeak activateNewMethod

Only 14% of the time is going into code compiled by Exupery and it's
helper functions. 62% of the time is still in the main interpreter
loop. Interestingly the ratio between time in native code and time in
exuperyCreateContext is the same as the send benchmark so it's likely
that the native code is mostly in send processing. Either being called
from interpreted code or sending to compiled code. The native code is
executing 1.6 instructions per cycle, the CPU maxes out at 4
instructions per cycle.1.6 instructions per cycle would be excellent
for an Athlon but Cores are more efficient, it's still good though.

Half of the time spent in exuperySetupMessage send is going to
dispatching to unhandled primitives, the other half will be
going to sends to interpreted code.

There's a few obvious things to do to improve performance:
* Implement more addressing modes
* Natively compile calls to C primitives
* Implement the ^true, ^false, and ^ nil primitives
* Remove jumps to jumps.

Implementing more addressing modes looks the most promising. It should
speed up most of the benchmarks as all but the bytecode benchmark
spend significant time in code that suffers badly from a single
missing addressing mode especially object creation code and
send/return code.

The current send optimisation is PICs which only work when sending
from compiled code to compiled code. Sends to and from interpreted
code are about the same speed or a little slower than interpreted to
interpreted sends. It's true that this can be avoided by compiling
more methods so most sends are compiled to compiled but it's much
easier to decide what to compile if compiling anything is likely
to lead to a speed improvement and not risk a speed loss.

Compiling the call to the primitive function into native code will
allow the primitives to be dispatched via PIC instead of needing to go
through exuperySetupMessageSend. Half of the calls to
exuperySetupMessageSend in the compiler benchmark are for primitives,
in the large explorers benchmark three quarters of the calls are for
primitives. That time will disappear. Evaluating blocks uses a
primitive send which takes a large proportion of the block dispatch
time.

There's a handful of primitives that are implemented inside the main
interpret loop. ^ true, ^ false, and ^ nil are some of them. They
often show up when they fail to inline as Exupery can not yet compile
them. If code uses them, then compiling it will cause a large time
loss due to using a full primitive dispatch compared with the
interpreter. Given how simple they are implementing them makes sense.

Exupery can create code that jumps directly to an unconditional
jump. This does happen in some inner loops. The jumps should be
modified to go to the target jump's destination. Jumping to a jump
makes the CPU's front end's life difficult. In the compiler benchmark
only for 9% of the time are the reservation stations full, which
indicates that for most of the time the front end can not keep up with
instruction execution.

Here's an example of the kind of code that's commonly generated with
addressing mode problems. This example is from the method return
sequence. Every compiled method goes through a block like this when
returning:
(block24
(mov #nilObj eax)
(mov (eax) eax)
(mov eax (8 ecx))
(mov #activeContext eax)
(mov ebx (eax))
(mov #youngStart eax)
(mov #activeContext ebx)
(mov (ebx) ebx)
(cmp (eax) ebx)
(jumpUnsignedGreaterEqualThan block25)
(mov #activeContext eax)
(mov (eax) eax)
(mov (eax) ebx)
(mov 1073741824 eax)
(and ebx eax)
(jnz block25)
(mov 2400 eax)
(mov #rootTableCount ecx)
(cmp eax (ecx))
(jumpSignedGreaterEqualThan block26)
(jmp block27)
)

The problem is instructions like "(mov #nilObj eax)" the address
should be encoded in the memory access that uses it. There's no
need to move an address into a register before using it. There's
other problems besides not handling literal indirect addressing but
the literal indirect problem is the largest.by a long shot.

I'm going to add literal indirect addressing first as it's harder to
estimate what it'll do to overall performance but it is a problem for
almost all the benchmarks.

It would also be worthwhile improving the profiling tools. It should
be relatively easy to get oprofile to show the compiled method names
instead of lumping all compiled code into the "anon" memory bucket.
It would also be worthwhile and easy to write some code to read the
oprofile files and compute the ratios rather than calculate them by
hand.

Bryce
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery