Posted by
Bryce Kampjes on
Dec 21, 2007; 9:00pm
URL: https://forum.world.st/Thinking-about-Exupery-0-14-tp133531p133535.html
Igor Stasenko writes:
>
> I suspect that main bottleneck in largeExplorers is not
> compiled/bytecode code, but memory allocations and GC.
> So, i doubt that you can gain any performance increase here.
>
Below's the raw numbers, this is from largeExplorers but with the
profiling compiler turned up to compile a bit more code. About 60% of
the time is going into the interpreter, compiled code, and primitives
that should be natively compiled. That's enough time to provide a
decent speed improvement. 70% is the normal amount spent in the
interpreter. The GC is probably only consuming about 5% of the time.
CPU: AMD64 processors, speed 1000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000
Counted LS_BUFFER_FULL events (LS Buffer 2 Full) with a unit mask of 0x00 (No unit mask) count 100000
Counted RETIRED_BRANCHES_MISPREDICTED events (Retired branches mispredicted) with a unit mask of 0x00 (No unit mask) count 100000
Counted RETIRED_INSNS events (Retired instructions (includes exceptions, interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000
samples % samples % samples % samples % image name app name symbol name
1792637 57.4476 65779 12.9048 391686 84.8932 1970781 58.4824 squeak squeak interpret
223739 7.1700 150 0.0294 211 0.0457 376848 11.1829 BitBltPlugin BitBltPlugin alphaBlendwith
110588 3.5439 48505 9.5159 1302 0.2822 104008 3.0864 BitBltPlugin BitBltPlugin copyBits
76361 2.4471 124873 24.4982 3679 0.7974 64529 1.9149 libc-2.4.so libc-2.4.so (no symbols)
65089 2.0859 91160 17.8842 2155 0.4671 12648 0.3753 no-vmlinux no-vmlinux (no symbols)
60351 1.9340 51072 10.0195 2297 0.4978 31543 0.9360 anon (tgid:6681 range:0xb1c0d000-0xb7b6c000) squeak (no symbols)
52940 1.6965 5 9.8e-04 1632 0.3537 82896 2.4599 B2DPlugin B2DPlugin fillSpanfromto
45634 1.4624 12633 2.4784 2849 0.6175 59950 1.7790 BitBltPlugin BitBltPlugin copyLoopPixMap
39297 1.2593 1 2.0e-04 161 0.0349 12604 0.3740 squeak squeak sweepPhase
34504 1.1057 31 0.0061 3079 0.6673 10196 0.3026 squeak squeak lookupMethodInClass
31797 1.0190 19 0.0037 3408 0.7386 39560 1.1739 squeak squeak markAndTrace
28467 0.9123 11 0.0022 1757 0.3808 15187 0.4507 squeak squeak updatePointersInRangeFromto
28316 0.9074 197 0.0386 2484 0.5384 22647 0.6720 BitBltPlugin BitBltPlugin loadBitBltFromwarping
26469 0.8482 7 0.0014 739 0.1602 39972 1.1862 squeak squeak finalizeReference
26380 0.8454 15 0.0029 342 0.0741 38994 1.1571 squeak squeak updatePointersInRootObjectsFromto
24235 0.7766 727 0.1426 522 0.1131 29049 0.8620 squeak squeak exuperyIsNativeContext
17343 0.5558 7 0.0014 790 0.1712 9012 0.2674 squeak squeak positive32BitValueOf
15935 0.5107 5546 1.0880 336 0.0728 22525 0.6684 squeak squeak allocateheaderSizeh1h2h3doFillwith
15559 0.4986 2712 0.5321 1191 0.2581 25552 0.7582 BitBltPlugin BitBltPlugin pixPaintwith
14371 0.4605 11 0.0022 3301 0.7155 19116 0.5673 squeak squeak commonAt
13348 0.4278 466 0.0914 383 0.0830 6048 0.1795 squeak squeak lookupSelectorclass
The anon block is the native code. What's interesting is the
instructions per clock is about 0.5 while the intepreter's
instructions per clock is a little over one. The native code has less
branch mispredicts but much more memory traffic. About 8% of the time
the native code has the load store unit's buffer full and is probably
stalled waiting for a memory request to finish.
Based on the profiling I've done I'm fairly confident that one of the
reasons why the macro benchmarks are not often showing a performance
improvement on an Athon 64 is due to excess spill code causing too
much memory traffic. The register allocator is not handling heavy
register pressure well and I doubt the spill heuristics are ideal for
larger methods.
Bryce
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery