The primary goal for the next releases will be making the following benchmarks more compelling. I've added a compile time benchmark as there are a few performance bugs in the compiler that should be removed. arithmaticLoopBenchmark 1396 compiled 128 ratio: 10.906 bytecodeBenchmark 2111 compiled 460 ratio: 4.589 sendBenchmark 1637 compiled 668 ratio: 2.451 doLoopsBenchmark 1081 compiled 715 ratio: 1.512 pointCreation 1245 compiled 1317 ratio: 0.945 largeExplorers 728 compiled 715 ratio: 1.018 compilerBenchmark 483 compiled 489 ratio: 0.988 Cumulative Time 1125 compiled 537 ratio 2.093 ExuperyBenchmarks>>arithmeticLoop 249ms SmallInteger>>benchmark 1112ms InstructionStream>>interpretExtension:in:for: 113460ms Average 3155.360 First, I'll get the register allocator to allocate each section of method separately. After that, I'll probably do some work on further optimising the register allocator but I might work on improving the generated native code. Register allocating each section separately will both allow for better and faster allocation. It will make it easy to avoid dealing with registers and interference from other sections of the code and will reduce the size of the problem. Colouring register allocation written well should be on average n log n time but the performance bugs will raise that to probably n^2. It's possible that just allocating each section of the method separately will be enough to bring allocation down to a reasonable time. It should definitely help for the larger methods but is unlikely to do anything for the arithmaticLoop and will only help the bytecode benchmark slightly. Compiling quicker will make it easier to run more extensive tests. Bryce _______________________________________________ Exupery mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery |
On 17/12/2007, [hidden email] <[hidden email]> wrote:
> > The primary goal for the next releases will be making the following > benchmarks more compelling. I've added a compile time benchmark as > there are a few performance bugs in the compiler that should be > removed. > > arithmaticLoopBenchmark 1396 compiled 128 ratio: 10.906 > bytecodeBenchmark 2111 compiled 460 ratio: 4.589 > sendBenchmark 1637 compiled 668 ratio: 2.451 > doLoopsBenchmark 1081 compiled 715 ratio: 1.512 > pointCreation 1245 compiled 1317 ratio: 0.945 > largeExplorers 728 compiled 715 ratio: 1.018 > compilerBenchmark 483 compiled 489 ratio: 0.988 > Cumulative Time 1125 compiled 537 ratio 2.093 > > ExuperyBenchmarks>>arithmeticLoop 249ms > SmallInteger>>benchmark 1112ms > InstructionStream>>interpretExtension:in:for: 113460ms > Average 3155.360 > > First, I'll get the register allocator to allocate each section of > method separately. After that, I'll probably do some work on further > optimising the register allocator but I might work on improving the > generated native code. > > Register allocating each section separately will both allow for better > and faster allocation. It will make it easy to avoid dealing with > registers and interference from other sections of the code and will > reduce the size of the problem. Colouring register allocation written > well should be on average n log n time but the performance bugs will > raise that to probably n^2. > > It's possible that just allocating each section of the method > separately will be enough to bring allocation down to a reasonable > time. It should definitely help for the larger methods but is unlikely > to do anything for the arithmaticLoop and will only help the bytecode > benchmark slightly. Compiling quicker will make it easier to run more > extensive tests. > Do you make any difference between calling compiling method and , for instance, a primitive function? As i remember, you compiling methods to some form of a routine, which can be called using cdecl convention. But on top of that, knowing the fact that you calling a compiled method you can use some register optimizations like passing arguments in it, and in general by knowing where you changing registers, you can predict what of them are changing after call, and what will stay unchanged. And, of course, nothing stops you from using own calling convention to make code working faster. There's also a MMX/SSE registers which can be used for different purposes. All of the above, depending on choices, can greatly improve sends speed. Just want to know, what you thinking about it. And small trick when compiling SmallInteger methods: you already know that receiver is a smallinteger. So, by using that knowledge, some tests can be omitted. In same manner you can deal with compiling methods for classes which have byte/reference indexed instances. Too bad, it's not slate, where you know the types of all method arguments. In ST its just receiver :) > Bryce > _______________________________________________ > Exupery mailing list > [hidden email] > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery > -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Exupery mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery |
Igor Stasenko writes:
> On 17/12/2007, [hidden email] <[hidden email]> wrote: > > arithmaticLoopBenchmark 1396 compiled 128 ratio: 10.906 > > bytecodeBenchmark 2111 compiled 460 ratio: 4.589 > > sendBenchmark 1637 compiled 668 ratio: 2.451 > > doLoopsBenchmark 1081 compiled 715 ratio: 1.512 > > pointCreation 1245 compiled 1317 ratio: 0.945 > > largeExplorers 728 compiled 715 ratio: 1.018 > > compilerBenchmark 483 compiled 489 ratio: 0.988 > > Cumulative Time 1125 compiled 537 ratio 2.093 > > > > ExuperyBenchmarks>>arithmeticLoop 249ms > > SmallInteger>>benchmark 1112ms > > InstructionStream>>interpretExtension:in:for: 113460ms > > Average 3155.360 First, from the numbers above, I'd say that having a method that takes 2 minutes to compile is currently the biggest practical problem. The second set of numbers is a compilation time benchmark. The second biggest problem is that a 2.4 times increase in send speed is not transferring through to the two macro-benchmarks (largeExplorers and compilerBenchmark). > > Do you make any difference between calling compiling method and , for > instance, a primitive function? The sender doesn't know if it's sending to a primitive or to a full method. If Exupery compiles a primitive then it executes in the senders context, just like the interpreter, > As i remember, you compiling methods to some form of a routine, which > can be called using cdecl convention. > But on top of that, knowing the fact that you calling a compiled > method you can use some register optimizations like passing arguments > in it, and in general by knowing where you changing registers, you can > predict what of them are changing after call, and what will stay > unchanged. > And, of course, nothing stops you from using own calling convention to > make code working faster. There's also a MMX/SSE registers which can > be used for different purposes. > All of the above, depending on choices, can greatly improve sends speed. > Just want to know, what you thinking about it. Currently Exupery uses C's calling conventions combined with the interpreters handling of contexts, there's plenty of room to improve this but I doubt that raw send speed is why the macro benchmarks aren't performing. Also full method inlining will change the value of other send optimisations by removing most of the common sends. It's the best optimisation for common sends. 1.0 is a base to add full method inlining too. > And small trick when compiling SmallInteger methods: you already know > that receiver is a smallinteger. So, by using that knowledge, some > tests can be omitted. > In same manner you can deal with compiling methods for classes which > have byte/reference indexed instances. Exupery compiles a method for each receiver so this is possible but not done yet. It'll get even more interesting when combined with full method inlining, then common self sends will become completely free. Bryce _______________________________________________ Exupery mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery |
On 18/12/2007, [hidden email] <[hidden email]> wrote:
> Igor Stasenko writes: > > On 17/12/2007, [hidden email] <[hidden email]> wrote: > > > arithmaticLoopBenchmark 1396 compiled 128 ratio: 10.906 > > > bytecodeBenchmark 2111 compiled 460 ratio: 4.589 > > > sendBenchmark 1637 compiled 668 ratio: 2.451 > > > doLoopsBenchmark 1081 compiled 715 ratio: 1.512 > > > pointCreation 1245 compiled 1317 ratio: 0.945 > > > largeExplorers 728 compiled 715 ratio: 1.018 > > > compilerBenchmark 483 compiled 489 ratio: 0.988 > > > Cumulative Time 1125 compiled 537 ratio 2.093 > > > > > > ExuperyBenchmarks>>arithmeticLoop 249ms > > > SmallInteger>>benchmark 1112ms > > > InstructionStream>>interpretExtension:in:for: 113460ms > > > Average 3155.360 > > First, from the numbers above, I'd say that having a method that takes > 2 minutes to compile is currently the biggest practical problem. The > second set of numbers is a compilation time benchmark. The second > biggest problem is that a 2.4 times increase in send speed is not > transferring through to the two macro-benchmarks (largeExplorers and > compilerBenchmark). > I suspect that main bottleneck in largeExplorers is not compiled/bytecode code, but memory allocations and GC. So, i doubt that you can gain any performance increase here. > > > > Do you make any difference between calling compiling method and , for > > instance, a primitive function? > > The sender doesn't know if it's sending to a primitive or to a full > method. If Exupery compiles a primitive then it executes in the > senders context, just like the interpreter, > > > As i remember, you compiling methods to some form of a routine, which > > can be called using cdecl convention. > > But on top of that, knowing the fact that you calling a compiled > > method you can use some register optimizations like passing arguments > > in it, and in general by knowing where you changing registers, you can > > predict what of them are changing after call, and what will stay > > unchanged. > > And, of course, nothing stops you from using own calling convention to > > make code working faster. There's also a MMX/SSE registers which can > > be used for different purposes. > > All of the above, depending on choices, can greatly improve sends speed. > > Just want to know, what you thinking about it. > > Currently Exupery uses C's calling conventions combined with the > interpreters handling of contexts, there's plenty of room to improve > this but I doubt that raw send speed is why the macro benchmarks > aren't performing. > > Also full method inlining will change the value of other send > optimisations by removing most of the common sends. It's the best > optimisation for common sends. 1.0 is a base to add full method > inlining too. > > > And small trick when compiling SmallInteger methods: you already know > > that receiver is a smallinteger. So, by using that knowledge, some > > tests can be omitted. > > In same manner you can deal with compiling methods for classes which > > have byte/reference indexed instances. > > Exupery compiles a method for each receiver so this is possible but > not done yet. It'll get even more interesting when combined with full > method inlining, then common self sends will become completely free. > method source to set of lambda-sends. Lambdas are very simple and yet powerful way to represent abstract algorithm. By having only few rules - substitution and reduction, i can transform method code to any form. And it's very helpful in case of full method inlining. Method can be represented as a labmda having single free variable <context>: lambda method(context), where any contextual parts/actions represented as messages to context, like: receiver: Lambda context receiver. receiver class: Lambda context receiver class. argument1 : Lambda context argumentAt: 1 push: Lambda context push: arg return: Lambda context return: expression. e.t.c. But now, when you going to inline method, things become interesting, because, if you know the receiver's class, then you can reduce some labmdas at compile stage, like any accesses to receiver's class, results of receiver method lookups e.t.c. And even more, if you have a way how to determine if method(s) are referentially transparent, then you can reduce some of methods to returned results at compile time. Like having: isInteger ^ true and then, in some method, when you encounter self isInteger ifTrue: [ codeA ] ifFalse: [ codeB ] you can reduce the whole expression to codeA, or codeB , depending on class of receiver. So, in given example, knowing the class of receiver, you can eliminate two sends: #isInteger and #ifTrue:ifFalse:. I think, that making compiler be able to detect and do such reductions will render any other kinds of optimizations much less important. As i know, you are translating methods by taking their bytecode. I'm not sure, if the above is possible by compiling from bytecode. I think it would be much easier to operate with abstract parse trees (with lambda-sends as nodes). > Bryce > _______________________________________________ > Exupery mailing list > [hidden email] > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery > -- Best regards, Igor Stasenko AKA sig. _______________________________________________ Exupery mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery |
Igor Stasenko writes:
> > I suspect that main bottleneck in largeExplorers is not > compiled/bytecode code, but memory allocations and GC. > So, i doubt that you can gain any performance increase here. > Below's the raw numbers, this is from largeExplorers but with the profiling compiler turned up to compile a bit more code. About 60% of the time is going into the interpreter, compiled code, and primitives that should be natively compiled. That's enough time to provide a decent speed improvement. 70% is the normal amount spent in the interpreter. The GC is probably only consuming about 5% of the time. CPU: AMD64 processors, speed 1000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000 Counted LS_BUFFER_FULL events (LS Buffer 2 Full) with a unit mask of 0x00 (No unit mask) count 100000 Counted RETIRED_BRANCHES_MISPREDICTED events (Retired branches mispredicted) with a unit mask of 0x00 (No unit mask) count 100000 Counted RETIRED_INSNS events (Retired instructions (includes exceptions, interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000 samples % samples % samples % samples % image name app name symbol name 1792637 57.4476 65779 12.9048 391686 84.8932 1970781 58.4824 squeak squeak interpret 223739 7.1700 150 0.0294 211 0.0457 376848 11.1829 BitBltPlugin BitBltPlugin alphaBlendwith 110588 3.5439 48505 9.5159 1302 0.2822 104008 3.0864 BitBltPlugin BitBltPlugin copyBits 76361 2.4471 124873 24.4982 3679 0.7974 64529 1.9149 libc-2.4.so libc-2.4.so (no symbols) 65089 2.0859 91160 17.8842 2155 0.4671 12648 0.3753 no-vmlinux no-vmlinux (no symbols) 60351 1.9340 51072 10.0195 2297 0.4978 31543 0.9360 anon (tgid:6681 range:0xb1c0d000-0xb7b6c000) squeak (no symbols) 52940 1.6965 5 9.8e-04 1632 0.3537 82896 2.4599 B2DPlugin B2DPlugin fillSpanfromto 45634 1.4624 12633 2.4784 2849 0.6175 59950 1.7790 BitBltPlugin BitBltPlugin copyLoopPixMap 39297 1.2593 1 2.0e-04 161 0.0349 12604 0.3740 squeak squeak sweepPhase 34504 1.1057 31 0.0061 3079 0.6673 10196 0.3026 squeak squeak lookupMethodInClass 31797 1.0190 19 0.0037 3408 0.7386 39560 1.1739 squeak squeak markAndTrace 28467 0.9123 11 0.0022 1757 0.3808 15187 0.4507 squeak squeak updatePointersInRangeFromto 28316 0.9074 197 0.0386 2484 0.5384 22647 0.6720 BitBltPlugin BitBltPlugin loadBitBltFromwarping 26469 0.8482 7 0.0014 739 0.1602 39972 1.1862 squeak squeak finalizeReference 26380 0.8454 15 0.0029 342 0.0741 38994 1.1571 squeak squeak updatePointersInRootObjectsFromto 24235 0.7766 727 0.1426 522 0.1131 29049 0.8620 squeak squeak exuperyIsNativeContext 17343 0.5558 7 0.0014 790 0.1712 9012 0.2674 squeak squeak positive32BitValueOf 15935 0.5107 5546 1.0880 336 0.0728 22525 0.6684 squeak squeak allocateheaderSizeh1h2h3doFillwith 15559 0.4986 2712 0.5321 1191 0.2581 25552 0.7582 BitBltPlugin BitBltPlugin pixPaintwith 14371 0.4605 11 0.0022 3301 0.7155 19116 0.5673 squeak squeak commonAt 13348 0.4278 466 0.0914 383 0.0830 6048 0.1795 squeak squeak lookupSelectorclass The anon block is the native code. What's interesting is the instructions per clock is about 0.5 while the intepreter's instructions per clock is a little over one. The native code has less branch mispredicts but much more memory traffic. About 8% of the time the native code has the load store unit's buffer full and is probably stalled waiting for a memory request to finish. Based on the profiling I've done I'm fairly confident that one of the reasons why the macro benchmarks are not often showing a performance improvement on an Athon 64 is due to excess spill code causing too much memory traffic. The register allocator is not handling heavy register pressure well and I doubt the spill heuristics are ideal for larger methods. Bryce _______________________________________________ Exupery mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery |
Free forum by Nabble | Edit this page |