Thinking about Exupery 0.14

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Thinking about Exupery 0.14

Bryce Kampjes

The primary goal for the next releases will be making the following
benchmarks more compelling. I've added a compile time benchmark as
there are a few performance bugs in the compiler that should be
removed.

  arithmaticLoopBenchmark 1396 compiled  128 ratio: 10.906
  bytecodeBenchmark       2111 compiled  460 ratio:  4.589
  sendBenchmark           1637 compiled  668 ratio:  2.451
  doLoopsBenchmark        1081 compiled  715 ratio:  1.512
  pointCreation           1245 compiled 1317 ratio:  0.945
  largeExplorers           728 compiled  715 ratio:  1.018
  compilerBenchmark        483 compiled  489 ratio:  0.988
  Cumulative Time         1125 compiled  537 ratio   2.093

  ExuperyBenchmarks>>arithmeticLoop                249ms
  SmallInteger>>benchmark                         1112ms
  InstructionStream>>interpretExtension:in:for: 113460ms
  Average                                         3155.360

First, I'll get the register allocator to allocate each section of
method separately. After that, I'll probably do some work on further
optimising the register allocator but I might work on improving the
generated native code.

Register allocating each section separately will both allow for better
and faster allocation. It will make it easy to avoid dealing with
registers and interference from other sections of the code and will
reduce the size of the problem. Colouring register allocation written
well should be on average n log n time but the performance bugs will
raise that to probably n^2.

It's possible that just allocating each section of the method
separately will be enough to bring allocation down to a reasonable
time. It should definitely help for the larger methods but is unlikely
to do anything for the arithmaticLoop and will only help the bytecode
benchmark slightly. Compiling quicker will make it easier to run more
extensive tests.

Bryce
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about Exupery 0.14

Igor Stasenko
On 17/12/2007, [hidden email] <[hidden email]> wrote:

>
> The primary goal for the next releases will be making the following
> benchmarks more compelling. I've added a compile time benchmark as
> there are a few performance bugs in the compiler that should be
> removed.
>
>   arithmaticLoopBenchmark 1396 compiled  128 ratio: 10.906
>   bytecodeBenchmark       2111 compiled  460 ratio:  4.589
>   sendBenchmark           1637 compiled  668 ratio:  2.451
>   doLoopsBenchmark        1081 compiled  715 ratio:  1.512
>   pointCreation           1245 compiled 1317 ratio:  0.945
>   largeExplorers           728 compiled  715 ratio:  1.018
>   compilerBenchmark        483 compiled  489 ratio:  0.988
>   Cumulative Time         1125 compiled  537 ratio   2.093
>
>   ExuperyBenchmarks>>arithmeticLoop                249ms
>   SmallInteger>>benchmark                         1112ms
>   InstructionStream>>interpretExtension:in:for: 113460ms
>   Average                                         3155.360
>
> First, I'll get the register allocator to allocate each section of
> method separately. After that, I'll probably do some work on further
> optimising the register allocator but I might work on improving the
> generated native code.
>
> Register allocating each section separately will both allow for better
> and faster allocation. It will make it easy to avoid dealing with
> registers and interference from other sections of the code and will
> reduce the size of the problem. Colouring register allocation written
> well should be on average n log n time but the performance bugs will
> raise that to probably n^2.
>
> It's possible that just allocating each section of the method
> separately will be enough to bring allocation down to a reasonable
> time. It should definitely help for the larger methods but is unlikely
> to do anything for the arithmaticLoop and will only help the bytecode
> benchmark slightly. Compiling quicker will make it easier to run more
> extensive tests.
>

Do you make any difference between calling compiling method and , for
instance, a primitive function?
As i remember, you compiling methods to some form of a routine, which
can be called using cdecl convention.
But on top of that, knowing the fact that you calling a compiled
method you can use some register optimizations like passing arguments
in it, and in general by knowing where you changing registers, you can
predict what of them are changing after call, and what will stay
unchanged.
And, of course, nothing stops you from using own calling convention to
make code working faster. There's also a MMX/SSE registers which can
be used for different purposes.
All of the above, depending on choices, can greatly improve sends speed.
Just want to know, what you thinking about it.

And small trick when compiling SmallInteger methods: you already know
that receiver is a smallinteger. So, by using that knowledge, some
tests can be omitted.
In same manner you can deal with compiling methods for classes which
have byte/reference indexed instances.

Too bad, it's not slate, where you know the types of all method
arguments. In ST its just receiver :)



> Bryce
> _______________________________________________
> Exupery mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
>


--
Best regards,
Igor Stasenko AKA sig.
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about Exupery 0.14

Bryce Kampjes
Igor Stasenko writes:
 > On 17/12/2007, [hidden email] <[hidden email]> wrote:
 > >   arithmaticLoopBenchmark 1396 compiled  128 ratio: 10.906
 > >   bytecodeBenchmark       2111 compiled  460 ratio:  4.589
 > >   sendBenchmark           1637 compiled  668 ratio:  2.451
 > >   doLoopsBenchmark        1081 compiled  715 ratio:  1.512
 > >   pointCreation           1245 compiled 1317 ratio:  0.945
 > >   largeExplorers           728 compiled  715 ratio:  1.018
 > >   compilerBenchmark        483 compiled  489 ratio:  0.988
 > >   Cumulative Time         1125 compiled  537 ratio   2.093
 > >
 > >   ExuperyBenchmarks>>arithmeticLoop                249ms
 > >   SmallInteger>>benchmark                         1112ms
 > >   InstructionStream>>interpretExtension:in:for: 113460ms
 > >   Average                                         3155.360

First, from the numbers above, I'd say that having a method that takes
2 minutes to compile is currently the biggest practical problem. The
second set of numbers is a compilation time benchmark. The second
biggest problem is that a 2.4 times increase in send speed is not
transferring through to the two macro-benchmarks (largeExplorers and
compilerBenchmark).

 >
 > Do you make any difference between calling compiling method and , for
 > instance, a primitive function?

The sender doesn't know if it's sending to a primitive or to a full
method. If Exupery compiles a primitive then it executes in the
senders context, just like the interpreter,

 > As i remember, you compiling methods to some form of a routine, which
 > can be called using cdecl convention.
 > But on top of that, knowing the fact that you calling a compiled
 > method you can use some register optimizations like passing arguments
 > in it, and in general by knowing where you changing registers, you can
 > predict what of them are changing after call, and what will stay
 > unchanged.
 > And, of course, nothing stops you from using own calling convention to
 > make code working faster. There's also a MMX/SSE registers which can
 > be used for different purposes.
 > All of the above, depending on choices, can greatly improve sends speed.
 > Just want to know, what you thinking about it.

Currently Exupery uses C's calling conventions combined with the
interpreters handling of contexts, there's plenty of room to improve
this but I doubt that raw send speed is why the macro benchmarks
aren't performing.

Also full method inlining will change the value of other send
optimisations by removing most of the common sends. It's the best
optimisation for common sends. 1.0 is a base to add full method
inlining too.

 > And small trick when compiling SmallInteger methods: you already know
 > that receiver is a smallinteger. So, by using that knowledge, some
 > tests can be omitted.
 > In same manner you can deal with compiling methods for classes which
 > have byte/reference indexed instances.

Exupery compiles a method for each receiver so this is possible but
not done yet. It'll get even more interesting when combined with full
method inlining, then common self sends will become completely free.

Bryce
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about Exupery 0.14

Igor Stasenko
On 18/12/2007, [hidden email] <[hidden email]> wrote:

> Igor Stasenko writes:
>  > On 17/12/2007, [hidden email] <[hidden email]> wrote:
>  > >   arithmaticLoopBenchmark 1396 compiled  128 ratio: 10.906
>  > >   bytecodeBenchmark       2111 compiled  460 ratio:  4.589
>  > >   sendBenchmark           1637 compiled  668 ratio:  2.451
>  > >   doLoopsBenchmark        1081 compiled  715 ratio:  1.512
>  > >   pointCreation           1245 compiled 1317 ratio:  0.945
>  > >   largeExplorers           728 compiled  715 ratio:  1.018
>  > >   compilerBenchmark        483 compiled  489 ratio:  0.988
>  > >   Cumulative Time         1125 compiled  537 ratio   2.093
>  > >
>  > >   ExuperyBenchmarks>>arithmeticLoop                249ms
>  > >   SmallInteger>>benchmark                         1112ms
>  > >   InstructionStream>>interpretExtension:in:for: 113460ms
>  > >   Average                                         3155.360
>
> First, from the numbers above, I'd say that having a method that takes
> 2 minutes to compile is currently the biggest practical problem. The
> second set of numbers is a compilation time benchmark. The second
> biggest problem is that a 2.4 times increase in send speed is not
> transferring through to the two macro-benchmarks (largeExplorers and
> compilerBenchmark).
>

I suspect that main bottleneck in largeExplorers is not
compiled/bytecode code, but memory allocations and GC.
So, i doubt that you can gain any performance increase here.

>  >
>  > Do you make any difference between calling compiling method and , for
>  > instance, a primitive function?
>
> The sender doesn't know if it's sending to a primitive or to a full
> method. If Exupery compiles a primitive then it executes in the
> senders context, just like the interpreter,
>
>  > As i remember, you compiling methods to some form of a routine, which
>  > can be called using cdecl convention.
>  > But on top of that, knowing the fact that you calling a compiled
>  > method you can use some register optimizations like passing arguments
>  > in it, and in general by knowing where you changing registers, you can
>  > predict what of them are changing after call, and what will stay
>  > unchanged.
>  > And, of course, nothing stops you from using own calling convention to
>  > make code working faster. There's also a MMX/SSE registers which can
>  > be used for different purposes.
>  > All of the above, depending on choices, can greatly improve sends speed.
>  > Just want to know, what you thinking about it.
>
> Currently Exupery uses C's calling conventions combined with the
> interpreters handling of contexts, there's plenty of room to improve
> this but I doubt that raw send speed is why the macro benchmarks
> aren't performing.
>
> Also full method inlining will change the value of other send
> optimisations by removing most of the common sends. It's the best
> optimisation for common sends. 1.0 is a base to add full method
> inlining too.
>
>  > And small trick when compiling SmallInteger methods: you already know
>  > that receiver is a smallinteger. So, by using that knowledge, some
>  > tests can be omitted.
>  > In same manner you can deal with compiling methods for classes which
>  > have byte/reference indexed instances.
>
> Exupery compiles a method for each receiver so this is possible but
> not done yet. It'll get even more interesting when combined with full
> method inlining, then common self sends will become completely free.
>
Yes, and for such matter i started coding some classes to translate ST
method source to set of lambda-sends.
Lambdas are very simple and yet powerful way to represent abstract algorithm.
By having only few rules - substitution and reduction, i can transform
method code to any form.
And it's very helpful in case of full method inlining.
Method can be represented as a labmda having single free variable <context>:

lambda method(context),

where any contextual parts/actions represented as messages to context, like:

receiver: Lambda context receiver.
receiver class: Lambda context receiver class.
argument1 : Lambda context argumentAt: 1
push: Lambda context push: arg
return: Lambda context return: expression.
e.t.c.

But now, when you going to inline method, things become interesting,
because, if you know the receiver's class, then you can reduce some
labmdas at compile stage, like any accesses to receiver's class,
results of receiver method lookups e.t.c.

And even more, if you have a way how to determine if method(s) are
referentially transparent, then you can reduce some of methods to
returned results at compile time.

Like having:

isInteger
  ^ true

and then, in some method, when you encounter
  self isInteger ifTrue: [ codeA ] ifFalse: [ codeB ]

you can reduce the whole expression to codeA, or codeB , depending on
class of receiver.
So, in given example, knowing the class of receiver, you can eliminate
two sends: #isInteger and #ifTrue:ifFalse:.
I think, that making compiler be able to detect and do such reductions
will render any other kinds of optimizations much less important.

As i know, you are translating methods by taking their bytecode.
I'm not sure, if the above is possible by compiling from bytecode. I
think it would be much easier to operate with abstract parse trees
(with lambda-sends as nodes).

> Bryce
> _______________________________________________
> Exupery mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
>


--
Best regards,
Igor Stasenko AKA sig.
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about Exupery 0.14

Bryce Kampjes
Igor Stasenko writes:
 >
 > I suspect that main bottleneck in largeExplorers is not
 > compiled/bytecode code, but memory allocations and GC.
 > So, i doubt that you can gain any performance increase here.
 >

Below's the raw numbers, this is from largeExplorers but with the
profiling compiler turned up to compile a bit more code. About 60% of
the time is going into the interpreter, compiled code, and primitives
that should be natively compiled. That's enough time to provide a
decent speed improvement. 70% is the normal amount spent in the
interpreter. The GC is probably only consuming about 5% of the time.

CPU: AMD64 processors, speed 1000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 1000000
Counted LS_BUFFER_FULL events (LS Buffer 2 Full) with a unit mask of 0x00 (No unit mask) count 100000
Counted RETIRED_BRANCHES_MISPREDICTED events (Retired branches mispredicted) with a unit mask of 0x00 (No unit mask) count 100000
Counted RETIRED_INSNS events (Retired instructions (includes exceptions, interrupts, re-syncs)) with a unit mask of 0x00 (No unit mask) count 1000000
samples  %        samples  %        samples  %        samples  %        image name               app name                 symbol name
1792637  57.4476  65779    12.9048  391686   84.8932  1970781  58.4824  squeak                   squeak                   interpret
223739    7.1700  150       0.0294  211       0.0457  376848   11.1829  BitBltPlugin             BitBltPlugin             alphaBlendwith
110588    3.5439  48505     9.5159  1302      0.2822  104008    3.0864  BitBltPlugin             BitBltPlugin             copyBits
76361     2.4471  124873   24.4982  3679      0.7974  64529     1.9149  libc-2.4.so              libc-2.4.so              (no symbols)
65089     2.0859  91160    17.8842  2155      0.4671  12648     0.3753  no-vmlinux               no-vmlinux               (no symbols)
60351     1.9340  51072    10.0195  2297      0.4978  31543     0.9360  anon (tgid:6681 range:0xb1c0d000-0xb7b6c000) squeak                   (no symbols)
52940     1.6965  5        9.8e-04  1632      0.3537  82896     2.4599  B2DPlugin                B2DPlugin                fillSpanfromto
45634     1.4624  12633     2.4784  2849      0.6175  59950     1.7790  BitBltPlugin             BitBltPlugin             copyLoopPixMap
39297     1.2593  1        2.0e-04  161       0.0349  12604     0.3740  squeak                   squeak                   sweepPhase
34504     1.1057  31        0.0061  3079      0.6673  10196     0.3026  squeak                   squeak                   lookupMethodInClass
31797     1.0190  19        0.0037  3408      0.7386  39560     1.1739  squeak                   squeak                   markAndTrace
28467     0.9123  11        0.0022  1757      0.3808  15187     0.4507  squeak                   squeak                   updatePointersInRangeFromto
28316     0.9074  197       0.0386  2484      0.5384  22647     0.6720  BitBltPlugin             BitBltPlugin             loadBitBltFromwarping
26469     0.8482  7         0.0014  739       0.1602  39972     1.1862  squeak                   squeak                   finalizeReference
26380     0.8454  15        0.0029  342       0.0741  38994     1.1571  squeak                   squeak                   updatePointersInRootObjectsFromto
24235     0.7766  727       0.1426  522       0.1131  29049     0.8620  squeak                   squeak                   exuperyIsNativeContext
17343     0.5558  7         0.0014  790       0.1712  9012      0.2674  squeak                   squeak                   positive32BitValueOf
15935     0.5107  5546      1.0880  336       0.0728  22525     0.6684  squeak                   squeak                   allocateheaderSizeh1h2h3doFillwith
15559     0.4986  2712      0.5321  1191      0.2581  25552     0.7582  BitBltPlugin             BitBltPlugin             pixPaintwith
14371     0.4605  11        0.0022  3301      0.7155  19116     0.5673  squeak                   squeak                   commonAt
13348     0.4278  466       0.0914  383       0.0830  6048      0.1795  squeak                   squeak                   lookupSelectorclass

The anon block is the native code. What's interesting is the
instructions per clock is about 0.5 while the intepreter's
instructions per clock is a little over one. The native code has less
branch mispredicts but much more memory traffic.  About 8% of the time
the native code has the load store unit's buffer full and is probably
stalled waiting for a memory request to finish.

Based on the profiling I've done I'm fairly confident that one of the
reasons why the macro benchmarks are not often showing a performance
improvement on an Athon 64 is due to excess spill code causing too
much memory traffic. The register allocator is not handling heavy
register pressure well and I doubt the spill heuristics are ideal for
larger methods.

Bryce
_______________________________________________
Exupery mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery