[squeak-dev] The Old Man

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RE: [squeak-dev] Re: The Old Man

Sebastian Sastre-2
> Anyway, we seem to agree that incremental improvements is precisely  
> what gets you usable near-term pink-plane results, and I think that  
> was the gist of Marcus' message, too.
>
> - Bert -
>
This is the anticipation of value effect Kent Beck mention in its books. The
unit tests help to make incremental improvments and anticipate value. For a
company could be value delivered so customer satisfaction and $$ in the other
hand for an open software community it would be motivation for volunteers/users.

It's good we talk about our volunteering economy because we are not a big number
of people so we must embrace and understood efficiency in this regard,

Sebastian


Reply | Threaded
Open this post in threaded view
|

[squeak-dev] jitter (was: The Old Man)

Jecel Assumpcao Jr
In reply to this post by Bert Freudenberg
Bert Freudenberg wrote:
> What we will never know is if the first Jitter had been incrementally  
> improved rather than being abandoned like all its successors, it may  
> have surpassed the current interpreter performance by far. The  
> downside is that it would inherently be much more complex - the  
> interpreter strikes a nice balance here.

The first jitter had the advantage of being cross-platform, but as I had
predicted it only improved performance by around 50% which is less than
what we got by making the interpreter better. The second jitter (known
as Jitter3) was actually finished as far as I can tell. There was still
some stuff to be done, but it was no worse in that regard than Unicode
or Traits support in current Squeak. There simply was not enough
interest for it to be adopted.

And since it was finished, you can download it (get Squeak 2.3) and run
some benchmarks (in Linux machines, at least). Here are the numbers on
this machine (3GHz Pentium 4):

2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325
sends/sec
2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459
sends/sec
3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec

The Jitter3 numbers varied wildly, but even the more stable numbers for
the normal 2.3 VM are very suspect. The problem is that the old image
didn't expect such a fast machine and doesn't seem to loop enough times.
But the point is that the code is out there and we can actually run it.

-- Jecel

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter

Paolo Bonzini-2
In reply to this post by Bert Freudenberg

> 2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325
> sends/sec
> 2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459
> sends/sec
> 3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec
>
> The Jitter3 numbers varied wildly, but even the more stable numbers for
> the normal 2.3 VM are very suspect. The problem is that the old image
> didn't expect such a fast machine and doesn't seem to loop enough times.
> But the point is that the code is out there and we can actually run it.

Just increment the numbers in #tinyBenchmarks (incrementing by one
roughly duplicates the run time, so incrementing by 4 should be more
than enough).

Paolo

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter

Andreas.Raab
In reply to this post by Bert Freudenberg
Jecel Assumpcao Jr wrote:

> And since it was finished, you can download it (get Squeak 2.3) and run
> some benchmarks (in Linux machines, at least). Here are the numbers on
> this machine (3GHz Pentium 4):
>
> 2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325
> sends/sec
> 2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459
> sends/sec
> 3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec
>
> The Jitter3 numbers varied wildly, but even the more stable numbers for
> the normal 2.3 VM are very suspect. The problem is that the old image
> didn't expect such a fast machine and doesn't seem to loop enough times.

Easy. Just copy and paste #tinyBenchmarks, #benchFib, and #benchmark to
Integer. I did this in Squeak1.1.image (this was the baseline of my
benchmark comparison) so it should work fine with 2.3.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

RE: [squeak-dev] jitter (was: The Old Man)

Sebastian Sastre-2
In reply to this post by Jecel Assumpcao Jr
> > downside is that it would inherently be much more complex - the  
> > interpreter strikes a nice balance here.
>
> The first jitter had the advantage of being cross-platform,
> but as I had
> predicted it only improved performance by around 50% which is
> less than
> what we got by making the interpreter better. The second jitter (known
> as Jitter3) was actually finished as far as I can tell. There
> was still
> some stuff to be done, but it was no worse in that regard than Unicode
> or Traits support in current Squeak. There simply was not enough
> interest for it to be adopted.
>
> And since it was finished, you can download it (get Squeak
> 2.3) and run
> some benchmarks (in Linux machines, at least). Here are the numbers on
> this machine (3GHz Pentium 4):
>
> 2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325
> sends/sec
> 2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459
> sends/sec
> 3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec
>
> The Jitter3 numbers varied wildly, but even the more stable
> numbers for
> the normal 2.3 VM are very suspect. The problem is that the old image
> didn't expect such a fast machine and doesn't seem to loop
> enough times.
> But the point is that the code is out there and we can
> actually run it.
>
> -- Jecel
>
Hi Jecel,

         thanks for informing about the state of jitter. Those 3 million
sends/sec more represents a nice pie of 30% better sends/sec. That jitter can
make team work with current interpreter? An updated and ported jitter, for lets
say 3dot10, you will say it sum performance near that 30% in sends?

        cheers,

Sebastian Sastre


Reply | Threaded
Open this post in threaded view
|

RE: [squeak-dev] jitter (was: The Old Man)

Göran Krampe
Hi folks!

"Sebastian Sastre" <[hidden email]> wrote:
> > > downside is that it would inherently be much more complex - the  
> > > interpreter strikes a nice balance here.
> >
> > The first jitter had the advantage of being cross-platform,
> > but as I had
> > predicted it only improved performance by around 50% which is
> > less than
> > what we got by making the interpreter better. The second jitter (known
> > as Jitter3) was actually finished as far as I can tell.
[SNIP]
> thanks for informing about the state of jitter. Those 3 million
> sends/sec more represents a nice pie of 30% better sends/sec. That jitter can
> make team work with current interpreter? An updated and ported jitter, for lets
> say 3dot10, you will say it sum performance near that 30% in sends?

If I remember correctly Jitter3 was actually a VM written in C++, Ian
can correct me if I am wrong. And if my memory serves correctly it
didn't show as impressive numbers in "real world scenarios" as compared
to benchmarks simply due to the fact that most real world Squeak apps
spend 50% of their time in primitives and those aren't affected.

(modulo my memory failing me of course)

My perception in all this is that Exupery is the *clearly* most
promising speed technology we have available and it is getting faster
and faster all the time. And it works together with the normal VM.

The numbers Exupery shows today with normal Squeak are very impressive
and the numbers Exupery shows in Huemul are even faster - due to some
improvements that Guillermo could make due to the fact that he has full
control of the design of the VM, if I understand it correctly.

Fonc/cola/coke can also turn out to be really fast - but that is a whole
new platform, not really "Squeak".

regards, Göran

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Andreas.Raab
[hidden email] wrote:

>> thanks for informing about the state of jitter. Those 3 million
>> sends/sec more represents a nice pie of 30% better sends/sec. That jitter can
>> make team work with current interpreter? An updated and ported jitter, for lets
>> say 3dot10, you will say it sum performance near that 30% in sends?
>
> If I remember correctly Jitter3 was actually a VM written in C++, Ian
> can correct me if I am wrong. And if my memory serves correctly it
> didn't show as impressive numbers in "real world scenarios" as compared
> to benchmarks simply due to the fact that most real world Squeak apps
> spend 50% of their time in primitives and those aren't affected.
>
> (modulo my memory failing me of course)

Yes, that's my recollection too. J3 had portions that were compiled into
native code (I remember that there were a few macros that one had to
implement on each platform) but the main reason why it didn't show too
much real-world improvements was (IIRC) that it didn't to context
mapping and inline caches. Those are the places that *really* make a
difference (context-mapping of course is hard without fixing blocks to
be strictly lifo).

> My perception in all this is that Exupery is the *clearly* most
> promising speed technology we have available and it is getting faster
> and faster all the time. And it works together with the normal VM.

One of my problems with Exupery is that I've only seen claims about byte
code speed and if you know where the time goes in a real-life
environment then you know it ain't bytecodes. In other words, it seems
to me that Exupery is optimizing the least significant portion of the
VM. I'd be rather more impressed if it did double the send speed.

> The numbers Exupery shows today with normal Squeak are very impressive
> and the numbers Exupery shows in Huemul are even faster - due to some
> improvements that Guillermo could make due to the fact that he has full
> control of the design of the VM, if I understand it correctly.

Based on which benchmarks? Can I run them on Windows?

> Fonc/cola/coke can also turn out to be really fast - but that is a whole
> new platform, not really "Squeak".

True.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: jitter (was: The Old Man)

Stephen Pair


On Wed, Apr 2, 2008 at 4:36 AM, Andreas Raab <[hidden email]> wrote:
One of my problems with Exupery is that I've only seen claims about byte code speed and if you know where the time goes in a real-life environment then you know it ain't bytecodes. In other words, it seems to me that Exupery is optimizing the least significant portion of the VM. I'd be rather more impressed if it did double the send speed.

I share similar views.  In the purest of OO systems, everything is a message send except the things that are not and the only things that are not, are primitives.  Message sends translate directly into some sort of canned set of machine code and you apply your compiling and optimizing dexterity to the primitives.  Optimization is done through PICs and recursive inlining of primitives and recompiling/optimizing.  Any method that isn't a primitive is just doing message sends, but any method could potentially get compiled into a primitive through the process of inlining.

Of course, Exupery isn't trying to reinvent the world either.  And, I'm sure aspects of the exupery compiler could be leveraged in a system that was closer to this kind of purity.

- Stephen



Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Bryce Kampjes
In reply to this post by Andreas.Raab
Andreas Raab writes:

 > One of my problems with Exupery is that I've only seen claims about byte
 > code speed and if you know where the time goes in a real-life
 > environment then you know it ain't bytecodes. In other words, it seems
 > to me that Exupery is optimizing the least significant portion of the
 > VM. I'd be rather more impressed if it did double the send speed.

Then be impressed. Exupery has had double Squeak's send performance
since March 2005.

 http://people.squeakfoundation.org/person/willembryce/diary.html?start=23

That's done by using polymorphic inline caches which are also used to
drive dynamic primitive inlining. It is true that further send
performance gains are not planned before 1.0. Doubling send
performance should be enough to provide a practical performance
improvement. It's better to solve all the problems standing in the way
of a practical performance improvement before starting work on full
method inlining which should provide serious send performance.

Here's the current benchmarks:
  Executing Code
  ==============
  arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122
  bytecodeBenchmark 2183 compiled 435 ratio: 5.017
  sendBenchmark 1657 compiled 741 ratio: 2.236
  doLoopsBenchmark 1100 compiled 813 ratio: 1.353
  pointCreation 988 compiled 968 ratio: 1.021
  largeExplorers 729 compiled 780 ratio: 0.935
  compilerBenchmark 529 compiled 480 ratio: 1.102
  Cumulative Time 1113.161 compiled 538.355 ratio 2.068

  Compile Time
  ============
  ExuperyBenchmarks>>arithmeticLoop 199ms
  SmallInteger>>benchmark 791ms
  InstructionStream>>interpretExtension:in:for: 14266ms
  Average 1309.515

The bottom two executing code benchmarks are macro benchmarks. They
compile a few methods based on a profile run then re-run the
benchmark.

There's several primitives that are inlined into the main interpret()
loop in the interpreter but require full worst case dispatching in
Exupery. They'll need to be implemented to prevent slow downs to
code the benefits. Also there are few limitations that can cause
Exupery to produce unperformant code in some situations. There
are also bugs, the last release would run for about an hour of
development before crashing. These are the issues that are currently
being worked on.

Here's the benchmarks from the 0.13 release:
  arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122
  bytecodeBenchmark 2183 compiled 435 ratio: 5.017
  sendBenchmark 1657 compiled 741 ratio: 2.236
  doLoopsBenchmark 1100 compiled 813 ratio: 1.353
  pointCreation 988 compiled 968 ratio: 1.021
  largeExplorers 729 compiled 780 ratio: 0.935
  compilerBenchmark 529 compiled 480 ratio: 1.102
  Cumulative Time 1113.161 compiled 538.355 ratio 2.068

  ExuperyBenchmarks>>arithmeticLoop 199ms
  SmallInteger>>benchmark 791ms
  InstructionStream>>interpretExtension:in:for: 14266ms
  Average 1309.515

The major gains are in the compileBenchmark macro benchmark and in
compilation time. Both due to work on the register allocator.

Exupery from the beginning has been an attempt to combine serious
optimisation with full method inlining similar to Self while having the
entire compiler written in Smalltalk. It's an ambitious goal that's
best tackled in smaller steps.

Bryce

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Bryce Kampjes

I had the wrong benchmarks for 0.13. This post fixes my copy/paste
error. Thanks Goran.

[hidden email] writes:
 > Andreas Raab writes:
 >
 >  > One of my problems with Exupery is that I've only seen claims about byte
 >  > code speed and if you know where the time goes in a real-life
 >  > environment then you know it ain't bytecodes. In other words, it seems
 >  > to me that Exupery is optimizing the least significant portion of the
 >  > VM. I'd be rather more impressed if it did double the send speed.
 >
 > Then be impressed. Exupery has had double Squeak's send performance
 > since March 2005.
 >
 >  http://people.squeakfoundation.org/person/willembryce/diary.html?start=23
 >
 > That's done by using polymorphic inline caches which are also used to
 > drive dynamic primitive inlining. It is true that further send
 > performance gains are not planned before 1.0. Doubling send
 > performance should be enough to provide a practical performance
 > improvement. It's better to solve all the problems standing in the way
 > of a practical performance improvement before starting work on full
 > method inlining which should provide serious send performance.
 >
 > Here's the current benchmarks:
 >   Executing Code
 >   ==============
 >   arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122
 >   bytecodeBenchmark 2183 compiled 435 ratio: 5.017
 >   sendBenchmark 1657 compiled 741 ratio: 2.236
 >   doLoopsBenchmark 1100 compiled 813 ratio: 1.353
 >   pointCreation 988 compiled 968 ratio: 1.021
 >   largeExplorers 729 compiled 780 ratio: 0.935
 >   compilerBenchmark 529 compiled 480 ratio: 1.102
 >   Cumulative Time 1113.161 compiled 538.355 ratio 2.068
 >
 >   Compile Time
 >   ============
 >   ExuperyBenchmarks>>arithmeticLoop 199ms
 >   SmallInteger>>benchmark 791ms
 >   InstructionStream>>interpretExtension:in:for: 14266ms
 >   Average 1309.515
 >
 > The bottom two executing code benchmarks are macro benchmarks. They
 > compile a few methods based on a profile run then re-run the
 > benchmark.
 >
 > There's several primitives that are inlined into the main interpret()
 > loop in the interpreter but require full worst case dispatching in
 > Exupery. They'll need to be implemented to prevent slow downs to
 > code the benefits. Also there are few limitations that can cause
 > Exupery to produce unperformant code in some situations. There
 > are also bugs, the last release would run for about an hour of
 > development before crashing. These are the issues that are currently
 > being worked on.
 >
 > Here's the benchmarks from the 0.13 release:
  arithmaticLoopBenchmark 1396 compiled  128 ratio: 10.906
  bytecodeBenchmark       2111 compiled  460 ratio:  4.589
  sendBenchmark           1637 compiled  668 ratio:  2.451
  doLoopsBenchmark        1081 compiled  715 ratio:  1.512
  pointCreation           1245 compiled 1317 ratio:  0.945
  largeExplorers           728 compiled  715 ratio:  1.018
  compilerBenchmark        483 compiled  489 ratio:  0.988
  Cumulative Time         1125 compiled  537 ratio   2.093

  ExuperyBenchmarks>>arithmeticLoop 249ms
  SmallInteger>>benchmark 1112ms
  InstructionStream>>interpretExtension:in:for: 113460ms
  Average 3155.360

 >
 > The major gains are in the compileBenchmark macro benchmark and in
 > compilation time. Both due to work on the register allocator.
 >
 > Exupery from the beginning has been an attempt to combine serious
 > optimisation with full method inlining similar to Self while having the
 > entire compiler written in Smalltalk. It's an ambitious goal that's
 > best tackled in smaller steps.
 >
 > Bryce
 >

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Andreas.Raab
In reply to this post by Bryce Kampjes
[hidden email] wrote:

> Andreas Raab writes:
>
>  > One of my problems with Exupery is that I've only seen claims about byte
>  > code speed and if you know where the time goes in a real-life
>  > environment then you know it ain't bytecodes. In other words, it seems
>  > to me that Exupery is optimizing the least significant portion of the
>  > VM. I'd be rather more impressed if it did double the send speed.
>
> Then be impressed. Exupery has had double Squeak's send performance
> since March 2005.
>
>  http://people.squeakfoundation.org/person/willembryce/diary.html?start=23

That's pretty impressive.

> That's done by using polymorphic inline caches which are also used to
> drive dynamic primitive inlining. It is true that further send
> performance gains are not planned before 1.0. Doubling send
> performance should be enough to provide a practical performance
> improvement. It's better to solve all the problems standing in the way
> of a practical performance improvement before starting work on full
> method inlining which should provide serious send performance.

Indeed. So what's in the way of practical performance improvement at
this point? I was quite surprised that in your corrected benchmarks the
two that were macros wouldn't show any improvement:
   bytecodeBenchmark       2111 compiled  460 ratio:  4.589
   sendBenchmark           1637 compiled  668 ratio:  2.451
   [...]
   largeExplorers           728 compiled  715 ratio:  1.018
   compilerBenchmark        483 compiled  489 ratio:  0.988

With sends 2.5x faster I would expect *some* noticable improvement. Any
ideas what the problem is?

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Dan Ingalls
In reply to this post by Bryce Kampjes
>Andreas Raab writes:
>
> > One of my problems with Exupery is that I've only seen claims about byte
> > code speed and if you know where the time goes in a real-life
> > environment then you know it ain't bytecodes. In other words, it seems
> > to me that Exupery is optimizing the least significant portion of the
> > VM. I'd be rather more impressed if it did double the send speed.
>
>Then be impressed. Exupery has had double Squeak's send performance
>since March 2005.
>
> http://people.squeakfoundation.org/person/willembryce/diary.html?start=23

Way back when we were playing around with J3, we put together the macroBenchmark methods as a more realistic indication of performance than the various micro benchmarks.  Are they still available?  If not, they can be run in older images (3.4 has them and runs them even now).  They have the minor downside of changing with tweaks to the image, but they have the upside of being quite realistic and absolutely comparable from VM to VM.  I highly recommend them to anyone serious about performance evaluation.

        - Dan

PS:  As a reminder, here is what they do...
        "1: Decompile, pretty-print, and compile a bunch of methods.
                Does not install in classes, so does not flush cache."
        "2: Build morphic tiles for all methods over 800 bytes (;-).
                Does no display."
        "3: Translate the interpreter with inlining.
                Does not include any plugins."
        "4: Run the context step simulator.
                200 iterations printing pi and 15 factorial."
        "5: Run the InterpreterSimulator for 150,000 bytecodes.
                Will only run if you have mini.image in your directory."
        "6: Open 10 browsers and close them.
                Includes browsing to a specific method."
        "7: Play a game of FreeCell with display, while running the MessageTally.
                Thanks to Bob Arning for the clever part of this one."

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter

Andreas.Raab
Dan Ingalls wrote:
> Way back when we were playing around with J3, we put together the macroBenchmark methods as a more realistic indication of performance than the various micro benchmarks.  Are they still available?  If not, they can be run in older images (3.4 has them and runs them even now).  They have the minor downside of changing with tweaks to the image, but they have the upside of being quite realistic and absolutely comparable from VM to VM.  I highly recommend them to anyone serious about performance evaluation.

I have resurrected most of them for Croquet.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Bryce Kampjes
In reply to this post by Andreas.Raab
Andreas Raab writes:
 > [hidden email] wrote:
 > Indeed. So what's in the way of practical performance improvement at
 > this point? I was quite surprised that in your corrected benchmarks the
 > two that were macros wouldn't show any improvement:
 >    bytecodeBenchmark       2111 compiled  460 ratio:  4.589
 >    sendBenchmark           1637 compiled  668 ratio:  2.451
 >    [...]
 >    largeExplorers           728 compiled  715 ratio:  1.018
 >    compilerBenchmark        483 compiled  489 ratio:  0.988
 >
 > With sends 2.5x faster I would expect *some* noticable improvement. Any
 > ideas what the problem is?

Here's the latest benchmarks, there's a 10% gain for the compiler
benchmark. There's a bigger loss for largeExplorers but I think that's
triggered by compiling more methods and thus hitting a missing
optimisation that previous benchmark runs didn't hit.

   arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122
   bytecodeBenchmark 2183 compiled 435 ratio: 5.017
   sendBenchmark 1657 compiled 741 ratio: 2.236
   doLoopsBenchmark 1100 compiled 813 ratio: 1.353
   pointCreation 988 compiled 968 ratio: 1.021
   largeExplorers 729 compiled 780 ratio: 0.935
   compilerBenchmark 529 compiled 480 ratio: *1.102*
   Cumulative Time 1113.161 compiled 538.355 ratio 2.068

   ExuperyBenchmarks>>arithmeticLoop 199ms
   SmallInteger>>benchmark 791ms
   InstructionStream>>interpretExtension:in:for: 14266ms
   Average 1309.515

There's many reasons why there isn't a larger gain. One is only a few
methods are being compiled. There was a register allocation
issue. String>>at: and at:put: are not yet compiled so are taking the
slow path through to the interpreter. The same with ^ true, ^ false,
and ^ nil. All are used by those macro benchmarks.

The problem I'm working on now is the stack is being loaded into
registers when Exupery enters a context then saved back into the
context when it leaves. This is particularly inefficient if the
registers get spilled to the C stack as then they're copied into
registers to be immediately copied to memory at a different location.
This makes real case sends potentially worse than the send benchmark.

With earlier benchmarks compilation was slowing down object allocation
because allocation is inlined into the main interpreter loop but
Exupery was doing a full worst case send to get to the primitive. (1)

To benefit from PICs both the sending and receiving methods must be
compiled. That may not be happening as much as it should be. This can
be blocked by some of the few missing bytecodes. Stack duplication is
the only serious missing bytecode.

The generated code is relatively sloppy still. Exupery doesn't handle
SIB byte addressing modes so it can't access literal indirections
which are used heavily to access interpreter state. Temporaries are
not stored in registers, that's until after the stack register
improvements are finished.

The visible progress since the last release is the compiler benchmark
now shows a 10% gain and compiling interpretExtension:in:for: is about
9 times faster.

Bryce

(1) It wasn't creating the context but it was going through the PIC
then dropping into a helper method which ran through the interpreter's
send code.

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Andreas.Raab
[hidden email] wrote:
> There's many reasons why there isn't a larger gain. One is only a few
> methods are being compiled.

Why not compile everything to see what the baseline gain is? This is
really the number I'm interested in - compile the entire image, and run
the macro benchmarks and compare that. And not only would it be a good
stress-test but for some applications (like our Croquet servers) that
would be entirely reasonable; even desirable.

Cheers,
   - Andreas

> The problem I'm working on now is the stack is being loaded into
> registers when Exupery enters a context then saved back into the
> context when it leaves. This is particularly inefficient if the
> registers get spilled to the C stack as then they're copied into
> registers to be immediately copied to memory at a different location.
> This makes real case sends potentially worse than the send benchmark.
>
> With earlier benchmarks compilation was slowing down object allocation
> because allocation is inlined into the main interpreter loop but
> Exupery was doing a full worst case send to get to the primitive. (1)
>
> To benefit from PICs both the sending and receiving methods must be
> compiled. That may not be happening as much as it should be. This can
> be blocked by some of the few missing bytecodes. Stack duplication is
> the only serious missing bytecode.
>
> The generated code is relatively sloppy still. Exupery doesn't handle
> SIB byte addressing modes so it can't access literal indirections
> which are used heavily to access interpreter state. Temporaries are
> not stored in registers, that's until after the stack register
> improvements are finished.
>
> The visible progress since the last release is the compiler benchmark
> now shows a 10% gain and compiling interpretExtension:in:for: is about
> 9 times faster.
>
> Bryce
>
> (1) It wasn't creating the context but it was going through the PIC
> then dropping into a helper method which ran through the interpreter's
> send code.
>
>


Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: jitter (was: The Old Man)

Bryce Kampjes
Andreas Raab writes:
 > [hidden email] wrote:
 > > There's many reasons why there isn't a larger gain. One is only a few
 > > methods are being compiled.
 >
 > Why not compile everything to see what the baseline gain is? This is
 > really the number I'm interested in - compile the entire image, and run
 > the macro benchmarks and compare that. And not only would it be a good
 > stress-test but for some applications (like our Croquet servers) that
 > would be entirely reasonable; even desirable.

The short answer is it's not ready yet. That's what I'm working
on. Exupery normally crashes after about an hour of real
use. Reliability is improving but it's not yet ready for real use
except in very controlled ways.

The benchmarks I showed are for driving development, they must be
reasonably quick so they can be easily run during development. Total
test time should be only a few minutes. I'd say if both macro
benchmarks were providing over 10% performance gains then reliability
would be the top priority rather than one of the top priorities.

Compiling the entire image is probably overkill and may eat a few
hundred megabytes of RAM. Bytecodes are a very dense format while
machine code isn't. Also Exupery at the moment needs to compile a
method once for each receiver, this is only required when compiling
primitives such as #at: and #new where it's useful to specialise the
compiled code for each different receiver. It would be easy to re-use
the same compiled code for multiple receivers but this hasn't yet been
done. It's not been a problem when compilation is driven from a
profiler.

Bryce

12