Interpreter>>isContextHeader: optimization

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

David T. Lewis
 
On Mon, Feb 23, 2009 at 08:28:14AM +0200, Igor Stasenko wrote:

>
> Some more observations:
>
> #include <stdio.h>
>
> int main(int argc , char** argv)
> {
>   int i;
>
>   i = 1;
>   i = ({ int y=i; int i=6; printf("foo"); 10+y+i; });
>   printf("%d", i);
>
> }
>
> prints:
> foo17
>
> so, to generate a proper inlined code, we should care only about name clashing
> for method arguments.
> In above code, suppose 'y' is an argument var,
> and 'i', declared in inner scope - a temporary var.

There is a bug in the inliner that produces name clashes like this. The fix is
in http://bugs.squeak.org/view.php?id=7155

Related issues:
Mantis 0007183: Various Slang problems masked by CPP macros.
Mantis 0007189: CCodeGenerator>>doInlining: fix.
Mantis 0007186: Interpreter>>loadFloatOrIntFrom: inlining broken.
Mantis 0007155: Bug in TMethod>>renameVariablesUsing:
Mantis 0007154: Interpreter main loop case generation fix.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

Bryce Kampjes
In reply to this post by Eliot Miranda-2
 
Eliot Miranda writes:
 >  On Sun, Feb 22, 2009 at 12:54 PM, <[hidden email]> wrote:
 > > All you need is the optimiser to run early in compilation for it to be
 > > portable.
 >
 >
 > ...and for it to be untimely.  An adaptive optimizer by definition needs to
 > be running intermittently all the time.  It optimizes what is happening now,
 > not what happened at start-up.

Exupery runs as a Smalltalk background thread, it already uses dynamic
feed back to inline some primitives including #at: and #at:put.

 > > I see only one sixth of the time going into context creation for the
 > > send benchmark which is about as send heavy as you can get. That's
 > > running native code at about twice Squeak's speed. Also there's still
 > > plenty of inefficiency in Exupery's call return sequences.
 >
 >
 > So you could get a 17% speedup if you could remove the context overhead.
 >  That's quite a tidy gain.  I see a 26% increase in benchFib performance
 > between base Squeak and the StackVM with no native code at all.
 >
 > What are the inefficiences in Exupery's call return sequences?

Exupery uses a C call sequence so it's easy to enter from the
interpreter, that C call frame is torn down when exiting each
compiled method then re-created when reentering native code. That's
a complete waste when going from one native method to another.

Also the send/return sequence isn't yet that optimised, there's still
plenty of inefficiencies due to lack of addressing modes etc and because
it's fairly naive translation of the interpreters send code.

17% would be rather optimistic, some of the work required to set up a
context will always be required. Temporaries will still need to be
nilled out etc.

Bryce
Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

Eliot Miranda-2
 


On Mon, Feb 23, 2009 at 1:49 PM, <[hidden email]> wrote:

Eliot Miranda writes:
 >  On Sun, Feb 22, 2009 at 12:54 PM, <[hidden email]> wrote:
 > > All you need is the optimiser to run early in compilation for it to be
 > > portable.
 >
 >
 > ...and for it to be untimely.  An adaptive optimizer by definition needs to
 > be running intermittently all the time.  It optimizes what is happening now,
 > not what happened at start-up.

Exupery runs as a Smalltalk background thread, it already uses dynamic
feed back to inline some primitives including #at: and #at:put.

 > > I see only one sixth of the time going into context creation for the
 > > send benchmark which is about as send heavy as you can get. That's
 > > running native code at about twice Squeak's speed. Also there's still
 > > plenty of inefficiency in Exupery's call return sequences.
 >
 >
 > So you could get a 17% speedup if you could remove the context overhead.
 >  That's quite a tidy gain.  I see a 26% increase in benchFib performance
 > between base Squeak and the StackVM with no native code at all.
 >
 > What are the inefficiences in Exupery's call return sequences?

Exupery uses a C call sequence so it's easy to enter from the
interpreter,

that's a non sequitur.  The Cog VM has full interoperability between interpreter and machine code without a C calling convention.  The interpreter doesn't call machine code methods, instead it calls a trampoline that jumps to the right point in a machine code method, allowing the system to optimize the common case of machine-code to machine-code calls (through inline caches).  The return address of a machine-code frame above an interpreter frame is that of a routine that does a return to the interpreter so that returns don't have to check for returning to the interpreter.
 
that C call frame is torn down when exiting each
compiled method then re-created when reentering native code. That's
a complete waste when going from one native method to another.

So you go to all the effort of producing native code and then you wrap it in so much gunk that you get minimal performance benefit from it.  I don't understand.  What are your goals?  Experimenting with compilers or producing an efficient Squeak VM?

Also the send/return sequence isn't yet that optimised, there's still
plenty of inefficiencies due to lack of addressing modes etc and because
it's fairly naive translation of the interpreters send code.

17% would be rather optimistic, some of the work required to set up a
context will always be required. Temporaries will still need to be
nilled out etc.

Again invalid assumptions.  Do without contexts (except as a frame access abstraction).  Sufficient adaptive optimization can avoid temporary initializations (e.g. by embedding information that records live ranges).
 


Bryce

Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

Eliot Miranda-2
In reply to this post by Bryce Kampjes
 
Brice,

    please forgive my earlier reply which I realise was extraordinarily rude and unnecessarily critical.  Let me try and engage more constructively.  I think I'll be making the same points but hopefulyl I'll do so while being less of an a***hole.  

On Mon, Feb 23, 2009 at 1:49 PM, <[hidden email]> wrote:

Eliot Miranda writes:
 >  On Sun, Feb 22, 2009 at 12:54 PM, <[hidden email]> wrote:
 > > All you need is the optimiser to run early in compilation for it to be
 > > portable.
 >
 >
 > ...and for it to be untimely.  An adaptive optimizer by definition needs to
 > be running intermittently all the time.  It optimizes what is happening now,
 > not what happened at start-up.

Exupery runs as a Smalltalk background thread, it already uses dynamic
feed back to inline some primitives including #at: and #at:put.

The background thread was used by Typed Smalltalk and is also used by some Java jits.  But how does it running in a background thread help it be portable?  Surely if it targets native code it needs to be split into a front-end and a back-end of which only the front end will be portable right?
 
 > > I see only one sixth of the time going into context creation for the
 > > send benchmark which is about as send heavy as you can get. That's
 > > running native code at about twice Squeak's speed. Also there's still
 > > plenty of inefficiency in Exupery's call return sequences.

As your VM gets faster so that 1/6th will loom ever larger.  If you triple the speed of your VM while keeping that same context creation scheme then that 1/6 will become 1/2 of entire execution time.  So if you want truly high performance you're going to have to tackle context elimination at some stage.  Since it is so integral to the central issue of call/return design I would encourage you to address it earlier rather than later.
 

 >
 >
 > So you could get a 17% speedup if you could remove the context overhead.
 >  That's quite a tidy gain.  I see a 26% increase in benchFib performance
 > between base Squeak and the StackVM with no native code at all.
 >
 > What are the inefficiences in Exupery's call return sequences?

Exupery uses a C call sequence so it's easy to enter from the
interpreter, that C call frame is torn down when exiting each
compiled method then re-created when reentering native code. That's
a complete waste when going from one native method to another.

One sage piece of advice is to optimize for the common case.  Try and make the common case as fast as possible.  You know this since you're also interested in adaptive optimization which is fundamentally to do with optimizing the common case.  So design your calling convention around machine-code to machine-code calls and make the uncommon interpreter/machine-code call do the necessary work to interface with the calling-convention not the other way around.  The dog should wag the tail (although try telling my father-in-law's labrador puppy that).

The way I've done this in Cog is to generate a trivial piece of machine code, a thunk/trampoline etc, that I actually call an enilopmart because it jumps from the interpreter/run-time into machine-code whereas jumps in the other direction are via trampolines.  The interpreter uses this by pushing the pc of the first instruction past the in-line cache checking code in the method, followed by the values of the register(s) that need loading and then calls the enilopmart. The enilopmart assigns stack and frame pointers with that of the machine-code frame being called, pops all the register values that are live on entry to the method off the stack and returns.  The return is effectively a jump to the start of the machine-code method.

To arrange that returns form machine-code frames don't have to check for a return to the interpreter the interpreter saves its own instruction pointer in a slot in its frame, and substitutes the address of a routine that handles returning to the interpreter as the return address.  So when the machine code frame returns it'll return to the trampoline that retrieves the saved instruction pointer form the slot and longjmps back to the interpreter.  I use a lngjmp to avoid stack growth in any dance between interpreter and machine code.
 
Also the send/return sequence isn't yet that optimised, there's still
plenty of inefficiencies due to lack of addressing modes etc and because
it's fairly naive translation of the interpreters send code.

I would sit down and draw what you want the calling convention to look like and do it sooner rather than later.  This is crucial to overall performance until you solve the more difficult problem of doing significant adaptive optimization so that call/return is eliminated.  However, as mentioned previously eliminating call/return isn't such a great idea per se and so keeping a fast call/return sequence is probably a very good idea anyway.  Its not as if modern processors don't do call/return well.

17% would be rather optimistic, some of the work required to set up a
context will always be required. Temporaries will still need to be
nilled out etc.

Aim higher :)  I'm hoping for 10x current Squeak performance for Smalltalk-intensive benchmarks some time later this year.

 


Bryce

Cheers
and again apologies for having been so critical
Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

Bryce Kampjes
In reply to this post by Eliot Miranda-2
 
I've read your other mail.

Eliot Miranda writes:
 >  On Mon, Feb 23, 2009 at 1:49 PM, <[hidden email]> wrote:
 > >  > What are the inefficiences in Exupery's call return sequences?
 > >
 > > Exupery uses a C call sequence so it's easy to enter from the
 > > interpreter,
 >
 >
 > that's a non sequitur.  The Cog VM has full interoperability between
 > interpreter and machine code without a C calling convention.  The
 > interpreter doesn't call machine code methods, instead it calls a trampoline
 > that jumps to the right point in a machine code method, allowing the system
 > to optimize the common case of machine-code to machine-code calls (through
 > inline caches).  The return address of a machine-code frame above an
 > interpreter frame is that of a routine that does a return to the interpreter
 > so that returns don't have to check for returning to the interpreter.
 >
 >
 > > that C call frame is torn down when exiting each
 > > compiled method then re-created when reentering native code. That's
 > > a complete waste when going from one native method to another.
 >
 >
 > So you go to all the effort of producing native code and then you wrap it in
 > so much gunk that you get minimal performance benefit from it.  I don't
 > understand.  What are your goals?  Experimenting with compilers or producing
 > an efficient Squeak VM?

My goal is an efficient Squeak VM. Exupery's current send performance
is about twice the interpreters, just above twice on Athlons, just
below twice on the Core 2. This was achieved primarily by using PICs.
Once send performance was twice the interpreters other issues became
more pressing including reliability and compile speed. More call
sequence optimisation is planned but is unlikely to be next.

If Cog's successful then it may make sense for Exupery to use the same
calling conventions as Cog so Cog->Exupery->Cog calls are zero
cost. Exupery is built to perform heavy optimisation on inlined code,
inlining is also needed to expose entire loops for further
optimisation.

Bryce
Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

David T. Lewis
In reply to this post by David T. Lewis
 
On Mon, Feb 23, 2009 at 07:39:23AM -0500, David T. Lewis wrote:

>  
> On Mon, Feb 23, 2009 at 07:05:39AM +0200, Igor Stasenko wrote:
> >
> > i never took a look how method inliner works in code generator. But
> > some of its discrepancies is a pain.
> > Like unable to inline a method which having cCode, or c variables
> > declarations/definitions.
>
> I think I forgot to put this on Mantis (sorry), but look at VMMaker-dtl.103
> on SqueakSource for the changes that allow inlining methods with embedded C
> code:

Correction: The changes to permit inlining of methods with embedded C code
are on Mantis 7190, and were first included in VMMaker as of VMMaker-dtl.94
on SqueakSource. The PermitInliningCCode-dtl.5.cs change set is attached to
the Mantis report.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

Bryce Kampjes
In reply to this post by Eliot Miranda-2
 
Eliot Miranda writes:
 >  Brice,
 >     please forgive my earlier reply which I realise was extraordinarily rude
 > and unnecessarily critical.  Let me try and engage more constructively.  I
 > think I'll be making the same points but hopefulyl I'll do so while being
 > less of an a***hole.
 >
 > On Mon, Feb 23, 2009 at 1:49 PM, <[hidden email]> wrote:
 >
 > >
 > > Eliot Miranda writes:
 > >  >  On Sun, Feb 22, 2009 at 12:54 PM, <[hidden email]> wrote:
 > >  > > All you need is the optimiser to run early in compilation for it to be
 > >  > > portable.
 > >  >
 > >  >
 > >  > ...and for it to be untimely.  An adaptive optimizer by definition needs
 > > to
 > >  > be running intermittently all the time.  It optimizes what is happening
 > > now,
 > >  > not what happened at start-up.
 > >
 > > Exupery runs as a Smalltalk background thread, it already uses dynamic
 > > feed back to inline some primitives including #at: and #at:put.
 >
 >
 > The background thread was used by Typed Smalltalk and is also used by some
 > Java jits.  But how does it running in a background thread help it be
 > portable?  Surely if it targets native code it needs to be split into a
 > front-end and a back-end of which only the front end will be portable right?

Most of the code in the back end is portable too. Exupery's back end
is split into three stages, instruction selection, register
allocation, then assembly. The register allocator is the biggest and
most complex part of Exupery so far and is portable.

Running as a background thread doesn't make it portable but it does
make it more timely than a compile on load system.

 >
 > >  > > I see only one sixth of the time going into context creation for the
 > >  > > send benchmark which is about as send heavy as you can get. That's
 > >  > > running native code at about twice Squeak's speed. Also there's still
 > >  > > plenty of inefficiency in Exupery's call return sequences.
 > >
 >
 > As your VM gets faster so that 1/6th will loom ever larger.  If you triple
 > the speed of your VM while keeping that same context creation scheme then
 > that 1/6 will become 1/2 of entire execution time.  So if you want truly
 > high performance you're going to have to tackle context elimination at some
 > stage.  Since it is so integral to the central issue of call/return design I
 > would encourage you to address it earlier rather than later.
 >
 >
 > >
 > >  >
 > >  >
 > >  > So you could get a 17% speedup if you could remove the context overhead.
 > >  >  That's quite a tidy gain.  I see a 26% increase in benchFib performance
 > >  > between base Squeak and the StackVM with no native code at all.
 > >  >
 > >  > What are the inefficiences in Exupery's call return sequences?
 > >
 > > Exupery uses a C call sequence so it's easy to enter from the
 > > interpreter, that C call frame is torn down when exiting each
 > > compiled method then re-created when reentering native code. That's
 > > a complete waste when going from one native method to another.
 >
 >
 > One sage piece of advice is to optimize for the common case.  Try and make
 > the common case as fast as possible.  You know this since you're also
 > interested in adaptive optimization which is fundamentally to do with
 > optimizing the common case.  So design your calling convention around
 > machine-code to machine-code calls and make the uncommon
 > interpreter/machine-code call do the necessary work to interface with the
 > calling-convention not the other way around.  The dog should wag the tail
 > (although try telling my father-in-law's labrador puppy that).
 >
 > The way I've done this in Cog is to generate a trivial piece of machine
 > code, a thunk/trampoline etc, that I actually call an enilopmart because it
 > jumps from the interpreter/run-time into machine-code whereas jumps in the
 > other direction are via trampolines.  The interpreter uses this by pushing
 > the pc of the first instruction past the in-line cache checking code in the
 > method, followed by the values of the register(s) that need loading and then
 > calls the enilopmart. The enilopmart assigns stack and frame pointers with
 > that of the machine-code frame being called, pops all the register values
 > that are live on entry to the method off the stack and returns.  The return
 > is effectively a jump to the start of the machine-code method.

Why does Cog still rely on the interpreter? Is this just a
bootstrapping phase?

 > To arrange that returns form machine-code frames don't have to check for a
 > return to the interpreter the interpreter saves its own instruction pointer
 > in a slot in its frame, and substitutes the address of a routine that
 > handles returning to the interpreter as the return address.  So when the
 > machine code frame returns it'll return to the trampoline that retrieves the
 > saved instruction pointer form the slot and longjmps back to the
 > interpreter.  I use a lngjmp to avoid stack growth in any dance between
 > interpreter and machine code.
 >
 >
 > > Also the send/return sequence isn't yet that optimised, there's still
 > > plenty of inefficiencies due to lack of addressing modes etc and because
 > > it's fairly naive translation of the interpreters send code.
 >
 >
 > I would sit down and draw what you want the calling convention to look like
 > and do it sooner rather than later.  This is crucial to overall performance
 > until you solve the more difficult problem of doing significant adaptive
 > optimization so that call/return is eliminated.  However, as mentioned
 > previously eliminating call/return isn't such a great idea per se and so
 > keeping a fast call/return sequence is probably a very good idea anyway.
 >  Its not as if modern processors don't do call/return well.

For Exupery to make sense it needs significant adaptive optimisation
to expose enough code to allow heavy optimisation to justify the
slower compiler. Otherwise it would make more sense to move the
compiler into the VM and allow compilation for all execution like VW.

The current system's primary goal is enable the development of the
adaptive optimisation. I'll tune to provide decent performance now but
not if it makes it harder to add key features later.

I've thought about using a context stack several times over the years.
The key benefits are faster returns and possibly faster
de-optimisation of inlined contexts. After inlining it's possible that
a code change will break an optimisation. If an inlined method is
modified then it will continue to be entered until all contexts have
died or it is actively removed. Removing inlined contexts from object
memory requires a full memory scan to find them (allInstances).

 > 17% would be rather optimistic, some of the work required to set up a
 > > context will always be required. Temporaries will still need to be
 > > nilled out etc.
 >
 >
 > Aim higher :)  I'm hoping for 10x current Squeak performance for
 > Smalltalk-intensive benchmarks some time later this year.

My original and current aim is double VW's performance or be roughly
equivalent to C.

Bryce
Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

Eliot Miranda-2
 


On Sun, Mar 1, 2009 at 1:46 PM, <[hidden email]> wrote:

Eliot Miranda writes:
 >  Brice,
 >     please forgive my earlier reply which I realise was extraordinarily rude
 > and unnecessarily critical.  Let me try and engage more constructively.  I
 > think I'll be making the same points but hopefulyl I'll do so while being
 > less of an a***hole.
 >
 > On Mon, Feb 23, 2009 at 1:49 PM, <[hidden email]> wrote:
 >
 > >
 > > Eliot Miranda writes:
 > >  >  On Sun, Feb 22, 2009 at 12:54 PM, <[hidden email]> wrote:
 > >  > > All you need is the optimiser to run early in compilation for it to be
 > >  > > portable.
 > >  >
 > >  >
 > >  > ...and for it to be untimely.  An adaptive optimizer by definition needs
 > > to
 > >  > be running intermittently all the time.  It optimizes what is happening
 > > now,
 > >  > not what happened at start-up.
 > >
 > > Exupery runs as a Smalltalk background thread, it already uses dynamic
 > > feed back to inline some primitives including #at: and #at:put.
 >
 >
 > The background thread was used by Typed Smalltalk and is also used by some
 > Java jits.  But how does it running in a background thread help it be
 > portable?  Surely if it targets native code it needs to be split into a
 > front-end and a back-end of which only the front end will be portable right?

Most of the code in the back end is portable too. Exupery's back end
is split into three stages, instruction selection, register
allocation, then assembly. The register allocator is the biggest and
most complex part of Exupery so far and is portable.

Running as a background thread doesn't make it portable but it does
make it more timely than a compile on load system.

 >
 > >  > > I see only one sixth of the time going into context creation for the
 > >  > > send benchmark which is about as send heavy as you can get. That's
 > >  > > running native code at about twice Squeak's speed. Also there's still
 > >  > > plenty of inefficiency in Exupery's call return sequences.
 > >
 >
 > As your VM gets faster so that 1/6th will loom ever larger.  If you triple
 > the speed of your VM while keeping that same context creation scheme then
 > that 1/6 will become 1/2 of entire execution time.  So if you want truly
 > high performance you're going to have to tackle context elimination at some
 > stage.  Since it is so integral to the central issue of call/return design I
 > would encourage you to address it earlier rather than later.
 >
 >
 > >
 > >  >
 > >  >
 > >  > So you could get a 17% speedup if you could remove the context overhead.
 > >  >  That's quite a tidy gain.  I see a 26% increase in benchFib performance
 > >  > between base Squeak and the StackVM with no native code at all.
 > >  >
 > >  > What are the inefficiences in Exupery's call return sequences?
 > >
 > > Exupery uses a C call sequence so it's easy to enter from the
 > > interpreter, that C call frame is torn down when exiting each
 > > compiled method then re-created when reentering native code. That's
 > > a complete waste when going from one native method to another.
 >
 >
 > One sage piece of advice is to optimize for the common case.  Try and make
 > the common case as fast as possible.  You know this since you're also
 > interested in adaptive optimization which is fundamentally to do with
 > optimizing the common case.  So design your calling convention around
 > machine-code to machine-code calls and make the uncommon
 > interpreter/machine-code call do the necessary work to interface with the
 > calling-convention not the other way around.  The dog should wag the tail
 > (although try telling my father-in-law's labrador puppy that).
 >
 > The way I've done this in Cog is to generate a trivial piece of machine
 > code, a thunk/trampoline etc, that I actually call an enilopmart because it
 > jumps from the interpreter/run-time into machine-code whereas jumps in the
 > other direction are via trampolines.  The interpreter uses this by pushing
 > the pc of the first instruction past the in-line cache checking code in the
 > method, followed by the values of the register(s) that need loading and then
 > calls the enilopmart. The enilopmart assigns stack and frame pointers with
 > that of the machine-code frame being called, pops all the register values
 > that are live on entry to the method off the stack and returns.  The return
 > is effectively a jump to the start of the machine-code method.

Why does Cog still rely on the interpreter? Is this just a
bootstrapping phase?

 It is by design, given my experience with HPS (VW's VM).

First, being able to rely on the interpreter means the JIT doesn't have to waste time compiling infrequently used or large methods.  There are methods, such as the Unicode initializers, that are humongous, are very rarely run, and when run only run once (e.g. once on start-up).  These methods typically take longer to compile than to interpret and consume huge amounts of code space.  It makes more sense to leave these to the interpreter.  My hunch is that overall performance will improve because machine code working set size will be kept small (currently Cog starts up a Qwaq Croquet development image generating less than 640k of code).
 
Second, if one is using a fixed size code cache (there are advantages in being able to traverse all generated code quickly, and in keeping it in one place) then one must be able to survive running out of space.  In VW this means a few points where things get very tricky.  e.g. one is trying to link an inline cache to a newly compiled machine-code method target, but the compilation caused the code zone to be compacted, moving the current call-site one is trying to link into.  With the interpreter to fall back on the VM can always make progress even if it has run out of space.  Hence compacting the code zone can be deferred until the next event check, just as is done with GC.  This makes code compaction simpler and hence more reliable and quicker to implement.

Third, keeping the interpreter around means maintaining the current levels of portability.  One can always fall back on the interpreter if one doesn't have the back-end for the current ISA or one is having to run in ROM.

Fourth, I've long been curious about JIT/Interpreter hybrids and whether they're really hard to get to work well.  Turns out it is not too bad.  I of course had to iterate twice before I understood what I was doing, but the system seems to be quite comprehensible.  It is complicated, but not as complicated as HPS.

 > To arrange that returns form machine-code frames don't have to check for a
 > return to the interpreter the interpreter saves its own instruction pointer
 > in a slot in its frame, and substitutes the address of a routine that
 > handles returning to the interpreter as the return address.  So when the
 > machine code frame returns it'll return to the trampoline that retrieves the
 > saved instruction pointer form the slot and longjmps back to the
 > interpreter.  I use a lngjmp to avoid stack growth in any dance between
 > interpreter and machine code.
 >
 >
 > > Also the send/return sequence isn't yet that optimised, there's still
 > > plenty of inefficiencies due to lack of addressing modes etc and because
 > > it's fairly naive translation of the interpreters send code.
 >
 >
 > I would sit down and draw what you want the calling convention to look like
 > and do it sooner rather than later.  This is crucial to overall performance
 > until you solve the more difficult problem of doing significant adaptive
 > optimization so that call/return is eliminated.  However, as mentioned
 > previously eliminating call/return isn't such a great idea per se and so
 > keeping a fast call/return sequence is probably a very good idea anyway.
 >  Its not as if modern processors don't do call/return well.

For Exupery to make sense it needs significant adaptive optimisation
to expose enough code to allow heavy optimisation to justify the
slower compiler. Otherwise it would make more sense to move the
compiler into the VM and allow compilation for all execution like VW.

The current system's primary goal is enable the development of the
adaptive optimisation. I'll tune to provide decent performance now but
not if it makes it harder to add key features later.

AOStA/SIStA is I think a quicker route.  I like your suggestion of looking at moving Exupery to the stack VM.  Even though I would put the back-end code generator in the VM I would love to use Exupery's front-end in AOStA/SIStA.

I've thought about using a context stack several times over the years.
The key benefits are faster returns

and much faster sends
 
and possibly faster
de-optimisation of inlined contexts. After inlining it's possible that
a code change will break an optimisation. If an inlined method is
modified then it will continue to be entered until all contexts have
died or it is actively removed. Removing inlined contexts from object
memory requires a full memory scan to find them (allInstances).

There's another key advantage and that is the freedom to layout an optimized stack.  A key goal of AOStA/SIStA i unboxing floating-point values and mapping them to floating-point registers.   I like the idea of having an abstract model, represented by an OptimizedContext, of having two stacks, an object stack and a byte-data stack for raw data (unboxed floating-point, untagged integers, etc).  That would be good for the debugger n deopimzation.  But in the VM one would want to map these two different stacks to a single machine stack with info for the GC as to what are object references and what are not (a la Java stack frames), and having the active frame keep much of this unboxed state in machine regsiters.

 > 17% would be rather optimistic, some of the work required to set up a
 > > context will always be required. Temporaries will still need to be
 > > nilled out etc.
 >
 >
 > Aim higher :)  I'm hoping for 10x current Squeak performance for
 > Smalltalk-intensive benchmarks some time later this year.

My original and current aim is double VW's performance or be roughly
equivalent to C.

And what's the state of this effort?  What metrics lead you to believe you can double VW's performance?  Where are you currently?  What do you mean by "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C compiler?  What benchmarks have you focussed on?

Bryce

Best
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Interpreter>>isContextHeader: optimization

Eliot Miranda-2
 


On Sun, Mar 1, 2009 at 2:21 PM, Eliot Miranda <[hidden email]> wrote:


On Sun, Mar 1, 2009 at 1:46 PM, <[hidden email]> wrote:

Eliot Miranda writes:
 >  Brice,
 >     please forgive my earlier reply which I realise was extraordinarily rude
 > and unnecessarily critical.  Let me try and engage more constructively.  I
 > think I'll be making the same points but hopefulyl I'll do so while being
 > less of an a***hole.
 >
 > On Mon, Feb 23, 2009 at 1:49 PM, <[hidden email]> wrote:
 >
 > >
 > > Eliot Miranda writes:
 > >  >  On Sun, Feb 22, 2009 at 12:54 PM, <[hidden email]> wrote:
 > >  > > All you need is the optimiser to run early in compilation for it to be
 > >  > > portable.
 > >  >
 > >  >
 > >  > ...and for it to be untimely.  An adaptive optimizer by definition needs
 > > to
 > >  > be running intermittently all the time.  It optimizes what is happening
 > > now,
 > >  > not what happened at start-up.
 > >
 > > Exupery runs as a Smalltalk background thread, it already uses dynamic
 > > feed back to inline some primitives including #at: and #at:put.
 >
 >
 > The background thread was used by Typed Smalltalk and is also used by some
 > Java jits.  But how does it running in a background thread help it be
 > portable?  Surely if it targets native code it needs to be split into a
 > front-end and a back-end of which only the front end will be portable right?

Most of the code in the back end is portable too. Exupery's back end
is split into three stages, instruction selection, register
allocation, then assembly. The register allocator is the biggest and
most complex part of Exupery so far and is portable.

Running as a background thread doesn't make it portable but it does
make it more timely than a compile on load system.

 >
 > >  > > I see only one sixth of the time going into context creation for the
 > >  > > send benchmark which is about as send heavy as you can get. That's
 > >  > > running native code at about twice Squeak's speed. Also there's still
 > >  > > plenty of inefficiency in Exupery's call return sequences.
 > >
 >
 > As your VM gets faster so that 1/6th will loom ever larger.  If you triple
 > the speed of your VM while keeping that same context creation scheme then
 > that 1/6 will become 1/2 of entire execution time.  So if you want truly
 > high performance you're going to have to tackle context elimination at some
 > stage.  Since it is so integral to the central issue of call/return design I
 > would encourage you to address it earlier rather than later.
 >
 >
 > >
 > >  >
 > >  >
 > >  > So you could get a 17% speedup if you could remove the context overhead.
 > >  >  That's quite a tidy gain.  I see a 26% increase in benchFib performance
 > >  > between base Squeak and the StackVM with no native code at all.
 > >  >
 > >  > What are the inefficiences in Exupery's call return sequences?
 > >
 > > Exupery uses a C call sequence so it's easy to enter from the
 > > interpreter, that C call frame is torn down when exiting each
 > > compiled method then re-created when reentering native code. That's
 > > a complete waste when going from one native method to another.
 >
 >
 > One sage piece of advice is to optimize for the common case.  Try and make
 > the common case as fast as possible.  You know this since you're also
 > interested in adaptive optimization which is fundamentally to do with
 > optimizing the common case.  So design your calling convention around
 > machine-code to machine-code calls and make the uncommon
 > interpreter/machine-code call do the necessary work to interface with the
 > calling-convention not the other way around.  The dog should wag the tail
 > (although try telling my father-in-law's labrador puppy that).
 >
 > The way I've done this in Cog is to generate a trivial piece of machine
 > code, a thunk/trampoline etc, that I actually call an enilopmart because it
 > jumps from the interpreter/run-time into machine-code whereas jumps in the
 > other direction are via trampolines.  The interpreter uses this by pushing
 > the pc of the first instruction past the in-line cache checking code in the
 > method, followed by the values of the register(s) that need loading and then
 > calls the enilopmart. The enilopmart assigns stack and frame pointers with
 > that of the machine-code frame being called, pops all the register values
 > that are live on entry to the method off the stack and returns.  The return
 > is effectively a jump to the start of the machine-code method.

Why does Cog still rely on the interpreter? Is this just a
bootstrapping phase?

 It is by design, given my experience with HPS (VW's VM).

First, being able to rely on the interpreter means the JIT doesn't have to waste time compiling infrequently used or large methods.  There are methods, such as the Unicode initializers, that are humongous, are very rarely run, and when run only run once (e.g. once on start-up).  These methods typically take longer to compile than to interpret and consume huge amounts of code space.  It makes more sense to leave these to the interpreter.  My hunch is that overall performance will improve because machine code working set size will be kept small (currently Cog starts up a Qwaq Croquet development image generating less than 640k of code).
 
Second, if one is using a fixed size code cache (there are advantages in being able to traverse all generated code quickly, and in keeping it in one place) then one must be able to survive running out of space.  In VW this means a few points where things get very tricky.  e.g. one is trying to link an inline cache to a newly compiled machine-code method target, but the compilation caused the code zone to be compacted, moving the current call-site one is trying to link into.  With the interpreter to fall back on the VM can always make progress even if it has run out of space.  Hence compacting the code zone can be deferred until the next event check, just as is done with GC.  This makes code compaction simpler and hence more reliable and quicker to implement.

Third, keeping the interpreter around means maintaining the current levels of portability.  One can always fall back on the interpreter if one doesn't have the back-end for the current ISA or one is having to run in ROM.

Fourth, I've long been curious about JIT/Interpreter hybrids and whether they're really hard to get to work well.  Turns out it is not too bad.  I of course had to iterate twice before I understood what I was doing, but the system seems to be quite comprehensible.  It is complicated, but not as complicated as HPS.

Oops!  I forgot to mention another important advantage.  With a pure JIT then somehow the system has to cope with resuming execution at an arbitrary bytecode.  In the Debugger one can step a context to a point that may not be a resumption point in machine code.  In code generated by an optimizing JIT that is doing "deferred code generation" (Ian Piumarta's term for eliminating intermediate pushes and pops when generating machine code, something HPS does too) there may only be resumption points for a small subset of the bytecodes, e.g. the bytecode following a send, which maps to the return from a call in machine code.  If you have a look at the VW debugger you'll see that it has the responsibility of stepping a context up to a bytecode pc that corresponds to a resumption point in machine code.  This code is, um, opaque, and fragile, because it constitutes an unwritten contract between the image and the VM.

 With an interpreter, however, the VM can simply resume execution in the interpreter if a context is not at a point  that corresponds to a resumption point in machine code.  Nice.  (and this advantage would also apply in SIStA where the interpreter would still be capable of interpreting optimized bytecode, apologies if this doesn't make sense; there's a lot of background I'm skipping here).


 > To arrange that returns form machine-code frames don't have to check for a
 > return to the interpreter the interpreter saves its own instruction pointer
 > in a slot in its frame, and substitutes the address of a routine that
 > handles returning to the interpreter as the return address.  So when the
 > machine code frame returns it'll return to the trampoline that retrieves the
 > saved instruction pointer form the slot and longjmps back to the
 > interpreter.  I use a lngjmp to avoid stack growth in any dance between
 > interpreter and machine code.
 >
 >
 > > Also the send/return sequence isn't yet that optimised, there's still
 > > plenty of inefficiencies due to lack of addressing modes etc and because
 > > it's fairly naive translation of the interpreters send code.
 >
 >
 > I would sit down and draw what you want the calling convention to look like
 > and do it sooner rather than later.  This is crucial to overall performance
 > until you solve the more difficult problem of doing significant adaptive
 > optimization so that call/return is eliminated.  However, as mentioned
 > previously eliminating call/return isn't such a great idea per se and so
 > keeping a fast call/return sequence is probably a very good idea anyway.
 >  Its not as if modern processors don't do call/return well.

For Exupery to make sense it needs significant adaptive optimisation
to expose enough code to allow heavy optimisation to justify the
slower compiler. Otherwise it would make more sense to move the
compiler into the VM and allow compilation for all execution like VW.

The current system's primary goal is enable the development of the
adaptive optimisation. I'll tune to provide decent performance now but
not if it makes it harder to add key features later.

AOStA/SIStA is I think a quicker route.  I like your suggestion of looking at moving Exupery to the stack VM.  Even though I would put the back-end code generator in the VM I would love to use Exupery's front-end in AOStA/SIStA.

I've thought about using a context stack several times over the years.
The key benefits are faster returns

and much faster sends
 
and possibly faster
de-optimisation of inlined contexts. After inlining it's possible that
a code change will break an optimisation. If an inlined method is
modified then it will continue to be entered until all contexts have
died or it is actively removed. Removing inlined contexts from object
memory requires a full memory scan to find them (allInstances).

There's another key advantage and that is the freedom to layout an optimized stack.  A key goal of AOStA/SIStA i unboxing floating-point values and mapping them to floating-point registers.   I like the idea of having an abstract model, represented by an OptimizedContext, of having two stacks, an object stack and a byte-data stack for raw data (unboxed floating-point, untagged integers, etc).  That would be good for the debugger n deopimzation.  But in the VM one would want to map these two different stacks to a single machine stack with info for the GC as to what are object references and what are not (a la Java stack frames), and having the active frame keep much of this unboxed state in machine regsiters.

 > 17% would be rather optimistic, some of the work required to set up a
 > > context will always be required. Temporaries will still need to be
 > > nilled out etc.
 >
 >
 > Aim higher :)  I'm hoping for 10x current Squeak performance for
 > Smalltalk-intensive benchmarks some time later this year.

My original and current aim is double VW's performance or be roughly
equivalent to C.

And what's the state of this effort?  What metrics lead you to believe you can double VW's performance?  Where are you currently?  What do you mean by "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C compiler?  What benchmarks have you focussed on?

Bryce

Best
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Cog and Exupery

Bryce Kampjes
In reply to this post by Eliot Miranda-2
 
Eliot Miranda writes:
 >  On Sun, Mar 1, 2009 at 1:46 PM, <[hidden email]> wrote:
 > I've thought about using a context stack several times over the years.
 > > The key benefits are faster returns
 >
 > and much faster sends

Well sends with few arguments could be optimised without a context
stack. Pass the arguments in registers, getting a context from the
list is quick, and the PIC uses an unconditional jump rather than a
call.

It's even possible to avoid adjusting the size of the stack frame in
most cases by allocating a frame big enough for most contexts. Exupery
only uses the stack frame for spilling internal temporary registers so
it's use tends to be small. Stack variables spill directly into the
context. (1)

 > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
 > >  > Smalltalk-intensive benchmarks some time later this year.
 > >
 > > My original and current aim is double VW's performance or be roughly
 > > equivalent to C.
 >
 >
 > And what's the state of this effort?  What metrics lead you to believe you
 > can double VW's performance?  Where are you currently?  What do you mean by
 > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
 > compiler?  What benchmarks have you focussed on?

Exupery is fairly stable and providing modest gains but needs more
tuning before it's performing as it should. Effort could now go into
either tuning the current engine so it performs as it should or adding
full method inlining. Until recently reliability was the main
show-stopper to actual use. There's still plenty of areas to optimise
to improve both compile time and run time.

Here's the current benchmarks:

  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
  bytecodeBenchmark 724 compiled 250 ratio: 2.895
  sendBenchmark 663 compiled 385 ratio: 1.722
  doLoopsBenchmark 381 compiled 235 ratio: 1.621
  pointCreation 394 compiled 389 ratio: 1.013
  largeExplorers 269 compiled 210 ratio: 1.280
  compilerBenchmark 273 compiled 250 ratio: 1.092
  Cumulative Time 413.408 compiled 232.706 ratio 1.777

  ExuperyBenchmarks>>arithmeticLoop 103ms
  SmallInteger>>benchmark 341ms
  InstructionStream>>interpretExtension:in:for: 6069ms
  Average 597.360

largeExplorers and compilerBenchmark are both real code. They do vary
depending on what the profiler decides to profile. The rest are micro
benchmarks. The benchmark suite looks much worse since upgrading to
a Core 2 which is very efficient at running the interpreter.

Optimisation so far has been about removing gross inefficiencies. I
avoid any optimisation that's likely to make other cases worse. e.g.
the last two optimisations were calling all primitives from native
code rather than just the few that Exupery compiles and adding
indirect literal addressing used to access VM variables.

What do I mean by equivalent to C? Basically being able to fully
optimise predictable code and optimise unpredictable code enough that
the remaining inefficiencies are hidden behind L2/L3 cache misses and
branch mispredicts. C was executing around 1 instruction per clock
measured from system code, that leaves some room to hid inefficiencies.

Why do I believe it's possible? With dynamic type information we can
optimise for the common case, and after inlining loops we can often
pull the type checks out of the loops. Then it's just a case of
competing optimiser to optimiser but most of the gains for most code
is likely to come from a small set of optimisations.

I've also played around with a few thought experiments including
looking at what it would take to compile a dot product with zero
overhead inside the loops.

Bryce

(1) I really should move temps and arguments into registers
too. Currently only stack values are held in registers. Moving
more variables into registers only made sense after spilling
directly into the context and reducing interference in the
register allocator.  
Reply | Threaded
Open this post in threaded view
|

Re: Cog and Exupery

Eliot Miranda-2
 


On Wed, Mar 4, 2009 at 2:47 PM, <[hidden email]> wrote:

Eliot Miranda writes:
 >  On Sun, Mar 1, 2009 at 1:46 PM, <[hidden email]> wrote:
 > I've thought about using a context stack several times over the years.
 > > The key benefits are faster returns
 >
 > and much faster sends

Well sends with few arguments could be optimised without a context
stack. Pass the arguments in registers, getting a context from the
list is quick, and the PIC uses an unconditional jump rather than a
call.

Getting anything off a list is not quick compared with a call.  taking things off a list involves a write.  Calls only write a return address to the stack (and then only on CISCs) which is well optimized by current processors.

PICs don't eliminate calls.  A send is a call whether through a PIC or not.  One calls the PIC (if only to collect a return address) and the PIC jumps.  There's still always a call.

Do some measurements on modern processors and see if there are significant differences between call and unconditional jump.  A call is an unconditional jump plus the pushing of a return address.  When I last looked at this carefully I found there was no significant difference in performance unless one was being deeply recursive.  i.e. attempting to use unconditional jumps to save performance, e.g. by inlining accessors, gained nothing because the processor was taking care to make calls and returns of the same cost as unconditional jumps (this on x86 in the 2004 timeframe).


It's even possible to avoid adjusting the size of the stack frame in
most cases by allocating a frame big enough for most contexts. Exupery
only uses the stack frame for spilling internal temporary registers so
it's use tends to be small. Stack variables spill directly into the
context. (1)

I don't understand the implication here.  With a stack the frame size is essentially irrelevant.  With a context it certainly is not.  If all contexts are large then one eats through memory faster and pays the price.


 > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
 > >  > Smalltalk-intensive benchmarks some time later this year.
 > >
 > > My original and current aim is double VW's performance or be roughly
 > > equivalent to C.
 >
 >
 > And what's the state of this effort?  What metrics lead you to believe you
 > can double VW's performance?  Where are you currently?  What do you mean by
 > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
 > compiler?  What benchmarks have you focussed on?

Exupery is fairly stable and providing modest gains but needs more
tuning before it's performing as it should. Effort could now go into
either tuning the current engine so it performs as it should or adding
full method inlining. Until recently reliability was the main
show-stopper to actual use. There's still plenty of areas to optimise
to improve both compile time and run time.

Here's the current benchmarks:

 arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
 bytecodeBenchmark 724 compiled 250 ratio: 2.895
 sendBenchmark 663 compiled 385 ratio: 1.722
 doLoopsBenchmark 381 compiled 235 ratio: 1.621
 pointCreation 394 compiled 389 ratio: 1.013
 largeExplorers 269 compiled 210 ratio: 1.280
 compilerBenchmark 273 compiled 250 ratio: 1.092
 Cumulative Time 413.408 compiled 232.706 ratio 1.777

 ExuperyBenchmarks>>arithmeticLoop 103ms
 SmallInteger>>benchmark 341ms
 InstructionStream>>interpretExtension:in:for: 6069ms
 Average 597.360

largeExplorers and compilerBenchmark are both real code. They do vary
depending on what the profiler decides to profile. The rest are micro
benchmarks. The benchmark suite looks much worse since upgrading to
a Core 2 which is very efficient at running the interpreter.

Which package version and repository are you taking the benchmarks from?  I'd like to run them in Cog.  Would you be interested in running the compiler language shootout benchmarks I'm using?


Optimisation so far has been about removing gross inefficiencies. I
avoid any optimisation that's likely to make other cases worse. e.g.
the last two optimisations were calling all primitives from native
code rather than just the few that Exupery compiles and adding
indirect literal addressing used to access VM variables.

What do I mean by equivalent to C? Basically being able to fully
optimise predictable code and optimise unpredictable code enough that
the remaining inefficiencies are hidden behind L2/L3 cache misses and
branch mispredicts. C was executing around 1 instruction per clock
measured from system code, that leaves some room to hid inefficiencies.

Why do I believe it's possible? With dynamic type information we can
optimise for the common case, and after inlining loops we can often
pull the type checks out of the loops. Then it's just a case of
competing optimiser to optimiser but most of the gains for most code
is likely to come from a small set of optimisations.

I've also played around with a few thought experiments including
looking at what it would take to compile a dot product with zero
overhead inside the loops.

I'll ask again ;)  "What metrics lead you to believe you can double VW's performance?"

Bryce

(1) I really should move temps and arguments into registers
too. Currently only stack values are held in registers. Moving
more variables into registers only made sense after spilling
directly into the context and reducing interference in the
register allocator.

Cheers
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Cog and Exupery

Igor Stasenko

2009/3/5 Eliot Miranda <[hidden email]>:

>
>
>
> On Wed, Mar 4, 2009 at 2:47 PM, <[hidden email]> wrote:
>>
>> Eliot Miranda writes:
>>  >  On Sun, Mar 1, 2009 at 1:46 PM, <[hidden email]> wrote:
>>  > I've thought about using a context stack several times over the years.
>>  > > The key benefits are faster returns
>>  >
>>  > and much faster sends
>>
>> Well sends with few arguments could be optimised without a context
>> stack. Pass the arguments in registers, getting a context from the
>> list is quick, and the PIC uses an unconditional jump rather than a
>> call.
>
> Getting anything off a list is not quick compared with a call.  taking things off a list involves a write.  Calls only write a return address to the stack (and then only on CISCs) which is well optimized by current processors.
> PICs don't eliminate calls.  A send is a call whether through a PIC or not.  One calls the PIC (if only to collect a return address) and the PIC jumps.  There's still always a call.
> Do some measurements on modern processors and see if there are significant differences between call and unconditional jump.  A call is an unconditional jump plus the pushing of a return address.  When I last looked at this carefully I found there was no significant difference in performance unless one was being deeply recursive.  i.e. attempting to use unconditional jumps to save performance, e.g. by inlining accessors, gained nothing because the processor was taking care to make calls and returns of the same cost as unconditional jumps (this on x86 in the 2004 timeframe).
>
>> It's even possible to avoid adjusting the size of the stack frame in
>> most cases by allocating a frame big enough for most contexts. Exupery
>> only uses the stack frame for spilling internal temporary registers so
>> it's use tends to be small. Stack variables spill directly into the
>> context. (1)
>
> I don't understand the implication here.  With a stack the frame size is essentially irrelevant.  With a context it certainly is not.  If all contexts are large then one eats through memory faster and pays the price.
>
>>  > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
>>  > >  > Smalltalk-intensive benchmarks some time later this year.
>>  > >
>>  > > My original and current aim is double VW's performance or be roughly
>>  > > equivalent to C.
>>  >
>>  >
>>  > And what's the state of this effort?  What metrics lead you to believe you
>>  > can double VW's performance?  Where are you currently?  What do you mean by
>>  > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
>>  > compiler?  What benchmarks have you focussed on?
>>
>> Exupery is fairly stable and providing modest gains but needs more
>> tuning before it's performing as it should. Effort could now go into
>> either tuning the current engine so it performs as it should or adding
>> full method inlining. Until recently reliability was the main
>> show-stopper to actual use. There's still plenty of areas to optimise
>> to improve both compile time and run time.
>>
>> Here's the current benchmarks:
>>
>>  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
>>  bytecodeBenchmark 724 compiled 250 ratio: 2.895
>>  sendBenchmark 663 compiled 385 ratio: 1.722
>>  doLoopsBenchmark 381 compiled 235 ratio: 1.621
>>  pointCreation 394 compiled 389 ratio: 1.013
>>  largeExplorers 269 compiled 210 ratio: 1.280
>>  compilerBenchmark 273 compiled 250 ratio: 1.092
>>  Cumulative Time 413.408 compiled 232.706 ratio 1.777
>>
>>  ExuperyBenchmarks>>arithmeticLoop 103ms
>>  SmallInteger>>benchmark 341ms
>>  InstructionStream>>interpretExtension:in:for: 6069ms
>>  Average 597.360
>>
>> largeExplorers and compilerBenchmark are both real code. They do vary
>> depending on what the profiler decides to profile. The rest are micro
>> benchmarks. The benchmark suite looks much worse since upgrading to
>> a Core 2 which is very efficient at running the interpreter.
>
> Which package version and repository are you taking the benchmarks from?  I'd like to run them in Cog.  Would you be interested in running the compiler language shootout benchmarks I'm using?
>
>> Optimisation so far has been about removing gross inefficiencies. I
>> avoid any optimisation that's likely to make other cases worse. e.g.
>> the last two optimisations were calling all primitives from native
>> code rather than just the few that Exupery compiles and adding
>> indirect literal addressing used to access VM variables.
>>
>> What do I mean by equivalent to C? Basically being able to fully
>> optimise predictable code and optimise unpredictable code enough that
>> the remaining inefficiencies are hidden behind L2/L3 cache misses and
>> branch mispredicts. C was executing around 1 instruction per clock
>> measured from system code, that leaves some room to hid inefficiencies.
>>
>> Why do I believe it's possible? With dynamic type information we can
>> optimise for the common case, and after inlining loops we can often
>> pull the type checks out of the loops. Then it's just a case of
>> competing optimiser to optimiser but most of the gains for most code
>> is likely to come from a small set of optimisations.
>>
>> I've also played around with a few thought experiments including
>> looking at what it would take to compile a dot product with zero
>> overhead inside the loops.
>
> I'll ask again ;)  "What metrics lead you to believe you can double VW's performance?"

Eliot, even if you see it unrealistic, i think this is a good aim to
try to achieve :)

>>
>> Bryce
>>
>> (1) I really should move temps and arguments into registers
>> too. Currently only stack values are held in registers. Moving
>> more variables into registers only made sense after spilling
>> directly into the context and reducing interference in the
>> register allocator.
>
> Cheers
> Eliot
>



--
Best regards,
Igor Stasenko AKA sig.
Reply | Threaded
Open this post in threaded view
|

Re: Cog and Exupery

Eliot Miranda-2
 


On Wed, Mar 4, 2009 at 4:29 PM, Igor Stasenko <[hidden email]> wrote:

2009/3/5 Eliot Miranda <[hidden email]>:
>
>
>
> On Wed, Mar 4, 2009 at 2:47 PM, <[hidden email]> wrote:
>>
>> Eliot Miranda writes:
>>  >  On Sun, Mar 1, 2009 at 1:46 PM, <[hidden email]> wrote:
>>  > I've thought about using a context stack several times over the years.
>>  > > The key benefits are faster returns
>>  >
>>  > and much faster sends
>>
>> Well sends with few arguments could be optimised without a context
>> stack. Pass the arguments in registers, getting a context from the
>> list is quick, and the PIC uses an unconditional jump rather than a
>> call.
>
> Getting anything off a list is not quick compared with a call.  taking things off a list involves a write.  Calls only write a return address to the stack (and then only on CISCs) which is well optimized by current processors.
> PICs don't eliminate calls.  A send is a call whether through a PIC or not.  One calls the PIC (if only to collect a return address) and the PIC jumps.  There's still always a call.
> Do some measurements on modern processors and see if there are significant differences between call and unconditional jump.  A call is an unconditional jump plus the pushing of a return address.  When I last looked at this carefully I found there was no significant difference in performance unless one was being deeply recursive.  i.e. attempting to use unconditional jumps to save performance, e.g. by inlining accessors, gained nothing because the processor was taking care to make calls and returns of the same cost as unconditional jumps (this on x86 in the 2004 timeframe).
>
>> It's even possible to avoid adjusting the size of the stack frame in
>> most cases by allocating a frame big enough for most contexts. Exupery
>> only uses the stack frame for spilling internal temporary registers so
>> it's use tends to be small. Stack variables spill directly into the
>> context. (1)
>
> I don't understand the implication here.  With a stack the frame size is essentially irrelevant.  With a context it certainly is not.  If all contexts are large then one eats through memory faster and pays the price.
>
>>  > >  > Aim higher :)  I'm hoping for 10x current Squeak performance for
>>  > >  > Smalltalk-intensive benchmarks some time later this year.
>>  > >
>>  > > My original and current aim is double VW's performance or be roughly
>>  > > equivalent to C.
>>  >
>>  >
>>  > And what's the state of this effort?  What metrics lead you to believe you
>>  > can double VW's performance?  Where are you currently?  What do you mean by
>>  > "equivalent to C", unoptimized, -O1, -O4 -funroll-loops, gcc, Intel C
>>  > compiler?  What benchmarks have you focussed on?
>>
>> Exupery is fairly stable and providing modest gains but needs more
>> tuning before it's performing as it should. Effort could now go into
>> either tuning the current engine so it performs as it should or adding
>> full method inlining. Until recently reliability was the main
>> show-stopper to actual use. There's still plenty of areas to optimise
>> to improve both compile time and run time.
>>
>> Here's the current benchmarks:
>>
>>  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
>>  bytecodeBenchmark 724 compiled 250 ratio: 2.895
>>  sendBenchmark 663 compiled 385 ratio: 1.722
>>  doLoopsBenchmark 381 compiled 235 ratio: 1.621
>>  pointCreation 394 compiled 389 ratio: 1.013
>>  largeExplorers 269 compiled 210 ratio: 1.280
>>  compilerBenchmark 273 compiled 250 ratio: 1.092
>>  Cumulative Time 413.408 compiled 232.706 ratio 1.777
>>
>>  ExuperyBenchmarks>>arithmeticLoop 103ms
>>  SmallInteger>>benchmark 341ms
>>  InstructionStream>>interpretExtension:in:for: 6069ms
>>  Average 597.360
>>
>> largeExplorers and compilerBenchmark are both real code. They do vary
>> depending on what the profiler decides to profile. The rest are micro
>> benchmarks. The benchmark suite looks much worse since upgrading to
>> a Core 2 which is very efficient at running the interpreter.
>
> Which package version and repository are you taking the benchmarks from?  I'd like to run them in Cog.  Would you be interested in running the compiler language shootout benchmarks I'm using?
>
>> Optimisation so far has been about removing gross inefficiencies. I
>> avoid any optimisation that's likely to make other cases worse. e.g.
>> the last two optimisations were calling all primitives from native
>> code rather than just the few that Exupery compiles and adding
>> indirect literal addressing used to access VM variables.
>>
>> What do I mean by equivalent to C? Basically being able to fully
>> optimise predictable code and optimise unpredictable code enough that
>> the remaining inefficiencies are hidden behind L2/L3 cache misses and
>> branch mispredicts. C was executing around 1 instruction per clock
>> measured from system code, that leaves some room to hid inefficiencies.
>>
>> Why do I believe it's possible? With dynamic type information we can
>> optimise for the common case, and after inlining loops we can often
>> pull the type checks out of the loops. Then it's just a case of
>> competing optimiser to optimiser but most of the gains for most code
>> is likely to come from a small set of optimisations.
>>
>> I've also played around with a few thought experiments including
>> looking at what it would take to compile a dot product with zero
>> overhead inside the loops.
>
> I'll ask again ;)  "What metrics lead you to believe you can double VW's performance?"

Eliot, even if you see it unrealistic, i think this is a good aim to
try to achieve :)

I don't think it's unrealistic at all.  It's my aim too!  There are other Smaltalk VMs out there that are faster than VW.  I just think Bryce is making some strange decisions (not eliminating contexts early on) and I want to try and understand why he is going a different way from me.  I might be very wrong in my approach.  On the other hand, Bryce might be going the wrong way, and my questions might help.

In any case, doubling VW's performance is a great idea.  But what's the claim based on?  How is Bryce going to beat VW for what benchmarks?  If it's not based on any analysis its an empty claim.  If it's based on real analysis then cool, and I'd like to understand the analysis.


>>
>> Bryce
>>
>> (1) I really should move temps and arguments into registers
>> too. Currently only stack values are held in registers. Moving
>> more variables into registers only made sense after spilling
>> directly into the context and reducing interference in the
>> register allocator.
>
> Cheers
> Eliot
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: Cog and Exupery

Bryce Kampjes
In reply to this post by Eliot Miranda-2
 
Eliot Miranda writes:
 >  On Wed, Mar 4, 2009 at 2:47 PM, <[hidden email]> wrote:
 >
 > >
 > > Eliot Miranda writes:
 > >  >  On Sun, Mar 1, 2009 at 1:46 PM, <[hidden email]> wrote:
 > >  > I've thought about using a context stack several times over the years.
 > >  > > The key benefits are faster returns
 > >  >
 > >  > and much faster sends
 > >
 > > Well sends with few arguments could be optimised without a context
 > > stack. Pass the arguments in registers, getting a context from the
 > > list is quick, and the PIC uses an unconditional jump rather than a
 > > call.
 >
 >
 > Getting anything off a list is not quick compared with a call.  taking
 > things off a list involves a write.  Calls only write a return address to
 > the stack (and then only on CISCs) which is well optimized by current
 > processors.
 >
 > PICs don't eliminate calls.  A send is a call whether through a PIC or not.
 >  One calls the PIC (if only to collect a return address) and the PIC jumps.
 >  There's still always a call.
 >
 > Do some measurements on modern processors and see if there are significant
 > differences between call and unconditional jump.  A call is an unconditional
 > jump plus the pushing of a return address.  When I last looked at this
 > carefully I found there was no significant difference in performance unless
 > one was being deeply recursive.  i.e. attempting to use unconditional jumps
 > to save performance, e.g. by inlining accessors, gained nothing because the
 > processor was taking care to make calls and returns of the same cost as
 > unconditional jumps (this on x86 in the 2004 timeframe).

No arguments that unconditional jumps are faster than calls, I'd
guess they're the same without looking at reference materal or
timing.

When I added adaptive inlining of primitives it doubled Exupery's
bytecode performance on the bytecode benchmark by inlining #at:
and #at:put: rather than accessing them via a PIC.

The cost of sends are:
 1) copying arguments
 2) finding the method to call
 3) getting memory for a new context

Copying arguments can be often avoided by passing the arguments
in registers.

Finding the method to call is the same with or without a context
stack.

Getting the memory for a new context will take a handful of
instructions in both cases.

The worst case sequence to get an item from a linked list is:
  1 - load (freeList) temp1
  2 - compare temp1 0
  3 - jumpEqual listEmptyBlock
  4 - load (temp1 nextOffset) temp2
  5 - store temp2 (freeList)

For the stack case:
  1 - load (overflowBlock) temp1
  2 - add newSize stackPointer
  3 - compare temp1 stackPointer
  4 - jumpGreater stackOverflowBlock

OK, the stack case wins by one instruction. Both could be optimised
by using registers. In the linked list case instructions 1 and 4 can
be removed. In the stack case just 1 instruction.

Writes are quick so long as the CPUs write bandwidth hasn't been
exceeded. Write performance has improved recently with write back
caches added for multi-cores.

Using a context cache will make a large improvement gain for returns.
There's no way I can see to avoid the checks that the returnee is good
when executing off contexts. With a stack the thunk can handle that
along with underflow.

 > It's even possible to avoid adjusting the size of the stack frame in
 > > most cases by allocating a frame big enough for most contexts. Exupery
 > > only uses the stack frame for spilling internal temporary registers so
 > > it's use tends to be small. Stack variables spill directly into the
 > > context. (1)
 >
 >
 > I don't understand the implication here.  With a stack the frame size is
 > essentially irrelevant.  With a context it certainly is not.  If all
 > contexts are large then one eats through memory faster and pays the price.

It removes the need for the two instruction to adjust the C stack pointer.

 > > Here's the current benchmarks:
 > >
 > >  arithmaticLoopBenchmark 390 compiled 80 ratio: 4.875
 > >  bytecodeBenchmark 724 compiled 250 ratio: 2.895
 > >  sendBenchmark 663 compiled 385 ratio: 1.722
 > >  doLoopsBenchmark 381 compiled 235 ratio: 1.621
 > >  pointCreation 394 compiled 389 ratio: 1.013
 > >  largeExplorers 269 compiled 210 ratio: 1.280
 > >  compilerBenchmark 273 compiled 250 ratio: 1.092
 > >  Cumulative Time 413.408 compiled 232.706 ratio 1.777
 > >
 > >  ExuperyBenchmarks>>arithmeticLoop 103ms
 > >  SmallInteger>>benchmark 341ms
 > >  InstructionStream>>interpretExtension:in:for: 6069ms
 > >  Average 597.360
 > >
 > > largeExplorers and compilerBenchmark are both real code. They do vary
 > > depending on what the profiler decides to profile. The rest are micro
 > > benchmarks. The benchmark suite looks much worse since upgrading to
 > > a Core 2 which is very efficient at running the interpreter.
 >
 >
 > Which package version and repository are you taking the benchmarks from?
 >  I'd like to run them in Cog.  Would you be interested in running the
 > compiler language shootout benchmarks I'm using?

I'd be interested in adding your benchmarks to my suite. Given I've
fairly much decided to implement adaptive method inlining next, I'm
unlikely to do much tuning based on them.

My benchmarks are in the Exupery package on SqueakSource here:

  http://www.squeaksource.com/Exupery.html

To run them use:

  ExuperyBenchmarks new run


 > > I've also played around with a few thought experiments including
 > > looking at what it would take to compile a dot product with zero
 > > overhead inside the loops.
 >
 >
 > I'll ask again ;)  "What metrics lead you to believe you can double VW's
 > performance?"

The best detailed analysis is in the appendix here:

   http://www.kampjes.demon.co.uk/articles/exuperyDesign.pdf

but it's a little dated now.
   
The following article is a bit more up to date and describes why
Exupery's high level design is the way it is.

    http://www.kampjes.demon.co.uk/articles/exuperyDirection.pdf

Bryce
Reply | Threaded
Open this post in threaded view
|

Re: Cog and Exupery

Bryce Kampjes
In reply to this post by Eliot Miranda-2
 
Eliot Miranda writes:
 >  On Wed, Mar 4, 2009 at 4:29 PM, Igor Stasenko <[hidden email]> wrote:
 >
 > I don't think it's unrealistic at all.  It's my aim too!  There are other
 > Smaltalk VMs out there that are faster than VW.  I just think Bryce is
 > making some strange decisions (not eliminating contexts early on) and I want
 > to try and understand why he is going a different way from me.  I might be
 > very wrong in my approach.  On the other hand, Bryce might be going the
 > wrong way, and my questions might help.

What are you asking here?
 1) Why I haven't implemented any serious context optimisation yet?
 2) Why I haven't implemented a context cache/stack?

The answer to 1) is debugging a compiler and adaptive inlining at
the same time seemed needlessly difficult so I've been working
on getting the compiler ready and reliable first. Up until now
the goal was to get everything needed to benefit from adaptive
inlining working except adaptive inlining.

The answer to 2) is adaptive inlining is the main send optimisation
planned. It's needed to expose the entire loops to allow further
optimisation. Once adaptive optimisation is in place and working then
other major send optimisations may be considered based on real
measurements

Bryce

12