Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Eliot Miranda-2
 


2009/8/6 Göran Krampe <[hidden email]>

Hi Eliot and all!


Eliot Miranda wrote:
Hi All,
   I'm looking at making the Squeak FFI reentrant to support nested calls
and possibly threading.  The current FFI has a couple of issues which render
it non-reentrant.

The tech stuff is over my head, but I do have three questions related to this:

1. What about Alien? Shouldn't we try to move towards Alien instead of current FFI? Or is that too much work at this point?

I intend to merge Alien into the current FFI to allow the current FFI to marshal Aliens.  Aliens are fine for modelling external data but the Alien FFI call-out mechanism is a little too naive for general use  It works well on x86 but has issues on anything with an exotic calling convention (passes arguments in integer and/or floating-point registers).  And see the next point about callbacks.


2. Callbacks has been a sore point in Squeak for a long time. AFAIK there is a patch available on www.squeakgtk.org/wiki/download.html, not sure what it does or if it is the original patch from Andreas when wxSqueak was being built. wxSqueak had a patched VM I recall. Perhaps that stuff is not related.

One thing that IMO is much better about Alien is the callback mechanism which allows one effectively to pass function pointers to blocks.  The current FFI's callback mechanism is weak.  It simply does a process switch away form the process calling out and requires further work in the image, e.g. a process waiting on a semaphore that is signalled by external code, to then collect information for performing the callback. So adding in the Alien callback mechanism is also something I intend to do.

 
3. Could we possibly ask for a status update on Cog and related activities? We are itching for news! :) Also curious about your interest in Factor and its lower bowels (definitely cool stuff going on there).

The status is as follows.
The Cog stack VM os being reviewed for release to the community.  We hope to have this done soon, certainly before the end of September, but we're busy and this isn't on the critical path.  Once it is released there will have to be some integration and merge activities before it is part of the standard VMs because we have effectively forked (although not a lot).

The first incarnation of the Cog JIT is complete (for x86 only) and in use at Qwaq.  We are gearing up for a new server release and the Cog VM is the Vm beneath it.  The next client release will include it also.  This VM has a naive code generator (every push or pop in the bytecode results in a push or pop in machine code) but good inline cacheing.  Performance is as high as 5x the current interpreter for certain computer-language-shootout benchmarks.  The naive code generator means there is poor loop performance (1 to: n do: ... style code can be 4 times slower than VisualWorks) and the object model means there is no machine code instance creation and no machine code at:put: primitive.  But send performance is good and block activation almost as fast as VisualWorks.  In our real-world experience we were last week able to run almost three times as many Qwaq Forums clients against a QF server running on the Cog VM than we were able to above the interpreters.  So the Cog JIT is providing significant speedups in real-world use.

I am (clearly) looking at FFI issues right now.  In the Autumn I intend to start work on a less naive code generator, a better object model and a faster garbage collector, the three of which should raise performance levels to VisualWorks levels, i.e. a further 2x to 3x increase over the 4x - 5x already achieved for pure Smalltalk execution.

I expect we'll be in a position to release some version of the Cog JIT to the community by Christmas.

I'll be giving a guided tour of the current Cog JIT VM at SqueakFest LA on Monday.



regards, Göran

Best
Eliot 

Reply | Threaded
Open this post in threaded view
|

Re: Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Igor Stasenko

2009/8/6 Eliot Miranda <[hidden email]>:

>
>
>
> 2009/8/6 Göran Krampe <[hidden email]>
>>
>> Hi Eliot and all!
>>
>> Eliot Miranda wrote:
>>>
>>> Hi All,
>>>    I'm looking at making the Squeak FFI reentrant to support nested calls
>>> and possibly threading.  The current FFI has a couple of issues which render
>>> it non-reentrant.
>>
>> The tech stuff is over my head, but I do have three questions related to this:
>>
>> 1. What about Alien? Shouldn't we try to move towards Alien instead of current FFI? Or is that too much work at this point?
>
> I intend to merge Alien into the current FFI to allow the current FFI to marshal Aliens.  Aliens are fine for modelling external data but the Alien FFI call-out mechanism is a little too naive for general use  It works well on x86 but has issues on anything with an exotic calling convention (passes arguments in integer and/or floating-point registers).  And see the next point about callbacks.
>>
>> 2. Callbacks has been a sore point in Squeak for a long time. AFAIK there is a patch available on www.squeakgtk.org/wiki/download.html, not sure what it does or if it is the original patch from Andreas when wxSqueak was being built. wxSqueak had a patched VM I recall. Perhaps that stuff is not related.
>
> One thing that IMO is much better about Alien is the callback mechanism which allows one effectively to pass function pointers to blocks.  The current FFI's callback mechanism is weak.  It simply does a process switch away form the process calling out and requires further work in the image, e.g. a process waiting on a semaphore that is signalled by external code, to then collect information for performing the callback. So adding in the Alien callback mechanism is also something I intend to do.
>
>>
>> 3. Could we possibly ask for a status update on Cog and related activities? We are itching for news! :) Also curious about your interest in Factor and its lower bowels (definitely cool stuff going on there).
>
> The status is as follows.
> The Cog stack VM os being reviewed for release to the community.  We hope to have this done soon, certainly before the end of September, but we're busy and this isn't on the critical path.  Once it is released there will have to be some integration and merge activities before it is part of the standard VMs because we have effectively forked (although not a lot).
> The first incarnation of the Cog JIT is complete (for x86 only) and in use at Qwaq.  We are gearing up for a new server release and the Cog VM is the Vm beneath it.  The next client release will include it also.  This VM has a naive code generator (every push or pop in the bytecode results in a push or pop in machine code) but good inline cacheing.  Performance is as high as 5x the current interpreter for certain computer-language-shootout benchmarks.  The naive code generator means there is poor loop performance (1 to: n do: ... style code can be 4 times slower than VisualWorks) and the object model means there is no machine code instance creation and no machine code at:put: primitive.  But send performance is good and block activation almost as fast as VisualWorks.  In our real-world experience we were last week able to run almost three times as many Qwaq Forums clients against a QF server running on the Cog VM than we were able to above the interpreters.  So the Cog JIT is providing significant speedups in real-world use.
> I am (clearly) looking at FFI issues right now.  In the Autumn I intend to start work on a less naive code generator, a better object model and a faster garbage collector, the three of which should raise performance levels to VisualWorks levels, i.e. a further 2x to 3x increase over the 4x - 5x already achieved for pure Smalltalk execution.

Yes, an FFI is heavily used in Croquet (and Qwaq Forums, i suppose) to
render graphics using OpenGL. So it is critical for high performance.
Btw, do you plan to use JIT for generating a callout code?

> I expect we'll be in a position to release some version of the Cog JIT to the community by Christmas.
> I'll be giving a guided tour of the current Cog JIT VM at SqueakFest LA on Monday.
>>
>>
>> regards, Göran
>
> Best
> Eliot
>
>



--
Best regards,
Igor Stasenko AKA sig.
Reply | Threaded
Open this post in threaded view
|

Re: Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Eliot Miranda-2
 


On Thu, Aug 6, 2009 at 10:29 AM, Igor Stasenko <[hidden email]> wrote:

2009/8/6 Eliot Miranda <[hidden email]>:
>
>
>
> 2009/8/6 Göran Krampe <[hidden email]>
>>
>> Hi Eliot and all!
>>
>> Eliot Miranda wrote:
>>>
>>> Hi All,
>>>    I'm looking at making the Squeak FFI reentrant to support nested calls
>>> and possibly threading.  The current FFI has a couple of issues which render
>>> it non-reentrant.
>>
>> The tech stuff is over my head, but I do have three questions related to this:
>>
>> 1. What about Alien? Shouldn't we try to move towards Alien instead of current FFI? Or is that too much work at this point?
>
> I intend to merge Alien into the current FFI to allow the current FFI to marshal Aliens.  Aliens are fine for modelling external data but the Alien FFI call-out mechanism is a little too naive for general use  It works well on x86 but has issues on anything with an exotic calling convention (passes arguments in integer and/or floating-point registers).  And see the next point about callbacks.
>>
>> 2. Callbacks has been a sore point in Squeak for a long time. AFAIK there is a patch available on www.squeakgtk.org/wiki/download.html, not sure what it does or if it is the original patch from Andreas when wxSqueak was being built. wxSqueak had a patched VM I recall. Perhaps that stuff is not related.
>
> One thing that IMO is much better about Alien is the callback mechanism which allows one effectively to pass function pointers to blocks.  The current FFI's callback mechanism is weak.  It simply does a process switch away form the process calling out and requires further work in the image, e.g. a process waiting on a semaphore that is signalled by external code, to then collect information for performing the callback. So adding in the Alien callback mechanism is also something I intend to do.
>
>>
>> 3. Could we possibly ask for a status update on Cog and related activities? We are itching for news! :) Also curious about your interest in Factor and its lower bowels (definitely cool stuff going on there).
>
> The status is as follows.
> The Cog stack VM os being reviewed for release to the community.  We hope to have this done soon, certainly before the end of September, but we're busy and this isn't on the critical path.  Once it is released there will have to be some integration and merge activities before it is part of the standard VMs because we have effectively forked (although not a lot).
> The first incarnation of the Cog JIT is complete (for x86 only) and in use at Qwaq.  We are gearing up for a new server release and the Cog VM is the Vm beneath it.  The next client release will include it also.  This VM has a naive code generator (every push or pop in the bytecode results in a push or pop in machine code) but good inline cacheing.  Performance is as high as 5x the current interpreter for certain computer-language-shootout benchmarks.  The naive code generator means there is poor loop performance (1 to: n do: ... style code can be 4 times slower than VisualWorks) and the object model means there is no machine code instance creation and no machine code at:put: primitive.  But send performance is good and block activation almost as fast as VisualWorks.  In our real-world experience we were last week able to run almost three times as many Qwaq Forums clients against a QF server running on the Cog VM than we were able to above the interpreters.  So the Cog JIT is providing significant speedups in real-world use.
> I am (clearly) looking at FFI issues right now.  In the Autumn I intend to start work on a less naive code generator, a better object model and a faster garbage collector, the three of which should raise performance levels to VisualWorks levels, i.e. a further 2x to 3x increase over the 4x - 5x already achieved for pure Smalltalk execution.

Yes, an FFI is heavily used in Croquet (and Qwaq Forums, i suppose) to
render graphics using OpenGL. So it is critical for high performance.
Btw, do you plan to use JIT for generating a callout code?

Eventually yes.  IMO this is the best way to go to get a correct and portable FFI.  ABIs like x86-64 sysV are too complicated to interpret efficiently and very complicated to implement in a low-level language.  I think the right architecture is one where the FFI compiler is written in Smalltalk and lives in the image.  When the image starts up on a different platform all the FFI callout methods have their generated code flushed.  The first time an FFI method is invoked the invocation will fail because there is no generated code.  e.g. one writes call-outys thusly:

ffiPrintString: aString 
        <
cdecl: char* 'ffiPrintString' (char *) error: errorCode
        
^self externalCallFailedWith: errorCode

A call failing due to no code will return e.g. #'need to compile code' or perhaps simply #'not yet linked'.  externalCallFailedWith: then invokes the ABI compiler to compile the FFI spec to some sort of abstract register transfer language, looks up the function name, stores the info in the ExternalFunction which, as it is now, is the method's first literal, and retries the invocation.  The JIT then translates the RTL into actual machine code and executes it.

One may need an additional layer which is Smalltalk code that exists to coerce arguments, raising errors for arguments that can't be coerced.  e.g.
ffiPrintString: aString 
        ^self ffiPrintStringInner: aString asNullTerminatedCString 

ffiPrintStringInner: 
aString 
        <
cdecl: char* 'ffiPrintString' (char *) error: errorCode
        
^self externalCallFailedWith: errorCode

This kind of approach can move much of the complexity up into Smalltalk where it can be mastered, and the system extended on the fly, leaving the lower-level VM the simpler task of generating platform-specifics.  In particular, lifting the dll/module searching machinery up into the image is a good idea.

I also like the following idea for accessing platform-specific constants.  I implemented a prototype of this for VisualWorks but it hasn't been deployed yet.

"For example, we want to move the socket layer out of the VM almost entirely.  To do this the VI must be able to reference the correct values for defines such as O_NONBLOCK which have an annoying habit of having different values on different unix variants.  One way to do this is to have the VI spit out a C file containing a table of all the constants it needs, name to value.  This gets compiled into a dll on each platform and loaded to retrieve the relevant values.  When developing the VI will need to get hold of new values, and so new versions of the dll will need to get spat out, compiled and reloaded.  Again providing recompilation as a service would enable users to deploy across platforms for which they have no C compiler."

My VW prototype generated a C file from a shared pool.  e.g. here's a snippet of an autogenerated socketconstants.c
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
typedef struct {
            char * name;
            int value, flags;
        } constantTable;
constantTable constants[] = {
{"AF_DECnet", (int)
#ifdef AF_DECnet
AF_DECnet, 1},
#else
0, 0},
#endif
{"AF_FILE", (int)
#ifdef AF_FILE
AF_FILE, 1},
#else
0, 0},
#endif

where flags indicated the size of the field amongst other things.

These files get compiled to dlls which can be loaded and inspected by the image.  To do a portable distribution of an application one needs to deploy a dlls  for each platform.  But one only needs to compile it when the set of constants changes.  The image therefore only loads the constants dlls when it finds it is starting up on a different operating system than that it was saved upon.  So a shrink-wrap application for a specific platform need not be deployed with the constants dll.

My VW prototype automagically generated and compiled the dll when constants were added to the shared pool.  The C compiler was invoked automatically using VW's equivalent of OSProcess.  One could imagine providing compilation-as-a-service or a central library of these constant dlls so that developers and application deployers didn't need to have the C compiler for all platforms upon which they wish to deploy.



> I expect we'll be in a position to release some version of the Cog JIT to the community by Christmas.
> I'll be giving a guided tour of the current Cog JIT VM at SqueakFest LA on Monday.
>>
>>
>> regards, Göran
>
> Best
> Eliot
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Andreas.Raab
In reply to this post by Eliot Miranda-2
 
Eliot Miranda wrote:

> The first incarnation of the Cog JIT is complete (for x86 only) and in
> use at Qwaq.  We are gearing up for a new server release and the Cog VM
> is the Vm beneath it.  The next client release will include it also.
>  This VM has a naive code generator (every push or pop in the bytecode
> results in a push or pop in machine code) but good inline cacheing.
>  Performance is as high as 5x the current interpreter for certain
> computer-language-shootout benchmarks.  The naive code generator means
> there is poor loop performance (1 to: n do: ... style code can be 4
> times slower than VisualWorks) and the object model means there is no
> machine code instance creation and no machine code at:put: primitive.
>  But send performance is good and block activation almost as fast as
> VisualWorks.  In our real-world experience we were last week able to run
> almost three times as many Qwaq Forums clients against a QF server
> running on the Cog VM than we were able to above the interpreters.  So
> the Cog JIT is providing significant speedups in real-world use.

Indeed. Here some numbers that I took earlier this year:

VM version           bc/sec  sends/sec  Macro1  Macro2  Macro5    Total
Closure(3.11.2) 198,295,894  5,801,773  3124ms  79333ms 9935ms  92411ms
Stack (2.0.10)  178,521,617  8,141,165  2136ms  43081ms 6874ms  52117ms
Cog (current)   199,221,789 17,509,420   982ms  29392ms 4053ms  34445ms
Stack vs. Closure      0.9        1.4     1.46     1.84   1.45     1.77
Cog vs. Stack          1.12       2.16    2.17     1.46   1.69     1.51
Cog vs. Closure        1.0        3.0     3.18     2.7    2.45     2.68

As a total improvement in performance Cog ranks at approx. 2.7x faster
in macro benchmarks than what we started from. That's a pretty decent
bit of speedup for real-world applications.

Compare this (for example) with j3 [1] which despite a speedup of 6x in
microbenchmarks only provided a 2x speedup in the macros.

[1] http://aspn.activestate.com/ASPN/Mail/Message/squeak-list/2369033:

"Of course, that was 2001. Revisiting the benchmarks is kind of
interesting...

Interp:     '43805612 bytecodes/sec; 1325959 sends/sec'
J3:         '135665076 bytecodes/sec; 8100691 sends/sec'

Today: (PowerBookG4 1.5GHz), interp:

             '114387846 bytecodes/sec; 5152891 sends/sec'

But the mircoBenchmarks don't tell the whole story: Even with a speedup
of factor 6 in sends, we only saw the performance doubled on real world
benchmarks (e.g. the MacroBenchmarks)."


Cheers,
   - Andreas
Reply | Threaded
Open this post in threaded view
|

Re: Cog status & FFI directions [was rearchitecting the FFI implementation for reentrancy]

Igor Stasenko

2009/8/7 Andreas Raab <[hidden email]>:

>
> Eliot Miranda wrote:
>>
>> The first incarnation of the Cog JIT is complete (for x86 only) and in use
>> at Qwaq.  We are gearing up for a new server release and the Cog VM is the
>> Vm beneath it.  The next client release will include it also.  This VM has a
>> naive code generator (every push or pop in the bytecode results in a push or
>> pop in machine code) but good inline cacheing.  Performance is as high as 5x
>> the current interpreter for certain computer-language-shootout benchmarks.
>>  The naive code generator means there is poor loop performance (1 to: n do:
>> ... style code can be 4 times slower than VisualWorks) and the object model
>> means there is no machine code instance creation and no machine code at:put:
>> primitive.  But send performance is good and block activation almost as fast
>> as VisualWorks.  In our real-world experience we were last week able to run
>> almost three times as many Qwaq Forums clients against a QF server running
>> on the Cog VM than we were able to above the interpreters.  So the Cog JIT
>> is providing significant speedups in real-world use.
>
> Indeed. Here some numbers that I took earlier this year:
>
> VM version           bc/sec  sends/sec  Macro1  Macro2  Macro5    Total
> Closure(3.11.2) 198,295,894  5,801,773  3124ms  79333ms 9935ms  92411ms
> Stack (2.0.10)  178,521,617  8,141,165  2136ms  43081ms 6874ms  52117ms

it was always confusing to me, how it is possible to have higher send
rate & lower bytecode execution rate at the same time.
The way how tinybenchmark calculating it is tricky one.

> Cog (current)   199,221,789 17,509,420   982ms  29392ms 4053ms  34445ms
> Stack vs. Closure      0.9        1.4     1.46     1.84   1.45     1.77
> Cog vs. Stack          1.12       2.16    2.17     1.46   1.69     1.51
> Cog vs. Closure        1.0        3.0     3.18     2.7    2.45     2.68
>
> As a total improvement in performance Cog ranks at approx. 2.7x faster in
> macro benchmarks than what we started from. That's a pretty decent bit of
> speedup for real-world applications.
>
> Compare this (for example) with j3 [1] which despite a speedup of 6x in
> microbenchmarks only provided a 2x speedup in the macros.
>
> [1] http://aspn.activestate.com/ASPN/Mail/Message/squeak-list/2369033:
>
> "Of course, that was 2001. Revisiting the benchmarks is kind of
> interesting...
>
> Interp:     '43805612 bytecodes/sec; 1325959 sends/sec'
> J3:         '135665076 bytecodes/sec; 8100691 sends/sec'
>
> Today: (PowerBookG4 1.5GHz), interp:
>
>            '114387846 bytecodes/sec; 5152891 sends/sec'
>
> But the mircoBenchmarks don't tell the whole story: Even with a speedup
> of factor 6 in sends, we only saw the performance doubled on real world
> benchmarks (e.g. the MacroBenchmarks)."
>
>
> Cheers,
>  - Andreas
>



--
Best regards,
Igor Stasenko AKA sig.