pharo closures

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

pharo closures

Toon Verwaest-2
I've just been working the whole day on the OPAL decompiler, getting it
almost to run (already works for various methods). Now ...

could someone please explain to me why the current implementation of
closures isn't a terribly bad idea?

I can see why it would pay off for Lisp programmers to have closures
that run like the Pharo closures, since it has O(1) access performance.
However, this performance boost only starts paying off once you have at
least more than 4 levels of nested closures, something which, unlike in
LISP, almost never happens in Pharo. Or at least shouldn't happen (if it
does, it's probably ok to punish the people by giving them slower
performance).

This implementation is pretty hard to understand, and it makes
decompilation semi-impossible unless you make very strong assumptions
about how the bytecodes are used. This then again reduces the
reusability of the new bytecodes and probably of the decompiler once
people start actually using the pushNewArray: bytecodes.

You might save a teeny tiny bit of memory by having stuff garbage
collected when it's not needed anymore ... but I doubt that the whole
design is based on that? Especially since it just penalizes the
performance in almost all possible ways for standard methods. And it
even wastes memory in general cases. I don't get it.

But probably I'm missing something?

cheers,
Toon

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Eliot Miranda-2
Hi Toon,

On Thu, Mar 24, 2011 at 4:46 PM, Toon Verwaest <[hidden email]> wrote:
I've just been working the whole day on the OPAL decompiler, getting it almost to run (already works for various methods). Now ...

could someone please explain to me why the current implementation of closures isn't a terribly bad idea?

No I can't.  Since I did it, I naturally think it's a good idea.  Perhaps, instead of denigrating it without substantiating your claims you could propose (and then implement, and then get adopted) a better idea?



I can see why it would pay off for Lisp programmers to have closures that run like the Pharo closures, since it has O(1) access performance. However, this performance boost only starts paying off once you have at least more than 4 levels of nested closures, something which, unlike in LISP, almost never happens in Pharo. Or at least shouldn't happen (if it does, it's probably ok to punish the people by giving them slower performance).

Slower performance than what?  BTW, I think you have things backwards.  I modelled the pharo closure implementation on lisp closures, not the other way around.


This implementation is pretty hard to understand, and it makes decompilation semi-impossible unless you make very strong assumptions about how the bytecodes are used. This then again reduces the reusability of the new bytecodes and probably of the decompiler once people start actually using the pushNewArray: bytecodes.

Um, the decompiler works, and in fact works better now than it did a couple of years ago.  So how does your claim stand up?
 

You might save a teeny tiny bit of memory by having stuff garbage collected when it's not needed anymore ... but I doubt that the whole design is based on that? Especially since it just penalizes the performance in almost all possible ways for standard methods. And it even wastes memory in general cases. I don't get it.

What has garbage collection got to do with anything?  What precisely are you talking about?  Indirection vectors?  To understand the rationale for indirection vectors you have to understand the rationale for implementing closures on a conventional machine stack.  For lisp that's clear; compile to a conventional stack as that's an easy model, in which case one has to store values that outlive LIFO discipline on the heap, hence indirection vectors.  Why you might want to do that in a Smalltalk implementation when you could just access the outer context directly has a lot to do with VM internals.  Basically its the same argument.  If one can map Smalltalk execution to a conventional stack organization then the JIT can produce a more efficient execution engine. Not doing this causes significant problems in context management.

The explanation is all on my blog and in my Context Management in VisualWorks 5i paper.

But does a bright buy like yourself find this /really/ hard to understand?  It's not that hard a transformation, and compared to what goes on in the JIT (e.g. in bytecode to machine-code pc mapping) its pretty trivial.


But probably I'm missing something?

It's me who's missing something.  I did the simplest thing I knew could possibly work re getting an efficient JIT and a Squeak with closures (there's huge similarity between the above scheme and the changes I made to VisualWorks that resulted in a VM that was 2 to 3 times faster depending on platform than VW 1.0).  But you can see a far more efficient and simple scheme.  What is it?

best,
Eliot

cheers,
Toon


Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Toon Verwaest-2

No I can't.  Since I did it, I naturally think it's a good idea.  Perhaps, instead of denigrating it without substantiating your claims you could propose (and then implement, and then get adopted) a better idea?
Sure. My own VM will take a lot longer to get done! ;) I don't want to blemish any of your credit for building a cool VM. I was rather just wondering why you decided to go for this particular implementation which seems unobvious to me. Hence the question. I guess I should've formulated it slightly differently :) More info below.
I can see why it would pay off for Lisp programmers to have closures that run like the Pharo closures, since it has O(1) access performance. However, this performance boost only starts paying off once you have at least more than 4 levels of nested closures, something which, unlike in LISP, almost never happens in Pharo. Or at least shouldn't happen (if it does, it's probably ok to punish the people by giving them slower performance).

Slower performance than what?  BTW, I think you have things backwards.  I modelled the pharo closure implementation on lisp closures, not the other way around.
This is exactly what I meant. The closures seem like a very good idea for languages with very deeply nested closures. Lisp is such a language with all the macros ... I don't really see this being so in Pharo.
This implementation is pretty hard to understand, and it makes decompilation semi-impossible unless you make very strong assumptions about how the bytecodes are used. This then again reduces the reusability of the new bytecodes and probably of the decompiler once people start actually using the pushNewArray: bytecodes.

Um, the decompiler works, and in fact works better now than it did a couple of years ago.  So how does your claim stand up?
For example when I just use the InstructionClient I get in pushNewArray: and then later popIntoTemp. This combination is supposed to make clear that you are storing a remote array. This is not what the bytecode says however. And this bytecode can easily be reused for something else; what if I use the bytecode to make my own arrays? What if this array is created in a different way? I can think of a lot of ways the temparray could come to be using lots of variations of bytecodes, from which I would never (...) be able to figure out that it's actually making the tempvector. Somehow I just feel there's a bigger disconnect between the bytecodes and the Smalltalk code and I'm unsure if this isn't harmful.

But ok, I am working on the Opal decompiler of course. Are you building an IR out with your decompiler? If so I'd like to have a look since I'm spending the whole day already trying to get the Opal compiler to somehow do what I want... getting one that works and builds a reusable IR would be useful. (I'm implementing your field-index-updating through bytecode transformation btw).
You might save a teeny tiny bit of memory by having stuff garbage collected when it's not needed anymore ... but I doubt that the whole design is based on that? Especially since it just penalizes the performance in almost all possible ways for standard methods. And it even wastes memory in general cases. I don't get it.

What has garbage collection got to do with anything?  What precisely are you talking about?  Indirection vectors?  To understand the rationale for indirection vectors you have to understand the rationale for implementing closures on a conventional machine stack.  For lisp that's clear; compile to a conventional stack as that's an easy model, in which case one has to store values that outlive LIFO discipline on the heap, hence indirection vectors.  Why you might want to do that in a Smalltalk implementation when you could just access the outer context directly has a lot to do with VM internals.  Basically its the same argument.  If one can map Smalltalk execution to a conventional stack organization then the JIT can produce a more efficient execution engine. Not doing this causes significant problems in context management.
With the garbage collection I meant the fact that you can already collect part of the stack frames and leave other parts (the remote temps) and only get them GCd later on when possible.

I do understand why you want to keep them on the stack as long as possible. The stack-frame marriage stuff for optimizations is very neat indeed. What I'm more worried about myself is the fact that stackframes aren't just linked to each other and share memory that way. This means that you only have 1 indirection to access the method-frame (via the homeContext), and 1 for the outer context. You can directly access yourself. So only the 4rd context will have 2 indirections (what all contexts have now for remotes). From the 5th on it gets worse... but I can't really see this happening in real world situations.

Then you have the problem that since you don't just link the frames and don't look up values via the frames, you have to copy over part of your frame for activation. This isn't necessarily -that- slow (although it is an overhead); but it's slightly clumsy and uses more memory. And that's where my problem lies I guess ... There's such a straightforward implementation possible, by just linking up stackframes (well... they are already linked up anyway), and traversing them. You'll have to do some rewriting whenever you leave a context that's still needed, but you do that anyway for the remote temps right?
The explanation is all on my blog and in my Context Management in VisualWorks 5i paper.

But does a bright buy like yourself find this /really/ hard to understand?  It's not that hard a transformation, and compared to what goes on in the JIT (e.g. in bytecode to machine-code pc mapping) its pretty trivial.
I guess I just like to really see what's going on by having a decent model around. When I look at the bytecodes; in the end I can reconstruct what it's doing ... as long as they are aligned in the way that the compiler currently generates them. But I can easily see how slight permutations would already throw me off completely.


But probably I'm missing something?

It's me who's missing something.  I did the simplest thing I knew could possibly work re getting an efficient JIT and a Squeak with closures (there's huge similarity between the above scheme and the changes I made to VisualWorks that resulted in a VM that was 2 to 3 times faster depending on platform than VW 1.0).  But you can see a far more efficient and simple scheme.  What is it?
Basically my scheme isn't necessarily far more efficient. It's just more understandable I think. I can understand scopes that point to their outer scope; and I can follow these scopes to see how the lookup works. And the fact that it does some pointer dereferencing and copying of data less is just something that makes me think it wouldn't be less efficient than what you have now. My problem is not that your implementation is slow, rather that it's complex. And I don't really see why this complexity is needed.

Obviously playing on my ego by telling me I should be clever enough to understand it makes me say I do! But I feel it's not the easiest; and probably less people understand this than the general model of just linking contexts together.

best,
Eliot
cheers,
Toon
Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Igor Stasenko
On 25 March 2011 01:44, Toon Verwaest <[hidden email]> wrote:

>
> No I can't.  Since I did it, I naturally think it's a good idea.  Perhaps,
> instead of denigrating it without substantiating your claims you could
> propose (and then implement, and then get adopted) a better idea?
>
> Sure. My own VM will take a lot longer to get done! ;) I don't want to
> blemish any of your credit for building a cool VM. I was rather just
> wondering why you decided to go for this particular implementation which
> seems unobvious to me. Hence the question. I guess I should've formulated it
> slightly differently :) More info below.
>>
>> I can see why it would pay off for Lisp programmers to have closures that
>> run like the Pharo closures, since it has O(1) access performance. However,
>> this performance boost only starts paying off once you have at least more
>> than 4 levels of nested closures, something which, unlike in LISP, almost
>> never happens in Pharo. Or at least shouldn't happen (if it does, it's
>> probably ok to punish the people by giving them slower performance).
>
> Slower performance than what?  BTW, I think you have things backwards.  I
> modelled the pharo closure implementation on lisp closures, not the other
> way around.
>
> This is exactly what I meant. The closures seem like a very good idea for
> languages with very deeply nested closures. Lisp is such a language with all
> the macros ... I don't really see this being so in Pharo.
>>
>> This implementation is pretty hard to understand, and it makes
>> decompilation semi-impossible unless you make very strong assumptions about
>> how the bytecodes are used. This then again reduces the reusability of the
>> new bytecodes and probably of the decompiler once people start actually
>> using the pushNewArray: bytecodes.
>
> Um, the decompiler works, and in fact works better now than it did a couple
> of years ago.  So how does your claim stand up?
>
> For example when I just use the InstructionClient I get in pushNewArray: and
> then later popIntoTemp. This combination is supposed to make clear that you
> are storing a remote array. This is not what the bytecode says however. And
> this bytecode can easily be reused for something else; what if I use the
> bytecode to make my own arrays? What if this array is created in a different
> way? I can think of a lot of ways the temparray could come to be using lots
> of variations of bytecodes, from which I would never (...) be able to figure
> out that it's actually making the tempvector. Somehow I just feel there's a
> bigger disconnect between the bytecodes and the Smalltalk code and I'm
> unsure if this isn't harmful.
>
> But ok, I am working on the Opal decompiler of course. Are you building an
> IR out with your decompiler? If so I'd like to have a look since I'm
> spending the whole day already trying to get the Opal compiler to somehow do
> what I want... getting one that works and builds a reusable IR would be
> useful. (I'm implementing your field-index-updating through bytecode
> transformation btw).
>>
>> You might save a teeny tiny bit of memory by having stuff garbage
>> collected when it's not needed anymore ... but I doubt that the whole design
>> is based on that? Especially since it just penalizes the performance in
>> almost all possible ways for standard methods. And it even wastes memory in
>> general cases. I don't get it.
>
> What has garbage collection got to do with anything?  What precisely are you
> talking about?  Indirection vectors?  To understand the rationale for
> indirection vectors you have to understand the rationale for implementing
> closures on a conventional machine stack.  For lisp that's clear; compile to
> a conventional stack as that's an easy model, in which case one has to store
> values that outlive LIFO discipline on the heap, hence indirection vectors.
>  Why you might want to do that in a Smalltalk implementation when you could
> just access the outer context directly has a lot to do with VM internals.
>  Basically its the same argument.  If one can map Smalltalk execution to a
> conventional stack organization then the JIT can produce a more efficient
> execution engine. Not doing this causes significant problems in context
> management.
>
> With the garbage collection I meant the fact that you can already collect
> part of the stack frames and leave other parts (the remote temps) and only
> get them GCd later on when possible.
>
> I do understand why you want to keep them on the stack as long as possible.
> The stack-frame marriage stuff for optimizations is very neat indeed. What
> I'm more worried about myself is the fact that stackframes aren't just
> linked to each other and share memory that way. This means that you only
> have 1 indirection to access the method-frame (via the homeContext), and 1
> for the outer context. You can directly access yourself. So only the 4rd
> context will have 2 indirections (what all contexts have now for remotes).
> From the 5th on it gets worse... but I can't really see this happening in
> real world situations.
>
> Then you have the problem that since you don't just link the frames and
> don't look up values via the frames, you have to copy over part of your
> frame for activation. This isn't necessarily -that- slow (although it is
> overhead); but it's slightly clumsy and uses more memory. And that's where
> my problem lies I guess ... There's such a straightforward implementation
> possible, by just linking up stackframes (well... they are already linked up
> anyway), and traversing them. You'll have to do some rewriting whenever you
> leave a context that's still needed, but you do that anyway for the remote
> temps right?
>


> The explanation is all on my blog and in my Context Management in
> VisualWorks 5i paper.
> But does a bright buy like yourself find this /really/ hard to understand?
>  It's not that hard a transformation, and compared to what goes on in the
> JIT (e.g. in bytecode to machine-code pc mapping) its pretty trivial.
>
> I guess I just like to really see what's going on by having a decent model
> around. When I look at the bytecodes; in the end I can reconstruct what it's
> doing ... as long as they are aligned in the way that the compiler currently
> generates them. But I can easily see how slight permutations would already
> throw me off completely.
>
>>
>> But probably I'm missing something?
>
> It's me who's missing something.  I did the simplest thing I knew could
> possibly work re getting an efficient JIT and a Squeak with closures
> (there's huge similarity between the above scheme and the changes I made to
> VisualWorks that resulted in a VM that was 2 to 3 times faster depending on
> platform than VW 1.0).  But you can see a far more efficient and simple
> scheme.  What is it?
>
> Basically my scheme isn't necessarily far more efficient. It's just more
> understandable I think. I can understand scopes that point to their outer
> scope; and I can follow these scopes to see how the lookup works. And the
> fact that it does some pointer dereferencing and copying of data less is
> just something that makes me think it wouldn't be less efficient than what
> you have now. My problem is not that your implementation is slow, rather
> that it's complex. And I don't really see why this complexity is needed.
>
> Obviously playing on my ego by telling me I should be clever enough to
> understand it makes me say I do! But I feel it's not the easiest; and
> probably less people understand this than the general model of just linking
> contexts together.
>

I can't say that i clearly understood your concept. But if it will
simplify implementation
without seemingly speed loss, i am all ears :)

> best,
> Eliot
>
> cheers,
> Toon
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Toon Verwaest-2

> I can't say that i clearly understood your concept. But if it will
> simplify implementation
> without seemingly speed loss, i am all ears :)
>

test
     |b|
     [ |a|
       a + b ]

Suppose you can't compile anything away, then you get

|==============
|MethodContext
|
|a := ...
|==============
      ^
      |
|==============
|BlockContext
|
|b := ...
|==============

And you just look up starting at the current context and go up. Except
if the var is from the homeContext, then you directly follow the
home-context pointer.
Since all contexts link to the home-context, this makes it 1 pointer
indirection to get to the method's context. 1 for the parent context. So
that makes only 2 indirections starting from the 3 nested block (so when
you have [ ... [ ... [ ... ] ... ] ... ]; where all of them are required
for storing captured data. ifTrue:ifFalse: etc blocks obviously don't
count. And blocks without shared locals could be left out (although we
might not do that, for debugging reasons).

Hope that helps.

cheers,
Toon

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Eliot Miranda-2
In reply to this post by Toon Verwaest-2


On Thu, Mar 24, 2011 at 5:44 PM, Toon Verwaest <[hidden email]> wrote:

No I can't.  Since I did it, I naturally think it's a good idea.  Perhaps, instead of denigrating it without substantiating your claims you could propose (and then implement, and then get adopted) a better idea?
Sure. My own VM will take a lot longer to get done! ;) I don't want to blemish any of your credit for building a cool VM. I was rather just wondering why you decided to go for this particular implementation which seems unobvious to me. Hence the question. I guess I should've formulated it slightly differently :) More info below.

I can see why it would pay off for Lisp programmers to have closures that run like the Pharo closures, since it has O(1) access performance. However, this performance boost only starts paying off once you have at least more than 4 levels of nested closures, something which, unlike in LISP, almost never happens in Pharo. Or at least shouldn't happen (if it does, it's probably ok to punish the people by giving them slower performance).

Slower performance than what?  BTW, I think you have things backwards.  I modelled the pharo closure implementation on lisp closures, not the other way around.
This is exactly what I meant. The closures seem like a very good idea for languages with very deeply nested closures. Lisp is such a language with all the macros ... I don't really see this being so in Pharo.

The issue is nothing to do with nesting and everything to do with executing closures on a stack.   Since closures are non-LIFO (can outlive their dynamic extent) any non-locval state they have should /not/ be held on the stack.  Hence copying and indirection vectors.

This implementation is pretty hard to understand, and it makes decompilation semi-impossible unless you make very strong assumptions about how the bytecodes are used. This then again reduces the reusability of the new bytecodes and probably of the decompiler once people start actually using the pushNewArray: bytecodes.

Um, the decompiler works, and in fact works better now than it did a couple of years ago.  So how does your claim stand up?
For example when I just use the InstructionClient I get in pushNewArray: and then later popIntoTemp. This combination is supposed to make clear that you are storing a remote array. This is not what the bytecode says however. And this bytecode can easily be reused for something else; what if I use the bytecode to make my own arrays? What if this array is created in a different way? I can think of a lot of ways the temparray could come to be using lots of variations of bytecodes, from which I would never (...) be able to figure out that it's actually making the tempvector. Somehow I just feel there's a bigger disconnect between the bytecodes and the Smalltalk code and I'm unsure if this isn't harmful.

I think you're setting up a straw-man here.  Until you find an example which is ambiguous you haven't really got a case.  First of all pushNewArray: consNewArray is /not/ used to implement Array new: N, quite rightly.  It is currently only used for tuples and indirection vectors.  It is clearly unnecessary (and for the decompiler extremely confusing) to use it for Array new: N.  What other uses do you have in mind?

And how, other than vague stirrings, is this actually harmful?  Be sure :)

 

But ok, I am working on the Opal decompiler of course. Are you building an IR out with your decompiler? If so I'd like to have a look since I'm spending the whole day already trying to get the Opal compiler to somehow do what I want... getting one that works and builds a reusable IR would be useful. (I'm implementing your field-index-updating through bytecode transformation btw).

I just modified the base Squeak decompiler.  The IR there-in is a stack of stacks of parse nodes.  It's a bit hairy but works well enough.

You might save a teeny tiny bit of memory by having stuff garbage collected when it's not needed anymore ... but I doubt that the whole design is based on that? Especially since it just penalizes the performance in almost all possible ways for standard methods. And it even wastes memory in general cases. I don't get it.

What has garbage collection got to do with anything?  What precisely are you talking about?  Indirection vectors?  To understand the rationale for indirection vectors you have to understand the rationale for implementing closures on a conventional machine stack.  For lisp that's clear; compile to a conventional stack as that's an easy model, in which case one has to store values that outlive LIFO discipline on the heap, hence indirection vectors.  Why you might want to do that in a Smalltalk implementation when you could just access the outer context directly has a lot to do with VM internals.  Basically its the same argument.  If one can map Smalltalk execution to a conventional stack organization then the JIT can produce a more efficient execution engine. Not doing this causes significant problems in context management.
With the garbage collection I meant the fact that you can already collect part of the stack frames and leave other parts (the remote temps) and only get them GCd later on when possible.

That's not the real issue one is solving.  The real issue one is solving is not having to write-back stack state to contexts.  Read the f-ing paper ;)
 

I do understand why you want to keep them on the stack as long as possible. The stack-frame marriage stuff for optimizations is very neat indeed. What I'm more worried about myself is the fact that stackframes aren't just linked to each other and share memory that way. This means that you only have 1 indirection to access the method-frame (via the homeContext), and 1 for the outer context. You can directly access yourself. So only the 4rd context will have 2 indirections (what all contexts have now for remotes). From the 5th on it gets worse... but I can't really see this happening in real world situations.

The only place extra indirections are introduced is in the static chain for nesting, i.e. non-local return being more expensive.  Variable access is O(1).  Either you pay at variable dereference time (by walking the static chain, a bad idea, closed over variables are read at least as many times as they're read, so reduce the read time, not the close-over time) or at closure-creation time (copying).  Yes, creating them isn't cheap, but its just an allocation (and that the allocator is slow isn't closures fault, and is remediable), and an adaptive optimizer would eliminate them altogether and to an extent make this entire conversation moot.

Then you have the problem that since you don't just link the frames and don't look up values via the frames, you have to copy over part of your frame for activation. This isn't necessarily -that- slow (although it is an overhead); but it's slightly clumsy and uses more memory. And that's where my problem lies I guess ... There's such a straightforward implementation possible, by just linking up stackframes (well... they are already linked up anyway), and traversing them. You'll have to do some rewriting whenever you leave a context that's still needed, but you do that anyway for the remote temps right?

Read the f-ing paper ;)
 

The explanation is all on my blog and in my Context Management in VisualWorks 5i paper.

But does a bright buy like yourself find this /really/ hard to understand?  It's not that hard a transformation, and compared to what goes on in the JIT (e.g. in bytecode to machine-code pc mapping) its pretty trivial.
I guess I just like to really see what's going on by having a decent model around. When I look at the bytecodes; in the end I can reconstruct what it's doing ... as long as they are aligned in the way that the compiler currently generates them. But I can easily see how slight permutations would already throw me off completely.



But probably I'm missing something?

It's me who's missing something.  I did the simplest thing I knew could possibly work re getting an efficient JIT and a Squeak with closures (there's huge similarity between the above scheme and the changes I made to VisualWorks that resulted in a VM that was 2 to 3 times faster depending on platform than VW 1.0).  But you can see a far more efficient and simple scheme.  What is it?
Basically my scheme isn't necessarily far more efficient. It's just more understandable I think. I can understand scopes that point to their outer scope; and I can follow these scopes to see how the lookup works. And the fact that it does some pointer dereferencing and copying of data less is just something that makes me think it wouldn't be less efficient than what you have now. My problem is not that your implementation is slow, rather that it's complex. And I don't really see why this complexity is needed.

Obviously playing on my ego by telling me I should be clever enough to understand it makes me say I do! But I feel it's not the easiest; and probably less people understand this than the general model of just linking contexts together.

Read the f-ing paper ;)
 

best,
Eliot
cheers,
Toon

best
Eliot 

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Eliot Miranda-2
In reply to this post by Toon Verwaest-2
Toon,

    what you describe is how Peter Deutsch designed closures for ObjectWorks 2.4 & ObjectWorks 2.5, whose virtual machine and bytecode set served all the way through VisualWorks 3.0.  If you read the context management paper you'll understand why this is a really slow design for a JIT.  When I replaced that scheme by one essentially isomorphic to the Squeak one the VM became substantially faster; for example factors of two and three in exception delivery performance.  The description of the problem and the performance numbers are all in the paper.  There are two main optimizations I performed on the VisualWorkas VM, one is the closures scheme and the other is PICs.  Those together sped-up what was the fastest commercial Smalltalk implementation by a factor of two on most platforms and a factor of three on Windows.

I'm sorry it's complex, but if one wants good performance it's a price well-worth paying.  After all I was able to implement the compiler and decompiler within a month, and Jorge proved at INRIA-Lille that I'm far form the only person on the planet who understands it.  Lispers have understood the scheme for a long time now.

best,
Eliot

On Thu, Mar 24, 2011 at 6:01 PM, Toon Verwaest <[hidden email]> wrote:

I can't say that i clearly understood your concept. But if it will
simplify implementation
without seemingly speed loss, i am all ears :)


test
   |b|
   [ |a|
     a + b ]

Suppose you can't compile anything away, then you get

|==============
|MethodContext
|
|a := ...
|==============
    ^
    |
|==============
|BlockContext
|
|b := ...
|==============

And you just look up starting at the current context and go up. Except if the var is from the homeContext, then you directly follow the home-context pointer.
Since all contexts link to the home-context, this makes it 1 pointer indirection to get to the method's context. 1 for the parent context. So that makes only 2 indirections starting from the 3 nested block (so when you have [ ... [ ... [ ... ] ... ] ... ]; where all of them are required for storing captured data. ifTrue:ifFalse: etc blocks obviously don't count. And blocks without shared locals could be left out (although we might not do that, for debugging reasons).

Hope that helps.

cheers,
Toon


Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Toon Verwaest-2
Ok, I will do so. (read the f-ing paper) I only read the blogpost until now.

I just realized that I actually made a mistake in my mental model of your model. See! It's complex!
So I realized that getting to the remotes is exactly as fast as going to the parent or outer context.

This makes it as fast as having a method context with maximally 2 nested contexts (3 blocks nested), and faster than deeper nestings. How often does it occur that you have deeper nesting in Pharo? Is it worthwhile to make the remote arrays just for those cases?

Is the copying really worthwhile to make those cases faster?

My biggest problem until now is... why wouldn't you be able to do everything you do with the remote arrays, directly with the context frames? Why limit it to only the part that is being closed over? The naive implementation that just extends squeak with proper closure-links will obviously be slow. I agree that you need a stack. Now I'd just like to read why you choose to just take a part of the frame (the remote array) rather than the whole frame. This would avoid the copyTemps thing...

But then. I guess I should go off and read the f-ing paper. I hope that particular thing is described there, since it's basically the piece I'm missing.

Also I don't exactly know what Peter Deutsch did, but if it was the straightforward implementation then it seems obvious you get such a speedup. Implementing it is less obvious, naturally ;)
These responses are exactly why I posed the question here... I'd like to understand why. No offense.

cheers,
Toon


On 03/25/2011 02:22 AM, Eliot Miranda wrote:
Toon,

    what you describe is how Peter Deutsch designed closures for ObjectWorks 2.4 & ObjectWorks 2.5, whose virtual machine and bytecode set served all the way through VisualWorks 3.0.  If you read the context management paper you'll understand why this is a really slow design for a JIT.  When I replaced that scheme by one essentially isomorphic to the Squeak one the VM became substantially faster; for example factors of two and three in exception delivery performance.  The description of the problem and the performance numbers are all in the paper.  There are two main optimizations I performed on the VisualWorkas VM, one is the closures scheme and the other is PICs.  Those together sped-up what was the fastest commercial Smalltalk implementation by a factor of two on most platforms and a factor of three on Windows.

I'm sorry it's complex, but if one wants good performance it's a price well-worth paying.  After all I was able to implement the compiler and decompiler within a month, and Jorge proved at INRIA-Lille that I'm far form the only person on the planet who understands it.  Lispers have understood the scheme for a long time now.

best,
Eliot

On Thu, Mar 24, 2011 at 6:01 PM, Toon Verwaest <[hidden email]> wrote:

I can't say that i clearly understood your concept. But if it will
simplify implementation
without seemingly speed loss, i am all ears :)


test
   |b|
   [ |a|
     a + b ]

Suppose you can't compile anything away, then you get

|==============
|MethodContext
|
|a := ...
|==============
    ^
    |
|==============
|BlockContext
|
|b := ...
|==============

And you just look up starting at the current context and go up. Except if the var is from the homeContext, then you directly follow the home-context pointer.
Since all contexts link to the home-context, this makes it 1 pointer indirection to get to the method's context. 1 for the parent context. So that makes only 2 indirections starting from the 3 nested block (so when you have [ ... [ ... [ ... ] ... ] ... ]; where all of them are required for storing captured data. ifTrue:ifFalse: etc blocks obviously don't count. And blocks without shared locals could be left out (although we might not do that, for debugging reasons).

Hope that helps.

cheers,
Toon



Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Igor Stasenko
On 25 March 2011 02:51, Toon Verwaest <[hidden email]> wrote:

> Ok, I will do so. (read the f-ing paper) I only read the blogpost until now.
>
> I just realized that I actually made a mistake in my mental model of your
> model. See! It's complex!
> So I realized that getting to the remotes is exactly as fast as going to the
> parent or outer context.
>
> This makes it as fast as having a method context with maximally 2 nested
> contexts (3 blocks nested), and faster than deeper nestings. How often does
> it occur that you have deeper nesting in Pharo? Is it worthwhile to make the
> remote arrays just for those cases?
>
> Is the copying really worthwhile to make those cases faster?
>
> My biggest problem until now is... why wouldn't you be able to do everything
> you do with the remote arrays, directly with the context frames? Why limit
> it to only the part that is being closed over? The naive implementation that
> just extends squeak with proper closure-links will obviously be slow. I
> agree that you need a stack. Now I'd just like to read why you choose to
> just take a part of the frame (the remote array) rather than the whole
> frame. This would avoid the copyTemps thing...
>
Perhaps because you don't need to copy unnecessary stuff, which
closure not accessing anyways.
This could save some cycles.

> But then. I guess I should go off and read the f-ing paper. I hope that
> particular thing is described there, since it's basically the piece I'm
> missing.
>
> Also I don't exactly know what Peter Deutsch did, but if it was the
> straightforward implementation then it seems obvious you get such a speedup.
> Implementing it is less obvious, naturally ;)
> These responses are exactly why I posed the question here... I'd like to
> understand why. No offense.
>
> cheers,
> Toon
>
>
> On 03/25/2011 02:22 AM, Eliot Miranda wrote:
>
> Toon,
>     what you describe is how Peter Deutsch designed closures for ObjectWorks
> 2.4 & ObjectWorks 2.5, whose virtual machine and bytecode set served all the
> way through VisualWorks 3.0.  If you read the context management paper
> you'll understand why this is a really slow design for a JIT.  When I
> replaced that scheme by one essentially isomorphic to the Squeak one the VM
> became substantially faster; for example factors of two and three in
> exception delivery performance.  The description of the problem and the
> performance numbers are all in the paper.  There are two main optimizations
> I performed on the VisualWorkas VM, one is the closures scheme and the other
> is PICs.  Those together sped-up what was the fastest commercial Smalltalk
> implementation by a factor of two on most platforms and a factor of three on
> Windows.
> I'm sorry it's complex, but if one wants good performance it's a price
> well-worth paying.  After all I was able to implement the compiler and
> decompiler within a month, and Jorge proved at INRIA-Lille that I'm far form
> the only person on the planet who understands it.  Lispers have understood
> the scheme for a long time now.
> best,
> Eliot
>
> On Thu, Mar 24, 2011 at 6:01 PM, Toon Verwaest <[hidden email]>
> wrote:
>>
>>> I can't say that i clearly understood your concept. But if it will
>>> simplify implementation
>>> without seemingly speed loss, i am all ears :)
>>>
>>
>> test
>>    |b|
>>    [ |a|
>>      a + b ]
>>
>> Suppose you can't compile anything away, then you get
>>
>> |==============
>> |MethodContext
>> |
>> |a := ...
>> |==============
>>     ^
>>     |
>> |==============
>> |BlockContext
>> |
>> |b := ...
>> |==============
>>
>> And you just look up starting at the current context and go up. Except if
>> the var is from the homeContext, then you directly follow the home-context
>> pointer.
>> Since all contexts link to the home-context, this makes it 1 pointer
>> indirection to get to the method's context. 1 for the parent context. So
>> that makes only 2 indirections starting from the 3 nested block (so when you
>> have [ ... [ ... [ ... ] ... ] ... ]; where all of them are required for
>> storing captured data. ifTrue:ifFalse: etc blocks obviously don't count. And
>> blocks without shared locals could be left out (although we might not do
>> that, for debugging reasons).
>>
>> Hope that helps.
>>
>> cheers,
>> Toon
>>
>
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Stéphane Ducasse
In reply to this post by Toon Verwaest-2

On Mar 25, 2011, at 2:51 AM, Toon Verwaest wrote:

> Ok, I will do so. (read the f-ing paper) I only read the blogpost until now.

which paper?
Is there something more than the blog? I read the old VW5 paper but eliot told me that this is old and not accurate with Cog anymore.

>
> I just realized that I actually made a mistake in my mental model of your model. See! It's complex!
> So I realized that getting to the remotes is exactly as fast as going to the parent or outer context.
>
> This makes it as fast as having a method context with maximally 2 nested contexts (3 blocks nested), and faster than deeper nestings. How often does it occur that you have deeper nesting in Pharo? Is it worthwhile to make the remote arrays just for those cases?
>
> Is the copying really worthwhile to make those cases faster?
>
> My biggest problem until now is... why wouldn't you be able to do everything you do with the remote arrays, directly with the context frames? Why limit it to only the part that is being closed over? The naive implementation that just extends squeak with proper closure-links will obviously be slow. I agree that you need a stack. Now I'd just like to read why you choose to just take a part of the frame (the remote array) rather than the whole frame. This would avoid the copyTemps thing...
>
> But then. I guess I should go off and read the f-ing paper. I hope that particular thing is described there, since it's basically the piece I'm missing.
>
> Also I don't exactly know what Peter Deutsch did, but if it was the straightforward implementation then it seems obvious you get such a     speedup. Implementing it is less obvious, naturally ;)
> These responses are exactly why I posed the question here... I'd like to understand why. No offense.
>
> cheers,
> Toon
>
>
> On 03/25/2011 02:22 AM, Eliot Miranda wrote:
>> Toon,
>>
>>     what you describe is how Peter Deutsch designed closures for ObjectWorks 2.4 & ObjectWorks 2.5, whose virtual machine and bytecode set served all the way through VisualWorks 3.0.  If you read the context management paper you'll understand why this is a really slow design for a JIT.  When I replaced that scheme by one essentially isomorphic to the Squeak one the VM became substantially faster; for example factors of two and three in exception delivery performance.  The description of the problem and the performance numbers are all in the paper.  There are two main optimizations I performed on the VisualWorkas VM, one is the closures scheme and the other is PICs.  Those together sped-up what was the fastest commercial Smalltalk implementation by a factor of two on most platforms and a factor of three on Windows.
>>
>> I'm sorry it's complex, but if one wants good performance it's a price well-worth paying.  After all I was able to implement the compiler and decompiler within a month, and Jorge proved at INRIA-Lille that I'm far form the only person on the planet who understands it.  Lispers have understood the scheme for a long time now.
>>
>> best,
>> Eliot
>>
>> On Thu, Mar 24, 2011 at 6:01 PM, Toon Verwaest <[hidden email]> wrote:
>>
>> I can't say that i clearly understood your concept. But if it will
>> simplify implementation
>> without seemingly speed loss, i am all ears :)
>>
>>
>> test
>>    |b|
>>    [ |a|
>>      a + b ]
>>
>> Suppose you can't compile anything away, then you get
>>
>> |==============
>> |MethodContext
>> |
>> |a := ...
>> |==============
>>     ^
>>     |
>> |==============
>> |BlockContext
>> |
>> |b := ...
>> |==============
>>
>> And you just look up starting at the current context and go up. Except if the var is from the homeContext, then you directly follow the home-context pointer.
>> Since all contexts link to the home-context, this makes it 1 pointer indirection to get to the method's context. 1 for the parent context. So that makes only 2 indirections starting from the 3 nested block (so when you have [ ... [ ... [ ... ] ... ] ... ]; where all of them are required for storing captured data. ifTrue:ifFalse: etc blocks obviously don't count. And blocks without shared locals could be left out (although we might not do that, for debugging reasons).
>>
>> Hope that helps.
>>
>> cheers,
>> Toon
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Toon Verwaest-2

> which paper?
> Is there something more than the blog? I read the old VW5 paper but eliot told me that this is old and not accurate with Cog anymore.
Mmh... he was referring me to that "old" paper indeed.
http://www.esug.org/data/Articles/misc/oopsla99-contexts.pdf

@igor that was exactly my point. You avoid copying temps around which
you might not need, and accesses to shared temps from your own frame
keep on having direct access rather than 1 indirection. But then again,
this might not pay off vs the speedup of the remote array in the long
run depending on the implementation details of the stack/jit VM, I
clearly have no idea. It would be cool to see some benchmarks that
compare exactly having remote arrays vs linked context frames but both
on a stack.

Unless I'm still completely missing the point and there's a grand
JIT-reason or other to not take the whole captured stackframe off... If
this hasn't been tried out yet I'd be interested in collaborating on
that. I really wonder what the speed diff is.

cheers,
Toon

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Igor Stasenko
On 25 March 2011 10:25, Toon Verwaest <[hidden email]> wrote:

>
>> which paper?
>> Is there something more than the blog? I read the old VW5 paper but eliot
>> told me that this is old and not accurate with Cog anymore.
>
> Mmh... he was referring me to that "old" paper indeed.
> http://www.esug.org/data/Articles/misc/oopsla99-contexts.pdf
>
> @igor that was exactly my point. You avoid copying temps around which you
> might not need, and accesses to shared temps from your own frame keep on
> having direct access rather than 1 indirection. But then again, this might
> not pay off vs the speedup of the remote array in the long run depending on
> the implementation details of the stack/jit VM, I clearly have no idea. It
> would be cool to see some benchmarks that compare exactly having remote
> arrays vs linked context frames but both on a stack.
>
If i'm not mistaken, remote temp vectors are rare..
(Eliot could give you a nice doit to count all closures in system
which using temps vector)

So if its even slow, its not a big deal anyways.


> Unless I'm still completely missing the point and there's a grand JIT-reason
> or other to not take the whole captured stackframe off... If this hasn't
> been tried out yet I'd be interested in collaborating on that. I really
> wonder what the speed diff is.
>
> cheers,
> Toon
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Toon Verwaest-2

> If i'm not mistaken, remote temp vectors are rare..
> (Eliot could give you a nice doit to count all closures in system
> which using temps vector)
>
> So if its even slow, its not a big deal anyways.
My point was not really that I would make it faster. My point was more
that I would do the straightforward thing and link context frames
together rather than remote vectors. If I see a context frame linked
from another frame, I can generally understand what this means. If I see
an array with values shared between context frames ... it's not
immediately obvious from inspecting the objects. So the semantics get
slightly obscured.

If you really win a lot of performance by doing so, this is justified
(although I would still give it a different class than just Array ;)).
I'm just trying to find out if that performance boost is really there.
But I'm going to take this offline I think not to pollute the whole
mailing list with our circular conversations, and I'll report back if we
figure something interesting out ;)

cheers,
Toon

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Stefan Marr-4
Hi:

On 25 Mar 2011, at 11:01, Toon Verwaest wrote:

> If you really win a lot of performance by doing so, this is justified (although I would still give it a different class than just Array ;)). I'm just trying to find out if that performance boost is really there. But I'm going to take this offline I think not to pollute the whole mailing list with our circular conversations, and I'll report back if we figure something interesting out ;)
Well, for what it is worth, I really enjoy such discussions. And they are much more than all the usual ones here on the list ;)

If you do not feel that the Pharo mailing list is the right place, there is always an almost unused smalltalk research mailing list...

Best regards
Stefan


--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525


Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Igor Stasenko
On 25 March 2011 12:35, Stefan Marr <[hidden email]> wrote:
> Hi:
>
> On 25 Mar 2011, at 11:01, Toon Verwaest wrote:
>
>> If you really win a lot of performance by doing so, this is justified (although I would still give it a different class than just Array ;)). I'm just trying to find out if that performance boost is really there. But I'm going to take this offline I think not to pollute the whole mailing list with our circular conversations, and I'll report back if we figure something interesting out ;)
> Well, for what it is worth, I really enjoy such discussions. And they are much more than all the usual ones here on the list ;)
>
> If you do not feel that the Pharo mailing list is the right place, there is always an almost unused smalltalk research mailing list...
>
No, it is right place for such discussions.
Come on, since when extra mail traffic in Pharo list became a problem? :)

For people who interested in topic (like me), they can read and learn
something new (and put their ignorant 2cents ;) .
And if you do it privately, then we will have just two people who know
how closures are implemented instead of 10 or 20 ones.


> Best regards
> Stefan
>
>
> --
> Stefan Marr
> Software Languages Lab
> Vrije Universiteit Brussel
> Pleinlaan 2 / B-1050 Brussels / Belgium
> http://soft.vub.ac.be/~smarr
> Phone: +32 2 629 2974
> Fax:   +32 2 629 3525
>
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Henrik Sperre Johansen

On Mar 25, 2011, at 12:54 25PM, Igor Stasenko wrote:

> On 25 March 2011 12:35, Stefan Marr <[hidden email]> wrote:
>> Hi:
>>
>> On 25 Mar 2011, at 11:01, Toon Verwaest wrote:
>>
>>> If you really win a lot of performance by doing so, this is justified (although I would still give it a different class than just Array ;)). I'm just trying to find out if that performance boost is really there. But I'm going to take this offline I think not to pollute the whole mailing list with our circular conversations, and I'll report back if we figure something interesting out ;)
>> Well, for what it is worth, I really enjoy such discussions. And they are much more than all the usual ones here on the list ;)
>>
>> If you do not feel that the Pharo mailing list is the right place, there is always an almost unused smalltalk research mailing list...
>>
> No, it is right place for such discussions.
> Come on, since when extra mail traffic in Pharo list became a problem? :)

Agreed.
Interesting technical discussion is much better than all of the "Cool!" and "Me too!" posts we already get. (Which of course, ironically, this is one)

> For people who interested in topic (like me), they can read and learn
> something new (and put their ignorant 2cents ;) .
> And if you do it privately, then we will have just two people who know
> how closures are implemented instead of 10 or 20 ones.

And if I don't have time now delve in deep now, it's always searchable later.
No need to send emails asking participants what happened to the experiments they ran in private as a result of that thread I spotted, but didn't have time to read thoroughly 4 months ago :)

Cheers,
Henry
Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Stéphane Ducasse
In reply to this post by Igor Stasenko
+1

> Hi:
>>
>> On 25 Mar 2011, at 11:01, Toon Verwaest wrote:
>>
>>> If you really win a lot of performance by doing so, this is justified (although I would still give it a different class than just Array ;)). I'm just trying to find out if that performance boost is really there. But I'm going to take this offline I think not to pollute the whole mailing list with our circular conversations, and I'll report back if we figure something interesting out ;)
>> Well, for what it is worth, I really enjoy such discussions. And they are much more than all the usual ones here on the list ;)
>>
>> If you do not feel that the Pharo mailing list is the right place, there is always an almost unused smalltalk research mailing list...
>>
> No, it is right place for such discussions.
> Come on, since when extra mail traffic in Pharo list became a problem? :)
>
> For people who interested in topic (like me), they can read and learn
> something new (and put their ignorant 2cents ;) .
> And if you do it privately, then we will have just two people who know
> how closures are implemented instead of 10 or 20 ones.
>
>
>> Best regards
>> Stefan
>>
>>
>> --
>> Stefan Marr
>> Software Languages Lab
>> Vrije Universiteit Brussel
>> Pleinlaan 2 / B-1050 Brussels / Belgium
>> http://soft.vub.ac.be/~smarr
>> Phone: +32 2 629 2974
>> Fax:   +32 2 629 3525
>>
>>
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>


Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Miguel Cobá
In reply to this post by Henrik Sperre Johansen
El vie, 25-03-2011 a las 13:05 +0100, Henrik Johansen escribió:

> On Mar 25, 2011, at 12:54 25PM, Igor Stasenko wrote:
>
> > On 25 March 2011 12:35, Stefan Marr <[hidden email]> wrote:
> >> Hi:
> >>
> >> On 25 Mar 2011, at 11:01, Toon Verwaest wrote:
> >>
> >>> If you really win a lot of performance by doing so, this is justified (although I would still give it a different class than just Array ;)). I'm just trying to find out if that performance boost is really there. But I'm going to take this offline I think not to pollute the whole mailing list with our circular conversations, and I'll report back if we figure something interesting out ;)
> >> Well, for what it is worth, I really enjoy such discussions. And they are much more than all the usual ones here on the list ;)
> >>
> >> If you do not feel that the Pharo mailing list is the right place, there is always an almost unused smalltalk research mailing list...
> >>
> > No, it is right place for such discussions.
> > Come on, since when extra mail traffic in Pharo list became a problem? :)
>
> Agreed.
> Interesting technical discussion is much better than all of the "Cool!" and "Me too!" posts we already get. (Which of course, ironically, this is one)

+100 ;) he he!


--
Miguel Cobá
http://twitter.com/MiguelCobaMtz
http://miguel.leugim.com.mx




Reply | Threaded
Open this post in threaded view
|

Re: pharo closures

Eliot Miranda-2
In reply to this post by Stéphane Ducasse


On Fri, Mar 25, 2011 at 12:54 AM, Stéphane Ducasse <[hidden email]> wrote:

On Mar 25, 2011, at 2:51 AM, Toon Verwaest wrote:

> Ok, I will do so. (read the f-ing paper) I only read the blogpost until now.

which paper?
Is there something more than the blog? I read the old VW5 paper but eliot told me that this is old and not accurate with Cog anymore.

The http://www.esug.org/data/Articles/misc/oopsla99-contexts.pdf paper describes the problem with an implementation of closures using contexts such as that described by Toon, since the pre-5i implementation is isomorphic to Toon's.  It also describes the 5i solution which is also essentially isomorphic to my Squeak closure implementation.  But in detail this paper clearly does not describe my Squeak closure implementation; the bytecodes have different names etc.


>
> I just realized that I actually made a mistake in my mental model of your model. See! It's complex!
> So I realized that getting to the remotes is exactly as fast as going to the parent or outer context.
>
> This makes it as fast as having a method context with maximally 2 nested contexts (3 blocks nested), and faster than deeper nestings. How often does it occur that you have deeper nesting in Pharo? Is it worthwhile to make the remote arrays just for those cases?
>
> Is the copying really worthwhile to make those cases faster?
>
> My biggest problem until now is... why wouldn't you be able to do everything you do with the remote arrays, directly with the context frames? Why limit it to only the part that is being closed over? The naive implementation that just extends squeak with proper closure-links will obviously be slow. I agree that you need a stack. Now I'd just like to read why you choose to just take a part of the frame (the remote array) rather than the whole frame. This would avoid the copyTemps thing...
>
> But then. I guess I should go off and read the f-ing paper. I hope that particular thing is described there, since it's basically the piece I'm missing.
>
> Also I don't exactly know what Peter Deutsch did, but if it was the straightforward implementation then it seems obvious you get such a     speedup. Implementing it is less obvious, naturally ;)
> These responses are exactly why I posed the question here... I'd like to understand why. No offense.
>
> cheers,
> Toon
>
>
> On 03/25/2011 02:22 AM, Eliot Miranda wrote:
>> Toon,
>>
>>     what you describe is how Peter Deutsch designed closures for ObjectWorks 2.4 & ObjectWorks 2.5, whose virtual machine and bytecode set served all the way through VisualWorks 3.0.  If you read the context management paper you'll understand why this is a really slow design for a JIT.  When I replaced that scheme by one essentially isomorphic to the Squeak one the VM became substantially faster; for example factors of two and three in exception delivery performance.  The description of the problem and the performance numbers are all in the paper.  There are two main optimizations I performed on the VisualWorkas VM, one is the closures scheme and the other is PICs.  Those together sped-up what was the fastest commercial Smalltalk implementation by a factor of two on most platforms and a factor of three on Windows.
>>
>> I'm sorry it's complex, but if one wants good performance it's a price well-worth paying.  After all I was able to implement the compiler and decompiler within a month, and Jorge proved at INRIA-Lille that I'm far form the only person on the planet who understands it.  Lispers have understood the scheme for a long time now.
>>
>> best,
>> Eliot
>>
>> On Thu, Mar 24, 2011 at 6:01 PM, Toon Verwaest <[hidden email]> wrote:
>>
>> I can't say that i clearly understood your concept. But if it will
>> simplify implementation
>> without seemingly speed loss, i am all ears :)
>>
>>
>> test
>>    |b|
>>    [ |a|
>>      a + b ]
>>
>> Suppose you can't compile anything away, then you get
>>
>> |==============
>> |MethodContext
>> |
>> |a := ...
>> |==============
>>     ^
>>     |
>> |==============
>> |BlockContext
>> |
>> |b := ...
>> |==============
>>
>> And you just look up starting at the current context and go up. Except if the var is from the homeContext, then you directly follow the home-context pointer.
>> Since all contexts link to the home-context, this makes it 1 pointer indirection to get to the method's context. 1 for the parent context. So that makes only 2 indirections starting from the 3 nested block (so when you have [ ... [ ... [ ... ] ... ] ... ]; where all of them are required for storing captured data. ifTrue:ifFalse: etc blocks obviously don't count. And blocks without shared locals could be left out (although we might not do that, for debugging reasons).
>>
>> Hope that helps.
>>
>> cheers,
>> Toon
>>
>>
>