Smalltalk › Squeak › Squeak VM

stack vm questions

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

15 messages Options

Jecel Assumpcao Jr

stack vm questions

I have read the description of the stack vm again:

> http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

It seems to me that it would be possible to keep the arguments and the
remaining temporaries together. This would require keeping the numTemps
in the flags word instead of the numArgs. It would also mean moving the
code to nil out the temps towards the beginning of
#internalActivateNewMethod and #activateNewClosureMethod:numArgs:
(before pushing the IP). The idea is that then #temporary:in: would
become simpler, which is important since it is a very popular operation.

This change would make it a little harder to fix the bug in #marryFrame:
but I don't see any other changes that would be needed. Is there some
important design issue with keeping the arguments and temps separate
that I am missing? I can imagine a compiler that avoids initially
filling out the temps with nils by creating them lazily on first
assignment and that wouldn't work with my change. But I don't know if
this is a planned feature (it complicates reflection a bit since you
have to fake the uninitialized temps in order not to confuse the
debugger).

I found the use of a one byte flag to indicate that the context pointer
is valid interesting since it seems to me that the same information is
available by looking at the pointer itself (nil or not). Is this just a
performance issue or are there situations where the flag can be zero but
the pointer is not nil?

Some small details that seem like errors to me but could be a lack of
understanding on my part:

- the drawing above the definition of "activateNewClosureMethod:
blockClosure numArgs: numArgs" shows the flag word right after fp but
the code indicates that it should be a method pointer instead.

- callersFPOrNull is used in #commonCallerReturn but not assigned to. Ah
- I see that the first definition near the top of the text is
incomplete. The real definition in the discussion about return does have
the assignment.

Cheers,
-- Jecel

Eliot Miranda-2

Re: stack vm questions

Hi jecel,

On Wed, May 13, 2009 at 3:46 PM, Jecel Assumpcao Jr <[hidden email]> wrote:

I have read the description of the stack vm again:

> http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

It seems to me that it would be possible to keep the arguments and the
remaining temporaries together. This would require keeping the numTemps
in the flags word instead of the numArgs. It would also mean moving the
code to nil out the temps towards the beginning of
#internalActivateNewMethod and #activateNewClosureMethod:numArgs:
(before pushing the IP). The idea is that then #temporary:in: would
become simpler, which is important since it is a very popular operation.

But the split arguments/temporaries organization is determined by frame build order, and the anticipation of a JIT.

The caller pushes receiver and arguments, and in a JIT then pushes the return pc (as part of the call instruction that makes up the send). Then a frame is built at the end of which temporaries are initialized. In the JIT there will always be at least the return pc/caller's saved instruction pointer between the arguments and the temporaries, and since the offsets to access each can be determined at compile time the cost of separating them is low.

So while one could I don't see that its worth-while. Even if one did keep the arguments and temporaries together one would still have the stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.

In practice the cost is not that high in the stack vm, or at least the stack vm is still substantially faster than the context vm with arguments and temporaries split. And in the JIT it really isn't a performance issue at all.

This change would make it a little harder to fix the bug in #marryFrame:
but I don't see any other changes that would be needed. Is there some
important design issue with keeping the arguments and temps separate
that I am missing? I can imagine a compiler that avoids initially
filling out the temps with nils by creating them lazily on first
assignment and that wouldn't work with my change. But I don't know if
this is a planned feature (it complicates reflection a bit since you
have to fake the uninitialized temps in order not to confuse the
debugger).

I found the use of a one byte flag to indicate that the context pointer
is valid interesting since it seems to me that the same information is
available by looking at the pointer itself (nil or not). Is this just a
performance issue or are there situations where the flag can be zero but
the pointer is not nil?

It was my original intent to avoid having to write the context pointer field. It used to be important to avoid unnecessary writes but on current processors with good cache interfaces I don't think avoiding the write makes any difference and I write it anyway in the current VM (which has moved on a little from the blog posts). But it is quicker to test the flag than test against nil simply because he reference to nil is an arbitrary 32-bit value, not simply 0 or 1.

In the JIT the flag is a bit in the method reference's LSBs and is set for free on frame build.

Some small details that seem like errors to me but could be a lack of
understanding on my part:

- the drawing above the definition of "activateNewClosureMethod:
blockClosure numArgs: numArgs" shows the flag word right after fp but
the code indicates that it should be a method pointer instead.

You're right. I missed out the methods my accident in the left-hand stack page.

- callersFPOrNull is used in #commonCallerReturn but not assigned to. Ah
- I see that the first definition near the top of the text is
incomplete. The real definition in the discussion about return does have
the assignment.

Cheers,
-- Jecel

Thanks!

Eliot Miranda-2

Re: stack vm questions

In reply to this post by Jecel Assumpcao Jr

On Wed, May 13, 2009 at 3:46 PM, Jecel Assumpcao Jr <[hidden email]> wrote:

[snip]
Some small details that seem like errors to me but could be a lack of
understanding on my part:

- the drawing above the definition of "activateNewClosureMethod:
blockClosure numArgs: numArgs" shows the flag word right after fp but
the code indicates that it should be a method pointer instead.

fixed. thanks again!

Jecel Assumpcao Jr

Re: stack vm questions

In reply to this post by Eliot Miranda-2

Eliot,

> [JIT code uses call which pushes PC first]

Ok, so this can't be helped.

> So while one could I don't see that its worth-while. Even if one did
> keep the arguments and temporaries together one would still have the
> stack contents separate from the arguments and temporaries and
> temporary access bytecodes can still access those so arguably one
> would still have to check the index against the temp count.

Really? I wouldn't expect the compiler to ever generate such bytecodes
and so wasn't too worried if the VM did the wrong thing in this
situation.

> In the JIT the flag is a bit in the method reference's LSBs and is set for free
> on frame build.

That sounds like a neat trick. Are the stack formats for the interpreted
stack vm and the jit a little diffeent?

Thanks for the explanations. I haven't figured out how to do this in
hardware in a reasonable way and so might have to go with a different
design.

-- Jecel

Eliot Miranda-2

Re: stack vm questions

In reply to this post by Eliot Miranda-2

On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote:

Eliot,

> [JIT code uses call which pushes PC first]

Ok, so this can't be helped.

> So while one could I don't see that its worth-while. Even if one did
> keep the arguments and temporaries together one would still have the
> stack contents separate from the arguments and temporaries and
> temporary access bytecodes can still access those so arguably one
> would still have to check the index against the temp count.

Really? I wouldn't expect the compiler to ever generate such bytecodes
and so wasn't too worried if the VM did the wrong thing in this
situation.

There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.

> In the JIT the flag is a bit in the method reference's LSBs and is set for free
> on frame build.

That sounds like a neat trick. Are the stack formats for the interpreted
stack vm and the jit a little diffeent?

Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.

Thanks for the explanations. I haven't figured out how to do this in
hardware in a reasonable way and so might have to go with a different design.

I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)

-- Jecel

Igor Stasenko

Re: stack vm questions

2009/5/14 Eliot Miranda <[hidden email]>:

>
>
>
> On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
>>
>> Eliot,
>>
>> > [JIT code uses call which pushes PC first]
>>
>> Ok, so this can't be helped.
>>
>> > So while one could I don't see that its worth-while. Even if one did
>> > keep the arguments and temporaries together one would still have the
>> > stack contents separate from the arguments and temporaries and
>> > temporary access bytecodes can still access those so arguably one
>> > would still have to check the index against the temp count.
>>
>> Really? I wouldn't expect the compiler to ever generate such bytecodes
>> and so wasn't too worried if the VM did the wrong thing in this
>> situation.
>
> There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
>>
>> > In the JIT the flag is a bit in the method reference's LSBs and is set for free
>> > on frame build.
>>
>> That sounds like a neat trick. Are the stack formats for the interpreted
>> stack vm and the jit a little diffeent?
>
> Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
>>
>> Thanks for the explanations. I haven't figured out how to do this in
>> hardware in a reasonable way and so might have to go with a different design.
>
> I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
>>
>>
>> -- Jecel
>>
>
>
>

The one thing you mentioned is separation of data stack & code stack
(use separate space to push args & temps, and another space to push
VM/JIT-specific state, such as context pointers & return addresses).

In this way, there are no big issue with this separation, because to
access the state in both variants you still have to track two pointers
- namely stack pointer and frame pointer.
But such a separation could be very helpful to handle the unitialized temps.
To call a method you usually pushing args (sp decreasing - if stack
organized in bottom-to-top order)
then pushing return address, save sp & other required context state
(fp decreasing)
And then you are ready to activate a new method.
Now there are 2 variants:
at the time when you entering new method, you can simply reserve the
stack space for its temps, OR, leave the sp unchanged, but organize
the code in such way, that each time you get a temp initialized, you
decreasing sp.
Then, at each point of execution, if debugger needs to determine if
some of the method's temps are unitialized , it can simply check that
context's saved sp should be <= temp pointer.
This of course makes things more complex, because you should
reorganize temps in order of their initialization i.e.
| a b |
b := 5.
a := 6.

to make pointer to a will be < pointer to b.
As well, as you should not alter the sp when "pushing" arguments
(otherwise debugger will assume that you have already initialized
temps).

But, it allows you to conserve the stack space between a method calls:

someMethod
| a b |

"1" b := 5.
"2" self foo.
"3" a := 6.

at 1) you decrement sp by pushing a initialized temp value of b
at 2) you saving the sp then increasing the sp by number of arguments
for method (foo) and activating a new method, on return you restoring
sp
at 3) you pushing a new initialized value , which leads to another
decrement of sp.

and, as i said, if you suspend the process in the middle of "self foo"
call, and inspecting the context of someMethod,
debugger can easily see, that it's saved sp is too small to hold all
temps, and if user wanting to see a current uninitialized value of
'a', then answer is nil, because offset of 'a' is greater than
currently saved sp.

There's another cost , of couse, - if method having 10 temps, then the
generated code needs 10 pushes for each initialization, in contrast to
single instruction for reserving space at the method's activation,
when you simply decreasing the sp by a known constant.

But i think there is no much difference: in both variants you have to
write at some memory location , and there just a little difference how
you calculate the address at initial store i.e. doing
push reg
or
mov [fp + offset] <- reg

So the major pros is:
- conserving a stack space
- easy to determine unitialized temps for debugger

and cons is:
- this could complicate the code generation

--
Best regards,
Igor Stasenko AKA sig.

Igor Stasenko

Re: stack vm questions

2009/5/14 Igor Stasenko <[hidden email]>:

> 2009/5/14 Eliot Miranda <[hidden email]>:
>>
>>
>>
>> On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
>>>
>>> Eliot,
>>>
>>> > [JIT code uses call which pushes PC first]
>>>
>>> Ok, so this can't be helped.
>>>
>>> > So while one could I don't see that its worth-while. Even if one did
>>> > keep the arguments and temporaries together one would still have the
>>> > stack contents separate from the arguments and temporaries and
>>> > temporary access bytecodes can still access those so arguably one
>>> > would still have to check the index against the temp count.
>>>
>>> Really? I wouldn't expect the compiler to ever generate such bytecodes
>>> and so wasn't too worried if the VM did the wrong thing in this
>>> situation.
>>
>> There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
>>>
>>> > In the JIT the flag is a bit in the method reference's LSBs and is set for free
>>> > on frame build.
>>>
>>> That sounds like a neat trick. Are the stack formats for the interpreted
>>> stack vm and the jit a little diffeent?
>>
>> Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
>>>
>>> Thanks for the explanations. I haven't figured out how to do this in
>>> hardware in a reasonable way and so might have to go with a different design.
>>
>> I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
>>>
>>>
>>> -- Jecel
>>>
>>
>>
>>
> The one thing you mentioned is separation of data stack & code stack
> (use separate space to push args & temps, and another space to push
> VM/JIT-specific state, such as context pointers & return addresses).
>
> In this way, there are no big issue with this separation, because to
> access the state in both variants you still have to track two pointers
> - namely stack pointer and frame pointer.
> But such a separation could be very helpful to handle the unitialized temps.
> To call a method you usually pushing args (sp decreasing - if stack
> organized in bottom-to-top order)
> then pushing return address, save sp & other required context state
> (fp decreasing)
> And then you are ready to activate a new method.
> Now there are 2 variants:
> at the time when you entering new method, you can simply reserve the
> stack space for its temps, OR, leave the sp unchanged, but organize
> the code in such way, that each time you get a temp initialized, you
> decreasing sp.
> Then, at each point of execution, if debugger needs to determine if
> some of the method's temps are unitialized , it can simply check that
> context's saved sp should be <= temp pointer.
> This of course makes things more complex, because you should
> reorganize temps in order of their initialization i.e.
> | a b |
> b := 5.
> a := 6.
>
> to make pointer to a will be < pointer to b.
> As well, as you should not alter the sp when "pushing" arguments
> (otherwise debugger will assume that you have already initialized
> temps).
>
> But, it allows you to conserve the stack space between a method calls:
>
> someMethod
> | a b |
>
> "1" b := 5.
> "2" self foo.
> "3" a := 6.
>
> at 1) you decrement sp by pushing a initialized temp value of b
> at 2) you saving the sp then increasing the sp by number of arguments
> for method (foo) and activating a new method, on return you restoring
> sp
> at 3) you pushing a new initialized value , which leads to another
> decrement of sp.
>
> and, as i said, if you suspend the process in the middle of "self foo"
> call, and inspecting the context of someMethod,
> debugger can easily see, that it's saved sp is too small to hold all
> temps, and if user wanting to see a current uninitialized value of
> 'a', then answer is nil, because offset of 'a' is greater than
> currently saved sp.
>
> There's another cost , of couse, - if method having 10 temps, then the
> generated code needs 10 pushes for each initialization, in contrast to
> single instruction for reserving space at the method's activation,
> when you simply decreasing the sp by a known constant.
>
> But i think there is no much difference: in both variants you have to
> write at some memory location , and there just a little difference how
> you calculate the address at initial store i.e. doing
> push reg
> or
> mov [fp + offset] <- reg
>
> So the major pros is:
> - conserving a stack space
> - easy to determine unitialized temps for debugger
>

oh, there's another pro, that you don't have to worry that any value
stored at location > sp is left unitialized. This simplifies the GC
code for visiting the stack to a simple loop from the beginning of
stack to the current stack pointer. ( Visiting the code stack, tracked
by fp is done separately).

> and cons is:
> - this could complicate the code generation
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

--
Best regards,
Igor Stasenko AKA sig.

Eliot Miranda-2

Re: stack vm questions

In reply to this post by Igor Stasenko

Hi Igor,

On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko <[hidden email]> wrote:

2009/5/14 Eliot Miranda <[hidden email]>:

>
>
>
> On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
>>
>> Eliot,
>>
>> > [JIT code uses call which pushes PC first]
>>
>> Ok, so this can't be helped.
>>
>> > So while one could I don't see that its worth-while. Even if one did
>> > keep the arguments and temporaries together one would still have the
>> > stack contents separate from the arguments and temporaries and
>> > temporary access bytecodes can still access those so arguably one
>> > would still have to check the index against the temp count.
>>
>> Really? I wouldn't expect the compiler to ever generate such bytecodes
>> and so wasn't too worried if the VM did the wrong thing in this
>> situation.
>
> There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
>>
>> > In the JIT the flag is a bit in the method reference's LSBs and is set for free
>> > on frame build.
>>
>> That sounds like a neat trick. Are the stack formats for the interpreted
>> stack vm and the jit a little diffeent?
>
> Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
>>
>> Thanks for the explanations. I haven't figured out how to do this in
>> hardware in a reasonable way and so might have to go with a different design.
>
> I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
>>
>>
>> -- Jecel
>>
>
>
>

The one thing you mentioned is separation of data stack & code stack
(use separate space to push args & temps, and another space to push
VM/JIT-specific state, such as context pointers & return addresses).

I think the "official" names used in Forth implementations are "control stack" and "operand stack".

In this way, there are no big issue with this separation, because to
access the state in both variants you still have to track two pointers
- namely stack pointer and frame pointer.

I think you end up with three pointers. There's a frame pointer into the control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack. An operand stack page needs to be large enough to hold the maximum amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids.

But such a separation could be very helpful to handle the unitialized temps.

Right. I took this approach in my BrouHaHa VMs, with the simplification that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time.

Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.

To call a method you usually pushing args (sp decreasing - if stack
organized in bottom-to-top order)
then pushing return address, save sp & other required context state
(fp decreasing)
And then you are ready to activate a new method.
Now there are 2 variants:
at the time when you entering new method, you can simply reserve the
stack space for its temps, OR, leave the sp unchanged, but organize
the code in such way, that each time you get a temp initialized, you
decreasing sp.
Then, at each point of execution, if debugger needs to determine if
some of the method's temps are unitialized , it can simply check that
context's saved sp should be <= temp pointer.
This of course makes things more complex, because you should
reorganize temps in order of their initialization i.e.
| a b |
b := 5.
a := 6.

to make pointer to a will be < pointer to b.
As well, as you should not alter the sp when "pushing" arguments
(otherwise debugger will assume that you have already initialized
temps).

But, it allows you to conserve the stack space between a method calls:

someMethod
| a b |

"1" b := 5.
"2" self foo.
"3" a := 6.

at 1) you decrement sp by pushing a initialized temp value of b
at 2) you saving the sp then increasing the sp by number of arguments
for method (foo) and activating a new method, on return you restoring
sp
at 3) you pushing a new initialized value , which leads to another
decrement of sp.

and, as i said, if you suspend the process in the middle of "self foo"
call, and inspecting the context of someMethod,
debugger can easily see, that it's saved sp is too small to hold all
temps, and if user wanting to see a current uninitialized value of
'a', then answer is nil, because offset of 'a' is greater than
currently saved sp.

There's another cost , of couse, - if method having 10 temps, then the
generated code needs 10 pushes for each initialization, in contrast to
single instruction for reserving space at the method's activation,
when you simply decreasing the sp by a known constant.

But i think there is no much difference: in both variants you have to
write at some memory location , and there just a little difference how
you calculate the address at initial store i.e. doing
push reg
or
mov [fp + offset] <- reg

So the major pros is:
- conserving a stack space
- easy to determine unitialized temps for debugger

and cons is:
- this could complicate the code generation

--
Best regards,
Igor Stasenko AKA sig.

Igor Stasenko

Re: stack vm questions

2009/5/14 Eliot Miranda <[hidden email]>:

>
> Hi Igor,
> On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko <[hidden email]> wrote:
>>
>> 2009/5/14 Eliot Miranda <[hidden email]>:
>> >
>> >
>> >
>> > On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
>> >>
>> >> Eliot,
>> >>
>> >> > [JIT code uses call which pushes PC first]
>> >>
>> >> Ok, so this can't be helped.
>> >>
>> >> > So while one could I don't see that its worth-while. Even if one did
>> >> > keep the arguments and temporaries together one would still have the
>> >> > stack contents separate from the arguments and temporaries and
>> >> > temporary access bytecodes can still access those so arguably one
>> >> > would still have to check the index against the temp count.
>> >>
>> >> Really? I wouldn't expect the compiler to ever generate such bytecodes
>> >> and so wasn't too worried if the VM did the wrong thing in this
>> >> situation.
>> >
>> > There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
>> >>
>> >> > In the JIT the flag is a bit in the method reference's LSBs and is set for free
>> >> > on frame build.
>> >>
>> >> That sounds like a neat trick. Are the stack formats for the interpreted
>> >> stack vm and the jit a little diffeent?
>> >
>> > Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
>> >>
>> >> Thanks for the explanations. I haven't figured out how to do this in
>> >> hardware in a reasonable way and so might have to go with a different design.
>> >
>> > I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
>> >>
>> >>
>> >> -- Jecel
>> >>
>> >
>> >
>> >
>> The one thing you mentioned is separation of data stack & code stack
>> (use separate space to push args & temps, and another space to push
>> VM/JIT-specific state, such as context pointers & return addresses).
>
> I think the "official" names used in Forth implementations are "control stack" and "operand stack".
>
>>
>> In this way, there are no big issue with this separation, because to
>> access the state in both variants you still have to track two pointers
>> - namely stack pointer and frame pointer.
>
> I think you end up with three pointers. There's a frame pointer into the control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack.

maybe you right, but i think this can be avoided, becasue at each call
site you know exactly the number of arguments and the order how they
will be pushed on stack, so to make a send

self x:1 y: 2

you can do:
[sp - 4] <- self
[sp - 8] <- 1
[sp - 12] <- 2
on 32bit machine

in case if you have a nested sends:

self x: (12 + 5) y: 2

" at the moment of preparing to send 12+5"
[sp - 4] <- self
[sp - 8] <- 12
[sp - 12] <- 5
" at the moment of preparing to send #x:y:"
[sp - 4] <- self
[sp - 8] <- 17
[sp - 12] <- 5

and even more complex, if you have a temp initialization in the middle:

| temp |
self x: 2 y: (temp:=5)

here you have to reserve the space for 'temp' first (push nil), before
storing 'self' at [sp-4].

Is there another cases, when you need to know the 'real' top of
operand stack for given context, except by the code flow?
I mean, for debugger, it will see the initialized temps in caller's
context and arguments in callee context - so there is no problem with
losing introspection abilities.

> An operand stack page needs to be large enough to hold the maximum amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids.

can't catch up with this statement. Do you mean that you have to
maintain 2 stack spaces instead of 1 (and consequently double the
overhead of reserving extra space to prevent the occasional overflow)?
And check 2 stacks before proceeding with call instead of just one.
But from other side, control stack frames always have a constant size
- so its very easy to predict how deep call chain should be before you
need to allocate another page for control stack. This means that you
can check the control stack 1/4 1/8 1/16 times the depth of call
chain changes to minimize this overhead.

>
>> But such a separation could be very helpful to handle the unitialized temps.
>
> Right. I took this approach in my BrouHaHa VMs, with the simplification that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time.
> Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.
>
>>
>> To call a method you usually pushing args (sp decreasing - if stack
>> organized in bottom-to-top order)
>> then pushing return address, save sp & other required context state
>> (fp decreasing)
>> And then you are ready to activate a new method.
>> Now there are 2 variants:
>> at the time when you entering new method, you can simply reserve the
>> stack space for its temps, OR, leave the sp unchanged, but organize
>> the code in such way, that each time you get a temp initialized, you
>> decreasing sp.
>> Then, at each point of execution, if debugger needs to determine if
>> some of the method's temps are unitialized , it can simply check that
>> context's saved sp should be <= temp pointer.
>> This of course makes things more complex, because you should
>> reorganize temps in order of their initialization i.e.
>> | a b |
>> b := 5.
>> a := 6.
>>
>> to make pointer to a will be < pointer to b.
>> As well, as you should not alter the sp when "pushing" arguments
>> (otherwise debugger will assume that you have already initialized
>> temps).
>>
>> But, it allows you to conserve the stack space between a method calls:
>>
>> someMethod
>> | a b |
>>
>> "1" b := 5.
>> "2" self foo.
>> "3" a := 6.
>>
>> at 1) you decrement sp by pushing a initialized temp value of b
>> at 2) you saving the sp then increasing the sp by number of arguments
>> for method (foo) and activating a new method, on return you restoring
>> sp
>> at 3) you pushing a new initialized value , which leads to another
>> decrement of sp.
>>
>> and, as i said, if you suspend the process in the middle of "self foo"
>> call, and inspecting the context of someMethod,
>> debugger can easily see, that it's saved sp is too small to hold all
>> temps, and if user wanting to see a current uninitialized value of
>> 'a', then answer is nil, because offset of 'a' is greater than
>> currently saved sp.
>>
>> There's another cost , of couse, - if method having 10 temps, then the
>> generated code needs 10 pushes for each initialization, in contrast to
>> single instruction for reserving space at the method's activation,
>> when you simply decreasing the sp by a known constant.
>>
>> But i think there is no much difference: in both variants you have to
>> write at some memory location , and there just a little difference how
>> you calculate the address at initial store i.e. doing
>> push reg
>> or
>> mov [fp + offset] <- reg
>>
>> So the major pros is:
>> - conserving a stack space
>> - easy to determine unitialized temps for debugger
>>
>> and cons is:
>> - this could complicate the code generation
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>
>
>

--
Best regards,
Igor Stasenko AKA sig.

Eliot Miranda-2

Re: stack vm questions

On Wed, May 13, 2009 at 7:08 PM, Igor Stasenko <[hidden email]> wrote:

2009/5/14 Eliot Miranda <[hidden email]>:
>
> Hi Igor,
> On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko <[hidden email]> wrote:
>>
>> 2009/5/14 Eliot Miranda <[hidden email]>:
>> >
>> >
>> >
>> > On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
>> >>
>> >> Eliot,
>> >>
>> >> > [JIT code uses call which pushes PC first]
>> >>
>> >> Ok, so this can't be helped.
>> >>
>> >> > So while one could I don't see that its worth-while. Even if one did
>> >> > keep the arguments and temporaries together one would still have the
>> >> > stack contents separate from the arguments and temporaries and
>> >> > temporary access bytecodes can still access those so arguably one
>> >> > would still have to check the index against the temp count.
>> >>
>> >> Really? I wouldn't expect the compiler to ever generate such bytecodes
>> >> and so wasn't too worried if the VM did the wrong thing in this
>> >> situation.
>> >
>> > There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
>> >>
>> >> > In the JIT the flag is a bit in the method reference's LSBs and is set for free
>> >> > on frame build.
>> >>
>> >> That sounds like a neat trick. Are the stack formats for the interpreted
>> >> stack vm and the jit a little diffeent?
>> >
>> > Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
>> >>
>> >> Thanks for the explanations. I haven't figured out how to do this in
>> >> hardware in a reasonable way and so might have to go with a different design.
>> >
>> > I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
>> >>
>> >>
>> >> -- Jecel
>> >>
>> >
>> >
>> >
>> The one thing you mentioned is separation of data stack & code stack
>> (use separate space to push args & temps, and another space to push
>> VM/JIT-specific state, such as context pointers & return addresses).
>
> I think the "official" names used in Forth implementations are "control stack" and "operand stack".
>
>>
>> In this way, there are no big issue with this separation, because to
>> access the state in both variants you still have to track two pointers
>> - namely stack pointer and frame pointer.
>
> I think you end up with three pointers. There's a frame pointer into the control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack.

maybe you right, but i think this can be avoided, becasue at each call
site you know exactly the number of arguments and the order how they
will be pushed on stack, so to make a send

self x:1 y: 2

you can do:
[sp - 4] <- self
[sp - 8] <- 1
[sp - 12] <- 2
on 32bit machine

Hang on; I think you're changing the context. Jecel and I (I think) have been discussing how to avoid checking the number of arguments on each temp var access on a machine where one has push/Store/StorePopTempVarN bytecodes. You're discussing a more general instruction set now. Yes, if you use a conventional instruction set where you've computed the offsets of al variables relative to the stack pointer at all points in the computation then you can do this; this is what gcc does when one says -fomit-frame-pointer. But that instruction set is not a 1 token=1 instruction set like a bytecode set and so is difficult to compile for, difficult to decompile, etc.

in case if you have a nested sends:

self x: (12 + 5) y: 2

" at the moment of preparing to send 12+5"
[sp - 4] <- self
[sp - 8] <- 12
[sp - 12] <- 5
" at the moment of preparing to send #x:y:"
[sp - 4] <- self
[sp - 8] <- 17
[sp - 12] <- 5

and even more complex, if you have a temp initialization in the middle:

| temp |
self x: 2 y: (temp:=5)

here you have to reserve the space for 'temp' first (push nil), before
storing 'self' at [sp-4].

Is there another cases, when you need to know the 'real' top of
operand stack for given context, except by the code flow?
I mean, for debugger, it will see the initialized temps in caller's
context and arguments in callee context - so there is no problem with
losing introspection abilities.

> An operand stack page needs to be large enough to hold the maximum amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids.

can't catch up with this statement. Do you mean that you have to
maintain 2 stack spaces instead of 1 (and consequently double the
overhead of reserving extra space to prevent the occasional overflow)?
And check 2 stacks before proceeding with call instead of just one.

Yes, but you don't have to check 2 stacks instead of one if the operand stack is large enough. i.e. if a control stack page holds N frames and the maximum stack space per frame is M then if an operand stack page is M * N it can't overflow before the control stack page does. But if the average stack space per frame is much less than the maximum (which it is) then one ends up wasting a lot of space. This is not a big issue if memory is cheap. I expect in Jecel's process technology memory isn't that expensive and would also expect that since an operand stack page would be heavily used it is good use for the memory.

But from other side, control stack frames always have a constant size
- so its very easy to predict how deep call chain should be before you
need to allocate another page for control stack. This means that you
can check the control stack 1/4 1/8 1/16 times the depth of call
chain changes to minimize this overhead.

And in hardware the check would be done in parallel with other instructions so it would effectively be free. One only pays for moving the overflow state from one page to the next.

>
>> But such a separation could be very helpful to handle the unitialized temps.
>
> Right. I took this approach in my BrouHaHa VMs, with the simplification that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time.
> Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack.
>
>>
>> To call a method you usually pushing args (sp decreasing - if stack
>> organized in bottom-to-top order)
>> then pushing return address, save sp & other required context state
>> (fp decreasing)
>> And then you are ready to activate a new method.
>> Now there are 2 variants:
>> at the time when you entering new method, you can simply reserve the
>> stack space for its temps, OR, leave the sp unchanged, but organize
>> the code in such way, that each time you get a temp initialized, you
>> decreasing sp.
>> Then, at each point of execution, if debugger needs to determine if
>> some of the method's temps are unitialized , it can simply check that
>> context's saved sp should be <= temp pointer.
>> This of course makes things more complex, because you should
>> reorganize temps in order of their initialization i.e.
>> | a b |
>> b := 5.
>> a := 6.
>>
>> to make pointer to a will be < pointer to b.
>> As well, as you should not alter the sp when "pushing" arguments
>> (otherwise debugger will assume that you have already initialized
>> temps).
>>
>> But, it allows you to conserve the stack space between a method calls:
>>
>> someMethod
>> | a b |
>>
>> "1" b := 5.
>> "2" self foo.
>> "3" a := 6.
>>
>> at 1) you decrement sp by pushing a initialized temp value of b
>> at 2) you saving the sp then increasing the sp by number of arguments
>> for method (foo) and activating a new method, on return you restoring
>> sp
>> at 3) you pushing a new initialized value , which leads to another
>> decrement of sp.
>>
>> and, as i said, if you suspend the process in the middle of "self foo"
>> call, and inspecting the context of someMethod,
>> debugger can easily see, that it's saved sp is too small to hold all
>> temps, and if user wanting to see a current uninitialized value of
>> 'a', then answer is nil, because offset of 'a' is greater than
>> currently saved sp.
>>
>> There's another cost , of couse, - if method having 10 temps, then the
>> generated code needs 10 pushes for each initialization, in contrast to
>> single instruction for reserving space at the method's activation,
>> when you simply decreasing the sp by a known constant.
>>
>> But i think there is no much difference: in both variants you have to
>> write at some memory location , and there just a little difference how
>> you calculate the address at initial store i.e. doing
>> push reg
>> or
>> mov [fp + offset] <- reg
>>
>> So the major pros is:
>> - conserving a stack space
>> - easy to determine unitialized temps for debugger
>>
>> and cons is:
>> - this could complicate the code generation
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>
>
>

--

Best regards,
Igor Stasenko AKA sig.

cheers

Eliot

Jecel Assumpcao Jr

Re: stack vm questions

Thanks Eliot and Igor for your comments!

Eliot is right that what Igor wrote is a good way to do a JIT but the
problem is that my hardware is essentially an interpreter and must deal
at runtime with expressions that the JIT would optimize away.

My design is described in:

http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details

What is missing from that description is what the control registers (x0
to x31) do, and that depends on the stack organization. Certainly it is
possible to have three registers as Eliot suggested. A separate control
stack is not only used on Forth processors but was also popular in Lisp
Machine designs.

Registers t0 to t29 are really implemented through a small amount of
hardware as a range of words in the stack cache. Having to split this
set differently for each method would make the hardware larger and
slower. With a separate control stack the mapping is really simple.

About the JIT having to use call which pushes the PC, that is not true
on RISC processors. But I suppose the focus for Cog is making Squeak
fast on the x86.

> There's a tension between implementing what the current compiler
> produces and implementing what the instruction set defines. For
> example should one assume arguments are never written to? I lean
> on the side of implementing the instruction set.

That is specially a good idea if the same VM ends up being used for
other languages like Newspeak. Certainly the Java VM runs many languages
that the original designers never expected.

> Yes. In the JIT an interpreted frame needs an extra field to hold
> the saved bytecode instruction pointer when an interpreted frame
> calls a machine code frame because the return address is the "return
> to interpreter trampoline" pc. There is no flag word in a machine
> code frame. So machine code frames save one word w.r.t. the
> stack vm and interpreted frames gain a word. But most frames
> are machine code ones so most of the time one is saving space.

Ok, so the JIT VM will still have an interpreter. Self originally didn't
have one but most of the effort that David Ungar put into the project
when it was restarted was making it more interpreter friendly. The
bytecodes became very similar to the ones in Little Smalltalk, for
example.

Will images be compatible between the JIT VM and the Stack VM? Or do you
expect the latter to not be used anymore once the JIT is available? I
had originally understood that the Stack VM would be compatible with
older images (since you divorce all frames on save and remarry them on
reload) but I had missed the detail of the different bytecodes for
variable instance access in the case of Context objects.

> I guess that in hardware you can create an instruction that will
> load a descriptor register as part of the return sequence in parallel
> with restoring the frame pointer and method so one would never
> indirect through the frame pointer to fetch the flags word; instead
> it would be part of the register state. But that's an extremely
> uneducated guess :)

Well, I am trying to avoid having a flags word even though in hardware
it is so easy to have any size fields that you might want. I can check
if context == nil very efficiently. For methods, t0 is the same value as
the "self" register (x4, for example) while for blocks it is different.
And with three pointers (fp, sp and control pointer) I shouldn't need to
keep track of the number of arguments.

> Jecel can also design the machine to avoid taking interrupts on
> the operand stack and provide a separate interrupt stack.

Hmmm... it has been a while since I designed hardware with interrupts,
but have normally used Alto style coroutines instead. The stack cache is
divided up into 32 word blocks and can hold parts of stacks from several
threads at once. Checking for overflow/underflow only needs to happen
when the stack pointer moves from one block to a different one (and even
then, only in certain cases which aren't too common). An interesting
feature of this scheme is that only 5 bit adders are needed (which are
much faster than 16 or 32 bit adders, for example. Wide adders could
reduce the clock speed or make the operation take an extra clock).
Another detail is that having operand or control frames split among two
stack pages is no problem at all.

address of tN in the stack cache:

raddr := fp[4:0] + N.
scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].

When fp[5] changes value, then tLow := tHigh and tHigh := head of free
block list (if fp was going up). If there are no free blocks, then some
have to be flushed to their stack pages in main memory. When going down,
tLow is loaded from a linked list, which might have to be extended by
loading blocks from stack pages. With a 4KB stack cache, for example,
there are 32 blocks with 32 words each and so block 0 can dedicate a
word for each of the other 31 blocks. The bottom 7 bits of that word
(only 5 actually needed, but it is nice to have a little room to grow)
can form the "previous" linked list (tHigh and tLow would also be 5 bits
wide) while the remaining bits can hold the block's address in main
memory.

This might seem far more complicated than a split arg/temp frame and it
certainly would be if implemented in software. In hardware, it is mostly
wires, a multiplexer and a small adder.

-- Jecel

Eliot Miranda-2

Re: stack vm questions

In reply to this post by Eliot Miranda-2

Hi Jecel,

you ask some pointed questions on the Stack & JIT Cog VMs that need good answers and I suspect are generally interesting so I'll answer to squeak-dev also.

On Thu, May 14, 2009 at 4:36 AM, Jecel Assumpcao Jr <[hidden email]> wrote:

[snip]

Eliot wrote:
> Yes. In the JIT an interpreted frame needs an extra field to hold
> the saved bytecode instruction pointer when an interpreted frame
> calls a machine code frame because the return address is the "return
> to interpreter trampoline" pc. There is no flag word in a machine
> code frame. So machine code frames save one word w.r.t. the
> stack vm and interpreted frames gain a word. But most frames
> are machine code ones so most of the time one is saving space.

Ok, so the JIT VM will still have an interpreter. Self originally didn't
have one but most of the effort that David Ungar put into the project
when it was restarted was making it more interpreter friendly. The
bytecodes became very similar to the ones in Little Smalltalk, for
example.

Will images be compatible between the JIT VM and the Stack VM?

At least in the first version of the JIT yes. There are a couple of restrictions.

- the code that rebuilds the Character table in SystemDictionary>>recreateSpecialObjectsArray has to be replaced by code that simply copies the table. This is because the JIT caches the Character table's oop in the generated machine-code for the at: primitive

- the use (e.g. in Pharo) of the SmallInteger primitives 1 through 17 on LargePositiveInteger must be replaced by the relevant large integer primitives, which will be included in both the Stack and JIT Cog VMs.

But I do expect to evolve the object representation quite soon, and this will break image backward-compatibility altogether. But I will want to produce a Stack VM that is compatible with the new format, not just a JIT VM because platform spread is very important to maintain and it'll be a while before more ISAs than x86 will be supported.

Or do you
expect the latter to not be used anymore once the JIT is available?

I expect that an interpreted Stack VM will survive for a long time and be the standard VM on ISAs such as SPARC, Itanium, MIPS, while the JIT will be available probably in the following order IA32/x86, ARM32, x86-64, PowerPC.

I had originally understood that the Stack VM would be compatible with
older images (since you divorce all frames on save and remarry them on
reload) but I had missed the detail of the different bytecodes for
variable instance access in the case of Context objects.

No, the Stack VM is not compatible with older images. The Context VM with closure bytecode extensions (e.g. John's current 4.0 beta VM) is backward-compatible and is the bridge over which one can bring forward pre-closure images to the Stack VM and JIT VM which are Closure-only VMs.

One could make the Stack VM run older images but it would be difficult (and hence probably buggy) and I think unnecessary. One can always run older images on older VMs for as long as OSs support the older executables (and that other VM technology makes that easier these days). One can migrate an older image forward to the Stack VM using the bootstrap and the Context VM with Closure extensions. Packaging technologies such as Monticello make images much less important these days. One should expect to be able to export one's code in the form of a package and load it into a newer image, perhaps with a little massage. So keeping those dusty decks running on up-to-date VMs is IMO both wasted effort and a hindrance to obtaining maximum performance on newer VMs.

Best

Eliot

Eliot Miranda-2

Re: stack vm questions

In reply to this post by Eliot Miranda-2

On Thu, May 14, 2009 at 4:36 AM, Jecel Assumpcao Jr <[hidden email]> wrote:

Thanks Eliot and Igor for your comments!

Eliot is right that what Igor wrote is a good way to do a JIT but the
problem is that my hardware is essentially an interpreter and must deal
at runtime with expressions that the JIT would optimize away.

My design is described in:

http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details

What is missing from that description is what the control registers (x0
to x31) do, and that depends on the stack organization. Certainly it is
possible to have three registers as Eliot suggested. A separate control
stack is not only used on Forth processors but was also popular in Lisp
Machine designs.

Registers t0 to t29 are really implemented through a small amount of
hardware as a range of words in the stack cache. Having to split this
set differently for each method would make the hardware larger and
slower. With a separate control stack the mapping is really simple.

About the JIT having to use call which pushes the PC, that is not true
on RISC processors. But I suppose the focus for Cog is making Squeak
fast on the x86.

Right, but also for sanity the run-time support code needs to be ISA-agnostic. So the frame organization will be unchanged between different ISAs and that means even on RISCs the return pc will be pushed; perhaps not in frameless leaf methods. but definitely in normal frames and hence it might as well be pushed in the same place, and incremented by the size of the call instruction, instead of having the return instruction add the offset at return time.

> There's a tension between implementing what the current compiler
> produces and implementing what the instruction set defines. For
> example should one assume arguments are never written to? I lean
> on the side of implementing the instruction set.

That is specially a good idea if the same VM ends up being used for
other languages like Newspeak. Certainly the Java VM runs many languages
that the original designers never expected.

Good point.

> Yes. In the JIT an interpreted frame needs an extra field to hold
> the saved bytecode instruction pointer when an interpreted frame
> calls a machine code frame because the return address is the "return
> to interpreter trampoline" pc. There is no flag word in a machine
> code frame. So machine code frames save one word w.r.t. the
> stack vm and interpreted frames gain a word. But most frames
> are machine code ones so most of the time one is saving space.

Ok, so the JIT VM will still have an interpreter. Self originally didn't
have one but most of the effort that David Ungar put into the project
when it was restarted was making it more interpreter friendly. The
bytecodes became very similar to the ones in Little Smalltalk, for
example.

Have you got an URL for that bytecode set? If not, can you mail me some code or documentation that describes it?

Will images be compatible between the JIT VM and the Stack VM? Or do you
expect the latter to not be used anymore once the JIT is available? I
had originally understood that the Stack VM would be compatible with
older images (since you divorce all frames on save and remarry them on
reload) but I had missed the detail of the different bytecodes for
variable instance access in the case of Context objects.

Most of this I've answered in a different message. Yes, accessing Context inst vars needs special treatment. One can add the test to all relevant inst var accesses (only inst vars 0 through 2 are affected on read and 0 through 5 on write) but that would be very slow. Hence the recompile to use the long-form bytecodes is necessary. This whole treatment of Context inst var access is complex but IMO it is better to hide it in the VM using the primitive and long-form IV bytecode interface than exposing the internals to the image because that makes the bootstrap process easier and makes evolving the VM easier too.

> I guess that in hardware you can create an instruction that will
> load a descriptor register as part of the return sequence in parallel
> with restoring the frame pointer and method so one would never
> indirect through the frame pointer to fetch the flags word; instead
> it would be part of the register state. But that's an extremely
> uneducated guess :)

Well, I am trying to avoid having a flags word even though in hardware
it is so easy to have any size fields that you might want. I can check
if context == nil very efficiently. For methods, t0 is the same value as
the "self" register (x4, for example) while for blocks it is different.
And with three pointers (fp, sp and control pointer) I shouldn't need to
keep track of the number of arguments.

> Jecel can also design the machine to avoid taking interrupts on
> the operand stack and provide a separate interrupt stack.

Hmmm... it has been a while since I designed hardware with interrupts,
but have normally used Alto style coroutines instead.

Of course; a much better architecture!

The stack cache is
divided up into 32 word blocks and can hold parts of stacks from several
threads at once. Checking for overflow/underflow only needs to happen
when the stack pointer moves from one block to a different one (and even
then, only in certain cases which aren't too common). An interesting
feature of this scheme is that only 5 bit adders are needed (which are
much faster than 16 or 32 bit adders, for example. Wide adders could
reduce the clock speed or make the operation take an extra clock).
Another detail is that having operand or control frames split among two
stack pages is no problem at all.

Cool :)

address of tN in the stack cache:

raddr := fp[4:0] + N.
scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].

When fp[5] changes value, then tLow := tHigh and tHigh := head of free
block list (if fp was going up). If there are no free blocks, then some
have to be flushed to their stack pages in main memory. When going down,
tLow is loaded from a linked list, which might have to be extended by
loading blocks from stack pages. With a 4KB stack cache, for example,
there are 32 blocks with 32 words each and so block 0 can dedicate a
word for each of the other 31 blocks. The bottom 7 bits of that word
(only 5 actually needed, but it is nice to have a little room to grow)
can form the "previous" linked list (tHigh and tLow would also be 5 bits
wide) while the remaining bits can hold the block's address in main
memory.

This might seem far more complicated than a split arg/temp frame and it
certainly would be if implemented in software. In hardware, it is mostly
wires, a multiplexer and a small adder.

A different world. Cool!

-- Jecel

Igor Stasenko

Re: stack vm questions

In reply to this post by Eliot Miranda-2

2009/5/14 Jecel Assumpcao Jr <[hidden email]>:
>
> Thanks Eliot and Igor for your comments!
>
> Eliot is right that what Igor wrote is a good way to do a JIT but the
> problem is that my hardware is essentially an interpreter and must deal
> at runtime with expressions that the JIT would optimize away.
>
yes, i was proposed it for JIT only. For bytecode interpteter such
separation really could be much less effective.

What is bad, that to test the idea (how much its more [not]effective
comparing to single stack) it requires a huge implementation effort.
So, its a bit risky spend many hours changing the code generator &
data formats only to discover that in practice, it gives much less
significant benefits than expected :)

> My design is described in:
>
> http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details
>
> What is missing from that description is what the control registers (x0
> to x31) do, and that depends on the stack organization. Certainly it is
> possible to have three registers as Eliot suggested. A separate control
> stack is not only used on Forth processors but was also popular in Lisp
> Machine designs.
>
> Registers t0 to t29 are really implemented through a small amount of
> hardware as a range of words in the stack cache. Having to split this
> set differently for each method would make the hardware larger and
> slower. With a separate control stack the mapping is really simple.
>
> About the JIT having to use call which pushes the PC, that is not true
> on RISC processors. But I suppose the focus for Cog is making Squeak
> fast on the x86.
>
>> There's a tension between implementing what the current compiler
>> produces and implementing what the instruction set defines. For
>> example should one assume arguments are never written to? I lean
>> on the side of implementing the instruction set.
>
> That is specially a good idea if the same VM ends up being used for
> other languages like Newspeak. Certainly the Java VM runs many languages
> that the original designers never expected.
>
>> Yes. In the JIT an interpreted frame needs an extra field to hold
>> the saved bytecode instruction pointer when an interpreted frame
>> calls a machine code frame because the return address is the "return
>> to interpreter trampoline" pc. There is no flag word in a machine
>> code frame. So machine code frames save one word w.r.t. the
>> stack vm and interpreted frames gain a word. But most frames
>> are machine code ones so most of the time one is saving space.
>
> Ok, so the JIT VM will still have an interpreter. Self originally didn't
> have one but most of the effort that David Ungar put into the project
> when it was restarted was making it more interpreter friendly. The
> bytecodes became very similar to the ones in Little Smalltalk, for
> example.
>
> Will images be compatible between the JIT VM and the Stack VM? Or do you
> expect the latter to not be used anymore once the JIT is available? I
> had originally understood that the Stack VM would be compatible with
> older images (since you divorce all frames on save and remarry them on
> reload) but I had missed the detail of the different bytecodes for
> variable instance access in the case of Context objects.
>
>> I guess that in hardware you can create an instruction that will
>> load a descriptor register as part of the return sequence in parallel
>> with restoring the frame pointer and method so one would never
>> indirect through the frame pointer to fetch the flags word; instead
>> it would be part of the register state. But that's an extremely
>> uneducated guess :)
>
> Well, I am trying to avoid having a flags word even though in hardware
> it is so easy to have any size fields that you might want. I can check
> if context == nil very efficiently. For methods, t0 is the same value as
> the "self" register (x4, for example) while for blocks it is different.
> And with three pointers (fp, sp and control pointer) I shouldn't need to
> keep track of the number of arguments.
>
>> Jecel can also design the machine to avoid taking interrupts on
>> the operand stack and provide a separate interrupt stack.
>
> Hmmm... it has been a while since I designed hardware with interrupts,
> but have normally used Alto style coroutines instead. The stack cache is
> divided up into 32 word blocks and can hold parts of stacks from several
> threads at once. Checking for overflow/underflow only needs to happen
> when the stack pointer moves from one block to a different one (and even
> then, only in certain cases which aren't too common). An interesting
> feature of this scheme is that only 5 bit adders are needed (which are
> much faster than 16 or 32 bit adders, for example. Wide adders could
> reduce the clock speed or make the operation take an extra clock).
> Another detail is that having operand or control frames split among two
> stack pages is no problem at all.
>
> address of tN in the stack cache:
>
> raddr := fp[4:0] + N.
> scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0].
>
> When fp[5] changes value, then tLow := tHigh and tHigh := head of free
> block list (if fp was going up). If there are no free blocks, then some
> have to be flushed to their stack pages in main memory. When going down,
> tLow is loaded from a linked list, which might have to be extended by
> loading blocks from stack pages. With a 4KB stack cache, for example,
> there are 32 blocks with 32 words each and so block 0 can dedicate a
> word for each of the other 31 blocks. The bottom 7 bits of that word
> (only 5 actually needed, but it is nice to have a little room to grow)
> can form the "previous" linked list (tHigh and tLow would also be 5 bits
> wide) while the remaining bits can hold the block's address in main
> memory.
>
> This might seem far more complicated than a split arg/temp frame and it
> certainly would be if implemented in software. In hardware, it is mostly
> wires, a multiplexer and a small adder.
>
> -- Jecel
>
>

--
Best regards,
Igor Stasenko AKA sig.

Jecel Assumpcao Jr

bytecodes (was: stack vm questions)

In reply to this post by Eliot Miranda-2

Eliot,

> Have you got an URL for that bytecode set? If not, can you mail
> me some code or documentation that describes it?

I have a very messy page with several different instruction set designs
and including the two used for Self at:

http://www.merlintec.com:8080/software/3

The Self stuff is near the bottom and easy to miss, but it is simple
enough that I can describe it here so you don't have to bother with that
page. The original (Self 4.1 and earlier) bytecodes had a 3 bit op field
followed by a 5 bit data field. The 8 operations were: extend, pushSelf,
pushLiteral, nonLocalReturn, setDirectee, send, selfSend and resend. The
extend is a prefix that will add its 5 bits of data to the following
bytecode. Note that the pushSelf and nonLocalReturn bytecodes ignored
their data field. The resend was like super in Squeak and since we have
multiple inheritance it could be preceded by setDirectee to limit lookup
to a specific parent. One important detail is that the pushLiteral
bytecode checked for blocks and actually created a closure and pushed
that instead. All variable access is done through the selfSend
bytecodes. Note that selfSend #self does the same thing as pushSelf
because all Contexts have a slot named "self", so this latter bytecode
is not really needed (but does save an entry in the literal array).

For Self 4.1.2 the bytecode set was changed to make it more interpreter
friendly. It now has a 4 bit op field followed by a 4 bit data field.
The 16 operations are : extend, pushLiteral, send, selfSend, extra,
readLocal, writeLocal, lexicalLevel, branchAlways, branchIfTrue,
branchIfFalse, branchIndexed, delegatee, undefined, undefined and
undefined. Only the first four extra operations are defined: pushSelf,
pop, nonLocalReturn, undirectedResend. The lexicalLevel bytecode would
change the meaning of the following readLocal or writeLocal. Resending
is a bit different, with the delegatee bytecode not only setting the
parent for lookup but also changing the following send into a resend.

The original bytecodes are described in various papers and thesis but
the new ones are not well documented except in the code itself. I
mentioned that these are similar to the current Little Smalltalk
bytecodes, which are a bit different from the original LST instruction
set described in the book. The new LST bytecodes are only described in
the source code for the VM and also have a 4 bit op field and 4 bit data
field, with the ops: extended, psuhInstance, pushArgument,
pushTemporary, pushLiteral, pushConstant, assignInstance,
assignTemporary, markArguments, sendMessage, sendUnary, sendBinary,
pushBlock, doPrimitive and doSpecial. The first 12 special instructions
are defined: selfReturn, stackReturn, blockReturn, duplicate, popTop,
branch, branchIfTrue, branchIfFalse, sendToSuper and breakPoint. The
branch bytecodes don't have a data field but seem to use the following
byte as their data.

This is far more understandable when I am not trying to be so compact,
of course. I always like to see instruction sets in a table form, for
example.

-- Jecel