I have read the description of the stack vm again: > http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/ It seems to me that it would be possible to keep the arguments and the remaining temporaries together. This would require keeping the numTemps in the flags word instead of the numArgs. It would also mean moving the code to nil out the temps towards the beginning of #internalActivateNewMethod and #activateNewClosureMethod:numArgs: (before pushing the IP). The idea is that then #temporary:in: would become simpler, which is important since it is a very popular operation. This change would make it a little harder to fix the bug in #marryFrame: but I don't see any other changes that would be needed. Is there some important design issue with keeping the arguments and temps separate that I am missing? I can imagine a compiler that avoids initially filling out the temps with nils by creating them lazily on first assignment and that wouldn't work with my change. But I don't know if this is a planned feature (it complicates reflection a bit since you have to fake the uninitialized temps in order not to confuse the debugger). I found the use of a one byte flag to indicate that the context pointer is valid interesting since it seems to me that the same information is available by looking at the pointer itself (nil or not). Is this just a performance issue or are there situations where the flag can be zero but the pointer is not nil? Some small details that seem like errors to me but could be a lack of understanding on my part: - the drawing above the definition of "activateNewClosureMethod: blockClosure numArgs: numArgs" shows the flag word right after fp but the code indicates that it should be a method pointer instead. - callersFPOrNull is used in #commonCallerReturn but not assigned to. Ah - I see that the first definition near the top of the text is incomplete. The real definition in the discussion about return does have the assignment. Cheers, -- Jecel |
Hi jecel, On Wed, May 13, 2009 at 3:46 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
But the split arguments/temporaries organization is determined by frame build order, and the anticipation of a JIT.
The caller pushes receiver and arguments, and in a JIT then pushes the return pc (as part of the call instruction that makes up the send). Then a frame is built at the end of which temporaries are initialized. In the JIT there will always be at least the return pc/caller's saved instruction pointer between the arguments and the temporaries, and since the offsets to access each can be determined at compile time the cost of separating them is low.
So while one could I don't see that its worth-while. Even if one did keep the arguments and temporaries together one would still have the stack contents separate from the arguments and temporaries and temporary access bytecodes can still access those so arguably one would still have to check the index against the temp count.
In practice the cost is not that high in the stack vm, or at least the stack vm is still substantially faster than the context vm with arguments and temporaries split. And in the JIT it really isn't a performance issue at all.
This change would make it a little harder to fix the bug in #marryFrame: It was my original intent to avoid having to write the context pointer field. It used to be important to avoid unnecessary writes but on current processors with good cache interfaces I don't think avoiding the write makes any difference and I write it anyway in the current VM (which has moved on a little from the blog posts). But it is quicker to test the flag than test against nil simply because he reference to nil is an arbitrary 32-bit value, not simply 0 or 1.
In the JIT the flag is a bit in the method reference's LSBs and is set for free on frame build. Some small details that seem like errors to me but could be a lack of You're right. I missed out the methods my accident in the left-hand stack page. - callersFPOrNull is used in #commonCallerReturn but not assigned to. Ah Thanks! |
In reply to this post by Jecel Assumpcao Jr
On Wed, May 13, 2009 at 3:46 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
[snip] fixed. thanks again! |
In reply to this post by Eliot Miranda-2
Eliot, > [JIT code uses call which pushes PC first] Ok, so this can't be helped. > So while one could I don't see that its worth-while. Even if one did > keep the arguments and temporaries together one would still have the > stack contents separate from the arguments and temporaries and > temporary access bytecodes can still access those so arguably one > would still have to check the index against the temp count. Really? I wouldn't expect the compiler to ever generate such bytecodes and so wasn't too worried if the VM did the wrong thing in this situation. > In the JIT the flag is a bit in the method reference's LSBs and is set for free > on frame build. That sounds like a neat trick. Are the stack formats for the interpreted stack vm and the jit a little diffeent? Thanks for the explanations. I haven't figured out how to do this in hardware in a reasonable way and so might have to go with a different design. -- Jecel |
In reply to this post by Eliot Miranda-2
On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set.
Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space.
Thanks for the explanations. I haven't figured out how to do this in I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :)
|
2009/5/14 Eliot Miranda <[hidden email]>: > > > > On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote: >> >> Eliot, >> >> > [JIT code uses call which pushes PC first] >> >> Ok, so this can't be helped. >> >> > So while one could I don't see that its worth-while. Even if one did >> > keep the arguments and temporaries together one would still have the >> > stack contents separate from the arguments and temporaries and >> > temporary access bytecodes can still access those so arguably one >> > would still have to check the index against the temp count. >> >> Really? I wouldn't expect the compiler to ever generate such bytecodes >> and so wasn't too worried if the VM did the wrong thing in this >> situation. > > There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set. >> >> > In the JIT the flag is a bit in the method reference's LSBs and is set for free >> > on frame build. >> >> That sounds like a neat trick. Are the stack formats for the interpreted >> stack vm and the jit a little diffeent? > > Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space. >> >> Thanks for the explanations. I haven't figured out how to do this in >> hardware in a reasonable way and so might have to go with a different design. > > I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :) >> >> >> -- Jecel >> > > > (use separate space to push args & temps, and another space to push VM/JIT-specific state, such as context pointers & return addresses). In this way, there are no big issue with this separation, because to access the state in both variants you still have to track two pointers - namely stack pointer and frame pointer. But such a separation could be very helpful to handle the unitialized temps. To call a method you usually pushing args (sp decreasing - if stack organized in bottom-to-top order) then pushing return address, save sp & other required context state (fp decreasing) And then you are ready to activate a new method. Now there are 2 variants: at the time when you entering new method, you can simply reserve the stack space for its temps, OR, leave the sp unchanged, but organize the code in such way, that each time you get a temp initialized, you decreasing sp. Then, at each point of execution, if debugger needs to determine if some of the method's temps are unitialized , it can simply check that context's saved sp should be <= temp pointer. This of course makes things more complex, because you should reorganize temps in order of their initialization i.e. | a b | b := 5. a := 6. to make pointer to a will be < pointer to b. As well, as you should not alter the sp when "pushing" arguments (otherwise debugger will assume that you have already initialized temps). But, it allows you to conserve the stack space between a method calls: someMethod | a b | "1" b := 5. "2" self foo. "3" a := 6. at 1) you decrement sp by pushing a initialized temp value of b at 2) you saving the sp then increasing the sp by number of arguments for method (foo) and activating a new method, on return you restoring sp at 3) you pushing a new initialized value , which leads to another decrement of sp. and, as i said, if you suspend the process in the middle of "self foo" call, and inspecting the context of someMethod, debugger can easily see, that it's saved sp is too small to hold all temps, and if user wanting to see a current uninitialized value of 'a', then answer is nil, because offset of 'a' is greater than currently saved sp. There's another cost , of couse, - if method having 10 temps, then the generated code needs 10 pushes for each initialization, in contrast to single instruction for reserving space at the method's activation, when you simply decreasing the sp by a known constant. But i think there is no much difference: in both variants you have to write at some memory location , and there just a little difference how you calculate the address at initial store i.e. doing push reg or mov [fp + offset] <- reg So the major pros is: - conserving a stack space - easy to determine unitialized temps for debugger and cons is: - this could complicate the code generation -- Best regards, Igor Stasenko AKA sig. |
2009/5/14 Igor Stasenko <[hidden email]>: > 2009/5/14 Eliot Miranda <[hidden email]>: >> >> >> >> On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote: >>> >>> Eliot, >>> >>> > [JIT code uses call which pushes PC first] >>> >>> Ok, so this can't be helped. >>> >>> > So while one could I don't see that its worth-while. Even if one did >>> > keep the arguments and temporaries together one would still have the >>> > stack contents separate from the arguments and temporaries and >>> > temporary access bytecodes can still access those so arguably one >>> > would still have to check the index against the temp count. >>> >>> Really? I wouldn't expect the compiler to ever generate such bytecodes >>> and so wasn't too worried if the VM did the wrong thing in this >>> situation. >> >> There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set. >>> >>> > In the JIT the flag is a bit in the method reference's LSBs and is set for free >>> > on frame build. >>> >>> That sounds like a neat trick. Are the stack formats for the interpreted >>> stack vm and the jit a little diffeent? >> >> Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space. >>> >>> Thanks for the explanations. I haven't figured out how to do this in >>> hardware in a reasonable way and so might have to go with a different design. >> >> I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :) >>> >>> >>> -- Jecel >>> >> >> >> > The one thing you mentioned is separation of data stack & code stack > (use separate space to push args & temps, and another space to push > VM/JIT-specific state, such as context pointers & return addresses). > > In this way, there are no big issue with this separation, because to > access the state in both variants you still have to track two pointers > - namely stack pointer and frame pointer. > But such a separation could be very helpful to handle the unitialized temps. > To call a method you usually pushing args (sp decreasing - if stack > organized in bottom-to-top order) > then pushing return address, save sp & other required context state > (fp decreasing) > And then you are ready to activate a new method. > Now there are 2 variants: > at the time when you entering new method, you can simply reserve the > stack space for its temps, OR, leave the sp unchanged, but organize > the code in such way, that each time you get a temp initialized, you > decreasing sp. > Then, at each point of execution, if debugger needs to determine if > some of the method's temps are unitialized , it can simply check that > context's saved sp should be <= temp pointer. > This of course makes things more complex, because you should > reorganize temps in order of their initialization i.e. > | a b | > b := 5. > a := 6. > > to make pointer to a will be < pointer to b. > As well, as you should not alter the sp when "pushing" arguments > (otherwise debugger will assume that you have already initialized > temps). > > But, it allows you to conserve the stack space between a method calls: > > someMethod > | a b | > > "1" b := 5. > "2" self foo. > "3" a := 6. > > at 1) you decrement sp by pushing a initialized temp value of b > at 2) you saving the sp then increasing the sp by number of arguments > for method (foo) and activating a new method, on return you restoring > sp > at 3) you pushing a new initialized value , which leads to another > decrement of sp. > > and, as i said, if you suspend the process in the middle of "self foo" > call, and inspecting the context of someMethod, > debugger can easily see, that it's saved sp is too small to hold all > temps, and if user wanting to see a current uninitialized value of > 'a', then answer is nil, because offset of 'a' is greater than > currently saved sp. > > There's another cost , of couse, - if method having 10 temps, then the > generated code needs 10 pushes for each initialization, in contrast to > single instruction for reserving space at the method's activation, > when you simply decreasing the sp by a known constant. > > But i think there is no much difference: in both variants you have to > write at some memory location , and there just a little difference how > you calculate the address at initial store i.e. doing > push reg > or > mov [fp + offset] <- reg > > So the major pros is: > - conserving a stack space > - easy to determine unitialized temps for debugger > stored at location > sp is left unitialized. This simplifies the GC code for visiting the stack to a simple loop from the beginning of stack to the current stack pointer. ( Visiting the code stack, tracked by fp is done separately). > and cons is: > - this could complicate the code generation > > -- > Best regards, > Igor Stasenko AKA sig. > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
Hi Igor, On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko <[hidden email]> wrote:
I think the "official" names used in Forth implementations are "control stack" and "operand stack".
In this way, there are no big issue with this separation, because to I think you end up with three pointers. There's a frame pointer into the control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack. An operand stack page needs to be large enough to hold the maximum amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids.
But such a separation could be very helpful to handle the unitialized temps. Right. I took this approach in my BrouHaHa VMs, with the simplification that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time.
Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack. To call a method you usually pushing args (sp decreasing - if stack |
2009/5/14 Eliot Miranda <[hidden email]>: > > Hi Igor, > On Wed, May 13, 2009 at 6:18 PM, Igor Stasenko <[hidden email]> wrote: >> >> 2009/5/14 Eliot Miranda <[hidden email]>: >> > >> > >> > >> > On Wed, May 13, 2009 at 5:05 PM, Jecel Assumpcao Jr <[hidden email]> wrote: >> >> >> >> Eliot, >> >> >> >> > [JIT code uses call which pushes PC first] >> >> >> >> Ok, so this can't be helped. >> >> >> >> > So while one could I don't see that its worth-while. Even if one did >> >> > keep the arguments and temporaries together one would still have the >> >> > stack contents separate from the arguments and temporaries and >> >> > temporary access bytecodes can still access those so arguably one >> >> > would still have to check the index against the temp count. >> >> >> >> Really? I wouldn't expect the compiler to ever generate such bytecodes >> >> and so wasn't too worried if the VM did the wrong thing in this >> >> situation. >> > >> > There's a tension between implementing what the current compiler produces and implementing what the instruction set defines. For example should one assume arguments are never written to? I lean on the side of implementing the instruction set. >> >> >> >> > In the JIT the flag is a bit in the method reference's LSBs and is set for free >> >> > on frame build. >> >> >> >> That sounds like a neat trick. Are the stack formats for the interpreted >> >> stack vm and the jit a little diffeent? >> > >> > Yes. In the JIT an interpreted frame needs an extra field to hold the saved bytecode instruction pointer when an interpreted frame calls a machine code frame because the return address is the "return to interpreter trampoline" pc. There is no flag word in a machine code frame. So machine code frames save one word w.r.t. the stack vm and interpreted frames gain a word. But most frames are machine code ones so most of the time one is saving space. >> >> >> >> Thanks for the explanations. I haven't figured out how to do this in >> >> hardware in a reasonable way and so might have to go with a different design. >> > >> > I guess that in hardware you can create an instruction that will load a descriptor register as part of the return sequence in parallel with restoring the frame pointer and method so one would never indirect through the frame pointer to fetch the flags word; instead it would be part of the register state. But that's an extremely uneducated guess :) >> >> >> >> >> >> -- Jecel >> >> >> > >> > >> > >> The one thing you mentioned is separation of data stack & code stack >> (use separate space to push args & temps, and another space to push >> VM/JIT-specific state, such as context pointers & return addresses). > > I think the "official" names used in Forth implementations are "control stack" and "operand stack". > >> >> In this way, there are no big issue with this separation, because to >> access the state in both variants you still have to track two pointers >> - namely stack pointer and frame pointer. > > I think you end up with three pointers. There's a frame pointer into the control stack, a stack pointer to the receiver/0th temp for that frame on the operand stack and a stack pointer proper to the top of stack on the operand stack. maybe you right, but i think this can be avoided, becasue at each call site you know exactly the number of arguments and the order how they will be pushed on stack, so to make a send self x:1 y: 2 you can do: [sp - 4] <- self [sp - 8] <- 1 [sp - 12] <- 2 on 32bit machine in case if you have a nested sends: self x: (12 + 5) y: 2 " at the moment of preparing to send 12+5" [sp - 4] <- self [sp - 8] <- 12 [sp - 12] <- 5 " at the moment of preparing to send #x:y:" [sp - 4] <- self [sp - 8] <- 17 [sp - 12] <- 5 and even more complex, if you have a temp initialization in the middle: | temp | self x: 2 y: (temp:=5) here you have to reserve the space for 'temp' first (push nil), before storing 'self' at [sp-4]. Is there another cases, when you need to know the 'real' top of operand stack for given context, except by the code flow? I mean, for debugger, it will see the initialized temps in caller's context and arguments in callee context - so there is no problem with losing introspection abilities. > An operand stack page needs to be large enough to hold the maximum amount of stack space used by as many frames as can fit on the control stack and this is an issue because one ends up wasting a lot of space, something the single stack organization avoids. can't catch up with this statement. Do you mean that you have to maintain 2 stack spaces instead of 1 (and consequently double the overhead of reserving extra space to prevent the occasional overflow)? And check 2 stacks before proceeding with call instead of just one. But from other side, control stack frames always have a constant size - so its very easy to predict how deep call chain should be before you need to allocate another page for control stack. This means that you can check the control stack 1/4 1/8 1/16 times the depth of call chain changes to minimize this overhead. > >> But such a separation could be very helpful to handle the unitialized temps. > > Right. I took this approach in my BrouHaHa VMs, with the simplification that there was only one stack page. After becoming familiar with HPS's organization I prefer it. It fits better with conventional hardware. But I think Jecel could use the split stack approach profitably on custom hardware where having 3 registers frame, stack base and top of stack is no problem, and where memory is still cheap. Its fine to throw transistors at the operand stack as that memory will be made frequent use of, even if one is wasting 25% of it most of the time. > Jecel can also design the machine to avoid taking interrupts on the operand stack and provide a separate interrupt stack. > >> >> To call a method you usually pushing args (sp decreasing - if stack >> organized in bottom-to-top order) >> then pushing return address, save sp & other required context state >> (fp decreasing) >> And then you are ready to activate a new method. >> Now there are 2 variants: >> at the time when you entering new method, you can simply reserve the >> stack space for its temps, OR, leave the sp unchanged, but organize >> the code in such way, that each time you get a temp initialized, you >> decreasing sp. >> Then, at each point of execution, if debugger needs to determine if >> some of the method's temps are unitialized , it can simply check that >> context's saved sp should be <= temp pointer. >> This of course makes things more complex, because you should >> reorganize temps in order of their initialization i.e. >> | a b | >> b := 5. >> a := 6. >> >> to make pointer to a will be < pointer to b. >> As well, as you should not alter the sp when "pushing" arguments >> (otherwise debugger will assume that you have already initialized >> temps). >> >> But, it allows you to conserve the stack space between a method calls: >> >> someMethod >> | a b | >> >> "1" b := 5. >> "2" self foo. >> "3" a := 6. >> >> at 1) you decrement sp by pushing a initialized temp value of b >> at 2) you saving the sp then increasing the sp by number of arguments >> for method (foo) and activating a new method, on return you restoring >> sp >> at 3) you pushing a new initialized value , which leads to another >> decrement of sp. >> >> and, as i said, if you suspend the process in the middle of "self foo" >> call, and inspecting the context of someMethod, >> debugger can easily see, that it's saved sp is too small to hold all >> temps, and if user wanting to see a current uninitialized value of >> 'a', then answer is nil, because offset of 'a' is greater than >> currently saved sp. >> >> There's another cost , of couse, - if method having 10 temps, then the >> generated code needs 10 pushes for each initialization, in contrast to >> single instruction for reserving space at the method's activation, >> when you simply decreasing the sp by a known constant. >> >> But i think there is no much difference: in both variants you have to >> write at some memory location , and there just a little difference how >> you calculate the address at initial store i.e. doing >> push reg >> or >> mov [fp + offset] <- reg >> >> So the major pros is: >> - conserving a stack space >> - easy to determine unitialized temps for debugger >> >> and cons is: >> - this could complicate the code generation >> >> -- >> Best regards, >> Igor Stasenko AKA sig. > > > -- Best regards, Igor Stasenko AKA sig. |
On Wed, May 13, 2009 at 7:08 PM, Igor Stasenko <[hidden email]> wrote:
Hang on; I think you're changing the context. Jecel and I (I think) have been discussing how to avoid checking the number of arguments on each temp var access on a machine where one has push/Store/StorePopTempVarN bytecodes. You're discussing a more general instruction set now. Yes, if you use a conventional instruction set where you've computed the offsets of al variables relative to the stack pointer at all points in the computation then you can do this; this is what gcc does when one says -fomit-frame-pointer. But that instruction set is not a 1 token=1 instruction set like a bytecode set and so is difficult to compile for, difficult to decompile, etc.
Yes, but you don't have to check 2 stacks instead of one if the operand stack is large enough. i.e. if a control stack page holds N frames and the maximum stack space per frame is M then if an operand stack page is M * N it can't overflow before the control stack page does. But if the average stack space per frame is much less than the maximum (which it is) then one ends up wasting a lot of space. This is not a big issue if memory is cheap. I expect in Jecel's process technology memory isn't that expensive and would also expect that since an operand stack page would be heavily used it is good use for the memory.
And in hardware the check would be done in parallel with other instructions so it would effectively be free. One only pays for moving the overflow state from one page to the next.
cheers Eliot
|
Thanks Eliot and Igor for your comments! Eliot is right that what Igor wrote is a good way to do a JIT but the problem is that my hardware is essentially an interpreter and must deal at runtime with expressions that the JIT would optimize away. My design is described in: http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details What is missing from that description is what the control registers (x0 to x31) do, and that depends on the stack organization. Certainly it is possible to have three registers as Eliot suggested. A separate control stack is not only used on Forth processors but was also popular in Lisp Machine designs. Registers t0 to t29 are really implemented through a small amount of hardware as a range of words in the stack cache. Having to split this set differently for each method would make the hardware larger and slower. With a separate control stack the mapping is really simple. About the JIT having to use call which pushes the PC, that is not true on RISC processors. But I suppose the focus for Cog is making Squeak fast on the x86. > There's a tension between implementing what the current compiler > produces and implementing what the instruction set defines. For > example should one assume arguments are never written to? I lean > on the side of implementing the instruction set. That is specially a good idea if the same VM ends up being used for other languages like Newspeak. Certainly the Java VM runs many languages that the original designers never expected. > Yes. In the JIT an interpreted frame needs an extra field to hold > the saved bytecode instruction pointer when an interpreted frame > calls a machine code frame because the return address is the "return > to interpreter trampoline" pc. There is no flag word in a machine > code frame. So machine code frames save one word w.r.t. the > stack vm and interpreted frames gain a word. But most frames > are machine code ones so most of the time one is saving space. Ok, so the JIT VM will still have an interpreter. Self originally didn't have one but most of the effort that David Ungar put into the project when it was restarted was making it more interpreter friendly. The bytecodes became very similar to the ones in Little Smalltalk, for example. Will images be compatible between the JIT VM and the Stack VM? Or do you expect the latter to not be used anymore once the JIT is available? I had originally understood that the Stack VM would be compatible with older images (since you divorce all frames on save and remarry them on reload) but I had missed the detail of the different bytecodes for variable instance access in the case of Context objects. > I guess that in hardware you can create an instruction that will > load a descriptor register as part of the return sequence in parallel > with restoring the frame pointer and method so one would never > indirect through the frame pointer to fetch the flags word; instead > it would be part of the register state. But that's an extremely > uneducated guess :) Well, I am trying to avoid having a flags word even though in hardware it is so easy to have any size fields that you might want. I can check if context == nil very efficiently. For methods, t0 is the same value as the "self" register (x4, for example) while for blocks it is different. And with three pointers (fp, sp and control pointer) I shouldn't need to keep track of the number of arguments. > Jecel can also design the machine to avoid taking interrupts on > the operand stack and provide a separate interrupt stack. Hmmm... it has been a while since I designed hardware with interrupts, but have normally used Alto style coroutines instead. The stack cache is divided up into 32 word blocks and can hold parts of stacks from several threads at once. Checking for overflow/underflow only needs to happen when the stack pointer moves from one block to a different one (and even then, only in certain cases which aren't too common). An interesting feature of this scheme is that only 5 bit adders are needed (which are much faster than 16 or 32 bit adders, for example. Wide adders could reduce the clock speed or make the operation take an extra clock). Another detail is that having operand or control frames split among two stack pages is no problem at all. address of tN in the stack cache: raddr := fp[4:0] + N. scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0]. When fp[5] changes value, then tLow := tHigh and tHigh := head of free block list (if fp was going up). If there are no free blocks, then some have to be flushed to their stack pages in main memory. When going down, tLow is loaded from a linked list, which might have to be extended by loading blocks from stack pages. With a 4KB stack cache, for example, there are 32 blocks with 32 words each and so block 0 can dedicate a word for each of the other 31 blocks. The bottom 7 bits of that word (only 5 actually needed, but it is nice to have a little room to grow) can form the "previous" linked list (tHigh and tLow would also be 5 bits wide) while the remaining bits can hold the block's address in main memory. This might seem far more complicated than a split arg/temp frame and it certainly would be if implemented in software. In hardware, it is mostly wires, a multiplexer and a small adder. -- Jecel |
In reply to this post by Eliot Miranda-2
Hi Jecel, you ask some pointed questions on the Stack & JIT Cog VMs that need good answers and I suspect are generally interesting so I'll answer to squeak-dev also.
On Thu, May 14, 2009 at 4:36 AM, Jecel Assumpcao Jr <[hidden email]> wrote: [snip]
At least in the first version of the JIT yes. There are a couple of restrictions. - the code that rebuilds the Character table in SystemDictionary>>recreateSpecialObjectsArray has to be replaced by code that simply copies the table. This is because the JIT caches the Character table's oop in the generated machine-code for the at: primitive
- the use (e.g. in Pharo) of the SmallInteger primitives 1 through 17 on LargePositiveInteger must be replaced by the relevant large integer primitives, which will be included in both the Stack and JIT Cog VMs.
But I do expect to evolve the object representation quite soon, and this will break image backward-compatibility altogether. But I will want to produce a Stack VM that is compatible with the new format, not just a JIT VM because platform spread is very important to maintain and it'll be a while before more ISAs than x86 will be supported.
Or do you I expect that an interpreted Stack VM will survive for a long time and be the standard VM on ISAs such as SPARC, Itanium, MIPS, while the JIT will be available probably in the following order IA32/x86, ARM32, x86-64, PowerPC.
I had originally understood that the Stack VM would be compatible with No, the Stack VM is not compatible with older images. The Context VM with closure bytecode extensions (e.g. John's current 4.0 beta VM) is backward-compatible and is the bridge over which one can bring forward pre-closure images to the Stack VM and JIT VM which are Closure-only VMs.
One could make the Stack VM run older images but it would be difficult (and hence probably buggy) and I think unnecessary. One can always run older images on older VMs for as long as OSs support the older executables (and that other VM technology makes that easier these days). One can migrate an older image forward to the Stack VM using the bootstrap and the Context VM with Closure extensions. Packaging technologies such as Monticello make images much less important these days. One should expect to be able to export one's code in the form of a package and load it into a newer image, perhaps with a little massage. So keeping those dusty decks running on up-to-date VMs is IMO both wasted effort and a hindrance to obtaining maximum performance on newer VMs.
Best Eliot |
In reply to this post by Eliot Miranda-2
On Thu, May 14, 2009 at 4:36 AM, Jecel Assumpcao Jr <[hidden email]> wrote:
Right, but also for sanity the run-time support code needs to be ISA-agnostic. So the frame organization will be unchanged between different ISAs and that means even on RISCs the return pc will be pushed; perhaps not in frameless leaf methods. but definitely in normal frames and hence it might as well be pushed in the same place, and incremented by the size of the call instruction, instead of having the return instruction add the offset at return time.
Good point.
Have you got an URL for that bytecode set? If not, can you mail me some code or documentation that describes it? Will images be compatible between the JIT VM and the Stack VM? Or do you Most of this I've answered in a different message. Yes, accessing Context inst vars needs special treatment. One can add the test to all relevant inst var accesses (only inst vars 0 through 2 are affected on read and 0 through 5 on write) but that would be very slow. Hence the recompile to use the long-form bytecodes is necessary. This whole treatment of Context inst var access is complex but IMO it is better to hide it in the VM using the primitive and long-form IV bytecode interface than exposing the internals to the image because that makes the bootstrap process easier and makes evolving the VM easier too.
Of course; a much better architecture! The stack cache is Cool :) address of tN in the stack cache: A different world. Cool!
|
In reply to this post by Eliot Miranda-2
2009/5/14 Jecel Assumpcao Jr <[hidden email]>: > > Thanks Eliot and Igor for your comments! > > Eliot is right that what Igor wrote is a good way to do a JIT but the > problem is that my hardware is essentially an interpreter and must deal > at runtime with expressions that the JIT would optimize away. > yes, i was proposed it for JIT only. For bytecode interpteter such separation really could be much less effective. What is bad, that to test the idea (how much its more [not]effective comparing to single stack) it requires a huge implementation effort. So, its a bit risky spend many hours changing the code generator & data formats only to discover that in practice, it gives much less significant benefits than expected :) > My design is described in: > > http://www.squeakphone.com:8203/seaside/pier/siliconsqueak/details > > What is missing from that description is what the control registers (x0 > to x31) do, and that depends on the stack organization. Certainly it is > possible to have three registers as Eliot suggested. A separate control > stack is not only used on Forth processors but was also popular in Lisp > Machine designs. > > Registers t0 to t29 are really implemented through a small amount of > hardware as a range of words in the stack cache. Having to split this > set differently for each method would make the hardware larger and > slower. With a separate control stack the mapping is really simple. > > About the JIT having to use call which pushes the PC, that is not true > on RISC processors. But I suppose the focus for Cog is making Squeak > fast on the x86. > >> There's a tension between implementing what the current compiler >> produces and implementing what the instruction set defines. For >> example should one assume arguments are never written to? I lean >> on the side of implementing the instruction set. > > That is specially a good idea if the same VM ends up being used for > other languages like Newspeak. Certainly the Java VM runs many languages > that the original designers never expected. > >> Yes. In the JIT an interpreted frame needs an extra field to hold >> the saved bytecode instruction pointer when an interpreted frame >> calls a machine code frame because the return address is the "return >> to interpreter trampoline" pc. There is no flag word in a machine >> code frame. So machine code frames save one word w.r.t. the >> stack vm and interpreted frames gain a word. But most frames >> are machine code ones so most of the time one is saving space. > > Ok, so the JIT VM will still have an interpreter. Self originally didn't > have one but most of the effort that David Ungar put into the project > when it was restarted was making it more interpreter friendly. The > bytecodes became very similar to the ones in Little Smalltalk, for > example. > > Will images be compatible between the JIT VM and the Stack VM? Or do you > expect the latter to not be used anymore once the JIT is available? I > had originally understood that the Stack VM would be compatible with > older images (since you divorce all frames on save and remarry them on > reload) but I had missed the detail of the different bytecodes for > variable instance access in the case of Context objects. > >> I guess that in hardware you can create an instruction that will >> load a descriptor register as part of the return sequence in parallel >> with restoring the frame pointer and method so one would never >> indirect through the frame pointer to fetch the flags word; instead >> it would be part of the register state. But that's an extremely >> uneducated guess :) > > Well, I am trying to avoid having a flags word even though in hardware > it is so easy to have any size fields that you might want. I can check > if context == nil very efficiently. For methods, t0 is the same value as > the "self" register (x4, for example) while for blocks it is different. > And with three pointers (fp, sp and control pointer) I shouldn't need to > keep track of the number of arguments. > >> Jecel can also design the machine to avoid taking interrupts on >> the operand stack and provide a separate interrupt stack. > > Hmmm... it has been a while since I designed hardware with interrupts, > but have normally used Alto style coroutines instead. The stack cache is > divided up into 32 word blocks and can hold parts of stacks from several > threads at once. Checking for overflow/underflow only needs to happen > when the stack pointer moves from one block to a different one (and even > then, only in certain cases which aren't too common). An interesting > feature of this scheme is that only 5 bit adders are needed (which are > much faster than 16 or 32 bit adders, for example. Wide adders could > reduce the clock speed or make the operation take an extra clock). > Another detail is that having operand or control frames split among two > stack pages is no problem at all. > > address of tN in the stack cache: > > raddr := fp[4:0] + N. > scaddr := (raddr[5] ? tHigh : tLow) , raddr[4:0]. > > When fp[5] changes value, then tLow := tHigh and tHigh := head of free > block list (if fp was going up). If there are no free blocks, then some > have to be flushed to their stack pages in main memory. When going down, > tLow is loaded from a linked list, which might have to be extended by > loading blocks from stack pages. With a 4KB stack cache, for example, > there are 32 blocks with 32 words each and so block 0 can dedicate a > word for each of the other 31 blocks. The bottom 7 bits of that word > (only 5 actually needed, but it is nice to have a little room to grow) > can form the "previous" linked list (tHigh and tLow would also be 5 bits > wide) while the remaining bits can hold the block's address in main > memory. > > This might seem far more complicated than a split arg/temp frame and it > certainly would be if implemented in software. In hardware, it is mostly > wires, a multiplexer and a small adder. > > -- Jecel > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Eliot Miranda-2
Eliot, > Have you got an URL for that bytecode set? If not, can you mail > me some code or documentation that describes it? I have a very messy page with several different instruction set designs and including the two used for Self at: http://www.merlintec.com:8080/software/3 The Self stuff is near the bottom and easy to miss, but it is simple enough that I can describe it here so you don't have to bother with that page. The original (Self 4.1 and earlier) bytecodes had a 3 bit op field followed by a 5 bit data field. The 8 operations were: extend, pushSelf, pushLiteral, nonLocalReturn, setDirectee, send, selfSend and resend. The extend is a prefix that will add its 5 bits of data to the following bytecode. Note that the pushSelf and nonLocalReturn bytecodes ignored their data field. The resend was like super in Squeak and since we have multiple inheritance it could be preceded by setDirectee to limit lookup to a specific parent. One important detail is that the pushLiteral bytecode checked for blocks and actually created a closure and pushed that instead. All variable access is done through the selfSend bytecodes. Note that selfSend #self does the same thing as pushSelf because all Contexts have a slot named "self", so this latter bytecode is not really needed (but does save an entry in the literal array). For Self 4.1.2 the bytecode set was changed to make it more interpreter friendly. It now has a 4 bit op field followed by a 4 bit data field. The 16 operations are : extend, pushLiteral, send, selfSend, extra, readLocal, writeLocal, lexicalLevel, branchAlways, branchIfTrue, branchIfFalse, branchIndexed, delegatee, undefined, undefined and undefined. Only the first four extra operations are defined: pushSelf, pop, nonLocalReturn, undirectedResend. The lexicalLevel bytecode would change the meaning of the following readLocal or writeLocal. Resending is a bit different, with the delegatee bytecode not only setting the parent for lookup but also changing the following send into a resend. The original bytecodes are described in various papers and thesis but the new ones are not well documented except in the code itself. I mentioned that these are similar to the current Little Smalltalk bytecodes, which are a bit different from the original LST instruction set described in the book. The new LST bytecodes are only described in the source code for the VM and also have a 4 bit op field and 4 bit data field, with the ops: extended, psuhInstance, pushArgument, pushTemporary, pushLiteral, pushConstant, assignInstance, assignTemporary, markArguments, sendMessage, sendUnary, sendBinary, pushBlock, doPrimitive and doSpecial. The first 12 special instructions are defined: selfReturn, stackReturn, blockReturn, duplicate, popTop, branch, branchIfTrue, branchIfFalse, sendToSuper and breakPoint. The branch bytecodes don't have a data field but seem to use the following byte as their data. This is far more understandable when I am not trying to be so compact, of course. I always like to see instruction sets in a table form, for example. -- Jecel |
Free forum by Nabble | Edit this page |