Smalltalk › Squeak › Squeak VM

FFI exception failure support on Win64 (Win32 also?)

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

13 messages Options

Eliot Miranda-2

FFI exception failure support on Win64 (Win32 also?)

Hi Windows Experts,

I'm trying to have the FFI exception failure support work across the board. What I see now is that the 64-bit StackVM works, while the 64-bit CogVM does not, even though they have exactly the same exception handling machinery in place (the new squeakExceptionHandler machinery; thank you whoever wrote it, it looks very nice), and the same essential architecture for jumping back into the interpreter.

I expect the issue is that the machinery for maintaining a chain through the stack and/or the stack search used in exception delivery is broken my careless management of the C stack in the Cog VM, whereas in the StackVM the C stack remains undisturbed.

To reproduce the crash in the Cog VM and the successful failure in the StackVM in an updated trunk image simply

- load the FFI (script attached; it does a snapshot and quit)

- run the example at the end of ExceptionInFFICallError's class comment

Any help on debugging this appreciated.

_,,,^..^,,,_

best, Eliot

LoadFFI.st (2K) Download Attachment

Eliot Miranda-2

Re: FFI exception failure support on Win64 (Win32 also?)

On Sat, Aug 25, 2018 at 2:22 PM Eliot Miranda <[hidden email]> wrote:

Hi Windows Experts,

I'm trying to have the FFI exception failure support work across the board. What I see now is that the 64-bit StackVM works, while the 64-bit CogVM does not, even though they have exactly the same exception handling machinery in place (the new squeakExceptionHandler machinery; thank you whoever wrote it, it looks very nice), and the same essential architecture for jumping back into the interpreter.

I expect the issue is that the machinery for maintaining a chain through the stack and/or the stack search used in exception delivery is broken my careless management of the C stack in the Cog VM, whereas in the StackVM the C stack remains undisturbed.

To reproduce the crash in the Cog VM and the successful failure in the StackVM in an updated trunk image simply
- load the FFI (script attached; it does a snapshot and quit)
- run the example at the end of ExceptionInFFICallError's class comment

And curiously on 32-bits both StackVM and CogVM fail (but I'm running on 64-bit Windows so that might affect things; I hope not).

Any help on debugging this appreciated.
_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot

Ben Coman

context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

In reply to this post by Eliot Miranda-2

I've changed the subject since I'm not sure if this is related,
but the description dredged up a memory of a concern I had a while ago regarding the interaction of context-switching and primitive-failure-return-values.

On Sun, 26 Aug 2018 at 05:23, Eliot Miranda <[hidden email]> wrote:

Hi Windows Experts,

I'm trying to have the FFI exception failure support work across the board. What I see now is that the 64-bit StackVM works, while the 64-bit CogVM does not, even though they have exactly the same exception handling machinery in place (the new squeakExceptionHandler machinery; thank you whoever wrote it, it looks very nice), and the same essential architecture for jumping back into the interpreter.

I expect the issue is that the machinery for maintaining a chain through the stack and/or the stack search used in exception delivery is broken my careless management of the C stack in the Cog VM, whereas in the StackVM the C stack remains undisturbed.

Back when I was having a go at new mutex primitives,

when a process "A" failed to lock a mutex, I wanted to return

a primitive-failed-code in addition to the usual context-change to different process "B"

However what I observed was that because of the context change

the primitive-failed-code incorrectly ended up returned on the stack of process "B".

I'm stretching my memory so there is a reasonable change this is misleading...

but I believe I observed this happening in CoInterpreter>>internalExecuteNewMethod

near this code..

"slowPrimitiveResponse may of course context-switch. ..."

succeeded := self slowPrimitiveResponse.

...

succeeded ifTrue: [....

Though I can't exactly put my finger on explaining why, my intuition is that

changing threads "half way" through a bytecode is a bad thing.

I started (but lost my way) to develop an idea to propose...

that rather than any context-changing primitive (e.g. #primtiveWait) directly

calling CoInterpreter>>transferTo:from:, it would just flag for #transferTo:from:

to be called at the next interpreter cycle, before the next bytecode is started.

Thus #internalExecuteNewMethod gets to exit normally, placing any primitive-failure-code

onto the correct process stack before the context-change.

Interestingly the comment in #transferTo:from says...

"Record a process to be awoken on the next interpreter cycle."

which sounds like what I'd propose, but actually it doesn't wait for

the next interpreter cycle and instead immediately changes context.

Philosophically it seems cleaner for context changes to happen

"between" bytecodes rather than "in the middle" of them,

but I'm unclear on the practical implications.

Also, probably a year later for curiousity I was browsing the MT code

and it seemed to do something like what I'd propose,

however I can't remember that reference.

hope that made some sense,

cheers -ben

Eliot Miranda-2

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

Hi Ben,

On Mon, Aug 27, 2018 at 11:36 AM Ben Coman <[hidden email]> wrote:

I've changed the subject since I'm not sure if this is related,
but the description dredged up a memory of a concern I had a while ago regarding the interaction of context-switching and primitive-failure-return-values.

On Sun, 26 Aug 2018 at 05:23, Eliot Miranda <[hidden email]> wrote:

Hi Windows Experts,

I'm trying to have the FFI exception failure support work across the board. What I see now is that the 64-bit StackVM works, while the 64-bit CogVM does not, even though they have exactly the same exception handling machinery in place (the new squeakExceptionHandler machinery; thank you whoever wrote it, it looks very nice), and the same essential architecture for jumping back into the interpreter.

I expect the issue is that the machinery for maintaining a chain through the stack and/or the stack search used in exception delivery is broken my careless management of the C stack in the Cog VM, whereas in the StackVM the C stack remains undisturbed.

Back when I was having a go at new mutex primitives,
when a process "A" failed to lock a mutex, I wanted to return
a primitive-failed-code in addition to the usual context-change to different process "B"
However what I observed was that because of the context change
the primitive-failed-code incorrectly ended up returned on the stack of process "B".

Your description doesn't match how (I understand) the VM works. The only way that Process A can initiate a process switch, mutex lock, et al, is by sending a message to some object (a process, mutex or semaphore). So we're talking about Process>>suspend & resume as well as the lock/unlock and wait/signal primitives. Primitive failure *always* delivers a primitive error to the method that contains the primitive and, in this case, initiated the process switch. Primitives validate their arguments and then succeed, or fail, leaving their arguments undisturbed (there is one regrettable exception to this in the BttF/Cog VM which is the segment loading primitive that leaves its input word array scrambled if a failure occurs, rather than incur the cost of cloning the array).

So the only way that a primitive could fail and the error code end up on the wrong process's stack would be if the primitive was mis-designed to not validate before occurring. Essentially it can not fail and cause a side effect. Primitives should be side-effect free when they fail and hence if a process switch primitive fails, it cannot yet have caused a process switch and therefore the error code would have to be delivered to process A's stack.

I'm stretching my memory so there is a reasonable change this is misleading...
but I believe I observed this happening in CoInterpreter>>internalExecuteNewMethod
near this code..

"slowPrimitiveResponse may of course context-switch. ..."
succeeded := self slowPrimitiveResponse.
...
succeeded ifTrue: [....

But internalExecuteNewMethod doesn't contain the switch code, internalActivateNewMethod does, and it does the process switch *after* delivering the primitive failure code, see reapAndResetErrorCodeTo:header: in the following:

internalActivateNewMethod

...

(self methodHeaderHasPrimitive: methodHeader) ifTrue:

["Skip the CallPrimitive bytecode, if it's there, and store the error code if the method starts

with a long store temp. Strictly no need to skip the store because it's effectively a noop."

localIP := localIP + (self sizeOfCallPrimitiveBytecode: methodHeader).

primFailCode ~= 0 ifTrue:

[self reapAndResetErrorCodeTo: localSP header: methodHeader]].

self assert: (self frameNumArgs: localFP) == argumentCount.

self assert: (self frameIsBlockActivation: localFP) not.

self assert: (self frameHasContext: localFP) not.

"Now check for stack overflow or an event (interrupt, must scavenge, etc)."

localSP < stackLimit ifTrue:

[self externalizeIPandSP.

switched := self handleStackOverflowOrEventAllowContextSwitch:

(self canContextSwitchIfActivating: newMethod header: methodHeader).

self returnToExecutive: true postContextSwitch: switched.

self internalizeIPandSP]

Though I can't exactly put my finger on explaining why, my intuition is that
changing threads "half way" through a bytecode is a bad thing.

Indeed it is, and the VM does not do this. It is possible that the execution simulation machinery in Context, InstructionStream at al could have been written carelessly to allow this to occur, but it cannot and does not occur in the VM proper.

I started (but lost my way) to develop an idea to propose...
that rather than any context-changing primitive (e.g. #primtiveWait) directly
calling CoInterpreter>>transferTo:from:, it would just flag for #transferTo:from:
to be called at the next interpreter cycle, before the next bytecode is started.
Thus #internalExecuteNewMethod gets to exit normally, placing any primitive-failure-code
onto the correct process stack before the context-change.

Given the design constraint that primitives either fail without side effects or complete atomically this architectural change isn't necessary.

Interestingly the comment in #transferTo:from says...
"Record a process to be awoken on the next interpreter cycle."
which sounds like what I'd propose, but actually it doesn't wait for
the next interpreter cycle and instead immediately changes context.

No it doesn't. It effects the process change, but control continues with the caller, allowing the caller to do other things before the process resumes. For example, if in checkForEventsMayContextSwitch: more than one semaphore is signaled (it could initiate any combination of signals of the low space semaphore, the input semaphore, external semaphores (associated with file descriptors & sockets), and the delay semaphore) then only the highest priority process would be runnable after the sequence, and several transferTo:[from:]'s could have been initiated from these signals, depending on process priority. But checkForEventsMayContextSwitch: will not finish mid-sequence. It will always complete all of its signals before returning to code that can then resume the newly activated process.

Philosophically it seems cleaner for context changes to happen
"between" bytecodes rather than "in the middle" of them,
but I'm unclear on the practical implications.

That's right and that's what happens. Context switches only occur at suspension points and these are only between byte codes. In fact, process switches can only occur on sends (and a subset of sends that aren't implemented as primitive, or are amongst the primitives in which process switch is allowed, namely the process,mutex,semaphore,and eval (perform,valeOfMethod:,BlockClosure>>value*) primitives) and backward jumps at the end of loops (this last to prevent infinite loops preventing response to interrupts). checkForEventsMayContextSwitch: is invoked only at these points.

Also, probably a year later for curiousity I was browsing the MT code
and it seemed to do something like what I'd propose,
however I can't remember that reference.

Again this is about scheduling a thread switch when a process is bound to some other thread. This is written as a two level scheduler where the current process completes its event check in checkForEventsMayContextSwitch: before a thread switch occurring prior to resuming the new process. The MT code may be incorrect in that it's still a work in progress. So I intend it to work as I describe, and if so I'd expect to see a thread switch in CoInterpreter>>MT>>returnToExecutive:postContextSwitch:, but I see no such method. In which case I have work to do ;-)

hope that made some sense,

Yes, and I hope equally that I've allayed your fears.

cheers -ben

_,,,^..^,,,_

best, Eliot

Ben Coman

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

Hi Eliot, Thanks for the detailed response.

On Tue, 28 Aug 2018 at 04:21, Eliot Miranda <[hidden email]> wrote:

On Mon, Aug 27, 2018 at 11:36 AM Ben Coman <[hidden email]> wrote:
Back when I was having a go at new mutex primitives,
when a process "A" failed to lock a mutex, I wanted to return
a primitive-failed-code in addition to the usual context-change to different process "B"
However what I observed was that because of the context change
the primitive-failed-code incorrectly ended up returned on the stack of process "B".

Your description doesn't match how (I understand) the VM works. The only way that Process A can initiate a process switch, mutex lock, et al, is by sending a message to some object (a process, mutex or semaphore). So we're talking about Process>>suspend & resume as well as the lock/unlock and wait/signal primitives. Primitive failure *always* delivers a primitive error to the method that contains the primitive and, in this case, initiated the process switch. Primitives validate their arguments and then succeed, or fail, leaving their arguments undisturbed (there is one regrettable exception to this in the BttF/Cog VM which is the segment loading primitive that leaves its input word array scrambled if a failure occurs, rather than incur the cost of cloning the array).

So the only way that a primitive could fail and the error code end up on the wrong process's stack would be if the primitive was mis-designed to not validate before occurring.

Essentially it can not fail and cause a side effect.

This was my first foray into writing a primitive (still on my backlog to be completed).

I was aware of validating the arguments and leaving them undisturbed for a failure,

but wasn't paying attention to primitive failure being completely free from side effects.

Primitives should be side-effect free when they fail and hence if a process switch primitive fails, it cannot yet have caused a process switch and therefore the error code would have to be delivered to process A's stack.

That is kind of the outcome of what I was proposing. I was trying for a mutex locking primitive that could fail without causing a side-effect "in the image". Maybe its not valid to distinguish between side-effects "inside" or "outside" the image, but I thought it might be reasonable for a flag hidden "in the VM" to be just another event checked by #checkForEventsMayContextSwitch: . Effectively the "in image" effect happens outside the primitive, a bit like reaching nextWakeupUsecs, or like I imagine a callback might work.

A side thought was that if context-changes occurred in a *single* location in #checkForEventsMayContextSwitch,

it might be easier to make an "Idle VM"

I'm stretching my memory so there is a reasonable change this is misleading...
but I believe I observed this happening in CoInterpreter>>internalExecuteNewMethod
near this code..

"slowPrimitiveResponse may of course context-switch. ..."
succeeded := self slowPrimitiveResponse.
...
succeeded ifTrue: [....

But internalExecuteNewMethod doesn't contain the switch code, internalActivateNewMethod does, and it does the process switch *after* delivering the primitive failure code, see reapAndResetErrorCodeTo:header: in the following:

internalActivateNewMethod
...
(self methodHeaderHasPrimitive: methodHeader) ifTrue:
["Skip the CallPrimitive bytecode, if it's there, and store the error code if the method starts
with a long store temp. Strictly no need to skip the store because it's effectively a noop."
localIP := localIP + (self sizeOfCallPrimitiveBytecode: methodHeader).
primFailCode ~= 0 ifTrue:
[self reapAndResetErrorCodeTo: localSP header: methodHeader]].

self assert: (self frameNumArgs: localFP) == argumentCount.
self assert: (self frameIsBlockActivation: localFP) not.
self assert: (self frameHasContext: localFP) not.

"Now check for stack overflow or an event (interrupt, must scavenge, etc)."
localSP < stackLimit ifTrue:
[self externalizeIPandSP.
switched := self handleStackOverflowOrEventAllowContextSwitch:
(self canContextSwitchIfActivating: newMethod header: methodHeader).
self returnToExecutive: true postContextSwitch: switched.
self internalizeIPandSP]

Though I can't exactly put my finger on explaining why, my intuition is that
changing threads "half way" through a bytecode is a bad thing.

Indeed it is, and the VM does not do this. It is possible that the execution simulation machinery in Context, InstructionStream at al could have been written carelessly to allow this to occur, but it cannot and does not occur in the VM proper.

I made a chart to understand this better. One thing first, I'm not sure I've correctly linked execution of the primitives into slowPrimitiveResponse. I'm not at all clear about how internalExecuteNewMethod selects between internalQuickPrimitiveResponse

and slowPrimitiveResponse, and what is the difference between them?

So I understand that checkForEventsMayContextSwitch: called at the end of internalActivateNewMethod
occurs after bytecode execution has completed, so that context switches made there are done "between" bytecodes.

However my perspective is that internalExecuteNewMethod is only half way through a bytecode execution when

the primitives effect context changes. So internalActivateNewMethod ends up working on a different Process than internalExecuteNewMethod started with. The bottom three red lines in the chart are what I considered to be changing threads "half way" through a bytecode.

Ahh, I'm slowly coming to grips with this. It was extremely confusing at the time why my failure code from the primitive was turning up in a different Process, though I then learnt a lot digging to discover why. In summary, if the primitive succeeds it simply returns from internalExecuteNewMethod and internalActivateNewMethod never sees the new Process B. My problem violating the primitive-failure side-effect rule was that internalActivateNewMethod trying to run Process A in-Image-primitive-failure code

instead ran Process B in-Image-primitive-failure code.

Interestingly the comment in #transferTo:from says...
"Record a process to be awoken on the next interpreter cycle."
which sounds like what I'd propose, but actually it doesn't wait for
the next interpreter cycle and instead immediately changes context.

No it doesn't. It effects the process change, but control continues with the caller, allowing the caller to do other things before the process resumes.

In my case I believe "control continues with the caller" was not true since internalActivateNewMethod

was trying to run the in-Image-primitive-failure code after the process changed.

But that was because I violated the side-effect rule.

For example, if in checkForEventsMayContextSwitch: more than one semaphore is signaled (it could initiate any combination of signals of the low space semaphore, the input semaphore, external semaphores (associated with file descriptors & sockets), and the delay semaphore) then only the highest priority process would be runnable after the sequence, and several transferTo:[from:]'s could have been initiated from these signals, depending on process priority. But checkForEventsMayContextSwitch: will not finish mid-sequence. It will always complete all of its signals before returning to code that can then resume the newly activated process.

Just to summarise to check I understood this correctly, no bytecode is executed during checkForEventsMayContextSwitch:.

That is, its multiple transferTo: calls don't re-enter the interpreter? It just which Process is set to run changes,

until at the end of checkForEventsMayContextSwitch it returns to the interpreter to pick up the next bytecode of the active process.

cheers -ben.

Eliot Miranda-2

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

Hi Ben, (and hi all prospective VM hackers)

On Tue, Aug 28, 2018 at 12:22 PM Ben Coman <[hidden email]> wrote:

Hi Eliot, Thanks for the detailed response.

On Tue, 28 Aug 2018 at 04:21, Eliot Miranda <[hidden email]> wrote:

On Mon, Aug 27, 2018 at 11:36 AM Ben Coman <[hidden email]> wrote:
Back when I was having a go at new mutex primitives,
when a process "A" failed to lock a mutex, I wanted to return
a primitive-failed-code in addition to the usual context-change to different process "B"
However what I observed was that because of the context change
the primitive-failed-code incorrectly ended up returned on the stack of process "B".

Your description doesn't match how (I understand) the VM works. The only way that Process A can initiate a process switch, mutex lock, et al, is by sending a message to some object (a process, mutex or semaphore). So we're talking about Process>>suspend & resume as well as the lock/unlock and wait/signal primitives. Primitive failure *always* delivers a primitive error to the method that contains the primitive and, in this case, initiated the process switch. Primitives validate their arguments and then succeed, or fail, leaving their arguments undisturbed (there is one regrettable exception to this in the BttF/Cog VM which is the segment loading primitive that leaves its input word array scrambled if a failure occurs, rather than incur the cost of cloning the array).

So the only way that a primitive could fail and the error code end up on the wrong process's stack would be if the primitive was mis-designed to not validate before occurring.

Essentially it can not fail and cause a side effect.

This was my first foray into writing a primitive (still on my backlog to be completed).
I was aware of validating the arguments and leaving them undisturbed for a failure,
but wasn't paying attention to primitive failure being completely free from side effects.

Primitives should be side-effect free when they fail and hence if a process switch primitive fails, it cannot yet have caused a process switch and therefore the error code would have to be delivered to process A's stack.

That is kind of the outcome of what I was proposing. I was trying for a mutex locking primitive that could fail without causing a side-effect "in the image". Maybe its not valid to distinguish between side-effects "inside" or "outside" the image, but I thought it might be reasonable for a flag hidden "in the VM" to be just another event checked by #checkForEventsMayContextSwitch: . Effectively the "in image" effect happens outside the primitive, a bit like reaching nextWakeupUsecs, or like I imagine a callback might work.

A side thought was that if context-changes occurred in a *single* location in #checkForEventsMayContextSwitch,
it might be easier to make an "Idle VM"

I'm stretching my memory so there is a reasonable change this is misleading...
but I believe I observed this happening in CoInterpreter>>internalExecuteNewMethod
near this code..

"slowPrimitiveResponse may of course context-switch. ..."
succeeded := self slowPrimitiveResponse.
...
succeeded ifTrue: [....

But internalExecuteNewMethod doesn't contain the switch code, internalActivateNewMethod does, and it does the process switch *after* delivering the primitive failure code, see reapAndResetErrorCodeTo:header: in the following:

internalActivateNewMethod
...
(self methodHeaderHasPrimitive: methodHeader) ifTrue:
["Skip the CallPrimitive bytecode, if it's there, and store the error code if the method starts
with a long store temp. Strictly no need to skip the store because it's effectively a noop."
localIP := localIP + (self sizeOfCallPrimitiveBytecode: methodHeader).
primFailCode ~= 0 ifTrue:
[self reapAndResetErrorCodeTo: localSP header: methodHeader]].

self assert: (self frameNumArgs: localFP) == argumentCount.
self assert: (self frameIsBlockActivation: localFP) not.
self assert: (self frameHasContext: localFP) not.

"Now check for stack overflow or an event (interrupt, must scavenge, etc)."
localSP < stackLimit ifTrue:
[self externalizeIPandSP.
switched := self handleStackOverflowOrEventAllowContextSwitch:
(self canContextSwitchIfActivating: newMethod header: methodHeader).
self returnToExecutive: true postContextSwitch: switched.
self internalizeIPandSP]

Though I can't exactly put my finger on explaining why, my intuition is that
changing threads "half way" through a bytecode is a bad thing.

Indeed it is, and the VM does not do this. It is possible that the execution simulation machinery in Context, InstructionStream at al could have been written carelessly to allow this to occur, but it cannot and does not occur in the VM proper.

I made a chart to understand this better. One thing first, I'm not sure I've correctly linked execution of the primitives into slowPrimitiveResponse. I'm not at all clear about how internalExecuteNewMethod selects between internalQuickPrimitiveResponse
and slowPrimitiveResponse, and what is the difference between them?

Well, the first thing to say is that this is a magnificent diagram; thank you. The problem is that the VM, and hence the diagram, is much more complex than the blue book specification, essentially because the VM is a highly optimized interpreter, whereas the specification is bare bones. So I would ask you, and anyone else who wants to understand a Smalltalk-80 VM (a VM that provides Context objects for method activations, rather than a Smalltalk that uses a more conventional stack model) to read the Blue Book Specification: http://www.mirandabanda.org/bluebook/bluebook_chapter28.html carefully and fully. This is the last section of Smalltalk-80: The Language and its Implementation, by Adele Goldberg and David Robson. The specification is well-written and clear and once digested serves as essential reference for understanding a more complex production VM.

Now to your question: "I'm not at all clear about how internalExecuteNewMethod selects between internalQuickPrimitiveResponse and slowPrimitiveResponse, and what is the difference between them?".

First, in a system that notionally allocates a context object to hold every activation, leaf routines are extremely expensive if all they do is answer a constant or an instance variable. Dan Ingall's optimization is to avoid activations by providing a set of quick primitives that answer an instance variable whose slot index is from 0 to 255, or self, nil, true, false, -1, 0, 1 & 2. The self, nil, true, false, -1, 0, 1 & 2 constants are derived from a static frequency analysis of literals and variable references in Smalltalk code and are echoed in the original bytecode set, byte codes 112-119 bering 01110iii Push (receiver, true, false, nil, -1, 0, 1, 2) [iii]. internalQuickPrimitiveResponse handles precisely these primitives, and these primitives only. These primitives have the property that they can never fail, so they are invoked along a path that does not reset the flag used to identify primitive failure, nor test it.

All other primitives are handled by slowPrimitiveResponse. This requires clearing primErrorCode and calling a function implementing the primitive and then testing primErrorCode before either continuing or building an activation for the failing primitive. There is another important task of internalExecuteNewMethod, which is to store the "internal" frame and stack pointers (localFP & localSP) into the global interpreter frame and stack pointers (framePointer and stackPointer) before invoking and then restoring localFP & localSP from framePointer and stackPointer after having invoked slowPrimitiveResponse. In the Back-to-the-Future (BTTF) VMs (the original Squeak VM and the Stack and Cog VMs) primitives access their receiver and arguments through framePointer and stackPointer. But these, being global, are slow and without compiler-specific hacks cannot be placed in registers. The Slang translator and the interpreter code collaborate to inline much of the interpreter, including every method beginning with internal, into the interpret routine in which localFP localSP and localIP are declared, hence allowing a C compiler to assign these variables to registers. So another reason slowPrimitiveResponse is slow is that it writes and reads localFP, localSP & localIP to/from framePointer, stackPointer & instructionPointer. But because it does so, primitives that change the execution context (process switch primitives that switch to another "stack" of contexts, or eval primitives such as perform:with:* and value:value* which build another frame) can be written and change the execution context at a send point (primitive invocation is always at a send).

Note that all the internal methods have non-internal duals that do the same thing but use framePointer, stackPointer & instructionPointer, not localFP, localSP & localIP. These are used to implement the eval primitives perform:* and withArgs:executeMethod: since these may also invoke primitives. And hence you might be able to get your head around my favorite Smalltalk construction:

| array |

array := { #perform:withArguments:. nil }.

array at: 2 put: array.

array perform: array first withArguments: array

;-)

Given the primitives are always invoked at a send we can see how elegant Dan's invention of primitive failure is. Primitives are essentially transactional and atomic. They validate their arguments and if validation succeeds they carry out their action and answer a result as if from some normal send. But if validation fails, or if they are unimplemented, the method containing the primitive reference (<primitive: 61>, <primitive: 'primitiveSocketAccept' module: 'SocketPlugin'>) is simply activated as if the primitive didn't exist, or as if the method was a normal method. Hence primitives are optional, in that if the method body does what the primitive does then no one can tell if the primitive is doing the work or Smalltalk code, except by measuring performance. Hence for example large integer arithmetic and string display primitives are optional and serve to accelerate the system.

There is one other route by which primitives are executed, also at the point of send, and this is via the special selector bytecocdes. Another of Dan's excellent optimizations, the special selectors both save space and make the interpreter faster. They save space by encoding the 32 most frequently occurring sends as one byte bytecodes, hence saving the 2 (16-bit Smalltalk-80), 4 (32-bit Squeak) or 8 (64-bit Squeak) bytes to store the selector in a method's literal frame. But some of them also speed up the interpreter by statically predicting the receiver type. i.e. #+, #-, $/ #*, #<, #>, #<= et al are most often sent to integers, and hence these bytecodes, as specified in the Blue Book, check for the top two elements on the stack being SmallIntegers, and if so replace the two top elements by the result of the operation, avoiding a send and primitive dispatch. Note that in the BttF interpreter this checking is extended both to apply to Float, and to check for a fool,lowing branch after the conditionals #<, #<= et al, so that the result doesn't have to be reified into a boolean that is tested immediately; effectively the following branch gets folded into the relational special selector bytecode. The JIT uses this same technique, but is able to do a much better job because, for example, it can know if a relational special selector send is followed by a jump bytecode or not an JIT time.

So I understand that checkForEventsMayContextSwitch: called at the end of internalActivateNewMethod
occurs after bytecode execution has completed, so that context switches made there are done "between" bytecodes.
However my perspective is that internalExecuteNewMethod is only half way through a bytecode execution when
the primitives effect context changes. So internalActivateNewMethod ends up working on a different Process than internalExecuteNewMethod started with. The bottom three red lines in the chart are what I considered to be changing threads "half way" through a bytecode.

More accurately, checkForEventsMayContextSwitch: is called on activating a method, after the send has occurred, but before the first bytecode has executed. Hence primitive sends are not suspension points unless the primitive is a process switch or eval primitive. Instead, just as a non-primitive (or failing primitive) method is being activated checkForEventsMayContextSwitch: is invoked.

Ahh, I'm slowly coming to grips with this. It was extremely confusing at the time why my failure code from the primitive was turning up in a different Process, though I then learnt a lot digging to discover why. In summary, if the primitive succeeds it simply returns from internalExecuteNewMethod and internalActivateNewMethod never sees the new Process B. My problem violating the primitive-failure side-effect rule was that internalActivateNewMethod trying to run Process A in-Image-primitive-failure code
instead ran Process B in-Image-primitive-failure code.

Right. So that design requirement is key to the VM architecture. That was something I had to explain to Alistair when he did the first cut of the FileAttributesPlugin, which used to return failure codes on error instead of failing, leaving error recovery to clients. And so it underscores the importance of reading the blue book specification. [We really should do an up-to-date version that describes a simplified 32-bit implementation].

The design requirement that primitives validate their arguments and complete atomically or fail without side-effects is also key to Spur. Spur speeds up become by using transparent forwarders (any object can become a forwarder) but reduces the cost of transparent forwarders by arranging the forwarders only have to be checked for during a send or during primitive argument validation. The receiver has to be accessed during a send anyway, and so Spur is able to move the check for a forwarder to the lookup side of a send, only checking for forwarders if the method cache probe failed, which will always be the case for forwarders. Likewise, primitive argument validation always fails for forwarders, and hence in the Spur slowPrimitiveResponse is a check on failure for a primitive's "accessor depth", the depth of the graph of arguments it validates is. If a primitive has a non-negative accessor depth then on failure the arguments are traversed to that depth, and any forwarders encountered are followed, fixing up that part of the object graph, and the primitive retried. Without Dan's primitive design, Spur could not work well.

Interestingly the comment in #transferTo:from says...
"Record a process to be awoken on the next interpreter cycle."
which sounds like what I'd propose, but actually it doesn't wait for
the next interpreter cycle and instead immediately changes context.

No it doesn't. It effects the process change, but control continues with the caller, allowing the caller to do other things before the process resumes.

In my case I believe "control continues with the caller" was not true since internalActivateNewMethod
was trying to run the in-Image-primitive-failure code after the process changed.
But that was because I violated the side-effect rule.

For example, if in checkForEventsMayContextSwitch: more than one semaphore is signaled (it could initiate any combination of signals of the low space semaphore, the input semaphore, external semaphores (associated with file descriptors & sockets), and the delay semaphore) then only the highest priority process would be runnable after the sequence, and several transferTo:[from:]'s could have been initiated from these signals, depending on process priority. But checkForEventsMayContextSwitch: will not finish mid-sequence. It will always complete all of its signals before returning to code that can then resume the newly activated process.

Just to summarise to check I understood this correctly, no bytecode is executed during checkForEventsMayContextSwitch:.
That is, its multiple transferTo: calls don't re-enter the interpreter? It just which Process is set to run changes,
until at the end of checkForEventsMayContextSwitch: it returns to the interpreter to pick up the next bytecode of the active process.

That's right. transferTo: merely sets framePointer, stackPointer & instructionPointer to those of the highest priority runnable process. Execution resumes at that point once checkForEventsMayContextSwitch returns to its caller. For details to do with mixing JITted code and interpreted code checkForEventsMayContextSwitch: answers a variable indicating if a switch actually took place, and returnToExecutive:postContextSwitch: may or may not have to longjmp back into the interpreter or jump into machine code. But that's just an optimization. Execution could always resume in the interpreter; the VM would simply be a little slower, and arguably a lot less complicated.

So the moral is, both the BttF and Cog VMs are complex, because they are optimized, Cog adding an entirely new level of complexity over the BttF interpreter VM. If you want to see the wood for the trees first read the Blue Book spec http://www.mirandabanda.org/bluebook/bluebook_chapter28.html.

cheers -ben.

_,,,^..^,,,_

best, Eliot

Ben Coman

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

On Wed, 29 Aug 2018 at 11:08, Eliot Miranda <[hidden email]> wrote:

Well, the first thing to say is that this is a magnificent diagram; thank you.

Your welcome. Actually I tried to avoid doing it because I've got some other work to go on with ;)

but I got sucked in :). Its a side effect of how I *need* to approach understanding complex problems.

My eyes were glazing over getting lost with the tree of senders and creating such a chart makes me pay attention to details.

btw, I created it with cross platform https://www.yworks.com/products/yed

(although I should be using Roassal to ensure completeness, doing it manually provides time to aborb details)

The problem is that the VM, and hence the diagram, is much more complex than the blue book specification, essentially because the VM is a highly optimized interpreter, whereas the specification is bare bones. So I would ask you, and anyone else who wants to understand a Smalltalk-80 VM (a VM that provides Context objects for method activations, rather than a Smalltalk that uses a more conventional stack model) to read the Blue Book Specification: http://www.mirandabanda.org/bluebook/bluebook_chapter28.html carefully and fully. This is the last section of Smalltalk-80: The Language and its Implementation, by Adele Goldberg and David Robson. The specification is well-written and clear and once digested serves as essential reference for understanding a more complex production VM.

I've been meaning to do this for a while, so took the opportunity just now to read chapters 28 & 29.

I'm split my insights from that into a few posts.

Now to your question: "I'm not at all clear about how internalExecuteNewMethod selects between internalQuickPrimitiveResponse and slowPrimitiveResponse, and what is the difference between them?".

First, in a system that notionally allocates a context object to hold every activation, leaf routines are extremely expensive if all they do is answer a constant or an instance variable. Dan Ingall's optimization is to avoid activations by providing a set of quick primitives that answer an instance variable whose slot index is from 0 to 255, or self, nil, true, false, -1, 0, 1 & 2. The self, nil, true, false, -1, 0, 1 & 2 constants are derived from a static frequency analysis of literals and variable references in Smalltalk code and are echoed in the original bytecode set, byte codes 112-119 bering 01110iii Push (receiver, true, false, nil, -1, 0, 1, 2) [iii]. internalQuickPrimitiveResponse handles precisely these primitives, and these primitives only. These primitives have the property that they can never fail, so they are invoked along a path that does not reset the flag used to identify primitive failure, nor test it.

I see BlueBook p605 & p620 code these primitives like this...

BlueBook Interpreter >> executeNewMethod "no separate quick/slow primitive handling here"

self primitiveResponse

ifFalse: [self activateNewMethod]

BlueBook Interpreter >> primitiveResponse

| flagValue thisReceiver offset |

primitiveIndex = 0

ifTrue: [ "quick primitives"

flagValue := self flagValueOf: newMethod.

flagValue = 5 ifTrue: [self quickReturnSelf. ^true].

flagValue = 6 ifTrue: [self quickInstanceLoad. ^true]. "Quick return inst vars"

^false]

ifFalse: [ "slow primitives"

self initPrimitive.

self dispatchPrimitives.

^ self success]

BlueBook Interpreter >> quicklnstanceLoad "Quick return inst vars, push instance variable whose slot index is from 0 to 255"

| thisReceiver fieldIndex |

thisReceiver := self popStack.

fieldlndex := self fieldlndexOf: newMethod.

self push: (memory fetchPointer: fieldlndex ofObject: thisReceiver)

So it looks like {nil, true, false, -1, 0, 1 & 2} were not handled. Although I see them handled by StackInterpreter.

I noted that StackInterpreter dispensed with "flagValue" and just used primitive number to identify individual quick variables.

StackInterpreter >> externalQuickPrimitiveResponse

"Invoke a quick primitive.

Called under the assumption that primFunctionPtr has been preloaded"

| localPrimIndex |

self assert: self isPrimitiveFunctionPointerAnIndex.

localPrimIndex := self cCoerceSimple: primitiveFunctionPointer to: #sqInt.

self assert: (localPrimIndex > 255 and: [localPrimIndex < 520]).

"Quick return inst vars"

localPrimIndex >= 264 ifTrue:

[self pop: 1 thenPush: (objectMemory fetchPointer: localPrimIndex - 264 ofObject: self stackTop).

^true].

"Quick return constants"

localPrimIndex = 256 ifTrue: [^true "return self"].

localPrimIndex = 257 ifTrue: [self pop: 1 thenPush: objectMemory trueObject. ^true].

localPrimIndex = 258 ifTrue: [self pop: 1 thenPush: objectMemory falseObject. ^true].

localPrimIndex = 259 ifTrue: [self pop: 1 thenPush: objectMemory nilObject. ^true].

self pop: 1 thenPush: (objectMemory integerObjectOf: localPrimIndex - 261).

^true

And distinguishing between quick/slow primitives moved out to executeNewMethod...

StackInterpreter >> executeNewMethod

"Execute newMethod - either primitiveFunctionPointer must be set directly

(i.e. from primitiveExecuteMethod et al), or it would have been set probing

the method cache (i.e. primitivePerform et al)."

primitiveFunctionPointer ~= 0 ifTrue:

[self isPrimitiveFunctionPointerAnIndex ifTrue:

[self externalQuickPrimitiveResponse.

^nil].

self slowPrimitiveResponse.

self successful ifTrue: [^nil]].

"if not primitive, or primitive failed, activate the method"

self activateNewMethod

Having soaked all that up, I've now better identified my difficulty understanding

"how executeNewMethod distinguishes between quick and slow primitives"

I was looking at isPrimitiveFunctionPointerAnIndex (called from internalExecuteMethod)

seeing it only checked ```primitiveFunctionPointer <= 520``` to identify quick primitives,

while quick primitives were between 256 to 520. i.e. not checking ```primitiveFunctionPointer >=256```

StackInterpreter class >> initializePrimitiveTable

"Quick Push Const Methods primitiveIndex=0"

(256 nil) "primitivePushSelf"

(257 nil) "primitivePushTrue"

(258 nil) "primitivePushFalse"

(259 nil) "primitivePushNil"

(260 nil) "primitivePushMinusOne"

(261 nil) "primitivePushZero"

(262 nil) "primitivePushOne"

(263 nil) "primitivePushTwo"

"Quick Push Inst Var Methods"

(264 519 nil) "primitiveLoadInstVar"

But I now I realise that I didn't properly read/understand the comments marked *** here...

StackInterpreter >> isPrimitiveFunctionPointerAnIndex

"We save slots in the method cache by using the primitiveFunctionPointer

to hold ***either*** a function pointer or the index of a quick primitive.
***Since quick primitive indices are small they can't be confused with function addresses*** ."

^(self cCoerce: primitiveFunctionPointer to: #'usqIntptr_t') <= MaxQuickPrimitiveIndex

I haven't dug into where primitiveFunctionPointer is set, but I now get the general idea.

cheers -ben

Ben Coman

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

In reply to this post by Eliot Miranda-2

On Wed, 29 Aug 2018 at 11:08, Eliot Miranda <[hidden email]> wrote:

All other primitives are handled by slowPrimitiveResponse. This requires clearing primErrorCode and calling a function implementing the primitive and then testing primErrorCode before either continuing or building an activation for the failing primitive. There is another important task of internalExecuteNewMethod, which is to store the "internal" frame and stack pointers (localFP & localSP) into the global interpreter frame and stack pointers (framePointer and stackPointer) before invoking and then restoring localFP & localSP from framePointer and stackPointer after having invoked slowPrimitiveResponse. In the Back-to-the-Future (BTTF) VMs (the original Squeak VM and the Stack and Cog VMs) primitives access their receiver and arguments through framePointer and stackPointer. But these, being global, are slow and without compiler-specific hacks cannot be placed in registers.

The Slang translator and the interpreter code collaborate to inline much of the interpreter, including every method beginning with internal, into the interpret routine in which localFP localSP and localIP are declared, hence allowing a C compiler to assign these variables to registers.

Ahhh. Maybe I finally get what "internal" means. IIUC, its code gets generated internal to C interpret() function,

as I can see searching for "internalExecuteNewMethod"

in... https://raw.githubusercontent.com/OpenSmalltalk/opensmalltalk-vm/Cog/spurstacksrc/vm/interp.c

One thing I'm curious about is why...

searching on "externalNewMethod" shows it is inlined several times,

but StackInterpreter>>executNewMethod doesn't have the inline pragma ??

Now a naive question, since as a 13,000 line function its a bit hard to absorb intepret()...

why inline by direct code generation rather than as using the "inline" directive on individually generated functions ??

http://www.drdobbs.com/the-new-c-inline-functions/184401540

(I quite like how that shows the function being folded inline and then function parameters optimized away)

So another reason slowPrimitiveResponse is slow is that it writes and reads localFP, localSP & localIP to/from framePointer, stackPointer & instructionPointer.

Also, along the same naive track, why not inline the primitive C functions so that

you don't need to manually {externalize,internalize}IPandSP and make those primitives faster ??

I guess it wouldn't work for instructionsPointer, with definitions like...

_iss char * stackPointer; //gobal

_iss char * framePointer; //global

_iss usqInt instructionPointer; //global

sqInt

interpret(void)

{ char * localFP;

char * localIP;

char * localSP;

StackInterpreter >> externalizeIPandSP

instructionPointer := self oopForPointer: localIP.

stackPointer := localSP.

framePointer := localFP

Now on page 594 of the Bluebook I read "The fetchByte routine fetches the

byte indicated by the activecontext's instruction pointer and increments the instructionPointer"

That sounds like each Context has its own instructionPointer, but I didn't think that was so ??

But because it does so, primitives that change the execution context (process switch primitives that switch to another "stack" of contexts, or eval primitives such as perform:with:* and value:value* which build another frame) can be written and change the execution context at a send point (primitive invocation is always at a send).

Note that all the internal methods have non-internal duals that do the same thing but use framePointer, stackPointer & instructionPointer, not localFP, localSP & localIP. These are used to implement the eval primitives perform:* and withArgs:executeMethod: since these may also invoke primitives. And hence you might be able to get your head around my favorite Smalltalk construction:
| array |
array := { #perform:withArguments:. nil }.
array at: 2 put: array.
array perform: array first withArguments: array
;-)

That's rather recursively evil.

Given the primitives are always invoked at a send we can see how elegant Dan's invention of primitive failure is. Primitives are essentially transactional and atomic. They validate their arguments and if validation succeeds they carry out their action and answer a result as if from some normal send. But if validation fails, or if they are unimplemented, the method containing the primitive reference (<primitive: 61>, <primitive: 'primitiveSocketAccept' module: 'SocketPlugin'>) is simply activated as if the primitive didn't exist, or as if the method was a normal method. Hence primitives are optional, in that if the method body does what the primitive does then no one can tell if the primitive is doing the work or Smalltalk code, except by measuring performance. Hence for example large integer arithmetic and string display primitives are optional and serve to accelerate the system.

There is one other route by which primitives are executed, also at the point of send, and this is via the special selector bytecocdes. Another of Dan's excellent optimizations, the special selectors both save space and make the interpreter faster. They save space by encoding the 32 most frequently occurring sends as one byte bytecodes, hence saving the 2 (16-bit Smalltalk-80), 4 (32-bit Squeak) or 8 (64-bit Squeak) bytes to store the selector in a method's literal frame. But some of them also speed up the interpreter by statically predicting the receiver type. i.e. #+, #-, $/ #*, #<, #>, #<= et al are most often sent to integers, and hence these bytecodes, as specified in the Blue Book, check for the top two elements on the stack being SmallIntegers, and if so replace the two top elements by the result of the operation, avoiding a send and primitive dispatch. Note that in the BttF interpreter this checking is extended both to apply to Float,

In the BlueBook Bluebook p619 I see the simple bytecode dispatch...

currentBytecode = 176 ifTrue: [ ^self primitiveAdd].

and...

Bluebook Interpreter >> primitiveAdd

| integerReceiver integerArgument integerResult |

integerArgument := self poplnteger.

integerReceiver := self poplnteger.

self success

ifTrue: [

integerResult := integerReceiver + integerArgument.

self success: (memory islntegerValue: integerResult)].

self success

ifTrue: [self pushlnteger: integerResult]

ifFalse: [self unPop: 2]

and notice that StackInterpreter is doing a lot more within the bytecode before a primitive is needed...

spurstacksrc/vm/interp.c has...

case 176: /* bytecodePrimAdd */

which is...

StackInterpreter >> bytecodePrimAdd

| rcvr arg result |

rcvr := self internalStackValue: 1.

arg := self internalStackValue: 0.

(objectMemory areIntegers: rcvr and: arg)

ifTrue: [result := (objectMemory integerValueOf: rcvr) + (objectMemory integerValueOf: arg).

(objectMemory isIntegerValue: result) ifTrue:

[self internalPop: 2 thenPush: (objectMemory integerObjectOf: result).

^ self fetchNextBytecode "success"]]

ifFalse: [self initPrimCall.

self externalizeIPandSP.

self primitiveFloatAdd: rcvr toArg: arg.

self internalizeIPandSP.

self successful ifTrue: [^ self fetchNextBytecode "success"]].

messageSelector := self specialSelector: 0.

argumentCount := 1.

self normalSend

Now I'm curious about the different handling above of integers (ifTrue: path) and floats (ifFalse: path).

I guess the integer code is due to being immediate values where the object doesn't need to be looked up,

while the non-immediate floats need a primitive call.

Now I'm wondering, since 64-bit has immediate floats, is bytecodePrimAdd due an update to make them faster?

btw, my impression is that #areImmediateIntegers:and: seems a more explicit name that #areIntegers:and:

since I guess the ifTrue: path doesn't apply to integers that are LargePositiveIntegers.

and to check for a fool,lowing branch after the conditionals #<, #<= et al, so that the result doesn't have to be reified into a boolean that is tested immediately; effectively the following branch gets folded into the relational special selector bytecode.

Very interesting to know.

The JIT uses this same technique, but is able to do a much better job because, for example, it can know if a relational special selector send is followed by a jump bytecode or not an JIT time.

So I understand that checkForEventsMayContextSwitch: called at the end of internalActivateNewMethod
occurs after bytecode execution has completed, so that context switches made there are done "between" bytecodes.
However my perspective is that internalExecuteNewMethod is only half way through a bytecode execution when
the primitives effect context changes. So internalActivateNewMethod ends up working on a different Process than internalExecuteNewMethod started with. The bottom three red lines in the chart are what I considered to be changing threads "half way" through a bytecode.

More accurately, checkForEventsMayContextSwitch: is called on activating a method, after the send has occurred, but before the first bytecode has executed. Hence primitive sends are not suspension points unless the primitive is a process switch or eval primitive. Instead, just as a non-primitive (or failing primitive) method is being activated checkForEventsMayContextSwitch: is invoked.

On BlueBook page 594 I see checkProcessSwitch is called at the start of the cycle rather than the end we have.

Probably its buried in history, but do you know of any particular reason for the change?

Ahh, I'm slowly coming to grips with this. It was extremely confusing at the time why my failure code from the primitive was turning up in a different Process, though I then learnt a lot digging to discover why. In summary, if the primitive succeeds it simply returns from internalExecuteNewMethod and internalActivateNewMethod never sees the new Process B. My problem violating the primitive-failure side-effect rule was that internalActivateNewMethod trying to run Process A in-Image-primitive-failure code
instead ran Process B in-Image-primitive-failure code.

Right. So that design requirement is key to the VM architecture. That was something I had to explain to Alistair when he did the first cut of the FileAttributesPlugin, which used to return failure codes on error instead of failing, leaving error recovery to clients. And so it underscores the importance of reading the blue book specification. [We really should do an up-to-date version that describes a simplified 32-bit implementation].

That would be great. First step would be to seek permission to reuse that chapter.

It would be best to reuse most of its structure and content and update details.

cheers -ben

timrowledge

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

I can answer that at least in part - history.

> On 2018-08-30, at 9:22 AM, Ben Coman <[hidden email]> wrote:
>
> Now a naive question, since as a 13,000 line function its a bit hard to absorb intepret()...
> why inline by direct code generation rather than as using the "inline" directive on individually generated functions ??

Remember that the basics of this code are not much changed since 1994/5/6 and that the original code was very much Mac oriented. The Mac C compiler of the day didn't really do the inlining thing very well (it's too long ago to remember if many compilers in general use did!) and so a text based approach was used. It literally did simple text substitution and similar hacks. Because it became an essential element of the VM code in that far off time we even got to the point (several times, probably) where *not* doing the inlining run(s) produced code that wouldn't compile. To tell the truth, I have no idea if that is the current state; bursts of hard work over the years has fixed it at least once.

An interesting question to answer would be whether current general compilers actually do a good job with inlining. My last experiments were before I moved to Canada in '04, so very out of date, but IIRC it seemed things got slower back then.

It *should* be the better solution, logically. Compilers are meant to be better at this than us meatbags. If you, or anybody, can find a way to make the considerable time to work on it then we could find out and hopefully benefit.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful Latin Phrases:- Fac ut gaudeam = Make my day.

David T. Lewis

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

On Thu, Aug 30, 2018 at 09:59:16AM -0700, tim Rowledge wrote:

>
> I can answer that at least in part - history.
>
> > On 2018-08-30, at 9:22 AM, Ben Coman <[hidden email]> wrote:
> >
> > Now a naive question, since as a 13,000 line function its a bit hard to absorb intepret()...
> > why inline by direct code generation rather than as using the "inline" directive on individually generated functions ??
>
> Remember that the basics of this code are not much changed since 1994/5/6 and that the original code was very much Mac oriented. The Mac C compiler of the day didn't really do the inlining thing very well (it's too long ago to remember if many compilers in general use did!) and so a text based approach was used. It literally did simple text substitution and similar hacks. Because it became an essential element of the VM code in that far off time we even got to the point (several times, probably) where *not* doing the inlining run(s) produced code that wouldn't compile. To tell the truth, I have no idea if that is the current state; bursts of hard work over the years has fixed it at least once.
>

I can add from my own experience that the Slang inliner is very effective
in practice. I was truly amazed to find out how good it is.

Consider the memory access macros (sqMemoryAccess.h). These basically serve
to inline the very lowest level (and hence performance critical) functions
of the VM. When I replaced those macros with equivalent Smalltalk methods
(see the MemoryAccess package in VMMaker), and relied on the inliner to
unwind them, the result was a VM with performance that was, as nearly as
I could tell, the same as the performance of the same VM with traditional
cpp macros.

Think about what that really means. You can write very low level code in
Smalltalk, and rely on the code generator and Slang inliner to spit out
very high performance code. Forget about trying to read the generated C
output, and forget about trying to figure out what those clever macros
are really doing. You can work directly in Smalltalk, no indecipherable
C preprocessor gunk, no compiler directives, no inlined macros. Just
Smalltalk, and the results are outstanding.

Oh, and one more thing. Just try running your VM under gdb to debug it
if you are depending on all those compiler optimizations. Forget it. But
use your Smalltalk tools, write in Smalltalk, translate to C in Smalltalk,
and inline in Smalltalk. Now you can turn off all that compiler optimization
and run even the lowest level memory access code under a debugger, line
by generated line of C source code. Wonderful.

That does not mean that compiler optimizers and directives do not matter.
They do. But do not underestimate the Slang inliner, it is very good.

> An interesting question to answer would be whether current general compilers actually do a good job with inlining. My last experiments were before I moved to Canada in '04, so very out of date, but IIRC it seemed things got slower back then.
>

It would be interesting to know the answer to this. My guess is that
the results would be at best similar to the Slang inliner. But to me
this falls into the category of great things for somebody else to spend
time on ;-)

> It *should* be the better solution, logically. Compilers are meant to be better at this than us meatbags. If you, or anybody, can find a way to make the considerable time to work on it then we could find out and hopefully benefit.
>

Actually, I do not think that it would be better. We meatbags need to
be able to run things under a debugger from time to time, and you simply
cannot do that with a big C program that depends heavily on C compiler
inlining.

Dave

Bert Freudenberg

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

In reply to this post by Ben Coman

On Thu, Aug 30, 2018 at 09:22 Ben Coman <[hidden email]> wrote:

On Wed, 29 Aug 2018 at 11:08, Eliot Miranda <[hidden email]> wrote:

The Slang translator and the interpreter code collaborate to inline much of the interpreter, including every method beginning with internal, into the interpret routine in which localFP localSP and localIP are declared, hence allowing a C compiler to assign these variables to registers.

Ahhh. Maybe I finally get what "internal" means.

Dan Ingalls described the secret recipe to achieve high performance in a dynamic language as the Art of Cheating Without Getting Caught.

That's what the "internal" vs "external" is about. To external code, everything looks like expected - e.g. when you inspect a context object, its stack has all the temp objects and it's instruction pointer is right past the last bytecode it executed, just like the Blue Book describes.

But in order to get higher performance, even the plain interpreter cheats, the stack interpreter a lot more, and Cog / Spur does unmentionable things.

To avoid getting caught (by the image or by primitives that are blissfully unaware of the amount of cheating going on) the internal state gets externalized at strategic points to preserve the illusion of order. Nothing to see here, move on.

That sounds like each Context has its own instructionPointer, but I didn't think that was so

That's exactly what happens. The VM has an active context, and each context has an instruction pointer for the next bytecode, as well as a stack pointer into its own little stack for values (not for return addresses). Each send creates a new context object which

is linked to its sender via the "sender" inst var. This linked list is the equivalent of a call stack in other languages.

And the cool thing is you can inspect all of that in the image. The even cooler thing is that you can manipulate the context objects, like switching out the sender to a completely different context (that's how co-routines work in e.g. a Squeak Generator). The coolest thing is that ever since the StackVM, there is something completely different going on behind the scenes that's much more performant, while still maintaining all the features.

HTH

- Bert -

Dr. Bert Freudenberg

7275 Franklin Avenue #210

Los Angeles CA 90046

+1 (818) 482-3991

johnmci

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

The slang inliner logic had a tiny bit of magic in it I added to help it fold the entire GC mark sweep logic into a single c routine and ensure all working variables became local variables to avoid a single or double reference to variables in the interp.c file or to the struct. This made a huge impact on gc times on 68K machines.

Sent from my iPhone

On Aug 30, 2018, at 17:57, Bert Freudenberg <[hidden email]> wrote:

On Thu, Aug 30, 2018 at 09:22 Ben Coman <[hidden email]> wrote:

On Wed, 29 Aug 2018 at 11:08, Eliot Miranda <[hidden email]> wrote:

The Slang translator and the interpreter code collaborate to inline much of the interpreter, including every method beginning with internal, into the interpret routine in which localFP localSP and localIP are declared, hence allowing a C compiler to assign these variables to registers.

Ahhh. Maybe I finally get what "internal" means.

Dan Ingalls described the secret recipe to achieve high performance in a dynamic language as the Art of Cheating Without Getting Caught.

That's what the "internal" vs "external" is about. To external code, everything looks like expected - e.g. when you inspect a context object, its stack has all the temp objects and it's instruction pointer is right past the last bytecode it executed, just like the Blue Book describes.

But in order to get higher performance, even the plain interpreter cheats, the stack interpreter a lot more, and Cog / Spur does unmentionable things.

To avoid getting caught (by the image or by primitives that are blissfully unaware of the amount of cheating going on) the internal state gets externalized at strategic points to preserve the illusion of order. Nothing to see here, move on.

That sounds like each Context has its own instructionPointer, but I didn't think that was so

That's exactly what happens. The VM has an active context, and each context has an instruction pointer for the next bytecode, as well as a stack pointer into its own little stack for values (not for return addresses). Each send creates a new context object which
is linked to its sender via the "sender" inst var. This linked list is the equivalent of a call stack in other languages.

And the cool thing is you can inspect all of that in the image. The even cooler thing is that you can manipulate the context objects, like switching out the sender to a completely different context (that's how co-routines work in e.g. a Squeak Generator). The coolest thing is that ever since the StackVM, there is something completely different going on behind the scenes that's much more performant, while still maintaining all the features.

HTH

- Bert -
--
--
Dr. Bert Freudenberg
7275 Franklin Avenue #210
Los Angeles CA 90046
+1 (818) 482-3991

Nicolas Cellier

Re: context change versus primitive failure return value (was: FFI exception failure support on Win64 (Win32 also?))

I want to add 3 things about inlining:

- inline has not been standardized before C99, even though it already existed in gcc well before; historically; it was not a solution at the time of original VM writing

- then inline is a compiler directive or hint or suggestion, but it is not a mandatory requirement, very much like register

So if we want fine control, we can't entirely trust it

- third, with type inference taking place at slang translation, some Smalltalk messages might be like generic template derived for different types, depending on sender, and this specialization happens during slang inlining. I don't remember if the VM really depends on this feature, but if so, we couldn't just replace by an inline directive, but should also care for generating several signatures instead of a single function...

For understanding code, it's better to stick to Smalltalk side and emulating the VM.

But there are subtle differences between emulated slang and generated C code, especially for the handling of types, and it is also sometimes necessary to debug the translation itself (wrt undefined behaviors for example), in which case compiler warnings are one of the low level companion tool.

Personnally, slang inlining annoyed me when tracking those compiler warnings, because it generates a lot of noise!

Le ven. 31 août 2018 à 03:41, John McIntosh <[hidden email]> a écrit :

The slang inliner logic had a tiny bit of magic in it I added to help it fold the entire GC mark sweep logic into a single c routine and ensure all working variables became local variables to avoid a single or double reference to variables in the interp.c file or to the struct. This made a huge impact on gc times on 68K machines.

Sent from my iPhone

On Aug 30, 2018, at 17:57, Bert Freudenberg <[hidden email]> wrote:

On Thu, Aug 30, 2018 at 09:22 Ben Coman <[hidden email]> wrote:

On Wed, 29 Aug 2018 at 11:08, Eliot Miranda <[hidden email]> wrote:

The Slang translator and the interpreter code collaborate to inline much of the interpreter, including every method beginning with internal, into the interpret routine in which localFP localSP and localIP are declared, hence allowing a C compiler to assign these variables to registers.

Ahhh. Maybe I finally get what "internal" means.

Dan Ingalls described the secret recipe to achieve high performance in a dynamic language as the Art of Cheating Without Getting Caught.

That's what the "internal" vs "external" is about. To external code, everything looks like expected - e.g. when you inspect a context object, its stack has all the temp objects and it's instruction pointer is right past the last bytecode it executed, just like the Blue Book describes.

But in order to get higher performance, even the plain interpreter cheats, the stack interpreter a lot more, and Cog / Spur does unmentionable things.

To avoid getting caught (by the image or by primitives that are blissfully unaware of the amount of cheating going on) the internal state gets externalized at strategic points to preserve the illusion of order. Nothing to see here, move on.

That sounds like each Context has its own instructionPointer, but I didn't think that was so

That's exactly what happens. The VM has an active context, and each context has an instruction pointer for the next bytecode, as well as a stack pointer into its own little stack for values (not for return addresses). Each send creates a new context object which
is linked to its sender via the "sender" inst var. This linked list is the equivalent of a call stack in other languages.

And the cool thing is you can inspect all of that in the image. The even cooler thing is that you can manipulate the context objects, like switching out the sender to a completely different context (that's how co-routines work in e.g. a Squeak Generator). The coolest thing is that ever since the StackVM, there is something completely different going on behind the scenes that's much more performant, while still maintaining all the features.

HTH

- Bert -
--
--
Dr. Bert Freudenberg
7275 Franklin Avenue #210
Los Angeles CA 90046
+1 (818) 482-3991