Two new curious Context/primitive questions :-)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Two new curious Context/primitive questions :-)

Christoph Thiede

Hi all, hi Eliot,


new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:


First, I was confused by the following:

(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"
What makes Context so special that it cannot be compiled with quick accessor methods?

Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

Best,
Christoph



Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

Eliot Miranda-2
Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:

Hi all, hi Eliot,


new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:


First, I was confused by the following:

(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"
What makes Context so special that it cannot be compiled with quick accessor methods?


But that's full of implementation detail.  Let me try and explain at a higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.

But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sdender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are "married".
When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's defaut oir set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married.

(*) The reason why Peter's original scheme is slower is that at return time the VM has to check for being married and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.


So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said
Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

It's not slow, so much as it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Without it we can simply check for nil.  But with it we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.

For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.

Best,
Christoph

Happy New Year!

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

Eliot Miranda-2
errata:

On Fri, Jan 1, 2021 at 1:22 PM Eliot Miranda <[hidden email]> wrote:
Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:
Hi all, hi Eliot,

new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:

First, I was confused by the following:
(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"

What makes Context so special that it cannot be compiled with quick accessor methods?

The gory details are here: http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

But that's full of implementation detail.  Let me try and explain at a higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.

But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are "married".
When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's default or set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married. A page is big enough to hold on average about 40 method activations; since activations are of different sizes and change size as execution proceeds, the number of activations per full page varies.

(*) The reason why Peter's original scheme is slower is that in HPS at return time the VM has to check for a frame being hybrid and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.  Divorced contexts are amnesiacs.


So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said

Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

It's not slow so much as that it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Before primitive error codes (IIRC) we simply checked for nil.  But with primitive error codes  we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.


For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.

Best,
Christoph

Happy New Year!

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

Christoph Thiede

Hi Eliot,


once again I am very pleasantly surprised by your professional and detailed answer which was a joy to read!


'Nuff said

No further questions. This was very interesting!
It also gives me a possible explanation for why one cannot execute a context subinstance without any problems. That's because the marriage between stack frame and context objects is hard-coded based on the Context class (specialObjects at: 11), isn't it?

But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.

Hm, aren't we already in a very similar situation? I had planned to write a separate message for this issue, but here is my case:

{Context primitiveFailTokenFor: nil} at: 1. "{an Object . nil}"
Context runSimulated: [{Context primitiveFailTokenFor: nil} at: 1]. "⚡ Error: subscript is out of bounds: 1"

Now one could argue that this is not a very likely situation to occur ever in "production", but the same could be said for a hypothetical #isPrimFailToken selector, and above all, it precludes an ideal, i.e. completely transparent simulation machinery ...
So I would like to suggest to find a different solution for this problem anyway, provided that you support this idea in general.

An alternative I could imagine would be the SetElement pattern from Collections which is completely transparent. But this would probably be too expensive for simulation purposes (because we would need to instantiate one object per simulated primitive call), right?
Why not encode a failing primitive via a changed control flow? Couldn't we pass an additional failBlock argument to #doPrimitive:method:receiver:args:, #tryPrimitive:withArgs:/#tryNamedPrimitiveIn:for:withArgs:, and #simulateValueWithArguments:caller: (and analogously to ExternalFunction >> #tryInvokeWithArguments: and the critsect primitive methods on Mutex)?
Then we could rewrite #send:to:with:lookupIn: like this (pseudo):

send: selector to: rcvr with: arguments lookupIn: lookupClass
    "Simulate the action of sending a message with selector and arguments to rcvr. The argument, lookupClass, is the class in which to lookup the message. This is the receiver's class for normal messages, but for super messages it will be some specific class related to the source method."

    | meth primFailCode val ctxt |
    (meth := lookupClass lookupSelector: selector) ifNil:
        [selector == #doesNotUnderstand: ifTrue:
            [self error: 'Recursive message not understood!' translated].
        ^self send: #doesNotUnderstand:
                to: rcvr
                with: {(Message selector: selector arguments: arguments) lookupClass: lookupClass}
                lookupIn: lookupClass].
    
    meth isCompiledMethod ifFalse:
        ["Object as Methods (OaM) protocol: 'The contract is that, when the VM encounters an ordinary object (rather than a compiled method) in the method dictionary during lookup, it sends it the special selector #run:with:in: providing the original selector, arguments, and receiver.'. DOI: 10.1145/2991041.2991062."
        ^self send: #run:with:in:
            to: meth
            with: {selector. arguments. rcvr}].
    
    meth numArgs = arguments size ifFalse:
        [^ self error: ('Wrong number of arguments in simulated message {1}' translated format: {selector})].
    (primIndex := meth primitive) > 0 ifTrue:
-         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments.
-         (self isPrimFailToken: val) ifFalse:
-             [^val]].
+         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments ifFail:
+             [:code | primFailCode := code].
+         primFailCode ifNil: [^ val]].
    
    (selector == #doesNotUnderstand: and: [lookupClass == ProtoObject]) ifTrue:
        [^self error: ('Simulated message {1} not understood' translated format: {arguments first selector})].
    
    ctxt := Context sender: self receiver: rcvr method: meth arguments: arguments.
-     (primIndex isInteger and: [primIndex > 0]) ifTrue:
-         [ctxt failPrimitiveWith: val].
+   primFailCode ifNotNil:
+       [ctxt failPrimitiveWith: primFailCode].
    
    ^ctxt

Current senders of #primitiveFailTokenFor: could then evaluate the failBlock if the primitive fails.
Do you think this approach would be worth a try? :-)

Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Eliot Miranda <[hidden email]>
Gesendet: Freitag, 1. Januar 2021 22:34:28
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Two new curious Context/primitive questions :-)
 
errata:

On Fri, Jan 1, 2021 at 1:22 PM Eliot Miranda <[hidden email]> wrote:
Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:
Hi all, hi Eliot,

new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:

First, I was confused by the following:
(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"

What makes Context so special that it cannot be compiled with quick accessor methods?

The gory details are here: http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

But that's full of implementation detail.  Let me try and explain at a higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.

But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are "married".
When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's default or set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married. A page is big enough to hold on average about 40 method activations; since activations are of different sizes and change size as execution proceeds, the number of activations per full page varies.

(*) The reason why Peter's original scheme is slower is that in HPS at return time the VM has to check for a frame being hybrid and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.  Divorced contexts are amnesiacs.


So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said

Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

It's not slow so much as that it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Before primitive error codes (IIRC) we simply checked for nil.  But with primitive error codes  we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.


For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.

Best,
Christoph

Happy New Year!

_,,,^..^,,,_
best, Eliot


Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

Eliot Miranda-2
Hi Christoph,

On Fri, Jan 1, 2021 at 4:18 PM Thiede, Christoph <[hidden email]> wrote:

Hi Eliot,


once again I am very pleasantly surprised by your professional and detailed answer which was a joy to read!


'Nuff said

No further questions. This was very interesting!
It also gives me a possible explanation for why one cannot execute a context subinstance without any problems. That's because the marriage between stack frame and context objects is hard-coded based on the Context class (specialObjects at: 11), isn't it?

Indeed. One may be able to go the other way, create a stack of contexts whose class is different.  These should retain their class because the associated stack frames will be married to them.  But going the other way is not supported.
But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.

Hm, aren't we already in a very similar situation? I had planned to write a separate message for this issue, but here is my case:

{Context primitiveFailTokenFor: nil} at: 1. "{an Object . nil}"
Context runSimulated: [{Context primitiveFailTokenFor: nil} at: 1]. "⚡ Error: subscript is out of bounds: 1"

Now one could argue that this is not a very likely situation to occur ever in "production", but the same could be said for a hypothetical #isPrimFailToken selector, and above all, it precludes an ideal, i.e. completely transparent simulation machinery ...
So I would like to suggest to find a different solution for this problem anyway, provided that you support this idea in general.

An alternative I could imagine would be the SetElement pattern from Collections which is completely transparent. But this would probably be too expensive for simulation purposes (because we would need to instantiate one object per simulated primitive call), right?

Not sure.  If instantiation can be deferred until failure then things should be OK.
Why not encode a failing primitive via a changed control flow? Couldn't we pass an additional failBlock argument to #doPrimitive:method:receiver:args:, #tryPrimitive:withArgs:/#tryNamedPrimitiveIn:for:withArgs:, and #simulateValueWithArguments:caller: (and analogously to ExternalFunction >> #tryInvokeWithArguments: and the critsect primitive methods on Mutex)?

We can, but remember that passing a block is in fact hiding an allocation, maybe two.  Mentioning a block in code causes it to be allocated and if not already done so, causes allocation of the lexically enclosing context.
Then we could rewrite #send:to:with:lookupIn: like this (pseudo):

send: selector to: rcvr with: arguments lookupIn: lookupClass
    "Simulate the action of sending a message with selector and arguments to rcvr. The argument, lookupClass, is the class in which to lookup the message. This is the receiver's class for normal messages, but for super messages it will be some specific class related to the source method."

    | meth primFailCode val ctxt |
    (meth := lookupClass lookupSelector: selector) ifNil:
        [selector == #doesNotUnderstand: ifTrue:
            [self error: 'Recursive message not understood!' translated].
        ^self send: #doesNotUnderstand:
                to: rcvr
                with: {(Message selector: selector arguments: arguments) lookupClass: lookupClass}
                lookupIn: lookupClass].
    
    meth isCompiledMethod ifFalse:
        ["Object as Methods (OaM) protocol: 'The contract is that, when the VM encounters an ordinary object (rather than a compiled method) in the method dictionary during lookup, it sends it the special selector #run:with:in: providing the original selector, arguments, and receiver.'. DOI: 10.1145/2991041.2991062."
        ^self send: #run:with:in:
            to: meth
            with: {selector. arguments. rcvr}].
    
    meth numArgs = arguments size ifFalse:
        [^ self error: ('Wrong number of arguments in simulated message {1}' translated format: {selector})].
    (primIndex := meth primitive) > 0 ifTrue:
-         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments.
-         (self isPrimFailToken: val) ifFalse:
-             [^val]].
+         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments ifFail:
+             [:code | primFailCode := code].
+         primFailCode ifNil: [^ val]].
    
    (selector == #doesNotUnderstand: and: [lookupClass == ProtoObject]) ifTrue:
        [^self error: ('Simulated message {1} not understood' translated format: {arguments first selector})].
    
    ctxt := Context sender: self receiver: rcvr method: meth arguments: arguments.
-     (primIndex isInteger and: [primIndex > 0]) ifTrue:
-         [ctxt failPrimitiveWith: val].
+   primFailCode ifNotNil:
+       [ctxt failPrimitiveWith: primFailCode].
    
    ^ctxt

Current senders of #primitiveFailTokenFor: could then evaluate the failBlock if the primitive fails.
Do you think this approach would be worth a try? :-)

Definitely.  It might eliminate multiple tests.  I think you should definitely try it.  It looks nice to me.  Thank you!

Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Eliot Miranda <[hidden email]>
Gesendet: Freitag, 1. Januar 2021 22:34:28
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Two new curious Context/primitive questions :-)
 
errata:

On Fri, Jan 1, 2021 at 1:22 PM Eliot Miranda <[hidden email]> wrote:
Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:
Hi all, hi Eliot,

new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:

First, I was confused by the following:
(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"

What makes Context so special that it cannot be compiled with quick accessor methods?

The gory details are here: http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

But that's full of implementation detail.  Let me try and explain at a higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.

But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are "married".
When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's default or set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married. A page is big enough to hold on average about 40 method activations; since activations are of different sizes and change size as execution proceeds, the number of activations per full page varies.

(*) The reason why Peter's original scheme is slower is that in HPS at return time the VM has to check for a frame being hybrid and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.  Divorced contexts are amnesiacs.


So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said

Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

It's not slow so much as that it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Before primitive error codes (IIRC) we simply checked for nil.  But with primitive error codes  we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.


For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.

Best,
Christoph

Happy New Year!

_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

Christoph Thiede

Hi Eliot,


thanks for the feedback! This is an exciting task and I'll put it onto my list, I'm looking forward to tackling it ... :-)


We can, but remember that passing a block is in fact hiding an allocation, maybe two.  Mentioning a block in code causes it to be allocated and if not already done so, causes allocation of the lexically enclosing context.


That's correct, but I believe that it could be possible to waive any closure variables but only use block arguments, block returns, and method returns. This should be relatively fast, shouldn't it?

Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Eliot Miranda <[hidden email]>
Gesendet: Samstag, 2. Januar 2021 02:45:53
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Two new curious Context/primitive questions :-)
 
Hi Christoph,

On Fri, Jan 1, 2021 at 4:18 PM Thiede, Christoph <[hidden email]> wrote:

Hi Eliot,


once again I am very pleasantly surprised by your professional and detailed answer which was a joy to read!


'Nuff said

No further questions. This was very interesting!
It also gives me a possible explanation for why one cannot execute a context subinstance without any problems. That's because the marriage between stack frame and context objects is hard-coded based on the Context class (specialObjects at: 11), isn't it?

Indeed. One may be able to go the other way, create a stack of contexts whose class is different.  These should retain their class because the associated stack frames will be married to them.  But going the other way is not supported.
But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.

Hm, aren't we already in a very similar situation? I had planned to write a separate message for this issue, but here is my case:

{Context primitiveFailTokenFor: nil} at: 1. "{an Object . nil}"
Context runSimulated: [{Context primitiveFailTokenFor: nil} at: 1]. "⚡ Error: subscript is out of bounds: 1"

Now one could argue that this is not a very likely situation to occur ever in "production", but the same could be said for a hypothetical #isPrimFailToken selector, and above all, it precludes an ideal, i.e. completely transparent simulation machinery ...
So I would like to suggest to find a different solution for this problem anyway, provided that you support this idea in general.

An alternative I could imagine would be the SetElement pattern from Collections which is completely transparent. But this would probably be too expensive for simulation purposes (because we would need to instantiate one object per simulated primitive call), right?

Not sure.  If instantiation can be deferred until failure then things should be OK.
Why not encode a failing primitive via a changed control flow? Couldn't we pass an additional failBlock argument to #doPrimitive:method:receiver:args:, #tryPrimitive:withArgs:/#tryNamedPrimitiveIn:for:withArgs:, and #simulateValueWithArguments:caller: (and analogously to ExternalFunction >> #tryInvokeWithArguments: and the critsect primitive methods on Mutex)?

We can, but remember that passing a block is in fact hiding an allocation, maybe two.  Mentioning a block in code causes it to be allocated and if not already done so, causes allocation of the lexically enclosing context.
Then we could rewrite #send:to:with:lookupIn: like this (pseudo):

send: selector to: rcvr with: arguments lookupIn: lookupClass
    "Simulate the action of sending a message with selector and arguments to rcvr. The argument, lookupClass, is the class in which to lookup the message. This is the receiver's class for normal messages, but for super messages it will be some specific class related to the source method."

    | meth primFailCode val ctxt |
    (meth := lookupClass lookupSelector: selector) ifNil:
        [selector == #doesNotUnderstand: ifTrue:
            [self error: 'Recursive message not understood!' translated].
        ^self send: #doesNotUnderstand:
                to: rcvr
                with: {(Message selector: selector arguments: arguments) lookupClass: lookupClass}
                lookupIn: lookupClass].
    
    meth isCompiledMethod ifFalse:
        ["Object as Methods (OaM) protocol: 'The contract is that, when the VM encounters an ordinary object (rather than a compiled method) in the method dictionary during lookup, it sends it the special selector #run:with:in: providing the original selector, arguments, and receiver.'. DOI: 10.1145/2991041.2991062."
        ^self send: #run:with:in:
            to: meth
            with: {selector. arguments. rcvr}].
    
    meth numArgs = arguments size ifFalse:
        [^ self error: ('Wrong number of arguments in simulated message {1}' translated format: {selector})].
    (primIndex := meth primitive) > 0 ifTrue:
-         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments.
-         (self isPrimFailToken: val) ifFalse:
-             [^val]].
+         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments ifFail:
+             [:code | primFailCode := code].
+         primFailCode ifNil: [^ val]].
    
    (selector == #doesNotUnderstand: and: [lookupClass == ProtoObject]) ifTrue:
        [^self error: ('Simulated message {1} not understood' translated format: {arguments first selector})].
    
    ctxt := Context sender: self receiver: rcvr method: meth arguments: arguments.
-     (primIndex isInteger and: [primIndex > 0]) ifTrue:
-         [ctxt failPrimitiveWith: val].
+   primFailCode ifNotNil:
+       [ctxt failPrimitiveWith: primFailCode].
    
    ^ctxt

Current senders of #primitiveFailTokenFor: could then evaluate the failBlock if the primitive fails.
Do you think this approach would be worth a try? :-)

Definitely.  It might eliminate multiple tests.  I think you should definitely try it.  It looks nice to me.  Thank you!

Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Eliot Miranda <[hidden email]>
Gesendet: Freitag, 1. Januar 2021 22:34:28
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Two new curious Context/primitive questions :-)
 
errata:

On Fri, Jan 1, 2021 at 1:22 PM Eliot Miranda <[hidden email]> wrote:
Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:
Hi all, hi Eliot,

new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:

First, I was confused by the following:
(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"

What makes Context so special that it cannot be compiled with quick accessor methods?

The gory details are here: http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

But that's full of implementation detail.  Let me try and explain at a higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.

But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are "married".
When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's default or set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married. A page is big enough to hold on average about 40 method activations; since activations are of different sizes and change size as execution proceeds, the number of activations per full page varies.

(*) The reason why Peter's original scheme is slower is that in HPS at return time the VM has to check for a frame being hybrid and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.  Divorced contexts are amnesiacs.


So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said

Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

It's not slow so much as that it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Before primitive error codes (IIRC) we simply checked for nil.  But with primitive error codes  we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.


For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.

Best,
Christoph

Happy New Year!

_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot


Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

codefrau
In reply to this post by Eliot Miranda-2
This was a very interesting read - thank you Eliot! That should go in some book :)

Another overview is in the class comment for MaybeContextInstanceVariableNode. Which I found by trying to figure out how the heck the compiler knows to use the long form for inst var refs in Context. 

Vanessa

On Fri, Jan 1, 2021 at 1:23 PM Eliot Miranda <[hidden email]> wrote:
Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:

Hi all, hi Eliot,


new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:


First, I was confused by the following:

(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"
What makes Context so special that it cannot be compiled with quick accessor methods?


But that's full of implementation detail.  Let me try and explain at a higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.

But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sdender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are "married".
When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's defaut oir set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married.

(*) The reason why Peter's original scheme is slower is that at return time the VM has to check for being married and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.


So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said
Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

It's not slow, so much as it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Without it we can simply check for nil.  But with it we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.

For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.

Best,
Christoph

Happy New Year!

_,,,^..^,,,_
best, Eliot



Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

frank.lesser
In reply to this post by Christoph Thiede

Hi Christoph,

yes this optimization is possible -

IMO Pharo/Squeal has a CleanBlockClosure -

I used the term red & green BlockClosure ( demonstrated at a customer in 2010  )

- implemented in LSWGVM since more than a decade.

Frank
On 1/6/2021 01:16, Thiede, Christoph wrote:

Hi Eliot,


thanks for the feedback! This is an exciting task and I'll put it onto my list, I'm looking forward to tackling it ... :-)


We can, but remember that passing a block is in fact hiding an allocation, maybe two.  Mentioning a block in code causes it to be allocated and if not already done so, causes allocation of the lexically enclosing context.


That's correct, but I believe that it could be possible to waive any closure variables but only use block arguments, block returns, and method returns. This should be relatively fast, shouldn't it?

Best,
Christoph

Von: Squeak-dev [hidden email] im Auftrag von Eliot Miranda [hidden email]
Gesendet: Samstag, 2. Januar 2021 02:45:53
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Two new curious Context/primitive questions :-)
 
Hi Christoph,

On Fri, Jan 1, 2021 at 4:18 PM Thiede, Christoph <[hidden email]> wrote:

Hi Eliot,


once again I am very pleasantly surprised by your professional and detailed answer which was a joy to read!


'Nuff said

No further questions. This was very interesting!
It also gives me a possible explanation for why one cannot execute a context subinstance without any problems. That's because the marriage between stack frame and context objects is hard-coded based on the Context class (specialObjects at: 11), isn't it?

Indeed. One may be able to go the other way, create a stack of contexts whose class is different.  These should retain their class because the associated stack frames will be married to them.  But going the other way is not supported.
But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.

Hm, aren't we already in a very similar situation? I had planned to write a separate message for this issue, but here is my case:

{Context primitiveFailTokenFor: nil} at: 1. "{an Object . nil}"
Context runSimulated: [{Context primitiveFailTokenFor: nil} at: 1]. "⚡ Error: subscript is out of bounds: 1"

Now one could argue that this is not a very likely situation to occur ever in "production", but the same could be said for a hypothetical #isPrimFailToken selector, and above all, it precludes an ideal, i.e. completely transparent simulation machinery ...
So I would like to suggest to find a different solution for this problem anyway, provided that you support this idea in general.

An alternative I could imagine would be the SetElement pattern from Collections which is completely transparent. But this would probably be too expensive for simulation purposes (because we would need to instantiate one object per simulated primitive call), right?

Not sure.  If instantiation can be deferred until failure then things should be OK.
Why not encode a failing primitive via a changed control flow? Couldn't we pass an additional failBlock argument to #doPrimitive:method:receiver:args:, #tryPrimitive:withArgs:/#tryNamedPrimitiveIn:for:withArgs:, and #simulateValueWithArguments:caller: (and analogously to ExternalFunction >> #tryInvokeWithArguments: and the critsect primitive methods on Mutex)?

We can, but remember that passing a block is in fact hiding an allocation, maybe two.  Mentioning a block in code causes it to be allocated and if not already done so, causes allocation of the lexically enclosing context.
Then we could rewrite #send:to:with:lookupIn: like this (pseudo):

send: selector to: rcvr with: arguments lookupIn: lookupClass
    "Simulate the action of sending a message with selector and arguments to rcvr. The argument, lookupClass, is the class in which to lookup the message. This is the receiver's class for normal messages, but for super messages it will be some specific class related to the source method."

    | meth primFailCode val ctxt |
    (meth := lookupClass lookupSelector: selector) ifNil:
        [selector == #doesNotUnderstand: ifTrue:
            [self error: 'Recursive message not understood!' translated].
        ^self send: #doesNotUnderstand:
                to: rcvr
                with: {(Message selector: selector arguments: arguments) lookupClass: lookupClass}
                lookupIn: lookupClass].
    
    meth isCompiledMethod ifFalse:
        ["Object as Methods (OaM) protocol: 'The contract is that, when the VM encounters an ordinary object (rather than a compiled method) in the method dictionary during lookup, it sends it the special selector #run:with:in: providing the original selector, arguments, and receiver.'. DOI: 10.1145/2991041.2991062."
        ^self send: #run:with:in:
            to: meth
            with: {selector. arguments. rcvr}].
    
    meth numArgs = arguments size ifFalse:
        [^ self error: ('Wrong number of arguments in simulated message {1}' translated format: {selector})].
    (primIndex := meth primitive) > 0 ifTrue:
-         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments.
-         (self isPrimFailToken: val) ifFalse:
-             [^val]].
+         [val := self doPrimitive: primIndex method: meth receiver: rcvr args: arguments ifFail:
+             [:code | primFailCode := code].
+         primFailCode ifNil: [^ val]].
    
    (selector == #doesNotUnderstand: and: [lookupClass == ProtoObject]) ifTrue:
        [^self error: ('Simulated message {1} not understood' translated format: {arguments first selector})].
    
    ctxt := Context sender: self receiver: rcvr method: meth arguments: arguments.
-     (primIndex isInteger and: [primIndex > 0]) ifTrue:
-         [ctxt failPrimitiveWith: val].
+   primFailCode ifNotNil:
+       [ctxt failPrimitiveWith: primFailCode].
    
    ^ctxt

Current senders of #primitiveFailTokenFor: could then evaluate the failBlock if the primitive fails.
Do you think this approach would be worth a try? :-)

Definitely.  It might eliminate multiple tests.  I think you should definitely try it.  It looks nice to me.  Thank you!

Best,
Christoph

Von: Squeak-dev <[hidden email]> im Auftrag von Eliot Miranda <[hidden email]>
Gesendet: Freitag, 1. Januar 2021 22:34:28
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Two new curious Context/primitive questions :-)
 
errata:

On Fri, Jan 1, 2021 at 1:22 PM Eliot Miranda <[hidden email]> wrote:
Hi Christoph,

    happy new year!!

On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:
Hi all, hi Eliot,

new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:

First, I was confused by the following:
(BlockClosure >> #numArgs) primitive. "266"
(Context >> #pc) primitive. "0"
What makes Context so special that it cannot be compiled with quick accessor methods?

The gory details are here: http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/

But that's full of implementation detail.  Let me try and explain at a higher level.

Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.

But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.

Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).

When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
When a stack frame needs a context object one is created and the two are "married".
When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)

(*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's default or set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married. A page is big enough to hold on average about 40 method activations; since activations are of different sizes and change size as execution proceeds, the number of activations per full page varies.

(*) The reason why Peter's original scheme is slower is that in HPS at return time the VM has to check for a frame being hybrid and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.

Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.  Divorced contexts are amnesiacs.


So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.

All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.

How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.

By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:

25 <E2 01> pushRcvr: 1
27 <5C> returnTop

'Nuff said

Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?

It's not slow so much as that it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Before primitive error codes (IIRC) we simply checked for nil.  But with primitive error codes  we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.


For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.

Best,
Christoph

Happy New Year!

_,,,^..^,,,_
best, Eliot



--
_,,,^..^,,,_
best, Eliot


    


Reply | Threaded
Open this post in threaded view
|

Re: Two new curious Context/primitive questions :-)

Tobias Pape
In reply to this post by codefrau


> On 6. Jan 2021, at 03:05, Vanessa Freudenberg <[hidden email]> wrote:
>
> This was a very interesting read - thank you Eliot! That should go in some book :)

Fast Smalltalk by Example…
A collection of VM essays

let's do it.
-t

>
> Another overview is in the class comment for MaybeContextInstanceVariableNode. Which I found by trying to figure out how the heck the compiler knows to use the long form for inst var refs in Context.
>
> Vanessa
>
> On Fri, Jan 1, 2021 at 1:23 PM Eliot Miranda <[hidden email]> wrote:
> Hi Christoph,
>
>     happy new year!!
>
> On Fri, Jan 1, 2021 at 11:28 AM Thiede, Christoph <[hidden email]> wrote:
> Hi all, hi Eliot,
>
>
> new year, new simulation fun, and I have collected two new questions about the Context implementation which I'd love to get answered here:
>
>
> First, I was confused by the following:
>
>
> (BlockClosure >> #numArgs) primitive. "266"
> (Context >> #pc) primitive. "0"
> What makes Context so special that it cannot be compiled with quick accessor methods?
>
> The gory details are here: http://www.mirandabanda.org/cogblog/2009/01/14/under-cover-contexts-and-the-big-frame-up/
>
> But that's full of implementation detail.  Let me try and explain at a higher level.
>
> Contexts are wonderful but very expensive.  Any VM that actually creates contexts for every method or block activation will inevitably run slowly.  There are two overheads here.  One is the basic overhead of allocating and reclaiming the contexts that are created.  The other is the cost of moving the outgoing receiver and arguments from the sending context to the new context.  A fast VM must avoid these overheads, and the key solution is to use a stack frame organization as is used in conventional language implementations.  The key here is that with stack frames the receiver and incoming arguments do not have to be moved; they stay where they are and a new frame is built underneath them which includes them as the incoming receiver and arguments; stack frames overlap outgoing and incoming arguments.
>
> But crucially, in addition the VM must still create contexts if and when they are needed.  If not, this would no longer be a Smalltalk-80 implementation.  Now one could imagine a VM which converted stack frames to contexts whenever a method or block activation was accessed using thisContext (or thisContext sender etc which access contexts by following the sdender field).  But that would be very slow.  Instead, the fastest approach is to try and keep stack frames as stack frames for as long as possible.  For this to work contexts have to be able to function as "proxy objects" for stack frames.  The VM does indeed create contexts when one uses thisContext, or even when one creates a block, but internally the context is in a very special state, a state in which it points at a stack frame.
>
> Some terminology is useful.  Peter Deutsch used "volatile", "hybrid" and "stable".  I like "single", "married", "widowed" and "divorced" (Peter's scheme didn't have "widowed" which is the key to why my scheme is faster, and why the closure implementation is as it is).
>
> When a stack frame is created it is "single".  It has a slot within it to refer to a context object if it is needed, but the slot is nil.
> When a stack frame needs a context object one is created and the two are "married".
> When a method or block returns its stack frame is discarded; the stack frame has died and if it was married its context becomes "widowed". (*)
> When space is needed for new stack frames all reclaimed stack frames are "divorced" from their contexts.  The stack frame state is written to the context object and the context object no longer acts as a proxy; it is a fully fledged normal stack object. (*)
>
> (*) Stack frames exist in a small memory zone organized as stack pages, the stack zone.  The size of the stack zone is set at startup, and its size is either the VM's defaut oir set by a value that persists in the image header (see vmParameterAt: 43).  Since the system may run out of stack pages (a deep call stack, many active processes), the VM makes new pages available by purging the least recently used pages to the heap in the form of context objects.  All the frames in the least recently used stack page are divorced at the same time. A context must be created for each frame that is not yet married.
>
> (*) The reason why Peter's original scheme is slower is that at return time the VM has to check for being married and copy state to the hybrid context, making it stable.  This has to be done because a closure's outer temporary variables are stored in the lexically enclosing stack frame, and so have to be made to persist in the stable context or their values will be stale.  In my (lisp-derived) closure implementation there are no such references to outer temporary variables; values are either copied if they do not change, or stored in an Array (the "indirection vector"), which can then be copied.
>
> Note that divorce is a process.  Once divorced a context is single and free to function as a context does.  Indeed when the system starts up it loads an image which only contains single contexts.  All contexts are divorced on snapshot.
>
>
> So now, finally, we can answer your question.  Why is (Context>>#pc), a method that simply answers an instance variable, not compiled as a quick method with a primitive that answers that variable?  "married" and "widowed" contexts are proxies for stack frames.  Only the instance variables they contain which do not change can be fetched from the context object itself; these are the receiver, arguments, closureOrNil, and method. sender, pc, stackp and other stack contents may change as execution proceeds. These mutable values must be fetched from the stack frame itself, if it still exists.
>
> All accesses to sender, pc, stackp, and stack contents (Context>>at:[put:]) must be mediated.  A married or widowed context has its sender field set to the frame pointer of its stack frame, encoded as a SmallInteger.  You can never see this from the image (except via a special xRay primitive). By following the frame pointer encoded in the sender field the VM can detect if the frame is still live.  If so, the relevant values are synthesized or fetched from the stack frame itself.  If not, the context is widowed and "mourns"; it is changed into a context with a nil sender field, a nil pc, and a stackp that points at the last argument.
>
> How is this mediation done?  One design could be to require that all image level accesses to Context inst vars are through primitives.  This feels contrived to me.  Instead, inst var access is via inst var access bytcodes.  But if we do this naively we would have to check in all inst var access bytecodes whose index is the same as sender (0), pc (1) & stackp (2).  This would slow down inst var access enormously.  Instead we observe that inst var access bytecodes pushReceiverVariable:, popIntoReceiverVariable: have short forms for the most frequently used (indexes 0 to 15 for push, indexes 0 to 7 for popInto).  By arranging that we only use the long forms for accessing inst vars in Context, and superclasses and subclasses, we only have to check in the long forms and the check is cheap.  We check that the index is <= 2, because all other inst var accesses with these indices will use the short-form bytecodes.
>
> By arranging that Context's inst vars are *not* accessed by quick primitives we do not have to duplicate this machinery in quick primitive dispatch, which would also slow things down.  Hence Context>>pc is compiled as a vanilla method and its inst var access bytecode is the long form:
>
> 25 <E2 01> pushRcvr: 1
> 27 <5C> returnTop
>
> 'Nuff said
> Second, Context >> #isPrimFailToken: attracted my attention multiple times when looking at different #timeProfile results of simulation sessions. In the expression [100000 factorial] it takes up more than 44% of the whole execution time!
> I have no idea why it could be so slow, but this message is sent really often, which is at least 2 times per simulation of a special message send.
> Would there be an easy change to resolve this bottleneck and speed up the simulation by 40% or more?
> Would it be possible to provide a primitive for this method? Or do you see any other way to optimize it?
>
> It's not slow, so much as it has to be used all the time to check whether a send has invoked a primitive which has failed.  Its function is to support primitive error codes.  Without it we can simply check for nil.  But with it we need a container that both reliably distinguishes a primitive failure result from all other objects in the system, and holds onto the primitive failure code.  It may indeed be able to reduce the overhead by only testing when absolutely necessary.  It may be that reimplementing the scheme is faster.
>
> For example, one could imagine a class whose instances are the only objects to answer #isPrimFailToken, which would be faster than a one argument message send.  But beware, because if one was debugging the debugger one mustn't confuse a PrimFailToken created in the simulation for a PrimFailToken created by the simulation.
> Remember to distinguish between price and value.  Implementing the context-to-stack mapping scheme outlined above was very costly, but its value is very high.  It allows us to have contexts, with all their utility, while reducing their costs significantly.  Similarly, and closely related, being able to simulate execution is costly but hugely valuable.  If the price is 40% of simulated execution then it is a price worth paying.
>
> Best,
> Christoph
>
> Happy New Year!
>
> _,,,^..^,,,_
> best, Eliot
>
>