Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Eliot Miranda-2
Hi David, Hi Bert, Clément, Juan, Levente and Marcus, Hi Anyone else with strong experience in the VM with processes,

On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <[hidden email]> wrote:
Semaphore seems like a rather basic thing that should work correctly in
any Squeak image. The tests do not pass in trunk any more.

Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but passes
in the earlier Squeak 4.6 / 5.0 images.

Is this a real problem? Does it need to be fixed for the 5.2 release?

Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message will strike you as TL;DR, but please, especially if you're Bert, Clément, Juan, Levente or Marcus, read this carefully.  It's quite important.  And below I'll present the Squeak code but will work with Clément and Marcus to implement semantically equivalent Pharo code asap.

And apologies in advance for the repetitious nature of this message.  It is better that I am precise than I am brief and anyone miss anything.  This is an old problem and it will be nice if I've fixed it, but I could easily have missed something; this problem having been around for decades.  OK...


This is an old problem which boiled down to there being no way to determine by looking at a process's suspendedContext whether a process is either waiting on a Semaphore or Mutex or is no longer waiting, but has made no progress because it is at a lower priority than runnable processes and so has not got a chance to run yet.

So in
    | s |
    s := Semaphore new.
    ...
    s wait
    ...

if we look at the context containing the wait its pc will be the same whether the process is blocked, waiting on the semaphore, or whether the semaphore has been signalled but the process has not been able to proceed because it is of lower priority than runnable processes and so can make no progress.  This caused problems for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
self wait.
^mutuallyExcludedBlock ensure: [self signal]

because the ensure: won't be entered if higher priority runnable processes are preventing it from running.

And for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
^[self wait.
mutuallyExcludedBlock value]
ensure: [self signal]

because if the process is terminated when the semaphore has not been signalled (i.e. the process is blocked in the wait), Process>>terminate will run the ensure: block anyway, resulting in the Semaphore getting an extra signal.

This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed Mutex as a more efficient version of Monitor, but this is also subject so the same problem.  We did change the definition of ensure: so that it is not a suspension point, by adding valueNoContextSwitch[:]

BlockClosure>>ensure: aBlock
"Evaluate a termination block after evaluating the receiver, regardless of
whether the receiver's evaluation completes.  N.B.  This method is *not*
implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
198 in a context's method as the mark for an ensure:/ifCurtailed: activation."

| complete returnValue |
<primitive: 198>
returnValue := self valueNoContextSwitch.
complete ifNil:[
complete := true.
aBlock value.
].
^ returnValue

This means that we don't have to deal with suspensions here (marked with !!!)

I now understand how to distinguish between the two cases, between blocking and not blocked but no progress.  Process>>suspend answers the list the Process was on when it was suspended.  If the process is already suspended Process>>suspend answers nil.  If the process is waiting on a Semaphore or a Mutex, Process>>suspend answers the Semaphore or Mutex. And if the process is runnable then Process>>suspend answers the process's run list (a LinkedList in ProcessorScheduler's quiescentProcessLists array corresponding to the process's priority).

So Process>>#terminate can distinguish between #wait or #primitiveEnterCriticalSection or #primitiveTestAndSetOwnershipOfCriticalSection being blocked, or being unblocked but having made no progress due to too low a priority.  We do so by testing the class of the result of suspending the process.  If it is a LinkedList, the process has past the #wait or #primitiveEnterCriticalSection but has made no progress due to too low a priority.

The version of Process>>#terminate I'm about to commit deals with several cases.  Let me present the cases first.  There are three versions of Semaphore>>#critical: to handle, and one version of Mutex>>critical: and Mutex>>#critical:ifLocked:.

The two basic versions of Semaphore>>critical: are

V1
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
self wait.
^mutuallyExcludedBlock ensure: [self signal]

V2
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
^[self wait.
  mutuallyExcludedBlock value]
ensure: [self signal]

and Juan's safer version is (after I added the criticalSection pragma)

V3
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
| caught |
"We need to catch eventual interruptions very carefully. 
The naive approach of just doing, e.g.,:
self wait.
aBlock ensure:[self signal].
will fail if the active process gets terminated while in the wait.
However, the equally naive:
[self wait.
aBlock value] ensure:[self signal].
will fail too, since the active process may get interrupted while
entering the ensured block and leave the semaphore signaled twice.
To avoid both problems we make use of the fact that interrupts only
occur on sends (or backward jumps) and use an assignment (bytecode)
right before we go into the wait primitive (which is not a real send and
therefore not interruptable either)."

caught := false.
^[
caught := true.
self wait.
mutuallyExcludedBlock value
] ensure: [ caught ifTrue: [self signal] ]

and the Mutex>>critical:'s are

critical: aBlock
"Evaluate aBlock protected by the receiver."
<criticalSection>
^self primitiveEnterCriticalSection
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]

critical: aBlock ifLocked: lockedBlock
"Answer the evaluation of aBlock protected by the receiver.  If it is already in a critical
section on behalf of some other process answer the evaluation of lockedBlock."
<criticalSection>
^self primitiveTestAndSetOwnershipOfCriticalSection
ifNil: [lockedBlock value]
ifNotNil:
[:alreadyOwner|
alreadyOwner
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]

primitiveEnterCriticalSection answers false if the Mutex was unowned, and true if it was already owned by the active process.  It blocks otherwise.  primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex was unowned, true if it was already owned by the active process, and nil if owned by some other process.

So we want Process>>#terminate to correctly release the semaphores and mutexes no matter where in these methods they are.  We don't have to worry if the process is within the block argument to a critical: itself, only if it is actually within the critical: method or a block within it. If it is already within the block argument to critical: then Process>>#terminate's unwind handling will unwind things correctly.  Taking Juan's version of Semaphore>>#critical: above, the key issue is whether the process being terminated is blocked on the wait, not blocked but still stuck at the wait, or at the start of the block argument to ensure:.

I have extracted the processing into Process>>releaseCriticalSection:, so now Process>>terminate reads

terminate 
"Stop the process that the receiver represents forever.
Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
If the process is in the middle of a critical: critical section, release it properly."

| ctxt unwindBlock oldList |
self isActiveProcess ifTrue: [
ctxt := thisContext.
[ ctxt := ctxt findNextUnwindContextUpTo: nil.
ctxt isNil
] whileFalse: [
(ctxt tempAt: 2) ifNil:[
ctxt tempAt: 2 put: nil.
unwindBlock := ctxt tempAt: 1.
thisContext terminateTo: ctxt.
unwindBlock value].
].
thisContext terminateTo: nil.
self suspend.
] ifFalse:[
"Always suspend the process first so it doesn't accidentally get woken up.
N.B. If oldList is a LinkedList then the process is runnable. If it is a Semaphore/Mutex et al
then the process is blocked, and if it is nil then the process is already suspended."
oldList := self suspend.
suspendedContext ifNotNil:
["Release any method marked with the <criticalSection> pragma.
 The argument is whether the process is runnable."
self releaseCriticalSection: (oldList isNil or: [oldList class == LinkedList]).

"If terminating a process halfways through an unwind, try to complete that unwind block first."
(suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
[:outer|
(suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt: 1)]) ifNotNil:
[:inner| "This is an unwind block currently under evaluation"
suspendedContext runUntilErrorOrReturnFrom: inner]].

ctxt := self popTo: suspendedContext bottomContext.
ctxt == suspendedContext bottomContext ifFalse:
[self debug: ctxt title: 'Unwind error during termination'].
"Set the context to its endPC for the benefit of isTerminated."
ctxt pc: ctxt endPC]]

In implementing releaseCriticalSection: we need to know which selector a context has just sent.  selectorJustSentOrSelf is implemented in Squeak as

InstructionStream>>selectorJustSentOrSelf
"If this instruction follows a send, answer the send's selector, otherwise answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: self previousPc

c.f.

InstructionStream>>selectorToSendOrSelf
"If this instruction is a send, answer the selector, otherwise answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: pc

Now we can implement Process>>#releaseCriticalSection:

releaseCriticalSection: runnable
"Figure out if we are terminating a process that is in the ensure: block of a critical section.
In this case, if the block has made progress, pop the suspendedContext so that we leave the
ensure: block inside the critical: without signaling the semaphore/exiting the primitive section,
since presumably this has already happened.  But if it hasn't made progress but is beyond the
wait (which we can tell my the oldList being one of the runnable lists, i.e. a LinkedList, not a
Semaphore or Mutex, et al), then the ensure: block needs to be run."
| selectorJustSent |
(suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
selectorJustSent := suspendedContext selectorJustSentOrSelf.

"Receiver and/or argument blocks of ensure: in Semaphore>>critical: or Mutex>>#critical:"
suspendedContext isClosureContext ifTrue:
[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

"If still at the wait the ensure: block has not been activated, so signal to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

"If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]


Let's go through it line by line.  First, runnable is an argument, determined in Process>>#terminate.  One could invoke it with

    self releaseCriticalSection: oldList class == LinkedList

but this means that an already suspended process is assumed to be not runnable, which makes it tricky to debug the Process>>#terminate method.  One has to assign to oldList while stepping though the method.  I've chosen safety, assuming that the process is still runnable if suspend answers nil, its simply being debugged.

Then we're only interested in <criticalSection> marked methods se we return if there's no such pragma.

Then we deal with blocks in these methods.  One issue here is to avoid running the ensure: block twice if it is already being evaluated.  The other is to run it if it is stalled and has yet to be run.

So if

suspendedContext isClosureContext ifTrue:

we're in the ensure: receiver or argument blocks in any <criticalSection> marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].  If wait was just sent then we're in the ensure: receiver block of Semaphore>>critical: (V2 & V3 above) and the issue is whether the process is blocked or is unblocked and has made no progress. If blocked then nothing needs to be done; the ensure: block is discarded and the stack cut back to the critical: activation.  If progress has been made then nothing needs to be done (in fact we can't be in this state; the ensure: receiver will have started evaluating the critical: block argument).  If unblocked, but no progress has been made, do /not/ discard the unwind block and it will be run in Process>>#terminate when this method returns.  Hence...

[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

Now we're left with the simpler version of Semaphore>>critical: (V1 above) and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only state we have to worry about is that the process is unblocked but has made no progress.  If not runnable the process is still blocked and we can simply return.

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

If #wait was just sent the process is in Semaphore>>#critical: and, because ensure: has not been sent we signal explicitly to restore the signal count:

"If still at the wait the ensure: block has not been activated, so signal to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

If either of primitiveEnterCriticalSection or primitiveTestAndSetOwnershipOfCriticalSection have just been sent then either the Mutex is already owned, in which case the ensure block is elsewhere in the colder part of the stack, or has just been owned, and because ensure: has not been sent we unlock explicitly to release the Mutex:

"If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]

So Pharoers can you read and say whether you think this is sane or not?  If so, then we can kibbutz to write the Pharo version.

Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?  Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184 reverts Semaphore>>#critical: to V1 above.


P.S. Looking at V1 above it seems to me that there is an issue if the process is preempted in ensure: before sending valueNoContextSwitch:.  I'll try and write a test that advances a process to that precise point.  If that test fails I think we have to use V2 or V3, and V2 is clearly preferable.

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Eliot Miranda-2
Ha!

On Thu, Jul 26, 2018 at 9:31 PM, Eliot Miranda <[hidden email]> wrote:
Hi David, Hi Bert, Clément, Juan, Levente and Marcus, Hi Anyone else with strong experience in the VM with processes,

On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <[hidden email]> wrote:
Semaphore seems like a rather basic thing that should work correctly in
any Squeak image. The tests do not pass in trunk any more.

Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but passes
in the earlier Squeak 4.6 / 5.0 images.

Is this a real problem? Does it need to be fixed for the 5.2 release?

Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message will strike you as TL;DR, but please, especially if you're Bert, Clément, Juan, Levente or Marcus, read this carefully.  It's quite important.  And below I'll present the Squeak code but will work with Clément and Marcus to implement semantically equivalent Pharo code asap.

And apologies in advance for the repetitious nature of this message.  It is better that I am precise than I am brief and anyone miss anything.  This is an old problem and it will be nice if I've fixed it, but I could easily have missed something; this problem having been around for decades.  OK...


This is an old problem which boiled down to there being no way to determine by looking at a process's suspendedContext whether a process is either waiting on a Semaphore or Mutex or is no longer waiting, but has made no progress because it is at a lower priority than runnable processes and so has not got a chance to run yet.

So in
    | s |
    s := Semaphore new.
    ...
    s wait
    ...

if we look at the context containing the wait its pc will be the same whether the process is blocked, waiting on the semaphore, or whether the semaphore has been signalled but the process has not been able to proceed because it is of lower priority than runnable processes and so can make no progress.  This caused problems for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
self wait.
^mutuallyExcludedBlock ensure: [self signal]

because the ensure: won't be entered if higher priority runnable processes are preventing it from running.

And for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
^[self wait.
mutuallyExcludedBlock value]
ensure: [self signal]

because if the process is terminated when the semaphore has not been signalled (i.e. the process is blocked in the wait), Process>>terminate will run the ensure: block anyway, resulting in the Semaphore getting an extra signal.

This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed Mutex as a more efficient version of Monitor, but this is also subject so the same problem.  We did change the definition of ensure: so that it is not a suspension point, by adding valueNoContextSwitch[:]

BlockClosure>>ensure: aBlock
"Evaluate a termination block after evaluating the receiver, regardless of
whether the receiver's evaluation completes.  N.B.  This method is *not*
implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
198 in a context's method as the mark for an ensure:/ifCurtailed: activation."

| complete returnValue |
<primitive: 198>
returnValue := self valueNoContextSwitch.
complete ifNil:[
complete := true.
aBlock value.
].
^ returnValue

This means that we don't have to deal with suspensions here (marked with !!!)

I now understand how to distinguish between the two cases, between blocking and not blocked but no progress.  Process>>suspend answers the list the Process was on when it was suspended.  If the process is already suspended Process>>suspend answers nil.  If the process is waiting on a Semaphore or a Mutex, Process>>suspend answers the Semaphore or Mutex. And if the process is runnable then Process>>suspend answers the process's run list (a LinkedList in ProcessorScheduler's quiescentProcessLists array corresponding to the process's priority).

So Process>>#terminate can distinguish between #wait or #primitiveEnterCriticalSection or #primitiveTestAndSetOwnershipOfCriticalSection being blocked, or being unblocked but having made no progress due to too low a priority.  We do so by testing the class of the result of suspending the process.  If it is a LinkedList, the process has past the #wait or #primitiveEnterCriticalSection but has made no progress due to too low a priority.

The version of Process>>#terminate I'm about to commit deals with several cases.  Let me present the cases first.  There are three versions of Semaphore>>#critical: to handle, and one version of Mutex>>critical: and Mutex>>#critical:ifLocked:.

The two basic versions of Semaphore>>critical: are

V1
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
self wait.
^mutuallyExcludedBlock ensure: [self signal]

V2
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
^[self wait.
  mutuallyExcludedBlock value]
ensure: [self signal]

and Juan's safer version is (after I added the criticalSection pragma)

V3
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
| caught |
"We need to catch eventual interruptions very carefully. 
The naive approach of just doing, e.g.,:
self wait.
aBlock ensure:[self signal].
will fail if the active process gets terminated while in the wait.
However, the equally naive:
[self wait.
aBlock value] ensure:[self signal].
will fail too, since the active process may get interrupted while
entering the ensured block and leave the semaphore signaled twice.
To avoid both problems we make use of the fact that interrupts only
occur on sends (or backward jumps) and use an assignment (bytecode)
right before we go into the wait primitive (which is not a real send and
therefore not interruptable either)."

caught := false.
^[
caught := true.
self wait.
mutuallyExcludedBlock value
] ensure: [ caught ifTrue: [self signal] ]

and the Mutex>>critical:'s are

critical: aBlock
"Evaluate aBlock protected by the receiver."
<criticalSection>
^self primitiveEnterCriticalSection
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]

critical: aBlock ifLocked: lockedBlock
"Answer the evaluation of aBlock protected by the receiver.  If it is already in a critical
section on behalf of some other process answer the evaluation of lockedBlock."
<criticalSection>
^self primitiveTestAndSetOwnershipOfCriticalSection
ifNil: [lockedBlock value]
ifNotNil:
[:alreadyOwner|
alreadyOwner
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]

primitiveEnterCriticalSection answers false if the Mutex was unowned, and true if it was already owned by the active process.  It blocks otherwise.  primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex was unowned, true if it was already owned by the active process, and nil if owned by some other process.

So we want Process>>#terminate to correctly release the semaphores and mutexes no matter where in these methods they are.  We don't have to worry if the process is within the block argument to a critical: itself, only if it is actually within the critical: method or a block within it. If it is already within the block argument to critical: then Process>>#terminate's unwind handling will unwind things correctly.  Taking Juan's version of Semaphore>>#critical: above, the key issue is whether the process being terminated is blocked on the wait, not blocked but still stuck at the wait, or at the start of the block argument to ensure:.

I have extracted the processing into Process>>releaseCriticalSection:, so now Process>>terminate reads

terminate 
"Stop the process that the receiver represents forever.
Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
If the process is in the middle of a critical: critical section, release it properly."

| ctxt unwindBlock oldList |
self isActiveProcess ifTrue: [
ctxt := thisContext.
[ ctxt := ctxt findNextUnwindContextUpTo: nil.
ctxt isNil
] whileFalse: [
(ctxt tempAt: 2) ifNil:[
ctxt tempAt: 2 put: nil.
unwindBlock := ctxt tempAt: 1.
thisContext terminateTo: ctxt.
unwindBlock value].
].
thisContext terminateTo: nil.
self suspend.
] ifFalse:[
"Always suspend the process first so it doesn't accidentally get woken up.
N.B. If oldList is a LinkedList then the process is runnable. If it is a Semaphore/Mutex et al
then the process is blocked, and if it is nil then the process is already suspended."
oldList := self suspend.
suspendedContext ifNotNil:
["Release any method marked with the <criticalSection> pragma.
 The argument is whether the process is runnable."
self releaseCriticalSection: (oldList isNil or: [oldList class == LinkedList]).

"If terminating a process halfways through an unwind, try to complete that unwind block first."
(suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
[:outer|
(suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt: 1)]) ifNotNil:
[:inner| "This is an unwind block currently under evaluation"
suspendedContext runUntilErrorOrReturnFrom: inner]].

ctxt := self popTo: suspendedContext bottomContext.
ctxt == suspendedContext bottomContext ifFalse:
[self debug: ctxt title: 'Unwind error during termination'].
"Set the context to its endPC for the benefit of isTerminated."
ctxt pc: ctxt endPC]]

In implementing releaseCriticalSection: we need to know which selector a context has just sent.  selectorJustSentOrSelf is implemented in Squeak as

InstructionStream>>selectorJustSentOrSelf
"If this instruction follows a send, answer the send's selector, otherwise answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: self previousPc

c.f.

InstructionStream>>selectorToSendOrSelf
"If this instruction is a send, answer the selector, otherwise answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: pc

Now we can implement Process>>#releaseCriticalSection:

releaseCriticalSection: runnable
"Figure out if we are terminating a process that is in the ensure: block of a critical section.
In this case, if the block has made progress, pop the suspendedContext so that we leave the
ensure: block inside the critical: without signaling the semaphore/exiting the primitive section,
since presumably this has already happened.  But if it hasn't made progress but is beyond the
wait (which we can tell my the oldList being one of the runnable lists, i.e. a LinkedList, not a
Semaphore or Mutex, et al), then the ensure: block needs to be run."
| selectorJustSent |
(suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
selectorJustSent := suspendedContext selectorJustSentOrSelf.

"Receiver and/or argument blocks of ensure: in Semaphore>>critical: or Mutex>>#critical:"
suspendedContext isClosureContext ifTrue:
[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

"If still at the wait the ensure: block has not been activated, so signal to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

"If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]


Let's go through it line by line.  First, runnable is an argument, determined in Process>>#terminate.  One could invoke it with

    self releaseCriticalSection: oldList class == LinkedList

but this means that an already suspended process is assumed to be not runnable, which makes it tricky to debug the Process>>#terminate method.  One has to assign to oldList while stepping though the method.  I've chosen safety, assuming that the process is still runnable if suspend answers nil, its simply being debugged.

Then we're only interested in <criticalSection> marked methods se we return if there's no such pragma.

Then we deal with blocks in these methods.  One issue here is to avoid running the ensure: block twice if it is already being evaluated.  The other is to run it if it is stalled and has yet to be run.

So if

suspendedContext isClosureContext ifTrue:

we're in the ensure: receiver or argument blocks in any <criticalSection> marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].  If wait was just sent then we're in the ensure: receiver block of Semaphore>>critical: (V2 & V3 above) and the issue is whether the process is blocked or is unblocked and has made no progress. If blocked then nothing needs to be done; the ensure: block is discarded and the stack cut back to the critical: activation.  If progress has been made then nothing needs to be done (in fact we can't be in this state; the ensure: receiver will have started evaluating the critical: block argument).  If unblocked, but no progress has been made, do /not/ discard the unwind block and it will be run in Process>>#terminate when this method returns.  Hence...

[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

Now we're left with the simpler version of Semaphore>>critical: (V1 above) and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only state we have to worry about is that the process is unblocked but has made no progress.  If not runnable the process is still blocked and we can simply return.

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

If #wait was just sent the process is in Semaphore>>#critical: and, because ensure: has not been sent we signal explicitly to restore the signal count:

"If still at the wait the ensure: block has not been activated, so signal to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

If either of primitiveEnterCriticalSection or primitiveTestAndSetOwnershipOfCriticalSection have just been sent then either the Mutex is already owned, in which case the ensure block is elsewhere in the colder part of the stack, or has just been owned, and because ensure: has not been sent we unlock explicitly to release the Mutex:

"If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]

So Pharoers can you read and say whether you think this is sane or not?  If so, then we can kibbutz to write the Pharo version.

Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?  Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184 reverts Semaphore>>#critical: to V1 above.


P.S. Looking at V1 above it seems to me that there is an issue if the process is preempted in ensure: before sending valueNoContextSwitch:.  I'll try and write a test that advances a process to that precise point.  If that test fails I think we have to use V2 or V3, and V2 is clearly preferable.

Lovely.  I added the test (testSemaCriticalWaitInEnsure & testMutexCriticalBlockedInEnsure) and it is V1 that works and V2 that does not.  I think it best not to try and be too clever and fix terminate and/or releaseCriticalSection: for this case.  We can simply stick with V1 for now.

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Tobias Pape
In reply to this post by Eliot Miranda-2
Hi Eliot,

Thanks for the comprehensive write-up.

Looks all got from over here.

> On 27.07.2018, at 06:31, Eliot Miranda <[hidden email]> wrote:
>
> Hi David, Hi Bert, Clément, Juan, Levente and Marcus, Hi Anyone else with strong experience in the VM with processes,
>
> On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <[hidden email]> wrote:
> Semaphore seems like a rather basic thing that should work correctly in
> any Squeak image. The tests do not pass in trunk any more.
>
> Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but passes
> in the earlier Squeak 4.6 / 5.0 images.
>
> Is this a real problem? Does it need to be fixed for the 5.2 release?
>
> Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message will strike you as TL;DR, but please, especially if you're Bert, Clément, Juan, Levente or Marcus, read this carefully.  It's quite important.  And below I'll present the Squeak code but will work with Clément and Marcus to implement semantically equivalent Pharo code asap.
>
> And apologies in advance for the repetitious nature of this message.  It is better that I am precise than I am brief and anyone miss anything.  This is an old problem and it will be nice if I've fixed it, but I could easily have missed something; this problem having been around for decades.  OK...
>
>
> This is an old problem which boiled down to there being no way to determine by looking at a process's suspendedContext whether a process is either waiting on a Semaphore or Mutex or is no longer waiting, but has made no progress because it is at a lower priority than runnable processes and so has not got a chance to run yet.
>
> So in
>     | s |
>     s := Semaphore new.
>     ...
>     s wait
>     ...
>
> if we look at the context containing the wait its pc will be the same whether the process is blocked, waiting on the semaphore, or whether the semaphore has been signalled but the process has not been able to proceed because it is of lower priority than runnable processes and so can make no progress.  This caused problems for code such as this:
>
> Semaphore>>critical: mutuallyExcludedBlock
> self wait.
> ^mutuallyExcludedBlock ensure: [self signal]
>
> because the ensure: won't be entered if higher priority runnable processes are preventing it from running.
>
> And for code such as this:
>
> Semaphore>>critical: mutuallyExcludedBlock
> ^[self wait.
>   mutuallyExcludedBlock value]
> ensure: [self signal]
>
> because if the process is terminated when the semaphore has not been signalled (i.e. the process is blocked in the wait), Process>>terminate will run the ensure: block anyway, resulting in the Semaphore getting an extra signal.
>
> This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed Mutex as a more efficient version of Monitor, but this is also subject so the same problem.  We did change the definition of ensure: so that it is not a suspension point, by adding valueNoContextSwitch[:]
>
> BlockClosure>>ensure: aBlock
> "Evaluate a termination block after evaluating the receiver, regardless of
> whether the receiver's evaluation completes.  N.B.  This method is *not*
> implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
> 198 in a context's method as the mark for an ensure:/ifCurtailed: activation."
>
> | complete returnValue |
> <primitive: 198>
> returnValue := self valueNoContextSwitch.
> complete ifNil:[
> complete := true.
> aBlock value.
> ].
> ^ returnValue
>
> This means that we don't have to deal with suspensions here (marked with !!!)
>
> I now understand how to distinguish between the two cases, between blocking and not blocked but no progress.  Process>>suspend answers the list the Process was on when it was suspended.  If the process is already suspended Process>>suspend answers nil.  If the process is waiting on a Semaphore or a Mutex, Process>>suspend answers the Semaphore or Mutex. And if the process is runnable then Process>>suspend answers the process's run list (a LinkedList in ProcessorScheduler's quiescentProcessLists array corresponding to the process's priority).
>
> So Process>>#terminate can distinguish between #wait or #primitiveEnterCriticalSection or #primitiveTestAndSetOwnershipOfCriticalSection being blocked, or being unblocked but having made no progress due to too low a priority.  We do so by testing the class of the result of suspending the process.  If it is a LinkedList, the process has past the #wait or #primitiveEnterCriticalSection but has made no progress due to too low a priority.
>
> The version of Process>>#terminate I'm about to commit deals with several cases.  Let me present the cases first.  There are three versions of Semaphore>>#critical: to handle, and one version of Mutex>>critical: and Mutex>>#critical:ifLocked:.
>
> The two basic versions of Semaphore>>critical: are
>
> V1
> critical: mutuallyExcludedBlock
> "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> the process of running the critical: message. If the receiver is, evaluate
> mutuallyExcludedBlock after the other critical: message is finished."
> <criticalSection>
> self wait.
> ^mutuallyExcludedBlock ensure: [self signal]
>
> V2
> critical: mutuallyExcludedBlock
> "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> the process of running the critical: message. If the receiver is, evaluate
> mutuallyExcludedBlock after the other critical: message is finished."
> <criticalSection>
> ^[self wait.
>   mutuallyExcludedBlock value]
> ensure: [self signal]
>
> and Juan's safer version is (after I added the criticalSection pragma)
>
> V3
> critical: mutuallyExcludedBlock
> "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> the process of running the critical: message. If the receiver is, evaluate
> mutuallyExcludedBlock after the other critical: message is finished."
> <criticalSection>
> | caught |
> "We need to catch eventual interruptions very carefully.
> The naive approach of just doing, e.g.,:
> self wait.
> aBlock ensure:[self signal].
> will fail if the active process gets terminated while in the wait.
> However, the equally naive:
> [self wait.
> aBlock value] ensure:[self signal].
> will fail too, since the active process may get interrupted while
> entering the ensured block and leave the semaphore signaled twice.
> To avoid both problems we make use of the fact that interrupts only
> occur on sends (or backward jumps) and use an assignment (bytecode)
> right before we go into the wait primitive (which is not a real send and
> therefore not interruptable either)."
>
> caught := false.
> ^[
> caught := true.
> self wait.
> mutuallyExcludedBlock value
> ] ensure: [ caught ifTrue: [self signal] ]
>
> and the Mutex>>critical:'s are
>
> critical: aBlock
> "Evaluate aBlock protected by the receiver."
> <criticalSection>
> ^self primitiveEnterCriticalSection
> ifTrue: [aBlock value]
> ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]
>
> critical: aBlock ifLocked: lockedBlock
> "Answer the evaluation of aBlock protected by the receiver.  If it is already in a critical
> section on behalf of some other process answer the evaluation of lockedBlock."
> <criticalSection>
> ^self primitiveTestAndSetOwnershipOfCriticalSection
> ifNil: [lockedBlock value]
> ifNotNil:
> [:alreadyOwner|
> alreadyOwner
> ifTrue: [aBlock value]
> ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]
>
> primitiveEnterCriticalSection answers false if the Mutex was unowned, and true if it was already owned by the active process.  It blocks otherwise.  primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex was unowned, true if it was already owned by the active process, and nil if owned by some other process.
>
> So we want Process>>#terminate to correctly release the semaphores and mutexes no matter where in these methods they are.  We don't have to worry if the process is within the block argument to a critical: itself, only if it is actually within the critical: method or a block within it. If it is already within the block argument to critical: then Process>>#terminate's unwind handling will unwind things correctly.  Taking Juan's version of Semaphore>>#critical: above, the key issue is whether the process being terminated is blocked on the wait, not blocked but still stuck at the wait, or at the start of the block argument to ensure:.
>
> I have extracted the processing into Process>>releaseCriticalSection:, so now Process>>terminate reads
>
> terminate
> "Stop the process that the receiver represents forever.
> Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
> If the process is in the middle of a critical: critical section, release it properly."
>
> | ctxt unwindBlock oldList |
> self isActiveProcess ifTrue: [
> ctxt := thisContext.
> [ ctxt := ctxt findNextUnwindContextUpTo: nil.
> ctxt isNil
> ] whileFalse: [
> (ctxt tempAt: 2) ifNil:[
> ctxt tempAt: 2 put: nil.
> unwindBlock := ctxt tempAt: 1.
> thisContext terminateTo: ctxt.
> unwindBlock value].
> ].
> thisContext terminateTo: nil.
> self suspend.
> ] ifFalse:[
> "Always suspend the process first so it doesn't accidentally get woken up.
> N.B. If oldList is a LinkedList then the process is runnable. If it is a Semaphore/Mutex et al
> then the process is blocked, and if it is nil then the process is already suspended."
> oldList := self suspend.
> suspendedContext ifNotNil:
> ["Release any method marked with the <criticalSection> pragma.
>  The argument is whether the process is runnable."
> self releaseCriticalSection: (oldList isNil or: [oldList class == LinkedList]).
>
> "If terminating a process halfways through an unwind, try to complete that unwind block first."
> (suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
> [:outer|
> (suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt: 1)]) ifNotNil:
> [:inner| "This is an unwind block currently under evaluation"
> suspendedContext runUntilErrorOrReturnFrom: inner]].
>
> ctxt := self popTo: suspendedContext bottomContext.
> ctxt == suspendedContext bottomContext ifFalse:
> [self debug: ctxt title: 'Unwind error during termination'].
> "Set the context to its endPC for the benefit of isTerminated."
> ctxt pc: ctxt endPC]]
>
> In implementing releaseCriticalSection: we need to know which selector a context has just sent.  selectorJustSentOrSelf is implemented in Squeak as
>
> InstructionStream>>selectorJustSentOrSelf
> "If this instruction follows a send, answer the send's selector, otherwise answer self."
>
> | method |
> method := self method.
> ^method encoderClass selectorToSendOrItselfFor: self in: method at: self previousPc
>
> c.f.
>
> InstructionStream>>selectorToSendOrSelf
> "If this instruction is a send, answer the selector, otherwise answer self."
>
> | method |
> method := self method.
> ^method encoderClass selectorToSendOrItselfFor: self in: method at: pc
>
> Now we can implement Process>>#releaseCriticalSection:
>
> releaseCriticalSection: runnable
> "Figure out if we are terminating a process that is in the ensure: block of a critical section.
> In this case, if the block has made progress, pop the suspendedContext so that we leave the
> ensure: block inside the critical: without signaling the semaphore/exiting the primitive section,
> since presumably this has already happened.  But if it hasn't made progress but is beyond the
> wait (which we can tell my the oldList being one of the runnable lists, i.e. a LinkedList, not a
> Semaphore or Mutex, et al), then the ensure: block needs to be run."
> | selectorJustSent |
> (suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
> selectorJustSent := suspendedContext selectorJustSentOrSelf.
>
> "Receiver and/or argument blocks of ensure: in Semaphore>>critical: or Mutex>>#critical:"
> suspendedContext isClosureContext ifTrue:
> [suspendedContext sender selector == #ensure: ifTrue:
> [| notWaitingButMadeNoProgress |
> "Avoid running the ensure: block twice, popping it if it has already been run. If runnable
> but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
> not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
> uses valueNoContextSwitch, and so there is no suspension point before the wait."
> notWaitingButMadeNoProgress :=
> runnable
> and: [selectorJustSent == #wait
> and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
> notWaitingButMadeNoProgress ifFalse:
> [suspendedContext := suspendedContext home]].
> ^self].
>
> "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
> runnable ifFalse: [^self].
>
> "If still at the wait the ensure: block has not been activated, so signal to restore."
> selectorJustSent == #wait ifTrue:
> [suspendedContext receiver signal].
>
> "If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
> then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
> (selectorJustSent == #primitiveEnterCriticalSection
> or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
> [(suspendedContext stackPtr > 0
>  and: [suspendedContext top == false]) ifTrue:
> [suspendedContext receiver primitiveExitCriticalSection]]
>
>
> Let's go through it line by line.  First, runnable is an argument, determined in Process>>#terminate.  One could invoke it with
>
>     self releaseCriticalSection: oldList class == LinkedList
>
> but this means that an already suspended process is assumed to be not runnable, which makes it tricky to debug the Process>>#terminate method.  One has to assign to oldList while stepping though the method.  I've chosen safety, assuming that the process is still runnable if suspend answers nil, its simply being debugged.
>
> Then we're only interested in <criticalSection> marked methods se we return if there's no such pragma.
>
> Then we deal with blocks in these methods.  One issue here is to avoid running the ensure: block twice if it is already being evaluated.  The other is to run it if it is stalled and has yet to be run.
>
> So if
>
> suspendedContext isClosureContext ifTrue:
>
> we're in the ensure: receiver or argument blocks in any <criticalSection> marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].  If wait was just sent then we're in the ensure: receiver block of Semaphore>>critical: (V2 & V3 above) and the issue is whether the process is blocked or is unblocked and has made no progress. If blocked then nothing needs to be done; the ensure: block is discarded and the stack cut back to the critical: activation.  If progress has been made then nothing needs to be done (in fact we can't be in this state; the ensure: receiver will have started evaluating the critical: block argument).  If unblocked, but no progress has been made, do /not/ discard the unwind block and it will be run in Process>>#terminate when this method returns.  Hence...
>
> [suspendedContext sender selector == #ensure: ifTrue:

Just one nit. Yes we know the method sends #ensure:, but should the code below also be run for other methods that are marked by primitive 198?
(eg, #ifCurtailed:, as send in Semaphore>>critical:ifCurtailed:).
I am just asking :) checking for the prim might be a tiny tad more robust.

best
        -tobias



> [| notWaitingButMadeNoProgress |
> "Avoid running the ensure: block twice, popping it if it has already been run. If runnable
> but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
> not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
> uses valueNoContextSwitch, and so there is no suspension point before the wait."
> notWaitingButMadeNoProgress :=
> runnable
> and: [selectorJustSent == #wait
> and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
> notWaitingButMadeNoProgress ifFalse:
> [suspendedContext := suspendedContext home]].
> ^self].
>
> Now we're left with the simpler version of Semaphore>>critical: (V1 above) and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only state we have to worry about is that the process is unblocked but has made no progress.  If not runnable the process is still blocked and we can simply return.
>
> "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
> runnable ifFalse: [^self].
>
> If #wait was just sent the process is in Semaphore>>#critical: and, because ensure: has not been sent we signal explicitly to restore the signal count:
>
> "If still at the wait the ensure: block has not been activated, so signal to restore."
> selectorJustSent == #wait ifTrue:
> [suspendedContext receiver signal].
>
> If either of primitiveEnterCriticalSection or primitiveTestAndSetOwnershipOfCriticalSection have just been sent then either the Mutex is already owned, in which case the ensure block is elsewhere in the colder part of the stack, or has just been owned, and because ensure: has not been sent we unlock explicitly to release the Mutex:
>
> "If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
> then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
> (selectorJustSent == #primitiveEnterCriticalSection
> or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
> [(suspendedContext stackPtr > 0
>  and: [suspendedContext top == false]) ifTrue:
> [suspendedContext receiver primitiveExitCriticalSection]]
>
> So Pharoers can you read and say whether you think this is sane or not?  If so, then we can kibbutz to write the Pharo version.
>
> Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?  Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184 reverts Semaphore>>#critical: to V1 above.
>
>
> P.S. Looking at V1 above it seems to me that there is an issue if the process is preempted in ensure: before sending valueNoContextSwitch:.  I'll try and write a test that advances a process to that precise point.  If that test fails I think we have to use V2 or V3, and V2 is clearly preferable.
>
> _,,,^..^,,,_
> best, Eliot
>


Reply | Threaded
Open this post in threaded view
|

Re: Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

David T. Lewis
On Fri, Jul 27, 2018 at 09:31:20AM +0200, Tobias Pape wrote:
> Hi Eliot,
>
> Thanks for the comprehensive write-up.

+1000


>
> Looks all got from over here.

It looks good to me also. The tests pass now, and CommandShell (which
does a lot of Semaphore stuff) still works, so it looks to me.

Thanks a lot!

Dave


>
> > On 27.07.2018, at 06:31, Eliot Miranda <[hidden email]> wrote:
> >
> > Hi David, Hi Bert, Cl??ment, Juan, Levente and Marcus, Hi Anyone else with strong experience in the VM with processes,
> >
> > On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <[hidden email]> wrote:
> > Semaphore seems like a rather basic thing that should work correctly in
> > any Squeak image. The tests do not pass in trunk any more.
> >
> > Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but passes
> > in the earlier Squeak 4.6 / 5.0 images.
> >
> > Is this a real problem? Does it need to be fixed for the 5.2 release?
> >
> > Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message will strike you as TL;DR, but please, especially if you're Bert, Cl??ment, Juan, Levente or Marcus, read this carefully.  It's quite important.  And below I'll present the Squeak code but will work with Cl??ment and Marcus to implement semantically equivalent Pharo code asap.
> >
> > And apologies in advance for the repetitious nature of this message.  It is better that I am precise than I am brief and anyone miss anything.  This is an old problem and it will be nice if I've fixed it, but I could easily have missed something; this problem having been around for decades.  OK...
> >
> >
> > This is an old problem which boiled down to there being no way to determine by looking at a process's suspendedContext whether a process is either waiting on a Semaphore or Mutex or is no longer waiting, but has made no progress because it is at a lower priority than runnable processes and so has not got a chance to run yet.
> >
> > So in
> >     | s |
> >     s := Semaphore new.
> >     ...
> >     s wait
> >     ...
> >
> > if we look at the context containing the wait its pc will be the same whether the process is blocked, waiting on the semaphore, or whether the semaphore has been signalled but the process has not been able to proceed because it is of lower priority than runnable processes and so can make no progress.  This caused problems for code such as this:
> >
> > Semaphore>>critical: mutuallyExcludedBlock
> > self wait.
> > ^mutuallyExcludedBlock ensure: [self signal]
> >
> > because the ensure: won't be entered if higher priority runnable processes are preventing it from running.
> >
> > And for code such as this:
> >
> > Semaphore>>critical: mutuallyExcludedBlock
> > ^[self wait.
> >   mutuallyExcludedBlock value]
> > ensure: [self signal]
> >
> > because if the process is terminated when the semaphore has not been signalled (i.e. the process is blocked in the wait), Process>>terminate will run the ensure: block anyway, resulting in the Semaphore getting an extra signal.
> >
> > This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed Mutex as a more efficient version of Monitor, but this is also subject so the same problem.  We did change the definition of ensure: so that it is not a suspension point, by adding valueNoContextSwitch[:]
> >
> > BlockClosure>>ensure: aBlock
> > "Evaluate a termination block after evaluating the receiver, regardless of
> > whether the receiver's evaluation completes.  N.B.  This method is *not*
> > implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
> > 198 in a context's method as the mark for an ensure:/ifCurtailed: activation."
> >
> > | complete returnValue |
> > <primitive: 198>
> > returnValue := self valueNoContextSwitch.
> > complete ifNil:[
> > complete := true.
> > aBlock value.
> > ].
> > ^ returnValue
> >
> > This means that we don't have to deal with suspensions here (marked with !!!)
> >
> > I now understand how to distinguish between the two cases, between blocking and not blocked but no progress.  Process>>suspend answers the list the Process was on when it was suspended.  If the process is already suspended Process>>suspend answers nil.  If the process is waiting on a Semaphore or a Mutex, Process>>suspend answers the Semaphore or Mutex. And if the process is runnable then Process>>suspend answers the process's run list (a LinkedList in ProcessorScheduler's quiescentProcessLists array corresponding to the process's priority).
> >
> > So Process>>#terminate can distinguish between #wait or #primitiveEnterCriticalSection or #primitiveTestAndSetOwnershipOfCriticalSection being blocked, or being unblocked but having made no progress due to too low a priority.  We do so by testing the class of the result of suspending the process.  If it is a LinkedList, the process has past the #wait or #primitiveEnterCriticalSection but has made no progress due to too low a priority.
> >
> > The version of Process>>#terminate I'm about to commit deals with several cases.  Let me present the cases first.  There are three versions of Semaphore>>#critical: to handle, and one version of Mutex>>critical: and Mutex>>#critical:ifLocked:.
> >
> > The two basic versions of Semaphore>>critical: are
> >
> > V1
> > critical: mutuallyExcludedBlock
> > "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> > the process of running the critical: message. If the receiver is, evaluate
> > mutuallyExcludedBlock after the other critical: message is finished."
> > <criticalSection>
> > self wait.
> > ^mutuallyExcludedBlock ensure: [self signal]
> >
> > V2
> > critical: mutuallyExcludedBlock
> > "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> > the process of running the critical: message. If the receiver is, evaluate
> > mutuallyExcludedBlock after the other critical: message is finished."
> > <criticalSection>
> > ^[self wait.
> >   mutuallyExcludedBlock value]
> > ensure: [self signal]
> >
> > and Juan's safer version is (after I added the criticalSection pragma)
> >
> > V3
> > critical: mutuallyExcludedBlock
> > "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
> > the process of running the critical: message. If the receiver is, evaluate
> > mutuallyExcludedBlock after the other critical: message is finished."
> > <criticalSection>
> > | caught |
> > "We need to catch eventual interruptions very carefully.
> > The naive approach of just doing, e.g.,:
> > self wait.
> > aBlock ensure:[self signal].
> > will fail if the active process gets terminated while in the wait.
> > However, the equally naive:
> > [self wait.
> > aBlock value] ensure:[self signal].
> > will fail too, since the active process may get interrupted while
> > entering the ensured block and leave the semaphore signaled twice.
> > To avoid both problems we make use of the fact that interrupts only
> > occur on sends (or backward jumps) and use an assignment (bytecode)
> > right before we go into the wait primitive (which is not a real send and
> > therefore not interruptable either)."
> >
> > caught := false.
> > ^[
> > caught := true.
> > self wait.
> > mutuallyExcludedBlock value
> > ] ensure: [ caught ifTrue: [self signal] ]
> >
> > and the Mutex>>critical:'s are
> >
> > critical: aBlock
> > "Evaluate aBlock protected by the receiver."
> > <criticalSection>
> > ^self primitiveEnterCriticalSection
> > ifTrue: [aBlock value]
> > ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]
> >
> > critical: aBlock ifLocked: lockedBlock
> > "Answer the evaluation of aBlock protected by the receiver.  If it is already in a critical
> > section on behalf of some other process answer the evaluation of lockedBlock."
> > <criticalSection>
> > ^self primitiveTestAndSetOwnershipOfCriticalSection
> > ifNil: [lockedBlock value]
> > ifNotNil:
> > [:alreadyOwner|
> > alreadyOwner
> > ifTrue: [aBlock value]
> > ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]
> >
> > primitiveEnterCriticalSection answers false if the Mutex was unowned, and true if it was already owned by the active process.  It blocks otherwise.  primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex was unowned, true if it was already owned by the active process, and nil if owned by some other process.
> >
> > So we want Process>>#terminate to correctly release the semaphores and mutexes no matter where in these methods they are.  We don't have to worry if the process is within the block argument to a critical: itself, only if it is actually within the critical: method or a block within it. If it is already within the block argument to critical: then Process>>#terminate's unwind handling will unwind things correctly.  Taking Juan's version of Semaphore>>#critical: above, the key issue is whether the process being terminated is blocked on the wait, not blocked but still stuck at the wait, or at the start of the block argument to ensure:.
> >
> > I have extracted the processing into Process>>releaseCriticalSection:, so now Process>>terminate reads
> >
> > terminate
> > "Stop the process that the receiver represents forever.
> > Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
> > If the process is in the middle of a critical: critical section, release it properly."
> >
> > | ctxt unwindBlock oldList |
> > self isActiveProcess ifTrue: [
> > ctxt := thisContext.
> > [ ctxt := ctxt findNextUnwindContextUpTo: nil.
> > ctxt isNil
> > ] whileFalse: [
> > (ctxt tempAt: 2) ifNil:[
> > ctxt tempAt: 2 put: nil.
> > unwindBlock := ctxt tempAt: 1.
> > thisContext terminateTo: ctxt.
> > unwindBlock value].
> > ].
> > thisContext terminateTo: nil.
> > self suspend.
> > ] ifFalse:[
> > "Always suspend the process first so it doesn't accidentally get woken up.
> > N.B. If oldList is a LinkedList then the process is runnable. If it is a Semaphore/Mutex et al
> > then the process is blocked, and if it is nil then the process is already suspended."
> > oldList := self suspend.
> > suspendedContext ifNotNil:
> > ["Release any method marked with the <criticalSection> pragma.
> >  The argument is whether the process is runnable."
> > self releaseCriticalSection: (oldList isNil or: [oldList class == LinkedList]).
> >
> > "If terminating a process halfways through an unwind, try to complete that unwind block first."
> > (suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
> > [:outer|
> > (suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt: 1)]) ifNotNil:
> > [:inner| "This is an unwind block currently under evaluation"
> > suspendedContext runUntilErrorOrReturnFrom: inner]].
> >
> > ctxt := self popTo: suspendedContext bottomContext.
> > ctxt == suspendedContext bottomContext ifFalse:
> > [self debug: ctxt title: 'Unwind error during termination'].
> > "Set the context to its endPC for the benefit of isTerminated."
> > ctxt pc: ctxt endPC]]
> >
> > In implementing releaseCriticalSection: we need to know which selector a context has just sent.  selectorJustSentOrSelf is implemented in Squeak as
> >
> > InstructionStream>>selectorJustSentOrSelf
> > "If this instruction follows a send, answer the send's selector, otherwise answer self."
> >
> > | method |
> > method := self method.
> > ^method encoderClass selectorToSendOrItselfFor: self in: method at: self previousPc
> >
> > c.f.
> >
> > InstructionStream>>selectorToSendOrSelf
> > "If this instruction is a send, answer the selector, otherwise answer self."
> >
> > | method |
> > method := self method.
> > ^method encoderClass selectorToSendOrItselfFor: self in: method at: pc
> >
> > Now we can implement Process>>#releaseCriticalSection:
> >
> > releaseCriticalSection: runnable
> > "Figure out if we are terminating a process that is in the ensure: block of a critical section.
> > In this case, if the block has made progress, pop the suspendedContext so that we leave the
> > ensure: block inside the critical: without signaling the semaphore/exiting the primitive section,
> > since presumably this has already happened.  But if it hasn't made progress but is beyond the
> > wait (which we can tell my the oldList being one of the runnable lists, i.e. a LinkedList, not a
> > Semaphore or Mutex, et al), then the ensure: block needs to be run."
> > | selectorJustSent |
> > (suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
> > selectorJustSent := suspendedContext selectorJustSentOrSelf.
> >
> > "Receiver and/or argument blocks of ensure: in Semaphore>>critical: or Mutex>>#critical:"
> > suspendedContext isClosureContext ifTrue:
> > [suspendedContext sender selector == #ensure: ifTrue:
> > [| notWaitingButMadeNoProgress |
> > "Avoid running the ensure: block twice, popping it if it has already been run. If runnable
> > but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
> > not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
> > uses valueNoContextSwitch, and so there is no suspension point before the wait."
> > notWaitingButMadeNoProgress :=
> > runnable
> > and: [selectorJustSent == #wait
> > and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
> > notWaitingButMadeNoProgress ifFalse:
> > [suspendedContext := suspendedContext home]].
> > ^self].
> >
> > "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
> > runnable ifFalse: [^self].
> >
> > "If still at the wait the ensure: block has not been activated, so signal to restore."
> > selectorJustSent == #wait ifTrue:
> > [suspendedContext receiver signal].
> >
> > "If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
> > then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
> > (selectorJustSent == #primitiveEnterCriticalSection
> > or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
> > [(suspendedContext stackPtr > 0
> >  and: [suspendedContext top == false]) ifTrue:
> > [suspendedContext receiver primitiveExitCriticalSection]]
> >
> >
> > Let's go through it line by line.  First, runnable is an argument, determined in Process>>#terminate.  One could invoke it with
> >
> >     self releaseCriticalSection: oldList class == LinkedList
> >
> > but this means that an already suspended process is assumed to be not runnable, which makes it tricky to debug the Process>>#terminate method.  One has to assign to oldList while stepping though the method.  I've chosen safety, assuming that the process is still runnable if suspend answers nil, its simply being debugged.
> >
> > Then we're only interested in <criticalSection> marked methods se we return if there's no such pragma.
> >
> > Then we deal with blocks in these methods.  One issue here is to avoid running the ensure: block twice if it is already being evaluated.  The other is to run it if it is stalled and has yet to be run.
> >
> > So if
> >
> > suspendedContext isClosureContext ifTrue:
> >
> > we're in the ensure: receiver or argument blocks in any <criticalSection> marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].  If wait was just sent then we're in the ensure: receiver block of Semaphore>>critical: (V2 & V3 above) and the issue is whether the process is blocked or is unblocked and has made no progress. If blocked then nothing needs to be done; the ensure: block is discarded and the stack cut back to the critical: activation.  If progress has been made then nothing needs to be done (in fact we can't be in this state; the ensure: receiver will have started evaluating the critical: block argument).  If unblocked, but no progress has been made, do /not/ discard the unwind block and it will be run in Process>>#terminate when this method returns.  Hence...
> >
> > [suspendedContext sender selector == #ensure: ifTrue:
>
> Just one nit. Yes we know the method sends #ensure:, but should the code below also be run for other methods that are marked by primitive 198?
> (eg, #ifCurtailed:, as send in Semaphore>>critical:ifCurtailed:).
> I am just asking :) checking for the prim might be a tiny tad more robust.
>
> best
> -tobias
>
>
>
> > [| notWaitingButMadeNoProgress |
> > "Avoid running the ensure: block twice, popping it if it has already been run. If runnable
> > but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
> > not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
> > uses valueNoContextSwitch, and so there is no suspension point before the wait."
> > notWaitingButMadeNoProgress :=
> > runnable
> > and: [selectorJustSent == #wait
> > and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
> > notWaitingButMadeNoProgress ifFalse:
> > [suspendedContext := suspendedContext home]].
> > ^self].
> >
> > Now we're left with the simpler version of Semaphore>>critical: (V1 above) and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only state we have to worry about is that the process is unblocked but has made no progress.  If not runnable the process is still blocked and we can simply return.
> >
> > "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
> > runnable ifFalse: [^self].
> >
> > If #wait was just sent the process is in Semaphore>>#critical: and, because ensure: has not been sent we signal explicitly to restore the signal count:
> >
> > "If still at the wait the ensure: block has not been activated, so signal to restore."
> > selectorJustSent == #wait ifTrue:
> > [suspendedContext receiver signal].
> >
> > If either of primitiveEnterCriticalSection or primitiveTestAndSetOwnershipOfCriticalSection have just been sent then either the Mutex is already owned, in which case the ensure block is elsewhere in the colder part of the stack, or has just been owned, and because ensure: has not been sent we unlock explicitly to release the Mutex:
> >
> > "If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
> > then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
> > (selectorJustSent == #primitiveEnterCriticalSection
> > or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
> > [(suspendedContext stackPtr > 0
> >  and: [suspendedContext top == false]) ifTrue:
> > [suspendedContext receiver primitiveExitCriticalSection]]
> >
> > So Pharoers can you read and say whether you think this is sane or not?  If so, then we can kibbutz to write the Pharo version.
> >
> > Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?  Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184 reverts Semaphore>>#critical: to V1 above.
> >
> >
> > P.S. Looking at V1 above it seems to me that there is an issue if the process is preempted in ensure: before sending valueNoContextSwitch:.  I'll try and write a test that advances a process to that precise point.  If that test fails I think we have to use V2 or V3, and V2 is clearly preferable.
> >
> > _,,,^..^,,,_
> > best, Eliot
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Eliot Miranda-2
In reply to this post by Tobias Pape
Hi Tobias,

> On Jul 27, 2018, at 12:31 AM, Tobias Pape <[hidden email]> wrote:
>
> Hi Eliot,
>
> Thanks for the comprehensive write-up.
>
> Looks all got from over here.
>
>> On 27.07.2018, at 06:31, Eliot Miranda <[hidden email]> wrote:
>>
>> Hi David, Hi Bert, Clément, Juan, Levente and Marcus, Hi Anyone else with strong experience in the VM with processes,
>>
>> On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <[hidden email]> wrote:
>> Semaphore seems like a rather basic thing that should work correctly in
>> any Squeak image. The tests do not pass in trunk any more.
>>
>> Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but passes
>> in the earlier Squeak 4.6 / 5.0 images.
>>
>> Is this a real problem? Does it need to be fixed for the 5.2 release?
>>
>> Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message will strike you as TL;DR, but please, especially if you're Bert, Clément, Juan, Levente or Marcus, read this carefully.  It's quite important.  And below I'll present the Squeak code but will work with Clément and Marcus to implement semantically equivalent Pharo code asap.
>>
>> And apologies in advance for the repetitious nature of this message.  It is better that I am precise than I am brief and anyone miss anything.  This is an old problem and it will be nice if I've fixed it, but I could easily have missed something; this problem having been around for decades.  OK...
>>
>>
>> This is an old problem which boiled down to there being no way to determine by looking at a process's suspendedContext whether a process is either waiting on a Semaphore or Mutex or is no longer waiting, but has made no progress because it is at a lower priority than runnable processes and so has not got a chance to run yet.
>>
>> So in
>>    | s |
>>    s := Semaphore new.
>>    ...
>>    s wait
>>    ...
>>
>> if we look at the context containing the wait its pc will be the same whether the process is blocked, waiting on the semaphore, or whether the semaphore has been signalled but the process has not been able to proceed because it is of lower priority than runnable processes and so can make no progress.  This caused problems for code such as this:
>>
>> Semaphore>>critical: mutuallyExcludedBlock
>>    self wait.
>>    ^mutuallyExcludedBlock ensure: [self signal]
>>
>> because the ensure: won't be entered if higher priority runnable processes are preventing it from running.
>>
>> And for code such as this:
>>
>> Semaphore>>critical: mutuallyExcludedBlock
>>    ^[self wait.
>>       mutuallyExcludedBlock value]
>>        ensure: [self signal]
>>
>> because if the process is terminated when the semaphore has not been signalled (i.e. the process is blocked in the wait), Process>>terminate will run the ensure: block anyway, resulting in the Semaphore getting an extra signal.
>>
>> This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed Mutex as a more efficient version of Monitor, but this is also subject so the same problem.  We did change the definition of ensure: so that it is not a suspension point, by adding valueNoContextSwitch[:]
>>
>> BlockClosure>>ensure: aBlock
>>    "Evaluate a termination block after evaluating the receiver, regardless of
>>     whether the receiver's evaluation completes.  N.B.  This method is *not*
>>     implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
>>     198 in a context's method as the mark for an ensure:/ifCurtailed: activation."
>>
>>    | complete returnValue |
>>    <primitive: 198>
>>    returnValue := self valueNoContextSwitch.
>>    complete ifNil:[
>>        complete := true.
>>        aBlock value.
>>    ].
>>    ^ returnValue
>>
>> This means that we don't have to deal with suspensions here (marked with !!!)
>>
>> I now understand how to distinguish between the two cases, between blocking and not blocked but no progress.  Process>>suspend answers the list the Process was on when it was suspended.  If the process is already suspended Process>>suspend answers nil.  If the process is waiting on a Semaphore or a Mutex, Process>>suspend answers the Semaphore or Mutex. And if the process is runnable then Process>>suspend answers the process's run list (a LinkedList in ProcessorScheduler's quiescentProcessLists array corresponding to the process's priority).
>>
>> So Process>>#terminate can distinguish between #wait or #primitiveEnterCriticalSection or #primitiveTestAndSetOwnershipOfCriticalSection being blocked, or being unblocked but having made no progress due to too low a priority.  We do so by testing the class of the result of suspending the process.  If it is a LinkedList, the process has past the #wait or #primitiveEnterCriticalSection but has made no progress due to too low a priority.
>>
>> The version of Process>>#terminate I'm about to commit deals with several cases.  Let me present the cases first.  There are three versions of Semaphore>>#critical: to handle, and one version of Mutex>>critical: and Mutex>>#critical:ifLocked:.
>>
>> The two basic versions of Semaphore>>critical: are
>>
>> V1
>> critical: mutuallyExcludedBlock
>>    "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
>>    the process of running the critical: message. If the receiver is, evaluate
>>    mutuallyExcludedBlock after the other critical: message is finished."
>>    <criticalSection>
>>    self wait.
>>    ^mutuallyExcludedBlock ensure: [self signal]
>>
>> V2
>> critical: mutuallyExcludedBlock
>>    "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
>>    the process of running the critical: message. If the receiver is, evaluate
>>    mutuallyExcludedBlock after the other critical: message is finished."
>>    <criticalSection>
>>    ^[self wait.
>>       mutuallyExcludedBlock value]
>>        ensure: [self signal]
>>
>> and Juan's safer version is (after I added the criticalSection pragma)
>>
>> V3
>> critical: mutuallyExcludedBlock
>>    "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
>>    the process of running the critical: message. If the receiver is, evaluate
>>    mutuallyExcludedBlock after the other critical: message is finished."
>>    <criticalSection>
>>    | caught |
>>    "We need to catch eventual interruptions very carefully.
>>    The naive approach of just doing, e.g.,:
>>        self wait.
>>        aBlock ensure:[self signal].
>>    will fail if the active process gets terminated while in the wait.
>>    However, the equally naive:
>>        [self wait.
>>        aBlock value] ensure:[self signal].
>>    will fail too, since the active process may get interrupted while
>>    entering the ensured block and leave the semaphore signaled twice.
>>    To avoid both problems we make use of the fact that interrupts only
>>    occur on sends (or backward jumps) and use an assignment (bytecode)
>>    right before we go into the wait primitive (which is not a real send and
>>    therefore not interruptable either)."
>>
>>    caught := false.
>>    ^[
>>        caught := true.
>>        self wait.
>>        mutuallyExcludedBlock value
>>    ] ensure: [ caught ifTrue: [self signal] ]
>>
>> and the Mutex>>critical:'s are
>>
>> critical: aBlock
>>    "Evaluate aBlock protected by the receiver."
>>    <criticalSection>
>>    ^self primitiveEnterCriticalSection
>>        ifTrue: [aBlock value]
>>        ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]
>>
>> critical: aBlock ifLocked: lockedBlock
>>    "Answer the evaluation of aBlock protected by the receiver.  If it is already in a critical
>>     section on behalf of some other process answer the evaluation of lockedBlock."
>>    <criticalSection>
>>    ^self primitiveTestAndSetOwnershipOfCriticalSection
>>        ifNil: [lockedBlock value]
>>        ifNotNil:
>>            [:alreadyOwner|
>>             alreadyOwner
>>                ifTrue: [aBlock value]
>>                ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]
>>
>> primitiveEnterCriticalSection answers false if the Mutex was unowned, and true if it was already owned by the active process.  It blocks otherwise.  primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex was unowned, true if it was already owned by the active process, and nil if owned by some other process.
>>
>> So we want Process>>#terminate to correctly release the semaphores and mutexes no matter where in these methods they are.  We don't have to worry if the process is within the block argument to a critical: itself, only if it is actually within the critical: method or a block within it. If it is already within the block argument to critical: then Process>>#terminate's unwind handling will unwind things correctly.  Taking Juan's version of Semaphore>>#critical: above, the key issue is whether the process being terminated is blocked on the wait, not blocked but still stuck at the wait, or at the start of the block argument to ensure:.
>>
>> I have extracted the processing into Process>>releaseCriticalSection:, so now Process>>terminate reads
>>
>> terminate
>>    "Stop the process that the receiver represents forever.
>>     Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
>>     If the process is in the middle of a critical: critical section, release it properly."
>>
>>    | ctxt unwindBlock oldList |
>>    self isActiveProcess ifTrue: [
>>        ctxt := thisContext.
>>        [    ctxt := ctxt findNextUnwindContextUpTo: nil.
>>            ctxt isNil
>>        ] whileFalse: [
>>            (ctxt tempAt: 2) ifNil:[
>>                ctxt tempAt: 2 put: nil.
>>                unwindBlock := ctxt tempAt: 1.
>>                thisContext terminateTo: ctxt.
>>                unwindBlock value].
>>        ].
>>        thisContext terminateTo: nil.
>>        self suspend.
>>    ] ifFalse:[
>>        "Always suspend the process first so it doesn't accidentally get woken up.
>>         N.B. If oldList is a LinkedList then the process is runnable. If it is a Semaphore/Mutex et al
>>         then the process is blocked, and if it is nil then the process is already suspended."
>>        oldList := self suspend.
>>        suspendedContext ifNotNil:
>>            ["Release any method marked with the <criticalSection> pragma.
>>              The argument is whether the process is runnable."
>>             self releaseCriticalSection: (oldList isNil or: [oldList class == LinkedList]).
>>
>>            "If terminating a process halfways through an unwind, try to complete that unwind block first."
>>            (suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
>>                [:outer|
>>                (suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt: 1)]) ifNotNil:
>>                    [:inner| "This is an unwind block currently under evaluation"
>>                    suspendedContext runUntilErrorOrReturnFrom: inner]].
>>
>>            ctxt := self popTo: suspendedContext bottomContext.
>>            ctxt == suspendedContext bottomContext ifFalse:
>>                [self debug: ctxt title: 'Unwind error during termination'].
>>            "Set the context to its endPC for the benefit of isTerminated."
>>            ctxt pc: ctxt endPC]]
>>
>> In implementing releaseCriticalSection: we need to know which selector a context has just sent.  selectorJustSentOrSelf is implemented in Squeak as
>>
>> InstructionStream>>selectorJustSentOrSelf
>>    "If this instruction follows a send, answer the send's selector, otherwise answer self."
>>
>>    | method |
>>    method := self method.
>>    ^method encoderClass selectorToSendOrItselfFor: self in: method at: self previousPc
>>
>> c.f.
>>
>> InstructionStream>>selectorToSendOrSelf
>>    "If this instruction is a send, answer the selector, otherwise answer self."
>>
>>    | method |
>>    method := self method.
>>    ^method encoderClass selectorToSendOrItselfFor: self in: method at: pc
>>
>> Now we can implement Process>>#releaseCriticalSection:
>>
>> releaseCriticalSection: runnable
>>    "Figure out if we are terminating a process that is in the ensure: block of a critical section.
>>     In this case, if the block has made progress, pop the suspendedContext so that we leave the
>>     ensure: block inside the critical: without signaling the semaphore/exiting the primitive section,
>>     since presumably this has already happened.  But if it hasn't made progress but is beyond the
>>     wait (which we can tell my the oldList being one of the runnable lists, i.e. a LinkedList, not a
>>     Semaphore or Mutex, et al), then the ensure: block needs to be run."
>>    | selectorJustSent |
>>    (suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
>>    selectorJustSent := suspendedContext selectorJustSentOrSelf.
>>
>>    "Receiver and/or argument blocks of ensure: in Semaphore>>critical: or Mutex>>#critical:"
>>    suspendedContext isClosureContext ifTrue:
>>        [suspendedContext sender selector == #ensure: ifTrue:
>>            [| notWaitingButMadeNoProgress |
>>            "Avoid running the ensure: block twice, popping it if it has already been run. If runnable
>>             but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
>>             not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
>>             uses valueNoContextSwitch, and so there is no suspension point before the wait."
>>             notWaitingButMadeNoProgress :=
>>                runnable
>>                and: [selectorJustSent == #wait
>>                and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
>>             notWaitingButMadeNoProgress ifFalse:
>>                [suspendedContext := suspendedContext home]].
>>         ^self].
>>
>>    "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
>>    runnable ifFalse: [^self].
>>
>>    "If still at the wait the ensure: block has not been activated, so signal to restore."
>>    selectorJustSent == #wait ifTrue:
>>        [suspendedContext receiver signal].
>>
>>    "If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
>>     then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
>>    (selectorJustSent == #primitiveEnterCriticalSection
>>     or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
>>        [(suspendedContext stackPtr > 0
>>          and: [suspendedContext top == false]) ifTrue:
>>            [suspendedContext receiver primitiveExitCriticalSection]]
>>
>>
>> Let's go through it line by line.  First, runnable is an argument, determined in Process>>#terminate.  One could invoke it with
>>
>>    self releaseCriticalSection: oldList class == LinkedList
>>
>> but this means that an already suspended process is assumed to be not runnable, which makes it tricky to debug the Process>>#terminate method.  One has to assign to oldList while stepping though the method.  I've chosen safety, assuming that the process is still runnable if suspend answers nil, its simply being debugged.
>>
>> Then we're only interested in <criticalSection> marked methods se we return if there's no such pragma.
>>
>> Then we deal with blocks in these methods.  One issue here is to avoid running the ensure: block twice if it is already being evaluated.  The other is to run it if it is stalled and has yet to be run.
>>
>> So if
>>
>>    suspendedContext isClosureContext ifTrue:
>>
>> we're in the ensure: receiver or argument blocks in any <criticalSection> marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].  If wait was just sent then we're in the ensure: receiver block of Semaphore>>critical: (V2 & V3 above) and the issue is whether the process is blocked or is unblocked and has made no progress. If blocked then nothing needs to be done; the ensure: block is discarded and the stack cut back to the critical: activation.  If progress has been made then nothing needs to be done (in fact we can't be in this state; the ensure: receiver will have started evaluating the critical: block argument).  If unblocked, but no progress has been made, do /not/ discard the unwind block and it will be run in Process>>#terminate when this method returns.  Hence...
>>
>>        [suspendedContext sender selector == #ensure: ifTrue:
>
> Just one nit. Yes we know the method sends #ensure:, but should the code below also be run for other methods that are marked by primitive 198?
> (eg, #ifCurtailed:, as send in Semaphore>>critical:ifCurtailed:).
> I am just asking :) checking for the prim might be a tiny tad more robust.

Doh!  Much better.  

    unwindProtectPrimitiveNumber := 198.
    ...
    suspendedContext sender method primitive = unwindProtectPrimitiveNumber ifTrue:

or implement isUnwindProtect in Context.

Thanks!!

>
> best
>    -tobias
>
>
>
>>            [| notWaitingButMadeNoProgress |
>>            "Avoid running the ensure: block twice, popping it if it has already been run. If runnable
>>             but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
>>             not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
>>             uses valueNoContextSwitch, and so there is no suspension point before the wait."
>>             notWaitingButMadeNoProgress :=
>>                runnable
>>                and: [selectorJustSent == #wait
>>                and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
>>             notWaitingButMadeNoProgress ifFalse:
>>                [suspendedContext := suspendedContext home]].
>>         ^self].
>>
>> Now we're left with the simpler version of Semaphore>>critical: (V1 above) and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only state we have to worry about is that the process is unblocked but has made no progress.  If not runnable the process is still blocked and we can simply return.
>>
>>    "Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
>>    runnable ifFalse: [^self].
>>
>> If #wait was just sent the process is in Semaphore>>#critical: and, because ensure: has not been sent we signal explicitly to restore the signal count:
>>
>>    "If still at the wait the ensure: block has not been activated, so signal to restore."
>>    selectorJustSent == #wait ifTrue:
>>        [suspendedContext receiver signal].
>>
>> If either of primitiveEnterCriticalSection or primitiveTestAndSetOwnershipOfCriticalSection have just been sent then either the Mutex is already owned, in which case the ensure block is elsewhere in the colder part of the stack, or has just been owned, and because ensure: has not been sent we unlock explicitly to release the Mutex:
>>
>>    "If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
>>     then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
>>    (selectorJustSent == #primitiveEnterCriticalSection
>>     or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
>>        [(suspendedContext stackPtr > 0
>>          and: [suspendedContext top == false]) ifTrue:
>>            [suspendedContext receiver primitiveExitCriticalSection]]
>>
>> So Pharoers can you read and say whether you think this is sane or not?  If so, then we can kibbutz to write the Pharo version.
>>
>> Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?  Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184 reverts Semaphore>>#critical: to V1 above.
>>
>>
>> P.S. Looking at V1 above it seems to me that there is an issue if the process is preempted in ensure: before sending valueNoContextSwitch:.  I'll try and write a test that advances a process to that precise point.  If that test fails I think we have to use V2 or V3, and V2 is clearly preferable.
>>
>> _,,,^..^,,,_
>> best, Eliot
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Tobias Pape

> On 27.07.2018, at 18:01, Eliot Miranda <[hidden email]> wrote:
>
> Hi Tobias,
[…]

>
>> Just one nit. Yes we know the method sends #ensure:, but should the code below also be run for other methods that are marked by primitive 198?
>> (eg, #ifCurtailed:, as send in Semaphore>>critical:ifCurtailed:).
>> I am just asking :) checking for the prim might be a tiny tad more robust.
>
> Doh!  Much better.  
>
>    unwindProtectPrimitiveNumber := 198.
>    ...
>    suspendedContext sender method primitive = unwindProtectPrimitiveNumber ifTrue:
>
> or implement isUnwindProtect in Context.
>
> Thanks!!

No biggie ;)
        -t