Re: [squeak-dev] Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

Juan Vuletich-3
 
Hi Eliot,

The V3 code in Cuis was last updated on 2010, but that was just the comment. I went back in time, and this (except for the comment) identical version of the code is 10/5/2007 by Andreas.

critical: mutuallyExcludedBlock           
    "Evaluate mutuallyExcludedBlock only if the receiver is not currently in
    the process of running the critical: message. If the receiver is, evaluate
    mutuallyExcludedBlock after the other critical: message is finished."
    | blockValue caught |
    caught := false.
    [
        caught := true.
        self wait.
        blockValue := mutuallyExcludedBlock value
    ] ensure: [caught ifTrue: [self signal]].
    ^blockValue

If you are interested, in Cuis, unzip /PreviousUpdates and in file list set this filter: *-ar.cs.st;*-ar.?.cs.st;*-ar.??.cs.st
This will show all the change sets in Cuis history directly based on Andreas' contributions. In this case it is 0104-SemaCritical-ar.cs.st

You know this stuff better than anyone, and I'm happy to follow your lead here, Eliot.

Just a question. You are using primitives #primitiveExitCriticalSection  #primitiveEnterCriticalSection #primitiveTestAndSetOwnershipOfCriticalSection (185, 186, 187), that Cuis hasn't used so far. Are they supported by SqueakJS and the classic interpreter?

Thanks,
-- 
Juan Vuletich
www.cuis-smalltalk.org
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev
https://github.com/jvuletich
https://www.linkedin.com/in/juan-vuletich-75611b3
@JuanVuletich

On 7/27/2018 2:40 AM, Eliot Miranda wrote:

Ha!

On Thu, Jul 26, 2018 at 9:31 PM, Eliot Miranda <[hidden email]> wrote:
Hi David, Hi Bert, Clément, Juan, Levente and Marcus, Hi Anyone else with strong experience in the VM with processes,

On Mon, Jul 23, 2018 at 7:38 PM, David T. Lewis <[hidden email]> wrote:
Semaphore seems like a rather basic thing that should work correctly in
any Squeak image. The tests do not pass in trunk any more.

Specifically, SemaphoreTest>>testSemaInCriticalWait fails in trunk, but passes
in the earlier Squeak 4.6 / 5.0 images.

Is this a real problem? Does it need to be fixed for the 5.2 release?

Yes.  Yes.  And it needs to be fixed in Pharo too.  I know this message will strike you as TL;DR, but please, especially if you're Bert, Clément, Juan, Levente or Marcus, read this carefully.  It's quite important.  And below I'll present the Squeak code but will work with Clément and Marcus to implement semantically equivalent Pharo code asap.

And apologies in advance for the repetitious nature of this message.  It is better that I am precise than I am brief and anyone miss anything.  This is an old problem and it will be nice if I've fixed it, but I could easily have missed something; this problem having been around for decades.  OK...


This is an old problem which boiled down to there being no way to determine by looking at a process's suspendedContext whether a process is either waiting on a Semaphore or Mutex or is no longer waiting, but has made no progress because it is at a lower priority than runnable processes and so has not got a chance to run yet.

So in
    | s |
    s := Semaphore new.
    ...
    s wait
    ...

if we look at the context containing the wait its pc will be the same whether the process is blocked, waiting on the semaphore, or whether the semaphore has been signalled but the process has not been able to proceed because it is of lower priority than runnable processes and so can make no progress.  This caused problems for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
self wait.
^mutuallyExcludedBlock ensure: [self signal]

because the ensure: won't be entered if higher priority runnable processes are preventing it from running.

And for code such as this:

Semaphore>>critical: mutuallyExcludedBlock
^[self wait.
mutuallyExcludedBlock value]
ensure: [self signal]

because if the process is terminated when the semaphore has not been signalled (i.e. the process is blocked in the wait), Process>>terminate will run the ensure: block anyway, resulting in the Semaphore getting an extra signal.

This occupied Andreas and I at Qwaq, and we didn't solve it.  We developed Mutex as a more efficient version of Monitor, but this is also subject so the same problem.  We did change the definition of ensure: so that it is not a suspension point, by adding valueNoContextSwitch[:]

BlockClosure>>ensure: aBlock
"Evaluate a termination block after evaluating the receiver, regardless of
whether the receiver's evaluation completes.  N.B.  This method is *not*
implemented as a primitive.  Primitive 198 always fails.  The VM uses prim
198 in a context's method as the mark for an ensure:/ifCurtailed: activation."

| complete returnValue |
<primitive: 198>
returnValue := self valueNoContextSwitch.
complete ifNil:[
complete := true.
aBlock value.
].
^ returnValue

This means that we don't have to deal with suspensions here (marked with !!!)

I now understand how to distinguish between the two cases, between blocking and not blocked but no progress.  Process>>suspend answers the list the Process was on when it was suspended.  If the process is already suspended Process>>suspend answers nil.  If the process is waiting on a Semaphore or a Mutex, Process>>suspend answers the Semaphore or Mutex. And if the process is runnable then Process>>suspend answers the process's run list (a LinkedList in ProcessorScheduler's quiescentProcessLists array corresponding to the process's priority).

So Process>>#terminate can distinguish between #wait or #primitiveEnterCriticalSection or #primitiveTestAndSetOwnershipOfCriticalSection being blocked, or being unblocked but having made no progress due to too low a priority.  We do so by testing the class of the result of suspending the process.  If it is a LinkedList, the process has past the #wait or #primitiveEnterCriticalSection but has made no progress due to too low a priority.

The version of Process>>#terminate I'm about to commit deals with several cases.  Let me present the cases first.  There are three versions of Semaphore>>#critical: to handle, and one version of Mutex>>critical: and Mutex>>#critical:ifLocked:.

The two basic versions of Semaphore>>critical: are

V1
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
self wait.
^mutuallyExcludedBlock ensure: [self signal]

V2
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
^[self wait.
  mutuallyExcludedBlock value]
ensure: [self signal]

and Juan's safer version is (after I added the criticalSection pragma)

V3
critical: mutuallyExcludedBlock
"Evaluate mutuallyExcludedBlock only if the receiver is not currently in
the process of running the critical: message. If the receiver is, evaluate
mutuallyExcludedBlock after the other critical: message is finished."
<criticalSection>
| caught |
"We need to catch eventual interruptions very carefully. 
The naive approach of just doing, e.g.,:
self wait.
aBlock ensure:[self signal].
will fail if the active process gets terminated while in the wait.
However, the equally naive:
[self wait.
aBlock value] ensure:[self signal].
will fail too, since the active process may get interrupted while
entering the ensured block and leave the semaphore signaled twice.
To avoid both problems we make use of the fact that interrupts only
occur on sends (or backward jumps) and use an assignment (bytecode)
right before we go into the wait primitive (which is not a real send and
therefore not interruptable either)."

caught := false.
^[
caught := true.
self wait.
mutuallyExcludedBlock value
] ensure: [ caught ifTrue: [self signal] ]

and the Mutex>>critical:'s are

critical: aBlock
"Evaluate aBlock protected by the receiver."
<criticalSection>
^self primitiveEnterCriticalSection
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]

critical: aBlock ifLocked: lockedBlock
"Answer the evaluation of aBlock protected by the receiver.  If it is already in a critical
section on behalf of some other process answer the evaluation of lockedBlock."
<criticalSection>
^self primitiveTestAndSetOwnershipOfCriticalSection
ifNil: [lockedBlock value]
ifNotNil:
[:alreadyOwner|
alreadyOwner
ifTrue: [aBlock value]
ifFalse: [aBlock ensure: [self primitiveExitCriticalSection]]]

primitiveEnterCriticalSection answers false if the Mutex was unowned, and true if it was already owned by the active process.  It blocks otherwise.  primitiveTestAndSetOwnershipOfCriticalSection answers false if the Mutex was unowned, true if it was already owned by the active process, and nil if owned by some other process.

So we want Process>>#terminate to correctly release the semaphores and mutexes no matter where in these methods they are.  We don't have to worry if the process is within the block argument to a critical: itself, only if it is actually within the critical: method or a block within it. If it is already within the block argument to critical: then Process>>#terminate's unwind handling will unwind things correctly.  Taking Juan's version of Semaphore>>#critical: above, the key issue is whether the process being terminated is blocked on the wait, not blocked but still stuck at the wait, or at the start of the block argument to ensure:.

I have extracted the processing into Process>>releaseCriticalSection:, so now Process>>terminate reads

terminate 
"Stop the process that the receiver represents forever.
Unwind to execute pending ensure:/ifCurtailed: blocks before terminating.
If the process is in the middle of a critical: critical section, release it properly."

| ctxt unwindBlock oldList |
self isActiveProcess ifTrue: [
ctxt := thisContext.
[ ctxt := ctxt findNextUnwindContextUpTo: nil.
ctxt isNil
] whileFalse: [
(ctxt tempAt: 2) ifNil:[
ctxt tempAt: 2 put: nil.
unwindBlock := ctxt tempAt: 1.
thisContext terminateTo: ctxt.
unwindBlock value].
].
thisContext terminateTo: nil.
self suspend.
] ifFalse:[
"Always suspend the process first so it doesn't accidentally get woken up.
N.B. If oldList is a LinkedList then the process is runnable. If it is a Semaphore/Mutex et al
then the process is blocked, and if it is nil then the process is already suspended."
oldList := self suspend.
suspendedContext ifNotNil:
["Release any method marked with the <criticalSection> pragma.
 The argument is whether the process is runnable."
self releaseCriticalSection: (oldList isNil or: [oldList class == LinkedList]).

"If terminating a process halfways through an unwind, try to complete that unwind block first."
(suspendedContext findNextUnwindContextUpTo: nil) ifNotNil:
[:outer|
(suspendedContext findContextSuchThat:[:c| c closure == (outer tempAt: 1)]) ifNotNil:
[:inner| "This is an unwind block currently under evaluation"
suspendedContext runUntilErrorOrReturnFrom: inner]].

ctxt := self popTo: suspendedContext bottomContext.
ctxt == suspendedContext bottomContext ifFalse:
[self debug: ctxt title: 'Unwind error during termination'].
"Set the context to its endPC for the benefit of isTerminated."
ctxt pc: ctxt endPC]]

In implementing releaseCriticalSection: we need to know which selector a context has just sent.  selectorJustSentOrSelf is implemented in Squeak as

InstructionStream>>selectorJustSentOrSelf
"If this instruction follows a send, answer the send's selector, otherwise answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: self previousPc

c.f.

InstructionStream>>selectorToSendOrSelf
"If this instruction is a send, answer the selector, otherwise answer self."

| method |
method := self method.
^method encoderClass selectorToSendOrItselfFor: self in: method at: pc

Now we can implement Process>>#releaseCriticalSection:

releaseCriticalSection: runnable
"Figure out if we are terminating a process that is in the ensure: block of a critical section.
In this case, if the block has made progress, pop the suspendedContext so that we leave the
ensure: block inside the critical: without signaling the semaphore/exiting the primitive section,
since presumably this has already happened.  But if it hasn't made progress but is beyond the
wait (which we can tell my the oldList being one of the runnable lists, i.e. a LinkedList, not a
Semaphore or Mutex, et al), then the ensure: block needs to be run."
| selectorJustSent |
(suspendedContext method pragmaAt: #criticalSection) ifNil: [^self].
selectorJustSent := suspendedContext selectorJustSentOrSelf.

"Receiver and/or argument blocks of ensure: in Semaphore>>critical: or Mutex>>#critical:"
suspendedContext isClosureContext ifTrue:
[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

"If still at the wait the ensure: block has not been activated, so signal to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

"If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]


Let's go through it line by line.  First, runnable is an argument, determined in Process>>#terminate.  One could invoke it with

    self releaseCriticalSection: oldList class == LinkedList

but this means that an already suspended process is assumed to be not runnable, which makes it tricky to debug the Process>>#terminate method.  One has to assign to oldList while stepping though the method.  I've chosen safety, assuming that the process is still runnable if suspend answers nil, its simply being debugged.

Then we're only interested in <criticalSection> marked methods se we return if there's no such pragma.

Then we deal with blocks in these methods.  One issue here is to avoid running the ensure: block twice if it is already being evaluated.  The other is to run it if it is stalled and has yet to be run.

So if

suspendedContext isClosureContext ifTrue:

we're in the ensure: receiver or argument blocks in any <criticalSection> marked method, i.e. Semaphore>>critical: and Mutex>>critical:[ifLocked:].  If wait was just sent then we're in the ensure: receiver block of Semaphore>>critical: (V2 & V3 above) and the issue is whether the process is blocked or is unblocked and has made no progress. If blocked then nothing needs to be done; the ensure: block is discarded and the stack cut back to the critical: activation.  If progress has been made then nothing needs to be done (in fact we can't be in this state; the ensure: receiver will have started evaluating the critical: block argument).  If unblocked, but no progress has been made, do /not/ discard the unwind block and it will be run in Process>>#terminate when this method returns.  Hence...

[suspendedContext sender selector == #ensure: ifTrue:
[| notWaitingButMadeNoProgress |
"Avoid running the ensure: block twice, popping it if it has already been run. If runnable
but at the wait, leave it in place. N.B. No need to check if the block receiver of ensure: has
not started to run (via suspendedContext pc = suspendedContext startpc) because ensure:
uses valueNoContextSwitch, and so there is no suspension point before the wait."
notWaitingButMadeNoProgress :=
runnable
and: [selectorJustSent == #wait
and: [suspendedContext sender selectorJustSentOrSelf == #valueNoContextSwitch]].
notWaitingButMadeNoProgress ifFalse:
[suspendedContext := suspendedContext home]].
^self].

Now we're left with the simpler version of Semaphore>>critical: (V1 above) and the two Mutex methods Mutex>>#critical:[ifLocked:].  Here the only state we have to worry about is that the process is unblocked but has made no progress.  If not runnable the process is still blocked and we can simply return.

"Either Semaphore>>critical: or Mutex>>#critical:.  Is the process still blocked?  If so, nothing further to do."
runnable ifFalse: [^self].

If #wait was just sent the process is in Semaphore>>#critical: and, because ensure: has not been sent we signal explicitly to restore the signal count:

"If still at the wait the ensure: block has not been activated, so signal to restore."
selectorJustSent == #wait ifTrue:
[suspendedContext receiver signal].

If either of primitiveEnterCriticalSection or primitiveTestAndSetOwnershipOfCriticalSection have just been sent then either the Mutex is already owned, in which case the ensure block is elsewhere in the colder part of the stack, or has just been owned, and because ensure: has not been sent we unlock explicitly to release the Mutex:

"If still at the lock primitive and the lock primitive just acquired ownership (indicated by it answering false)
then the ensure block has not been activated, so explicitly primitiveExitCriticalSection to unlock."
(selectorJustSent == #primitiveEnterCriticalSection
or: [selectorJustSent == #primitiveTestAndSetOwnershipOfCriticalSection]) ifTrue:
[(suspendedContext stackPtr > 0
 and: [suspendedContext top == false]) ifTrue:
[suspendedContext receiver primitiveExitCriticalSection]]

So Pharoers can you read and say whether you think this is sane or not?  If so, then we can kibbutz to write the Pharo version.

Squeakers can you review Kernel-eem.1183 & Kernel-eem.1184 in the inbox?  Kernel-eem.1183 includes the fix as described above.  Kernel-eem.1184 reverts Semaphore>>#critical: to V1 above.


P.S. Looking at V1 above it seems to me that there is an issue if the process is preempted in ensure: before sending valueNoContextSwitch:.  I'll try and write a test that advances a process to that precise point.  If that test fails I think we have to use V2 or V3, and V2 is clearly preferable.

Lovely.  I added the test (testSemaCriticalWaitInEnsure & testMutexCriticalBlockedInEnsure) and it is V1 that works and V2 that does not.  I think it best not to try and be too clever and fix terminate and/or releaseCriticalSection: for this case.  We can simply stick with V1 for now.

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Solving termination of critical sections in the context of priority inversion was: SemaphoreTest fails in trunk, is a fix needed for the 5.2 release?

David T. Lewis
 
Hi Juan,

On Sat, Jul 28, 2018 at 11:43:18AM -0300, Juan Vuletich wrote:

>  
> Hi Eliot,
>
> The V3 code in Cuis was last updated on 2010, but that was just the
> comment. I went back in time, and this (except for the comment)
> identical version of the code is 10/5/2007 by Andreas.
>
> critical: mutuallyExcludedBlock
>     "Evaluate mutuallyExcludedBlock only if the receiver is not
> currently in
>     the process of running the critical: message. If the receiver is,
> evaluate
>     mutuallyExcludedBlock after the other critical: message is finished."
>     | blockValue caught |
>     caught := false.
>     [
>         caught := true.
>         self wait.
>         blockValue := mutuallyExcludedBlock value
>     ] ensure: [caught ifTrue: [self signal]].
>     ^blockValue
>
> If you are interested, in Cuis, unzip /PreviousUpdates and in file list
> set this filter: *-ar.cs.st;*-ar.?.cs.st;*-ar.??.cs.st
> This will show all the change sets in Cuis history directly based on
> Andreas' contributions. In this case it is 0104-SemaCritical-ar.cs.st
>
> You know this stuff better than anyone, and I'm happy to follow your
> lead here, Eliot.
>
> Just a question. You are using primitives #primitiveExitCriticalSection  
> #primitiveEnterCriticalSection
> #primitiveTestAndSetOwnershipOfCriticalSection (185, 186, 187), that
> Cuis hasn't used so far. Are they supported by SqueakJS and the classic
> interpreter?
>

Yes, these primitives are present in the classic interpreter VM since
VMMaker-dtl.278 of 10 July, 2012. I have tested this with a V3 image
on interpreter VM, and after installing the changes that Eliot has
introduced here, all tests are green in KernelTests-Processes (covering
Semaphore, Mutex and Monitor).

The primitives are also in the last classic interpreter VMs that were
published on squeakvm.org, although I did not specifically test that
(I'm in the habit of compiling my own).

It should be safe to adopt Eliot's latest changes in Cuis, and they
will work for V3 images on both Cog and classic interpreter.

Dave