[squeak-dev] Suspending process fix

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Suspending process fix

Igor Stasenko
Hello, i trying to provide a fix to the issue
http://bugs.squeak.org/view.php?id=6822
what is needed to make following code work w/o error:

|sema proc |
sema := Semaphore new.
proc := [ sema wait. self error: 'should not be there' ] fork.
Processor yield.
proc suspend.
proc resume.

The problem here, is that once you issue a #suspend on a process it
puts nil to its myList ivar, ignoring the fact that it could be
waiting on semaphore.
This leads to disconnecting the process from semaphore, and any
subsequent 'sema signal' will be simply lost:

suspend
        | oldList |
        <primitive: 88>   "primitive will fail, if receiver is not active
process, which is just our case"
        myList ifNil:[^nil].
        oldList := myList.
        myList := nil.
        oldList remove: self ifAbsent:[].
        ^oldList

the possible fix would be:
- add an instance variable to Process, say 'semaphore'.

- fix the #suspend to following:

suspend
        | oldList |
        <primitive: 88>   "primitive will fail, if receiver is not active
process, which is just our case"
        myList ifNil:[^nil].
        oldList := myList.
        myList := nil.
        oldList remove: self ifAbsent:[].
oldList isKindOf: Semaphore ifTrue: [ semaphore := oldList ].
        ^oldList


- then, in the #resume

resume
        suspendedContext ifNil: [^ self primitiveFailed].
        semaphore ifNotNil: [
            semaphore resumeWaiting: self.
            semaphore := nil.
            ^ self.
        ]
        ^ self primitiveResume

and

Semaphore>>resumeWaiting: aProcess
        excessSignals>0
                ifTrue: [excessSignals := excessSignals-1.
                    aProcess primitiveResume. "resume immediately" ]
                ifFalse: [self addLastLink: aProcess]

----------
An attached file is the changeset which makes following test to work
w/o failure:

|sema proc bool |
bool := true.
sema := Semaphore new.
proc := [ sema wait. bool ifTrue: [self error: 'should not be there'].
Transcript show: 'semaphore signaled' ] fork.
Processor yield.
proc newSuspend.
proc newResume.
self assert: (sema last == proc).
bool := false.
sema signal.


P.S. i didn't checked how this fix will affect other things around
processes, like mentioned by Mike:

Note that Process>>signalException: (as of in Squeak 3.8, it may be
fixed in later versions?) relies on this broken behaviour.


--
Best regards,
Igor Stasenko AKA sig.



process-suspend-fix.1.cs (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Suspending process fix

Andreas.Raab
Hi Igor -

Having spent more time than I dare admitting in debugging multi-process
interactions on heavily utilized servers I can tell you that your fix
isn't going to work very well ;-) If only because it relies on #suspend
being non-atomic which was one of the largest sources of problems when
used with semaphores. Atomic suspension is absolutely critical since the
VM may or may not decide to switch processes when you are in the midst
of modifying that list and when this happens you're in for a very bad
time (processes will vanish in thin air and garbage collected). Making
Process>>suspend be atomic in all circumstances was one of the major
advances for stability on our servers and I'm kinda surprised these
changes are not in the latest VMs (I will verify this later).

One thing I'm curious about is what is the use case are you looking at?
I have never come across this particular problem in practice since
explicit suspend and resume operations are exceptionally rare outside of
the debugger.

That said, I am not certain the behavior is "broken" to gegin with, in
particular consider the alternatives. Depending on what problem you're
trying to solve, you might want to work around the issue by utilizing
the return value from #suspend. For example, doing something like:

   list := process suspend.
   list resumeProcess: process.

and then implement

Semaphore>>resumeProcess: aProcess
   "The process was waiting on this semaphore. Put it back on."

Of course, this also illustrates one of the problems with doing this in
general which is: Are you going to put the process at the beginning of
the list? In the middle? At the end? Since any decision probably depends
on application specific context, one could argue that the best way to
deal with it then is along the lines of, say:

list := process suspend.
list
   ifNil:[process resume "process was active or suspended already"]
   ifNotNil:[list addFirst: process] "put process back on list"

But of course the latter then raises the issue that if this is an
already signaled semaphore this shouldn't be doing that either. Icky stuff.

Which is why I really do prefer the current behavior which gives you a
consistent behavior across the board. #suspend takes a process of a
semaphore, #resume simply makes the process runnable again. It may not
be what you expected it to be but it is consistent and sometimes this is
all you can ask for. If you need specifically different behavior you can
implement that in a way described above but it isn't clear to me whether
any of these would be improvements because when it comes to the edge
cases they are just as problematic as the current solution.

Cheers,
   - Andreas


Igor Stasenko wrote:

> Hello, i trying to provide a fix to the issue
> http://bugs.squeak.org/view.php?id=6822
> what is needed to make following code work w/o error:
>
> |sema proc |
> sema := Semaphore new.
> proc := [ sema wait. self error: 'should not be there' ] fork.
> Processor yield.
> proc suspend.
> proc resume.
>
> The problem here, is that once you issue a #suspend on a process it
> puts nil to its myList ivar, ignoring the fact that it could be
> waiting on semaphore.
> This leads to disconnecting the process from semaphore, and any
> subsequent 'sema signal' will be simply lost:
>
> suspend
> | oldList |
> <primitive: 88>   "primitive will fail, if receiver is not active
> process, which is just our case"
> myList ifNil:[^nil].
> oldList := myList.
> myList := nil.
> oldList remove: self ifAbsent:[].
> ^oldList
>
> the possible fix would be:
> - add an instance variable to Process, say 'semaphore'.
>
> - fix the #suspend to following:
>
> suspend
> | oldList |
> <primitive: 88>   "primitive will fail, if receiver is not active
> process, which is just our case"
> myList ifNil:[^nil].
> oldList := myList.
> myList := nil.
> oldList remove: self ifAbsent:[].
> oldList isKindOf: Semaphore ifTrue: [ semaphore := oldList ].
> ^oldList
>
>
> - then, in the #resume
>
> resume
> suspendedContext ifNil: [^ self primitiveFailed].
>         semaphore ifNotNil: [
>             semaphore resumeWaiting: self.
>             semaphore := nil.
>             ^ self.
>         ]
> ^ self primitiveResume
>
> and
>
> Semaphore>>resumeWaiting: aProcess
> excessSignals>0
> ifTrue: [excessSignals := excessSignals-1.
>                     aProcess primitiveResume. "resume immediately" ]
> ifFalse: [self addLastLink: aProcess]
>
> ----------
> An attached file is the changeset which makes following test to work
> w/o failure:
>
> |sema proc bool |
> bool := true.
> sema := Semaphore new.
> proc := [ sema wait. bool ifTrue: [self error: 'should not be there'].
> Transcript show: 'semaphore signaled' ] fork.
> Processor yield.
> proc newSuspend.
> proc newResume.
> self assert: (sema last == proc).
> bool := false.
> sema signal.
>
>
> P.S. i didn't checked how this fix will affect other things around
> processes, like mentioned by Mike:
>
> Note that Process>>signalException: (as of in Squeak 3.8, it may be
> fixed in later versions?) relies on this broken behaviour.
>
>
>
>
> ------------------------------------------------------------------------
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Igor Stasenko
2009/4/28 Andreas Raab <[hidden email]>:

> Hi Igor -
>
> Having spent more time than I dare admitting in debugging multi-process
> interactions on heavily utilized servers I can tell you that your fix isn't
> going to work very well ;-) If only because it relies on #suspend being
> non-atomic which was one of the largest sources of problems when used with
> semaphores. Atomic suspension is absolutely critical since the VM may or may
> not decide to switch processes when you are in the midst of modifying that
> list and when this happens you're in for a very bad time (processes will
> vanish in thin air and garbage collected). Making Process>>suspend be atomic
> in all circumstances was one of the major advances for stability on our
> servers and I'm kinda surprised these changes are not in the latest VMs (I
> will verify this later).
>
+100. I didn't mentioned that these changes should find their way to
VM, just because i sure that everyone who is interested in this topic,
understands it by default :)

> One thing I'm curious about is what is the use case are you looking at? I
> have never come across this particular problem in practice since explicit
> suspend and resume operations are exceptionally rare outside of the
> debugger.
>

Well, i discovered this issue when wanted to make a current process be
the only process which can be running during a certain period of time
- without any chance for being interrupted by another, higher priority
process.
And how naive i am, thinking that doing simply suspend/resume will do
what i want to. :)

IMO, the suspend/resume should guarantee that suspended process will
not perform any actions under any circumstances and be able to
continue normally after issuing corresponding #resume.
As my issue illustrates, this is not true for processes which is
waiting on semaphore.

Ask yourself, why a developer, who may want to suspend any process
(regardless of his intents) to resume it later, should make any
assertions like "what will be broken if i suspend it?".

> That said, I am not certain the behavior is "broken" to gegin with, in
> particular consider the alternatives. Depending on what problem you're
> trying to solve, you might want to work around the issue by utilizing the
> return value from #suspend. For example, doing something like:
>
>  list := process suspend.
>  list resumeProcess: process.
>
> and then implement
>
> Semaphore>>resumeProcess: aProcess
>  "The process was waiting on this semaphore. Put it back on."
>

Of course, there is alternatives.
" sarcasm starts here "
For instance an integer multiplication for positive a*b can be implemented as:

result := 0.
 1 to: a do: [:i | result := result + b ].
^ result

and then we could tell developers how to implement a 'rare' cases when
they need to multiply non-positive integers, or floating point values.
and so on. Depending on their problem, we could always propose a
solution :)
" end of sarcasm "


> Of course, this also illustrates one of the problems with doing this in
> general which is: Are you going to put the process at the beginning of the
> list? In the middle? At the end? Since any decision probably depends on
> application specific context, one could argue that the best way to deal with
> it then is along the lines of, say:
>
> list := process suspend.
> list
>  ifNil:[process resume "process was active or suspended already"]
>  ifNotNil:[list addFirst: process] "put process back on list"
>
> But of course the latter then raises the issue that if this is an already
> signaled semaphore this shouldn't be doing that either. Icky stuff.
>
> Which is why I really do prefer the current behavior which gives you a
> consistent behavior across the board. #suspend takes a process of a
> semaphore, #resume simply makes the process runnable again. It may not be
> what you expected it to be but it is consistent and sometimes this is all
> you can ask for. If you need specifically different behavior you can
> implement that in a way described above but it isn't clear to me whether any
> of these would be improvements because when it comes to the edge cases they
> are just as problematic as the current solution.
>
Right. My proposal doesn't deals correctly with cases when there are
multiple processes waiting on a single semaphore.
It is correct only for a single process.

If we suppose, that we are running in ideal environment, where
processes are running in parallel, then nothing prevents us to
implement waiting as following:

Semaphore>>wait
| flag |
  flag := Array with: false with: nil.  "a second slot will be used
for the 'next' pointer, for chaining the list of consumers who waiting
on signal"

  self scheduleFlagToSignal: flag.

  [ flag at: 1 ] whileFalse: []

it is clear, that to exit from loop, there should be external process
who alters the flag from false to true.
Also, it is clear, that suspending the loop in the middle can't affect
other processes which possibly waiting for same semaphore signal, as
well as it guarantees that semaphore signal will be consumed in
correct order.

This example illustrates the fundamental difference between waiting
and suspending/preemption, while in Squeak, unfortunately there is an
equal sign between them (semaphores use a VM's preemption mechanism to
implement waiting).
If we take this example as a base, then of course we can implement a
correct scenario which deals with preemption and scheduling.


> Cheers,
>  - Andreas
>
>
> Igor Stasenko wrote:
>>
>> Hello, i trying to provide a fix to the issue
>> http://bugs.squeak.org/view.php?id=6822
>> what is needed to make following code work w/o error:
>>
>> |sema proc |
>> sema := Semaphore new.
>> proc := [ sema wait. self error: 'should not be there' ] fork.
>> Processor yield.
>> proc suspend.
>> proc resume.
>>
>> The problem here, is that once you issue a #suspend on a process it
>> puts nil to its myList ivar, ignoring the fact that it could be
>> waiting on semaphore.
>> This leads to disconnecting the process from semaphore, and any
>> subsequent 'sema signal' will be simply lost:
>>
>> suspend
>>        | oldList |
>>        <primitive: 88>   "primitive will fail, if receiver is not active
>> process, which is just our case"
>>        myList ifNil:[^nil].
>>        oldList := myList.
>>        myList := nil.
>>        oldList remove: self ifAbsent:[].
>>        ^oldList
>>
>> the possible fix would be:
>> - add an instance variable to Process, say 'semaphore'.
>>
>> - fix the #suspend to following:
>>
>> suspend
>>        | oldList |
>>        <primitive: 88>   "primitive will fail, if receiver is not active
>> process, which is just our case"
>>        myList ifNil:[^nil].
>>        oldList := myList.
>>        myList := nil.
>>        oldList remove: self ifAbsent:[].
>> oldList isKindOf: Semaphore ifTrue: [ semaphore := oldList ].
>>        ^oldList
>>
>>
>> - then, in the #resume
>>
>> resume
>>        suspendedContext ifNil: [^ self primitiveFailed].
>>        semaphore ifNotNil: [
>>            semaphore resumeWaiting: self.
>>            semaphore := nil.
>>            ^ self.
>>        ]
>>        ^ self primitiveResume
>>
>> and
>>
>> Semaphore>>resumeWaiting: aProcess
>>        excessSignals>0
>>                ifTrue: [excessSignals := excessSignals-1.
>>                    aProcess primitiveResume. "resume immediately" ]
>>                ifFalse: [self addLastLink: aProcess]
>>
>> ----------
>> An attached file is the changeset which makes following test to work
>> w/o failure:
>>
>> |sema proc bool |
>> bool := true.
>> sema := Semaphore new.
>> proc := [ sema wait. bool ifTrue: [self error: 'should not be there'].
>> Transcript show: 'semaphore signaled' ] fork.
>> Processor yield.
>> proc newSuspend.
>> proc newResume.
>> self assert: (sema last == proc).
>> bool := false.
>> sema signal.
>>
>>
>> P.S. i didn't checked how this fix will affect other things around
>> processes, like mentioned by Mike:
>>
>> Note that Process>>signalException: (as of in Squeak 3.8, it may be
>> fixed in later versions?) relies on this broken behaviour.
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Igor Stasenko
2009/4/28 Igor Stasenko <[hidden email]>:
[snip]

>
> If we suppose, that we are running in ideal environment, where
> processes are running in parallel, then nothing prevents us to
> implement waiting as following:
>
> Semaphore>>wait
> | flag |
>  flag := Array with: false with: nil.  "a second slot will be used
> for the 'next' pointer, for chaining the list of consumers who waiting
> on signal"
>
>  self scheduleFlagToSignal: flag.
>
>  [ flag at: 1 ] whileFalse: []
>
> it is clear, that to exit from loop, there should be external process
> who alters the flag from false to true.
> Also, it is clear, that suspending the loop in the middle can't affect
> other processes which possibly waiting for same semaphore signal, as
> well as it guarantees that semaphore signal will be consumed in
> correct order.
>
A small correction: is will work correctly even in preemptive
environment, except that it will waste the CPU cycles in loop when
waiting.

> This example illustrates the fundamental difference between waiting
> and suspending/preemption, while in Squeak, unfortunately there is an
> equal sign between them (semaphores use a VM's preemption mechanism to
> implement waiting).
> If we take this example as a base, then of course we can implement a
> correct scenario which deals with preemption and scheduling.
>
>

--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Suspending process fix

Andreas.Raab
In reply to this post by Igor Stasenko
Igor Stasenko wrote:

> 2009/4/28 Andreas Raab <[hidden email]>:
>> One thing I'm curious about is what is the use case are you looking at? I
>> have never come across this particular problem in practice since explicit
>> suspend and resume operations are exceptionally rare outside of the
>> debugger.
>
> Well, i discovered this issue when wanted to make a current process be
> the only process which can be running during a certain period of time
> - without any chance for being interrupted by another, higher priority
> process.

I'm missing some context here. How does this issue relate to sending a
process suspend; resume and expect it to keep waiting on a semaphore? If
I'd have to solve this problem I would just bump the process' priority
temporarily.

> IMO, the suspend/resume should guarantee that suspended process will
> not perform any actions under any circumstances and be able to
> continue normally after issuing corresponding #resume.
> As my issue illustrates, this is not true for processes which is
> waiting on semaphore.

Yes, but even with your proposal this wouldn't be true since a process
suspended from some position in the list wouldn't be put back on in the
same position. In practice, there are *severe* limits to that statement
about how the system "ought" to behave when you run hundreds of
processes with some 50,000 network interrupts per second behind a Tweak
UI ;-) I think I can prove that your implied definition of "continuing
normally" is impossible to achieve in any system that has to deal with
asynchronous signals.

> Ask yourself, why a developer, who may want to suspend any process
> (regardless of his intents) to resume it later, should make any
> assertions like "what will be broken if i suspend it?".

Thus my question about use cases. I haven't seen many uses of suspend
outside of the debugger. And I don't think that's by accident - suspend
is very tricky to deal with in a realistic setting that needs to deal
with asynchronous signals. Most of the time it is a last resort solution
(i.e., don't care too much about what happens afterwards) not something
that you would do casually and expect to be side-effect free.

> Right. My proposal doesn't deals correctly with cases when there are
> multiple processes waiting on a single semaphore.
> It is correct only for a single process.
>
> If we suppose, that we are running in ideal environment, where
> processes are running in parallel, then nothing prevents us to
> implement waiting as following:

The problem is that in a "real" environment signals are asynchronous.
Unless you have some way of stopping time and other external interrupts
at the same time you simply cannot guarantee that after the #suspend
there isn't an external signal which causes some other process waiting
on the semaphore to execute before the process that "ought" to be released.

For example, just consider a mutex where for some reason ordering
matters like in Tweak (which does break if processes are not put back in
the same order in which they were taken off the list): You have a
process which holds the mutex, two more are waiting. You send #suspend
to the first (waiting) one, it is off-list. Now the current mutex owner
leaves that mutex. What should happen? Should the entire mutex stall
because the process that was supposed to go next was suspended? If it
proceeds it changes the ordering and that can cause all sorts of
problems (as I found out when testing some earlier versions of the
semaphore fixes that weren't quite correct ;-)

That is the main problem with the whole idea of trying to have
fine-grained control of processes externally - you cannot know the
precise circumstances when these things happen and whether issues such
as ordering matter which is why it is generally better to do this either
implicitly (using only priorities) or by messaging (send a
signal/message to the process and have it pick it up later). Outside of
that transparent external control of process execution ranges somewhere
between very tricky and plain impossible when asynchronous signals are
involved.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Igor Stasenko
2009/4/28 Andreas Raab <[hidden email]>:

> Igor Stasenko wrote:
>>
>> 2009/4/28 Andreas Raab <[hidden email]>:
>>>
>>> One thing I'm curious about is what is the use case are you looking at? I
>>> have never come across this particular problem in practice since explicit
>>> suspend and resume operations are exceptionally rare outside of the
>>> debugger.
>>
>> Well, i discovered this issue when wanted to make a current process be
>> the only process which can be running during a certain period of time
>> - without any chance for being interrupted by another, higher priority
>> process.
>
> I'm missing some context here. How does this issue relate to sending a
> process suspend; resume and expect it to keep waiting on a semaphore? If I'd
> have to solve this problem I would just bump the process' priority
> temporarily.
>

Process suspension is a STRONG guarantee that given process will not
perform any actions,until it receives #resume.
Priority is a wrong way to ensure this: different VMs could break this
contract easily , while breaking a #suspend contract is something what
doesn't fits in my mind :)

>> IMO, the suspend/resume should guarantee that suspended process will
>> not perform any actions under any circumstances and be able to
>> continue normally after issuing corresponding #resume.
>> As my issue illustrates, this is not true for processes which is
>> waiting on semaphore.
>
> Yes, but even with your proposal this wouldn't be true since a process
> suspended from some position in the list wouldn't be put back on in the same
> position. In practice, there are *severe* limits to that statement about how
> the system "ought" to behave when you run hundreds of processes with some
> 50,000 network interrupts per second behind a Tweak UI ;-) I think I can
> prove that your implied definition of "continuing normally" is impossible to
> achieve in any system that has to deal with asynchronous signals.
>

Do not try to scare me with numbers: if things working correctly for
2-3 processes, why they should fail for 50000? ;)
Certainly, the problem is to correctly identify a set of operations
which require atomicity (at language side and at VM side, if its using
many native threads). But if its done right, then who cares about
numbers?

>> Ask yourself, why a developer, who may want to suspend any process
>> (regardless of his intents) to resume it later, should make any
>> assertions like "what will be broken if i suspend it?".
>
> Thus my question about use cases. I haven't seen many uses of suspend
> outside of the debugger. And I don't think that's by accident - suspend is
> very tricky to deal with in a realistic setting that needs to deal with
> asynchronous signals. Most of the time it is a last resort solution (i.e.,
> don't care too much about what happens afterwards) not something that you
> would do casually and expect to be side-effect free.
>

Suspeding process is an explicit way to control on what happens in your system.
Many facilities can benefit from it, is we guarantee a certain
contracts to be fullfilled.
Actually, we are using suspend/resume every day, even without noticing
it - consider an image snapshot/startup :)
Does processes which were waiting for semaphore and saved in image in
such state start working after startup as if semaphore signalled? Do
such processes lose their 'wait' state?

>> Right. My proposal doesn't deals correctly with cases when there are
>> multiple processes waiting on a single semaphore.
>> It is correct only for a single process.
>>
>> If we suppose, that we are running in ideal environment, where
>> processes are running in parallel, then nothing prevents us to
>> implement waiting as following:
>
> The problem is that in a "real" environment signals are asynchronous. Unless
> you have some way of stopping time and other external interrupts at the same
> time you simply cannot guarantee that after the #suspend there isn't an
> external signal which causes some other process waiting on the semaphore to
> execute before the process that "ought" to be released.
>
If you speaking about Squeak VM, and its green threading model then
this is certanly doable, because at primitive (VM) level there is no
other activity at language side, other than VM does.

> For example, just consider a mutex where for some reason ordering matters
> like in Tweak (which does break if processes are not put back in the same
> order in which they were taken off the list): You have a process which holds
> the mutex, two more are waiting. You send #suspend to the first (waiting)
> one, it is off-list. Now the current mutex owner leaves that mutex. What
> should happen? Should the entire mutex stall because the process that was
> supposed to go next was suspended? If it proceeds it changes the ordering
> and that can cause all sorts of problems (as I found out when testing some
> earlier versions of the semaphore fixes that weren't quite correct ;-)
>

It should stall, of course, because first process who waiting on mutex
should obtain it first. The fact that its suspended is not relevant.
This is what i'm trying to say: waiting semantics should be kept
separated from scheduling.
A proof case is:

mutex critical: [
  proc := Processor activeProcess.
  [ proc suspend.  self do something here.  proc resume. ] fork.
  Processor yield.
].

it shows, that process which obtained a mutex can be suspended at any
point of time. And it gives you the right answer: any other processes
who waiting on same mutex will stall forever, until eventually,
suspended process will be resumed and release the mutex.

> That is the main problem with the whole idea of trying to have fine-grained
> control of processes externally - you cannot know the precise circumstances
> when these things happen and whether issues such as ordering matter which is
> why it is generally better to do this either implicitly (using only
> priorities) or by messaging (send a signal/message to the process and have
> it pick it up later). Outside of that transparent external control of
> process execution ranges somewhere between very tricky and plain impossible
> when asynchronous signals are involved.
>
I do not agree. As i said before, priorities is a fluid essence, which
simply shows to VM , which process have a better chance to take
control over computing resources (other VMs can treat priority
differently - like a percentage of computing resources which can be
allocated for a given process and guarantee that all active processes
will not starve during a certain period of time).
I don't like implicit control, explicit is much more better , because
it guarantees that under any circumstances your code will work same as
before.

I will try to implement a VM-side primitives which will guarantee
atomicity for Semaphore wait/signal operations. Then we can continue
our discussion using more grounded arguments. :)

> Cheers,
>  - Andreas
>
>


--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Suspending process fix

Andreas.Raab
Igor Stasenko wrote:

> 2009/4/28 Andreas Raab <[hidden email]>:
>> Igor Stasenko wrote:
>>> 2009/4/28 Andreas Raab <[hidden email]>:
>>>> One thing I'm curious about is what is the use case are you looking at? I
>>>> have never come across this particular problem in practice since explicit
>>>> suspend and resume operations are exceptionally rare outside of the
>>>> debugger.
>>> Well, i discovered this issue when wanted to make a current process be
>>> the only process which can be running during a certain period of time
>>> - without any chance for being interrupted by another, higher priority
>>> process.
>> I'm missing some context here. How does this issue relate to sending a
>> process suspend; resume and expect it to keep waiting on a semaphore? If I'd
>> have to solve this problem I would just bump the process' priority
>> temporarily.
>>
>
> Process suspension is a STRONG guarantee that given process will not
> perform any actions,until it receives #resume.

This is not an answer to my (honest) question. I don't understand how
suspending and resuming processes does allow you to solve your
originally stated problem that is "running during a certain period of
time without any chance for being interrupted by another, higher
priority process". I can do what you are stating as the problem by
changing the process priority, I don't see how suspending a process
would even begin to address this. In short, I don't see what solution
you are proposing to address your stated problem and why the issue with
suspend and resume arises from that.

> Priority is a wrong way to ensure this: different VMs could break this
> contract easily , while breaking a #suspend contract is something what
> doesn't fits in my mind :)

I'll fix that when I have a reason to switch to a VM that has a
different scheduling policy. I'm not into discussing problems I don't have.

>>> IMO, the suspend/resume should guarantee that suspended process will
>>> not perform any actions under any circumstances and be able to
>>> continue normally after issuing corresponding #resume.
>>> As my issue illustrates, this is not true for processes which is
>>> waiting on semaphore.
>> Yes, but even with your proposal this wouldn't be true since a process
>> suspended from some position in the list wouldn't be put back on in the same
>> position. In practice, there are *severe* limits to that statement about how
>> the system "ought" to behave when you run hundreds of processes with some
>> 50,000 network interrupts per second behind a Tweak UI ;-) I think I can
>> prove that your implied definition of "continuing normally" is impossible to
>> achieve in any system that has to deal with asynchronous signals.
>>
>
> Do not try to scare me with numbers: if things working correctly for
> 2-3 processes, why they should fail for 50000? ;)

Because your assumption that the code is "working correctly" is just
that - a wild guess about what might happen in what circumstances. Since
there is no mathematically sound proof that the code is indeed "working
correctly" (in fact you probably don't even have a definition of what it
means to "work correctly"), running that code at 10,000 times the
frequency allows you to find problems with a much higher probability
than you would otherwise be able to. Seriously, we didn't find the
problems we fixed over the last years by reasoning - we found them
because the system came to a screeching halt often enough to allow us to
find the issues.

> Certainly, the problem is to correctly identify a set of operations
> which require atomicity (at language side and at VM side, if its using
> many native threads). But if its done right, then who cares about
> numbers?

Numbers show that your code *actually* works instead you just thinking
it works. Do you really believe that people who wrote the code that we
fixed did know that their code was buggy and were just too lazy to write
correct code? Come on, get serious.

>>> Ask yourself, why a developer, who may want to suspend any process
>>> (regardless of his intents) to resume it later, should make any
>>> assertions like "what will be broken if i suspend it?".
>> Thus my question about use cases. I haven't seen many uses of suspend
>> outside of the debugger. And I don't think that's by accident - suspend is
>> very tricky to deal with in a realistic setting that needs to deal with
>> asynchronous signals. Most of the time it is a last resort solution (i.e.,
>> don't care too much about what happens afterwards) not something that you
>> would do casually and expect to be side-effect free.
>
> Suspeding process is an explicit way to control on what happens in your system.
> Many facilities can benefit from it, is we guarantee a certain
> contracts to be fullfilled.
> Actually, we are using suspend/resume every day, even without noticing
> it - consider an image snapshot/startup :)

But that a) suspends *all* processes (in effect stops time) and b) it is
not side-effect free and nobody expects it to be. In fact image snapshot
is a great example of cooperating processes (including shutdown /
startup processing) and why one cannot assume that external suspend /
resume can be completely side effect free.

> Does processes which were waiting for semaphore and saved in image in
> such state start working after startup as if semaphore signalled? Do
> such processes lose their 'wait' state?

In the non-trivial cases, they do. That's because they are being shut
down during the system shutdown and startup sequence. Check out what
(for example) web servers do - they stop the listener process and
restart it after the system has been restarted. But even that is beside
the point because you are cherry-picking one particular aspect of image
saving without looking at the parts that make this possible. Including
for example, the *great* care Delay goes through when adjusting wakeup
times, or the resource management necessary for sockets and files just
so the rest of the system can *pretend* nothing of importance just
happened. Or are you now suggesting that a process that is waiting on a
delay must also adjust the wakeup time for that delay when it is
suspended and resumed later? Close and reopen files? Shut down and
reopen sockets? ;-)

>> The problem is that in a "real" environment signals are asynchronous. Unless
>> you have some way of stopping time and other external interrupts at the same
>> time you simply cannot guarantee that after the #suspend there isn't an
>> external signal which causes some other process waiting on the semaphore to
>> execute before the process that "ought" to be released.
>>
> If you speaking about Squeak VM, and its green threading model then
> this is certanly doable, because at primitive (VM) level there is no
> other activity at language side, other than VM does.

I don't understand that sentence.

> It should stall, of course, because first process who waiting on mutex
> should obtain it first. The fact that its suspended is not relevant.

Well, I guess that's the end of this discussion for me. I have little
interest to discuss hypotheticals. All of my needs are very practical.

> This is what i'm trying to say: waiting semantics should be kept
> separated from scheduling.
> A proof case is:
>
> mutex critical: [
>   proc := Processor activeProcess.
>   [ proc suspend.  self do something here.  proc resume. ] fork.
>   Processor yield.
> ].
>
> it shows, that process which obtained a mutex can be suspended at any
> point of time. And it gives you the right answer: any other processes
> who waiting on same mutex will stall forever, until eventually,
> suspended process will be resumed and release the mutex.

I don't get what you are trying to show with the above. Yes a process
which holds a mutex can be suspended while it holds the mutex. This has
been true forever and it has nothing to do with the case I was
describing which was about the order at which processes arrive at and
leave a mutex.

> I do not agree. As i said before, priorities is a fluid essence, which
> simply shows to VM , which process have a better chance to take
> control over computing resources (other VMs can treat priority
> differently - like a percentage of computing resources which can be
> allocated for a given process and guarantee that all active processes
> will not starve during a certain period of time).
> I don't like implicit control, explicit is much more better , because
> it guarantees that under any circumstances your code will work same as
> before.
>
> I will try to implement a VM-side primitives which will guarantee
> atomicity for Semaphore wait/signal operations. Then we can continue
> our discussion using more grounded arguments. :)

Good luck with that. I'll settle for a definition of "correct" behavior
in the presence of asynchronous interrupts since it is really not clear
to me what you mean when you say "correct".

My problem here is that I've been knee-deep for years now into all of
the things that *actually* happen that I have no illusions left about
hypotheticals. I'll settle for a least-surprise approach that gives me
"a" result consistently. I'll rather take a wrong result 100% of the
time that a right result 95% of the time because the former you can
learn to work around quickly whereas the latter may work for weeks and
then fall over three times in a day.

But you know all that already - just compare our discussion about the
expected behavior of forking processes at the same priority. Obviously
we have different opinions about these issues but in my experience with
these (and related) issues whenever we removed uncertainty we improved
robustness because it's the rare and unexpected things that really get
you, not the obvious ones.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Suspending process fix

Andreas.Raab
Andreas Raab wrote:
> I'll settle for a definition of "correct" behavior
> in the presence of asynchronous interrupts since it is really not clear
> to me what you mean when you say "correct".

I had a thought about that this morning. One possible definition of
"correct" in this context may mean that a process in a "frozen" state
(to disambiguate from a current "suspended" state) simply does not
advance its pc but behaves identical to an unfrozen process in all other
respects.

The effect can be simulated (within reason) by something that simply
drops the priority of a process to below the idle process' priority so
that it just doesn't get any computational resources.

The point being that a freezing a process doesn't even touch the
suspended list. You could implement this for example by using negative
priorities to indicate that a process is frozen, along the lines of:

Process>>isFrozen
   "Answer if the process is frozen"
   ^priority < 0

Process>>freeze
   "This needs to be a primitive"
   self isFrozen ifTrue:[^self].
   priority := priority negated.

Process>>unfreeze
   "This needs to be a primitive"
   self isFrozen ifFalse:[^self].
   priority := priority negated.

Changing synchronousSignal: not to allow frozen processes to become
runnable (i.e., test for negative priority and just ignore it if that's
the case) would achieve most of what you need (modulo the primitives for
freeze/unfreeze).

Obviously the effects will be "interesting" (to say the least) if you
freeze a process on its way into a mutex since that process will indeed
hold the mutex until such a point that it becomes unfrozen. But I guess
that is kind of what you were asking for after all ;-)

Cheers,
   - Andreas


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Igor Stasenko
2009/4/28 Andreas Raab <[hidden email]>:

> Andreas Raab wrote:
>>
>> I'll settle for a definition of "correct" behavior in the presence of
>> asynchronous interrupts since it is really not clear to me what you mean
>> when you say "correct".
>
> I had a thought about that this morning. One possible definition of
> "correct" in this context may mean that a process in a "frozen" state (to
> disambiguate from a current "suspended" state) simply does not advance its
> pc but behaves identical to an unfrozen process in all other respects.
>
> The effect can be simulated (within reason) by something that simply drops
> the priority of a process to below the idle process' priority so that it
> just doesn't get any computational resources.
>
> The point being that a freezing a process doesn't even touch the suspended
> list. You could implement this for example by using negative priorities to
> indicate that a process is frozen, along the lines of:
>
> Process>>isFrozen
>  "Answer if the process is frozen"
>  ^priority < 0
>
> Process>>freeze
>  "This needs to be a primitive"
>  self isFrozen ifTrue:[^self].
>  priority := priority negated.
>
> Process>>unfreeze
>  "This needs to be a primitive"
>  self isFrozen ifFalse:[^self].
>  priority := priority negated.
>
> Changing synchronousSignal: not to allow frozen processes to become runnable
> (i.e., test for negative priority and just ignore it if that's the case)
> would achieve most of what you need (modulo the primitives for
> freeze/unfreeze).
>
> Obviously the effects will be "interesting" (to say the least) if you freeze
> a process on its way into a mutex since that process will indeed hold the
> mutex until such a point that it becomes unfrozen. But I guess that is kind
> of what you were asking for after all ;-)
>

Good idea, except that you losing the original priority value. And
freezing/suspending is very much synonyms as to me :)

After your criticism (sorry i haven't commented it yet..), my thoughts
vent somewhere else, trying to identify what is wrong with Squeak
scheduling/concurrency and why its so brittle even at core levels (not
mentioning an application levels which rely on it).

I came to an idea , you might be interested in.
As many of us know, some CPUs having a special mode - interrupt mode.
What if we introduce the interrupt mode for scheduler?

Processor interruptWith: aBlock
"evaluate the block closure in interrupt mode. VM guarantees the
atomicity of block evaluation (no external events could interrupt its
evaluation, since processor are already in interrupt mode).
A  'Processor interruptedProcess' holds a process which is being interrupted.
Once block evaluation is done, VM continues with process which holding
interruptedProcess ivar.
In case of any external events(such as semaphore signal), which may
occur during interrupt, they are queued in list. Interpreter continues
to handle interrupts in interrupt mode until there are pending
interrupts in its queue.
If code sends #interruptWith: when already in interrupt mode , then
block will be evaluated immediately, as if sending #value message to
it
 "

Now i trying to imagine, how a basic stuff might look like(please
correct me if its utterly wrong way ;), if we will be able to use
interrupt mode.

First, task switching, and scheduling policy is in hands of developers, not VM:

Processor>>initializeSchedulingAction
"when VM going to switch tasks , like in wait/suspend/running to the
bottom of stack, it picks a block closure , located in Processor's
schedulingAction instance variable, and runs it as an interrupt. "
   schedulingAction := [  self interruptedProcess: self
pickNextScheduledProcessBasedOnWhateverCriteria. ].

and next is trivial:

Processor>>yield
"yield the current process"
   self interruptWith: [
       self schedulingAction value.
   ].

As a consequence of this, semaphores can be implemented as is (w/o the
need of special primitives)

Semaphore>>signal
   Processor interruptWith: [
      self isEmpty
                ifTrue: [excessSignals := excessSignals+1]
                ifFalse: [Processor resume: self removeFirstLink]
  ].

Semaphore>>wait
   Processor interruptWith: [
        excessSignals>0
                ifTrue: [excessSignals := excessSignals-1]
                ifFalse: [self addLastLink: Processor activeProcess suspend] ]


Delays:

Delay>>wait
"action & delayedProcess is instance variables"
[
  Processor interruptWith: [
      delayedProcess := Processor interruptedProcess.
      Processor unschedule: delayedProcess.
      action := [ Processor rescheduleProcess: delayedProcess ].
      self class scheduleDelay: self.
      Processor continueWithNextScheduledProcess.
  ].
] ifCurtailed: [ action := [] ].

---

Delay class>>scheduleDelay: delay
   "gory details about picking the nearest delay to be signaled... "
   ActiveDelay := ...

   self primitiveScheduleAction: [ ActiveDelay action value. self
sheduleNextDelay ] registerAsExternalObject atMilliseconds:
pointInTime.

A 'primitiveScheduleAction...' will schedule an interrupt to evaluate
a closure which were registered in external objects table.
Note, that since access to state (ActiveDelay) is possible only during
interrupt phase (we can enforce this quite easy), we don't need an
extra synchronizations like AccessProtect critical: []. So, the
obvious benefit - a code is greatly simplified.

Also, note the tricks with 'action' ivar of delay. An ifCurtailed:
block sets action to empty block, so nothing bad happens when given
delay will be activated by timerInterrupt.

In similar way, we could add a 'signalAction' ivar to Semaphore, and
replace it in case, if we don't want it to be performed (such in
#terminate /#critical: case).

And, what is now possible - a shift in mindset - use direct actions
(closures), instead of semaphores. Lets get straight to the business:
why do we need to wrap stuff with semaphores first, and only then do
something? :)
There could be many variations in actions - a different mixtures of
waiting for semaphore + abandoning wait after timeout.. etc etc.
There is almost no need for extra primitives, except one which gives
you interrupt mode.

Like in sockets, what if we instead of creating a bunch of semaphores:

        semaIndex := Smalltalk registerExternalObject: semaphore.
        readSemaIndex := Smalltalk registerExternalObject: readSemaphore.
        writeSemaIndex := Smalltalk registerExternalObject: writeSemaphore.

and passing them to primitive:
        socketHandle :=
                self primSocketCreateNetwork: 0
                        type: socketType
                        receiveBufferSize: 8000
                        sendBufSize: 8000
                        semaIndex: semaIndex
                        readSemaIndex: readSemaIndex
                        writeSemaIndex: writeSemaIndex.

pass the closures instead?

Then, when something happens on socket -> VM will evaluate given
closure in interrupt mode.
And you're free to react on these events as you like.

What you think?

> Cheers,
>  - Andreas
>
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Nicolas Cellier
2009/4/28 Igor Stasenko <[hidden email]>:
>
> Good idea, except that you losing the original priority value. And
> freezing/suspending is very much synonyms as to me :)
>

Na, look better, Andreas keep the priority by just negating it... :)

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Suspending process fix

Michael van der Gulik-2
In reply to this post by Igor Stasenko
On 4/28/09, Igor Stasenko <[hidden email]> wrote:

> Hello, i trying to provide a fix to the issue
> http://bugs.squeak.org/view.php?id=6822
> what is needed to make following code work w/o error:
>
> |sema proc |
> sema := Semaphore new.
> proc := [ sema wait. self error: 'should not be there' ] fork.
> Processor yield.
> proc suspend.
> proc resume.
>
> The problem here, is that once you issue a #suspend on a process it
> puts nil to its myList ivar, ignoring the fact that it could be
> waiting on semaphore.
> This leads to disconnecting the process from semaphore, and any
> subsequent 'sema signal' will be simply lost:
>
> suspend
> | oldList |
> <primitive: 88>   "primitive will fail, if receiver is not active
> process, which is just our case"
> myList ifNil:[^nil].
> oldList := myList.
> myList := nil.
> oldList remove: self ifAbsent:[].
> ^oldList
>
> the possible fix would be:
> - add an instance variable to Process, say 'semaphore'.

Personally, I'd add an boolean instance variable called "suspended" to
Process, and then modify the scheduler (which unfortunately is in the
VM code) to respect it.

Gulik.

--
http://gulik.pbwiki.com/

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Michael van der Gulik-2
In reply to this post by Andreas.Raab
On 4/28/09, Andreas Raab <[hidden email]> wrote:

> Igor Stasenko wrote:
>> Ask yourself, why a developer, who may want to suspend any process
>> (regardless of his intents) to resume it later, should make any
>> assertions like "what will be broken if i suspend it?".
>
> Thus my question about use cases. I haven't seen many uses of suspend
> outside of the debugger. And I don't think that's by accident - suspend
> is very tricky to deal with in a realistic setting that needs to deal
> with asynchronous signals. Most of the time it is a last resort solution
> (i.e., don't care too much about what happens afterwards) not something
> that you would do casually and expect to be side-effect free.

IMHO Process>>suspend should never by used in "normal" code. It should
only be used from debuggers and system tools. But it should still be
implemented to exhibit the correct behaviour.

The above is exactly the use case where this bug has bitten me. When I
was trying to debug concurrent code, the debugger would simply ignore
Semaphore>>wait and step right over it! The debugger quickly became
useless when I had to manually keep track of which semaphores were
signalled and which weren't.

Of course, the debugger would also need improvement to make sure that
it doesn't suspend the entire GUI every time its simulated process
waits on a semaphore, but that's another issue.

> The problem is that in a "real" environment signals are asynchronous.
> Unless you have some way of stopping time and other external interrupts
> at the same time you simply cannot guarantee that after the #suspend
> there isn't an external signal which causes some other process waiting
> on the semaphore to execute before the process that "ought" to be released.

By my understanding of how suspending a process should work, if a
Process is suspended (by calling >>suspend) then no force on Earth
other than called >>resume on it should resume it again. Any events or
signals on it should accumulate until it is resumed.

> For example, just consider a mutex where for some reason ordering
> matters like in Tweak (which does break if processes are not put back in
> the same order in which they were taken off the list): You have a
> process which holds the mutex, two more are waiting. You send #suspend
> to the first (waiting) one, it is off-list. Now the current mutex owner
> leaves that mutex. What should happen? Should the entire mutex stall
> because the process that was supposed to go next was suspended? If it
> proceeds it changes the ordering and that can cause all sorts of
> problems (as I found out when testing some earlier versions of the
> semaphore fixes that weren't quite correct ;-)

What list? Are you referring to the linked list that a Semaphore
maintains? I would consider the linked list of Processes that
Semaphores maintain to be an implementation detail of Processes and
Semaphores. Your code should be written to be completely oblivious to
it.

I believe the correct behaviour of Semaphore>>signal should be that
the next process to be run would be either the process doing the
signalling, or any other process waiting on that semaphore. Assuming a
multi-core capable VM, two processes might end up concurrrently
continuing execution. There shouldn't be any guaranteed ordering in
the resuming of processes; that's an implementation detail in the VM
that could potentially change.

And yes, I believe that in your example, it is correct that the entire
mutex should "stall", meaning that the process that entered the mutex
has entered a "suspended" state and all processes still waiting on
that mutex remain in their "waiting" state. If the mutex didn't
"stall", the debugger wouldn't be particularly helpful.

Gulik.

--
http://gulik.pbwiki.com/

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Eliot Miranda-2
Hi Michael,

   responding to clear-up some apparent confusions.


On Tue, Apr 28, 2009 at 3:29 PM, Michael van der Gulik <[hidden email]> wrote:
On 4/28/09, Andreas Raab <[hidden email]> wrote:
> Igor Stasenko wrote:
>> Ask yourself, why a developer, who may want to suspend any process
>> (regardless of his intents) to resume it later, should make any
>> assertions like "what will be broken if i suspend it?".
>
> Thus my question about use cases. I haven't seen many uses of suspend
> outside of the debugger. And I don't think that's by accident - suspend
> is very tricky to deal with in a realistic setting that needs to deal
> with asynchronous signals. Most of the time it is a last resort solution
> (i.e., don't care too much about what happens afterwards) not something
> that you would do casually and expect to be side-effect free.

IMHO Process>>suspend should never by used in "normal" code. It should
only be used from debuggers and system tools. But it should still be
implemented to exhibit the correct behaviour.

Define "normal" :)
 
The above is exactly the use case where this bug has bitten me. When I
was trying to debug concurrent code, the debugger would simply ignore
Semaphore>>wait and step right over it! The debugger quickly became
useless when I had to manually keep track of which semaphores were
signalled and which weren't.

Of course, the debugger would also need improvement to make sure that
it doesn't suspend the entire GUI every time its simulated process
waits on a semaphore, but that's another issue.

> The problem is that in a "real" environment signals are asynchronous.
> Unless you have some way of stopping time and other external interrupts
> at the same time you simply cannot guarantee that after the #suspend
> there isn't an external signal which causes some other process waiting
> on the semaphore to execute before the process that "ought" to be released.

By my understanding of how suspending a process should work, if a
Process is suspended (by calling >>suspend) then no force on Earth
other than called >>resume on it should resume it again. Any events or
signals on it should accumulate until it is resumed.

Signals accumulate on Semaphores, not on processes.
 
> For example, just consider a mutex where for some reason ordering
> matters like in Tweak (which does break if processes are not put back in
> the same order in which they were taken off the list): You have a
> process which holds the mutex, two more are waiting. You send #suspend
> to the first (waiting) one, it is off-list. Now the current mutex owner
> leaves that mutex. What should happen? Should the entire mutex stall
> because the process that was supposed to go next was suspended? If it
> proceeds it changes the ordering and that can cause all sorts of
> problems (as I found out when testing some earlier versions of the
> semaphore fixes that weren't quite correct ;-)

What list? Are you referring to the linked list that a Semaphore
maintains? I would consider the linked list of Processes that
Semaphores maintain to be an implementation detail of Processes and
Semaphores. Your code should be written to be completely oblivious to
it.

This is more than an implementation detail.  Processes are subclasses of Link that add a "myList" instance variable, and Semaphores subclasses of LinkedList.  The runnable process lists in Processor are LinkedList instances.  So a process is either the sole activeProcess and not on any list, or suspended, and not on any list, or on one of the runnable process lists, or waiting on a Semaphore.  The "myList" instance variable is set to the list a process is waiting on or not.  So a process is suspended iff myList == nil and the process ~~ Processor activeProcess.  None of Processor (the access of the value of (Smalltalk associationAt: #Process), activeProcess (an inst var accessor) and #== or #and: are suspension points so one can write
    Process isSuspended
        ^myList == nil and: [self ~~ Processor activeProcess]

But wait!?!?  The Squeak definition is simply
Process methods for accessing
isSuspended
^myList isNil

which is broken.  Take for example the following two forks, the first of which gets added to the runnable process lists before it runs, the second of which runs immediately:

| s r | s := Semaphore new. [| thisProcess | thisProcess := Processor activeProcess. r := {(Processor instVarNamed: 'quiescentProcessLists') indexOf: (thisProcess instVarNamed: 'myList'). thisProcess priority. thisProcess isSuspended }. s signal] forkAt: Processor userBackgroundPriority. s wait. r answers #(1 30 false) | s r | s := Semaphore new. [| thisProcess | thisProcess := Processor activeProcess. r := {(Processor instVarNamed: 'quiescentProcessLists') indexOf: (thisProcess instVarNamed: 'myList'). thisProcess priority. thisProcess isSuspended }. s signal] forkAt: Processor activePriority + 20. s wait. r answers #(0 60 true)

So I'm confused too!  There's a bug in the VM and in the definition of isSuspended.  isSuspended needs to read

    isSuspended
        ^myList == nil and: [self ~~ Processor activeProcess]

and the VM needs to set myList to nil in transferTo:.

If I were thinking of a truly multi-process implementation I'd set myList to the process itself so that we could define


    isSuspended
        ^myList == nil

    isActive
        ^myList == self

but one step at a timee.


I believe the correct behaviour of Semaphore>>signal should be that
the next process to be run would be either the process doing the
signalling, or any other process waiting on that semaphore. Assuming a
multi-core capable VM, two processes might end up concurrrently
continuing execution. There shouldn't be any guaranteed ordering in
the resuming of processes; that's an implementation detail in the VM
that could potentially change.

Um, this seems very confused.  The process doing the signalling should not be affected at all.  Only processes waiting on the semaphore should be affected, and by definition the signalling process can't be waiting on the semapihre it seignals because it has to be runnable to be able to signal.

Further, yes there *must* be an ordering in the resumption order.  It must be strictly FIFO for many scheduling algorithms to work.  Even on a multi-core CPU the scheduler can be well-defined such that processes waiting on a semaphore become active in the order they are waiting on the semaphore.  Even in a hypothetical multi-core VM with two concurrent processes simultaneously signalling a semaphore with two processes waiting on it the system would have to ensure that the two signals ended up scheduling both processes, not that both signals ended up somehow proceeding only one; i.e. the VM is going to have to serialize or mutually-exclude access to the semaphore to prevent signals being lost.

And yes, this is guaranteed, very conciously so.  The Smalltalk scheduler is strictly a real-time scheduler preemptive across priority and cooperatively-scheduled round-robin within priority, such that all lower-priority runnable processes are preempted by any higher-priority runnable processes as soon as any higher-priority process becomes runnable.  The only bug in this in Squeak (apart from the myList snafu above) is that when a lower-priority process is preempted by a higher one it gets added to the back of its scheduling queue, so within a priority processes can't rely on being co-operatively scheduled.  This has been fixed in VisualWorks, where a preempted process is added to the front of its runnable process list.  (But VisualWorks goes on to break Semaphores by making them priority queues where the first highest priority process is the one that gets run, which will break certain scheduling algorithms).


And yes, I believe that in your example, it is correct that the entire
mutex should "stall", meaning that the process that entered the mutex
has entered a "suspended" state and all processes still waiting on
that mutex remain in their "waiting" state. If the mutex didn't
"stall", the debugger wouldn't be particularly helpful.

Cheers
Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Michael van der Gulik-2
On 4/29/09, Eliot Miranda <[hidden email]> wrote:

> Hi Michael,
>> What list? Are you referring to the linked list that a Semaphore
>> maintains? I would consider the linked list of Processes that
>> Semaphores maintain to be an implementation detail of Processes and
>> Semaphores. Your code should be written to be completely oblivious to
>> it.
>
>
> This is more than an implementation detail.  Processes are subclasses of
> Link that add a "myList" instance variable, and Semaphores subclasses of
> LinkedList.  The runnable process lists in Processor are LinkedList
> instances.  So a process is either the sole activeProcess
> and not on any
> list, or suspended, and not on any list, or on one of the runnable process
> lists, or waiting on a Semaphore.  The "myList" instance variable is set to
> the list a process is waiting on or not.  So a process is suspended iff
> myList == nil and the process ~~ Processor activeProcess.  <snip>

Okay, so we have two types of "suspended" going on here:

1: The process is not currently running, but might be soon.
2: The process is in a special "suspended" state for debugging or
management, and won't be scheduled.

Again, you're also working with implementation details. As a
programmer of "normal" code (where "normal" isn't defined here :-P ),
I don't know or care whether Process is a subclass of Link or whether
Semaphores are a subclass of LinkedList. They have protocols that I'd
hopefully never need to use. That sort of stuff should be abstracted
away from me.

What I care about is that:

* When I fork a process, conceptually it is running at the same time.
How the processes are actually scheduled or run should not be of
interest to me.

* When a process waits on a semaphore, it won't restart until that
semaphore is signalled... unless I've asked for a timeout. This
includes when the process has been suspended and resumed - it should
still remain in the "waiting" state.

And either of (depending on who is right between us; discussion
continued below):

* When a semaphore is signalled, *some* process that was waiting on it
shall, at some stage in the future depending on scheduling behaviour,
resume execution.
or
* When a semaphore is signalled, the process that has been waiting on
this semaphore the longest shall, at some stage in the future
depending on scheduling behaviour, resume execution.
(noting that the current implementation in Squeak resumes the waiting
process immediately)

And then:

* If Process>>suspend is invoked on a Process, that Process enteres a
suspended state and won't run until >>resume is invoked on it. (If the
Process was waiting on a semaphore, it will continue waiting on that
semaphore when resumed).

This is my impression of the API that should be presented to
application developers.

> I believe the correct behaviour of Semaphore>>signal should be that
>> the next process to be run would be either the process doing the
>> signalling, or any other process waiting on that semaphore. Assuming a
>> multi-core capable VM, two processes might end up concurrently
>> continuing execution. There shouldn't be any guaranteed ordering in
>> the resuming of processes; that's an implementation detail in the VM
>> that could potentially change.
>
>
> Um, this seems very confused.  The process doing the signalling should not
> be affected at all.  Only processes waiting on the semaphore should be
> affected, and by definition the signalling process can't be waiting on the
> semapihre it seignals because it has to be runnable to be able to signal.
>
> Further, yes there *must* be an ordering in the resumption order.  It must
> be strictly FIFO for many scheduling algorithms to work.

Would you point me in the right direction on this? Do you have
examples of algorithms where the resumption order of processes waiting
on a Semaphore need to be FIFO?

>  Even on a
> multi-core CPU the scheduler can be well-defined such that processes waiting
> on a semaphore become active in the order they are waiting on the semaphore.
>  Even in a hypothetical multi-core VM with two concurrent processes
> simultaneously signalling a semaphore with two processes waiting on it the
> system would have to ensure that the two signals ended up scheduling both
> processes, not that both signals ended up somehow proceeding only one; i.e.
> the VM is going to have to serialize or mutually-exclude access to the
> semaphore to prevent signals being lost.
>
> And yes, this is guaranteed, very conciously so.  The Smalltalk scheduler is
> strictly a real-time scheduler preemptive across priority
> and cooperatively-scheduled round-robin within priority, such that all
> lower-priority runnable processes are preempted by any higher-priority
> runnable processes as soon as any higher-priority process becomes runnable.

This would break on a multi-core capable VM. On such a hypothetical VM
(we can dream, right?), any number of processes from any number of
runlevels could be running truly concurrently.

I keep going on about the hypothetical multi-core capable VM, because
I honestly believe that one day some bright spark will make us one and
there will be much rejoicing and joviality. If nobody else does, I
will, once I've finished all my current projects(*).

If all our code relies on the behaviour of the scheduler, then it's
going to be painful for a very long time before our images will be run
nice and concurrently on multi-core PCs. Standard PCs these days have
two or four cores, and I can see the number of cores going all
exponential on us.

Gulik.

(*) ha ha ha ha. See my wiki below.

--
http://gulik.pbwiki.com/

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Suspending process fix

Andreas.Raab
In reply to this post by Igor Stasenko
Igor Stasenko wrote:
> I came to an idea , you might be interested in.
> As many of us know, some CPUs having a special mode - interrupt mode.
> What if we introduce the interrupt mode for scheduler?
[... snip ...]
> Now i trying to imagine, how a basic stuff might look like(please
> correct me if its utterly wrong way ;), if we will be able to use
> interrupt mode.

This is actually along similar lines of thought that I had when I was
thinking of how to get rid of the builtin VM scheduling behavior. The
main thought that I had was that the VM may have a "special" process -
the scheduler process (duh!) which it runs when it doesn't know what
else to do. The VM would then not directly schedule processes after
semaphore signals but rather put them onto a "ready" queue that can be
read by the scheduler process and switch to the scheduler process. The
scheduler process decides what to run next and resumes the process via a
primitive. Whenever an external signal comes in, the VM automatically
activates the scheduler process and the scheduler process then decides
whether to resume the previously running process or to switch to a
different process.

In a way this folds the timer process into the scheduler (which makes
good sense from my perspective because much of the work in the timer is
stuff that could be more effectively take place in the scheduler). The
implementation should be relatively straightforward - just add a
scheduler process and a ready list to the special objects, and wherever
the VM would normally process switch you just switch to the scheduler.
Voila, there is your user-manipulable scheduler ;-) And obviously,
anything that is run out of the scheduler process is by definition
non-interruptable because there is simply nothing to switch to!

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Eliot Miranda-2
Hi Andreas,

On Tue, Apr 28, 2009 at 8:33 PM, Andreas Raab <[hidden email]> wrote:
Igor Stasenko wrote:
I came to an idea , you might be interested in.
As many of us know, some CPUs having a special mode - interrupt mode.
What if we introduce the interrupt mode for scheduler?
[... snip ...]

Now i trying to imagine, how a basic stuff might look like(please
correct me if its utterly wrong way ;), if we will be able to use
interrupt mode.

This is actually along similar lines of thought that I had when I was thinking of how to get rid of the builtin VM scheduling behavior. The main thought that I had was that the VM may have a "special" process - the scheduler process (duh!) which it runs when it doesn't know what else to do. The VM would then not directly schedule processes after semaphore signals but rather put them onto a "ready" queue that can be read by the scheduler process and switch to the scheduler process. The scheduler process decides what to run next and resumes the process via a primitive. Whenever an external signal comes in, the VM automatically activates the scheduler process and the scheduler process then decides whether to resume the previously running process or to switch to a different process.

In a way this folds the timer process into the scheduler (which makes good sense from my perspective because much of the work in the timer is stuff that could be more effectively take place in the scheduler). The implementation should be relatively straightforward - just add a scheduler process and a ready list to the special objects, and wherever the VM would normally process switch you just switch to the scheduler. Voila, there is your user-manipulable scheduler ;-) And obviously, anything that is run out of the scheduler process is by definition non-interruptable because there is simply nothing to switch to!

How would you generalise this to a natively multi-threaded VM?  Obviously one simple way is to stop the other processors at a suspension point before letting the scheduler process proceed, but is there anything cleverer that doesn't halt all processors until the singleton scheduler has made its mind up?
 
Cheers,
 - Andreas




Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Michael van der Gulik-2
On 4/29/09, Eliot Miranda <[hidden email]> wrote:
> Hi Andreas,
>
> On Tue, Apr 28, 2009 at 8:33 PM, Andreas Raab <[hidden email]> wrote:

>> This is actually along similar lines of thought that I had when I was
>> thinking of how to get rid of the builtin VM scheduling behavior. The main
>> thought that I had was that the VM may have a "special" process - the
>> scheduler process (duh!) which it runs when it doesn't know what else to
>> do.
>> The VM would then not directly schedule processes after semaphore signals
>> but rather put them onto a "ready" queue that can be read by the scheduler
>> process and switch to the scheduler process. The scheduler process decides
>> what to run next and resumes the process via a primitive. Whenever an
>> external signal comes in, the VM automatically activates the scheduler
>> process and the scheduler process then decides whether to resume the
>> previously running process or to switch to a different process.
>>
>> In a way this folds the timer process into the scheduler (which makes good
>> sense from my perspective because much of the work in the timer is stuff
>> that could be more effectively take place in the scheduler). The
>> implementation should be relatively straightforward - just add a scheduler
>> process and a ready list to the special objects, and wherever the VM would
>> normally process switch you just switch to the scheduler. Voila, there is
>> your user-manipulable scheduler ;-) And obviously, anything that is run
>> out
>> of the scheduler process is by definition non-interruptable because there
>> is
>> simply nothing to switch to!
>
>
> How would you generalise this to a natively multi-threaded VM?  Obviously
> one simple way is to stop the other processors at a suspension point before
> letting the scheduler process proceed, but is there anything cleverer that
> doesn't halt all processors until the singleton scheduler has made its mind
> up?

Every OS thread would have it's own scheduler process. Each scheduler
process has it's own data structure (collection of process lists in
runlevels / balancing tree / whatever) which it has complete ownership
over meaning that it doesn't need to synchonise access to it. Each
process has affinity with a particular scheduler so that the
interrupt-like mechanism Igor mentioned would hold only for that
particular scheduler instance and it's associated list/tree/whatever
of processes.

Each scheduler process would then communicate with other scheduler
processes using something like... umm... a shared queue of messages
that it would poll on... or something. I haven't thought out all the
details here yet. The communication would be necessary for (at least:)

* moving processes between schedulers. This would happen when a
scheduler goes idle so that load balancing happens.

* signalling semaphores where the waiting process belongs to another scheduler.

* and other stuff I haven't thought of yet.

Gulik.

--
http://gulik.pbwiki.com/

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: Suspending process fix

Andreas.Raab
In reply to this post by Eliot Miranda-2
Eliot Miranda wrote:
> How would you generalise this to a natively multi-threaded VM?
> Obviously one simple way is to stop the other processors at a
> suspension point before letting the scheduler process proceed, but is
> there anything cleverer that doesn't halt all processors until the
> singleton scheduler has made its mind up?

There is probably something cleverer - just not as the first step ;-)
The advantage of this approach is that if you halt all threads then you
know you're in a safe spot which gives you room to experiment. Note that
you could even run GC only from the scheduler (i.e., treat GC as an
external signal that requests a GC from the scheduler) which solves the
concurrent GC problem even with the current VM design.

I think this would be a great starting point to learn more about
multi-threaded issues in your environment. As you learn how to cope with
the situation and what the implications of having a single scheduler
really are you can move to relax the dependencies and see where this
gets you.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Michael van der Gulik-2
On 4/29/09, Andreas Raab <[hidden email]> wrote:
> Eliot Miranda wrote:
> Note that
> you could even run GC only from the scheduler (i.e., treat GC as an
> external signal that requests a GC from the scheduler) which solves the
> concurrent GC problem even with the current VM design.

...or even implement the GC in bytecodes. You could provide primitives
which gives lower level access to the image - raw memory maybe, or
perhaps objects indexed by oop.

It would be a bit tricky managing the garbage the GC itself would
generate. Maybe you could just leave it lying around, or maybe
deallocate it manually? And it would be a bit slow.

Gulik.

--
http://gulik.pbwiki.com/

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: Suspending process fix

Igor Stasenko
In reply to this post by Andreas.Raab
2009/4/29 Andreas Raab <[hidden email]>:

> Igor Stasenko wrote:
>>
>> I came to an idea , you might be interested in.
>> As many of us know, some CPUs having a special mode - interrupt mode.
>> What if we introduce the interrupt mode for scheduler?
>
> [... snip ...]
>>
>> Now i trying to imagine, how a basic stuff might look like(please
>> correct me if its utterly wrong way ;), if we will be able to use
>> interrupt mode.
>
> This is actually along similar lines of thought that I had when I was
> thinking of how to get rid of the builtin VM scheduling behavior. The main
> thought that I had was that the VM may have a "special" process - the
> scheduler process (duh!) which it runs when it doesn't know what else to do.
> The VM would then not directly schedule processes after semaphore signals
> but rather put them onto a "ready" queue that can be read by the scheduler
> process and switch to the scheduler process. The scheduler process decides
> what to run next and resumes the process via a primitive. Whenever an
> external signal comes in, the VM automatically activates the scheduler
> process and the scheduler process then decides whether to resume the
> previously running process or to switch to a different process.
>
> In a way this folds the timer process into the scheduler (which makes good
> sense from my perspective because much of the work in the timer is stuff
> that could be more effectively take place in the scheduler). The
> implementation should be relatively straightforward - just add a scheduler
> process and a ready list to the special objects, and wherever the VM would
> normally process switch you just switch to the scheduler. Voila, there is
> your user-manipulable scheduler ;-) And obviously, anything that is run out
> of the scheduler process is by definition non-interruptable because there is
> simply nothing to switch to!
>

Very nice indeed. That's even better that my first proposal.
ProcessorScheduler>>schedulingProcessLoop
[
  self handlePendingSignalsAndActions.
  activeProcess ifNil: [ self idle ] ifNotNil: [ self
primitiveTransferControlTo: activeProcess].
]  repeat.

and when any process, somehow stops running
(suspend/wait/terminate/interrupted etc), VM will again switch to
scheduler process loop.

What is important in having it, that there is guarantee to be not
preempted by anything. Simply by having this, many
concurrency/scheduling related problems can be solved by language-side
implementation, without fear of having gotchas from VM side.

Also, VM doesn't needs to know details about priorities, suspending,
etc etc..  - which means that we can simplify VM considerably and
implement same parts on the language side, where everything is late
bound :)

As for moving to multi-cores.. yes, as Gulik suggests, its like adding
a new dimension:
 - local scheduler for each core
 - single global scheduler for freezing everything

This, of course, if we could afford running same object memory over
multiple cores. Handling interpreter/object memory state(s) with
multiple cores is not trivial thing.

If we going to keep more isolated model (islands, hydra ) then we need
no/minimal changes to scheduler - each scheduler serves own island and
receives asynchronous signals from other collegues through shared
queue.

> Cheers,
>  - Andreas
>


--
Best regards,
Igor Stasenko AKA sig.

12