Hi -
[cc: vm-dev which I accidentally left out earlier]
Attached my proposed fixes for myList manipulation. The first CS
(PrimSuspend-ar) changes primitiveSuspend to enable atomic removal from
myList for the non-active process. The second (SuspendFixes-ar) has the
main modifications for the in-image part (this may not work for all
Squeak versions - I used a Croquet image as the basis, YMMV).
Feedback is welcome, in particular from the usual suspects on vm-dev.
Cheers,
- Andreas
Andreas Raab wrote:
> Hi -
>
> I had an eventful (which is euphemistic for @!^# up) morning caused by
> Process>>terminate. In our last round of delay and semaphore discussions
> I had noticed that there is a possibility of having a race condition in
> Process>>terminate but dismissed it as being of an application problem
> (e.g., if you send #terminate make sure you have only one place where
> you send it).
>
> This morning proved conclusively that this is a race condition which can
> affect *every* user of the system. It is caused by Process>>terminate
> which says:
>
> myList remove: self ifAbsent: [].
>
> The reason this is so problematic is that the modification of myList is
> not atomic and that because of the non-atomic modification there is a
> possibility of the VM manipulating the very same list concurrently due
> to an external event (like a network interrupt). When this happens in
> "just the right way" the effect is that any number of processes at the
> same priority will "fall off" of the scheduled list. In the image that I
> was looking at earlier we had the following situation:
> * ~40 processes were not running
> * The processes had their myList be an empty linked list
> * The processes were internally linked (via nextLink)
> * The processes were all at the same priority
> Given that most of the processes were unrelated other than having the
> same priority I think the evidence is pretty clear.
>
> The question is now: How can we fix it? My proposal would be to simply
> change primitiveSuspend such that for a non-active process it will
> primitively take the process off its suspendingList. This makes suspend
> a little more general and (by returning the previous suspendingList) it
> will also guard us against any following cleanup (like the Semaphore
> situations earlier).
>
> Unfortunately, this *will* require VM changes but I don't think it can
> be helped at this point since the VM will be manipulating these lists
> atomically anyway. The good news though is that we can have reasonable
> fallback code which does just exactly what we do today as a fallback to
> primitiveSuspend.
>
> Any comments? Alternatives? Suggestions?
>
>
> Cheers,
> - Andreas
>
>