Hi -
I had an eventful (which is euphemistic for @!^# up) morning caused by Process>>terminate. In our last round of delay and semaphore discussions I had noticed that there is a possibility of having a race condition in Process>>terminate but dismissed it as being of an application problem (e.g., if you send #terminate make sure you have only one place where you send it). This morning proved conclusively that this is a race condition which can affect *every* user of the system. It is caused by Process>>terminate which says: myList remove: self ifAbsent: []. The reason this is so problematic is that the modification of myList is not atomic and that because of the non-atomic modification there is a possibility of the VM manipulating the very same list concurrently due to an external event (like a network interrupt). When this happens in "just the right way" the effect is that any number of processes at the same priority will "fall off" of the scheduled list. In the image that I was looking at earlier we had the following situation: * ~40 processes were not running * The processes had their myList be an empty linked list * The processes were internally linked (via nextLink) * The processes were all at the same priority Given that most of the processes were unrelated other than having the same priority I think the evidence is pretty clear. The question is now: How can we fix it? My proposal would be to simply change primitiveSuspend such that for a non-active process it will primitively take the process off its suspendingList. This makes suspend a little more general and (by returning the previous suspendingList) it will also guard us against any following cleanup (like the Semaphore situations earlier). Unfortunately, this *will* require VM changes but I don't think it can be helped at this point since the VM will be manipulating these lists atomically anyway. The good news though is that we can have reasonable fallback code which does just exactly what we do today as a fallback to primitiveSuspend. Any comments? Alternatives? Suggestions? Cheers, - Andreas |
I vote for VM changes.
First, i think, we should consider, what makes Process (and all connected to it) so fragile to concurrency issues, and redesign them to simply make such things impossible to appear. All these dirty hacks and additional guard code only shows that the original design have flaws and best way to improve it is to change it, instead of patching again and again and making things even more fragile. On 05/12/2007, Andreas Raab <[hidden email]> wrote: > Hi - > > I had an eventful (which is euphemistic for @!^# up) morning caused by > Process>>terminate. In our last round of delay and semaphore discussions > I had noticed that there is a possibility of having a race condition in > Process>>terminate but dismissed it as being of an application problem > (e.g., if you send #terminate make sure you have only one place where > you send it). > > This morning proved conclusively that this is a race condition which can > affect *every* user of the system. It is caused by Process>>terminate > which says: > > myList remove: self ifAbsent: []. > > The reason this is so problematic is that the modification of myList is > not atomic and that because of the non-atomic modification there is a > possibility of the VM manipulating the very same list concurrently due > to an external event (like a network interrupt). When this happens in > "just the right way" the effect is that any number of processes at the > same priority will "fall off" of the scheduled list. In the image that I > was looking at earlier we had the following situation: > * ~40 processes were not running > * The processes had their myList be an empty linked list > * The processes were internally linked (via nextLink) > * The processes were all at the same priority > Given that most of the processes were unrelated other than having the > same priority I think the evidence is pretty clear. > > The question is now: How can we fix it? My proposal would be to simply > change primitiveSuspend such that for a non-active process it will > primitively take the process off its suspendingList. This makes suspend > a little more general and (by returning the previous suspendingList) it > will also guard us against any following cleanup (like the Semaphore > situations earlier). > > Unfortunately, this *will* require VM changes but I don't think it can > be helped at this point since the VM will be manipulating these lists > atomically anyway. The good news though is that we can have reasonable > fallback code which does just exactly what we do today as a fallback to > primitiveSuspend. > > Any comments? Alternatives? Suggestions? > > > Cheers, > - Andreas > > -- Best regards, Igor Stasenko AKA sig. |
Igor Stasenko wrote:
> I vote for VM changes. > First, i think, we should consider, what makes Process (and all > connected to it) so fragile to concurrency issues, and redesign them > to simply make such things impossible to appear. Good luck with that ;-) Seriously, I don't think it's fair to say processes are "fragile by design"; these problems are bugs more than they are design issues. > All these dirty hacks and additional guard code only shows that the > original design have flaws and best way to improve it is to change it, > instead of patching again and again and making things even more > fragile. I wouldn't call what I'm proposing a "dirty hack" - it's rather what I think primitiveSuspend really *ought* to do. The primitive is (and always was in my understanding) incomplete in this respect. Cheers, - Andreas |
Andreas Raab wrote:
> I wouldn't call what I'm proposing a "dirty hack" - it's rather what I > think primitiveSuspend really *ought* to do. The primitive is (and > always was in my understanding) incomplete in this respect. BTW, to emphasize this aspect have a look at the senders of #suspendingList. For example, ControlManager>>interruptName: does this (suspendingList := activeControllerProcess suspendingList) == nil ifTrue: [activeControllerProcess == Processor activeProcess ifTrue: [activeControllerProcess suspend]] ifFalse: [suspendingList remove: activeControllerProcess ifAbsent:[]. activeControllerProcess offList]. With the proposed change to #primitiveSuspend this becomes merely: activeControllerProcess suspend. (which is obviously the intended and correct way of doing things). Other places that get cleaned up with the modification include for example Process>>suspend (which becomes merely a call to #primitiveSuspend) and Project class>>interruptName: (and probably some others). Cheers, -Andreas |
On Dec 5, 2007, at 0:43 , Andreas Raab wrote: > Andreas Raab wrote: >> I wouldn't call what I'm proposing a "dirty hack" - it's rather >> what I think primitiveSuspend really *ought* to do. The primitive >> is (and always was in my understanding) incomplete in this respect. > > BTW, to emphasize this aspect have a look at the senders of > #suspendingList. For example, ControlManager>>interruptName: does this > > (suspendingList := activeControllerProcess suspendingList) == nil > ifTrue: [activeControllerProcess == Processor activeProcess > ifTrue: [activeControllerProcess suspend]] > ifFalse: [suspendingList remove: activeControllerProcess ifAbsent: > []. > activeControllerProcess offList]. > > With the proposed change to #primitiveSuspend this becomes merely: > > activeControllerProcess suspend. > > (which is obviously the intended and correct way of doing things). > Other places that get cleaned up with the modification include for > example Process>>suspend (which becomes merely a call to > #primitiveSuspend) and Project class>>interruptName: (and probably > some others). Sounds like the Right Thing to me. - Bert - |
In reply to this post by Andreas.Raab
On 05/12/2007, Andreas Raab <[hidden email]> wrote:
> Andreas Raab wrote: > > I wouldn't call what I'm proposing a "dirty hack" - it's rather what I > > think primitiveSuspend really *ought* to do. The primitive is (and > > always was in my understanding) incomplete in this respect. > Maybe, again i was not precise with what i saying :), i just only meant that without changes to VM, its very difficult to create a good non-buggy implementation of Process without concurrency issues mentioned above. That's why, the key moments, which should performed atomically, should be implemented by VM to prevent any possibilities to step into problems in future. But still, a simple modification, like managing myList by VM means, that VM "knows" the structure of process, which means that we tightly bound the processes state with VM, and this is what i don't like. Because we making it rigid and will be not able to change in future. Its obvious, that to avoid such dependency, we should rethink the structure(s) to not force VM "knowing" the Process layout. > BTW, to emphasize this aspect have a look at the senders of > #suspendingList. For example, ControlManager>>interruptName: does this > > (suspendingList := activeControllerProcess suspendingList) == nil > ifTrue: [activeControllerProcess == Processor activeProcess > ifTrue: [activeControllerProcess suspend]] > ifFalse: [suspendingList remove: activeControllerProcess ifAbsent:[]. > activeControllerProcess offList]. > > With the proposed change to #primitiveSuspend this becomes merely: > > activeControllerProcess suspend. > > (which is obviously the intended and correct way of doing things). Other > places that get cleaned up with the modification include for example > Process>>suspend (which becomes merely a call to #primitiveSuspend) and > Project class>>interruptName: (and probably some others). > > Cheers, > -Andreas > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Andreas.Raab
Hi -
[cc: vm-dev which I accidentally left out earlier] Attached my proposed fixes for myList manipulation. The first CS (PrimSuspend-ar) changes primitiveSuspend to enable atomic removal from myList for the non-active process. The second (SuspendFixes-ar) has the main modifications for the in-image part (this may not work for all Squeak versions - I used a Croquet image as the basis, YMMV). Feedback is welcome, in particular from the usual suspects on vm-dev. Cheers, - Andreas Andreas Raab wrote: > Hi - > > I had an eventful (which is euphemistic for @!^# up) morning caused by > Process>>terminate. In our last round of delay and semaphore discussions > I had noticed that there is a possibility of having a race condition in > Process>>terminate but dismissed it as being of an application problem > (e.g., if you send #terminate make sure you have only one place where > you send it). > > This morning proved conclusively that this is a race condition which can > affect *every* user of the system. It is caused by Process>>terminate > which says: > > myList remove: self ifAbsent: []. > > The reason this is so problematic is that the modification of myList is > not atomic and that because of the non-atomic modification there is a > possibility of the VM manipulating the very same list concurrently due > to an external event (like a network interrupt). When this happens in > "just the right way" the effect is that any number of processes at the > same priority will "fall off" of the scheduled list. In the image that I > was looking at earlier we had the following situation: > * ~40 processes were not running > * The processes had their myList be an empty linked list > * The processes were internally linked (via nextLink) > * The processes were all at the same priority > Given that most of the processes were unrelated other than having the > same priority I think the evidence is pretty clear. > > The question is now: How can we fix it? My proposal would be to simply > change primitiveSuspend such that for a non-active process it will > primitively take the process off its suspendingList. This makes suspend > a little more general and (by returning the previous suspendingList) it > will also guard us against any following cleanup (like the Semaphore > situations earlier). > > Unfortunately, this *will* require VM changes but I don't think it can > be helped at this point since the VM will be manipulating these lists > atomically anyway. The good news though is that we can have reasonable > fallback code which does just exactly what we do today as a fallback to > primitiveSuspend. > > Any comments? Alternatives? Suggestions? > > > Cheers, > - Andreas > > |
In reply to this post by Andreas.Raab
this talks by itself!
Sounds like this is the way to go. But I'm only a beginner on concurrency (even if I'm improving slowly). >> BTW, to emphasize this aspect have a look at the senders of >> #suspendingList. For example, ControlManager>>interruptName: does >> this > > (suspendingList := activeControllerProcess suspendingList) == nil > ifTrue: [activeControllerProcess == Processor activeProcess > ifTrue: [activeControllerProcess suspend]] > ifFalse: [suspendingList remove: activeControllerProcess ifAbsent: > []. > activeControllerProcess offList]. > > With the proposed change to #primitiveSuspend this becomes merely: > > activeControllerProcess suspend. > > (which is obviously the intended and correct way of doing things). > Other places that get cleaned up with the modification include for > example Process>>suspend (which becomes merely a call to > #primitiveSuspend) and Project class>>interruptName: (and probably > some others). > > Cheers, > -Andreas > > |
On 05/12/2007, stephane ducasse <[hidden email]> wrote:
> this talks by itself! > Sounds like this is the way to go. > But I'm only a beginner on concurrency (even if I'm improving slowly). > I was involved in C++ project, where we had to deal with more than 12 running threads, each performing own task (timing/sockets/sound/UI). It was a good practice in fighting with deadlocks and learning how to avoid writing non-concurrent code. -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Andreas.Raab
I've also noticed that Process>>isTerminated is not thread-safe... I have had situations where a perfectly live process has answered true to this message and, while investigating, even managed to get a case where the suspendedContext pc was nil! |
> I've also noticed that Process>>isTerminated is not thread-safe...
And #signalException: Cheers, Lukas -- Lukas Renggli http://www.lukas-renggli.ch |
In reply to this post by Igor Stasenko
On 5 déc. 07, at 10:45, Igor Stasenko wrote: > On 05/12/2007, stephane ducasse <[hidden email]> wrote: >> this talks by itself! >> Sounds like this is the way to go. >> But I'm only a beginner on concurrency (even if I'm improving >> slowly). >> > I was involved in C++ project, where we had to deal with more than 12 > running threads, each performing own task (timing/sockets/sound/UI). > It was a good practice in fighting with deadlocks and learning how to > avoid writing non-concurrent code. I imagine :) I'm reading the code, several books and will do the exercises but still I'm really humble with concurrent programming. I'm doing the exercises of the little book on semaphores. Just to get exposed to my own limits :) Stef > > -- > Best regards, > Igor Stasenko AKA sig. > > |
In reply to this post by Lukas Renggli
Lukas Renggli wrote:
>> I've also noticed that Process>>isTerminated is not thread-safe... > > And #signalException: Good one. Yes, this needs fixing. My suggestion would be (with the other changes): signalException: anException | oldList | self isActiveProcess ifTrue: [^anException signal]. "Suspend myself first to ensure that I won't run away in the midst of the following modifications." oldList := self suspend. "Add a new method context to the stack that will signal the exception" suspendedContext := MethodContext sender: suspendedContext receiver: self method: (self class lookupSelector: #pvtSignal:list:) arguments: (Array with: anException with: oldList). "If we were on a list to run, then suspend and restart the receiver (this lets the receiver run if it is currently blocked on a semaphore). If we are not on a list to be run (i.e. this process is suspended), then when the process is resumed, it will signal the exception" oldList ifNotNil: [self resume]. Cheers, - Andreas |
In reply to this post by Andreas.Raab
Am Dienstag, den 04.12.2007, 16:40 -0800 schrieb Andreas Raab:
> Feedback is welcome, in particular from the usual suspects on vm-dev. Got any stress test for a patched vm to test this? FWIW I built a 3.9.12 linux vm based on the debian package from Matej and the 3.9.12 sources from the svn repo. For easy patch creation I changed the patch handling to dpatch. I applied the following patches: 10_configure recreation of configure 11_primitiveSuspend the proposed fix for primitiveSuspend (see http://lists.squeakfoundation.org/pipermail/squeak-dev/2007-December/123025.html) by Andreas Raab 15_biasToGrow fix biasToGrow (see http://bugs.squeak.org/view.php?id=6667 for details) by John M. McIntosh 20_sym_link_patch fix symlinks on linux (see http://lists.squeakfoundation.org/pipermail/squeak-dev/2007-November/122485.html) by me 30_window_icon original patch by Matej Kosik 40_usage_builtin_modules this only adds a list of internal vm-plugins to the output (squeak-vm -h) to keep me sane switching between different vms On a debian system you need to have these entries in your sources.list, if you want to use apt-get for installation: deb http://ftp.squeak.org/debian/ stable main deb http://www.lazarevic.de/debian/ lenny main As root or with sudo do: apt-get update apt-get install squeak-vm To get rid of the vm just remove the last entry from the sources.list, update and "downgrade" to the 3.9.8 version again. Have fun, Alex PS: No guarantees for nothing! :) |
In reply to this post by Andreas.Raab
If we intend to modify/enhance the VM what about the old classic
#valueUninterrupatably (VA didn't quite pin it totally down, but it did the job, mostly). That is, a way of doing something totally atomically with respect to the VM. To be used sparingly, of course! > -----Original Message----- > From: [hidden email] > [mailto:[hidden email]]On Behalf Of > Andreas Raab > Sent: 05 December 2007 4:44 PM > To: The general-purpose Squeak developers list > Subject: Re: Process>>terminate woes > > > Lukas Renggli wrote: > >> I've also noticed that Process>>isTerminated is not thread-safe... > > > > And #signalException: > > Good one. Yes, this needs fixing. > My suggestion would be (with the other changes): > > signalException: anException > | oldList | > self isActiveProcess ifTrue: [^anException signal]. > > "Suspend myself first to ensure that I won't run away in the > midst of the following modifications." > oldList := self suspend. > > "Add a new method context to the stack that will signal the exception" > suspendedContext := MethodContext > sender: suspendedContext > receiver: self > method: (self class lookupSelector: #pvtSignal:list:) > arguments: (Array with: anException with: oldList). > > "If we were on a list to run, then suspend and restart the receiver > (this lets the receiver run if it is currently blocked on a > semaphore). If > we are not on a list to be run (i.e. this process is suspended), then > when the > process is resumed, it will signal the exception" > > oldList ifNotNil: [self resume]. > > Cheers, > - Andreas > |
Maybe this is related, maybe not.
www.squeaksource.com has been successfully running with the recently published Semaphore patches for about a month. This morning we received the attached stack-trace. Going to the image shows that there are thousands of processes waiting at the Semaphore you see in the stack-trace. Cleaning up the mess only helps for a couple of hours, after a while it looks up again ... Any ideas? Cheers, Lukas -- Lukas Renggli http://www.lukas-renggli.ch error.txt (10K) Download Attachment |
Lukas Renggli wrote:
> www.squeaksource.com has been successfully running with the recently > published Semaphore patches for about a month. This morning we > received the attached stack-trace. Going to the image shows that there > are thousands of processes waiting at the Semaphore you see in the > stack-trace. Cleaning up the mess only helps for a couple of hours, > after a while it looks up again ... > > Any ideas? Very hard to tell without looking at direct evidence. It may just be a problem with your error handler, e.g., that the unwind blocks were not run correctly (leaving the semaphore locked). OTOH, it *is* perfectly possible that some processes "fell off" the schedulers list. Here is how to find out: Process allInstances reject:[:p| p isActiveProcess or:[p suspendingList == nil or:[p suspendingList includes: p]]. ]. If this comes up non-empty you are in trouble. Cheers, - Andreas |
Is your code snippet thread safe Andreas?
All "fun and games"! To be fair, after your delay/semaphore fixes we have what seems a stable solution for our products. Can never be sure, sometimes, though! Our system seems to run happily now, around 16 high priority processes and a few others. Synchronisation in the inportant areas (Squeak can work with an "optimistic" outlook as long as the critical areas are protected). Again I'll advocate a "non-preemptive" primitive, on a block, would be nice! Gary > -----Original Message----- > From: [hidden email] > [mailto:[hidden email]]On Behalf Of > Andreas Raab > Sent: 05 December 2007 7:50 PM > To: The general-purpose Squeak developers list > Subject: Re: Process>>terminate woes > > > Lukas Renggli wrote: > > www.squeaksource.com has been successfully running with the recently > > published Semaphore patches for about a month. This morning we > > received the attached stack-trace. Going to the image shows that there > > are thousands of processes waiting at the Semaphore you see in the > > stack-trace. Cleaning up the mess only helps for a couple of hours, > > after a while it looks up again ... > > > > Any ideas? > > Very hard to tell without looking at direct evidence. It may just be a > problem with your error handler, e.g., that the unwind blocks were not > run correctly (leaving the semaphore locked). OTOH, it *is* perfectly > possible that some processes "fell off" the schedulers list. Here is how > to find out: > > Process allInstances reject:[:p| > p isActiveProcess > or:[p suspendingList == nil > or:[p suspendingList includes: p]]. > ]. > > If this comes up non-empty you are in trouble. > > Cheers, > - Andreas > |
In reply to this post by Alexander Lazarevic'
squeak-vm_3.9.12-3 now features the dbus-plugin and a default sources
path via SystemAttributes 1007. Patch 11_primitiveSuspend is still in the package, but is deactivated. The image in package squeak-image3.9_3.9.7067.exp1 contains support code for IPv6, DBus and the default sources path. Alex Alexander Lazarević schrieb: > Am Dienstag, den 04.12.2007, 16:40 -0800 schrieb Andreas Raab: > >> Feedback is welcome, in particular from the usual suspects on vm-dev. >> > > Got any stress test for a patched vm to test this? > > FWIW I built a 3.9.12 linux vm based on the debian package from Matej > and the 3.9.12 sources from the svn repo. For easy patch creation I > changed the patch handling to dpatch. I applied the following patches: > > 10_configure > recreation of configure > > 11_primitiveSuspend > the proposed fix for primitiveSuspend (see > http://lists.squeakfoundation.org/pipermail/squeak-dev/2007-December/123025.html) > by Andreas Raab > > 15_biasToGrow > fix biasToGrow (see http://bugs.squeak.org/view.php?id=6667 > for details) by John M. McIntosh > > 20_sym_link_patch > fix symlinks on linux (see > http://lists.squeakfoundation.org/pipermail/squeak-dev/2007-November/122485.html) > by me > > 30_window_icon > original patch by Matej Kosik > > 40_usage_builtin_modules > this only adds a list of internal vm-plugins to the output > (squeak-vm -h) to keep me sane switching between different vms > > On a debian system you need to have these entries in your sources.list, > if you want to use apt-get for installation: > > deb http://ftp.squeak.org/debian/ stable main > deb http://www.lazarevic.de/debian/ lenny main > > As root or with sudo do: > > apt-get update > apt-get install squeak-vm > > To get rid of the vm just remove the last entry from the sources.list, > update and "downgrade" to the 3.9.8 version again. > > Have fun, > Alex > > PS: No guarantees for nothing! :) > > > |
Alexander Lazarević schrieb:
> The image in package squeak-image3.9_3.9.7067.exp1 contains support code > for IPv6, DBus and the default sources path. > The latest squeak-dev image with this is in the package squeak-image3.9dev_3.9.7067.1 Alex |
Free forum by Nabble | Edit this page |