Platform: VW7.6, Gentoo linux 2.6.27-r8
I'm completely befuddled by this but I've managed to reproduce it in a very simple way. There are three processes involved: Parent (visualworks) Child (visualworks) Grandchild (netcat -- but anything that can listen on a socket is probably fine) Child is spawned by Parent using ExternalProcess class>>shOne:. Child spawns Grandchild via ExternalProcess class>>execute:arguments:do:errorStreamDo:. Grandchild opens a TCP server socket and waits for a connection. Meanwhile Child exits via ObjectMemory quit. The fact that Child has exited can be verified by grepping through 'ps aux'. It is gone, beyond all doubt. At this point Parent should (in my opinion which seems to differ from reality) return from shOne:. In reality, however, it sits waiting for Grandchild to exit. Killing Grandchild frees up Parent (shOne: returns). This is very strange behavior and is causing me lockups galore (in my case Granchild is firefox launched to run SeasideTesting tests in an automated test environment, Child is the image-under-test and Parent is an image building toolset). Attached is a complete set of scripts to reproduce the problem. You may have to touch up the paths a bit for your system. Run the 'run' script. You'll note (in headless-transcript.log) that the child has exited (verify with ps if you like) but Parent is waiting. Killing the nc instance frees up the Parent. I'm really clueless on how to fix this (or who the culprit is). My guess, after looking through some strace output was that the problem in in the use of clone(). In my case I can't modify the Grandchild so I'm stuck trying to work around these problems in VW or somehow wrapping the Grandchild (to insulate Child and Parent from its behavior) but I haven't hit on a winning wrapper yet. At the moment I'm working around it by having Child kill Grandchild before it exits but this is suboptimal (I end up killing firefox for no good reason). Has anyone else hit this problem? David _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc ext.tar.gz (780 bytes) Download Attachment |
This sounds like Unix behavior. Have you checked the descriptions of how fork() works? Have you tried setting the SIGCHLD signal to SIGIGNORE, if VW will let you? I'm assuming that you don't care if/when the child exits...
Cheers! Tom Hawker -------------------------- Senior Framework Developer -------------------------- Home +1 (408) 274-4128 Office +1 (408) 576-6591 Mobile +1 (408) 835-3643 -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of C. David Shaffer Sent: Tuesday, September 22, 2009 2:27 PM To: [hidden email] Subject: [vwnc] ExternalProcess problems Platform: VW7.6, Gentoo linux 2.6.27-r8 I'm completely befuddled by this but I've managed to reproduce it in a very simple way. There are three processes involved: Parent (visualworks) Child (visualworks) Grandchild (netcat -- but anything that can listen on a socket is probably fine) Child is spawned by Parent using ExternalProcess class>>shOne:. Child spawns Grandchild via ExternalProcess class>>execute:arguments:do:errorStreamDo:. Grandchild opens a TCP server socket and waits for a connection. Meanwhile Child exits via ObjectMemory quit. The fact that Child has exited can be verified by grepping through 'ps aux'. It is gone, beyond all doubt. At this point Parent should (in my opinion which seems to differ from reality) return from shOne:. In reality, however, it sits waiting for Grandchild to exit. Killing Grandchild frees up Parent (shOne: returns). This is very strange behavior and is causing me lockups galore (in my case Granchild is firefox launched to run SeasideTesting tests in an automated test environment, Child is the image-under-test and Parent is an image building toolset). Attached is a complete set of scripts to reproduce the problem. You may have to touch up the paths a bit for your system. Run the 'run' script. You'll note (in headless-transcript.log) that the child has exited (verify with ps if you like) but Parent is waiting. Killing the nc instance frees up the Parent. I'm really clueless on how to fix this (or who the culprit is). My guess, after looking through some strace output was that the problem in in the use of clone(). In my case I can't modify the Grandchild so I'm stuck trying to work around these problems in VW or somehow wrapping the Grandchild (to insulate Child and Parent from its behavior) but I haven't hit on a winning wrapper yet. At the moment I'm working around it by having Child kill Grandchild before it exits but this is suboptimal (I end up killing firefox for no good reason). Has anyone else hit this problem? David IMPORTANT NOTICE Email from OOCL is confidential and may be legally privileged. If it is not intended for you, please delete it immediately unread. The internet cannot guarantee that this communication is free of viruses, interception or interference and anyone who communicates with us by email is taken to accept the risks in doing so. Without limitation, OOCL and its affiliates accept no liability whatsoever and howsoever arising in connection with the use of this email. Under no circumstances shall this email constitute a binding agreement to carry or for provision of carriage services by OOCL, which is subject to the availability of carrier's equipment and vessels and the terms and conditions of OOCL's standard bill of lading which is also available at http://www.oocl.com. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by cdavidshaffer
I can't say that I understand the implications of
everything, but why not just try using the lower-level
#execute:arguments:do:errorStreamDo:, which lets you give it blocks that
deal with the input and output streams, rather than blocking the calling
process. It might just leave you with blocks stuck waiting on things that
haven't closed yet, but that might be better than blocking your calling
process.
At 05:26 PM 2009-09-22, C. David Shaffer wrote: Platform: VW7.6, Gentoo linux 2.6.27-r8 --
Alan Knight [|], Engineering Manager, Cincom Smalltalk
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by thomas.hawker
[hidden email] wrote:
> This sounds like Unix behavior. Have you checked the descriptions of how fork() works? Have you tried setting the SIGCHLD signal to SIGIGNORE, if VW will let you? I'm assuming that you don't care if/when the child exits... > > Cheers! > > Tom Hawker > Sorry for sound dense but could you be more specific about Unix behavior and fork()? I've used fork() and clone() pretty extensively in the C world and never hit this particular problem. It seems connected to the network layer in a subtle way (although I could be wrong about that!) I can't ignore the child exit in Parent since I want its execution to be synchronous to Child. Bash, for example, doesn't seem to have this problem: trap 'echo saw exit' SIGCHLD && bash -c 'nc -l -p 7755 &' produces a similar situation but the echo /is/ invoked by the parent shell even though nc is left running the background. I don't see the VW behavior as typical at all. David _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by cdavidshaffer
David,
can you try and localise the problem further? If you e.g. kick VW's reaper process in Parent (see ExternalProcess class>>startReaper) by signalling the status-changed semaphore does VW see the child's exit? If so, either the SIGCHLD is getting lost somehow or the reaper process isn't getting to run (unlikely; it's quite high priority). Perhaps more than one process is waiting on the child. I see that in UnixProcess>>releaseHandle the exit semaphore is only signalled once. IMO it should read something like releaseHandle "Break the reference to the (presumably non-existant) process, awaken any waiters."
| s | super releaseHandle. s := exitSemaphore.
exitSemaphore := nil. s == nil ifFalse: [[s signal. s isEmpty] whileFalse]
On Tue, Sep 22, 2009 at 2:26 PM, C. David Shaffer <[hidden email]> wrote: Platform: VW7.6, Gentoo linux 2.6.27-r8 _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Eliot Miranda wrote:
> David, > > can you try and localise the problem further? If you e.g. kick > VW's reaper process in Parent (see ExternalProcess class>>startReaper) > by signalling the status-changed semaphore does VW see the child's > exit? If so, either the SIGCHLD is getting lost somehow or the reaper > process isn't getting to run (unlikely; it's quite high priority). 1) As you suggested, I stored the reapers semaphore in a global and signaled it after a delay. No progress. Attached is the patched startReaper. I added UnixProcess startReaper to my parent startup script which now looks like: Transcript cr; show: 'Launching child...'; cr. UnixProcess startReaper. [(Delay forSeconds: 5) wait. Transcript show: 'Kicking semaphore.'; cr. (Smalltalk at: #JackTheReaper) signal] fork. ExternalProcess shOne: 'visual /usr/local/vw7.6nc/image/visualnc.im -headless -fileIn UnixProcess-releaseHandle.st -fileIn child.st'. Transcript show: 'Child returned'; cr. ObjectMemory quit. I see the "Kicking semaphore" message but the image remains hung. 2) An strace of parent and offspring is here: http://cdshaffer.com/david/strace.log I believe Parent pid = 3878, Child pid = 3879 and Grandchild pid = 3880. At around line 3672 I see Parent receive SIGCHLD and call waitpid on it. > > Perhaps more than one process is waiting on the child. I see that in > UnixProcess>>releaseHandle the exit semaphore is only signalled once. > IMO it should read something like > > releaseHandle > "Break the reference to the (presumably non-existant) process, > awaken any waiters." > > | s | > super releaseHandle. > s := exitSemaphore. > exitSemaphore := nil. > s == nil ifFalse: > [[s signal. s isEmpty] whileFalse] > David <?xml version="1.0"?> <st-source> <time-stamp>From VisualWorks® NonCommercial, 7.6 of March 3, 2008 on September 22, 2009 at 4:27:36 pm</time-stamp> <methods> <class-id>OS.ExternalProcess class</class-id> <category>process reaping</category> <body package="OS-ExternalProcess" selector="startReaper">startReaper "Start the child-reap process." "ExternalProcess defaultClass startReaper" | sem | self stopReaper. sem := Semaphore new. Transcript show: 'Storing reaper'; cr. Smalltalk at: #JackTheReaper put: sem. self setStatusChangedSemaphore: sem. Reaper := [[sem wait. self reapSome] repeat] forkAt: Processor lowIOPriority. Reaper setIsSystemProcess. Reaper name: 'ExternalProcessReaper'. </body> </methods> </st-source> <?xml version="1.0"?> <st-source> <time-stamp>From VisualWorks® NonCommercial, 7.6 of March 3, 2008 on September 22, 2009 at 4:19:05 pm</time-stamp> <methods> <class-id>OS.UnixProcess</class-id> <category>private-initialize/release</category> <body package="OS-ExternalProcess" selector="releaseHandle">releaseHandle "Break the reference to the (presumably non-existant) process, awaken any waiters." | s | super releaseHandle. s := exitSemaphore. exitSemaphore := nil. s == nil ifFalse: [[s signal. s isEmpty] whileFalse]</body> </methods> </st-source> _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Let me add that it seems to have nothing to do with network I/O (sorry
for the misdirection, my test case for eliminating general processes as a problem was flawed so I thought it was network connected). So, if you modify child to use: ExternalProcess execute: 'sleep' arguments: #('500') do: [:in :out | in close. out close] errorStreamDo: [:err | err close]. you will have the same problem. David _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by cdavidshaffer
On Tue, Sep 22, 2009 at 4:43 PM, C. David Shaffer <[hidden email]> wrote:
You need to kick JackTheReaper _after_ spawning the child. Kicking it before there is a process to reap won't achieve anything.
Then verify whether the semaphore is signalled or not. If it is not then either a) there is some problem in the VM which causes the signal not to be translated into a signal of the reaper semaphore or b) an image bug where the wrong semaphore is being registered with the VM.
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Eliot Miranda wrote:
> > > Transcript cr; show: 'Launching child...'; cr. > UnixProcess startReaper. > [(Delay forSeconds: 5) wait. > Transcript show: 'Kicking semaphore.'; cr. > (Smalltalk at: #JackTheReaper) signal] fork. > ExternalProcess shOne: 'visual > /usr/local/vw7.6nc/image/visualnc.im <http://visualnc.im> > -headless -fileIn UnixProcess-releaseHandle.st -fileIn child.st > <http://child.st>'. > Transcript show: 'Child returned'; cr. > ObjectMemory quit. > > > You need to kick JackTheReaper _after_ spawning the child. Kicking > it before there is a process to reap won't achieve anything. enough for the ExternalProcess to be launched. I can't actually fork the block after I call shOne: since that call never returns. > > Then verify whether the semaphore is signalled or not. If it is not > then either a) there is some problem in the VM which causes the > signal not to be translated into a signal of the reaper semaphore or > b) an image bug where the wrong semaphore is being registered with the VM. > Thanks for walking be through it. Yes, the signal is being delivered to to the image. The attached patch prints the proper "process done" message as a result of the signal. As Alan suggested, it looks like the image is waiting for a SIGIO that is never delivered. Killing the Grandchild causes it. Seems like an odd connection with Linux async IO and clone()??? David _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
C. David Shaffer wrote:
> > The attached patch prints the proper "process done" > message as a result of the signal. > gosh darn it. -attached. <?xml version="1.0"?> <st-source> <time-stamp>From VisualWorks® NonCommercial, 7.6 of March 3, 2008 on September 22, 2009 at 7:10:00 pm</time-stamp> <methods> <class-id>OS.UnixProcess</class-id> <category>private-initialize/release</category> <body package="OS-ExternalProcess" selector="done:with:">done: status with: usig "Handle a process which has exited." "Record the status (non-zero usig means terminated due to signal) and cut the process loose (doesn't need watching anymore)." Transcript show: 'Process done ' , self key printString; cr. exitStatus := usig = 0 ifTrue: [status] ifFalse: [usig negated]. self releaseHandle</body> </methods> </st-source> _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by cdavidshaffer
At 02:26 PM 9/22/2009, C. David Shaffer wrote:
>Platform: VW7.6, Gentoo linux 2.6.27-r8 > >I'm completely befuddled by this but I've managed to reproduce it in a >very simple way. There are three processes involved: > >Parent (visualworks) > Child (visualworks) > Grandchild (netcat -- but anything that can listen on a socket is >probably fine) > >Child is spawned by Parent using ExternalProcess class>>shOne:. And so waits for offspring to finish. So far so good. >Child spawns Grandchild via >ExternalProcess class>>execute:arguments:do:errorStreamDo:. I'm a little rusty here, but this sounds like a fork? >Grandchild opens a TCP server socket and waits for a connection. >Meanwhile Child exits via ObjectMemory quit. Ok, so it must be a fork. >The fact that Child has exited can be verified by >grepping through 'ps aux'. It is gone, beyond all doubt. Ok so far. >At this point Parent should >(in my opinion which seems to differ from reality) >return from shOne:. Hmmm. I'd expect parent (which is waiting for children -- ALL children -- to exit ) would continue waiting until grandchild completes. Unless/until the grandchild is emancipated, which is *not* the usual thing. Your later example, using bash as child and nc as grandchild, is different. IIRC, you've got bash doing a 'background fork' of nc, (the & at the end), which emancipates the grandchild, leaving the parent waiting on bash and only bash, whereas the vw case does a normal fork, and when your child exits, the parent reacquires responsibility for all the child's unemancipated offspring. Since the parent is waiting for all offspring to finish, and the offspring haven't finished, the parent continues waiting. A simple fix would be to <bash> your example into working the way you intend: Parent spawns child via #cshOne:. Child spawns grandchild via something like cshOne: 'bash -c ''originalGrandchild &''' Child waits on bash, which does a background fork of the grandchild and exits, allowing child to exit, allowing parent to proceed. Or, you could look up process groups, and methods for escaping therefrom, and find a way to convey emancipation after a fork, from within VW. CAVEAT - I don't know if your OS variant actually does anything about process groups. Regards, -cstb > In reality, however, it sits waiting for Grandchild to >exit. Killing Grandchild frees up Parent (shOne: returns). Makes sense to me. But then, so do I. ;-) YMMV... _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
cstb wrote:
> > Or, you could look up process groups, > and methods for escaping therefrom, > and find a way to convey emancipation > after a fork, from within VW. > I toyed with that that ages ago (vw3), see #setsid here: http://web.archive.org/web/20041010012021/wiki.cs.uiuc.edu/VisualWorks/Running+in+the+background+on+Unix R - _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by jas
:-)Makes sense to me. But then, so do I. ;-) YMMV... Sorry for posting in HTML...need a fixed width font. For further clarification here's the output of ps axjf for this cluster of processes. I'm running in strace as you can see. PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 5282 5317 5317 5317 pts/5 5694 Ss 1000 0:00 \_ bash 5317 5694 5694 5317 pts/5 5694 S+ 1000 0:00 \_ /bin/sh ./run 5694 5695 5694 5317 pts/5 5694 S+ 1000 0:00 \_ strace -f visual /usr/local/vw7.6nc/image/visualnc.im -run 5695 5696 5694 5317 pts/5 5694 S+ 1000 0:00 \_ visual /usr/local/vw7.6nc/image/visualnc.im -runtime - 1 5701 5701 5701 ? -1 Ss 1000 0:00 sleep 500 Note the sleep line. David _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by cdavidshaffer
On Tue, Sep 22, 2009 at 7:20 PM, C. David Shaffer <[hidden email]> wrote:
Doh! I find increasingly I scan emails without properly reading them and post really stupid replies as a result. Must try to do better. Sorry.
It should be long _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by cdavidshaffer
[hidden email] wrote: > FWIW, isn't clone() the same as BSD's vfork() - a virtually efficient fork() call that doesn't copy memory in the expectation that it will be immediately followed by an exec*() call? > > Cheers! > I don't know vfork() but my guess is "maybe." clone() is like fork() but the resulting process shares a good bit with the parent (including memory, signal handlers/masks etc). The reason I /thought/ the distinction might be important is that a clone() process does not necessarily produce SIGCHLD to its parent when it exits. It can produce, in fact, any signal or no signal at all (the signal one of the arguments to clone()). I thought this might be related to the problem I was having but in going back and forth with Eliot in this thread I now I see that it isn't. The problem I'm having seems to be connected to I/O handles in some way. Process groups/sessions were suggested as possibly connected but I still don't see that one. What seems to happen is that VW creates what it calls a "pipe" between parent and child (I put it in quotes because it doesn't seem to be a UNIX pipe...have to dig deeper on this one). In the VW VM's usual style, asnc-I/O is used on this pipe. When Child exits, Linux never delivers SIGIO (or even SIGPIPE) to Parent. This is why Parent gets stuck. I have verified that Parent thinks Child has exited (UnixProcess>>isActive produces false) but it sits blocked in the read. I think I can reproduce this without async-IO in a simple clone()/exec C program but I haven't hit it yet. This might seems like a silly little corner case to a lot of people but I can think of lots of server arrangements for which this would cause problems. This connection to the Grandchild is subtle, undocumented and probably platform-dependent behavior. For example, a lot of people assumed parent and Child and Grandchild were in the same session when, in fact, the VW VM calls setsid() right after clone(). I'm hoping to correct it by making the parent-child arrangement more explicit (and hence probably more platform dependent but definitely less subtle). Reinout's sample code has made me less afraid to throw some manual C callouts into the mix to try to patch things up once I understand the problem better. Thanks for everyone's suggestions! Please keep them coming as you think of possibilities... David _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |