[OpenSmalltalk/opensmalltalk-vm] UnixProcess forkSqueak broken since October (#548)
UnixProcess class>>forkSqueak is no longer working. The forked child process VM crashes with segmentation fault. Testing with VMs from bintray shows that version 5.0-202009300634 works, and any version 5.0-202010192227 or later fails. Stack dump sometimes (but not always) shows failure in aioPoll() for example:
I am not able to catch the failure in gdb because it happens in the child process. My initial guess is that it may be related to the epoll enhancements added in this time frame, because forking the VM requires initializing things like this in the new child VM process.
Re: [OpenSmalltalk/opensmalltalk-vm] UnixProcess forkSqueak broken since October (#548)
On Sun, 2021-01-24 at 18:59 -0800, David T Lewis wrote:
> I am not able to catch the failure in gdb because it happens in the
> child process.
GDB can follow fork(), see
(gdb) help set follow-fork-mode
Set debugger response to a program call of fork or vfork.
A fork or vfork creates a new process. follow-fork-mode can be:
parent - the original process is debugged after a fork
child - the new process is debugged after a fork
The unfollowed process will continue to run.
By default, the debugger will follow the parent process.
The segfault happens in the child process that was forked by the forkSqueak prim. It occurs in the new epoll code. I don't yet see the cause (there is no obvious null pointer issue) but the gdb backtrace is:
#0 0x0000000000dd39b0 in ?? () #1 0x00000000004d62dc in aioPoll (microSeconds=0) at /home/lewis/squeak/git/opensmalltalk-vm/platforms/unix/vm/aio.c:405 #2 0x00007fa016315e39 in display_ioProcessEvents () at /home/lewis/squeak/git/opensmalltalk-vm/platforms/unix/vm-display-X11/sqUnixX11.c:4867 #3 0x0000000000417ca3 in ioProcessEvents () at /home/lewis/squeak/git/opensmalltalk-vm/platforms/unix/vm/sqUnixMain.c:726 #4 0x0000000000441f58 in checkForEventsMayContextSwitch (mayContextSwitch=1) at /home/lewis/squeak/git/opensmalltalk-vm/spurstack64src/vm/gcc3x-interp.c:50306 #5 0x00000000004401ca in handleStackOverflowOrEventAllowContextSwitch (mayContextSwitch=1)
at /home/lewis/squeak/git/opensmalltalk-vm/spurstack64src/vm/gcc3x-interp.c:53718 #6 0x0000000000426cbd in interpret () at /home/lewis/squeak/git/opensmalltalk-vm/spurstack64src/vm/gcc3x-interp.c:5844 #7 0x000000000043ab1f in enterSmalltalkExecutiveImplementation () at /home/lewis/squeak/git/opensmalltalk-vm/spurstack64src/vm/gcc3x-interp.c:51798 #8 0x000000000041d4ba in interpret () at /home/lewis/squeak/git/opensmalltalk-vm/spurstack64src/vm/gcc3x-interp.c:2493 #9 0x000000000041ad0a in main (argc=2, argv=0x7ffc3ba11f78, envp=0x7ffc3ba11f90) at /home/lewis/squeak/git/opensmalltalk-vm/platforms/unix/vm/sqUnixMain.c:2164
The problem is that the file descriptors and structures are shared between parent and child after fork. However, after the fork, the epoll structures point to data that belongs to the parent. At line 405 the child process tries to access that data, and I think that causes the segfault.
The child should close the inherited epoll file descriptor and recreate it along with the necessary data structures. This can be done by a handler registered with pthread_atfork().
See it explained at
Le sam. 30 janv. 2021 à 23:38, smalltalking <[hidden email]> a
> The problem is that the file descriptors and structures are shared between
> parent and child after fork. However, after the fork, the epoll structures
> point to data that belongs to the parent. At line 405 the child process
> tries to access that data, and I think that causes the segfault.
> The child should close the inherited epoll file descriptor and recreate it
> along with the necessary data structures. This can be done by a handler
> registered with pthread_atfork().
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> or unsubscribe
Update - The actual forkSqueak is working fine, but we get failures associated with aio handling for the socket connection to the X11 server. The child closes the socket and calls aioDisable for the socket fd to unregister it. When using epoll rather than generic aio event handling, this apparently affects the Linux kernel epoll registration for the socket fd (I am not sure if I understand this correctly, but this appears to be the case). The result seems to be failures in either the child or parent VM process, or both. The problem goes away if I #ifdef the call to aioDisable() in the forgetXDisplay() function. I am not sure if this is a proper fix or just a workaround kludge, but it does work.