If I comment /* signal(SIGIO,forceInterruptCheck) */ in platforms//unix/vm/aio.c then I don't experiment crashes anymore One solution if we want to manipulate the rsp safely is - to define a specific stack for signal handling via sigaltstack() - to use that specific stack via sigaction() with SA_ONSTACK flag That's too low level and I prefer some assistance here. There are several calls to signal in the VM so they should all be reviewed. Le jeu. 3 oct. 2019 à 22:10, Nicolas Cellier <[hidden email]> a écrit :
|
Hi Nicolas, I am only reading your (very impressive) emails here, and not really looking at the code. But for whatever it may be worth: I am not surprised if OS X and Linux behave differently with respect to signal delivery and handling. OS X started with a Mach kernel that was based on lightweight threads, and it evolved to be a flavor of unix with a unix process model based on threads. Linux started as a unix clone with unix processes that were gradually evolved to act like threads. Nowadays they are probably more or less the same in most practical respects, but the mechanisms for delivering signals to threads and/or processes might be different in the details. IIUC, your hypothesis is that signals are being delivered to threads at some point at which the state of the thread stack is somehow not consistent with what is expected (somehow related to competing uses of stacks in the interpreter and jit frames). That makes sense to me. You suggest a possible solution of defining a specific stack for signal handling via sigaltstack() and using that stack via sigaction() with SA_ONSTACK flag. That is probably the right thing to do. One reason I think it may be the right thing to do is that Eliot and/or I did something very similar for the signal forwarding mechanism in OSProcessPlugin. Checking the old commit notices (because I do not really recall the details), I added SA_ONSTACK to deal with some Cog (stack intepreter) issues back in 2012. Here is the commit notice that explains the issue at that time: Name: VMConstruction-Plugins-OSProcessPlugin-dtl.37 Author: dtl Time: 22 December 2012, 3:08:51.228 pm UUID: e3c06c1b-76d8-4025-9d13-5fe582618c62 Ancestors: VMConstruction-Plugins-OSProcessPlugin-dtl.36 OSProcessPlugin 4.5.1 Use an alternate signal stack to avoid consuming native stack space. This prevents stack usage problems when running on Cog (stack depth exceeded when a signal is being handled). Adapted from VMConstruction-Plugins-OSProcessPlugin.oscog-eem.38 with the following changes: - Always use sigaltstack if the platform supports it (i.e. do not test for Cog). - If SQ_ONSTACK is not defined, then assume the platform cannot support signalstack and use the default stack. Use macros to define away unsupportable code. - Fix some type declarations to address int/pointer comparison - Use #isDefined:inSmalltalk:comment:ifTrue:ifFalse: And for reference here is the generated C code from UnixOSProcessPlugin>>setSignalNumber:handler (my original version, oscog is different but the same in essence). It sets SA_ONSTACK for the signal stack, and also SA_RESTART because the VM code generally cannot be expected to handle restarting interrupted primitives rigorously, therefore use the simple default. /* Set a signal handler. The C code translator will convert #sig:nal: into 'signal(parm1, parm2)' */ static void * setSignalNumberhandler(sqInt signalNumber, void *signalHandlerAddress) { struct sigaction oldHandlerAction; struct sigaction sigHandlerAction; # ifdef SA_ONSTACK // true if platform supports sigaltstack sigHandlerAction.sa_sigaction = signalHandlerAddress; sigHandlerAction.sa_flags = SA_ONSTACK | SA_RESTART; sigemptyset(&sigHandlerAction.sa_mask); if ((sigaction(signalNumber, (&sigHandlerAction), (&oldHandlerAction))) != 0) { perror("signal"); } return oldHandlerAction.sa_sigaction; # else return signal(signalNumber, signalHandlerAddress); # endif // SA_ONSTACK return null; } Dave On Thu, Oct 03, 2019 at 10:47:03PM +0200, Nicolas Cellier wrote: > > If I comment /* signal(SIGIO,forceInterruptCheck) */ in > platforms//unix/vm/aio.c > then I don't experiment crashes anymore > > One solution if we want to manipulate the rsp safely is > - to define a specific stack for signal handling via sigaltstack() > - to use that specific stack via sigaction() with SA_ONSTACK flag > > That's too low level and I prefer some assistance here. > There are several calls to signal in the VM so they should all be reviewed. > > Le jeu. 3 oct. 2019 ?? 22:10, Nicolas Cellier < > [hidden email]> a ??crit : > > > I confirm that each crash follows a SIGIO -> forceInterruptCheck() with > > corrupted stack pointer pointing to methodZone (that is executable zone so > > program counter). > > I did: > > 1) remove static declaration of methodZoneBase and limitAddress in > > cogitX64Sysv.c so as to make them exported globals > > 2) declared them as extern sqInt in gcc-cointerp.c > > 3) instrumented forceInterruptCheck() so as to emit a warning and/or set a > > breakpoint: > > > > if ( getsp() > methodZoneBase && getsp() <= limitAddress ) { > > printf("pb in forceInterruptCheck\n"); > > } > > > > Unlike what I said, forceInterruptCheck() does nothing special and > > certainly not invoke Smalltalk code nor JITTED code. > > The interruptCheckChain function that it could invoke is generally null > > (except for debug/simulation purpose), and always null in the crash i > > observed. > > The signal handler also generally operate on same stack (unless > > specifically asked via sigaction() - which is not the case). > > My earler diagnostic was driven by the false assumption that handler would > > use the interrupt stack pointer (IST), and thus that only special stack > > pointer manipulation happening in trampoline could corrupt it, hmmm that > > was late night blurred minded ;) > > > > What comes to my mind: > > - the signal SIGIO might happen while we are manipulating the stack > > pointer, hence the heisenbug > > - the stack pointer could be offseted by some incorrect code (but why no > > error would happen before SIGIO then?) > > > > I don't have clear understanding of all spur machinery details that would > > help further diagnosis > > Eliot, doesn't it sound interesting? > > > > Le jeu. 3 oct. 2019 ?? 08:10, Nicolas Cellier < > > [hidden email]> a ??crit : > > > >> Maybe I draw conclusions too quickly. > >> But clearly, the signal handler has a stack pointer pointing in > >> methodZone. > >> > >> (lldb) p/x $rsp > >> (unsigned long) $211 = 0x0000000107c92da0 > >> (lldb) p/x methodZoneBase > >> (sqInt) $212 = 0x0000000107c01d30 > >> (lldb) p/x limitAddress > >> (usqInt) $213 = 0x0000000107d5e000 > >> > >> It remains to see how that happened. > >> > >> Le jeu. 3 oct. 2019 ?? 05:49, Nicolas Cellier < > >> [hidden email]> a ??crit : > >> > >>> Hi all, > >>> So I finally got it! > >>> That's our specific handling of SIGIO with forceInterruptCheck. > >>> Don't you JIT the signal handler! > >>> > >>> The signal handler is using a different stack pointer so as to not > >>> interfere with the program. > >>> Until Cog decide to jump into JITTED code from within the signal handler. > >>> It now share a stack pointer pointing somewhere in cog methodZone, and > >>> anything it does with it's stack (call a function, save registers...) will > >>> destroy the JITTED code. > >>> It even happened when I used the debugger which shares the interrupt > >>> handler stack pointer! > >>> > >>> Why does it not happen in linux, and why only Mac OS 64, I have no > >>> clue... > >>> But we can't decently live very long with such handling. > >>> > >>> Le mer. 2 oct. 2019 ?? 23:29, Nicolas Cellier < > >>> [hidden email]> a ??crit : > >>> > >>>> Hi Eliot, > >>>> so I have set watchpoint to 4 previous crash and got this: > >>>> > >>>> (lldb) run ../../image/crash.image > >>>> There is a running process, kill it and restart?: [Y/n] > >>>> Process 98150 exited with status = 9 (0x00000009) > >>>> Process 98164 launched: > >>>> '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/SqueakDebug.app/Contents/MacOS/Squeak' > >>>> (x86_64) > >>>> Process 98164 stopped > >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint > >>>> 1.1 > >>>> frame #0: 0x00000001001618ec Squeak`-[sqSqueakMainApplication > >>>> setupMenus](self=0x000000010076e1a0, _cmd="setupMenus") at > >>>> sqSqueakMainApplication.m:123 > >>>> 120 > >>>> 121 - (void) setupMenus { > >>>> 122 // nothing to do so far since the menu is setup in the > >>>> MainMenu.nib file > >>>> -> 123 } > >>>> 124 > >>>> 125 - (void) setupTimers { > >>>> 126 extern void SetUpTimers(void); > >>>> Target 0: (Squeak) stopped. > >>>> (lldb) wa se e -s 1 -- 0x107c992b4 > >>>> Watchpoint created: Watchpoint 35: addr = 0x107c992b4 size = 1 state = > >>>> enabled type = w > >>>> (lldb) wa se e -s 1 -- 0x107c929e0 > >>>> Watchpoint created: Watchpoint 34: addr = 0x107c929e0 size = 1 state = > >>>> enabled type = w > >>>> (lldb) wa se e -s 1 -- 0x0000000107c945df > >>>> Watchpoint created: Watchpoint 36: addr = 0x107c945df size = 1 state = > >>>> enabled type = w > >>>> (lldb) wa se e -s 1 -- 0x0000000107c92da3 > >>>> Watchpoint created: Watchpoint 37: addr = 0x107c92da3 size = 1 state = > >>>> enabled type = w > >>>> new value: -3689348814741910324 > >>>> (lldb) c > >>>> Process 98164 resuming > >>>> > >>>> Watchpoint 34 hit: > >>>> new value: -3689348814741910456 > >>>> Process 98164 stopped > >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint > >>>> 34 > >>>> frame #0: 0x0000000100102067 > >>>> Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, > >>>> val=72) at sqMemoryAccess.h:136 > >>>> 133 /* Use static inline functions when the compiler produces > >>>> efficient code for small accessors. > >>>> 134 These are preferred because static type checking will > >>>> prevent inadvertent confusion of pointers and oops. */ > >>>> 135 static inline sqInt byteAtPointer(char *ptr) { return > >>>> (sqInt)(*((unsigned char *)ptr)); } > >>>> -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { > >>>> return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } > >>>> 137 static inline sqInt shortAtPointer(char *ptr) { return > >>>> (sqInt)(*((short *)ptr)); } > >>>> 138 static inline sqInt shortAtPointerput(char *ptr, int val) { > >>>> return (sqInt)(*((short *)ptr)= (short)val); } > >>>> 139 static inline sqInt intAtPointer(char *ptr) { return > >>>> (sqInt)(*((int *)ptr)); } > >>>> Target 0: (Squeak) stopped. > >>>> (lldb) > >>>> Process 98164 resuming > >>>> > >>>> Watchpoint 37 hit: > >>>> old value: -3689348814741910324 > >>>> new value: -3689912642157150198 > >>>> Process 98164 stopped > >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint > >>>> 37 > >>>> frame #0: 0x000000010012de84 > >>>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c92da0, size=1240, > >>>> selector=4435376352) at cogitX64SysV.c:6321 > >>>> 6318 > >>>> 6319 (method->cmType = CMMethod); > >>>> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); > >>>> -> 6321 (method->blockSize = size); > >>>> 6322 (method->methodObject = methodObj); > >>>> 6323 > >>>> 6324 /* If the method has already been cogged (e.g. Newspeak > >>>> accessors) then > >>>> Target 0: (Squeak) stopped. > >>>> (lldb) > >>>> Process 98164 resuming > >>>> > >>>> Watchpoint 36 hit: > >>>> new value: -3689348814741910520 > >>>> Process 98164 stopped > >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint > >>>> 36 > >>>> frame #0: 0x0000000100102067 > >>>> Squeak`byteAtPointerput(ptr="\b???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, > >>>> val=8) at sqMemoryAccess.h:136 > >>>> 133 /* Use static inline functions when the compiler produces > >>>> efficient code for small accessors. > >>>> 134 These are preferred because static type checking will > >>>> prevent inadvertent confusion of pointers and oops. */ > >>>> 135 static inline sqInt byteAtPointer(char *ptr) { return > >>>> (sqInt)(*((unsigned char *)ptr)); } > >>>> -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { > >>>> return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } > >>>> 137 static inline sqInt shortAtPointer(char *ptr) { return > >>>> (sqInt)(*((short *)ptr)); } > >>>> 138 static inline sqInt shortAtPointerput(char *ptr, int val) { > >>>> return (sqInt)(*((short *)ptr)= (short)val); } > >>>> 139 static inline sqInt intAtPointer(char *ptr) { return > >>>> (sqInt)(*((int *)ptr)); } > >>>> Target 0: (Squeak) stopped. > >>>> (lldb) > >>>> Process 98164 resuming > >>>> > >>>> Watchpoint 35 hit: > >>>> new value: -3689348814741910456 > >>>> Process 98164 stopped > >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint > >>>> 35 > >>>> frame #0: 0x0000000100102067 > >>>> Squeak`byteAtPointerput(ptr="H???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????, > >>>> val=72) at sqMemoryAccess.h:136 > >>>> 133 /* Use static inline functions when the compiler produces > >>>> efficient code for small accessors. > >>>> 134 These are preferred because static type checking will > >>>> prevent inadvertent confusion of pointers and oops. */ > >>>> 135 static inline sqInt byteAtPointer(char *ptr) { return > >>>> (sqInt)(*((unsigned char *)ptr)); } > >>>> -> 136 static inline sqInt byteAtPointerput(char *ptr, int val) { > >>>> return (sqInt)(*((unsigned char *)ptr)= (unsigned char)val); } > >>>> 137 static inline sqInt shortAtPointer(char *ptr) { return > >>>> (sqInt)(*((short *)ptr)); } > >>>> 138 static inline sqInt shortAtPointerput(char *ptr, int val) { > >>>> return (sqInt)(*((short *)ptr)= (short)val); } > >>>> 139 static inline sqInt intAtPointer(char *ptr) { return > >>>> (sqInt)(*((int *)ptr)); } > >>>> Target 0: (Squeak) stopped. > >>>> (lldb) > >>>> Process 98164 resuming > >>>> > >>>> Watchpoint 37 hit: > >>>> old value: -3689912642157150198 > >>>> new value: -5066732099722018553 > >>>> Process 98164 stopped > >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint > >>>> 37 > >>>> frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at > >>>> gcc3x-cointerp.c:64151 > >>>> 64148 /* StackInterpreter>>#forceInterruptCheck */ > >>>> 64149 sqInt > >>>> 64150 forceInterruptCheck(void) > >>>> -> 64151 { DECL_MAYBE_SQ_GLOBAL_STRUCT > >>>> 64152 void (*iccFunc)(); > >>>> 64153 StackPage *thePage; > >>>> 64154 > >>>> Target 0: (Squeak) stopped. > >>>> (lldb) bt > >>>> * thread #1, queue = 'com.apple.main-thread', stop reason = watchpoint > >>>> 37 > >>>> * frame #0: 0x000000010002e801 Squeak`forceInterruptCheck at > >>>> gcc3x-cointerp.c:64151 > >>>> frame #1: 0x00007fff6ab9af5a libsystem_platform.dylib`_sigtramp + 26 > >>>> frame #2: 0x0000000107c10699 > >>>> frame #3: 0x0000000107c9646f > >>>> frame #4: 0x0000000107c9633e > >>>> frame #5: 0x0000000107c962ac > >>>> frame #6: 0x0000000107c961e9 > >>>> frame #7: 0x0000000107c8fcf8 > >>>> frame #8: 0x0000000107c98605 > >>>> frame #9: 0x0000000107c8bbfe > >>>> frame #10: 0x0000000107c1100e > >>>> frame #11: 0x0000000107c8c4ad > >>>> frame #12: 0x0000000107c0cef6 > >>>> frame #13: 0x0000000107c8c2a5 > >>>> frame #14: 0x0000000107c8bb4c > >>>> frame #15: 0x0000000107c0cef6 > >>>> frame #16: 0x0000000107c8ba05 > >>>> frame #17: 0x0000000107c8b85b > >>>> frame #18: 0x0000000107c1100e > >>>> frame #19: 0x0000000107c8b7e1 > >>>> frame #20: 0x0000000107c57018 > >>>> frame #21: 0x0000000107c018c8 > >>>> (lldb) call printCallStack() > >>>> > >>>> Watchpoint 34 hit: > >>>> old value: -3689348814741910456 > >>>> new value: 4435376368 > >>>> error: Execution was interrupted, reason: watchpoint 34. > >>>> The process has been returned to the state before expression evaluation. > >>>> (lldb) call printCogMethodFor((void*)0x0000000107c10699) > >>>> > >>>> Watchpoint 34 hit: > >>>> old value: 4435376368 > >>>> new value: 0 > >>>> error: Execution was interrupted, reason: watchpoint 34. > >>>> The process has been returned to the state before expression evaluation. > >>>> > >>>> I don't know how to proceed, but maybe there's enough clue for you... > >>>> > >>>> Le lun. 30 sept. 2019 ?? 01:28, Eliot Miranda <[hidden email]> > >>>> a ??crit : > >>>> > >>>>> > >>>>> Hi Nicolas, > >>>>> > >>>>> On Wed, Sep 25, 2019 at 12:32 AM Nicolas Cellier < > >>>>> [hidden email]> wrote: > >>>>> > >>>>>> > >>>>>> I noticed that many crash happen in testSendTimeout. > >>>>>> This test is organizing a race: > >>>>>> - Smalltalk fill the socket send buffer > >>>>>> - the OS tries to drain > >>>>>> > >>>>>> The other thing I noticed is that [SocketTest suite run] takes 7 to > >>>>>> 8s on macos while only 2 to 3s on ubuntu and windows > >>>>>> So it might be that some test times out on macos, while it doesn't on > >>>>>> other OS. > >>>>>> > >>>>>> This would explain that many crash also happen in JITted Timer loop, > >>>>>> and that we cannot observe it in other OSes. > >>>>>> > >>>>> > >>>>> Yes, that is, I think, an important clue. The other one is that, > >>>>> given that the crash always happens only early in a run, plus the address > >>>>> at which it crashes in machine code, suggests to me that the crash occurs > >>>>> when the interpreter invokes a block in the JITted startTimerLoop. > >>>>> > >>>>> What worries me is that the JITted code for startTimerLoop looks > >>>>> completely corrupted, not merely moved or overwritten (the method header > >>>>> looks OK up to the selector field, which is the last field in the header). > >>>>> So I'm wondering whether it could be a bug in support code overwriting > >>>>> machine code, and hence that not might be related to a 32-bit vs 64-bit > >>>>> code issue. I was hoping your recent sqUnixSocket.c changes might work but > >>>>> I'm sure they have no bearing. > >>>>> > >>>>> The access to OS ressources and the race may also explain some > >>>>>> randomization of the crash... > >>>>>> > >>>>>> So one idea would be to make the test timeout in linux too, see if we > >>>>>> can make it crash then try using rr. > >>>>>> Maybe it's possible with a huge Socket buffer, a smaller image side > >>>>>> buffer (we could reduce the size from 1000 to 1 so as to increase overhead). > >>>>>> > >>>>> > >>>>> Yes, that's a good idea. But I do think we need to set up a reliable > >>>>> remote server, som thing that acts like an infinite sink for receiving a > >>>>> par4ticular large Monticello package, so that ev very time the example runs > >>>>> the same package is indeed copied to a remote server. Would it be possible > >>>>> to construct such a fake Monticello server? > >>>>> > >>>>> What I'm wanting the server to do is > >>>>> a) take a while to respond to the Monticello package contents request > >>>>> b) to never include a particular package which the example would > >>>>> always try and upload > >>>>> > >>>>> The test case would be to copy a particular version of VMMaker from > >>>>> package-cache to the remote "fake" server > >>>>> > >>>>> > >>>>>> > >>>>>> It's just theories I tried to construct while biking, not in front of > >>>>>> a keyboard, and I won't be able to verify until this evening... > >>>>>> But the watchpoint strategy is too much vain shooting in the dark, we > >>>>>> need more ellaborated plans ;) > >>>>>> > >>>>> > >>>>> Does anyone see the crash on macOS in 32 bits? > >>>>> > >>>>> Le mar. 24 sept. 2019 ?? 00:28, Nicolas Cellier < > >>>>>> [hidden email]> a ??crit : > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Le mar. 24 sept. 2019 ?? 00:25, Nicolas Cellier < > >>>>>>> [hidden email]> a ??crit : > >>>>>>> > >>>>>>>> Follow up: so we have a problem of repeatability... > >>>>>>>> I found nothing like rr for macos, and CPU watchpoint capability is > >>>>>>>> too restricted to be usefull. > >>>>>>>> > >>>>>>>> So what I did is save a crash.image attempting to crash at > >>>>>>>> resumption time: > >>>>>>>> Smalltalk saveAs: 'crash'. > >>>>>>>> Smalltalk snapshot: true andQuit: true. > >>>>>>>> 5 timesRepeat: [SocketTest suite run]. > >>>>>>>> > >>>>>>>> Then I launched the vm under lldb and gathered $pc stats to see if > >>>>>>>> I could observe some reproducibility > >>>>>>>> (lldb) p/x $pc - methodZoneBase > >>>>>>>> 0x00000000000a0e94 > >>>>>>>> 0x0000000000090f70 > >>>>>>>> 0x000000000009f824 > >>>>>>>> 0x000000000009bfc8 > >>>>>>>> 0x000000000009d030 > >>>>>>>> 0x000000000009c7c0 > >>>>>>>> 0x00000000000951e0 > >>>>>>>> 0x00000000000949c8 <- once > >>>>>>>> 0x0000000000094418 > >>>>>>>> 0x00000000000949c8 <??? twice > >>>>>>>> 0x000000000009b848 > >>>>>>>> 0x00000000000a0e94 <- twice > >>>>>>>> 0x000000000009a176 > >>>>>>>> 0x000000006dfff2e0 ??? > >>>>>>>> 0x00000000000996d1 > >>>>>>>> 0x00000000000949c8 <- thrice good candidate for watchpoint set > >>>>>>>> 0x000000000009ae70 > >>>>>>>> 0x000000000009ae70 > >>>>>>>> 0x00000000000a18e7 > >>>>>>>> 0x0000000000095470 > >>>>>>>> 0x000000000009c458 > >>>>>>>> 0x00000000000a0a7c > >>>>>>>> 0x0000000000093e89 > >>>>>>>> 0x000000000009beb0 > >>>>>>>> 0x0000000000095a37 > >>>>>>>> 0x0000000000095408 > >>>>>>>> 0x00000000000a06bc > >>>>>>>> 0x000000000009cd48 > >>>>>>>> 0x0000000000095408 <- twice > >>>>>>>> 0x000000000009ba41 > >>>>>>>> 0x0000000000095408 <- thrice new candidate for watchpoint set > >>>>>>>> 0x0000000000045258 <??? smallest ever > >>>>>>>> 0x000000000009c7c0 > >>>>>>>> 0x0000000000095408 <- fourth ! the watch point was set here, why > >>>>>>>> didn???t it stop??? argh methodZoneBase was incremented after I set the > >>>>>>>> watchpoint!!! > >>>>>>>> 0x00000000000a12dc > >>>>>>>> 0x00000000000951e0 > >>>>>>>> 0x00000000000a0e24 > >>>>>>>> 0x000000000009ff8c > >>>>>>>> 0x000000000009c7f6 > >>>>>>>> 0x000000000009a626 > >>>>>>>> > >>>>>>>> Then I have chosen to watch the most frequent crash address, > >>>>>>>> that is methodZoneBase+0x0000000000095408, > >>>>>>>> Unfortunately, s commented above, I missed one (I set the > >>>>>>>> watchpoint too early, before methodZoneBase took its definitive address). > >>>>>>>> > >>>>>>>> And after a few (a bunch of) trials, here is what I found: > >>>>>>>> > >>>>>>>> (lldb) run ../../image/crash.image > >>>>>>>> There is a running process, kill it and restart?: [Y/n] > >>>>>>>> Process 28175 exited with status = 9 (0x00000009) > >>>>>>>> Process 28185 launched: > >>>>>>>> '/Users/nicolas/Smalltalk/OpenSmalltalk/opensmalltalk-vm/build.macos64x64/squeak.cog.spur/Squeak.app/Contents/MacOS/Squeak' > >>>>>>>> (x86_64) > >>>>>>>> Squeak was compiled with optimization - stepping may behave oddly; > >>>>>>>> variables may not be available. > >>>>>>>> Process 28185 stopped > >>>>>>>> * thread #1, queue = 'com.apple.main-thread', stop reason = > >>>>>>>> breakpoint 6.1 > >>>>>>>> frame #0: 0x000000010007bc22 Squeak`-[sqSqueakOSXApplication > >>>>>>>> doHeadlessSetup](self=0x000000010066fc00, _cmd="doHeadlessSetup") at > >>>>>>>> sqSqueakOSXApplication.m:125 [opt] > >>>>>>>> 122 //No super call here as we've already set headless from > >>>>>>>> command line or info.plist > >>>>>>>> 123 extern BOOL gSqueakHeadless; > >>>>>>>> 124 // Notice that setActivationPolicy: is available in OSX > >>>>>>>> 10.6 and later > >>>>>>>> -> 125 if ([NSApp respondsToSelector:@selector(setActivationPolicy:)]) > >>>>>>>> { > >>>>>>>> 126 if (gSqueakHeadless) { > >>>>>>>> 127 [NSApp > >>>>>>>> setActivationPolicy:NSApplicationActivationPolicyProhibited]; > >>>>>>>> 128 } else { > >>>>>>>> Target 0: (Squeak) stopped. > >>>>>>>> (lldb) wa s e -s 1 -- methodZoneBase+0x0000000000095408 > >>>>>>>> Watchpoint created: Watchpoint 33: addr = 0x107c58138 size = 1 > >>>>>>>> state = enabled type = w > >>>>>>>> new value: -3689348814741910324 > >>>>>>>> (lldb) cont > >>>>>>>> Process 28185 resuming > >>>>>>>> > >>>>>>>> Watchpoint 33 hit: > >>>>>>>> old value: -3689348814741910324 > >>>>>>>> new value: -3689348814741910836 > >>>>>>>> Process 28185 stopped > >>>>>>>> * thread #1, queue = 'com.apple.main-thread', stop reason = > >>>>>>>> watchpoint 33 > >>>>>>>> frame #0: 0x00000001000731dc > >>>>>>>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, > >>>>>>>> selector=4435118304) at cogitX64SysV.c:6320 [opt] > >>>>>>>> 6317 sqInt rawHeader; > >>>>>>>> 6318 > >>>>>>>> 6319 (method->cmType = CMMethod); > >>>>>>>> -> 6320 (method->objectHeader = nullHeaderForMachineCodeMethod()); > >>>>>>>> 6321 (method->blockSize = size); > >>>>>>>> 6322 (method->methodObject = methodObj); > >>>>>>>> 6323 > >>>>>>>> Target 0: (Squeak) stopped. > >>>>>>>> (lldb) > >>>>>>>> Process 28185 resuming > >>>>>>>> > >>>>>>>> Watchpoint 33 hit: > >>>>>>>> old value: -3689348814741910836 > >>>>>>>> new value: -3689571654825360894 > >>>>>>>> Process 28185 stopped > >>>>>>>> * thread #1, queue = 'com.apple.main-thread', stop reason = > >>>>>>>> watchpoint 33 > >>>>>>>> frame #0: 0x0000000100073255 > >>>>>>>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, > >>>>>>>> selector=4435118304) at cogitX64SysV.c:6339 [opt] > >>>>>>>> 6336 (method->selector = selector); > >>>>>>>> 6337 (method->cmNumArgs = > >>>>>>>> argumentCountOfMethodHeader(methodHeader)); > >>>>>>>> 6338 (method->cmHasMovableLiteral = hasMovableLiteral); > >>>>>>>> -> 6339 if ((method->cmRefersToYoung = hasYoungReferent)) { > >>>>>>>> 6340 addToYoungReferrers(method); > >>>>>>>> 6341 } > >>>>>>>> 6342 (method->cmUsageCount = initialMethodUsageCount()); > >>>>>>>> Target 0: (Squeak) stopped. > >>>>>>>> (lldb) bt > >>>>>>>> * thread #1, queue = 'com.apple.main-thread', stop reason = > >>>>>>>> watchpoint 33 > >>>>>>>> * frame #0: 0x0000000100073255 > >>>>>>>> Squeak`fillInMethodHeadersizeselector(method=0x0000000107c58130, size=544, > >>>>>>>> selector=4435118304) at cogitX64SysV.c:6339 [opt] > >>>>>>>> frame #1: 0x000000010004f90c Squeak`compileCogMethod [inlined] > >>>>>>>> generateCogMethod(selector=<unavailable>) at cogitX64SysV.c:7098 [opt] > >>>>>>>> frame #2: 0x000000010004f7a5 > >>>>>>>> Squeak`compileCogMethod(selector=4435118304) at cogitX64SysV.c:27780 [opt] > >>>>>>>> frame #3: 0x000000010004bc54 > >>>>>>>> Squeak`cogselector(aMethodObj=4468481712, aSelectorOop=4435118304) at > >>>>>>>> cogitX64SysV.c:4833 [opt] > >>>>>>>> frame #4: 0x0000000100013d4d Squeak`primitiveClosureValue at > >>>>>>>> gcc3x-cointerp.c:27237 [opt] > >>>>>>>> frame #5: 0x0000000107bd1925 > >>>>>>>> frame #6: 0x000000010000236b Squeak`interpret at > >>>>>>>> gcc3x-cointerp.c:2767 [opt] > >>>>>>>> frame #7: 0x000000010008ed55 Squeak`-[sqSqueakMainApplication > >>>>>>>> runSqueak](self=0x000000010066fc00, _cmd=<unavailable>) at > >>>>>>>> sqSqueakMainApplication.m:201 [opt] > >>>>>>>> frame #8: 0x00007fff44b077b8 > >>>>>>>> Foundation`__NSFirePerformWithOrder + 360 > >>>>>>>> frame #9: 0x00007fff4297cf57 > >>>>>>>> CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ > >>>>>>>> + 23 > >>>>>>>> frame #10: 0x00007fff4297ce7f > >>>>>>>> CoreFoundation`__CFRunLoopDoObservers + 527 > >>>>>>>> frame #11: 0x00007fff4295f3f8 CoreFoundation`__CFRunLoopRun + > >>>>>>>> 1240 > >>>>>>>> frame #12: 0x00007fff4295ec93 > >>>>>>>> CoreFoundation`CFRunLoopRunSpecific + 483 > >>>>>>>> frame #13: 0x00007fff41c49d96 > >>>>>>>> HIToolbox`RunCurrentEventLoopInMode + 286 > >>>>>>>> frame #14: 0x00007fff41c49a0f HIToolbox`ReceiveNextEventCommon > >>>>>>>> + 366 > >>>>>>>> frame #15: 0x00007fff41c49884 > >>>>>>>> HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 64 > >>>>>>>> frame #16: 0x00007fff3fef9a73 AppKit`_DPSNextEvent + 2085 > >>>>>>>> frame #17: 0x00007fff4068fe34 AppKit`-[NSApplication(NSEvent) > >>>>>>>> _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 3044 > >>>>>>>> frame #18: 0x00007fff3feee885 AppKit`-[NSApplication run] + 764 > >>>>>>>> frame #19: 0x00007fff3febda72 AppKit`NSApplicationMain + 804 > >>>>>>>> frame #20: 0x00007fff6a88c015 libdyld.dylib`start + 1 > >>>>>>>> (lldb) > >>>>>>>> > >>>>>>>> Here I don't know how to proceed, because I have no knowledge of > >>>>>>>> this part of the VM, it's also quite late now... > >>>>>>>> Eliot, does it ring some bell? > >>>>>>>> > >>>>>>>> > >>>>>>> Ah, forget it, I thought it was overwriting, but it is the first > >>>>>>> write on yet virgin methodeZone... > >>>>>>> (it's just that the scenario was different from all the previous run) > >>>>>>> > >>>>>>> Le dim. 22 sept. 2019 ?? 11:21, Jan Vrany <[hidden email]> a > >>>>>>>> ??crit : > >>>>>>>> > >>>>>>>>> > >>>>>>>>> On 2019-09-22 10:57, Nicolas Cellier wrote: > >>>>>>>>> > Hi Jan, > >>>>>>>>> > Thanks, a recording debugger would indeed be a tremendous help! > >>>>>>>>> > Unfortunately, I'm unable to reproduce the bug under linux, only > >>>>>>>>> > macosx... > >>>>>>>>> > Do you know something equivalent to rr for mac? > >>>>>>>>> > >>>>>>>>> No, I don't. There might be something though, I never had a mac... > >>>>>>>>> > >>>>>>>>> Jan > >>>>>>>>> > >>>>>>>>> > > >>>>>>>>> > Le dim. 22 sept. 2019 ?? 09:55, Jan Vrany <[hidden email]> > >>>>>>>>> a > >>>>>>>>> > ??crit : > >>>>>>>>> > > >>>>>>>>> >> Hi, > >>>>>>>>> >> > >>>>>>>>> >>> > >>>>>>>>> >>> - can anyone think of any other strategies I might take to try > >>>>>>>>> and > >>>>>>>>> >>> reproduce this? > >>>>>>>>> >>> > >>>>>>>>> >> at the risk of stating the obvious, in cases similar to this rr > >>>>>>>>> (+ > >>>>>>>>> >> VDB) > >>>>>>>>> >> helped > >>>>>>>>> >> me many times. See https://www.youtube.com/watch?v=Qbu4ED8wPYE > >>>>>>>>> >> > >>>>>>>>> >> Never used with opensmalltalk VM, though. > >>>>>>>>> >> > >>>>>>>>> >> HTH, Jan > >>>>>>>>> > >>>>>>>> > >>>>> > >>>>> -- > >>>>> _,,,^..^,,,_ > >>>>> best, Eliot > >>>>> > >>>> |
In reply to this post by Nicolas Cellier
Hi Nicolas, On Thu, Oct 3, 2019 at 1:10 PM Nicolas Cellier <[hidden email]> wrote:
It looks to me like the right cause. The corruption I was seeing in the methodSZOne doesn't;t look like a bug in the JIT; it looks like overwriting. I had thought we were using sigaltstack to ensure the signal handler runs on a dedicated stack, but it appears we don't. If I'm right that we're not using sigaltstack then this may be very easy to fix :-) [I have to mention ALSA and its awful SIGIO sign al handling code; see platforms/unix//vm-sound-ALSA/sqUnixSoundALSA.c. I don't think this affects the Mac, but it could be an issue on Linux]
_,,,^..^,,,_ best, Eliot |
In reply to this post by Nicolas Cellier
On Thu, Oct 3, 2019 at 1:47 PM Nicolas Cellier <[hidden email]> wrote:
That's the right solution as far as I understand these issues.
Right, I will do so today.
_,,,^..^,,,_ best, Eliot |
Free forum by Nabble | Edit this page |