Smalltalk › Squeak › Squeak VM

Fixing crashes on delivery of SIGIO etc [Was Re: [Pharo-dev] Seg Fault Pharo 7.0.3, Was Difficult to debug VM crash with full blocks and Sista V1]

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

Eliot Miranda-2

Fixing crashes on delivery of SIGIO etc [Was Re: [Pharo-dev] Seg Fault Pharo 7.0.3, Was Difficult to debug VM crash with full blocks and Sista V1]

Hi Nicolas, Clément, et al,

On Tue, Oct 8, 2019 at 1:40 PM Nicolas Cellier <[hidden email]> wrote:

More on the problem that Eliot is speaking about: it can happen in these conditions:
if is SIGIO is delivered while executing any trampoline transiting from SmalltalkToCStackSwitch.
What happens here?

We must load 64bits contents of memory (cStackPointerAddress) to the stack pointer register (SPReg = $rsp) - See genLoadCStackPointer.
This is done by using the CogRTLOpcode abstract instruction MoveAwR
But we have no matching instruction in IA-32 X64... We can only load a 64bits memory content in $rax
So the idea is to generate this sequence (CogX64Compiler>>concretizeMoveAwR):
    xchgq %rsp, %rax
    movabsq 0x10027c338, %rax ; cStackPointerAddress
    xchgq %rsp, %rax
That's clever because it preserves $rax which could be in use when we want to MoveAwR.

Clever, but in the case of %rsp and %rbp, dead wrong :blush:. The constraint is that x86_64 only provides a 64-bit load from an absolute address into %rax; no other register can be used.

But it has an unfortunate side effect: the stack pointer temporarily gets the contents of $rax and can thus temporarily point anywhere.

Right; which is unacceptable. It's a tiny window, but one we hit all the time.

What happens when performing Squeak SocketTest is that we use some trampolines (I guess for invoking primitives for example) and we generate SIGIO (for some reasons, there are a lot of SIGIO generated on my particular macos machine, so I can trigger the bug more easily than Eliot).
We previously installed a handler for SIGIO via signal(). When we use signal(), the handler shares the stack pointer with user program.
If the event is delivered in between the two xchgq instructions above, the signal handler will then use a corrupted stack pointer pointing anywhere (depending no contents of $rax) when the VM enter the signal handler function, it uses stack pointer to save some states, and corrupt a memory zone, segfault or whatever.
In the case described by the opensmalltalk vm-dev thread, $rax was pointing to the generated code zone (jitted methodZone), so we corrupted the generated code and soon get punished for that. But it's probable that there might be other (rare) occurrences of this bug.

Not sure if it is causing the bugs described by Sean, but it's important to use the fix from Eliot ASAP and retry.
There might be other occurrence of signal(SIGIO,forceInterruptCheck) in minheadless flavour, I did not check if Eliot also corrected it, if not it should also be corrected ASAP, as should every usage of signal() be replaced by ussage of sigaction() with appropriate flags to use sigaltstack() - see Eliot's commit details.

I want to discuss potential fixes with Clément and Nicolas, and anyone else interested. So...

1. the straight forward fix is to generate different code for setting %rsp and %rbp. I shall do this very soon. On x86_64 we dedicate a register for code generation purposes, this is called RISCTempReg, and is either %r8 (SysV ABI (unix)) or %r11 (Windows). RISCTempReg is never assumed to be live except within a single instruction sequence. We can swap with this before assigning to %rsp, e.g.

xchgq %r8, %rax

movabsq 0x10027c338, %rax ; cStackPointerAddress

xchgq %r8, %rax

movq %r8, %rsp

this adds a couple of bytes, but is reliable.

2. one could imagine exchanging TempReg and RISCTempReg, i.e. TempReg would be either %r8 or %r11, and RISCTemptReg would be %rax. That would allow

movabsq 0x10027c338, %rax ; cStackPointerAddress
movq %rax, %rsp

this at least worked on Ryan's MIPS32 back end when it was in use (no one has tested it in as while AFAIA) where TempReg is S5 (r21), RISCTempReg is AT (r1) and CResultRegister is V0 (r2). So this is worth investigating. [I had tried something similar with HPS and it completely broke the code generator so my gut is trying to tell me this will never work; Ryan has proved otherwise].

3. I would much rather implement some form of tracking whether a particular register is live or not. A trivial implementation would only track TempReg and only track being not live up until the first assignment to TempReg. A more sophisticated approach could deal with control flow branching and merging of the liveness info. I like the trivial approach for now. That would simply track TempReg and only up until the first explicit assignment to TempReg.

4. we could simply treat assignments to SPReg and FPReg specially, knowing that these are done only in trampolines and enilopmarts (these are the pieces of code that sit b between JITted code and the rest of the run-time (interpreter, memory manager, primitives, etc), either to call the runtime from JITted code or, as in this case, to enter machine code from the run-time. I suppose this is OK as long as we document it. This would allow us to generate what we would generate for #2 above, i.e.

movabsq 0x10027c338, %rax ; cStackPointerAddress
movq %rax, %rsp

N. are there better ways?

Clément, Nicolas (et al), what do you think?

Le dim. 6 oct. 2019 à 12:36, Eliot Miranda <[hidden email]> a écrit :

Hi Sean, Hi All,

this may be because of the issue described here: http://forum.world.st/Difficult-to-debug-VM-crash-with-full-blocks-and-Sista-V1-tt5103810.html

This issue is characterized by the system crashing soon after start up when some significant i/o is done, typically either to files or sockets. It affects macOS only and may indeed affect only 64-bits. We have strong evidence that it is caused by the dynamic linker being invoked in the signal handler for SIGIO when the signal is delivered while the VM is executing JITted code. The symptom that causes the crash is corruption of a particular jitted method’s machine code, eg Delay class>>#startEventLoop, and we believe that the corruption is caused by the linker when it misinterprets a jitted Smalltalk stack frame as an ABI-compliant stack frame and attempts to scan code to link it.

Our diagnosis is speculative; this is extremely hard to reproduce. Typically in repeating a crashing run SIGIO may no longer be delivered at the same point because any remote server has now woken up and delivers results sooner, etc. However, Nicolas Cellier and I are both confident that we have correctly identified the bug.

The fix is simple; SIGIO should be delivered on a dedicated signal stack (see sigaltstack(2)). I committed a fix yesterday evening and we should see within a week or so if these crashes have disappeared.

I encourage the Pharo vm maintainers to build and release vms that include https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/c24970eb2859a474065c6f69060c0324aef2b211 asap.

Cheers,
Eliot
_,,,^..^,,,_ (phone)

On Oct 3, 2019, at 1:24 PM, Sean P. DeNigris <[hidden email]> wrote:

Segmentation fault Thu Oct 3 15:52:33 2019

VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6
Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git

C stack backtrace & registers:
rax 0x0000000124380000 rbx 0x00007ffeebd00050 rcx 0x0000000000468260 rdx
0x0000000000dd6800
rdi 0x0000000124cee5a0 rsi 0x0000000124cee5a0 rbp 0x00007ffeebcffe50 rsp
0x00007ffeebcffe50
r8 0x00007fff3f2cefe5 r9 0x0000000000000b00 r10 0x0000000000006000 r11
0xfffffffffcd8d5a0
r12 0x0000000000000002 r13 0x0000000035800000 r14 0x00007ffeebd00064 r15
0x0000000000002800
rip 0x00007fff630f7d09
0   libsystem_platform.dylib            0x00007fff630f7d09
_platform_memmove$VARIANT$Haswell + 41
1   Pharo                               0x0000000103f52642 reportStackState
+ 952
2   Pharo                               0x0000000103f52987 sigsegv + 174
3   libsystem_platform.dylib            0x00007fff630fab3d _sigtramp + 29
4   ???                                 0x0000058900000a00 0x0 +
6085968660992
5   libGLImage.dylib                    0x00007fff3f2ce29e
glgProcessPixelsWithProcessor + 2149
6   AMDRadeonX5000GLDriver              0x000000010db16db1 glrATIStoreLevels
+ 1600
7   AMDRadeonX5000GLDriver              0x000000010db52c83
glrAMD_GFX9_LoadSysTextureStandard + 45
8   AMDRadeonX5000GLDriver              0x000000010db519bb glrUpdateTexture
+ 1346
9   libGPUSupportMercury.dylib          0x00007fff5181279d
gpusLoadCurrentTextures + 591
10 AMDRadeonX5000GLDriver              0x000000010db5a099 gldUpdateDispatch
+ 397
11 GLEngine                            0x00007fff3ff72078
gleDoDrawDispatchCore + 629
12 GLEngine                            0x00007fff3ff16369
glDrawArraysInstanced_STD_Exec + 264
13 GLEngine                            0x00007fff3ff1625a
glDrawArrays_UnpackThread + 40
14 GLEngine                            0x00007fff3ff6dce1 gleCmdProcessor +
77
15 libdispatch.dylib                   0x00007fff62ec2dcf
_dispatch_client_callout + 8
16 libdispatch.dylib                   0x00007fff62ecea2c
_dispatch_lane_barrier_sync_invoke_and_complete + 60
17 GLEngine                            0x00007fff3fec4b85
glFlush_ExecThread + 15
18 Pharo                               0x0000000103f4cc62
-[sqSqueakOSXOpenGLView drawRect:flush:] + 314
19 Pharo                               0x0000000103f4cb22 -
...

Smalltalk stack dump:
   0x7ffeebd14238 M DelaySemaphoreScheduler>unscheduleAtTimingPriority
0x10fab3ad0: a(n) DelaySemaphoreScheduler
   0x7ffeebd14270 M [] in
DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
0x10fab3ad0: a(n) DelaySemaphoreScheduler
      0x1125923f8 s BlockClosure>ensure:
      0x111e88d30 s
DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
      0x112590a50 s [] in
DelaySemaphoreScheduler(DelayBasicScheduler)>startTimerEventLoopPriority:
      0x111e88e08 s [] in BlockClosure>newProcess

Most recent primitives
@
actualScreenSize
millisecondClockValue
tempAt:

-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

_,,,^..^,,,_

best, Eliot

Nicolas Cellier

Re: Fixing crashes on delivery of SIGIO etc [Was Re: [Pharo-dev] Seg Fault Pharo 7.0.3, Was Difficult to debug VM crash with full blocks and Sista V1]

Hi Eliot,

Yesterday I thought I would opt for 4) because it's TSTTCPW.

Of course it's very hackish and trampoline specific.

But loading stack and frame pointer is very specific.

Le mer. 9 oct. 2019 à 22:10, Eliot Miranda <[hidden email]> a écrit :

Hi Nicolas, Clément, et al,

On Tue, Oct 8, 2019 at 1:40 PM Nicolas Cellier <[hidden email]> wrote:
More on the problem that Eliot is speaking about: it can happen in these conditions:
if is SIGIO is delivered while executing any trampoline transiting from SmalltalkToCStackSwitch.
What happens here?

We must load 64bits contents of memory (cStackPointerAddress) to the stack pointer register (SPReg = $rsp) - See genLoadCStackPointer.
This is done by using the CogRTLOpcode abstract instruction MoveAwR
But we have no matching instruction in IA-32 X64... We can only load a 64bits memory content in $rax
So the idea is to generate this sequence (CogX64Compiler>>concretizeMoveAwR):
    xchgq %rsp, %rax
    movabsq 0x10027c338, %rax ; cStackPointerAddress
    xchgq %rsp, %rax
That's clever because it preserves $rax which could be in use when we want to MoveAwR.

Clever, but in the case of %rsp and %rbp, dead wrong :blush:. The constraint is that x86_64 only provides a 64-bit load from an absolute address into %rax; no other register can be used.

But it has an unfortunate side effect: the stack pointer temporarily gets the contents of $rax and can thus temporarily point anywhere.

Right; which is unacceptable. It's a tiny window, but one we hit all the time.

What happens when performing Squeak SocketTest is that we use some trampolines (I guess for invoking primitives for example) and we generate SIGIO (for some reasons, there are a lot of SIGIO generated on my particular macos machine, so I can trigger the bug more easily than Eliot).
We previously installed a handler for SIGIO via signal(). When we use signal(), the handler shares the stack pointer with user program.
If the event is delivered in between the two xchgq instructions above, the signal handler will then use a corrupted stack pointer pointing anywhere (depending no contents of $rax) when the VM enter the signal handler function, it uses stack pointer to save some states, and corrupt a memory zone, segfault or whatever.
In the case described by the opensmalltalk vm-dev thread, $rax was pointing to the generated code zone (jitted methodZone), so we corrupted the generated code and soon get punished for that. But it's probable that there might be other (rare) occurrences of this bug.

Not sure if it is causing the bugs described by Sean, but it's important to use the fix from Eliot ASAP and retry.
There might be other occurrence of signal(SIGIO,forceInterruptCheck) in minheadless flavour, I did not check if Eliot also corrected it, if not it should also be corrected ASAP, as should every usage of signal() be replaced by ussage of sigaction() with appropriate flags to use sigaltstack() - see Eliot's commit details.

I want to discuss potential fixes with Clément and Nicolas, and anyone else interested. So...

1. the straight forward fix is to generate different code for setting %rsp and %rbp. I shall do this very soon. On x86_64 we dedicate a register for code generation purposes, this is called RISCTempReg, and is either %r8 (SysV ABI (unix)) or %r11 (Windows). RISCTempReg is never assumed to be live except within a single instruction sequence. We can swap with this before assigning to %rsp, e.g.

  xchgq %r8, %rax
   movabsq 0x10027c338, %rax ; cStackPointerAddress
   xchgq %r8, %rax
movq %r8, %rsp

this adds a couple of bytes, but is reliable.

2. one could imagine exchanging TempReg and RISCTempReg, i.e. TempReg would be either %r8 or %r11, and RISCTemptReg would be %rax. That would allow

movabsq 0x10027c338, %rax ; cStackPointerAddress
movq %rax, %rsp

this at least worked on Ryan's MIPS32 back end when it was in use (no one has tested it in as while AFAIA) where TempReg is S5 (r21), RISCTempReg is AT (r1) and CResultRegister is V0 (r2). So this is worth investigating. [I had tried something similar with HPS and it completely broke the code generator so my gut is trying to tell me this will never work; Ryan has proved otherwise].

3. I would much rather implement some form of tracking whether a particular register is live or not. A trivial implementation would only track TempReg and only track being not live up until the first assignment to TempReg. A more sophisticated approach could deal with control flow branching and merging of the liveness info. I like the trivial approach for now. That would simply track TempReg and only up until the first explicit assignment to TempReg.

4. we could simply treat assignments to SPReg and FPReg specially, knowing that these are done only in trampolines and enilopmarts (these are the pieces of code that sit b between JITted code and the rest of the run-time (interpreter, memory manager, primitives, etc), either to call the runtime from JITted code or, as in this case, to enter machine code from the run-time. I suppose this is OK as long as we document it. This would allow us to generate what we would generate for #2 above, i.e.

movabsq 0x10027c338, %rax ; cStackPointerAddress
movq %rax, %rsp

N. are there better ways?

Clément, Nicolas (et al), what do you think?

Le dim. 6 oct. 2019 à 12:36, Eliot Miranda <[hidden email]> a écrit :
Hi Sean, Hi All,

this may be because of the issue described here: http://forum.world.st/Difficult-to-debug-VM-crash-with-full-blocks-and-Sista-V1-tt5103810.html

This issue is characterized by the system crashing soon after start up when some significant i/o is done, typically either to files or sockets. It affects macOS only and may indeed affect only 64-bits. We have strong evidence that it is caused by the dynamic linker being invoked in the signal handler for SIGIO when the signal is delivered while the VM is executing JITted code. The symptom that causes the crash is corruption of a particular jitted method’s machine code, eg Delay class>>#startEventLoop, and we believe that the corruption is caused by the linker when it misinterprets a jitted Smalltalk stack frame as an ABI-compliant stack frame and attempts to scan code to link it.

Our diagnosis is speculative; this is extremely hard to reproduce. Typically in repeating a crashing run SIGIO may no longer be delivered at the same point because any remote server has now woken up and delivers results sooner, etc. However, Nicolas Cellier and I are both confident that we have correctly identified the bug.

The fix is simple; SIGIO should be delivered on a dedicated signal stack (see sigaltstack(2)). I committed a fix yesterday evening and we should see within a week or so if these crashes have disappeared.

I encourage the Pharo vm maintainers to build and release vms that include https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/c24970eb2859a474065c6f69060c0324aef2b211 asap.

Cheers,
Eliot
_,,,^..^,,,_ (phone)

On Oct 3, 2019, at 1:24 PM, Sean P. DeNigris <[hidden email]> wrote:

Segmentation fault Thu Oct 3 15:52:33 2019

VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6
Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git

C stack backtrace & registers:
rax 0x0000000124380000 rbx 0x00007ffeebd00050 rcx 0x0000000000468260 rdx
0x0000000000dd6800
rdi 0x0000000124cee5a0 rsi 0x0000000124cee5a0 rbp 0x00007ffeebcffe50 rsp
0x00007ffeebcffe50
r8 0x00007fff3f2cefe5 r9 0x0000000000000b00 r10 0x0000000000006000 r11
0xfffffffffcd8d5a0
r12 0x0000000000000002 r13 0x0000000035800000 r14 0x00007ffeebd00064 r15
0x0000000000002800
rip 0x00007fff630f7d09
0   libsystem_platform.dylib            0x00007fff630f7d09
_platform_memmove$VARIANT$Haswell + 41
1   Pharo                               0x0000000103f52642 reportStackState
+ 952
2   Pharo                               0x0000000103f52987 sigsegv + 174
3   libsystem_platform.dylib            0x00007fff630fab3d _sigtramp + 29
4   ???                                 0x0000058900000a00 0x0 +
6085968660992
5   libGLImage.dylib                    0x00007fff3f2ce29e
glgProcessPixelsWithProcessor + 2149
6   AMDRadeonX5000GLDriver              0x000000010db16db1 glrATIStoreLevels
+ 1600
7   AMDRadeonX5000GLDriver              0x000000010db52c83
glrAMD_GFX9_LoadSysTextureStandard + 45
8   AMDRadeonX5000GLDriver              0x000000010db519bb glrUpdateTexture
+ 1346
9   libGPUSupportMercury.dylib          0x00007fff5181279d
gpusLoadCurrentTextures + 591
10 AMDRadeonX5000GLDriver              0x000000010db5a099 gldUpdateDispatch
+ 397
11 GLEngine                            0x00007fff3ff72078
gleDoDrawDispatchCore + 629
12 GLEngine                            0x00007fff3ff16369
glDrawArraysInstanced_STD_Exec + 264
13 GLEngine                            0x00007fff3ff1625a
glDrawArrays_UnpackThread + 40
14 GLEngine                            0x00007fff3ff6dce1 gleCmdProcessor +
77
15 libdispatch.dylib                   0x00007fff62ec2dcf
_dispatch_client_callout + 8
16 libdispatch.dylib                   0x00007fff62ecea2c
_dispatch_lane_barrier_sync_invoke_and_complete + 60
17 GLEngine                            0x00007fff3fec4b85
glFlush_ExecThread + 15
18 Pharo                               0x0000000103f4cc62
-[sqSqueakOSXOpenGLView drawRect:flush:] + 314
19 Pharo                               0x0000000103f4cb22 -
...

Smalltalk stack dump:
   0x7ffeebd14238 M DelaySemaphoreScheduler>unscheduleAtTimingPriority
0x10fab3ad0: a(n) DelaySemaphoreScheduler
   0x7ffeebd14270 M [] in
DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
0x10fab3ad0: a(n) DelaySemaphoreScheduler
      0x1125923f8 s BlockClosure>ensure:
      0x111e88d30 s
DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
      0x112590a50 s [] in
DelaySemaphoreScheduler(DelayBasicScheduler)>startTimerEventLoopPriority:
      0x111e88e08 s [] in BlockClosure>newProcess

Most recent primitives
@
actualScreenSize
millisecondClockValue
tempAt:

-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

--
_,,,^..^,,,_
best, Eliot

Eliot Miranda-2

Re: Fixing crashes on delivery of SIGIO etc [Was Re: [Pharo-dev] Seg Fault Pharo 7.0.3, Was Difficult to debug VM crash with full blocks and Sista V1]

On Wed, Oct 9, 2019 at 2:00 PM Nicolas Cellier <[hidden email]> wrote:

Hi Eliot,
Yesterday I thought I would opt for 4) because it's TSTTCPW.
Of course it's very hackish and trampoline specific.
But loading stack and frame pointer is very specific.

We are of the same mind. When I look at the code it is currently always safe to do this. So implementing #4 and documenting the assumption properly is, I think, enough. We can then revert installing sigaltstack being the default for a COGVM and properly test.

P.S. Here's how I search for all occurrences:

| bindings |

bindings := { Cogit bindingOf: #FPReg. Cogit bindingOf: #SPReg}.

self systemNavigation browseAllSelect:

[:m|

m scanForInstructionSequence:

[:a :b|

a selector == #pushLiteralVariable: and: [(bindings includes: a argument)

and: [b selector = #send:super:numArgs: and: [b argument == #MoveAw:R:]]]]]

Searching for MoveR:Aw: isn't nearly as easy :-(.

Le mer. 9 oct. 2019 à 22:10, Eliot Miranda <[hidden email]> a écrit :

Hi Nicolas, Clément, et al,

On Tue, Oct 8, 2019 at 1:40 PM Nicolas Cellier <[hidden email]> wrote:
More on the problem that Eliot is speaking about: it can happen in these conditions:
if is SIGIO is delivered while executing any trampoline transiting from SmalltalkToCStackSwitch.
What happens here?

We must load 64bits contents of memory (cStackPointerAddress) to the stack pointer register (SPReg = $rsp) - See genLoadCStackPointer.
This is done by using the CogRTLOpcode abstract instruction MoveAwR
But we have no matching instruction in IA-32 X64... We can only load a 64bits memory content in $rax
So the idea is to generate this sequence (CogX64Compiler>>concretizeMoveAwR):
    xchgq %rsp, %rax
    movabsq 0x10027c338, %rax ; cStackPointerAddress
    xchgq %rsp, %rax
That's clever because it preserves $rax which could be in use when we want to MoveAwR.

Clever, but in the case of %rsp and %rbp, dead wrong :blush:. The constraint is that x86_64 only provides a 64-bit load from an absolute address into %rax; no other register can be used.

But it has an unfortunate side effect: the stack pointer temporarily gets the contents of $rax and can thus temporarily point anywhere.

Right; which is unacceptable. It's a tiny window, but one we hit all the time.

What happens when performing Squeak SocketTest is that we use some trampolines (I guess for invoking primitives for example) and we generate SIGIO (for some reasons, there are a lot of SIGIO generated on my particular macos machine, so I can trigger the bug more easily than Eliot).
We previously installed a handler for SIGIO via signal(). When we use signal(), the handler shares the stack pointer with user program.
If the event is delivered in between the two xchgq instructions above, the signal handler will then use a corrupted stack pointer pointing anywhere (depending no contents of $rax) when the VM enter the signal handler function, it uses stack pointer to save some states, and corrupt a memory zone, segfault or whatever.
In the case described by the opensmalltalk vm-dev thread, $rax was pointing to the generated code zone (jitted methodZone), so we corrupted the generated code and soon get punished for that. But it's probable that there might be other (rare) occurrences of this bug.

Not sure if it is causing the bugs described by Sean, but it's important to use the fix from Eliot ASAP and retry.
There might be other occurrence of signal(SIGIO,forceInterruptCheck) in minheadless flavour, I did not check if Eliot also corrected it, if not it should also be corrected ASAP, as should every usage of signal() be replaced by ussage of sigaction() with appropriate flags to use sigaltstack() - see Eliot's commit details.

I want to discuss potential fixes with Clément and Nicolas, and anyone else interested. So...

1. the straight forward fix is to generate different code for setting %rsp and %rbp. I shall do this very soon. On x86_64 we dedicate a register for code generation purposes, this is called RISCTempReg, and is either %r8 (SysV ABI (unix)) or %r11 (Windows). RISCTempReg is never assumed to be live except within a single instruction sequence. We can swap with this before assigning to %rsp, e.g.

  xchgq %r8, %rax
   movabsq 0x10027c338, %rax ; cStackPointerAddress
   xchgq %r8, %rax
movq %r8, %rsp

this adds a couple of bytes, but is reliable.

2. one could imagine exchanging TempReg and RISCTempReg, i.e. TempReg would be either %r8 or %r11, and RISCTemptReg would be %rax. That would allow

movabsq 0x10027c338, %rax ; cStackPointerAddress
movq %rax, %rsp

this at least worked on Ryan's MIPS32 back end when it was in use (no one has tested it in as while AFAIA) where TempReg is S5 (r21), RISCTempReg is AT (r1) and CResultRegister is V0 (r2). So this is worth investigating. [I had tried something similar with HPS and it completely broke the code generator so my gut is trying to tell me this will never work; Ryan has proved otherwise].

3. I would much rather implement some form of tracking whether a particular register is live or not. A trivial implementation would only track TempReg and only track being not live up until the first assignment to TempReg. A more sophisticated approach could deal with control flow branching and merging of the liveness info. I like the trivial approach for now. That would simply track TempReg and only up until the first explicit assignment to TempReg.

4. we could simply treat assignments to SPReg and FPReg specially, knowing that these are done only in trampolines and enilopmarts (these are the pieces of code that sit b between JITted code and the rest of the run-time (interpreter, memory manager, primitives, etc), either to call the runtime from JITted code or, as in this case, to enter machine code from the run-time. I suppose this is OK as long as we document it. This would allow us to generate what we would generate for #2 above, i.e.

movabsq 0x10027c338, %rax ; cStackPointerAddress
movq %rax, %rsp

N. are there better ways?

Clément, Nicolas (et al), what do you think?

Le dim. 6 oct. 2019 à 12:36, Eliot Miranda <[hidden email]> a écrit :
Hi Sean, Hi All,

this may be because of the issue described here: http://forum.world.st/Difficult-to-debug-VM-crash-with-full-blocks-and-Sista-V1-tt5103810.html

This issue is characterized by the system crashing soon after start up when some significant i/o is done, typically either to files or sockets. It affects macOS only and may indeed affect only 64-bits. We have strong evidence that it is caused by the dynamic linker being invoked in the signal handler for SIGIO when the signal is delivered while the VM is executing JITted code. The symptom that causes the crash is corruption of a particular jitted method’s machine code, eg Delay class>>#startEventLoop, and we believe that the corruption is caused by the linker when it misinterprets a jitted Smalltalk stack frame as an ABI-compliant stack frame and attempts to scan code to link it.

Our diagnosis is speculative; this is extremely hard to reproduce. Typically in repeating a crashing run SIGIO may no longer be delivered at the same point because any remote server has now woken up and delivers results sooner, etc. However, Nicolas Cellier and I are both confident that we have correctly identified the bug.

The fix is simple; SIGIO should be delivered on a dedicated signal stack (see sigaltstack(2)). I committed a fix yesterday evening and we should see within a week or so if these crashes have disappeared.

I encourage the Pharo vm maintainers to build and release vms that include https://github.com/OpenSmalltalk/opensmalltalk-vm/commit/c24970eb2859a474065c6f69060c0324aef2b211 asap.

Cheers,
Eliot
_,,,^..^,,,_ (phone)

On Oct 3, 2019, at 1:24 PM, Sean P. DeNigris <[hidden email]> wrote:

Segmentation fault Thu Oct 3 15:52:33 2019

VM: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git
Date: Sat Jan 5 20:00:11 2019 CommitHash: 7a3c6b6
Plugins: 201901051900 https://github.com/OpenSmalltalk/opensmalltalk-vm.git

C stack backtrace & registers:
rax 0x0000000124380000 rbx 0x00007ffeebd00050 rcx 0x0000000000468260 rdx
0x0000000000dd6800
rdi 0x0000000124cee5a0 rsi 0x0000000124cee5a0 rbp 0x00007ffeebcffe50 rsp
0x00007ffeebcffe50
r8 0x00007fff3f2cefe5 r9 0x0000000000000b00 r10 0x0000000000006000 r11
0xfffffffffcd8d5a0
r12 0x0000000000000002 r13 0x0000000035800000 r14 0x00007ffeebd00064 r15
0x0000000000002800
rip 0x00007fff630f7d09
0   libsystem_platform.dylib            0x00007fff630f7d09
_platform_memmove$VARIANT$Haswell + 41
1   Pharo                               0x0000000103f52642 reportStackState
+ 952
2   Pharo                               0x0000000103f52987 sigsegv + 174
3   libsystem_platform.dylib            0x00007fff630fab3d _sigtramp + 29
4   ???                                 0x0000058900000a00 0x0 +
6085968660992
5   libGLImage.dylib                    0x00007fff3f2ce29e
glgProcessPixelsWithProcessor + 2149
6   AMDRadeonX5000GLDriver              0x000000010db16db1 glrATIStoreLevels
+ 1600
7   AMDRadeonX5000GLDriver              0x000000010db52c83
glrAMD_GFX9_LoadSysTextureStandard + 45
8   AMDRadeonX5000GLDriver              0x000000010db519bb glrUpdateTexture
+ 1346
9   libGPUSupportMercury.dylib          0x00007fff5181279d
gpusLoadCurrentTextures + 591
10 AMDRadeonX5000GLDriver              0x000000010db5a099 gldUpdateDispatch
+ 397
11 GLEngine                            0x00007fff3ff72078
gleDoDrawDispatchCore + 629
12 GLEngine                            0x00007fff3ff16369
glDrawArraysInstanced_STD_Exec + 264
13 GLEngine                            0x00007fff3ff1625a
glDrawArrays_UnpackThread + 40
14 GLEngine                            0x00007fff3ff6dce1 gleCmdProcessor +
77
15 libdispatch.dylib                   0x00007fff62ec2dcf
_dispatch_client_callout + 8
16 libdispatch.dylib                   0x00007fff62ecea2c
_dispatch_lane_barrier_sync_invoke_and_complete + 60
17 GLEngine                            0x00007fff3fec4b85
glFlush_ExecThread + 15
18 Pharo                               0x0000000103f4cc62
-[sqSqueakOSXOpenGLView drawRect:flush:] + 314
19 Pharo                               0x0000000103f4cb22 -
...

Smalltalk stack dump:
   0x7ffeebd14238 M DelaySemaphoreScheduler>unscheduleAtTimingPriority
0x10fab3ad0: a(n) DelaySemaphoreScheduler
   0x7ffeebd14270 M [] in
DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
0x10fab3ad0: a(n) DelaySemaphoreScheduler
      0x1125923f8 s BlockClosure>ensure:
      0x111e88d30 s
DelaySemaphoreScheduler(DelayBasicScheduler)>runBackendLoopAtTimingPriority
      0x112590a50 s [] in
DelaySemaphoreScheduler(DelayBasicScheduler)>startTimerEventLoopPriority:
      0x111e88e08 s [] in BlockClosure>newProcess

Most recent primitives
@
actualScreenSize
millisecondClockValue
tempAt:

-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

--
_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot