Monticello / OS deadlock ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Monticello / OS deadlock ?

Jose San Leandro
Hi,

In one of our projects we are using Pharo4. The image gets built by gradle, which loads the Metacello project. Sometimes, we see the build process hangs. It just don't progress.

When adding local gitfiletree:// dependencies manually through Monticello after a while Pharo gets frozen. It's not always the same repository, it's not always the same number of repositories before it hangs.

I launched the image with strace, and attached gdb to the frozen process.
It turns out It's waiting for a lock that gets never released.

The environment is a 64b Gentoo Linux with enough of everything (multiple monitors, multiple cores, enough RAM).

I hope anybody could point me how to dig deeper into this.

# gdb
(gdb) attach [pid]
[..]
Reading symbols from /usr/lib32/libbz2.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib32/libbz2.so.1
0x0809d8bb in signalSemaphoreWithIndex ()
(gdb) backtrace
#0  0x0809d8bb in signalSemaphoreWithIndex ()
#1  0x0810868c in handleSignal ()
#2  <signal handler called>
#3  0x0809d8c8 in signalSemaphoreWithIndex ()
#4  0x0809f0af in aioPoll ()
#5  0xf76f9671 in display_ioRelinquishProcessorForMicroseconds () from /home/chous/realhome/toolbox/pharo-5.0/pharo-vm/vm-display-X11
#6  0x080a1887 in ioRelinquishProcessorForMicroseconds ()
#7  0x080767fa in primitiveRelinquishProcessor ()
#8  0xb6fc838c in ?? ()
#9  0xb6fc3700 in ?? ()
#10 0xb7952882 in ?? ()
#11 0xb6fc3648 in ?? ()
(gdb) disassemble
Dump of assembler code for function handleSignal:
   0x081085e0 <+0>:     sub    $0x9c,%esp
   0x081085e6 <+6>:     mov    %ebx,0x90(%esp)
   0x081085ed <+13>:    mov    0xa0(%esp),%ebx
   0x081085f4 <+20>:    mov    %esi,0x94(%esp)
   0x081085fb <+27>:    mov    %edi,0x98(%esp)
   0x08108602 <+34>:    movzbl 0x8168420(%ebx),%esi
   0x08108609 <+41>:    mov    %ebx,%eax
   0x0810860b <+43>:    mov    %esi,%edx
   0x0810860d <+45>:    call   0x81070d0 <forwardSignaltoSemaphoreAt>
   0x08108612 <+50>:    call   0x805aae0 <pthread_self@plt>
   0x08108617 <+55>:    mov    0x8168598,%edi
   0x0810861d <+61>:    cmp    %edi,%eax
   0x0810861f <+63>:    je     0x8108680 <handleSignal+160>
   0x08108621 <+65>:    lea    0x10(%esp),%esi
   0x08108625 <+69>:    mov    %esi,(%esp)
   0x08108628 <+72>:    call   0x805b330 <sigemptyset@plt>
   0x0810862d <+77>:    mov    %ebx,0x4(%esp)
   0x08108631 <+81>:    mov    %esi,(%esp)
   0x08108634 <+84>:    call   0x805b0c0 <sigaddset@plt>
   0x08108639 <+89>:    movl   $0x0,0x8(%esp)
   0x08108641 <+97>:    mov    %esi,0x4(%esp)
   0x08108645 <+101>:   movl   $0x0,(%esp)
   0x0810864c <+108>:   call   0x805ada0 <pthread_sigmask@plt>
   0x08108651 <+113>:   mov    %ebx,0x4(%esp)
   0x08108655 <+117>:   mov    %edi,(%esp)
   0x08108658 <+120>:   call   0x805b240 <pthread_kill@plt>
   0x0810865d <+125>:   mov    0x90(%esp),%ebx
   0x08108664 <+132>:   mov    0x94(%esp),%esi
   0x0810866b <+139>:   mov    0x98(%esp),%edi
   0x08108672 <+146>:   add    $0x9c,%esp
   0x08108678 <+152>:   ret
   0x08108679 <+153>:   lea    0x0(%esi,%eiz,1),%esi
   0x08108680 <+160>:   test   %esi,%esi
   0x08108682 <+162>:   je     0x810865d <handleSignal+125>
   0x08108684 <+164>:   mov    %esi,(%esp)
   0x08108687 <+167>:   call   0x809d8a0 <signalSemaphoreWithIndex>
=> 0x0810868c <+172>:   jmp    0x810865d <handleSignal+125>
End of assembler dump.
(gdb) up 3
(gdb) disassemble
Dump of assembler code for function signalSemaphoreWithIndex:
   0x0809d8a0 <+0>:     push   %esi
   0x0809d8a1 <+1>:     xor    %eax,%eax
   0x0809d8a3 <+3>:     push   %ebx
   0x0809d8a4 <+4>:     sub    $0x24,%esp
   0x0809d8a7 <+7>:     mov    0x30(%esp),%esi
   0x0809d8ab <+11>:    test   %esi,%esi
   0x0809d8ad <+13>:    jle    0x809d918 <signalSemaphoreWithIndex+120>
   0x0809d8af <+15>:    mov    $0x1,%edx
   0x0809d8b4 <+20>:    lea    0x0(%esi,%eiz,1),%esi
   0x0809d8b8 <+24>:    mfence
   0x0809d8bb <+27>:    mov    $0x0,%eax
   0x0809d8c0 <+32>:    lock cmpxchg %edx,0x8152d80
=> 0x0809d8c8 <+40>:    mov    %eax,0x1c(%esp)
   0x0809d8cc <+44>:    mov    0x1c(%esp),%eax
   0x0809d8d0 <+48>:    test   %eax,%eax
   0x0809d8d2 <+50>:    jne    0x809d8b8 <signalSemaphoreWithIndex+24>
   0x0809d8d4 <+52>:    mov    0x8152d84,%edx
   0x0809d8da <+58>:    cmp    $0x1ff,%edx
   0x0809d8e0 <+64>:    lea    0x1(%edx),%ebx
   0x0809d8e3 <+67>:    cmove  %eax,%ebx
   0x0809d8e6 <+70>:    mov    0x8152d88,%eax
   0x0809d8eb <+75>:    cmp    %ebx,%eax
   0x0809d8ed <+77>:    je     0x809d920 <signalSemaphoreWithIndex+128>
   0x0809d8ef <+79>:    mov    0x8152d84,%eax
   0x0809d8f4 <+84>:    mov    %esi,0x8152da0(,%eax,4)
   0x0809d8fb <+91>:    mfence
   0x0809d8fe <+94>:    mov    %ebx,0x8152d84
   0x0809d904 <+100>:   movl   $0x0,0x8152d80
   0x0809d90e <+110>:   call   0x807c2c0 <forceInterruptCheck>
   0x0809d913 <+115>:   mov    $0x1,%eax
   0x0809d918 <+120>:   add    $0x24,%esp
   0x0809d91b <+123>:   pop    %ebx
   0x0809d91c <+124>:   pop    %esi
   0x0809d91d <+125>:   ret
   0x0809d91e <+126>:   xchg   %ax,%ax
   0x0809d920 <+128>:   movl   $0x810c888,(%esp)
   0x0809d927 <+135>:   movl   $0x0,0x8152d80
   0x0809d931 <+145>:   call   0x80a3720 <error>
   0x0809d936 <+150>:   jmp    0x809d8ef <signalSemaphoreWithIndex+79>
End of assembler dump.

Meanwhile, strace gets frozen showing this:
[..]
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63665cd9d0) = 3736
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x42a8a0, [], SA_RESTORER, 0x7f6365ba3ad0}, {SIG_DFL, [], SA_RESTORER, 0x7f6365ba3ad0}, 8) = 0
wait4(-1, 0x7ffc4ef7f7e8, 0, NULL)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(-1,
Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Thierry Goubier
Hi Jose,

yes, I've noticed that as well. It was, at a point, drastic (i.e. almost allways lock-up) on my work development laptop; it now happens far less often (but it does happens to me from time to time).

Dave Lewis, the author of OSProcess, fixed one issue which solved most of the lockups I had, but not all of them. The lockup is in the interaction between OSProcess inside Pharo and the external shell command (i.e. it concerns anything which uses OSProcess), and seems like missing a signal. It is also machine and linux version dependent (Ubuntu 14.10 was horrible, 14.04 and 15.04 on the same hardware are far less sensitive), and seems to also depend on the load of the machine itself.

By the way, which version of OSProcess you are using?

Thierry

2015-06-02 11:10 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi,

In one of our projects we are using Pharo4. The image gets built by gradle, which loads the Metacello project. Sometimes, we see the build process hangs. It just don't progress.

When adding local gitfiletree:// dependencies manually through Monticello after a while Pharo gets frozen. It's not always the same repository, it's not always the same number of repositories before it hangs.

I launched the image with strace, and attached gdb to the frozen process.
It turns out It's waiting for a lock that gets never released.

The environment is a 64b Gentoo Linux with enough of everything (multiple monitors, multiple cores, enough RAM).

I hope anybody could point me how to dig deeper into this.

# gdb
(gdb) attach [pid]
[..]
Reading symbols from /usr/lib32/libbz2.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib32/libbz2.so.1
0x0809d8bb in signalSemaphoreWithIndex ()
(gdb) backtrace
#0  0x0809d8bb in signalSemaphoreWithIndex ()
#1  0x0810868c in handleSignal ()
#2  <signal handler called>
#3  0x0809d8c8 in signalSemaphoreWithIndex ()
#4  0x0809f0af in aioPoll ()
#5  0xf76f9671 in display_ioRelinquishProcessorForMicroseconds () from /home/chous/realhome/toolbox/pharo-5.0/pharo-vm/vm-display-X11
#6  0x080a1887 in ioRelinquishProcessorForMicroseconds ()
#7  0x080767fa in primitiveRelinquishProcessor ()
#8  0xb6fc838c in ?? ()
#9  0xb6fc3700 in ?? ()
#10 0xb7952882 in ?? ()
#11 0xb6fc3648 in ?? ()
(gdb) disassemble
Dump of assembler code for function handleSignal:
   0x081085e0 <+0>:     sub    $0x9c,%esp
   0x081085e6 <+6>:     mov    %ebx,0x90(%esp)
   0x081085ed <+13>:    mov    0xa0(%esp),%ebx
   0x081085f4 <+20>:    mov    %esi,0x94(%esp)
   0x081085fb <+27>:    mov    %edi,0x98(%esp)
   0x08108602 <+34>:    movzbl 0x8168420(%ebx),%esi
   0x08108609 <+41>:    mov    %ebx,%eax
   0x0810860b <+43>:    mov    %esi,%edx
   0x0810860d <+45>:    call   0x81070d0 <forwardSignaltoSemaphoreAt>
   0x08108612 <+50>:    call   0x805aae0 <pthread_self@plt>
   0x08108617 <+55>:    mov    0x8168598,%edi
   0x0810861d <+61>:    cmp    %edi,%eax
   0x0810861f <+63>:    je     0x8108680 <handleSignal+160>
   0x08108621 <+65>:    lea    0x10(%esp),%esi
   0x08108625 <+69>:    mov    %esi,(%esp)
   0x08108628 <+72>:    call   0x805b330 <sigemptyset@plt>
   0x0810862d <+77>:    mov    %ebx,0x4(%esp)
   0x08108631 <+81>:    mov    %esi,(%esp)
   0x08108634 <+84>:    call   0x805b0c0 <sigaddset@plt>
   0x08108639 <+89>:    movl   $0x0,0x8(%esp)
   0x08108641 <+97>:    mov    %esi,0x4(%esp)
   0x08108645 <+101>:   movl   $0x0,(%esp)
   0x0810864c <+108>:   call   0x805ada0 <pthread_sigmask@plt>
   0x08108651 <+113>:   mov    %ebx,0x4(%esp)
   0x08108655 <+117>:   mov    %edi,(%esp)
   0x08108658 <+120>:   call   0x805b240 <pthread_kill@plt>
   0x0810865d <+125>:   mov    0x90(%esp),%ebx
   0x08108664 <+132>:   mov    0x94(%esp),%esi
   0x0810866b <+139>:   mov    0x98(%esp),%edi
   0x08108672 <+146>:   add    $0x9c,%esp
   0x08108678 <+152>:   ret
   0x08108679 <+153>:   lea    0x0(%esi,%eiz,1),%esi
   0x08108680 <+160>:   test   %esi,%esi
   0x08108682 <+162>:   je     0x810865d <handleSignal+125>
   0x08108684 <+164>:   mov    %esi,(%esp)
   0x08108687 <+167>:   call   0x809d8a0 <signalSemaphoreWithIndex>
=> 0x0810868c <+172>:   jmp    0x810865d <handleSignal+125>
End of assembler dump.
(gdb) up 3
(gdb) disassemble
Dump of assembler code for function signalSemaphoreWithIndex:
   0x0809d8a0 <+0>:     push   %esi
   0x0809d8a1 <+1>:     xor    %eax,%eax
   0x0809d8a3 <+3>:     push   %ebx
   0x0809d8a4 <+4>:     sub    $0x24,%esp
   0x0809d8a7 <+7>:     mov    0x30(%esp),%esi
   0x0809d8ab <+11>:    test   %esi,%esi
   0x0809d8ad <+13>:    jle    0x809d918 <signalSemaphoreWithIndex+120>
   0x0809d8af <+15>:    mov    $0x1,%edx
   0x0809d8b4 <+20>:    lea    0x0(%esi,%eiz,1),%esi
   0x0809d8b8 <+24>:    mfence
   0x0809d8bb <+27>:    mov    $0x0,%eax
   0x0809d8c0 <+32>:    lock cmpxchg %edx,0x8152d80
=> 0x0809d8c8 <+40>:    mov    %eax,0x1c(%esp)
   0x0809d8cc <+44>:    mov    0x1c(%esp),%eax
   0x0809d8d0 <+48>:    test   %eax,%eax
   0x0809d8d2 <+50>:    jne    0x809d8b8 <signalSemaphoreWithIndex+24>
   0x0809d8d4 <+52>:    mov    0x8152d84,%edx
   0x0809d8da <+58>:    cmp    $0x1ff,%edx
   0x0809d8e0 <+64>:    lea    0x1(%edx),%ebx
   0x0809d8e3 <+67>:    cmove  %eax,%ebx
   0x0809d8e6 <+70>:    mov    0x8152d88,%eax
   0x0809d8eb <+75>:    cmp    %ebx,%eax
   0x0809d8ed <+77>:    je     0x809d920 <signalSemaphoreWithIndex+128>
   0x0809d8ef <+79>:    mov    0x8152d84,%eax
   0x0809d8f4 <+84>:    mov    %esi,0x8152da0(,%eax,4)
   0x0809d8fb <+91>:    mfence
   0x0809d8fe <+94>:    mov    %ebx,0x8152d84
   0x0809d904 <+100>:   movl   $0x0,0x8152d80
   0x0809d90e <+110>:   call   0x807c2c0 <forceInterruptCheck>
   0x0809d913 <+115>:   mov    $0x1,%eax
   0x0809d918 <+120>:   add    $0x24,%esp
   0x0809d91b <+123>:   pop    %ebx
   0x0809d91c <+124>:   pop    %esi
   0x0809d91d <+125>:   ret
   0x0809d91e <+126>:   xchg   %ax,%ax
   0x0809d920 <+128>:   movl   $0x810c888,(%esp)
   0x0809d927 <+135>:   movl   $0x0,0x8152d80
   0x0809d931 <+145>:   call   0x80a3720 <error>
   0x0809d936 <+150>:   jmp    0x809d8ef <signalSemaphoreWithIndex+79>
End of assembler dump.

Meanwhile, strace gets frozen showing this:
[..]
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63665cd9d0) = 3736
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x42a8a0, [], SA_RESTORER, 0x7f6365ba3ad0}, {SIG_DFL, [], SA_RESTORER, 0x7f6365ba3ad0}, 8) = 0
wait4(-1, 0x7ffc4ef7f7e8, 0, NULL)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(-1,

Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Jose San Leandro
Hi Thierry,

ConfigurationOfOSProcess-ThierryGoubier.38.mcz, which corresponds to version 4.6.2.

Another workaround that would work for me is to be able to "resume" a previous load attempt of a Metacello project. Or a custom "hook" in Metacello to save the image after every dependency is successfully loaded.


2015-06-02 11:25 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Jose,

yes, I've noticed that as well. It was, at a point, drastic (i.e. almost allways lock-up) on my work development laptop; it now happens far less often (but it does happens to me from time to time).

Dave Lewis, the author of OSProcess, fixed one issue which solved most of the lockups I had, but not all of them. The lockup is in the interaction between OSProcess inside Pharo and the external shell command (i.e. it concerns anything which uses OSProcess), and seems like missing a signal. It is also machine and linux version dependent (Ubuntu 14.10 was horrible, 14.04 and 15.04 on the same hardware are far less sensitive), and seems to also depend on the load of the machine itself.

By the way, which version of OSProcess you are using?

Thierry


2015-06-02 11:10 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi,

In one of our projects we are using Pharo4. The image gets built by gradle, which loads the Metacello project. Sometimes, we see the build process hangs. It just don't progress.

When adding local gitfiletree:// dependencies manually through Monticello after a while Pharo gets frozen. It's not always the same repository, it's not always the same number of repositories before it hangs.

I launched the image with strace, and attached gdb to the frozen process.
It turns out It's waiting for a lock that gets never released.

The environment is a 64b Gentoo Linux with enough of everything (multiple monitors, multiple cores, enough RAM).

I hope anybody could point me how to dig deeper into this.

# gdb
(gdb) attach [pid]
[..]
Reading symbols from /usr/lib32/libbz2.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib32/libbz2.so.1
0x0809d8bb in signalSemaphoreWithIndex ()
(gdb) backtrace
#0  0x0809d8bb in signalSemaphoreWithIndex ()
#1  0x0810868c in handleSignal ()
#2  <signal handler called>
#3  0x0809d8c8 in signalSemaphoreWithIndex ()
#4  0x0809f0af in aioPoll ()
#5  0xf76f9671 in display_ioRelinquishProcessorForMicroseconds () from /home/chous/realhome/toolbox/pharo-5.0/pharo-vm/vm-display-X11
#6  0x080a1887 in ioRelinquishProcessorForMicroseconds ()
#7  0x080767fa in primitiveRelinquishProcessor ()
#8  0xb6fc838c in ?? ()
#9  0xb6fc3700 in ?? ()
#10 0xb7952882 in ?? ()
#11 0xb6fc3648 in ?? ()
(gdb) disassemble
Dump of assembler code for function handleSignal:
   0x081085e0 <+0>:     sub    $0x9c,%esp
   0x081085e6 <+6>:     mov    %ebx,0x90(%esp)
   0x081085ed <+13>:    mov    0xa0(%esp),%ebx
   0x081085f4 <+20>:    mov    %esi,0x94(%esp)
   0x081085fb <+27>:    mov    %edi,0x98(%esp)
   0x08108602 <+34>:    movzbl 0x8168420(%ebx),%esi
   0x08108609 <+41>:    mov    %ebx,%eax
   0x0810860b <+43>:    mov    %esi,%edx
   0x0810860d <+45>:    call   0x81070d0 <forwardSignaltoSemaphoreAt>
   0x08108612 <+50>:    call   0x805aae0 <pthread_self@plt>
   0x08108617 <+55>:    mov    0x8168598,%edi
   0x0810861d <+61>:    cmp    %edi,%eax
   0x0810861f <+63>:    je     0x8108680 <handleSignal+160>
   0x08108621 <+65>:    lea    0x10(%esp),%esi
   0x08108625 <+69>:    mov    %esi,(%esp)
   0x08108628 <+72>:    call   0x805b330 <sigemptyset@plt>
   0x0810862d <+77>:    mov    %ebx,0x4(%esp)
   0x08108631 <+81>:    mov    %esi,(%esp)
   0x08108634 <+84>:    call   0x805b0c0 <sigaddset@plt>
   0x08108639 <+89>:    movl   $0x0,0x8(%esp)
   0x08108641 <+97>:    mov    %esi,0x4(%esp)
   0x08108645 <+101>:   movl   $0x0,(%esp)
   0x0810864c <+108>:   call   0x805ada0 <pthread_sigmask@plt>
   0x08108651 <+113>:   mov    %ebx,0x4(%esp)
   0x08108655 <+117>:   mov    %edi,(%esp)
   0x08108658 <+120>:   call   0x805b240 <pthread_kill@plt>
   0x0810865d <+125>:   mov    0x90(%esp),%ebx
   0x08108664 <+132>:   mov    0x94(%esp),%esi
   0x0810866b <+139>:   mov    0x98(%esp),%edi
   0x08108672 <+146>:   add    $0x9c,%esp
   0x08108678 <+152>:   ret
   0x08108679 <+153>:   lea    0x0(%esi,%eiz,1),%esi
   0x08108680 <+160>:   test   %esi,%esi
   0x08108682 <+162>:   je     0x810865d <handleSignal+125>
   0x08108684 <+164>:   mov    %esi,(%esp)
   0x08108687 <+167>:   call   0x809d8a0 <signalSemaphoreWithIndex>
=> 0x0810868c <+172>:   jmp    0x810865d <handleSignal+125>
End of assembler dump.
(gdb) up 3
(gdb) disassemble
Dump of assembler code for function signalSemaphoreWithIndex:
   0x0809d8a0 <+0>:     push   %esi
   0x0809d8a1 <+1>:     xor    %eax,%eax
   0x0809d8a3 <+3>:     push   %ebx
   0x0809d8a4 <+4>:     sub    $0x24,%esp
   0x0809d8a7 <+7>:     mov    0x30(%esp),%esi
   0x0809d8ab <+11>:    test   %esi,%esi
   0x0809d8ad <+13>:    jle    0x809d918 <signalSemaphoreWithIndex+120>
   0x0809d8af <+15>:    mov    $0x1,%edx
   0x0809d8b4 <+20>:    lea    0x0(%esi,%eiz,1),%esi
   0x0809d8b8 <+24>:    mfence
   0x0809d8bb <+27>:    mov    $0x0,%eax
   0x0809d8c0 <+32>:    lock cmpxchg %edx,0x8152d80
=> 0x0809d8c8 <+40>:    mov    %eax,0x1c(%esp)
   0x0809d8cc <+44>:    mov    0x1c(%esp),%eax
   0x0809d8d0 <+48>:    test   %eax,%eax
   0x0809d8d2 <+50>:    jne    0x809d8b8 <signalSemaphoreWithIndex+24>
   0x0809d8d4 <+52>:    mov    0x8152d84,%edx
   0x0809d8da <+58>:    cmp    $0x1ff,%edx
   0x0809d8e0 <+64>:    lea    0x1(%edx),%ebx
   0x0809d8e3 <+67>:    cmove  %eax,%ebx
   0x0809d8e6 <+70>:    mov    0x8152d88,%eax
   0x0809d8eb <+75>:    cmp    %ebx,%eax
   0x0809d8ed <+77>:    je     0x809d920 <signalSemaphoreWithIndex+128>
   0x0809d8ef <+79>:    mov    0x8152d84,%eax
   0x0809d8f4 <+84>:    mov    %esi,0x8152da0(,%eax,4)
   0x0809d8fb <+91>:    mfence
   0x0809d8fe <+94>:    mov    %ebx,0x8152d84
   0x0809d904 <+100>:   movl   $0x0,0x8152d80
   0x0809d90e <+110>:   call   0x807c2c0 <forceInterruptCheck>
   0x0809d913 <+115>:   mov    $0x1,%eax
   0x0809d918 <+120>:   add    $0x24,%esp
   0x0809d91b <+123>:   pop    %ebx
   0x0809d91c <+124>:   pop    %esi
   0x0809d91d <+125>:   ret
   0x0809d91e <+126>:   xchg   %ax,%ax
   0x0809d920 <+128>:   movl   $0x810c888,(%esp)
   0x0809d927 <+135>:   movl   $0x0,0x8152d80
   0x0809d931 <+145>:   call   0x80a3720 <error>
   0x0809d936 <+150>:   jmp    0x809d8ef <signalSemaphoreWithIndex+79>
End of assembler dump.

Meanwhile, strace gets frozen showing this:
[..]
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63665cd9d0) = 3736
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x42a8a0, [], SA_RESTORER, 0x7f6365ba3ad0}, {SIG_DFL, [], SA_RESTORER, 0x7f6365ba3ad0}, 8) = 0
wait4(-1, 0x7ffc4ef7f7e8, 0, NULL)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(-1,


Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Thierry Goubier


2015-06-02 12:14 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi Thierry,

ConfigurationOfOSProcess-ThierryGoubier.38.mcz, which corresponds to version 4.6.2.

Ok, then this is the latest.
 

Another workaround that would work for me is to be able to "resume" a previous load attempt of a Metacello project. Or a custom "hook" in Metacello to save the image after every dependency is successfully loaded.

Yes, this would work. I'll ask again Dave if he has any idea; the bug is hard to reproduce.

Would you mind telling the linux kernel / libc version of your gentoo box?

Thierry
 


2015-06-02 11:25 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Jose,

yes, I've noticed that as well. It was, at a point, drastic (i.e. almost allways lock-up) on my work development laptop; it now happens far less often (but it does happens to me from time to time).

Dave Lewis, the author of OSProcess, fixed one issue which solved most of the lockups I had, but not all of them. The lockup is in the interaction between OSProcess inside Pharo and the external shell command (i.e. it concerns anything which uses OSProcess), and seems like missing a signal. It is also machine and linux version dependent (Ubuntu 14.10 was horrible, 14.04 and 15.04 on the same hardware are far less sensitive), and seems to also depend on the load of the machine itself.

By the way, which version of OSProcess you are using?

Thierry


2015-06-02 11:10 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi,

In one of our projects we are using Pharo4. The image gets built by gradle, which loads the Metacello project. Sometimes, we see the build process hangs. It just don't progress.

When adding local gitfiletree:// dependencies manually through Monticello after a while Pharo gets frozen. It's not always the same repository, it's not always the same number of repositories before it hangs.

I launched the image with strace, and attached gdb to the frozen process.
It turns out It's waiting for a lock that gets never released.

The environment is a 64b Gentoo Linux with enough of everything (multiple monitors, multiple cores, enough RAM).

I hope anybody could point me how to dig deeper into this.

# gdb
(gdb) attach [pid]
[..]
Reading symbols from /usr/lib32/libbz2.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib32/libbz2.so.1
0x0809d8bb in signalSemaphoreWithIndex ()
(gdb) backtrace
#0  0x0809d8bb in signalSemaphoreWithIndex ()
#1  0x0810868c in handleSignal ()
#2  <signal handler called>
#3  0x0809d8c8 in signalSemaphoreWithIndex ()
#4  0x0809f0af in aioPoll ()
#5  0xf76f9671 in display_ioRelinquishProcessorForMicroseconds () from /home/chous/realhome/toolbox/pharo-5.0/pharo-vm/vm-display-X11
#6  0x080a1887 in ioRelinquishProcessorForMicroseconds ()
#7  0x080767fa in primitiveRelinquishProcessor ()
#8  0xb6fc838c in ?? ()
#9  0xb6fc3700 in ?? ()
#10 0xb7952882 in ?? ()
#11 0xb6fc3648 in ?? ()
(gdb) disassemble
Dump of assembler code for function handleSignal:
   0x081085e0 <+0>:     sub    $0x9c,%esp
   0x081085e6 <+6>:     mov    %ebx,0x90(%esp)
   0x081085ed <+13>:    mov    0xa0(%esp),%ebx
   0x081085f4 <+20>:    mov    %esi,0x94(%esp)
   0x081085fb <+27>:    mov    %edi,0x98(%esp)
   0x08108602 <+34>:    movzbl 0x8168420(%ebx),%esi
   0x08108609 <+41>:    mov    %ebx,%eax
   0x0810860b <+43>:    mov    %esi,%edx
   0x0810860d <+45>:    call   0x81070d0 <forwardSignaltoSemaphoreAt>
   0x08108612 <+50>:    call   0x805aae0 <pthread_self@plt>
   0x08108617 <+55>:    mov    0x8168598,%edi
   0x0810861d <+61>:    cmp    %edi,%eax
   0x0810861f <+63>:    je     0x8108680 <handleSignal+160>
   0x08108621 <+65>:    lea    0x10(%esp),%esi
   0x08108625 <+69>:    mov    %esi,(%esp)
   0x08108628 <+72>:    call   0x805b330 <sigemptyset@plt>
   0x0810862d <+77>:    mov    %ebx,0x4(%esp)
   0x08108631 <+81>:    mov    %esi,(%esp)
   0x08108634 <+84>:    call   0x805b0c0 <sigaddset@plt>
   0x08108639 <+89>:    movl   $0x0,0x8(%esp)
   0x08108641 <+97>:    mov    %esi,0x4(%esp)
   0x08108645 <+101>:   movl   $0x0,(%esp)
   0x0810864c <+108>:   call   0x805ada0 <pthread_sigmask@plt>
   0x08108651 <+113>:   mov    %ebx,0x4(%esp)
   0x08108655 <+117>:   mov    %edi,(%esp)
   0x08108658 <+120>:   call   0x805b240 <pthread_kill@plt>
   0x0810865d <+125>:   mov    0x90(%esp),%ebx
   0x08108664 <+132>:   mov    0x94(%esp),%esi
   0x0810866b <+139>:   mov    0x98(%esp),%edi
   0x08108672 <+146>:   add    $0x9c,%esp
   0x08108678 <+152>:   ret
   0x08108679 <+153>:   lea    0x0(%esi,%eiz,1),%esi
   0x08108680 <+160>:   test   %esi,%esi
   0x08108682 <+162>:   je     0x810865d <handleSignal+125>
   0x08108684 <+164>:   mov    %esi,(%esp)
   0x08108687 <+167>:   call   0x809d8a0 <signalSemaphoreWithIndex>
=> 0x0810868c <+172>:   jmp    0x810865d <handleSignal+125>
End of assembler dump.
(gdb) up 3
(gdb) disassemble
Dump of assembler code for function signalSemaphoreWithIndex:
   0x0809d8a0 <+0>:     push   %esi
   0x0809d8a1 <+1>:     xor    %eax,%eax
   0x0809d8a3 <+3>:     push   %ebx
   0x0809d8a4 <+4>:     sub    $0x24,%esp
   0x0809d8a7 <+7>:     mov    0x30(%esp),%esi
   0x0809d8ab <+11>:    test   %esi,%esi
   0x0809d8ad <+13>:    jle    0x809d918 <signalSemaphoreWithIndex+120>
   0x0809d8af <+15>:    mov    $0x1,%edx
   0x0809d8b4 <+20>:    lea    0x0(%esi,%eiz,1),%esi
   0x0809d8b8 <+24>:    mfence
   0x0809d8bb <+27>:    mov    $0x0,%eax
   0x0809d8c0 <+32>:    lock cmpxchg %edx,0x8152d80
=> 0x0809d8c8 <+40>:    mov    %eax,0x1c(%esp)
   0x0809d8cc <+44>:    mov    0x1c(%esp),%eax
   0x0809d8d0 <+48>:    test   %eax,%eax
   0x0809d8d2 <+50>:    jne    0x809d8b8 <signalSemaphoreWithIndex+24>
   0x0809d8d4 <+52>:    mov    0x8152d84,%edx
   0x0809d8da <+58>:    cmp    $0x1ff,%edx
   0x0809d8e0 <+64>:    lea    0x1(%edx),%ebx
   0x0809d8e3 <+67>:    cmove  %eax,%ebx
   0x0809d8e6 <+70>:    mov    0x8152d88,%eax
   0x0809d8eb <+75>:    cmp    %ebx,%eax
   0x0809d8ed <+77>:    je     0x809d920 <signalSemaphoreWithIndex+128>
   0x0809d8ef <+79>:    mov    0x8152d84,%eax
   0x0809d8f4 <+84>:    mov    %esi,0x8152da0(,%eax,4)
   0x0809d8fb <+91>:    mfence
   0x0809d8fe <+94>:    mov    %ebx,0x8152d84
   0x0809d904 <+100>:   movl   $0x0,0x8152d80
   0x0809d90e <+110>:   call   0x807c2c0 <forceInterruptCheck>
   0x0809d913 <+115>:   mov    $0x1,%eax
   0x0809d918 <+120>:   add    $0x24,%esp
   0x0809d91b <+123>:   pop    %ebx
   0x0809d91c <+124>:   pop    %esi
   0x0809d91d <+125>:   ret
   0x0809d91e <+126>:   xchg   %ax,%ax
   0x0809d920 <+128>:   movl   $0x810c888,(%esp)
   0x0809d927 <+135>:   movl   $0x0,0x8152d80
   0x0809d931 <+145>:   call   0x80a3720 <error>
   0x0809d936 <+150>:   jmp    0x809d8ef <signalSemaphoreWithIndex+79>
End of assembler dump.

Meanwhile, strace gets frozen showing this:
[..]
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63665cd9d0) = 3736
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x42a8a0, [], SA_RESTORER, 0x7f6365ba3ad0}, {SIG_DFL, [], SA_RESTORER, 0x7f6365ba3ad0}, 8) = 0
wait4(-1, 0x7ffc4ef7f7e8, 0, NULL)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(-1,



Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Jose San Leandro
No problem, of course.

It's a dual-core running a custom 4.0.4-hardened-r2 kernel, hardened/linux/amd64/selinux profile (but in permissive mode), glibc version 2.20-r2, with multilib and selinux USE flags active.

I can provide more information if that helps, of course. Even ssh to a Docker container running in it, but it won't support X I fear.

Thanks!

2015-06-02 14:34 GMT+02:00 Thierry Goubier <[hidden email]>:


2015-06-02 12:14 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi Thierry,

ConfigurationOfOSProcess-ThierryGoubier.38.mcz, which corresponds to version 4.6.2.

Ok, then this is the latest.
 

Another workaround that would work for me is to be able to "resume" a previous load attempt of a Metacello project. Or a custom "hook" in Metacello to save the image after every dependency is successfully loaded.

Yes, this would work. I'll ask again Dave if he has any idea; the bug is hard to reproduce.

Would you mind telling the linux kernel / libc version of your gentoo box?

Thierry
 


2015-06-02 11:25 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Jose,

yes, I've noticed that as well. It was, at a point, drastic (i.e. almost allways lock-up) on my work development laptop; it now happens far less often (but it does happens to me from time to time).

Dave Lewis, the author of OSProcess, fixed one issue which solved most of the lockups I had, but not all of them. The lockup is in the interaction between OSProcess inside Pharo and the external shell command (i.e. it concerns anything which uses OSProcess), and seems like missing a signal. It is also machine and linux version dependent (Ubuntu 14.10 was horrible, 14.04 and 15.04 on the same hardware are far less sensitive), and seems to also depend on the load of the machine itself.

By the way, which version of OSProcess you are using?

Thierry


2015-06-02 11:10 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi,

In one of our projects we are using Pharo4. The image gets built by gradle, which loads the Metacello project. Sometimes, we see the build process hangs. It just don't progress.

When adding local gitfiletree:// dependencies manually through Monticello after a while Pharo gets frozen. It's not always the same repository, it's not always the same number of repositories before it hangs.

I launched the image with strace, and attached gdb to the frozen process.
It turns out It's waiting for a lock that gets never released.

The environment is a 64b Gentoo Linux with enough of everything (multiple monitors, multiple cores, enough RAM).

I hope anybody could point me how to dig deeper into this.

# gdb
(gdb) attach [pid]
[..]
Reading symbols from /usr/lib32/libbz2.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib32/libbz2.so.1
0x0809d8bb in signalSemaphoreWithIndex ()
(gdb) backtrace
#0  0x0809d8bb in signalSemaphoreWithIndex ()
#1  0x0810868c in handleSignal ()
#2  <signal handler called>
#3  0x0809d8c8 in signalSemaphoreWithIndex ()
#4  0x0809f0af in aioPoll ()
#5  0xf76f9671 in display_ioRelinquishProcessorForMicroseconds () from /home/chous/realhome/toolbox/pharo-5.0/pharo-vm/vm-display-X11
#6  0x080a1887 in ioRelinquishProcessorForMicroseconds ()
#7  0x080767fa in primitiveRelinquishProcessor ()
#8  0xb6fc838c in ?? ()
#9  0xb6fc3700 in ?? ()
#10 0xb7952882 in ?? ()
#11 0xb6fc3648 in ?? ()
(gdb) disassemble
Dump of assembler code for function handleSignal:
   0x081085e0 <+0>:     sub    $0x9c,%esp
   0x081085e6 <+6>:     mov    %ebx,0x90(%esp)
   0x081085ed <+13>:    mov    0xa0(%esp),%ebx
   0x081085f4 <+20>:    mov    %esi,0x94(%esp)
   0x081085fb <+27>:    mov    %edi,0x98(%esp)
   0x08108602 <+34>:    movzbl 0x8168420(%ebx),%esi
   0x08108609 <+41>:    mov    %ebx,%eax
   0x0810860b <+43>:    mov    %esi,%edx
   0x0810860d <+45>:    call   0x81070d0 <forwardSignaltoSemaphoreAt>
   0x08108612 <+50>:    call   0x805aae0 <pthread_self@plt>
   0x08108617 <+55>:    mov    0x8168598,%edi
   0x0810861d <+61>:    cmp    %edi,%eax
   0x0810861f <+63>:    je     0x8108680 <handleSignal+160>
   0x08108621 <+65>:    lea    0x10(%esp),%esi
   0x08108625 <+69>:    mov    %esi,(%esp)
   0x08108628 <+72>:    call   0x805b330 <sigemptyset@plt>
   0x0810862d <+77>:    mov    %ebx,0x4(%esp)
   0x08108631 <+81>:    mov    %esi,(%esp)
   0x08108634 <+84>:    call   0x805b0c0 <sigaddset@plt>
   0x08108639 <+89>:    movl   $0x0,0x8(%esp)
   0x08108641 <+97>:    mov    %esi,0x4(%esp)
   0x08108645 <+101>:   movl   $0x0,(%esp)
   0x0810864c <+108>:   call   0x805ada0 <pthread_sigmask@plt>
   0x08108651 <+113>:   mov    %ebx,0x4(%esp)
   0x08108655 <+117>:   mov    %edi,(%esp)
   0x08108658 <+120>:   call   0x805b240 <pthread_kill@plt>
   0x0810865d <+125>:   mov    0x90(%esp),%ebx
   0x08108664 <+132>:   mov    0x94(%esp),%esi
   0x0810866b <+139>:   mov    0x98(%esp),%edi
   0x08108672 <+146>:   add    $0x9c,%esp
   0x08108678 <+152>:   ret
   0x08108679 <+153>:   lea    0x0(%esi,%eiz,1),%esi
   0x08108680 <+160>:   test   %esi,%esi
   0x08108682 <+162>:   je     0x810865d <handleSignal+125>
   0x08108684 <+164>:   mov    %esi,(%esp)
   0x08108687 <+167>:   call   0x809d8a0 <signalSemaphoreWithIndex>
=> 0x0810868c <+172>:   jmp    0x810865d <handleSignal+125>
End of assembler dump.
(gdb) up 3
(gdb) disassemble
Dump of assembler code for function signalSemaphoreWithIndex:
   0x0809d8a0 <+0>:     push   %esi
   0x0809d8a1 <+1>:     xor    %eax,%eax
   0x0809d8a3 <+3>:     push   %ebx
   0x0809d8a4 <+4>:     sub    $0x24,%esp
   0x0809d8a7 <+7>:     mov    0x30(%esp),%esi
   0x0809d8ab <+11>:    test   %esi,%esi
   0x0809d8ad <+13>:    jle    0x809d918 <signalSemaphoreWithIndex+120>
   0x0809d8af <+15>:    mov    $0x1,%edx
   0x0809d8b4 <+20>:    lea    0x0(%esi,%eiz,1),%esi
   0x0809d8b8 <+24>:    mfence
   0x0809d8bb <+27>:    mov    $0x0,%eax
   0x0809d8c0 <+32>:    lock cmpxchg %edx,0x8152d80
=> 0x0809d8c8 <+40>:    mov    %eax,0x1c(%esp)
   0x0809d8cc <+44>:    mov    0x1c(%esp),%eax
   0x0809d8d0 <+48>:    test   %eax,%eax
   0x0809d8d2 <+50>:    jne    0x809d8b8 <signalSemaphoreWithIndex+24>
   0x0809d8d4 <+52>:    mov    0x8152d84,%edx
   0x0809d8da <+58>:    cmp    $0x1ff,%edx
   0x0809d8e0 <+64>:    lea    0x1(%edx),%ebx
   0x0809d8e3 <+67>:    cmove  %eax,%ebx
   0x0809d8e6 <+70>:    mov    0x8152d88,%eax
   0x0809d8eb <+75>:    cmp    %ebx,%eax
   0x0809d8ed <+77>:    je     0x809d920 <signalSemaphoreWithIndex+128>
   0x0809d8ef <+79>:    mov    0x8152d84,%eax
   0x0809d8f4 <+84>:    mov    %esi,0x8152da0(,%eax,4)
   0x0809d8fb <+91>:    mfence
   0x0809d8fe <+94>:    mov    %ebx,0x8152d84
   0x0809d904 <+100>:   movl   $0x0,0x8152d80
   0x0809d90e <+110>:   call   0x807c2c0 <forceInterruptCheck>
   0x0809d913 <+115>:   mov    $0x1,%eax
   0x0809d918 <+120>:   add    $0x24,%esp
   0x0809d91b <+123>:   pop    %ebx
   0x0809d91c <+124>:   pop    %esi
   0x0809d91d <+125>:   ret
   0x0809d91e <+126>:   xchg   %ax,%ax
   0x0809d920 <+128>:   movl   $0x810c888,(%esp)
   0x0809d927 <+135>:   movl   $0x0,0x8152d80
   0x0809d931 <+145>:   call   0x80a3720 <error>
   0x0809d936 <+150>:   jmp    0x809d8ef <signalSemaphoreWithIndex+79>
End of assembler dump.

Meanwhile, strace gets frozen showing this:
[..]
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63665cd9d0) = 3736
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x42a8a0, [], SA_RESTORER, 0x7f6365ba3ad0}, {SIG_DFL, [], SA_RESTORER, 0x7f6365ba3ad0}, 8) = 0
wait4(-1, 0x7ffc4ef7f7e8, 0, NULL)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(-1,




Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Thierry Goubier


2015-06-02 15:03 GMT+02:00 Jose San Leandro <[hidden email]>:
No problem, of course.

It's a dual-core running a custom 4.0.4-hardened-r2 kernel, hardened/linux/amd64/selinux profile (but in permissive mode), glibc version 2.20-r2, with multilib and selinux USE flags active.

I can provide more information if that helps, of course. Even ssh to a Docker container running in it, but it won't support X I fear.

When the pharo process get locked, can you do a kill -SIGUSR1 on the pharo process and look at the output? It will give the status inside the vm.

Thierry
 

Thanks!

2015-06-02 14:34 GMT+02:00 Thierry Goubier <[hidden email]>:


2015-06-02 12:14 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi Thierry,

ConfigurationOfOSProcess-ThierryGoubier.38.mcz, which corresponds to version 4.6.2.

Ok, then this is the latest.
 

Another workaround that would work for me is to be able to "resume" a previous load attempt of a Metacello project. Or a custom "hook" in Metacello to save the image after every dependency is successfully loaded.

Yes, this would work. I'll ask again Dave if he has any idea; the bug is hard to reproduce.

Would you mind telling the linux kernel / libc version of your gentoo box?

Thierry
 


2015-06-02 11:25 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Jose,

yes, I've noticed that as well. It was, at a point, drastic (i.e. almost allways lock-up) on my work development laptop; it now happens far less often (but it does happens to me from time to time).

Dave Lewis, the author of OSProcess, fixed one issue which solved most of the lockups I had, but not all of them. The lockup is in the interaction between OSProcess inside Pharo and the external shell command (i.e. it concerns anything which uses OSProcess), and seems like missing a signal. It is also machine and linux version dependent (Ubuntu 14.10 was horrible, 14.04 and 15.04 on the same hardware are far less sensitive), and seems to also depend on the load of the machine itself.

By the way, which version of OSProcess you are using?

Thierry


2015-06-02 11:10 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi,

In one of our projects we are using Pharo4. The image gets built by gradle, which loads the Metacello project. Sometimes, we see the build process hangs. It just don't progress.

When adding local gitfiletree:// dependencies manually through Monticello after a while Pharo gets frozen. It's not always the same repository, it's not always the same number of repositories before it hangs.

I launched the image with strace, and attached gdb to the frozen process.
It turns out It's waiting for a lock that gets never released.

The environment is a 64b Gentoo Linux with enough of everything (multiple monitors, multiple cores, enough RAM).

I hope anybody could point me how to dig deeper into this.

# gdb
(gdb) attach [pid]
[..]
Reading symbols from /usr/lib32/libbz2.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib32/libbz2.so.1
0x0809d8bb in signalSemaphoreWithIndex ()
(gdb) backtrace
#0  0x0809d8bb in signalSemaphoreWithIndex ()
#1  0x0810868c in handleSignal ()
#2  <signal handler called>
#3  0x0809d8c8 in signalSemaphoreWithIndex ()
#4  0x0809f0af in aioPoll ()
#5  0xf76f9671 in display_ioRelinquishProcessorForMicroseconds () from /home/chous/realhome/toolbox/pharo-5.0/pharo-vm/vm-display-X11
#6  0x080a1887 in ioRelinquishProcessorForMicroseconds ()
#7  0x080767fa in primitiveRelinquishProcessor ()
#8  0xb6fc838c in ?? ()
#9  0xb6fc3700 in ?? ()
#10 0xb7952882 in ?? ()
#11 0xb6fc3648 in ?? ()
(gdb) disassemble
Dump of assembler code for function handleSignal:
   0x081085e0 <+0>:     sub    $0x9c,%esp
   0x081085e6 <+6>:     mov    %ebx,0x90(%esp)
   0x081085ed <+13>:    mov    0xa0(%esp),%ebx
   0x081085f4 <+20>:    mov    %esi,0x94(%esp)
   0x081085fb <+27>:    mov    %edi,0x98(%esp)
   0x08108602 <+34>:    movzbl 0x8168420(%ebx),%esi
   0x08108609 <+41>:    mov    %ebx,%eax
   0x0810860b <+43>:    mov    %esi,%edx
   0x0810860d <+45>:    call   0x81070d0 <forwardSignaltoSemaphoreAt>
   0x08108612 <+50>:    call   0x805aae0 <pthread_self@plt>
   0x08108617 <+55>:    mov    0x8168598,%edi
   0x0810861d <+61>:    cmp    %edi,%eax
   0x0810861f <+63>:    je     0x8108680 <handleSignal+160>
   0x08108621 <+65>:    lea    0x10(%esp),%esi
   0x08108625 <+69>:    mov    %esi,(%esp)
   0x08108628 <+72>:    call   0x805b330 <sigemptyset@plt>
   0x0810862d <+77>:    mov    %ebx,0x4(%esp)
   0x08108631 <+81>:    mov    %esi,(%esp)
   0x08108634 <+84>:    call   0x805b0c0 <sigaddset@plt>
   0x08108639 <+89>:    movl   $0x0,0x8(%esp)
   0x08108641 <+97>:    mov    %esi,0x4(%esp)
   0x08108645 <+101>:   movl   $0x0,(%esp)
   0x0810864c <+108>:   call   0x805ada0 <pthread_sigmask@plt>
   0x08108651 <+113>:   mov    %ebx,0x4(%esp)
   0x08108655 <+117>:   mov    %edi,(%esp)
   0x08108658 <+120>:   call   0x805b240 <pthread_kill@plt>
   0x0810865d <+125>:   mov    0x90(%esp),%ebx
   0x08108664 <+132>:   mov    0x94(%esp),%esi
   0x0810866b <+139>:   mov    0x98(%esp),%edi
   0x08108672 <+146>:   add    $0x9c,%esp
   0x08108678 <+152>:   ret
   0x08108679 <+153>:   lea    0x0(%esi,%eiz,1),%esi
   0x08108680 <+160>:   test   %esi,%esi
   0x08108682 <+162>:   je     0x810865d <handleSignal+125>
   0x08108684 <+164>:   mov    %esi,(%esp)
   0x08108687 <+167>:   call   0x809d8a0 <signalSemaphoreWithIndex>
=> 0x0810868c <+172>:   jmp    0x810865d <handleSignal+125>
End of assembler dump.
(gdb) up 3
(gdb) disassemble
Dump of assembler code for function signalSemaphoreWithIndex:
   0x0809d8a0 <+0>:     push   %esi
   0x0809d8a1 <+1>:     xor    %eax,%eax
   0x0809d8a3 <+3>:     push   %ebx
   0x0809d8a4 <+4>:     sub    $0x24,%esp
   0x0809d8a7 <+7>:     mov    0x30(%esp),%esi
   0x0809d8ab <+11>:    test   %esi,%esi
   0x0809d8ad <+13>:    jle    0x809d918 <signalSemaphoreWithIndex+120>
   0x0809d8af <+15>:    mov    $0x1,%edx
   0x0809d8b4 <+20>:    lea    0x0(%esi,%eiz,1),%esi
   0x0809d8b8 <+24>:    mfence
   0x0809d8bb <+27>:    mov    $0x0,%eax
   0x0809d8c0 <+32>:    lock cmpxchg %edx,0x8152d80
=> 0x0809d8c8 <+40>:    mov    %eax,0x1c(%esp)
   0x0809d8cc <+44>:    mov    0x1c(%esp),%eax
   0x0809d8d0 <+48>:    test   %eax,%eax
   0x0809d8d2 <+50>:    jne    0x809d8b8 <signalSemaphoreWithIndex+24>
   0x0809d8d4 <+52>:    mov    0x8152d84,%edx
   0x0809d8da <+58>:    cmp    $0x1ff,%edx
   0x0809d8e0 <+64>:    lea    0x1(%edx),%ebx
   0x0809d8e3 <+67>:    cmove  %eax,%ebx
   0x0809d8e6 <+70>:    mov    0x8152d88,%eax
   0x0809d8eb <+75>:    cmp    %ebx,%eax
   0x0809d8ed <+77>:    je     0x809d920 <signalSemaphoreWithIndex+128>
   0x0809d8ef <+79>:    mov    0x8152d84,%eax
   0x0809d8f4 <+84>:    mov    %esi,0x8152da0(,%eax,4)
   0x0809d8fb <+91>:    mfence
   0x0809d8fe <+94>:    mov    %ebx,0x8152d84
   0x0809d904 <+100>:   movl   $0x0,0x8152d80
   0x0809d90e <+110>:   call   0x807c2c0 <forceInterruptCheck>
   0x0809d913 <+115>:   mov    $0x1,%eax
   0x0809d918 <+120>:   add    $0x24,%esp
   0x0809d91b <+123>:   pop    %ebx
   0x0809d91c <+124>:   pop    %esi
   0x0809d91d <+125>:   ret
   0x0809d91e <+126>:   xchg   %ax,%ax
   0x0809d920 <+128>:   movl   $0x810c888,(%esp)
   0x0809d927 <+135>:   movl   $0x0,0x8152d80
   0x0809d931 <+145>:   call   0x80a3720 <error>
   0x0809d936 <+150>:   jmp    0x809d8ef <signalSemaphoreWithIndex+79>
End of assembler dump.

Meanwhile, strace gets frozen showing this:
[..]
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63665cd9d0) = 3736
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x42a8a0, [], SA_RESTORER, 0x7f6365ba3ad0}, {SIG_DFL, [], SA_RESTORER, 0x7f6365ba3ad0}, 8) = 0
wait4(-1, 0x7ffc4ef7f7e8, 0, NULL)      = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
wait4(-1,





Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

David T. Lewis
In reply to this post by Thierry Goubier
On Tue, Jun 02, 2015 at 02:34:49PM +0200, Thierry Goubier wrote:

> 2015-06-02 12:14 GMT+02:00 Jose San Leandro <[hidden email]>:
>
> > Hi Thierry,
> >
> > ConfigurationOfOSProcess-ThierryGoubier.38.mcz, which corresponds to
> > version 4.6.2.
> >
>
> Ok, then this is the latest.
>
>
> >
> > Another workaround that would work for me is to be able to "resume" a
> > previous load attempt of a Metacello project. Or a custom "hook" in
> > Metacello to save the image after every dependency is successfully loaded.
> >
>
> Yes, this would work. I'll ask again Dave if he has any idea; the bug is
> hard to reproduce.


Hi Thierry and Jose,

I am reading this thread with interest and will help if I can.

I do have one idea that we have not tried before. I have a theory that this may
be an intermittent problem caused by SIGCHLD signals (from the external OS process
when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
that handles them.

If this is happening, then I may be able to change grimReaperProcess to
work around the problem.

When you see the OS deadlock condition, are you able tell if your Pharo VM
process has subprocesses in the zombie state (indicating that grimReaperProcess
did not clean them up)? The unix command "ps -axf | less" will let you look
at the process tree and that may give us a clue if this is happening.

Thanks!

Dave



>
> Would you mind telling the linux kernel / libc version of your gentoo box?
>
> Thierry
>
>
> >
> >
> > 2015-06-02 11:25 GMT+02:00 Thierry Goubier <[hidden email]>:
> >
> >> Hi Jose,
> >>
> >> yes, I've noticed that as well. It was, at a point, drastic (i.e. almost
> >> allways lock-up) on my work development laptop; it now happens far less
> >> often (but it does happens to me from time to time).
> >>
> >> Dave Lewis, the author of OSProcess, fixed one issue which solved most of
> >> the lockups I had, but not all of them. The lockup is in the interaction
> >> between OSProcess inside Pharo and the external shell command (i.e. it
> >> concerns anything which uses OSProcess), and seems like missing a signal.
> >> It is also machine and linux version dependent (Ubuntu 14.10 was horrible,
> >> 14.04 and 15.04 on the same hardware are far less sensitive), and seems to
> >> also depend on the load of the machine itself.
> >>
> >> By the way, which version of OSProcess you are using?
> >>
> >> Thierry
> >>
> >>
> >> 2015-06-02 11:10 GMT+02:00 Jose San Leandro <[hidden email]>:
> >>
> >>> Hi,
> >>>
> >>> In one of our projects we are using Pharo4. The image gets built by
> >>> gradle, which loads the Metacello project. Sometimes, we see the build
> >>> process hangs. It just don't progress.
> >>>
> >>> When adding local gitfiletree:// dependencies manually through
> >>> Monticello after a while Pharo gets frozen. It's not always the same
> >>> repository, it's not always the same number of repositories before it hangs.
> >>>
> >>> I launched the image with strace, and attached gdb to the frozen process.
> >>> It turns out It's waiting for a lock that gets never released.
> >>>
> >>> The environment is a 64b Gentoo Linux with enough of everything
> >>> (multiple monitors, multiple cores, enough RAM).
> >>>
> >>> I hope anybody could point me how to dig deeper into this.
> >>>
> >>> # gdb
> >>> (gdb) attach [pid]
> >>> [..]
> >>> Reading symbols from /usr/lib32/libbz2.so.1...(no debugging symbols
> >>> found)...done.
> >>> Loaded symbols for /usr/lib32/libbz2.so.1
> >>> 0x0809d8bb in signalSemaphoreWithIndex ()
> >>> (gdb) backtrace
> >>> #0  0x0809d8bb in signalSemaphoreWithIndex ()
> >>> #1  0x0810868c in handleSignal ()
> >>> #2  <signal handler called>
> >>> #3  0x0809d8c8 in signalSemaphoreWithIndex ()
> >>> #4  0x0809f0af in aioPoll ()
> >>> #5  0xf76f9671 in display_ioRelinquishProcessorForMicroseconds () from
> >>> /home/chous/realhome/toolbox/pharo-5.0/pharo-vm/vm-display-X11
> >>> #6  0x080a1887 in ioRelinquishProcessorForMicroseconds ()
> >>> #7  0x080767fa in primitiveRelinquishProcessor ()
> >>> #8  0xb6fc838c in ?? ()
> >>> #9  0xb6fc3700 in ?? ()
> >>> #10 0xb7952882 in ?? ()
> >>> #11 0xb6fc3648 in ?? ()
> >>> (gdb) disassemble
> >>> Dump of assembler code for function handleSignal:
> >>>    0x081085e0 <+0>:     sub    $0x9c,%esp
> >>>    0x081085e6 <+6>:     mov    %ebx,0x90(%esp)
> >>>    0x081085ed <+13>:    mov    0xa0(%esp),%ebx
> >>>    0x081085f4 <+20>:    mov    %esi,0x94(%esp)
> >>>    0x081085fb <+27>:    mov    %edi,0x98(%esp)
> >>>    0x08108602 <+34>:    movzbl 0x8168420(%ebx),%esi
> >>>    0x08108609 <+41>:    mov    %ebx,%eax
> >>>    0x0810860b <+43>:    mov    %esi,%edx
> >>>    0x0810860d <+45>:    call   0x81070d0 <forwardSignaltoSemaphoreAt>
> >>>    0x08108612 <+50>:    call   0x805aae0 <pthread_self@plt>
> >>>    0x08108617 <+55>:    mov    0x8168598,%edi
> >>>    0x0810861d <+61>:    cmp    %edi,%eax
> >>>    0x0810861f <+63>:    je     0x8108680 <handleSignal+160>
> >>>    0x08108621 <+65>:    lea    0x10(%esp),%esi
> >>>    0x08108625 <+69>:    mov    %esi,(%esp)
> >>>    0x08108628 <+72>:    call   0x805b330 <sigemptyset@plt>
> >>>    0x0810862d <+77>:    mov    %ebx,0x4(%esp)
> >>>    0x08108631 <+81>:    mov    %esi,(%esp)
> >>>    0x08108634 <+84>:    call   0x805b0c0 <sigaddset@plt>
> >>>    0x08108639 <+89>:    movl   $0x0,0x8(%esp)
> >>>    0x08108641 <+97>:    mov    %esi,0x4(%esp)
> >>>    0x08108645 <+101>:   movl   $0x0,(%esp)
> >>>    0x0810864c <+108>:   call   0x805ada0 <pthread_sigmask@plt>
> >>>    0x08108651 <+113>:   mov    %ebx,0x4(%esp)
> >>>    0x08108655 <+117>:   mov    %edi,(%esp)
> >>>    0x08108658 <+120>:   call   0x805b240 <pthread_kill@plt>
> >>>    0x0810865d <+125>:   mov    0x90(%esp),%ebx
> >>>    0x08108664 <+132>:   mov    0x94(%esp),%esi
> >>>    0x0810866b <+139>:   mov    0x98(%esp),%edi
> >>>    0x08108672 <+146>:   add    $0x9c,%esp
> >>>    0x08108678 <+152>:   ret
> >>>    0x08108679 <+153>:   lea    0x0(%esi,%eiz,1),%esi
> >>>    0x08108680 <+160>:   test   %esi,%esi
> >>>    0x08108682 <+162>:   je     0x810865d <handleSignal+125>
> >>>    0x08108684 <+164>:   mov    %esi,(%esp)
> >>>    0x08108687 <+167>:   call   0x809d8a0 <signalSemaphoreWithIndex>
> >>> => 0x0810868c <+172>:   jmp    0x810865d <handleSignal+125>
> >>> End of assembler dump.
> >>> (gdb) up 3
> >>> (gdb) disassemble
> >>> Dump of assembler code for function signalSemaphoreWithIndex:
> >>>    0x0809d8a0 <+0>:     push   %esi
> >>>    0x0809d8a1 <+1>:     xor    %eax,%eax
> >>>    0x0809d8a3 <+3>:     push   %ebx
> >>>    0x0809d8a4 <+4>:     sub    $0x24,%esp
> >>>    0x0809d8a7 <+7>:     mov    0x30(%esp),%esi
> >>>    0x0809d8ab <+11>:    test   %esi,%esi
> >>>    0x0809d8ad <+13>:    jle    0x809d918 <signalSemaphoreWithIndex+120>
> >>>    0x0809d8af <+15>:    mov    $0x1,%edx
> >>>    0x0809d8b4 <+20>:    lea    0x0(%esi,%eiz,1),%esi
> >>>    0x0809d8b8 <+24>:    mfence
> >>>    0x0809d8bb <+27>:    mov    $0x0,%eax
> >>>    0x0809d8c0 <+32>:    lock cmpxchg %edx,0x8152d80
> >>> => 0x0809d8c8 <+40>:    mov    %eax,0x1c(%esp)
> >>>    0x0809d8cc <+44>:    mov    0x1c(%esp),%eax
> >>>    0x0809d8d0 <+48>:    test   %eax,%eax
> >>>    0x0809d8d2 <+50>:    jne    0x809d8b8 <signalSemaphoreWithIndex+24>
> >>>    0x0809d8d4 <+52>:    mov    0x8152d84,%edx
> >>>    0x0809d8da <+58>:    cmp    $0x1ff,%edx
> >>>    0x0809d8e0 <+64>:    lea    0x1(%edx),%ebx
> >>>    0x0809d8e3 <+67>:    cmove  %eax,%ebx
> >>>    0x0809d8e6 <+70>:    mov    0x8152d88,%eax
> >>>    0x0809d8eb <+75>:    cmp    %ebx,%eax
> >>>    0x0809d8ed <+77>:    je     0x809d920 <signalSemaphoreWithIndex+128>
> >>>    0x0809d8ef <+79>:    mov    0x8152d84,%eax
> >>>    0x0809d8f4 <+84>:    mov    %esi,0x8152da0(,%eax,4)
> >>>    0x0809d8fb <+91>:    mfence
> >>>    0x0809d8fe <+94>:    mov    %ebx,0x8152d84
> >>>    0x0809d904 <+100>:   movl   $0x0,0x8152d80
> >>>    0x0809d90e <+110>:   call   0x807c2c0 <forceInterruptCheck>
> >>>    0x0809d913 <+115>:   mov    $0x1,%eax
> >>>    0x0809d918 <+120>:   add    $0x24,%esp
> >>>    0x0809d91b <+123>:   pop    %ebx
> >>>    0x0809d91c <+124>:   pop    %esi
> >>>    0x0809d91d <+125>:   ret
> >>>    0x0809d91e <+126>:   xchg   %ax,%ax
> >>>    0x0809d920 <+128>:   movl   $0x810c888,(%esp)
> >>>    0x0809d927 <+135>:   movl   $0x0,0x8152d80
> >>>    0x0809d931 <+145>:   call   0x80a3720 <error>
> >>>    0x0809d936 <+150>:   jmp    0x809d8ef <signalSemaphoreWithIndex+79>
> >>> End of assembler dump.
> >>>
> >>> Meanwhile, strace gets frozen showing this:
> >>> [..]
> >>> clone(child_stack=0,
> >>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> >>> child_tidptr=0x7f63665cd9d0) = 3736
> >>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> >>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>> rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> >>> rt_sigaction(SIGINT, {0x42a8a0, [], SA_RESTORER, 0x7f6365ba3ad0},
> >>> {SIG_DFL, [], SA_RESTORER, 0x7f6365ba3ad0}, 8) = 0
> >>> wait4(-1, 0x7ffc4ef7f7e8, 0, NULL)      = ? ERESTARTSYS (To be restarted
> >>> if SA_RESTART is set)
> >>> --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
> >>> wait4(-1,
> >>>
> >>
> >>
> >

Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Thierry Goubier
Hi Dave,

Le 03/06/2015 03:15, David T. Lewis a écrit :

> Hi Thierry and Jose,
>
> I am reading this thread with interest and will help if I can.
>
> I do have one idea that we have not tried before. I have a theory that this may
> be an intermittent problem caused by SIGCHLD signals (from the external OS process
> when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
> that handles them.
>
> If this is happening, then I may be able to change grimReaperProcess to
> work around the problem.
>
> When you see the OS deadlock condition, are you able tell if your Pharo VM
> process has subprocesses in the zombie state (indicating that grimReaperProcess
> did not clean them up)? The unix command "ps -axf | less" will let you look
> at the process tree and that may give us a clue if this is happening.

I found it very easy to reproduce and I do have a zombie children
process to the pharo process.

Interesting enough, the lock-up happens in a very specific place, a call
to git branch, which is a very short command returning just a few
characters (where all other commands have longuer outputs). Reducing the
frequency of the calls to git branch by a bit of caching reduces the
chances of a lock-up.

Thanks,

Dave

> Thanks!
>
> Dave
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Ben Coman
On Wed, Jun 3, 2015 at 1:05 PM, Thierry Goubier
<[hidden email]> wrote:

> Hi Dave,
>
> Le 03/06/2015 03:15, David T. Lewis a écrit :
>>
>> Hi Thierry and Jose,
>>
>> I am reading this thread with interest and will help if I can.
>>
>> I do have one idea that we have not tried before. I have a theory that
>> this may
>> be an intermittent problem caused by SIGCHLD signals (from the external OS
>> process
>> when it exits) being missed by the
>> UnixOSProcessAccessor>>grimReaperProcess
>> that handles them.
>>
>> If this is happening, then I may be able to change grimReaperProcess to
>> work around the problem.
>>
>> When you see the OS deadlock condition, are you able tell if your Pharo VM
>> process has subprocesses in the zombie state (indicating that
>> grimReaperProcess
>> did not clean them up)? The unix command "ps -axf | less" will let you
>> look
>> at the process tree and that may give us a clue if this is happening.
>
>
> I found it very easy to reproduce and I do have a zombie children process to
> the pharo process.
>
> Interesting enough, the lock-up happens in a very specific place, a call to
> git branch, which is a very short command returning just a few characters
> (where all other commands have longuer outputs). Reducing the frequency of
> the calls to git branch by a bit of caching reduces the chances of a
> lock-up.

As a workaround and investigation, can you wrap the "git banch" in a
script and experiment with extending the time.

    #!/usr/local/mygitbranch
    git branch $@
    STATUS=$?
    # sleep 1
    exit STATUS

http://www.tldp.org/LDP/abs/html/exit-status.html

http://stackoverflow.com/questions/18492443/pass-all-parameters-of-one-shell-script-to-another

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Jose San Leandro
In reply to this post by Thierry Goubier
Hi Dave, Thierry,

Here's what I get in all recent attempts:

[..]
/2.3/lib/gradle-launcher-2.3.jar org.gradle.launcher.GradleMain assemble
18620 pts/5    S+     0:00  |       \_ bash /home/chous/toolbox/pharo/pharo Pharo.image config gitfiletree:///home/chous/osoco/open-badges/game-core ConfigurationOfGameCore --install=bleedingEdge
18635 pts/5    R+     5:07  |           \_ /home/chous/toolbox/pharo/pharo-vm/pharo --nodisplay Pharo.image config gitfiletree:///home/chous/osoco/open-badges/game-core ConfigurationOfGameCore --install=bleedingEdge
32741 pts/5    Z+     0:00  |               \_ [git] <defunct>




2015-06-03 7:05 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Dave,

Le 03/06/2015 03:15, David T. Lewis a écrit :
Hi Thierry and Jose,

I am reading this thread with interest and will help if I can.

I do have one idea that we have not tried before. I have a theory that this may
be an intermittent problem caused by SIGCHLD signals (from the external OS process
when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
that handles them.

If this is happening, then I may be able to change grimReaperProcess to
work around the problem.

When you see the OS deadlock condition, are you able tell if your Pharo VM
process has subprocesses in the zombie state (indicating that grimReaperProcess
did not clean them up)? The unix command "ps -axf | less" will let you look
at the process tree and that may give us a clue if this is happening.

I found it very easy to reproduce and I do have a zombie children process to the pharo process.

Interesting enough, the lock-up happens in a very specific place, a call to git branch, which is a very short command returning just a few characters (where all other commands have longuer outputs). Reducing the frequency of the calls to git branch by a bit of caching reduces the chances of a lock-up.

Thanks,

Dave

Thanks!

Dave






Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Jose San Leandro
Sending the SIGUSR1 signal prints this:

0xb84e2a7c s NonInteractiveUIManager(UIManager)>defer:
0xb84e29f4 s PharoCommandLineHandler class>activateWith:
0xb84f5e08 s [] in BasicCommandLineHandler>activateSubCommand:
0xb84e2e98 s BlockClosure>on:do:
0xb84e2970 s BasicCommandLineHandler>activateSubCommand:
0xb84e2914 s BasicCommandLineHandler>handleSubcommand
0xb84f5e64 s BasicCommandLineHandler>handleArgument:
0xb84e281c s [] in BasicCommandLineHandler>activate
0xb84e2878 s BlockClosure>on:do:
0xb84e27a0 s BasicCommandLineHandler>activate
0xb84f5cf4 s [] in BasicCommandLineHandler class>startUp:
0xb84f5d50 s BlockClosure>cull:
0xb84f5dac s [] in SmalltalkImage>executeDeferredStartupActions:
0xb84e2ef4 s BlockClosure>on:do:
0xb84e0ee4 s SmalltalkImage>logStartUpErrorDuring:into:tryDebugger:
0xb84e0e1c s SmalltalkImage>executeDeferredStartupActions:
0xb84e0c1c s SmalltalkImage>startupImage:snapshotWorked:
0xb84e0170 s SmalltalkImage>snapshot:andQuit:
0xb84e04ac s [] in WorldState class>saveAndQuit
0xb84e0508 s BlockClosure>ensure:
0xb84d3274 s CursorWithMask(Cursor)>showWhile:
0xb84d3208 s WorldState class>saveAndQuit
0xb84e0564 s [] in ToggleMenuItemMorph(MenuItemMorph)>invokeWithEvent:
0xb84e05c0 s BlockClosure>ensure:
0xb84d3198 s CursorWithMask(Cursor)>showWhile:
0xb84d30c8 s ToggleMenuItemMorph(MenuItemMorph)>invokeWithEvent:
0xb84d306c s ToggleMenuItemMorph(MenuItemMorph)>mouseUp:
0xb84e061c s ToggleMenuItemMorph(MenuItemMorph)>handleMouseUp:
0xb84e0678 s MouseButtonEvent>sentTo:
0xb84e06d4 s ToggleMenuItemMorph(Morph)>handleEvent:
0xb84e0730 s MorphicEventDispatcher>dispatchDefault:with:
0xb84e078c s MorphicEventDispatcher>handleMouseUp:
0xb84e07e8 s MouseButtonEvent>sentTo:
0xb84e0844 s [] in MorphicEventDispatcher>dispatchEvent:with:
0xb84e08a0 s BlockClosure>ensure:
0xb84d2fec s MorphicEventDispatcher>dispatchEvent:with:
0xb84e08fc s ToggleMenuItemMorph(Morph)>processEvent:using:
0xb84d2f08 s MorphicEventDispatcher>dispatchDefault:with:
0xb84d2f64 s MorphicEventDispatcher>handleMouseUp:
0xb84e01cc s MouseButtonEvent>sentTo:
0xb84e0228 s [] in MorphicEventDispatcher>dispatchEvent:with:
0xb84e0284 s BlockClosure>ensure:
0xb84d2e88 s MorphicEventDispatcher>dispatchEvent:with:
0xb84e02e0 s MenuMorph(Morph)>processEvent:using:
0xb84e033c s MenuMorph(Morph)>processEvent:
0xb84e0398 s MenuMorph>handleFocusEvent:
0xb84e03f4 s [] in HandMorph>sendFocusEvent:to:clear:
0xb84e0450 s BlockClosure>on:do:
0xb84d2d88 s WorldMorph(PasteUpMorph)>becomeActiveDuring:
0xb84d2d10 s HandMorph>sendFocusEvent:to:clear:
0xb84d2e00 s HandMorph>sendEvent:focus:clear:
0xb84d2c9c s HandMorph>sendMouseEvent:
0xb84e0958 s HandMorph>handleEvent:
0xb84e09b4 s HandMorph>processEvents
0xb84e0a10 s [] in WorldState>doOneCycleNowFor:
0xb84e0a6c s Array(SequenceableCollection)>do:
0xb84e0ac8 s WorldState>handsDo:
0xb84d2ba4 s WorldState>doOneCycleNowFor:
0xb84e0b24 s WorldState>doOneCycleFor:
0xb84e0b80 s WorldMorph>doOneCycle
0xb84a0b60 s [] in MorphicUIManager>spawnNewProcess
0xb84a0adc s [] in BlockClosure>newProcess

Most recent primitives
[..]


On Wed, Jun 3, 2015 at 8:32 AM, Jose San Leandro <[hidden email]> wrote:
Hi Dave, Thierry,

Here's what I get in all recent attempts:

[..]
/2.3/lib/gradle-launcher-2.3.jar org.gradle.launcher.GradleMain assemble
18620 pts/5    S+     0:00  |       \_ bash /home/chous/toolbox/pharo/pharo Pharo.image config gitfiletree:///home/chous/osoco/open-badges/game-core ConfigurationOfGameCore --install=bleedingEdge
18635 pts/5    R+     5:07  |           \_ /home/chous/toolbox/pharo/pharo-vm/pharo --nodisplay Pharo.image config gitfiletree:///home/chous/osoco/open-badges/game-core ConfigurationOfGameCore --install=bleedingEdge
32741 pts/5    Z+     0:00  |               \_ [git] <defunct>




2015-06-03 7:05 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Dave,

Le 03/06/2015 03:15, David T. Lewis a écrit :
Hi Thierry and Jose,

I am reading this thread with interest and will help if I can.

I do have one idea that we have not tried before. I have a theory that this may
be an intermittent problem caused by SIGCHLD signals (from the external OS process
when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
that handles them.

If this is happening, then I may be able to change grimReaperProcess to
work around the problem.

When you see the OS deadlock condition, are you able tell if your Pharo VM
process has subprocesses in the zombie state (indicating that grimReaperProcess
did not clean them up)? The unix command "ps -axf | less" will let you look
at the process tree and that may give us a clue if this is happening.

I found it very easy to reproduce and I do have a zombie children process to the pharo process.

Interesting enough, the lock-up happens in a very specific place, a call to git branch, which is a very short command returning just a few characters (where all other commands have longuer outputs). Reducing the frequency of the calls to git branch by a bit of caching reduces the chances of a lock-up.

Thanks,

Dave

Thanks!

Dave








2015-06-03 8:32 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi Dave, Thierry,

Here's what I get in all recent attempts:

[..]
/2.3/lib/gradle-launcher-2.3.jar org.gradle.launcher.GradleMain assemble
18620 pts/5    S+     0:00  |       \_ bash /home/chous/toolbox/pharo/pharo Pharo.image config gitfiletree:///home/chous/osoco/open-badges/game-core ConfigurationOfGameCore --install=bleedingEdge
18635 pts/5    R+     5:07  |           \_ /home/chous/toolbox/pharo/pharo-vm/pharo --nodisplay Pharo.image config gitfiletree:///home/chous/osoco/open-badges/game-core ConfigurationOfGameCore --install=bleedingEdge
32741 pts/5    Z+     0:00  |               \_ [git] <defunct>




2015-06-03 7:05 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Dave,

Le 03/06/2015 03:15, David T. Lewis a écrit :
Hi Thierry and Jose,

I am reading this thread with interest and will help if I can.

I do have one idea that we have not tried before. I have a theory that this may
be an intermittent problem caused by SIGCHLD signals (from the external OS process
when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
that handles them.

If this is happening, then I may be able to change grimReaperProcess to
work around the problem.

When you see the OS deadlock condition, are you able tell if your Pharo VM
process has subprocesses in the zombie state (indicating that grimReaperProcess
did not clean them up)? The unix command "ps -axf | less" will let you look
at the process tree and that may give us a clue if this is happening.

I found it very easy to reproduce and I do have a zombie children process to the pharo process.

Interesting enough, the lock-up happens in a very specific place, a call to git branch, which is a very short command returning just a few characters (where all other commands have longuer outputs). Reducing the frequency of the calls to git branch by a bit of caching reduces the chances of a lock-up.

Thanks,

Dave

Thanks!

Dave







Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

David T. Lewis
In reply to this post by Thierry Goubier
On Wed, Jun 03, 2015 at 07:05:15AM +0200, Thierry Goubier wrote:

> Hi Dave,
>
> Le 03/06/2015 03:15, David T. Lewis a ?crit :
> >Hi Thierry and Jose,
> >
> >I am reading this thread with interest and will help if I can.
> >
> >I do have one idea that we have not tried before. I have a theory that
> >this may
> >be an intermittent problem caused by SIGCHLD signals (from the external OS
> >process
> >when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
> >that handles them.
> >
> >If this is happening, then I may be able to change grimReaperProcess to
> >work around the problem.
> >
> >When you see the OS deadlock condition, are you able tell if your Pharo VM
> >process has subprocesses in the zombie state (indicating that
> >grimReaperProcess
> >did not clean them up)? The unix command "ps -axf | less" will let you look
> >at the process tree and that may give us a clue if this is happening.
>
> I found it very easy to reproduce and I do have a zombie children
> process to the pharo process.
Jose confirms this also (thanks).

Can you try filing in the attached UnixOSProcessAccessor>>grimReaperProcess
and see if it helps? I do not know if it will make a difference, but the
idea is to put a timeout on the semaphore that is waiting for signals from
SIGCHLD. I am hoping that if these signals are sometimes being missed, then
the timeout will allow the process to recover from the problem.


>
> Interesting enough, the lock-up happens in a very specific place, a call
> to git branch, which is a very short command returning just a few
> characters (where all other commands have longuer outputs). Reducing the
> frequency of the calls to git branch by a bit of caching reduces the
> chances of a lock-up.
>

This is a good clue, and it may indicate a different kind of problem (so
maybe I am looking in the wrong place). Ben's suggestion of adding a delay
to the external process sounds like a good idea to help troubleshoot it.

Dave

 

UnixOSProcessAccessor-grimReaperProcess.st (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Jose San Leandro
Unfortunately it doesn't fix it, or at least I get the same sympthoms.

Sending SIGUSR1 prints this:

SIGUSR1 Wed Jun  3 16:53:50 2015


/home/chous/toolbox/pharo-4.0/pharo-vm/pharo
pharo VM version: 3.9-7 #1 Thu Apr  2 00:51:45 CEST 2015 gcc 4.6.3 [Production ITHB VM]
Built from: NBCoInterpreter NativeBoost-CogPlugin-EstebanLorenzano.21 uuid: 4d9b9bdf-2dfa-4c0b-99eb-5b110dadc697 Apr  2 2015
With: NBCogit NativeBoost-CogPlugin-EstebanLorenzano.21 uuid: 4d9b9bdf-2dfa-4c0b-99eb-5b110dadc697 Apr  2 2015
Revision: https://github.com/pharo-project/pharo-vm.git Commit: 32d18ba0f2db9bee7f3bdbf16bdb24fe4801cfc5 Date: 2015-03-24 11:08:14 +0100 By: Esteban Lorenzano <[hidden email]> Jenkins build #14904
Build host: Linux pharo-linux 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7 16:39:45 UTC 2012 i686 i686 i386 GNU/Linux
plugin path: /home/chous/toolbox/pharo-4.0/pharo-vm/ [default: /home/chous/toolbox/pharo-4.0/pharo-vm/]


C stack backtrace & registers:
        eax 0xff981e94 ebx 0xff981db0 ecx 0xff981e48 edx 0xff981dfc
        edi 0xff981c80 esi 0xff981c80 ebp 0xff981d18 esp 0xff981d64
        eip 0xff981f78
*[0xff981f78]
/home/chous/toolbox/pharo/pharo-vm/pharo[0x80a33a2]
/home/chous/toolbox/pharo/pharo-vm/pharo[0x80a3649]
linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf773acc0]
/home/chous/toolbox/pharo/pharo-vm/pharo(signalSemaphoreWithIndex+0x28)[0x809d8c8]
/home/chous/toolbox/pharo/pharo-vm/pharo[0x810868c]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf773acb0]
/home/chous/toolbox/pharo/pharo-vm/pharo(signalSemaphoreWithIndex+0x5e)[0x809d8fe]
/home/chous/toolbox/pharo/pharo-vm/pharo(aioPoll+0x22f)[0x809f0af]
/home/chous/toolbox/pharo-4.0/pharo-vm/vm-display-X11(+0xe671)[0xf772a671]
/home/chous/toolbox/pharo/pharo-vm/pharo(ioRelinquishProcessorForMicroseconds+0x17)[0x80a1887]
/home/chous/toolbox/pharo/pharo-vm/pharo[0x80767fa]
[0xb4a2fe0c]
[0xb4a2d700]
[0xb53b9382]
[0xb4a2d648]
[0x5b]


All Smalltalk process stacks (active first):
Process 0xb6d930c4 priority 10
0xff9ad450 M ProcessorScheduler class>idleProcess 0xb4d935c0: a(n) ProcessorScheduler class
0xff9ad470 I [] in ProcessorScheduler class>startUp 0xb4d935c0: a(n) ProcessorScheduler class
0xff9ad490 I [] in BlockClosure>newProcess 0xb6d92fe8: a(n) BlockClosure

suspended processes
Process 0xb68e1984 priority 50
0xff9a6490 M WeakArray class>finalizationProcess 0xb4d93790: a(n) WeakArray class
0xb69beb68 s [] in WeakArray class>restartFinalizationProcess
0xb68e1924 s [] in BlockClosure>newProcess

Process 0xb5ced038 priority 80
0xff9af490 M DelayMicrosecondScheduler>runTimerEventLoop 0xb5bb6f9c: a(n) DelayMicrosecondScheduler
0xb6098314 s [] in DelayMicrosecondScheduler>startTimerEventLoop
0xb5cecfd8 s [] in BlockClosure>newProcess

Process 0xb68ec880 priority 40
0xff9b2478 M [] in UnixOSProcessAccessor>(nil) 0xb60dc6d0: a(n) UnixOSProcessAccessor
0xff9b2490 M BlockClosure>repeat 0xb68ef2d4: a(n) BlockClosure
0xb68ef278 s [] in UnixOSProcessAccessor>(nil)
0xb68ec820 s [] in BlockClosure>newProcess

Process 0xb6d92d78 priority 60
0xff98742c M InputEventFetcher>waitForInput 0xb5a09718: a(n) InputEventFetcher
0xff987450 M InputEventFetcher>eventLoop 0xb5a09718: a(n) InputEventFetcher
0xff987470 I [] in InputEventFetcher>installEventLoop 0xb5a09718: a(n) InputEventFetcher
0xff987490 I [] in BlockClosure>newProcess 0xb6d92c9c: a(n) BlockClosure

Process 0xb6f25f94 priority 60
0xb6f25fcc s SmalltalkImage>lowSpaceWatcher
0xb71523e4 s [] in SmalltalkImage>installLowSpaceWatcher
0xb6f25f34 s [] in BlockClosure>newProcess

Process 0xb73a4e7c priority 30
0xff99b470 M [] in AioEventHandler>handleExceptions:readEvents:writeEvents: 0xb73a49e4: a(n) AioEventHandler
0xff99b490 I [] in BlockClosure>newProcess 0xb73a4d90: a(n) BlockClosure
Process 0xb6686c88 priority 40
0xffa073d0 M [] in Delay>wait 0xb73a63fc: a(n) Delay
0xffa073f0 M BlockClosure>ifCurtailed: 0xb73a6614: a(n) BlockClosure
0xffa0740c M Delay>wait 0xb73a63fc: a(n) Delay
0xffa07428 M PipeableOSProcess(PipeJunction)>outputOn: 0xb73a0d34: a(n) PipeableOSProcess
0xffa07444 M PipeableOSProcess(PipeJunction)>output 0xb73a0d34: a(n) PipeableOSProcess
0xffa0746c M [] in MCFileTreeGitRepository class>runOSProcessGitCommand:in: 0xb611fa88: a(n) MCFileTreeGitRepository class
0xffa0748c M BlockClosure>ensure: 0xb739d9dc: a(n) BlockClosure
0xff9e538c M MCFileTreeGitRepository class>runOSProcessGitCommand:in: 0xb611fa88: a(n) MCFileTreeGitRepository class
0xff9e53ac M MCFileTreeGitRepository class>runGitCommand:in: 0xb611fa88: a(n) MCFileTreeGitRepository class
0xff9e53cc M MCFileTreeGitRepository>gitCommand:in: 0xb612926c: a(n) MCFileTreeGitRepository
0xff9e53f4 M MCFileTreeGitRepository>gitVersionsForPackage: 0xb612926c: a(n) MCFileTreeGitRepository
0xff9e543c M [] in MCFileTreeGitRepository>loadAllFileNames 0xb612926c: a(n) MCFileTreeGitRepository
0xff9e5458 M FileSystemDirectoryEntry(Object)>in: 0xb71b3fe8: a(n) FileSystemDirectoryEntry
0xff9e548c M [] in MCFileTreeGitRepository>loadAllFileNames 0xb612926c: a(n) MCFileTreeGitRepository
0xffa04310 M BlockClosure>cull: 0xb71b4894: a(n) BlockClosure
0xffa04338 I [] in Job>run 0xb71b48b4: a(n) Job
0xffa04350 M BlockClosure>on:do: 0xb71b56b8: a(n) BlockClosure
0xffa0437c I [] in Job>run 0xb71b48b4: a(n) Job
0xffa0439c M BlockClosure>ensure: 0xb71b4980: a(n) BlockClosure
0xffa043c4 I Job>run 0xb71b48b4: a(n) Job
0xffa043e4 I MorphicUIManager(UIManager)>displayProgress:from:to:during: 0xb50a8790: a(n) MorphicUIManager
0xffa04414 I ByteString(String)>displayProgressFrom:to:during: 0xb61238d8: a(n) ByteString
0xffa04444 M MCFileTreeGitRepository>loadAllFileNames 0xb612926c: a(n) MCFileTreeGitRepository
0xffa04464 I MCFileTreeGitRepository>allFileNames 0xb612926c: a(n) MCFileTreeGitRepository
0xffa0448c M MCFileTreeGitRepository>goferVersionFrom: 0xb612926c: a(n) MCFileTreeGitRepository
0xff9e238c I MetacelloCachingGoferResolvedReference(GoferResolvedReference)>version 0xb71b3134: a(n) MetacelloCachingGoferResolvedReference
0xff9e23a4 M MetacelloCachingGoferResolvedReference>version 0xb71b3134: a(n) MetacelloCachingGoferResolvedReference
0xff9e23bc M [] in MetacelloFetchingMCSpecLoader>resolveDependencies:nearest:into: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9e23e0 M OrderedCollection>do: 0xb71b3234: a(n) OrderedCollection
0xff9e240c M [] in MetacelloFetchingMCSpecLoader>resolveDependencies:nearest:into: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9e2424 M BlockClosure>on:do: 0xb71b3334: a(n) BlockClosure
0xff9e244c M MetacelloFetchingMCSpecLoader>resolveDependencies:nearest:into: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9e2490 M [] in MetacelloFetchingMCSpecLoader>linearLoadPackageSpec:gofer: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9ae318 M MetacelloPharo30Platform(MetacelloPlatform)>do:displaying: 0xb50e8b94: a(n) MetacelloPharo30Platform
0xff9ae338 M MetacelloFetchingMCSpecLoader>linearLoadPackageSpec:gofer: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9ae358 M MetacelloPackageSpec>loadUsing:gofer: 0xb706be54: a(n) MetacelloPackageSpec
0xff9ae37c M [] in MetacelloFetchingMCSpecLoader(MetacelloCommonMCSpecLoader)>linearLoadPackageSpecs:repositories: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9ae3a0 M OrderedCollection>do: 0xb70c807c: a(n) OrderedCollection
0xff9ae3c0 M MetacelloFetchingMCSpecLoader(MetacelloCommonMCSpecLoader)>linearLoadPackageSpecs:repositories: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9ae3f0 I [] in MetacelloFetchingMCSpecLoader>linearLoadPackageSpecs:repositories: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xff9ae410 M BlockClosure>ensure: 0xb70c813c: a(n) BlockClosure
0xff9ae438 I MetacelloLoaderPolicy>pushLoadDirective:during: 0xb706cb7c: a(n) MetacelloLoaderPolicy
0xff9ae460 I MetacelloLoaderPolicy>pushLinearLoadDirectivesDuring:for: 0xb706cb7c: a(n) MetacelloLoaderPolicy
0xff9ae488 I MetacelloFetchingMCSpecLoader>linearLoadPackageSpecs:repositories: 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
0xb70c33c0 s MetacelloFetchingMCSpecLoader(MetacelloCommonMCSpecLoader)>load
0xb706d898 s MetacelloMCVersionSpecLoader>load
0xb71948d0 s MetacelloMCVersion>executeLoadFromArray:
0xb719492c s [] in MetacelloMCVersion>fetchRequiredFromArray:
0xb7194988 s [] in MetacelloPharo30Platform(MetacelloPlatform)>useStackCacheDuring:defaultDictionary:
0xb706d96c s BlockClosure>on:do:
0xb706d2d8 s MetacelloPharo30Platform(MetacelloPlatform)>useStackCacheDuring:defaultDictionary:
0xb706d258 s [] in MetacelloMCVersion>fetchRequiredFromArray:
0xb71949e4 s BlockClosure>ensure:
0xb706d15c s [] in MetacelloMCVersion>fetchRequiredFromArray:
0xb706d1e4 s MetacelloPharo30Platform(MetacelloPlatform)>do:displaying:
0xb706d0e4 s MetacelloMCVersion>fetchRequiredFromArray:
0xb706ccc0 s [] in MetacelloMCVersion>doLoadRequiredFromArray:
0xb715327c s BlockClosure>ensure:
0xb706cc34 s MetacelloMCVersion>doLoadRequiredFromArray:
0xb71532d8 s MetacelloMCVersion>load
0xb7153334 s UndefinedObject>(nil)
0xb7153390 s OpalCompiler>evaluate
0xb706ab30 s RubSmalltalkEditor>evaluate:andDo:
0xb706a7f4 s RubSmalltalkEditor>highlightEvaluateAndDo:
0xb7152edc s [] in GLMMorphicPharoPlaygroundRenderer(GLMMorphicPharoCodeRenderer)>actOnHighlightAndEvaluate:
0xb7152f38 s RubEditingArea(RubAbstractTextArea)>handleEdit:
0xb706a784 s [] in GLMMorphicPharoPlaygroundRenderer(GLMMorphicPharoCodeRenderer)>actOnHighlightAndEvaluate:
0xb7152f94 s WorldState>runStepMethodsIn:
0xb7152ff0 s WorldMorph>runStepMethods
0xb706a1cc s WorldState>doOneCycleNowFor:
0xb715304c s WorldState>doOneCycleFor:
0xb71530a8 s WorldMorph>doOneCycle
0xb6686f8c s [] in MorphicUIManager>spawnNewProcess
0xb6686c28 s [] in BlockClosure>newProcess

Most recent primitives
primCreatePipe
new:
at:put:  
at:put:  
basicNew 
basicNew:
basicNew 
basicNew:
primSQFileSetBlocking:
basicNew:
basicAt:put:
basicNew:
basicAt:put:
at:put:  
basicNew 
primSigPipeNumber
basicNew 
wait
at:put:  
signal
primForwardSignal:toSemaphore:
wait
at:put:  
signal
primCreatePipe
new:
at:put:  
at:put:  
basicNew 
basicNew:
basicNew 
basicNew:
primSQFileSetNonBlocking:
basicNew:
basicAt:put:
basicNew:
basicAt:put:
at:put:  
basicNew 
signal
basicNew:
basicAt:put:
basicNew:
basicAt:put:
at:put:  
new:
basicNew 
new:
replaceFrom:to:with:startingAt:
basicNew 
basicNew:
primSQFileSetNonBlocking:
basicNew 
stringHash:initialHash:
primOSFileHandle:
basicNew 
wait
at:put:  
signal
primAioEnable:forSemaphore:externalObject:
basicNew 
objectAt:
basicNew:
stackp:  
basicNew 
primitiveResume
wait
wait
signal
wait
signal
primAioHandle:exceptionEvents:readEvents:writeEvents:
signal
basicNew:
basicAt:put:
primSQFileSetNonBlocking:
basicNew:
basicAt:put:
basicNew:
basicAt:put:
at:put:  
basicNew 
basicNew 
wait
signal
primUTCMicrosecondsClock
+
>=
+
<
primSignal:atUTCMicroseconds:
wait
signal
wait
wait
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
primUTCMicrosecondsClock
>=
signal
+
primSignal:atUTCMicroseconds:
wait
basicNew 
basicNew 
basicNew 
basicNew 
signal
basicNew 
signal
basicNew 
new:
wait
new:
at:put:  
at:put:  
at:put:  
basicNew:
at:put:  
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
basicNew 
new:
at:put:  
new:
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
at:put:  
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
at:put:  
at:put:  
at:put:  
new:
replaceFrom:to:with:startingAt:
primSizeOfPointer
new:
at:put:  
at:put:  
at:put:  
primSizeOfPointer
basicNew:
basicNew 
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
new:
basicNew 
new:
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
at:put:  
new:
replaceFrom:to:with:startingAt:
new:
at:put:  
at:put:  
primGetCurrentWorkingDirectory
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
primForkExec:stdIn:stdOut:stdErr:argBuf:argOffsets:envBuf:envOffsets:workingDir:
primGetPid
primGetPid
primGetPid
basicNew 
basicNew 
wait
at:put:  
signal
wait
shallowCopy
new:
replaceFrom:to:with:startingAt:
signal
wait
replaceFrom:to:with:startingAt:
at:put:  
signal
primCloseNoError:
primCloseNoError:
primCloseNoError:
signal
basicNew:
basicNew 
basicNew 
basicNew 
wait
signal
primUTCMicrosecondsClock
+
>=
+
<
primSignal:atUTCMicroseconds:
wait
signal
wait
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
basicNew:
primRead:into:startingAt:count:
basicNew 
signal
wait
basicNew:
basicNew 
basicNew:
replaceFrom:to:with:startingAt:
replaceFrom:to:with:startingAt:
signal
basicNew 
signal
basicNew 
new:
wait
signal
wait
signal
primAioHandle:exceptionEvents:readEvents:writeEvents:
signal
wait
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:
relinquishProcessorForMicroseconds:

stack page bytes 4096 available headroom 3300 minimum unused headroom 2152

        (SIGUSR1)


2015-06-03 14:15 GMT+02:00 David T. Lewis <[hidden email]>:
On Wed, Jun 03, 2015 at 07:05:15AM +0200, Thierry Goubier wrote:
> Hi Dave,
>
> Le 03/06/2015 03:15, David T. Lewis a ?crit :
> >Hi Thierry and Jose,
> >
> >I am reading this thread with interest and will help if I can.
> >
> >I do have one idea that we have not tried before. I have a theory that
> >this may
> >be an intermittent problem caused by SIGCHLD signals (from the external OS
> >process
> >when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
> >that handles them.
> >
> >If this is happening, then I may be able to change grimReaperProcess to
> >work around the problem.
> >
> >When you see the OS deadlock condition, are you able tell if your Pharo VM
> >process has subprocesses in the zombie state (indicating that
> >grimReaperProcess
> >did not clean them up)? The unix command "ps -axf | less" will let you look
> >at the process tree and that may give us a clue if this is happening.
>
> I found it very easy to reproduce and I do have a zombie children
> process to the pharo process.

Jose confirms this also (thanks).

Can you try filing in the attached UnixOSProcessAccessor>>grimReaperProcess
and see if it helps? I do not know if it will make a difference, but the
idea is to put a timeout on the semaphore that is waiting for signals from
SIGCHLD. I am hoping that if these signals are sometimes being missed, then
the timeout will allow the process to recover from the problem.


>
> Interesting enough, the lock-up happens in a very specific place, a call
> to git branch, which is a very short command returning just a few
> characters (where all other commands have longuer outputs). Reducing the
> frequency of the calls to git branch by a bit of caching reduces the
> chances of a lock-up.
>

This is a good clue, and it may indicate a different kind of problem (so
maybe I am looking in the wrong place). Ben's suggestion of adding a delay
to the external process sounds like a good idea to help troubleshoot it.

Dave



Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Ben Coman
In reply to this post by Jose San Leandro
On Tue, Jun 2, 2015 at 5:10 PM, Jose San Leandro
<[hidden email]> wrote:

> Hi,
>
> In one of our projects we are using Pharo4. The image gets built by gradle,
> which loads the Metacello project. Sometimes, we see the build process
> hangs. It just don't progress.
>
> When adding local gitfiletree:// dependencies manually through Monticello
> after a while Pharo gets frozen. It's not always the same repository, it's
> not always the same number of repositories before it hangs.
>
> I launched the image with strace, and attached gdb to the frozen process.
> It turns out It's waiting for a lock that gets never released.
>

Perhaps try each of the experimental delay schedulers under
World > System > Settings > System > Delay Scheduler.

I no reason to think this will help, its easy to try (as a shotgun
approach to troubleshooting).

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

David T. Lewis
In reply to this post by Jose San Leandro
On Wed, Jun 03, 2015 at 05:03:18PM +0200, Jose San Leandro wrote:
> Unfortunately it doesn't fix it, or at least I get the same sympthoms.

Thanks for trying it. Sorry it did not help :-/

Dave



>
> Sending SIGUSR1 prints this:
>
> SIGUSR1 Wed Jun  3 16:53:50 2015
>
>
> /home/chous/toolbox/pharo-4.0/pharo-vm/pharo
> pharo VM version: 3.9-7 #1 Thu Apr  2 00:51:45 CEST 2015 gcc 4.6.3
> [Production ITHB VM]
> Built from: NBCoInterpreter NativeBoost-CogPlugin-EstebanLorenzano.21 uuid:
> 4d9b9bdf-2dfa-4c0b-99eb-5b110dadc697 Apr  2 2015
> With: NBCogit NativeBoost-CogPlugin-EstebanLorenzano.21 uuid:
> 4d9b9bdf-2dfa-4c0b-99eb-5b110dadc697 Apr  2 2015
> Revision: https://github.com/pharo-project/pharo-vm.git Commit:
> 32d18ba0f2db9bee7f3bdbf16bdb24fe4801cfc5 Date: 2015-03-24 11:08:14 +0100
> By: Esteban Lorenzano <[hidden email]> Jenkins build #14904
> Build host: Linux pharo-linux 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7
> 16:39:45 UTC 2012 i686 i686 i386 GNU/Linux
> plugin path: /home/chous/toolbox/pharo-4.0/pharo-vm/ [default:
> /home/chous/toolbox/pharo-4.0/pharo-vm/]
>
>
> C stack backtrace & registers:
>         eax 0xff981e94 ebx 0xff981db0 ecx 0xff981e48 edx 0xff981dfc
>         edi 0xff981c80 esi 0xff981c80 ebp 0xff981d18 esp 0xff981d64
>         eip 0xff981f78
> *[0xff981f78]
> /home/chous/toolbox/pharo/pharo-vm/pharo[0x80a33a2]
> /home/chous/toolbox/pharo/pharo-vm/pharo[0x80a3649]
> linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf773acc0]
> /home/chous/toolbox/pharo/pharo-vm/pharo(signalSemaphoreWithIndex+0x28)[0x809d8c8]
> /home/chous/toolbox/pharo/pharo-vm/pharo[0x810868c]
> linux-gate.so.1(__kernel_sigreturn+0x0)[0xf773acb0]
> /home/chous/toolbox/pharo/pharo-vm/pharo(signalSemaphoreWithIndex+0x5e)[0x809d8fe]
> /home/chous/toolbox/pharo/pharo-vm/pharo(aioPoll+0x22f)[0x809f0af]
> /home/chous/toolbox/pharo-4.0/pharo-vm/vm-display-X11(+0xe671)[0xf772a671]
> /home/chous/toolbox/pharo/pharo-vm/pharo(ioRelinquishProcessorForMicroseconds+0x17)[0x80a1887]
> /home/chous/toolbox/pharo/pharo-vm/pharo[0x80767fa]
> [0xb4a2fe0c]
> [0xb4a2d700]
> [0xb53b9382]
> [0xb4a2d648]
> [0x5b]
>
>
> All Smalltalk process stacks (active first):
> Process 0xb6d930c4 priority 10
> 0xff9ad450 M ProcessorScheduler class>idleProcess 0xb4d935c0: a(n)
> ProcessorScheduler class
> 0xff9ad470 I [] in ProcessorScheduler class>startUp 0xb4d935c0: a(n)
> ProcessorScheduler class
> 0xff9ad490 I [] in BlockClosure>newProcess 0xb6d92fe8: a(n) BlockClosure
>
> suspended processes
> Process 0xb68e1984 priority 50
> 0xff9a6490 M WeakArray class>finalizationProcess 0xb4d93790: a(n) WeakArray
> class
> 0xb69beb68 s [] in WeakArray class>restartFinalizationProcess
> 0xb68e1924 s [] in BlockClosure>newProcess
>
> Process 0xb5ced038 priority 80
> 0xff9af490 M DelayMicrosecondScheduler>runTimerEventLoop 0xb5bb6f9c: a(n)
> DelayMicrosecondScheduler
> 0xb6098314 s [] in DelayMicrosecondScheduler>startTimerEventLoop
> 0xb5cecfd8 s [] in BlockClosure>newProcess
>
> Process 0xb68ec880 priority 40
> 0xff9b2478 M [] in UnixOSProcessAccessor>(nil) 0xb60dc6d0: a(n)
> UnixOSProcessAccessor
> 0xff9b2490 M BlockClosure>repeat 0xb68ef2d4: a(n) BlockClosure
> 0xb68ef278 s [] in UnixOSProcessAccessor>(nil)
> 0xb68ec820 s [] in BlockClosure>newProcess
>
> Process 0xb6d92d78 priority 60
> 0xff98742c M InputEventFetcher>waitForInput 0xb5a09718: a(n)
> InputEventFetcher
> 0xff987450 M InputEventFetcher>eventLoop 0xb5a09718: a(n) InputEventFetcher
> 0xff987470 I [] in InputEventFetcher>installEventLoop 0xb5a09718: a(n)
> InputEventFetcher
> 0xff987490 I [] in BlockClosure>newProcess 0xb6d92c9c: a(n) BlockClosure
>
> Process 0xb6f25f94 priority 60
> 0xb6f25fcc s SmalltalkImage>lowSpaceWatcher
> 0xb71523e4 s [] in SmalltalkImage>installLowSpaceWatcher
> 0xb6f25f34 s [] in BlockClosure>newProcess
>
> Process 0xb73a4e7c priority 30
> 0xff99b470 M [] in AioEventHandler>handleExceptions:readEvents:writeEvents:
> 0xb73a49e4: a(n) AioEventHandler
> 0xff99b490 I [] in BlockClosure>newProcess 0xb73a4d90: a(n) BlockClosure
> Process 0xb6686c88 priority 40
> 0xffa073d0 M [] in Delay>wait 0xb73a63fc: a(n) Delay
> 0xffa073f0 M BlockClosure>ifCurtailed: 0xb73a6614: a(n) BlockClosure
> 0xffa0740c M Delay>wait 0xb73a63fc: a(n) Delay
> 0xffa07428 M PipeableOSProcess(PipeJunction)>outputOn: 0xb73a0d34: a(n)
> PipeableOSProcess
> 0xffa07444 M PipeableOSProcess(PipeJunction)>output 0xb73a0d34: a(n)
> PipeableOSProcess
> 0xffa0746c M [] in MCFileTreeGitRepository class>runOSProcessGitCommand:in:
> 0xb611fa88: a(n) MCFileTreeGitRepository class
> 0xffa0748c M BlockClosure>ensure: 0xb739d9dc: a(n) BlockClosure
> 0xff9e538c M MCFileTreeGitRepository class>runOSProcessGitCommand:in:
> 0xb611fa88: a(n) MCFileTreeGitRepository class
> 0xff9e53ac M MCFileTreeGitRepository class>runGitCommand:in: 0xb611fa88:
> a(n) MCFileTreeGitRepository class
> 0xff9e53cc M MCFileTreeGitRepository>gitCommand:in: 0xb612926c: a(n)
> MCFileTreeGitRepository
> 0xff9e53f4 M MCFileTreeGitRepository>gitVersionsForPackage: 0xb612926c:
> a(n) MCFileTreeGitRepository
> 0xff9e543c M [] in MCFileTreeGitRepository>loadAllFileNames 0xb612926c:
> a(n) MCFileTreeGitRepository
> 0xff9e5458 M FileSystemDirectoryEntry(Object)>in: 0xb71b3fe8: a(n)
> FileSystemDirectoryEntry
> 0xff9e548c M [] in MCFileTreeGitRepository>loadAllFileNames 0xb612926c:
> a(n) MCFileTreeGitRepository
> 0xffa04310 M BlockClosure>cull: 0xb71b4894: a(n) BlockClosure
> 0xffa04338 I [] in Job>run 0xb71b48b4: a(n) Job
> 0xffa04350 M BlockClosure>on:do: 0xb71b56b8: a(n) BlockClosure
> 0xffa0437c I [] in Job>run 0xb71b48b4: a(n) Job
> 0xffa0439c M BlockClosure>ensure: 0xb71b4980: a(n) BlockClosure
> 0xffa043c4 I Job>run 0xb71b48b4: a(n) Job
> 0xffa043e4 I MorphicUIManager(UIManager)>displayProgress:from:to:during:
> 0xb50a8790: a(n) MorphicUIManager
> 0xffa04414 I ByteString(String)>displayProgressFrom:to:during: 0xb61238d8:
> a(n) ByteString
> 0xffa04444 M MCFileTreeGitRepository>loadAllFileNames 0xb612926c: a(n)
> MCFileTreeGitRepository
> 0xffa04464 I MCFileTreeGitRepository>allFileNames 0xb612926c: a(n)
> MCFileTreeGitRepository
> 0xffa0448c M MCFileTreeGitRepository>goferVersionFrom: 0xb612926c: a(n)
> MCFileTreeGitRepository
> 0xff9e238c I
> MetacelloCachingGoferResolvedReference(GoferResolvedReference)>version
> 0xb71b3134: a(n) MetacelloCachingGoferResolvedReference
> 0xff9e23a4 M MetacelloCachingGoferResolvedReference>version 0xb71b3134:
> a(n) MetacelloCachingGoferResolvedReference
> 0xff9e23bc M [] in
> MetacelloFetchingMCSpecLoader>resolveDependencies:nearest:into: 0xb706d83c:
> a(n) MetacelloFetchingMCSpecLoader
> 0xff9e23e0 M OrderedCollection>do: 0xb71b3234: a(n) OrderedCollection
> 0xff9e240c M [] in
> MetacelloFetchingMCSpecLoader>resolveDependencies:nearest:into: 0xb706d83c:
> a(n) MetacelloFetchingMCSpecLoader
> 0xff9e2424 M BlockClosure>on:do: 0xb71b3334: a(n) BlockClosure
> 0xff9e244c M
> MetacelloFetchingMCSpecLoader>resolveDependencies:nearest:into: 0xb706d83c:
> a(n) MetacelloFetchingMCSpecLoader
> 0xff9e2490 M [] in
> MetacelloFetchingMCSpecLoader>linearLoadPackageSpec:gofer: 0xb706d83c: a(n)
> MetacelloFetchingMCSpecLoader
> 0xff9ae318 M MetacelloPharo30Platform(MetacelloPlatform)>do:displaying:
> 0xb50e8b94: a(n) MetacelloPharo30Platform
> 0xff9ae338 M MetacelloFetchingMCSpecLoader>linearLoadPackageSpec:gofer:
> 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
> 0xff9ae358 M MetacelloPackageSpec>loadUsing:gofer: 0xb706be54: a(n)
> MetacelloPackageSpec
> 0xff9ae37c M [] in
> MetacelloFetchingMCSpecLoader(MetacelloCommonMCSpecLoader)>linearLoadPackageSpecs:repositories:
> 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
> 0xff9ae3a0 M OrderedCollection>do: 0xb70c807c: a(n) OrderedCollection
> 0xff9ae3c0 M
> MetacelloFetchingMCSpecLoader(MetacelloCommonMCSpecLoader)>linearLoadPackageSpecs:repositories:
> 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
> 0xff9ae3f0 I [] in
> MetacelloFetchingMCSpecLoader>linearLoadPackageSpecs:repositories:
> 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
> 0xff9ae410 M BlockClosure>ensure: 0xb70c813c: a(n) BlockClosure
> 0xff9ae438 I MetacelloLoaderPolicy>pushLoadDirective:during: 0xb706cb7c:
> a(n) MetacelloLoaderPolicy
> 0xff9ae460 I MetacelloLoaderPolicy>pushLinearLoadDirectivesDuring:for:
> 0xb706cb7c: a(n) MetacelloLoaderPolicy
> 0xff9ae488 I
> MetacelloFetchingMCSpecLoader>linearLoadPackageSpecs:repositories:
> 0xb706d83c: a(n) MetacelloFetchingMCSpecLoader
> 0xb70c33c0 s MetacelloFetchingMCSpecLoader(MetacelloCommonMCSpecLoader)>load
> 0xb706d898 s MetacelloMCVersionSpecLoader>load
> 0xb71948d0 s MetacelloMCVersion>executeLoadFromArray:
> 0xb719492c s [] in MetacelloMCVersion>fetchRequiredFromArray:
> 0xb7194988 s [] in
> MetacelloPharo30Platform(MetacelloPlatform)>useStackCacheDuring:defaultDictionary:
> 0xb706d96c s BlockClosure>on:do:
> 0xb706d2d8 s
> MetacelloPharo30Platform(MetacelloPlatform)>useStackCacheDuring:defaultDictionary:
> 0xb706d258 s [] in MetacelloMCVersion>fetchRequiredFromArray:
> 0xb71949e4 s BlockClosure>ensure:
> 0xb706d15c s [] in MetacelloMCVersion>fetchRequiredFromArray:
> 0xb706d1e4 s MetacelloPharo30Platform(MetacelloPlatform)>do:displaying:
> 0xb706d0e4 s MetacelloMCVersion>fetchRequiredFromArray:
> 0xb706ccc0 s [] in MetacelloMCVersion>doLoadRequiredFromArray:
> 0xb715327c s BlockClosure>ensure:
> 0xb706cc34 s MetacelloMCVersion>doLoadRequiredFromArray:
> 0xb71532d8 s MetacelloMCVersion>load
> 0xb7153334 s UndefinedObject>(nil)
> 0xb7153390 s OpalCompiler>evaluate
> 0xb706ab30 s RubSmalltalkEditor>evaluate:andDo:
> 0xb706a7f4 s RubSmalltalkEditor>highlightEvaluateAndDo:
> 0xb7152edc s [] in
> GLMMorphicPharoPlaygroundRenderer(GLMMorphicPharoCodeRenderer)>actOnHighlightAndEvaluate:
> 0xb7152f38 s RubEditingArea(RubAbstractTextArea)>handleEdit:
> 0xb706a784 s [] in
> GLMMorphicPharoPlaygroundRenderer(GLMMorphicPharoCodeRenderer)>actOnHighlightAndEvaluate:
> 0xb7152f94 s WorldState>runStepMethodsIn:
> 0xb7152ff0 s WorldMorph>runStepMethods
> 0xb706a1cc s WorldState>doOneCycleNowFor:
> 0xb715304c s WorldState>doOneCycleFor:
> 0xb71530a8 s WorldMorph>doOneCycle
> 0xb6686f8c s [] in MorphicUIManager>spawnNewProcess
> 0xb6686c28 s [] in BlockClosure>newProcess
>
> Most recent primitives
> primCreatePipe
> new:
> at:put:
> at:put:
> basicNew
> basicNew:
> basicNew
> basicNew:
> primSQFileSetBlocking:
> basicNew:
> basicAt:put:
> basicNew:
> basicAt:put:
> at:put:
> basicNew
> primSigPipeNumber
> basicNew
> wait
> at:put:
> signal
> primForwardSignal:toSemaphore:
> wait
> at:put:
> signal
> primCreatePipe
> new:
> at:put:
> at:put:
> basicNew
> basicNew:
> basicNew
> basicNew:
> primSQFileSetNonBlocking:
> basicNew:
> basicAt:put:
> basicNew:
> basicAt:put:
> at:put:
> basicNew
> signal
> basicNew:
> basicAt:put:
> basicNew:
> basicAt:put:
> at:put:
> new:
> basicNew
> new:
> replaceFrom:to:with:startingAt:
> basicNew
> basicNew:
> primSQFileSetNonBlocking:
> basicNew
> stringHash:initialHash:
> primOSFileHandle:
> basicNew
> wait
> at:put:
> signal
> primAioEnable:forSemaphore:externalObject:
> basicNew
> objectAt:
> basicNew:
> stackp:
> basicNew
> primitiveResume
> wait
> wait
> signal
> wait
> signal
> primAioHandle:exceptionEvents:readEvents:writeEvents:
> signal
> basicNew:
> basicAt:put:
> primSQFileSetNonBlocking:
> basicNew:
> basicAt:put:
> basicNew:
> basicAt:put:
> at:put:
> basicNew
> basicNew
> wait
> signal
> primUTCMicrosecondsClock
> +
> >=
> +
> <
> primSignal:atUTCMicroseconds:
> wait
> signal
> wait
> wait
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> primUTCMicrosecondsClock
> >=
> signal
> +
> primSignal:atUTCMicroseconds:
> wait
> basicNew
> basicNew
> basicNew
> basicNew
> signal
> basicNew
> signal
> basicNew
> new:
> wait
> new:
> at:put:
> at:put:
> at:put:
> basicNew:
> at:put:
> basicNew:
> replaceFrom:to:with:startingAt:
> replaceFrom:to:with:startingAt:
> basicNew
> new:
> at:put:
> new:
> basicNew:
> replaceFrom:to:with:startingAt:
> replaceFrom:to:with:startingAt:
> at:put:
> basicNew:
> replaceFrom:to:with:startingAt:
> replaceFrom:to:with:startingAt:
> at:put:
> at:put:
> at:put:
> new:
> replaceFrom:to:with:startingAt:
> primSizeOfPointer
> new:
> at:put:
> at:put:
> at:put:
> primSizeOfPointer
> basicNew:
> basicNew
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> replaceFrom:to:with:startingAt:
> replaceFrom:to:with:startingAt:
> replaceFrom:to:with:startingAt:
> new:
> basicNew
> new:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> at:put:
> new:
> replaceFrom:to:with:startingAt:
> new:
> at:put:
> at:put:
> primGetCurrentWorkingDirectory
> basicNew:
> replaceFrom:to:with:startingAt:
> replaceFrom:to:with:startingAt:
> primForkExec:stdIn:stdOut:stdErr:argBuf:argOffsets:envBuf:envOffsets:workingDir:
> primGetPid
> primGetPid
> primGetPid
> basicNew
> basicNew
> wait
> at:put:
> signal
> wait
> shallowCopy
> new:
> replaceFrom:to:with:startingAt:
> signal
> wait
> replaceFrom:to:with:startingAt:
> at:put:
> signal
> primCloseNoError:
> primCloseNoError:
> primCloseNoError:
> signal
> basicNew:
> basicNew
> basicNew
> basicNew
> wait
> signal
> primUTCMicrosecondsClock
> +
> >=
> +
> <
> primSignal:atUTCMicroseconds:
> wait
> signal
> wait
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> basicNew:
> primRead:into:startingAt:count:
> basicNew
> signal
> wait
> basicNew:
> basicNew
> basicNew:
> replaceFrom:to:with:startingAt:
> replaceFrom:to:with:startingAt:
> signal
> basicNew
> signal
> basicNew
> new:
> wait
> signal
> wait
> signal
> primAioHandle:exceptionEvents:readEvents:writeEvents:
> signal
> wait
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
> relinquishProcessorForMicroseconds:
>
> stack page bytes 4096 available headroom 3300 minimum unused headroom 2152
>
>         (SIGUSR1)
>
>
> 2015-06-03 14:15 GMT+02:00 David T. Lewis <[hidden email]>:
>
> > On Wed, Jun 03, 2015 at 07:05:15AM +0200, Thierry Goubier wrote:
> > > Hi Dave,
> > >
> > > Le 03/06/2015 03:15, David T. Lewis a ?crit :
> > > >Hi Thierry and Jose,
> > > >
> > > >I am reading this thread with interest and will help if I can.
> > > >
> > > >I do have one idea that we have not tried before. I have a theory that
> > > >this may
> > > >be an intermittent problem caused by SIGCHLD signals (from the external
> > OS
> > > >process
> > > >when it exits) being missed by the
> > UnixOSProcessAccessor>>grimReaperProcess
> > > >that handles them.
> > > >
> > > >If this is happening, then I may be able to change grimReaperProcess to
> > > >work around the problem.
> > > >
> > > >When you see the OS deadlock condition, are you able tell if your Pharo
> > VM
> > > >process has subprocesses in the zombie state (indicating that
> > > >grimReaperProcess
> > > >did not clean them up)? The unix command "ps -axf | less" will let you
> > look
> > > >at the process tree and that may give us a clue if this is happening.
> > >
> > > I found it very easy to reproduce and I do have a zombie children
> > > process to the pharo process.
> >
> > Jose confirms this also (thanks).
> >
> > Can you try filing in the attached UnixOSProcessAccessor>>grimReaperProcess
> > and see if it helps? I do not know if it will make a difference, but the
> > idea is to put a timeout on the semaphore that is waiting for signals from
> > SIGCHLD. I am hoping that if these signals are sometimes being missed, then
> > the timeout will allow the process to recover from the problem.
> >
> >
> > >
> > > Interesting enough, the lock-up happens in a very specific place, a call
> > > to git branch, which is a very short command returning just a few
> > > characters (where all other commands have longuer outputs). Reducing the
> > > frequency of the calls to git branch by a bit of caching reduces the
> > > chances of a lock-up.
> > >
> >
> > This is a good clue, and it may indicate a different kind of problem (so
> > maybe I am looking in the wrong place). Ben's suggestion of adding a delay
> > to the external process sounds like a good idea to help troubleshoot it.
> >
> > Dave
> >
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Thierry Goubier
In reply to this post by Jose San Leandro
Hi Jose,

I have pushed a new version of GitFileTree (the development version for
Pharo4) with a complete rewrite of the underlying OSProcess use. Could
you test to see if it solves your deadlocks?

It should also be a tad faster.

Regards,

Thierry

Le 03/06/2015 17:03, Jose San Leandro a écrit :
> Unfortunately it doesn't fix it, or at least I get the same sympthoms.
>


Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Jose San Leandro
Hi,

So far it works perfect. I'll let you know if it happens again.

Thank you very much!

2015-06-11 23:28 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Jose,

I have pushed a new version of GitFileTree (the development version for Pharo4) with a complete rewrite of the underlying OSProcess use. Could you test to see if it solves your deadlocks?

It should also be a tad faster.

Regards,

Thierry


Le 03/06/2015 17:03, Jose San Leandro a écrit :
Unfortunately it doesn't fix it, or at least I get the same sympthoms.




Reply | Threaded
Open this post in threaded view
|

Re: Monticello / OS deadlock ?

Thierry Goubier


2015-06-18 10:32 GMT+02:00 Jose San Leandro <[hidden email]>:
Hi,

So far it works perfect. I'll let you know if it happens again.

Thanks.
 

Thank you very much!

You're welcome. Just a question: which version of the vm are you using? Or which zeroconf scripts are you using to download Pharo? I made some changes related to OSProcess in the latest vm (and they have been integrated); if, say, you're using the normal Pharo4 vm, then it would mean that your problem was solved by changing the way GitFileTree uses OSProcess.

Thierry

 

2015-06-11 23:28 GMT+02:00 Thierry Goubier <[hidden email]>:
Hi Jose,

I have pushed a new version of GitFileTree (the development version for Pharo4) with a complete rewrite of the underlying OSProcess use. Could you test to see if it solves your deadlocks?

It should also be a tad faster.

Regards,

Thierry


Le 03/06/2015 17:03, Jose San Leandro a écrit :
Unfortunately it doesn't fix it, or at least I get the same sympthoms.