Problem with OSSubprocess / signals / heartbeat ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with OSSubprocess / signals / heartbeat ?

Mariano Martinez Peck
Hi guys, 

There is a bug happening in OSSubprocess since quite some time which we were never able to fix. What happens is that at certain point (after having used OSSubprocess for a while), a simple command like the following would "hang" Pharo:

OSSUnixSubprocess new
    command: 'ls';
    arguments: #();
    redirectStdout;
    runAndWaitOnExitDo: [ :process :outString  | outString ]

Once the image does that, it keeps doing it in a 80% of the cases (so it is easy enough to debug). 

Guille Polito kept one of these images if someone can give us a hand. He also proposed the great idea of using `strace` to see what was going on. He  (together with Pable Tesone) suspected that the heartbeat could be interrupting the `clone()` function which is (I think) called internally by the 'posix_spawn()' which is the one used by OSSubprocess.

When these images are "hung" they found at a infinitive loop like this:

[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 606282}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 612528}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 618689}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 624922}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 631145}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 637241}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 643578}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 17477] gettimeofday({1482152630, 649974}, NULL) = 0
[pid 17477] sigreturn() (mask [])       = 120
[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---

As you can see, there is a SIGALARM involved. It also looks like the `gettimeofday` is used by the heartbeat ?  Could it be that someone the heartbeat is interrupting the `clone()` ?

Guille also showed me the `strace` output with a regular / working image:


[pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
[pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0
[pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted system call)
[pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 1917}}) = 0
[pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
[pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)
[pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 797}}) = 0
[pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
[pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To be restarted if no handler)
[pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---


Does anyone have any hint here?  

Thanks in advance, 


--
Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Problem with OSSubprocess / signals / heartbeat ?

Holger Freyther

> On 19 Dec 2016, at 14:41, Mariano Martinez Peck <[hidden email]> wrote:
>
> Hi guys,
>
> Guille Polito kept one of these images if someone can give us a hand. He also proposed the great idea of using `strace` to see what was going on. He  (together with Pable Tesone) suspected that the heartbeat could be interrupting the `clone()` function which is (I think) called internally by the 'posix_spawn()' which is the one used by OSSubprocess.
>
> When these images are "hung" they found at a infinitive loop like this:

Okay, how many child processes do you have at that point? How many processes does the system have?


>
> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> [pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0
> [pid 17477] sigreturn() (mask [])       = 120
> [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> [pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0
> [pid 17477] sigreturn() (mask [])       = 120
> [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)

so above.. 7ms between the two gettimeofday calls? Nothing else? Set a breakpoint on clone/fork in gdb and look at the c-stack at this point? Could you strace with timestamps to see how much time is spent? Is the process suspicious in other ways?

So yes.. sounds like clone doesn't complete.. the question is why? Is it out of resources? Is something in the VM blocking longer than the heartbeat, is the heartbeat more frequent than expected?


> As you can see, there is a SIGALARM involved. It also looks like the `gettimeofday` is used by the heartbeat ?  Could it be that someone the heartbeat is interrupting the `clone()` ?
>
> Guille also showed me the `strace` output with a regular / working image:
>
>
> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> [pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0
> [pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted system call)
> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 1917}}) = 0
> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> [pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)
> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 797}}) = 0
> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> [pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To be restarted if no handler)
> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>
>
> Does anyone have any hint here?  

Get timestamps in there. How long does it take to fail/end in this situation?

holger




Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Problem with OSSubprocess / signals / heartbeat ?

Guillermo Polito
Hi Holger,

Just to add to the issue, when we talk about hanging, the VM thread is
stopped in an FFI call,

0xffdfe1c0 I
OSSUnixSubprocess>primitivePosixSpawn:command:fileActions:attributes:argPointer:envPointer:
0x9482438: a(n) OSSUnixSubprocess
0xffdfe200 I OSSUnixSubprocess>internalRun 0x9482438: a(n) OSSUnixSubprocess
0xffdfe220 I [] in OSSUnixSubprocess>run 0x9482438: a(n) OSSUnixSubprocess
0xffdf9fe0 M BlockClosure>ensure: 0x9482750: a(n) BlockClosure
0xffdfa004 I OSSUnixSubprocess>run 0x9482438: a(n) OSSUnixSubprocess
0xffdfa024 I OSSUnixSubprocess>runAndWaitOnExitDo: 0x9482438: a(n)
OSSUnixSubprocess

self ffiCall: #( int posix_spawnp (void* pidPointer, String
commandString, void* fileActions, void * attributes, void* argPointer,
void* envPointer) )


Now, here is an extract of the timestamped `strace -ftt` of the hanging
process:

[pid 24108] 10:28:09.044534 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.044577 gettimeofday({1482226089, 44582}, NULL) = 0
[pid 24108] 10:28:09.044644 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.044664 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.050924 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.050970 gettimeofday({1482226089, 50976}, NULL) = 0
[pid 24108] 10:28:09.050996 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.051009 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.051016 gettimeofday({1482226089, 51020}, NULL) = 0
[pid 24108] 10:28:09.051028 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.051040 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.066154 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.066196 gettimeofday({1482226089, 66201}, NULL) = 0
[pid 24108] 10:28:09.066220 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.066233 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.072599 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.072643 gettimeofday({1482226089, 72648}, NULL) = 0
[pid 24108] 10:28:09.072716 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.072740 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.079552 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.079580 gettimeofday({1482226089, 79585}, NULL) = 0
[pid 24108] 10:28:09.079648 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.079672 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.089178 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.089233 gettimeofday({1482226089, 89239}, NULL) = 0
[pid 24108] 10:28:09.089264 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.089278 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.101007 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.101046 gettimeofday({1482226089, 101050}, NULL) = 0
[pid 24108] 10:28:09.101096 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.101116 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.107650 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.107699 gettimeofday({1482226089, 107706}, NULL) = 0
[pid 24108] 10:28:09.107771 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.107789 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.116188 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---
[pid 24108] 10:28:09.116226 gettimeofday({1482226089, 116232}, NULL) = 0
[pid 24108] 10:28:09.116254 sigreturn() (mask []) = 120
[pid 24108] 10:28:09.116273 clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xf7500768) = ? ERESTARTNOINTR (To be restarted)
[pid 24108] 10:28:09.129739 --- SIGALRM {si_signo=SIGALRM,
si_code=SI_KERNEL} ---


The second trace in Mariano's email was a trace in the case of no FFI
call. Just to compare both of them and their differences, but there is
no hang in there.
Although my knowledge arrives until here, I can help in providing any
other traces or information.

I asked Santa an OSSubprocess that won't hang :)

Thanks,
Guille


-------- Original Message --------

>  
>
>> On 19 Dec 2016, at 14:41, Mariano Martinez Peck <[hidden email]> wrote:
>>
>> Hi guys,
>>
>> Guille Polito kept one of these images if someone can give us a hand. He also proposed the great idea of using `strace` to see what was going on. He  (together with Pable Tesone) suspected that the heartbeat could be interrupting the `clone()` function which is (I think) called internally by the 'posix_spawn()' which is the one used by OSSubprocess.
>>
>> When these images are "hung" they found at a infinitive loop like this:
> Okay, how many child processes do you have at that point? How many processes does the system have?
>
>
>> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> [pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0
>> [pid 17477] sigreturn() (mask [])       = 120
>> [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
>> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> [pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0
>> [pid 17477] sigreturn() (mask [])       = 120
>> [pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
> so above.. 7ms between the two gettimeofday calls? Nothing else? Set a breakpoint on clone/fork in gdb and look at the c-stack at this point? Could you strace with timestamps to see how much time is spent? Is the process suspicious in other ways?
>
> So yes.. sounds like clone doesn't complete.. the question is why? Is it out of resources? Is something in the VM blocking longer than the heartbeat, is the heartbeat more frequent than expected?
>
>
>> As you can see, there is a SIGALARM involved. It also looks like the `gettimeofday` is used by the heartbeat ?  Could it be that someone the heartbeat is interrupting the `clone()` ?
>>
>> Guille also showed me the `strace` output with a regular / working image:
>>
>>
>> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> [pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0
>> [pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted system call)
>> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 1917}}) = 0
>> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
>> [pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)
>> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 797}}) = 0
>> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)
>> [pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To be restarted if no handler)
>> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>
>>
>> Does anyone have any hint here?
> Get timestamps in there. How long does it take to fail/end in this situation?
>
> holger
>
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Problem with OSSubprocess / signals / heartbeat ?

stepharong

> I asked Santa an OSSubprocess that won't hang :)

I really appreciate to see such not so small polar elfves helping santa :)

Stef



> Thanks,
> Guille
>
>
> -------- Original Message --------
>>
>>> On 19 Dec 2016, at 14:41, Mariano Martinez Peck  
>>> <[hidden email]> wrote:
>>>
>>> Hi guys,
>>>
>>> Guille Polito kept one of these images if someone can give us a hand.  
>>> He also proposed the great idea of using `strace` to see what was  
>>> going on. He  (together with Pable Tesone) suspected that the  
>>> heartbeat could be interrupting the `clone()` function which is (I  
>>> think) called internally by the 'posix_spawn()' which is the one used  
>>> by OSSubprocess.
>>>
>>> When these images are "hung" they found at a infinitive loop like this:
>> Okay, how many child processes do you have at that point? How many  
>> processes does the system have?
>>
>>
>>> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>> [pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0
>>> [pid 17477] sigreturn() (mask [])       = 120
>>> [pid 17477] clone(child_stack=0,  
>>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,  
>>> child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
>>> [pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>> [pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0
>>> [pid 17477] sigreturn() (mask [])       = 120
>>> [pid 17477] clone(child_stack=0,  
>>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,  
>>> child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)
>> so above.. 7ms between the two gettimeofday calls? Nothing else? Set a  
>> breakpoint on clone/fork in gdb and look at the c-stack at this point?  
>> Could you strace with timestamps to see how much time is spent? Is the  
>> process suspicious in other ways?
>>
>> So yes.. sounds like clone doesn't complete.. the question is why? Is  
>> it out of resources? Is something in the VM blocking longer than the  
>> heartbeat, is the heartbeat more frequent than expected?
>>
>>
>>> As you can see, there is a SIGALARM involved. It also looks like the  
>>> `gettimeofday` is used by the heartbeat ?  Could it be that someone  
>>> the heartbeat is interrupting the `clone()` ?
>>>
>>> Guille also showed me the `strace` output with a regular / working  
>>> image:
>>>
>>>
>>> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>> [pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0
>>> [pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted system  
>>> call)
>>> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000},  
>>> it_value={0, 1917}}) = 0
>>> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource  
>>> temporarily unavailable)
>>> [pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)
>>> [pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000},  
>>> it_value={0, 797}}) = 0
>>> [pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource  
>>> temporarily unavailable)
>>> [pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To  
>>> be restarted if no handler)
>>> [pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>>
>>>
>>> Does anyone have any hint here?
>> Get timestamps in there. How long does it take to fail/end in this  
>> situation?
>>
>> holger
>>
>>
>>
>>
>>
>
>
>


--
Using Opera's mail client: http://www.opera.com/mail/

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Problem with OSSubprocess / signals / heartbeat ?

Guillermo Polito

Ok, so following a bit on this. I'll summarize some of our findings, some of them maybe obvious for some people in this list.

I saw that there are actually two different kind of VMs for *nix [1] :
 - threaded heartbeat
 - itimer + signal heartbeat
 
Specially, I'd like to cite the following paragraph for the lazy:

A distinction on linux is between VMs with an itimer hearbeat or a threaded heartbeat. VMs with an itimer hearbeat use setitimer to deliver a SIGALRM signal at regular intervals to interrupt the VM to check for events. These signals can be troublesome, interrupting foreign code that cannot cope with such signals. VMs with a threaded heartbeat use a high-priority thread that loops, blocking on nanosleep and then interrupting the VM, performing the same function as the itimer heartbeat but without using signals. These VMs are to be preferred but suport for multiple thread priorities in user-level processes has only been available on linux in kernels later than 2.6.12.

So, I downloaded the heartbeat squeak VM from bintray [2]. This VM requires so deploy some configuration files in /etc/security [2].

Under this configuration OSSubprocess worked like a charm (or I did not found the issue again so far).

Now, this heartbeat threaded VM is the recommended in the README file, and we see that OSSubprocess generates the exact issue stated. The main problem remains for the moment since Pharo's default download includes not this VM but the itimer one. I talked with Esteban about it and he was aware of these two VM flavours, and the reason why we are using the itimer one is the need to deploy those permission files in /etc, which makes installation a bit less automatic.

Cheers,
Guille

[1] https://github.com/OpenSmalltalk/opensmalltalk-vm/blob/e17db79411cfec767e04f3d94d12a642d920a30e/build.linux64x64/HowToBuild
[2] https://bintray.com/opensmalltalk/vm/download_file?file_path=cog_linux32x86_squeak.sista.spur_201612170124.tar.gz
[3] https://github.com/OpenSmalltalk/opensmalltalk-vm/releases/tag/r3732#linux


-------- Original Message --------


I asked Santa an OSSubprocess that won't hang :)



I really appreciate to see such not so small polar elfves helping santa :)



Stef







Thanks,

Guille





-------- Original Message --------



On 19 Dec 2016, at 14:41, Mariano Martinez Peck [hidden email] wrote:



Hi guys,



Guille Polito kept one of these images if someone can give us a hand. He also proposed the great idea of using `strace` to see what was going on. He  (together with Pable Tesone) suspected that the heartbeat could be interrupting the `clone()` function which is (I think) called internally by the 'posix_spawn()' which is the one used by OSSubprocess.



When these images are "hung" they found at a infinitive loop like this:

Okay, how many child processes do you have at that point? How many processes does the system have?





[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---

[pid 17477] gettimeofday({1482152630, 593498}, NULL) = 0

[pid 17477] sigreturn() (mask [])       = 120

[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)

[pid 17477] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---

[pid 17477] gettimeofday({1482152630, 600126}, NULL) = 0

[pid 17477] sigreturn() (mask [])       = 120

[pid 17477] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf7578768) = ? ERESTARTNOINTR (To be restarted)

so above.. 7ms between the two gettimeofday calls? Nothing else? Set a breakpoint on clone/fork in gdb and look at the c-stack at this point? Could you strace with timestamps to see how much time is spent? Is the process suspicious in other ways?



So yes.. sounds like clone doesn't complete.. the question is why? Is it out of resources? Is something in the VM blocking longer than the heartbeat, is the heartbeat more frequent than expected?





As you can see, there is a SIGALARM involved. It also looks like the `gettimeofday` is used by the heartbeat ?  Could it be that someone the heartbeat is interrupting the `clone()` ?



Guille also showed me the `strace` output with a regular / working image:





[pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---

[pid 18647] gettimeofday({1482152829, 481014}, NULL) = 0

[pid 18647] sigreturn() (mask [])       = -1 EINTR (Interrupted system call)

[pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 1917}}) = 0

[pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)

[pid 18647] select(4, [3], [], [3], {0, 1000}) = 0 (Timeout)

[pid 18647] getitimer(ITIMER_REAL, {it_interval={0, 2000}, it_value={0, 797}}) = 0

[pid 18647] recvmsg(3, 0xff7b0734, 0)   = -1 EAGAIN (Resource temporarily unavailable)

[pid 18647] select(4, [3], [], [3], {0, 1000}) = ? ERESTARTNOHAND (To be restarted if no handler)

[pid 18647] --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---





Does anyone have any hint here?

Get timestamps in there. How long does it take to fail/end in this situation?



holger

























Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Problem with OSSubprocess / signals / heartbeat ?

Holger Freyther

> On 21 Dec 2016, at 22:29, Guille Polito <[hidden email]> wrote:
>


Hi!

> Now, this heartbeat threaded VM is the recommended in the README file, and we see that OSSubprocess generates the exact issue stated. The main problem remains for the moment since Pharo's default download includes not this VM but the itimer one. I talked with Esteban about it and he was aware of these two VM flavours, and the reason why we are using the itimer one is the need to deploy those permission files in /etc, which makes installation a bit less automatic.

It is not a matter of configuration, e.g. more infrastructure is migrated into docker containers (I run the CI with docker containers) and even as root inside the container you might not be allowed to use realtime priority. And you might not be able to influence how your container is being executed.

holger

PS: Have you seen the fix David Lewis made to the UnixProcess plugin, maybe it already helps with your bug as well?