Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck
Hi guys,

I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain how this could happened but it does, and it happened to me a couple of times (but not fully reproducible).

In OSSubprocess I have this child watcher:

initializeChildWatcher
"This is a process which waits for the death of a child processes. Use SIGCHLD events rather than a Delay to poll."

| processSynchronizationDelay |
processSynchronizationDelay := Delay forMilliseconds: 20.
childWatcher ifNil: [
childWatcher := [[
"OSProcess authors suspected that there were various ways in which OS signals
could be missed under conditions of heavy load. For that reason, we use
#waitTimeoutMSecs: with the semaphore, so that if a signal is missed,
we time out and rescan occasionally anyway (#updateActiveChildrenAndNotifyDead
sends queryExitStatus which sends waitpid() )
"
self sigChldSemaphore waitTimeoutMSecs: 1000.
processSynchronizationDelay wait. "Avoids lost signals in heavy process switching"
self updateActiveChildrenAndNotifyDead.
] repeat] newProcess.

childWatcher resume.
"name selected to look reasonable in the process browser"
childWatcher name: ((ReadStream on: childWatcher hash asString) next: 5)
, ': the OSSubprocess child watcher'
]


The problem is that now above process is hung on the "processSynchronizationDelay wait.". At least that's what I can see from the process browser.  That should not hung, of course, and should stop waiting after 20 ms. 

If I inspect the Delay instance in question (doing the wait) I see that the print string is bad.... "a Delay(20 msecs; 3641032396485 msecs remaining)". WTF? 20ms  and you remain 3641032396485 to wait??? I attach an screenshot of the Delay instance in case there could be something wrong.

Maybe it is wrong to keep and re-use same instance of Delay? (note it is outside the closure) (this was not a problem in the past). 

Thanks in advance for any clue.


--

Screen Shot 2016-05-18 at 12.47.14 PM.png (892K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Ben Coman
I don't have time to look at this properly until tomorrow.
In the meantime, could you just post the result of...

   Delay delaySchedulerClass
it should be "DelayExperimentalSpinScheduler"   ***

Also could you try a few of the other options from
 World > System > Settings > System ...


***whoops, the Experimental tag should have been dropped for Pharo 5
Release.  Its been fine for months (unless this is a fault :)  ]

On Wed, May 18, 2016 at 11:49 PM, Mariano Martinez Peck
<[hidden email]> wrote:

> Hi guys,
>
> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain
> how this could happened but it does, and it happened to me a couple of times
> (but not fully reproducible).
>
> In OSSubprocess I have this child watcher:
>
> initializeChildWatcher
> "This is a process which waits for the death of a child processes. Use
> SIGCHLD events rather than a Delay to poll."
>
> | processSynchronizationDelay |
> processSynchronizationDelay := Delay forMilliseconds: 20.
> childWatcher ifNil: [
> childWatcher := [[
> "OSProcess authors suspected that there were various ways in which OS
> signals
> could be missed under conditions of heavy load. For that reason, we use
> #waitTimeoutMSecs: with the semaphore, so that if a signal is missed,
> we time out and rescan occasionally anyway
> (#updateActiveChildrenAndNotifyDead
> sends queryExitStatus which sends waitpid() )
> "
> self sigChldSemaphore waitTimeoutMSecs: 1000.
> processSynchronizationDelay wait. "Avoids lost signals in heavy process
> switching"
> self updateActiveChildrenAndNotifyDead.
> ] repeat] newProcess.
>
> childWatcher resume.
> "name selected to look reasonable in the process browser"
> childWatcher name: ((ReadStream on: childWatcher hash asString) next: 5)
> , ': the OSSubprocess child watcher'
> ]
>
>
> The problem is that now above process is hung on the
> "processSynchronizationDelay wait.". At least that's what I can see from the
> process browser.  That should not hung, of course, and should stop waiting
> after 20 ms.
>
> If I inspect the Delay instance in question (doing the wait) I see that the
> print string is bad.... "a Delay(20 msecs; 3641032396485 msecs remaining)".
> WTF? 20ms  and you remain 3641032396485 to wait??? I attach an screenshot of
> the Delay instance in case there could be something wrong.

>
> Maybe it is wrong to keep and re-use same instance of Delay? (note it is
> outside the closure) (this was not a problem in the past).

Not sure.  I've heard others say Delays should only be used once. To
experiment, could you try creating a delay each time.

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck


On Wed, May 18, 2016 at 2:10 PM, Ben Coman <[hidden email]> wrote:
I don't have time to look at this properly until tomorrow.
In the meantime, could you just post the result of...

   Delay delaySchedulerClass
it should be "DelayExperimentalSpinScheduler"   ***


Yes, it is. 
 
Also could you try a few of the other options from
 World > System > Settings > System ...


Do you know which is the "Safest" to use? Or at least which was used in say, 4.0?


 

***whoops, the Experimental tag should have been dropped for Pharo 5
Release.  Its been fine for months (unless this is a fault :)  ]


Yes, I found that strange too. 


>
> Maybe it is wrong to keep and re-use same instance of Delay? (note it is
> outside the closure) (this was not a problem in the past).

Not sure.  I've heard others say Delays should only be used once. To
experiment, could you try creating a delay each time.


Ok, I will try this. But again, it would take some some time to test and it happens to me every in while only.

Thanks!! 



--
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Martin McClure-2
In reply to this post by Mariano Martinez Peck
On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
> Hi guys,
>
> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
> explain how this could happened but it does, and it happened to me a
> couple of times (but not fully reproducible).
>

Hmm. The schedulerResumptionTime is, somehow, being (approximately)
doubled. It's not clear how that can happen, but I'll look a little more.

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Paul DeBruicker
In reply to this post by Mariano Martinez Peck
the reason I asked about OSProcess in Pharo 5 was that

OSProcess waitForCommand: 'ls'


never returned.  


Maybe its another instance of this problem.  And it happens everytime.  

There is a 50ms delay in the #waitForCommand: method




hope this helps

Paul



Mariano Martinez Peck wrote
On Wed, May 18, 2016 at 2:10 PM, Ben Coman <[hidden email]> wrote:

> I don't have time to look at this properly until tomorrow.
> In the meantime, could you just post the result of...
>
>    Delay delaySchedulerClass
> it should be "DelayExperimentalSpinScheduler"   ***
>
>
Yes, it is.


> Also could you try a few of the other options from
>  World > System > Settings > System ...
>
>
Do you know which is the "Safest" to use? Or at least which was used in
say, 4.0?




>
> ***whoops, the Experimental tag should have been dropped for Pharo 5
> Release.  Its been fine for months (unless this is a fault :)  ]
>
>
Yes, I found that strange too.


> >
> > Maybe it is wrong to keep and re-use same instance of Delay? (note it is
> > outside the closure) (this was not a problem in the past).
>
> Not sure.  I've heard others say Delays should only be used once. To
> experiment, could you try creating a delay each time.
>
>
Ok, I will try this. But again, it would take some some time to test and it
happens to me every in while only.

Thanks!!



--
Mariano
http://marianopeck.wordpress.com
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Martin McClure-2
On 05/18/2016 02:46 PM, Paul DeBruicker wrote:

> the reason I asked about OSProcess in Pharo 5 was that
>
> OSProcess waitForCommand: 'ls'
>
>
> never returned.
>
>
> Maybe its another instance of this problem.  And it happens everytime.
>
> There is a 50ms delay in the #waitForCommand: method
>
I'm currently having a problem with an OSProcess of a chmod never
returning. And that's in Pharo 4! Never seen a problem there before.
Investigating.

Regards,

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck
In reply to this post by Paul DeBruicker


On Wed, May 18, 2016 at 6:46 PM, Paul DeBruicker <[hidden email]> wrote:
the reason I asked about OSProcess in Pharo 5 was that

OSProcess waitForCommand: 'ls'


never returned.


Maybe its another instance of this problem.  And it happens everytime.

There is a 50ms delay in the #waitForCommand: method


It looks quite similar indeed. And... both, OSProcess and OSSubprocess have a similar child watcher (I based mine in OSProcess one).
Paul could you inspect the child process from the process browser tool, do an inspect, and check the suspended process? Check the wait time to see if it is 50ms or 20ms etc...that way we can see which part is the "hunging".
 

 


hope this helps

Paul




Mariano Martinez Peck wrote
> On Wed, May 18, 2016 at 2:10 PM, Ben Coman &lt;

> btc@

> &gt; wrote:
>
>> I don't have time to look at this properly until tomorrow.
>> In the meantime, could you just post the result of...
>>
>>    Delay delaySchedulerClass
>> it should be "DelayExperimentalSpinScheduler"   ***
>>
>>
> Yes, it is.
>
>
>> Also could you try a few of the other options from
>>  World > System > Settings > System ...
>>
>>
> Do you know which is the "Safest" to use? Or at least which was used in
> say, 4.0?
>
>
>
>
>>
>> ***whoops, the Experimental tag should have been dropped for Pharo 5
>> Release.  Its been fine for months (unless this is a fault :)  ]
>>
>>
> Yes, I found that strange too.
>
>
>> >
>> > Maybe it is wrong to keep and re-use same instance of Delay? (note it
>> is
>> > outside the closure) (this was not a problem in the past).
>>
>> Not sure.  I've heard others say Delays should only be used once. To
>> experiment, could you try creating a delay each time.
>>
>>
> Ok, I will try this. But again, it would take some some time to test and
> it
> happens to me every in while only.
>
> Thanks!!
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com





--
View this message in context: http://forum.world.st/Problem-with-delay-waiting-OSSubprocess-code-on-Pharo-5-0-tp4895780p4895890.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.




--
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck
In reply to this post by Martin McClure-2


On Wed, May 18, 2016 at 7:28 PM, Martin McClure <[hidden email]> wrote:
On 05/18/2016 02:46 PM, Paul DeBruicker wrote:
the reason I asked about OSProcess in Pharo 5 was that

OSProcess waitForCommand: 'ls'


never returned.


Maybe its another instance of this problem.  And it happens everytime.

There is a 50ms delay in the #waitForCommand: method

I'm currently having a problem with an OSProcess of a chmod never returning. And that's in Pharo 4! Never seen a problem there before. Investigating.


Thanks Martin. I cannot tell about 4.0 as OSSubprocess only works in 5.0... 
What I know is that in 5.0 there have been changes in the wait  delay (that's why Ben answered). 
 
Regards,

-Martin




--
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Martin McClure-2
In reply to this post by Martin McClure-2
On 05/18/2016 03:17 PM, Martin McClure wrote:

> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>> Hi guys,
>>
>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>> explain how this could happened but it does, and it happened to me a
>> couple of times (but not fully reproducible).
>>
>
> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
> doubled. It's not clear how that can happen, but I'll look a little more.
>

Mario, is there any chance that you might be saving the image during one
of these Delays?


This one smells like a race condition, and I think I see something that
*might* explain it. But I don't have any more time to spend on this one,
so I'll leave the rest to someone else. I hope this is helpful:

The only way I immediately see for the schedulerResumptionTime to become
approximately doubled is if the Delay's resumption time is adjusted by
#restoreResumptionTimes without previously having been adjusted by
#saveResumptionTimes.

The only time either of those are sent, that I can see, is on saving the
image. Both are normally sent, (save before the snapshot, restore
afterwards), but there may be a hole there.

#saveResumptionTimes is only sent (by this scheduler class) when the
accessProtect semaphore is held, but #handleTimerEvent: is executed in
the timing Process *without* the protection of accessProtect, in the
case of the VM signaling the timingSemaphore. If the VM signals the
timingSemaphore, #handleTimerEvent: could run in the middle of
#saveResumptionTimes. If some Delay expires because of that timer event,
our Delay could move from being the first suspended delay to being the
active delay. If that happens after we've adjusted the active delay, but
before we've processed the suspended delays, that Delay will not get
adjusted, and will show the symptoms that Mariano is seeing.

Also, I'm not sure how the Heap that holds the suspendedDelays will
react to being modified in the middle of an enumeration. That might open
a larger window for the problem.

Regards,

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Ben Coman
On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]> wrote:

> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>
>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>
>>> Hi guys,
>>>
>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>> explain how this could happened but it does, and it happened to me a couple
>>> of times (but not fully reproducible).
>>>
>>
>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> doubled. It's not clear how that can happen, but I'll look a little more.
>>
>
> Mario, is there any chance that you might be saving the image during one of
> these Delays?
>
>
> This one smells like a race condition, and I think I see something that
> *might* explain it. But I don't have any more time to spend on this one, so
> I'll leave the rest to someone else. I hope this is helpful:
>
> The only way I immediately see for the schedulerResumptionTime to become
> approximately doubled is if the Delay's resumption time is adjusted by
> #restoreResumptionTimes without previously having been adjusted by
> #saveResumptionTimes.
>
> The only time either of those are sent, that I can see, is on saving the
> image. Both are normally sent, (save before the snapshot, restore
> afterwards), but there may be a hole there.
>
> #saveResumptionTimes is only sent (by this scheduler class) when the
> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
> timing Process *without* the protection of accessProtect, in the case of the
> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
> Delay expires because of that timer event, our Delay could move from being
> the first suspended delay to being the active delay. If that happens after
> we've adjusted the active delay, but before we've processed the suspended
> delays, that Delay will not get adjusted, and will show the symptoms that
> Mariano is seeing.

A quick experiment to test this might be in shutDown/#startUp trying...
    [ self saveResumptionTimes ] valueAt: Processor timingPriority
    [ self resumeResumptionTimes ] valueAt: Processor timingPriority

>
> Also, I'm not sure how the Heap that holds the suspendedDelays will react to
> being modified in the middle of an enumeration. That might open a larger
> window for the problem.
>
> Regards,
>
> -Martin
>

Even if not directly related to Mariano's problem, I agree with your
general assessment.  I'm not comfortable with the way that
#save/#restoreResumptionTimes (which manipulate suspendedDelays) are
called from user priority code via #shutDown/#startUp.  Per the
original code**, accessProtect can't be used inside the timing
priority #handleTimerEvent since accessProtect is held by the user
priority #schedule when it uses "timingSemaphore signal" to invoke
invokes #handleTimerEvent.  accessProtect never protected
timingPriority manipulation of suspendedDelays by #handleTimerEvent,
nor expired delays waking up.  But ahhh... the disabling of
accessProtect previously prevented new delays being scheduled between
a #save and #restore. If a new delay is scheduled after the #save,
when it is #restore'd its resumptionTime would be wrong.

Waiting in the wings for Pharo 6 I have changes that should help:
* have #save/#restoreResumptionTimes *only* called from timing
priority event loop (i.e. #handleTimerEvent)
* shutDown/startUp suspends/resumes the timing priority event loop,
instead of trying to block signals to timingSemaphore

I haven't touched it for a few months so I'll need to chase it up to
provide a preview.


Mariano, can you try DelayMillisecondScheduler (which however is
missing some fixes for other issues).

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Martin McClure-2
On 05/18/2016 07:39 PM, Ben Coman wrote:
Waiting in the wings for Pharo 6 I have changes that should help:
* have #save/#restoreResumptionTimes *only* called from timing
priority event loop (i.e. #handleTimerEvent)
* shutDown/startUp suspends/resumes the timing priority event loop,
instead of trying to block signals to timingSemaphore

Yeah, that sounds better, thanks.

-Martin
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck
In reply to this post by Martin McClure-2


On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]> wrote:
On 05/18/2016 03:17 PM, Martin McClure wrote:
On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
Hi guys,

I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain how this could happened but it does, and it happened to me a couple of times (but not fully reproducible).


Hmm. The schedulerResumptionTime is, somehow, being (approximately) doubled. It's not clear how that can happen, but I'll look a little more.


Mario, is there any chance that you might be saving the image during one of these Delays?


This one smells like a race condition, and I think I see something that *might* explain it. But I don't have any more time to spend on this one, so I'll leave the rest to someone else. I hope this is helpful:

The only way I immediately see for the schedulerResumptionTime to become approximately doubled is if the Delay's resumption time is adjusted by #restoreResumptionTimes without previously having been adjusted by #saveResumptionTimes.

The only time either of those are sent, that I can see, is on saving the image. Both are normally sent, (save before the snapshot, restore afterwards), but there may be a hole there.


Martin, first off, thanks for the research!!!

Now....your email made me remember something: I did get VM crash when saving the image a couple of times. The VM crashed when saving the image. If I re-opened the image, it looks like if the image was indeed saved (so the snapshot primitive itself did work), but I suspect not all shutdown code could have been run correctly. 

The VM crash looks like the FreeTypeFace >> pvtDestroyHandle  which, as far as I know, it's a "known crash" (I attach crash dump). From what I can see, if I follow all the stack, the crash starts from the WeakArray >> startUp: .
That means that...depending on the order of the startup list...the Scheduler may not have been run after the crash.

Now.... WeakArray initialization does:

SessionManager default
registerSystemClassNamed: self name.
While... 

Delay class >> startUp "Restart active delay, if any, when resuming a snapshot." Scheduler startUp.  

And the Delay registration is 

SessionManager default 
registerSystemClassNamed: self name 
atPriority: 20.

So...that seems correct...

I can verify this by:

SessionManager default systemCategory prioritizedList 

Anyway...not sure if this adds something, but just wanted to note this. 

 
#saveResumptionTimes is only sent (by this scheduler class) when the accessProtect semaphore is held, but #handleTimerEvent: is executed in the timing Process *without* the protection of accessProtect, in the case of the VM signaling the timingSemaphore. If the VM signals the timingSemaphore, #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some Delay expires because of that timer event, our Delay could move from being the first suspended delay to being the active delay. If that happens after we've adjusted the active delay, but before we've processed the suspended delays, that Delay will not get adjusted, and will show the symptoms that Mariano is seeing.

Also, I'm not sure how the Heap that holds the suspendedDelays will react to being modified in the middle of an enumeration. That might open a larger window for the problem.

Regards,

-Martin




--

crash.dmp (49K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck
Ben, for the record, I am using DelayMillisecondScheduler for a day and a half and so far no problem. 


On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck <[hidden email]> wrote:


On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]> wrote:
On 05/18/2016 03:17 PM, Martin McClure wrote:
On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
Hi guys,

I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain how this could happened but it does, and it happened to me a couple of times (but not fully reproducible).


Hmm. The schedulerResumptionTime is, somehow, being (approximately) doubled. It's not clear how that can happen, but I'll look a little more.


Mario, is there any chance that you might be saving the image during one of these Delays?


This one smells like a race condition, and I think I see something that *might* explain it. But I don't have any more time to spend on this one, so I'll leave the rest to someone else. I hope this is helpful:

The only way I immediately see for the schedulerResumptionTime to become approximately doubled is if the Delay's resumption time is adjusted by #restoreResumptionTimes without previously having been adjusted by #saveResumptionTimes.

The only time either of those are sent, that I can see, is on saving the image. Both are normally sent, (save before the snapshot, restore afterwards), but there may be a hole there.


Martin, first off, thanks for the research!!!

Now....your email made me remember something: I did get VM crash when saving the image a couple of times. The VM crashed when saving the image. If I re-opened the image, it looks like if the image was indeed saved (so the snapshot primitive itself did work), but I suspect not all shutdown code could have been run correctly. 

The VM crash looks like the FreeTypeFace >> pvtDestroyHandle  which, as far as I know, it's a "known crash" (I attach crash dump). From what I can see, if I follow all the stack, the crash starts from the WeakArray >> startUp: .
That means that...depending on the order of the startup list...the Scheduler may not have been run after the crash.

Now.... WeakArray initialization does:

SessionManager default
registerSystemClassNamed: self name.
While... 

Delay class >> startUp "Restart active delay, if any, when resuming a snapshot." Scheduler startUp.  

And the Delay registration is 

SessionManager default 
registerSystemClassNamed: self name 
atPriority: 20.

So...that seems correct...

I can verify this by:

SessionManager default systemCategory prioritizedList 

Anyway...not sure if this adds something, but just wanted to note this. 

 
#saveResumptionTimes is only sent (by this scheduler class) when the accessProtect semaphore is held, but #handleTimerEvent: is executed in the timing Process *without* the protection of accessProtect, in the case of the VM signaling the timingSemaphore. If the VM signals the timingSemaphore, #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some Delay expires because of that timer event, our Delay could move from being the first suspended delay to being the active delay. If that happens after we've adjusted the active delay, but before we've processed the suspended delays, that Delay will not get adjusted, and will show the symptoms that Mariano is seeing.

Also, I'm not sure how the Heap that holds the suspendedDelays will react to being modified in the middle of an enumeration. That might open a larger window for the problem.

Regards,

-Martin




--



--
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Ben Coman
On Sat, May 21, 2016 at 12:05 AM, Mariano Martinez Peck
<[hidden email]> wrote:
> Ben, for the record, I am using DelayMillisecondScheduler for a day and a
> half and so far no problem.

Cool. Thats why I left it there. I hope to soon have something for you
to try with the newer design. Thanks for the update.

cheers -ben

> On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck
> <[hidden email]> wrote:
>>
>>
>>
>> On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]>
>> wrote:
>>>
>>> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>>>
>>>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>>>> explain how this could happened but it does, and it happened to me a couple
>>>>> of times (but not fully reproducible).
>>>>>
>>>>
>>>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>>>> doubled. It's not clear how that can happen, but I'll look a little more.
>>>>
>>>
>>> Mario, is there any chance that you might be saving the image during one
>>> of these Delays?
>>>
>>>
>>> This one smells like a race condition, and I think I see something that
>>> *might* explain it. But I don't have any more time to spend on this one, so
>>> I'll leave the rest to someone else. I hope this is helpful:
>>>
>>> The only way I immediately see for the schedulerResumptionTime to become
>>> approximately doubled is if the Delay's resumption time is adjusted by
>>> #restoreResumptionTimes without previously having been adjusted by
>>> #saveResumptionTimes.
>>>
>>> The only time either of those are sent, that I can see, is on saving the
>>> image. Both are normally sent, (save before the snapshot, restore
>>> afterwards), but there may be a hole there.
>>>
>>
>> Martin, first off, thanks for the research!!!
>>
>> Now....your email made me remember something: I did get VM crash when
>> saving the image a couple of times. The VM crashed when saving the image. If
>> I re-opened the image, it looks like if the image was indeed saved (so the
>> snapshot primitive itself did work), but I suspect not all shutdown code
>> could have been run correctly.
>>
>> The VM crash looks like the FreeTypeFace >> pvtDestroyHandle  which, as
>> far as I know, it's a "known crash" (I attach crash dump). From what I can
>> see, if I follow all the stack, the crash starts from the WeakArray >>
>> startUp: .
>> That means that...depending on the order of the startup list...the
>> Scheduler may not have been run after the crash.
>>
>> Now.... WeakArray initialization does:
>>
>> SessionManager default
>> registerSystemClassNamed: self name.
>> While...
>>
>> Delay class >> startUp "Restart active delay, if any, when resuming a
>> snapshot." Scheduler startUp.
>>
>> And the Delay registration is
>>
>> SessionManager default
>> registerSystemClassNamed: self name
>> atPriority: 20.
>>
>> So...that seems correct...
>>
>> I can verify this by:
>>
>> SessionManager default systemCategory prioritizedList
>>
>> Anyway...not sure if this adds something, but just wanted to note this.
>>
>>
>>>
>>> #saveResumptionTimes is only sent (by this scheduler class) when the
>>> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
>>> timing Process *without* the protection of accessProtect, in the case of the
>>> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>>> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
>>> Delay expires because of that timer event, our Delay could move from being
>>> the first suspended delay to being the active delay. If that happens after
>>> we've adjusted the active delay, but before we've processed the suspended
>>> delays, that Delay will not get adjusted, and will show the symptoms that
>>> Mariano is seeing.
>>>
>>> Also, I'm not sure how the Heap that holds the suspendedDelays will react
>>> to being modified in the middle of an enumeration. That might open a larger
>>> window for the problem.
>>>
>>> Regards,
>>>
>>> -Martin
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck
Hi Ben, 

I opened this issue for tracking the problem and not forget:  https://pharo.fogbugz.com/f/cases/18359/Problem-with-DelayExperimentalSpinScheduler-and-delay

Cheers,

On Fri, May 20, 2016 at 2:58 PM, Ben Coman <[hidden email]> wrote:
On Sat, May 21, 2016 at 12:05 AM, Mariano Martinez Peck
<[hidden email]> wrote:
> Ben, for the record, I am using DelayMillisecondScheduler for a day and a
> half and so far no problem.

Cool. Thats why I left it there. I hope to soon have something for you
to try with the newer design. Thanks for the update.

cheers -ben

> On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck
> <[hidden email]> wrote:
>>
>>
>>
>> On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]>
>> wrote:
>>>
>>> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>>>
>>>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>>>> explain how this could happened but it does, and it happened to me a couple
>>>>> of times (but not fully reproducible).
>>>>>
>>>>
>>>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>>>> doubled. It's not clear how that can happen, but I'll look a little more.
>>>>
>>>
>>> Mario, is there any chance that you might be saving the image during one
>>> of these Delays?
>>>
>>>
>>> This one smells like a race condition, and I think I see something that
>>> *might* explain it. But I don't have any more time to spend on this one, so
>>> I'll leave the rest to someone else. I hope this is helpful:
>>>
>>> The only way I immediately see for the schedulerResumptionTime to become
>>> approximately doubled is if the Delay's resumption time is adjusted by
>>> #restoreResumptionTimes without previously having been adjusted by
>>> #saveResumptionTimes.
>>>
>>> The only time either of those are sent, that I can see, is on saving the
>>> image. Both are normally sent, (save before the snapshot, restore
>>> afterwards), but there may be a hole there.
>>>
>>
>> Martin, first off, thanks for the research!!!
>>
>> Now....your email made me remember something: I did get VM crash when
>> saving the image a couple of times. The VM crashed when saving the image. If
>> I re-opened the image, it looks like if the image was indeed saved (so the
>> snapshot primitive itself did work), but I suspect not all shutdown code
>> could have been run correctly.
>>
>> The VM crash looks like the FreeTypeFace >> pvtDestroyHandle  which, as
>> far as I know, it's a "known crash" (I attach crash dump). From what I can
>> see, if I follow all the stack, the crash starts from the WeakArray >>
>> startUp: .
>> That means that...depending on the order of the startup list...the
>> Scheduler may not have been run after the crash.
>>
>> Now.... WeakArray initialization does:
>>
>> SessionManager default
>> registerSystemClassNamed: self name.
>> While...
>>
>> Delay class >> startUp "Restart active delay, if any, when resuming a
>> snapshot." Scheduler startUp.
>>
>> And the Delay registration is
>>
>> SessionManager default
>> registerSystemClassNamed: self name
>> atPriority: 20.
>>
>> So...that seems correct...
>>
>> I can verify this by:
>>
>> SessionManager default systemCategory prioritizedList
>>
>> Anyway...not sure if this adds something, but just wanted to note this.
>>
>>
>>>
>>> #saveResumptionTimes is only sent (by this scheduler class) when the
>>> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
>>> timing Process *without* the protection of accessProtect, in the case of the
>>> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>>> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
>>> Delay expires because of that timer event, our Delay could move from being
>>> the first suspended delay to being the active delay. If that happens after
>>> we've adjusted the active delay, but before we've processed the suspended
>>> delays, that Delay will not get adjusted, and will show the symptoms that
>>> Mariano is seeing.
>>>
>>> Also, I'm not sure how the Heap that holds the suspendedDelays will react
>>> to being modified in the middle of an enumeration. That might open a larger
>>> window for the problem.
>>>
>>> Regards,
>>>
>>> -Martin
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com




--
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Gastón Dall' Oglio
Hi  Mariano.

Some time ago I had a similar problem (Windows, Linux did not happen) and the solution was "OSProcessAccessor initialize" to open the image.


Cheers

2016-05-30 10:31 GMT-03:00 Mariano Martinez Peck <[hidden email]>:
Hi Ben, 

I opened this issue for tracking the problem and not forget:  https://pharo.fogbugz.com/f/cases/18359/Problem-with-DelayExperimentalSpinScheduler-and-delay

Cheers,

On Fri, May 20, 2016 at 2:58 PM, Ben Coman <[hidden email]> wrote:
On Sat, May 21, 2016 at 12:05 AM, Mariano Martinez Peck
<[hidden email]> wrote:
> Ben, for the record, I am using DelayMillisecondScheduler for a day and a
> half and so far no problem.

Cool. Thats why I left it there. I hope to soon have something for you
to try with the newer design. Thanks for the update.

cheers -ben

> On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck
> <[hidden email]> wrote:
>>
>>
>>
>> On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]>
>> wrote:
>>>
>>> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>>>
>>>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>>>> explain how this could happened but it does, and it happened to me a couple
>>>>> of times (but not fully reproducible).
>>>>>
>>>>
>>>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>>>> doubled. It's not clear how that can happen, but I'll look a little more.
>>>>
>>>
>>> Mario, is there any chance that you might be saving the image during one
>>> of these Delays?
>>>
>>>
>>> This one smells like a race condition, and I think I see something that
>>> *might* explain it. But I don't have any more time to spend on this one, so
>>> I'll leave the rest to someone else. I hope this is helpful:
>>>
>>> The only way I immediately see for the schedulerResumptionTime to become
>>> approximately doubled is if the Delay's resumption time is adjusted by
>>> #restoreResumptionTimes without previously having been adjusted by
>>> #saveResumptionTimes.
>>>
>>> The only time either of those are sent, that I can see, is on saving the
>>> image. Both are normally sent, (save before the snapshot, restore
>>> afterwards), but there may be a hole there.
>>>
>>
>> Martin, first off, thanks for the research!!!
>>
>> Now....your email made me remember something: I did get VM crash when
>> saving the image a couple of times. The VM crashed when saving the image. If
>> I re-opened the image, it looks like if the image was indeed saved (so the
>> snapshot primitive itself did work), but I suspect not all shutdown code
>> could have been run correctly.
>>
>> The VM crash looks like the FreeTypeFace >> pvtDestroyHandle  which, as
>> far as I know, it's a "known crash" (I attach crash dump). From what I can
>> see, if I follow all the stack, the crash starts from the WeakArray >>
>> startUp: .
>> That means that...depending on the order of the startup list...the
>> Scheduler may not have been run after the crash.
>>
>> Now.... WeakArray initialization does:
>>
>> SessionManager default
>> registerSystemClassNamed: self name.
>> While...
>>
>> Delay class >> startUp "Restart active delay, if any, when resuming a
>> snapshot." Scheduler startUp.
>>
>> And the Delay registration is
>>
>> SessionManager default
>> registerSystemClassNamed: self name
>> atPriority: 20.
>>
>> So...that seems correct...
>>
>> I can verify this by:
>>
>> SessionManager default systemCategory prioritizedList
>>
>> Anyway...not sure if this adds something, but just wanted to note this.
>>
>>
>>>
>>> #saveResumptionTimes is only sent (by this scheduler class) when the
>>> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
>>> timing Process *without* the protection of accessProtect, in the case of the
>>> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>>> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
>>> Delay expires because of that timer event, our Delay could move from being
>>> the first suspended delay to being the active delay. If that happens after
>>> we've adjusted the active delay, but before we've processed the suspended
>>> delays, that Delay will not get adjusted, and will show the symptoms that
>>> Mariano is seeing.
>>>
>>> Also, I'm not sure how the Heap that holds the suspendedDelays will react
>>> to being modified in the middle of an enumeration. That might open a larger
>>> window for the problem.
>>>
>>> Regards,
>>>
>>> -Martin
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com




--

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck
In reply to this post by Ben Coman


On Wed, May 18, 2016 at 11:39 PM, Ben Coman <[hidden email]> wrote:
On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]> wrote:
> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>
>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>
>>> Hi guys,
>>>
>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>> explain how this could happened but it does, and it happened to me a couple
>>> of times (but not fully reproducible).
>>>
>>
>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> doubled. It's not clear how that can happen, but I'll look a little more.
>>
>
> Mario, is there any chance that you might be saving the image during one of
> these Delays?
>
>
> This one smells like a race condition, and I think I see something that
> *might* explain it. But I don't have any more time to spend on this one, so
> I'll leave the rest to someone else. I hope this is helpful:
>
> The only way I immediately see for the schedulerResumptionTime to become
> approximately doubled is if the Delay's resumption time is adjusted by
> #restoreResumptionTimes without previously having been adjusted by
> #saveResumptionTimes.
>
> The only time either of those are sent, that I can see, is on saving the
> image. Both are normally sent, (save before the snapshot, restore
> afterwards), but there may be a hole there.
>
> #saveResumptionTimes is only sent (by this scheduler class) when the
> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
> timing Process *without* the protection of accessProtect, in the case of the
> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
> Delay expires because of that timer event, our Delay could move from being
> the first suspended delay to being the active delay. If that happens after
> we've adjusted the active delay, but before we've processed the suspended
> delays, that Delay will not get adjusted, and will show the symptoms that
> Mariano is seeing.

A quick experiment to test this might be in shutDown/#startUp trying...
    [ self saveResumptionTimes ] valueAt: Processor timingPriority
    [ self resumeResumptionTimes ] valueAt: Processor timingPriority

>
> Also, I'm not sure how the Heap that holds the suspendedDelays will react to
> being modified in the middle of an enumeration. That might open a larger
> window for the problem.
>
> Regards,
>
> -Martin
>

Even if not directly related to Mariano's problem, I agree with your
general assessment.  I'm not comfortable with the way that
#save/#restoreResumptionTimes (which manipulate suspendedDelays) are
called from user priority code via #shutDown/#startUp.  Per the
original code**, accessProtect can't be used inside the timing
priority #handleTimerEvent since accessProtect is held by the user
priority #schedule when it uses "timingSemaphore signal" to invoke
invokes #handleTimerEvent.  accessProtect never protected
timingPriority manipulation of suspendedDelays by #handleTimerEvent,
nor expired delays waking up.  But ahhh... the disabling of
accessProtect previously prevented new delays being scheduled between
a #save and #restore. If a new delay is scheduled after the #save,
when it is #restore'd its resumptionTime would be wrong.

Waiting in the wings for Pharo 6 I have changes that should help:
* have #save/#restoreResumptionTimes *only* called from timing
priority event loop (i.e. #handleTimerEvent)
* shutDown/startUp suspends/resumes the timing priority event loop,
instead of trying to block signals to timingSemaphore

I haven't touched it for a few months so I'll need to chase it up to
provide a preview.


Mariano, can you try DelayMillisecondScheduler (which however is
missing some fixes for other issues).



Ben,

Did you find a chance to do something else with this? I am still finding lookups every in a while even with the DelayMillisecondScheduler  :(

Thanks!
 



--
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Ben Coman
On Wed, Jul 13, 2016 at 12:30 AM, Mariano Martinez Peck
<[hidden email]> wrote:

>
>
> On Wed, May 18, 2016 at 11:39 PM, Ben Coman <[hidden email]> wrote:
>>
>> On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]>
>> wrote:
>> > On 05/18/2016 03:17 PM, Martin McClure wrote:
>> >>
>> >> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>> >>> explain how this could happened but it does, and it happened to me a
>> >>> couple
>> >>> of times (but not fully reproducible).
>> >>>
>> >>
>> >> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> >> doubled. It's not clear how that can happen, but I'll look a little
>> >> more.
>> >>
>> >
>> > Mario, is there any chance that you might be saving the image during one
>> > of
>> > these Delays?
>> >
>> >
>> > This one smells like a race condition, and I think I see something that
>> > *might* explain it. But I don't have any more time to spend on this one,
>> > so
>> > I'll leave the rest to someone else. I hope this is helpful:
>> >
>> > The only way I immediately see for the schedulerResumptionTime to become
>> > approximately doubled is if the Delay's resumption time is adjusted by
>> > #restoreResumptionTimes without previously having been adjusted by
>> > #saveResumptionTimes.
>> >
>> > The only time either of those are sent, that I can see, is on saving the
>> > image. Both are normally sent, (save before the snapshot, restore
>> > afterwards), but there may be a hole there.
>> >
>> > #saveResumptionTimes is only sent (by this scheduler class) when the
>> > accessProtect semaphore is held, but #handleTimerEvent: is executed in
>> > the
>> > timing Process *without* the protection of accessProtect, in the case of
>> > the
>> > VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>> > #handleTimerEvent: could run in the middle of #saveResumptionTimes. If
>> > some
>> > Delay expires because of that timer event, our Delay could move from
>> > being
>> > the first suspended delay to being the active delay. If that happens
>> > after
>> > we've adjusted the active delay, but before we've processed the
>> > suspended
>> > delays, that Delay will not get adjusted, and will show the symptoms
>> > that
>> > Mariano is seeing.
>>
>> A quick experiment to test this might be in shutDown/#startUp trying...
>>     [ self saveResumptionTimes ] valueAt: Processor timingPriority
>>     [ self resumeResumptionTimes ] valueAt: Processor timingPriority
>>
>> >
>> > Also, I'm not sure how the Heap that holds the suspendedDelays will
>> > react to
>> > being modified in the middle of an enumeration. That might open a larger
>> > window for the problem.
>> >
>> > Regards,
>> >
>> > -Martin
>> >
>>
>> Even if not directly related to Mariano's problem, I agree with your
>> general assessment.  I'm not comfortable with the way that
>> #save/#restoreResumptionTimes (which manipulate suspendedDelays) are
>> called from user priority code via #shutDown/#startUp.  Per the
>> original code**, accessProtect can't be used inside the timing
>> priority #handleTimerEvent since accessProtect is held by the user
>> priority #schedule when it uses "timingSemaphore signal" to invoke
>> invokes #handleTimerEvent.  accessProtect never protected
>> timingPriority manipulation of suspendedDelays by #handleTimerEvent,
>> nor expired delays waking up.  But ahhh... the disabling of
>> accessProtect previously prevented new delays being scheduled between
>> a #save and #restore. If a new delay is scheduled after the #save,
>> when it is #restore'd its resumptionTime would be wrong.
>>
>> Waiting in the wings for Pharo 6 I have changes that should help:
>> * have #save/#restoreResumptionTimes *only* called from timing
>> priority event loop (i.e. #handleTimerEvent)
>> * shutDown/startUp suspends/resumes the timing priority event loop,
>> instead of trying to block signals to timingSemaphore
>>
>> I haven't touched it for a few months so I'll need to chase it up to
>> provide a preview.
>>
>>
>>
>> Mariano, can you try DelayMillisecondScheduler (which however is
>> missing some fixes for other issues).
>>
>
>
> Ben,
>
> Did you find a chance to do something else with this? I am still finding
> lookups every in a while even with the DelayMillisecondScheduler  :(

Hi Mariano,

Sorry for the delay attending to this.  I got a bit twisted up on
whether I'm complicating things too much with my DelayScheduler
refactoring for Pharo 6 (I should just put it out for early review),
but then I had to get through a Netflix addiction to the Star Trek
prequel Enterprise, and family time over my kids school holidays.  I'm
back looking at it now - and taking a step back to see what is the
minimum change required to fix this.

Do you have some test code or a test image I could run to reproduce
the problem?

Now I have one thing for you to try.  In the move of DelaySchedulerXxx
from milliseconds to microseconds, the code for dealing with clock
wrap-around was removed, but I wonder if it was also covering clock
jitter as a side effect.  Could you try again the SpinScheduler but
restore this code...

From handletTimerEvent...
  "Check for clock wrap-around."
  millisecondNowTick < activeDelayStartTime ifTrue: [
    "clock wrapped"
    self saveResumptionTimes.
    self restoreResumptionTimes ].
  activeDelayStartTime := millisecondNowTick.

From runTimerEventLoop...
  Time millisecondClockValue < millisecondNowTick
    ifTrue:[ timingSemaphore signal ]. "retry"

From startup...
   activeDelayStartTime := Time millisecondClockValue.

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Mariano Martinez Peck


On Sun, Jul 17, 2016 at 6:46 AM, Ben Coman <[hidden email]> wrote:
On Wed, Jul 13, 2016 at 12:30 AM, Mariano Martinez Peck
<[hidden email]> wrote:
>
>
> On Wed, May 18, 2016 at 11:39 PM, Ben Coman <[hidden email]> wrote:
>>
>> On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]>
>> wrote:
>> > On 05/18/2016 03:17 PM, Martin McClure wrote:
>> >>
>> >> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>> >>> explain how this could happened but it does, and it happened to me a
>> >>> couple
>> >>> of times (but not fully reproducible).
>> >>>
>> >>
>> >> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> >> doubled. It's not clear how that can happen, but I'll look a little
>> >> more.
>> >>
>> >
>> > Mario, is there any chance that you might be saving the image during one
>> > of
>> > these Delays?
>> >
>> >
>> > This one smells like a race condition, and I think I see something that
>> > *might* explain it. But I don't have any more time to spend on this one,
>> > so
>> > I'll leave the rest to someone else. I hope this is helpful:
>> >
>> > The only way I immediately see for the schedulerResumptionTime to become
>> > approximately doubled is if the Delay's resumption time is adjusted by
>> > #restoreResumptionTimes without previously having been adjusted by
>> > #saveResumptionTimes.
>> >
>> > The only time either of those are sent, that I can see, is on saving the
>> > image. Both are normally sent, (save before the snapshot, restore
>> > afterwards), but there may be a hole there.
>> >
>> > #saveResumptionTimes is only sent (by this scheduler class) when the
>> > accessProtect semaphore is held, but #handleTimerEvent: is executed in
>> > the
>> > timing Process *without* the protection of accessProtect, in the case of
>> > the
>> > VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>> > #handleTimerEvent: could run in the middle of #saveResumptionTimes. If
>> > some
>> > Delay expires because of that timer event, our Delay could move from
>> > being
>> > the first suspended delay to being the active delay. If that happens
>> > after
>> > we've adjusted the active delay, but before we've processed the
>> > suspended
>> > delays, that Delay will not get adjusted, and will show the symptoms
>> > that
>> > Mariano is seeing.
>>
>> A quick experiment to test this might be in shutDown/#startUp trying...
>>     [ self saveResumptionTimes ] valueAt: Processor timingPriority
>>     [ self resumeResumptionTimes ] valueAt: Processor timingPriority
>>
>> >
>> > Also, I'm not sure how the Heap that holds the suspendedDelays will
>> > react to
>> > being modified in the middle of an enumeration. That might open a larger
>> > window for the problem.
>> >
>> > Regards,
>> >
>> > -Martin
>> >
>>
>> Even if not directly related to Mariano's problem, I agree with your
>> general assessment.  I'm not comfortable with the way that
>> #save/#restoreResumptionTimes (which manipulate suspendedDelays) are
>> called from user priority code via #shutDown/#startUp.  Per the
>> original code**, accessProtect can't be used inside the timing
>> priority #handleTimerEvent since accessProtect is held by the user
>> priority #schedule when it uses "timingSemaphore signal" to invoke
>> invokes #handleTimerEvent.  accessProtect never protected
>> timingPriority manipulation of suspendedDelays by #handleTimerEvent,
>> nor expired delays waking up.  But ahhh... the disabling of
>> accessProtect previously prevented new delays being scheduled between
>> a #save and #restore. If a new delay is scheduled after the #save,
>> when it is #restore'd its resumptionTime would be wrong.
>>
>> Waiting in the wings for Pharo 6 I have changes that should help:
>> * have #save/#restoreResumptionTimes *only* called from timing
>> priority event loop (i.e. #handleTimerEvent)
>> * shutDown/startUp suspends/resumes the timing priority event loop,
>> instead of trying to block signals to timingSemaphore
>>
>> I haven't touched it for a few months so I'll need to chase it up to
>> provide a preview.
>>
>>
>>
>> Mariano, can you try DelayMillisecondScheduler (which however is
>> missing some fixes for other issues).
>>
>
>
> Ben,
>
> Did you find a chance to do something else with this? I am still finding
> lookups every in a while even with the DelayMillisecondScheduler  :(

Hi Mariano,

Sorry for the delay attending to this.  I got a bit twisted up on
whether I'm complicating things too much with my DelayScheduler
refactoring for Pharo 6 (I should just put it out for early review),
but then I had to get through a Netflix addiction to the Star Trek
prequel Enterprise, and family time over my kids school holidays.  I'm
back looking at it now - and taking a step back to see what is the
minimum change required to fix this.


hahaha no problem.
 
Do you have some test code or a test image I could run to reproduce
the problem?


The problem is that I cannot reproduce it immediately. It takes me some days and likely some image save and start again until I reproduce it. 
 

Now I have one thing for you to try.  In the move of DelaySchedulerXxx
from milliseconds to microseconds, the code for dealing with clock
wrap-around was removed, but I wonder if it was also covering clock
jitter as a side effect.  Could you try again the SpinScheduler but
restore this code...

From handletTimerEvent...
  "Check for clock wrap-around."
  millisecondNowTick < activeDelayStartTime ifTrue: [
    "clock wrapped"
    self saveResumptionTimes.
    self restoreResumptionTimes ].
  activeDelayStartTime := millisecondNowTick.

From runTimerEventLoop...
  Time millisecondClockValue < millisecondNowTick
    ifTrue:[ timingSemaphore signal ]. "retry"

From startup...
   activeDelayStartTime := Time millisecondClockValue.


OK, I put back the code into the spin scheduler. Then I kept using milisecond one until I finished the changes. Then I saved image and switch to spin one: the image hungs. I cannot even interrupt it. I attach my modifications. 
Do they look correct? 

Thanks!



--

Kernel-MarianoMartinezPeck.2291.mcz (875K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Ben Coman


On Fri, Jul 22, 2016 at 4:42 AM, Mariano Martinez Peck <[hidden email]> wrote:


On Sun, Jul 17, 2016 at 6:46 AM, Ben Coman <[hidden email]> wrote:
 
Do you have some test code or a test image I could run to reproduce
the problem?


The problem is that I cannot reproduce it immediately. It takes me some days and likely some image save and start again until I reproduce it. 

Could you try 
How about something that invokes 
 
 

Now I have one thing for you to try.  In the move of DelaySchedulerXxx
from milliseconds to microseconds, the code for dealing with clock
wrap-around was removed, but I wonder if it was also covering clock
jitter as a side effect.  Could you try again the SpinScheduler but
restore this code...

From handletTimerEvent...
  "Check for clock wrap-around."
  millisecondNowTick < activeDelayStartTime ifTrue: [
    "clock wrapped"
    self saveResumptionTimes.
    self restoreResumptionTimes ].
  activeDelayStartTime := millisecondNowTick.

From runTimerEventLoop...
  Time millisecondClockValue < millisecondNowTick
    ifTrue:[ timingSemaphore signal ]. "retry"

From startup...
   activeDelayStartTime := Time millisecondClockValue.


OK, I put back the code into the spin scheduler. Then I kept using milisecond one until I finished the changes. Then I saved image and switch to spin one: the image hungs. I cannot even interrupt it. I attach my modifications. 
Do they look correct? 

I confirm that locked my image also. 

One change of milli to micro was missed at the bottom of runTimeEventLoop...
That is...
   Time millisecondClockValue < microsecondNowTick
to...
   Time microsecondClockValue < microsecondNowTick

(just a case of domestic blindness) [1]  :) :)

cheers -ben


P.S. A trick to debug delays is to bypass interCycleDelay by at the top of it putting...
    true ifTrue: [^self].


12