Smalltalk › Pharo › Pharo Smalltalk Developers

Problem with delay waiting (OSSubprocess code) on Pharo 5.0

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

25 messages Options

Mariano Martinez Peck

Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Hi guys,

I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain how this could happened but it does, and it happened to me a couple of times (but not fully reproducible).

In OSSubprocess I have this child watcher:

initializeChildWatcher

"This is a process which waits for the death of a child processes. Use SIGCHLD events rather than a Delay to poll."

| processSynchronizationDelay |

processSynchronizationDelay := Delay forMilliseconds: 20.

childWatcher ifNil: [

childWatcher := [[

"OSProcess authors suspected that there were various ways in which OS signals

could be missed under conditions of heavy load. For that reason, we use

#waitTimeoutMSecs: with the semaphore, so that if a signal is missed,

we time out and rescan occasionally anyway (#updateActiveChildrenAndNotifyDead

sends queryExitStatus which sends waitpid() )

self sigChldSemaphore waitTimeoutMSecs: 1000.

processSynchronizationDelay wait. "Avoids lost signals in heavy process switching"

self updateActiveChildrenAndNotifyDead.

] repeat] newProcess.

childWatcher resume.

"name selected to look reasonable in the process browser"

childWatcher name: ((ReadStream on: childWatcher hash asString) next: 5)

, ': the OSSubprocess child watcher'

]

The problem is that now above process is hung on the "processSynchronizationDelay wait.". At least that's what I can see from the process browser. That should not hung, of course, and should stop waiting after 20 ms.

If I inspect the Delay instance in question (doing the wait) I see that the print string is bad.... "a Delay(20 msecs; 3641032396485 msecs remaining)". WTF? 20ms and you remain 3641032396485 to wait??? I attach an screenshot of the Delay instance in case there could be something wrong.

Maybe it is wrong to keep and re-use same instance of Delay? (note it is outside the closure) (this was not a problem in the past).

Thanks in advance for any clue.

Mariano
http://marianopeck.wordpress.com

Screen Shot 2016-05-18 at 12.47.14 PM.png (892K) Download Attachment

Ben Coman

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

I don't have time to look at this properly until tomorrow.
In the meantime, could you just post the result of...

Delay delaySchedulerClass
it should be "DelayExperimentalSpinScheduler" ***

Also could you try a few of the other options from
World > System > Settings > System ...

***whoops, the Experimental tag should have been dropped for Pharo 5
Release. Its been fine for months (unless this is a fault :) ]

On Wed, May 18, 2016 at 11:49 PM, Mariano Martinez Peck
<[hidden email]> wrote:

> Hi guys,
>
> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain
> how this could happened but it does, and it happened to me a couple of times
> (but not fully reproducible).
>
> In OSSubprocess I have this child watcher:
>
> initializeChildWatcher
> "This is a process which waits for the death of a child processes. Use
> SIGCHLD events rather than a Delay to poll."
>
> | processSynchronizationDelay |
> processSynchronizationDelay := Delay forMilliseconds: 20.
> childWatcher ifNil: [
> childWatcher := [[
> "OSProcess authors suspected that there were various ways in which OS
> signals
> could be missed under conditions of heavy load. For that reason, we use
> #waitTimeoutMSecs: with the semaphore, so that if a signal is missed,
> we time out and rescan occasionally anyway
> (#updateActiveChildrenAndNotifyDead
> sends queryExitStatus which sends waitpid() )
> "
> self sigChldSemaphore waitTimeoutMSecs: 1000.
> processSynchronizationDelay wait. "Avoids lost signals in heavy process
> switching"
> self updateActiveChildrenAndNotifyDead.
> ] repeat] newProcess.
>
> childWatcher resume.
> "name selected to look reasonable in the process browser"
> childWatcher name: ((ReadStream on: childWatcher hash asString) next: 5)
> , ': the OSSubprocess child watcher'
> ]
>
>
> The problem is that now above process is hung on the
> "processSynchronizationDelay wait.". At least that's what I can see from the
> process browser. That should not hung, of course, and should stop waiting
> after 20 ms.
>
> If I inspect the Delay instance in question (doing the wait) I see that the
> print string is bad.... "a Delay(20 msecs; 3641032396485 msecs remaining)".
> WTF? 20ms and you remain 3641032396485 to wait??? I attach an screenshot of
> the Delay instance in case there could be something wrong.

>
> Maybe it is wrong to keep and re-use same instance of Delay? (note it is
> outside the closure) (this was not a problem in the past).

Not sure. I've heard others say Delays should only be used once. To
experiment, could you try creating a delay each time.

cheers -ben

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On Wed, May 18, 2016 at 2:10 PM, Ben Coman <[hidden email]> wrote:

I don't have time to look at this properly until tomorrow.
In the meantime, could you just post the result of...

Delay delaySchedulerClass
it should be "DelayExperimentalSpinScheduler" ***

Yes, it is.

Also could you try a few of the other options from
World > System > Settings > System ...

Do you know which is the "Safest" to use? Or at least which was used in say, 4.0?

***whoops, the Experimental tag should have been dropped for Pharo 5
Release. Its been fine for months (unless this is a fault :) ]

Yes, I found that strange too.

>
> Maybe it is wrong to keep and re-use same instance of Delay? (note it is
> outside the closure) (this was not a problem in the past).

Not sure. I've heard others say Delays should only be used once. To
experiment, could you try creating a delay each time.

Ok, I will try this. But again, it would take some some time to test and it happens to me every in while only.

Thanks!!

Mariano
http://marianopeck.wordpress.com

Martin McClure-2

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

In reply to this post by Mariano Martinez Peck

On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
> Hi guys,
>
> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
> explain how this could happened but it does, and it happened to me a
> couple of times (but not fully reproducible).
>

Hmm. The schedulerResumptionTime is, somehow, being (approximately)
doubled. It's not clear how that can happen, but I'll look a little more.

-Martin

Paul DeBruicker

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

In reply to this post by Mariano Martinez Peck

the reason I asked about OSProcess in Pharo 5 was that

OSProcess waitForCommand: 'ls'

never returned.

Maybe its another instance of this problem. And it happens everytime.

There is a 50ms delay in the #waitForCommand: method

hope this helps

Paul

Mariano Martinez Peck wrote

On Wed, May 18, 2016 at 2:10 PM, Ben Coman <[hidden email]> wrote:

> I don't have time to look at this properly until tomorrow.
> In the meantime, could you just post the result of...
>
> Delay delaySchedulerClass
> it should be "DelayExperimentalSpinScheduler" ***
>
>
Yes, it is.

> Also could you try a few of the other options from
> World > System > Settings > System ...
>
>
Do you know which is the "Safest" to use? Or at least which was used in
say, 4.0?

>
> ***whoops, the Experimental tag should have been dropped for Pharo 5
> Release. Its been fine for months (unless this is a fault :) ]
>
>
Yes, I found that strange too.

> >
> > Maybe it is wrong to keep and re-use same instance of Delay? (note it is
> > outside the closure) (this was not a problem in the past).
>
> Not sure. I've heard others say Delays should only be used once. To
> experiment, could you try creating a delay each time.
>
>
Ok, I will try this. But again, it would take some some time to test and it
happens to me every in while only.

Thanks!!

--
Mariano
http://marianopeck.wordpress.com

Martin McClure-2

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On 05/18/2016 02:46 PM, Paul DeBruicker wrote:

> the reason I asked about OSProcess in Pharo 5 was that
>
> OSProcess waitForCommand: 'ls'
>
>
> never returned.
>
>
> Maybe its another instance of this problem. And it happens everytime.
>
> There is a 50ms delay in the #waitForCommand: method
>

I'm currently having a problem with an OSProcess of a chmod never
returning. And that's in Pharo 4! Never seen a problem there before.
Investigating.

Regards,

-Martin

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

In reply to this post by Paul DeBruicker

On Wed, May 18, 2016 at 6:46 PM, Paul DeBruicker <[hidden email]> wrote:

the reason I asked about OSProcess in Pharo 5 was that

OSProcess waitForCommand: 'ls'

never returned.

Maybe its another instance of this problem. And it happens everytime.

There is a 50ms delay in the #waitForCommand: method

It looks quite similar indeed. And... both, OSProcess and OSSubprocess have a similar child watcher (I based mine in OSProcess one).

Paul could you inspect the child process from the process browser tool, do an inspect, and check the suspended process? Check the wait time to see if it is 50ms or 20ms etc...that way we can see which part is the "hunging".

hope this helps

Paul

Mariano Martinez Peck wrote
> On Wed, May 18, 2016 at 2:10 PM, Ben Coman <

> btc@

> > wrote:
>
>> I don't have time to look at this properly until tomorrow.
>> In the meantime, could you just post the result of...
>>
>> Delay delaySchedulerClass
>> it should be "DelayExperimentalSpinScheduler" ***
>>
>>
> Yes, it is.
>
>
>> Also could you try a few of the other options from
>> World > System > Settings > System ...
>>
>>
> Do you know which is the "Safest" to use? Or at least which was used in
> say, 4.0?
>
>
>
>
>>
>> ***whoops, the Experimental tag should have been dropped for Pharo 5
>> Release. Its been fine for months (unless this is a fault :) ]
>>
>>
> Yes, I found that strange too.
>
>
>> >
>> > Maybe it is wrong to keep and re-use same instance of Delay? (note it
>> is
>> > outside the closure) (this was not a problem in the past).
>>
>> Not sure. I've heard others say Delays should only be used once. To
>> experiment, could you try creating a delay each time.
>>
>>
> Ok, I will try this. But again, it would take some some time to test and
> it
> happens to me every in while only.
>
> Thanks!!
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com

--
View this message in context: http://forum.world.st/Problem-with-delay-waiting-OSSubprocess-code-on-Pharo-5-0-tp4895780p4895890.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.

Mariano
http://marianopeck.wordpress.com

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

In reply to this post by Martin McClure-2

On Wed, May 18, 2016 at 7:28 PM, Martin McClure <[hidden email]> wrote:

On 05/18/2016 02:46 PM, Paul DeBruicker wrote:

the reason I asked about OSProcess in Pharo 5 was that

OSProcess waitForCommand: 'ls'

never returned.

Maybe its another instance of this problem. And it happens everytime.

There is a 50ms delay in the #waitForCommand: method

I'm currently having a problem with an OSProcess of a chmod never returning. And that's in Pharo 4! Never seen a problem there before. Investigating.

Thanks Martin. I cannot tell about 4.0 as OSSubprocess only works in 5.0...

What I know is that in 5.0 there have been changes in the wait delay (that's why Ben answered).

Regards,

-Martin

Mariano
http://marianopeck.wordpress.com

Martin McClure-2

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

In reply to this post by Martin McClure-2

On 05/18/2016 03:17 PM, Martin McClure wrote:

> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>> Hi guys,
>>
>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>> explain how this could happened but it does, and it happened to me a
>> couple of times (but not fully reproducible).
>>
>
> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
> doubled. It's not clear how that can happen, but I'll look a little more.
>

Mario, is there any chance that you might be saving the image during one
of these Delays?

This one smells like a race condition, and I think I see something that
*might* explain it. But I don't have any more time to spend on this one,
so I'll leave the rest to someone else. I hope this is helpful:

The only way I immediately see for the schedulerResumptionTime to become
approximately doubled is if the Delay's resumption time is adjusted by
#restoreResumptionTimes without previously having been adjusted by
#saveResumptionTimes.

The only time either of those are sent, that I can see, is on saving the
image. Both are normally sent, (save before the snapshot, restore
afterwards), but there may be a hole there.

#saveResumptionTimes is only sent (by this scheduler class) when the
accessProtect semaphore is held, but #handleTimerEvent: is executed in
the timing Process *without* the protection of accessProtect, in the
case of the VM signaling the timingSemaphore. If the VM signals the
timingSemaphore, #handleTimerEvent: could run in the middle of
#saveResumptionTimes. If some Delay expires because of that timer event,
our Delay could move from being the first suspended delay to being the
active delay. If that happens after we've adjusted the active delay, but
before we've processed the suspended delays, that Delay will not get
adjusted, and will show the symptoms that Mariano is seeing.

Also, I'm not sure how the Heap that holds the suspendedDelays will
react to being modified in the middle of an enumeration. That might open
a larger window for the problem.

Regards,

-Martin

Ben Coman

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]> wrote:

> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>
>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>
>>> Hi guys,
>>>
>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>> explain how this could happened but it does, and it happened to me a couple
>>> of times (but not fully reproducible).
>>>
>>
>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> doubled. It's not clear how that can happen, but I'll look a little more.
>>
>
> Mario, is there any chance that you might be saving the image during one of
> these Delays?
>
>
> This one smells like a race condition, and I think I see something that
> *might* explain it. But I don't have any more time to spend on this one, so
> I'll leave the rest to someone else. I hope this is helpful:
>
> The only way I immediately see for the schedulerResumptionTime to become
> approximately doubled is if the Delay's resumption time is adjusted by
> #restoreResumptionTimes without previously having been adjusted by
> #saveResumptionTimes.
>
> The only time either of those are sent, that I can see, is on saving the
> image. Both are normally sent, (save before the snapshot, restore
> afterwards), but there may be a hole there.
>
> #saveResumptionTimes is only sent (by this scheduler class) when the
> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
> timing Process *without* the protection of accessProtect, in the case of the
> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
> Delay expires because of that timer event, our Delay could move from being
> the first suspended delay to being the active delay. If that happens after
> we've adjusted the active delay, but before we've processed the suspended
> delays, that Delay will not get adjusted, and will show the symptoms that
> Mariano is seeing.

A quick experiment to test this might be in shutDown/#startUp trying...
[ self saveResumptionTimes ] valueAt: Processor timingPriority
[ self resumeResumptionTimes ] valueAt: Processor timingPriority

>
> Also, I'm not sure how the Heap that holds the suspendedDelays will react to
> being modified in the middle of an enumeration. That might open a larger
> window for the problem.
>
> Regards,
>
> -Martin
>

Even if not directly related to Mariano's problem, I agree with your
general assessment. I'm not comfortable with the way that
#save/#restoreResumptionTimes (which manipulate suspendedDelays) are
called from user priority code via #shutDown/#startUp. Per the
original code**, accessProtect can't be used inside the timing
priority #handleTimerEvent since accessProtect is held by the user
priority #schedule when it uses "timingSemaphore signal" to invoke
invokes #handleTimerEvent. accessProtect never protected
timingPriority manipulation of suspendedDelays by #handleTimerEvent,
nor expired delays waking up. But ahhh... the disabling of
accessProtect previously prevented new delays being scheduled between
a #save and #restore. If a new delay is scheduled after the #save,
when it is #restore'd its resumptionTime would be wrong.

Waiting in the wings for Pharo 6 I have changes that should help:
* have #save/#restoreResumptionTimes *only* called from timing
priority event loop (i.e. #handleTimerEvent)
* shutDown/startUp suspends/resumes the timing priority event loop,
instead of trying to block signals to timingSemaphore

I haven't touched it for a few months so I'll need to chase it up to
provide a preview.

Mariano, can you try DelayMillisecondScheduler (which however is
missing some fixes for other issues).

cheers -ben

Martin McClure-2

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On 05/18/2016 07:39 PM, Ben Coman wrote:

Waiting in the wings for Pharo 6 I have changes that should help:
* have #save/#restoreResumptionTimes *only* called from timing
priority event loop (i.e. #handleTimerEvent)
* shutDown/startUp suspends/resumes the timing priority event loop,
instead of trying to block signals to timingSemaphore

Yeah, that sounds better, thanks.

-Martin

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

In reply to this post by Martin McClure-2

On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]> wrote:

On 05/18/2016 03:17 PM, Martin McClure wrote:

On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:

Hi guys,

I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain how this could happened but it does, and it happened to me a couple of times (but not fully reproducible).

Hmm. The schedulerResumptionTime is, somehow, being (approximately) doubled. It's not clear how that can happen, but I'll look a little more.

Mario, is there any chance that you might be saving the image during one of these Delays?

This one smells like a race condition, and I think I see something that *might* explain it. But I don't have any more time to spend on this one, so I'll leave the rest to someone else. I hope this is helpful:

The only way I immediately see for the schedulerResumptionTime to become approximately doubled is if the Delay's resumption time is adjusted by #restoreResumptionTimes without previously having been adjusted by #saveResumptionTimes.

The only time either of those are sent, that I can see, is on saving the image. Both are normally sent, (save before the snapshot, restore afterwards), but there may be a hole there.

Martin, first off, thanks for the research!!!

Now....your email made me remember something: I did get VM crash when saving the image a couple of times. The VM crashed when saving the image. If I re-opened the image, it looks like if the image was indeed saved (so the snapshot primitive itself did work), but I suspect not all shutdown code could have been run correctly.

The VM crash looks like the FreeTypeFace >> pvtDestroyHandle which, as far as I know, it's a "known crash" (I attach crash dump). From what I can see, if I follow all the stack, the crash starts from the WeakArray >> startUp: .

That means that...depending on the order of the startup list...the Scheduler may not have been run after the crash.

Now.... WeakArray initialization does:

SessionManager default

registerSystemClassNamed: self name.

While...

Delay class >> startUp "Restart active delay, if any, when resuming a snapshot." Scheduler startUp.

And the Delay registration is

SessionManager default

registerSystemClassNamed: self name

atPriority: 20.

So...that seems correct...

I can verify this by:

SessionManager default systemCategory prioritizedList

Anyway...not sure if this adds something, but just wanted to note this.

#saveResumptionTimes is only sent (by this scheduler class) when the accessProtect semaphore is held, but #handleTimerEvent: is executed in the timing Process *without* the protection of accessProtect, in the case of the VM signaling the timingSemaphore. If the VM signals the timingSemaphore, #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some Delay expires because of that timer event, our Delay could move from being the first suspended delay to being the active delay. If that happens after we've adjusted the active delay, but before we've processed the suspended delays, that Delay will not get adjusted, and will show the symptoms that Mariano is seeing.

Also, I'm not sure how the Heap that holds the suspendedDelays will react to being modified in the middle of an enumeration. That might open a larger window for the problem.

Regards,

-Martin

Mariano
http://marianopeck.wordpress.com

crash.dmp (49K) Download Attachment

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Ben, for the record, I am using DelayMillisecondScheduler for a day and a half and so far no problem.

On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck <[hidden email]> wrote:

On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]> wrote:
On 05/18/2016 03:17 PM, Martin McClure wrote:

On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:

Hi guys,

I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot explain how this could happened but it does, and it happened to me a couple of times (but not fully reproducible).

Hmm. The schedulerResumptionTime is, somehow, being (approximately) doubled. It's not clear how that can happen, but I'll look a little more.

Mario, is there any chance that you might be saving the image during one of these Delays?

This one smells like a race condition, and I think I see something that *might* explain it. But I don't have any more time to spend on this one, so I'll leave the rest to someone else. I hope this is helpful:

The only way I immediately see for the schedulerResumptionTime to become approximately doubled is if the Delay's resumption time is adjusted by #restoreResumptionTimes without previously having been adjusted by #saveResumptionTimes.

The only time either of those are sent, that I can see, is on saving the image. Both are normally sent, (save before the snapshot, restore afterwards), but there may be a hole there.

Martin, first off, thanks for the research!!!

Now....your email made me remember something: I did get VM crash when saving the image a couple of times. The VM crashed when saving the image. If I re-opened the image, it looks like if the image was indeed saved (so the snapshot primitive itself did work), but I suspect not all shutdown code could have been run correctly.

The VM crash looks like the FreeTypeFace >> pvtDestroyHandle which, as far as I know, it's a "known crash" (I attach crash dump). From what I can see, if I follow all the stack, the crash starts from the WeakArray >> startUp: .
That means that...depending on the order of the startup list...the Scheduler may not have been run after the crash.

Now.... WeakArray initialization does:

SessionManager default
registerSystemClassNamed: self name.

While...

Delay class >> startUp "Restart active delay, if any, when resuming a snapshot." Scheduler startUp.

And the Delay registration is

SessionManager default
registerSystemClassNamed: self name
atPriority: 20.

So...that seems correct...

I can verify this by:

SessionManager default systemCategory prioritizedList

Anyway...not sure if this adds something, but just wanted to note this.

#saveResumptionTimes is only sent (by this scheduler class) when the accessProtect semaphore is held, but #handleTimerEvent: is executed in the timing Process *without* the protection of accessProtect, in the case of the VM signaling the timingSemaphore. If the VM signals the timingSemaphore, #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some Delay expires because of that timer event, our Delay could move from being the first suspended delay to being the active delay. If that happens after we've adjusted the active delay, but before we've processed the suspended delays, that Delay will not get adjusted, and will show the symptoms that Mariano is seeing.

Also, I'm not sure how the Heap that holds the suspendedDelays will react to being modified in the middle of an enumeration. That might open a larger window for the problem.

Regards,

-Martin

--
Mariano
http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

Ben Coman

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On Sat, May 21, 2016 at 12:05 AM, Mariano Martinez Peck
<[hidden email]> wrote:
> Ben, for the record, I am using DelayMillisecondScheduler for a day and a
> half and so far no problem.

Cool. Thats why I left it there. I hope to soon have something for you
to try with the newer design. Thanks for the update.

cheers -ben

> On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck
> <[hidden email]> wrote:
>>
>>
>>
>> On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]>
>> wrote:
>>>
>>> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>>>
>>>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>>>> explain how this could happened but it does, and it happened to me a couple
>>>>> of times (but not fully reproducible).
>>>>>
>>>>
>>>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>>>> doubled. It's not clear how that can happen, but I'll look a little more.
>>>>
>>>
>>> Mario, is there any chance that you might be saving the image during one
>>> of these Delays?
>>>
>>>
>>> This one smells like a race condition, and I think I see something that
>>> *might* explain it. But I don't have any more time to spend on this one, so
>>> I'll leave the rest to someone else. I hope this is helpful:
>>>
>>> The only way I immediately see for the schedulerResumptionTime to become
>>> approximately doubled is if the Delay's resumption time is adjusted by
>>> #restoreResumptionTimes without previously having been adjusted by
>>> #saveResumptionTimes.
>>>
>>> The only time either of those are sent, that I can see, is on saving the
>>> image. Both are normally sent, (save before the snapshot, restore
>>> afterwards), but there may be a hole there.
>>>
>>
>> Martin, first off, thanks for the research!!!
>>
>> Now....your email made me remember something: I did get VM crash when
>> saving the image a couple of times. The VM crashed when saving the image. If
>> I re-opened the image, it looks like if the image was indeed saved (so the
>> snapshot primitive itself did work), but I suspect not all shutdown code
>> could have been run correctly.
>>
>> The VM crash looks like the FreeTypeFace >> pvtDestroyHandle which, as
>> far as I know, it's a "known crash" (I attach crash dump). From what I can
>> see, if I follow all the stack, the crash starts from the WeakArray >>
>> startUp: .
>> That means that...depending on the order of the startup list...the
>> Scheduler may not have been run after the crash.
>>
>> Now.... WeakArray initialization does:
>>
>> SessionManager default
>> registerSystemClassNamed: self name.
>> While...
>>
>> Delay class >> startUp "Restart active delay, if any, when resuming a
>> snapshot." Scheduler startUp.
>>
>> And the Delay registration is
>>
>> SessionManager default
>> registerSystemClassNamed: self name
>> atPriority: 20.
>>
>> So...that seems correct...
>>
>> I can verify this by:
>>
>> SessionManager default systemCategory prioritizedList
>>
>> Anyway...not sure if this adds something, but just wanted to note this.
>>
>>
>>>
>>> #saveResumptionTimes is only sent (by this scheduler class) when the
>>> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
>>> timing Process *without* the protection of accessProtect, in the case of the
>>> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>>> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
>>> Delay expires because of that timer event, our Delay could move from being
>>> the first suspended delay to being the active delay. If that happens after
>>> we've adjusted the active delay, but before we've processed the suspended
>>> delays, that Delay will not get adjusted, and will show the symptoms that
>>> Mariano is seeing.
>>>
>>> Also, I'm not sure how the Heap that holds the suspendedDelays will react
>>> to being modified in the middle of an enumeration. That might open a larger
>>> window for the problem.
>>>
>>> Regards,
>>>
>>> -Martin
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Hi Ben,

I opened this issue for tracking the problem and not forget: https://pharo.fogbugz.com/f/cases/18359/Problem-with-DelayExperimentalSpinScheduler-and-delay

Cheers,

On Fri, May 20, 2016 at 2:58 PM, Ben Coman <[hidden email]> wrote:

On Sat, May 21, 2016 at 12:05 AM, Mariano Martinez Peck
<[hidden email]> wrote:
> Ben, for the record, I am using DelayMillisecondScheduler for a day and a
> half and so far no problem.

Cool. Thats why I left it there. I hope to soon have something for you
to try with the newer design. Thanks for the update.

cheers -ben

> On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck
> <[hidden email]> wrote:
>>
>>
>>
>> On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]>
>> wrote:
>>>
>>> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>>>
>>>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>>>> explain how this could happened but it does, and it happened to me a couple
>>>>> of times (but not fully reproducible).
>>>>>
>>>>
>>>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>>>> doubled. It's not clear how that can happen, but I'll look a little more.
>>>>
>>>
>>> Mario, is there any chance that you might be saving the image during one
>>> of these Delays?
>>>
>>>
>>> This one smells like a race condition, and I think I see something that
>>> *might* explain it. But I don't have any more time to spend on this one, so
>>> I'll leave the rest to someone else. I hope this is helpful:
>>>
>>> The only way I immediately see for the schedulerResumptionTime to become
>>> approximately doubled is if the Delay's resumption time is adjusted by
>>> #restoreResumptionTimes without previously having been adjusted by
>>> #saveResumptionTimes.
>>>
>>> The only time either of those are sent, that I can see, is on saving the
>>> image. Both are normally sent, (save before the snapshot, restore
>>> afterwards), but there may be a hole there.
>>>
>>
>> Martin, first off, thanks for the research!!!
>>
>> Now....your email made me remember something: I did get VM crash when
>> saving the image a couple of times. The VM crashed when saving the image. If
>> I re-opened the image, it looks like if the image was indeed saved (so the
>> snapshot primitive itself did work), but I suspect not all shutdown code
>> could have been run correctly.
>>
>> The VM crash looks like the FreeTypeFace >> pvtDestroyHandle which, as
>> far as I know, it's a "known crash" (I attach crash dump). From what I can
>> see, if I follow all the stack, the crash starts from the WeakArray >>
>> startUp: .
>> That means that...depending on the order of the startup list...the
>> Scheduler may not have been run after the crash.
>>
>> Now.... WeakArray initialization does:
>>
>> SessionManager default
>> registerSystemClassNamed: self name.
>> While...
>>
>> Delay class >> startUp "Restart active delay, if any, when resuming a
>> snapshot." Scheduler startUp.
>>
>> And the Delay registration is
>>
>> SessionManager default
>> registerSystemClassNamed: self name
>> atPriority: 20.
>>
>> So...that seems correct...
>>
>> I can verify this by:
>>
>> SessionManager default systemCategory prioritizedList
>>
>> Anyway...not sure if this adds something, but just wanted to note this.
>>
>>
>>>
>>> #saveResumptionTimes is only sent (by this scheduler class) when the
>>> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
>>> timing Process *without* the protection of accessProtect, in the case of the
>>> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>>> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
>>> Delay expires because of that timer event, our Delay could move from being
>>> the first suspended delay to being the active delay. If that happens after
>>> we've adjusted the active delay, but before we've processed the suspended
>>> delays, that Delay will not get adjusted, and will show the symptoms that
>>> Mariano is seeing.
>>>
>>> Also, I'm not sure how the Heap that holds the suspendedDelays will react
>>> to being modified in the middle of an enumeration. That might open a larger
>>> window for the problem.
>>>
>>> Regards,
>>>
>>> -Martin
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

Gastón Dall' Oglio

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

Hi Mariano.

Some time ago I had a similar problem (Windows, Linux did not happen) and the solution was "OSProcessAccessor initialize" to open the image.

HTH http://forum.world.st/OSProcess-error-in-1-3-tp4602389p4604877.html

Cheers

2016-05-30 10:31 GMT-03:00 Mariano Martinez Peck <[hidden email]>:

Hi Ben,

I opened this issue for tracking the problem and not forget: https://pharo.fogbugz.com/f/cases/18359/Problem-with-DelayExperimentalSpinScheduler-and-delay

Cheers,

On Fri, May 20, 2016 at 2:58 PM, Ben Coman <[hidden email]> wrote:
On Sat, May 21, 2016 at 12:05 AM, Mariano Martinez Peck
<[hidden email]> wrote:
> Ben, for the record, I am using DelayMillisecondScheduler for a day and a
> half and so far no problem.

Cool. Thats why I left it there. I hope to soon have something for you
to try with the newer design. Thanks for the update.

cheers -ben

> On Thu, May 19, 2016 at 9:19 AM, Mariano Martinez Peck
> <[hidden email]> wrote:
>>
>>
>>
>> On Wed, May 18, 2016 at 9:49 PM, Martin McClure <[hidden email]>
>> wrote:
>>>
>>> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>>>
>>>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>>>> explain how this could happened but it does, and it happened to me a couple
>>>>> of times (but not fully reproducible).
>>>>>
>>>>
>>>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>>>> doubled. It's not clear how that can happen, but I'll look a little more.
>>>>
>>>
>>> Mario, is there any chance that you might be saving the image during one
>>> of these Delays?
>>>
>>>
>>> This one smells like a race condition, and I think I see something that
>>> *might* explain it. But I don't have any more time to spend on this one, so
>>> I'll leave the rest to someone else. I hope this is helpful:
>>>
>>> The only way I immediately see for the schedulerResumptionTime to become
>>> approximately doubled is if the Delay's resumption time is adjusted by
>>> #restoreResumptionTimes without previously having been adjusted by
>>> #saveResumptionTimes.
>>>
>>> The only time either of those are sent, that I can see, is on saving the
>>> image. Both are normally sent, (save before the snapshot, restore
>>> afterwards), but there may be a hole there.
>>>
>>
>> Martin, first off, thanks for the research!!!
>>
>> Now....your email made me remember something: I did get VM crash when
>> saving the image a couple of times. The VM crashed when saving the image. If
>> I re-opened the image, it looks like if the image was indeed saved (so the
>> snapshot primitive itself did work), but I suspect not all shutdown code
>> could have been run correctly.
>>
>> The VM crash looks like the FreeTypeFace >> pvtDestroyHandle which, as
>> far as I know, it's a "known crash" (I attach crash dump). From what I can
>> see, if I follow all the stack, the crash starts from the WeakArray >>
>> startUp: .
>> That means that...depending on the order of the startup list...the
>> Scheduler may not have been run after the crash.
>>
>> Now.... WeakArray initialization does:
>>
>> SessionManager default
>> registerSystemClassNamed: self name.
>> While...
>>
>> Delay class >> startUp "Restart active delay, if any, when resuming a
>> snapshot." Scheduler startUp.
>>
>> And the Delay registration is
>>
>> SessionManager default
>> registerSystemClassNamed: self name
>> atPriority: 20.
>>
>> So...that seems correct...
>>
>> I can verify this by:
>>
>> SessionManager default systemCategory prioritizedList
>>
>> Anyway...not sure if this adds something, but just wanted to note this.
>>
>>
>>>
>>> #saveResumptionTimes is only sent (by this scheduler class) when the
>>> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
>>> timing Process *without* the protection of accessProtect, in the case of the
>>> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>>> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
>>> Delay expires because of that timer event, our Delay could move from being
>>> the first suspended delay to being the active delay. If that happens after
>>> we've adjusted the active delay, but before we've processed the suspended
>>> delays, that Delay will not get adjusted, and will show the symptoms that
>>> Mariano is seeing.
>>>
>>> Also, I'm not sure how the Heap that holds the suspendedDelays will react
>>> to being modified in the middle of an enumeration. That might open a larger
>>> window for the problem.
>>>
>>> Regards,
>>>
>>> -Martin
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

In reply to this post by Ben Coman

On Wed, May 18, 2016 at 11:39 PM, Ben Coman <[hidden email]> wrote:

On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]> wrote:
> On 05/18/2016 03:17 PM, Martin McClure wrote:
>>
>> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>>>
>>> Hi guys,
>>>
>>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>>> explain how this could happened but it does, and it happened to me a couple
>>> of times (but not fully reproducible).
>>>
>>
>> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> doubled. It's not clear how that can happen, but I'll look a little more.
>>
>
> Mario, is there any chance that you might be saving the image during one of
> these Delays?
>
>
> This one smells like a race condition, and I think I see something that
> *might* explain it. But I don't have any more time to spend on this one, so
> I'll leave the rest to someone else. I hope this is helpful:
>
> The only way I immediately see for the schedulerResumptionTime to become
> approximately doubled is if the Delay's resumption time is adjusted by
> #restoreResumptionTimes without previously having been adjusted by
> #saveResumptionTimes.
>
> The only time either of those are sent, that I can see, is on saving the
> image. Both are normally sent, (save before the snapshot, restore
> afterwards), but there may be a hole there.
>
> #saveResumptionTimes is only sent (by this scheduler class) when the
> accessProtect semaphore is held, but #handleTimerEvent: is executed in the
> timing Process *without* the protection of accessProtect, in the case of the
> VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
> #handleTimerEvent: could run in the middle of #saveResumptionTimes. If some
> Delay expires because of that timer event, our Delay could move from being
> the first suspended delay to being the active delay. If that happens after
> we've adjusted the active delay, but before we've processed the suspended
> delays, that Delay will not get adjusted, and will show the symptoms that
> Mariano is seeing.

A quick experiment to test this might be in shutDown/#startUp trying...
[ self saveResumptionTimes ] valueAt: Processor timingPriority
[ self resumeResumptionTimes ] valueAt: Processor timingPriority

>
> Also, I'm not sure how the Heap that holds the suspendedDelays will react to
> being modified in the middle of an enumeration. That might open a larger
> window for the problem.
>
> Regards,
>
> -Martin
>

Even if not directly related to Mariano's problem, I agree with your
general assessment. I'm not comfortable with the way that
#save/#restoreResumptionTimes (which manipulate suspendedDelays) are
called from user priority code via #shutDown/#startUp. Per the
original code**, accessProtect can't be used inside the timing
priority #handleTimerEvent since accessProtect is held by the user
priority #schedule when it uses "timingSemaphore signal" to invoke
invokes #handleTimerEvent. accessProtect never protected
timingPriority manipulation of suspendedDelays by #handleTimerEvent,
nor expired delays waking up. But ahhh... the disabling of
accessProtect previously prevented new delays being scheduled between
a #save and #restore. If a new delay is scheduled after the #save,
when it is #restore'd its resumptionTime would be wrong.

Waiting in the wings for Pharo 6 I have changes that should help:
* have #save/#restoreResumptionTimes *only* called from timing
priority event loop (i.e. #handleTimerEvent)
* shutDown/startUp suspends/resumes the timing priority event loop,
instead of trying to block signals to timingSemaphore

I haven't touched it for a few months so I'll need to chase it up to
provide a preview.

Mariano, can you try DelayMillisecondScheduler (which however is
missing some fixes for other issues).

Ben,

Did you find a chance to do something else with this? I am still finding lookups every in a while even with the DelayMillisecondScheduler :(

Thanks!

Mariano
http://marianopeck.wordpress.com

Ben Coman

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On Wed, Jul 13, 2016 at 12:30 AM, Mariano Martinez Peck
<[hidden email]> wrote:

>
>
> On Wed, May 18, 2016 at 11:39 PM, Ben Coman <[hidden email]> wrote:
>>
>> On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]>
>> wrote:
>> > On 05/18/2016 03:17 PM, Martin McClure wrote:
>> >>
>> >> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>> >>> explain how this could happened but it does, and it happened to me a
>> >>> couple
>> >>> of times (but not fully reproducible).
>> >>>
>> >>
>> >> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> >> doubled. It's not clear how that can happen, but I'll look a little
>> >> more.
>> >>
>> >
>> > Mario, is there any chance that you might be saving the image during one
>> > of
>> > these Delays?
>> >
>> >
>> > This one smells like a race condition, and I think I see something that
>> > *might* explain it. But I don't have any more time to spend on this one,
>> > so
>> > I'll leave the rest to someone else. I hope this is helpful:
>> >
>> > The only way I immediately see for the schedulerResumptionTime to become
>> > approximately doubled is if the Delay's resumption time is adjusted by
>> > #restoreResumptionTimes without previously having been adjusted by
>> > #saveResumptionTimes.
>> >
>> > The only time either of those are sent, that I can see, is on saving the
>> > image. Both are normally sent, (save before the snapshot, restore
>> > afterwards), but there may be a hole there.
>> >
>> > #saveResumptionTimes is only sent (by this scheduler class) when the
>> > accessProtect semaphore is held, but #handleTimerEvent: is executed in
>> > the
>> > timing Process *without* the protection of accessProtect, in the case of
>> > the
>> > VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>> > #handleTimerEvent: could run in the middle of #saveResumptionTimes. If
>> > some
>> > Delay expires because of that timer event, our Delay could move from
>> > being
>> > the first suspended delay to being the active delay. If that happens
>> > after
>> > we've adjusted the active delay, but before we've processed the
>> > suspended
>> > delays, that Delay will not get adjusted, and will show the symptoms
>> > that
>> > Mariano is seeing.
>>
>> A quick experiment to test this might be in shutDown/#startUp trying...
>> [ self saveResumptionTimes ] valueAt: Processor timingPriority
>> [ self resumeResumptionTimes ] valueAt: Processor timingPriority
>>
>> >
>> > Also, I'm not sure how the Heap that holds the suspendedDelays will
>> > react to
>> > being modified in the middle of an enumeration. That might open a larger
>> > window for the problem.
>> >
>> > Regards,
>> >
>> > -Martin
>> >
>>
>> Even if not directly related to Mariano's problem, I agree with your
>> general assessment. I'm not comfortable with the way that
>> #save/#restoreResumptionTimes (which manipulate suspendedDelays) are
>> called from user priority code via #shutDown/#startUp. Per the
>> original code**, accessProtect can't be used inside the timing
>> priority #handleTimerEvent since accessProtect is held by the user
>> priority #schedule when it uses "timingSemaphore signal" to invoke
>> invokes #handleTimerEvent. accessProtect never protected
>> timingPriority manipulation of suspendedDelays by #handleTimerEvent,
>> nor expired delays waking up. But ahhh... the disabling of
>> accessProtect previously prevented new delays being scheduled between
>> a #save and #restore. If a new delay is scheduled after the #save,
>> when it is #restore'd its resumptionTime would be wrong.
>>
>> Waiting in the wings for Pharo 6 I have changes that should help:
>> * have #save/#restoreResumptionTimes *only* called from timing
>> priority event loop (i.e. #handleTimerEvent)
>> * shutDown/startUp suspends/resumes the timing priority event loop,
>> instead of trying to block signals to timingSemaphore
>>
>> I haven't touched it for a few months so I'll need to chase it up to
>> provide a preview.
>>
>>
>>
>> Mariano, can you try DelayMillisecondScheduler (which however is
>> missing some fixes for other issues).
>>
>
>
> Ben,
>
> Did you find a chance to do something else with this? I am still finding
> lookups every in a while even with the DelayMillisecondScheduler :(

Hi Mariano,

Sorry for the delay attending to this. I got a bit twisted up on
whether I'm complicating things too much with my DelayScheduler
refactoring for Pharo 6 (I should just put it out for early review),
but then I had to get through a Netflix addiction to the Star Trek
prequel Enterprise, and family time over my kids school holidays. I'm
back looking at it now - and taking a step back to see what is the
minimum change required to fix this.

Do you have some test code or a test image I could run to reproduce
the problem?

Now I have one thing for you to try. In the move of DelaySchedulerXxx
from milliseconds to microseconds, the code for dealing with clock
wrap-around was removed, but I wonder if it was also covering clock
jitter as a side effect. Could you try again the SpinScheduler but
restore this code...

From handletTimerEvent...
"Check for clock wrap-around."
millisecondNowTick < activeDelayStartTime ifTrue: [
"clock wrapped"
self saveResumptionTimes.
self restoreResumptionTimes ].
activeDelayStartTime := millisecondNowTick.

From runTimerEventLoop...
Time millisecondClockValue < millisecondNowTick
ifTrue:[ timingSemaphore signal ]. "retry"

From startup...
activeDelayStartTime := Time millisecondClockValue.

cheers -ben

Mariano Martinez Peck

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On Sun, Jul 17, 2016 at 6:46 AM, Ben Coman <[hidden email]> wrote:

On Wed, Jul 13, 2016 at 12:30 AM, Mariano Martinez Peck
<[hidden email]> wrote:
>
>
> On Wed, May 18, 2016 at 11:39 PM, Ben Coman <[hidden email]> wrote:
>>
>> On Thu, May 19, 2016 at 8:49 AM, Martin McClure <[hidden email]>
>> wrote:
>> > On 05/18/2016 03:17 PM, Martin McClure wrote:
>> >>
>> >> On 05/18/2016 08:49 AM, Mariano Martinez Peck wrote:
>> >>>
>> >>> Hi guys,
>> >>>
>> >>> I am seeing a problem in Pharo 5.0 regarding Delay >> wait. I cannot
>> >>> explain how this could happened but it does, and it happened to me a
>> >>> couple
>> >>> of times (but not fully reproducible).
>> >>>
>> >>
>> >> Hmm. The schedulerResumptionTime is, somehow, being (approximately)
>> >> doubled. It's not clear how that can happen, but I'll look a little
>> >> more.
>> >>
>> >
>> > Mario, is there any chance that you might be saving the image during one
>> > of
>> > these Delays?
>> >
>> >
>> > This one smells like a race condition, and I think I see something that
>> > *might* explain it. But I don't have any more time to spend on this one,
>> > so
>> > I'll leave the rest to someone else. I hope this is helpful:
>> >
>> > The only way I immediately see for the schedulerResumptionTime to become
>> > approximately doubled is if the Delay's resumption time is adjusted by
>> > #restoreResumptionTimes without previously having been adjusted by
>> > #saveResumptionTimes.
>> >
>> > The only time either of those are sent, that I can see, is on saving the
>> > image. Both are normally sent, (save before the snapshot, restore
>> > afterwards), but there may be a hole there.
>> >
>> > #saveResumptionTimes is only sent (by this scheduler class) when the
>> > accessProtect semaphore is held, but #handleTimerEvent: is executed in
>> > the
>> > timing Process *without* the protection of accessProtect, in the case of
>> > the
>> > VM signaling the timingSemaphore. If the VM signals the timingSemaphore,
>> > #handleTimerEvent: could run in the middle of #saveResumptionTimes. If
>> > some
>> > Delay expires because of that timer event, our Delay could move from
>> > being
>> > the first suspended delay to being the active delay. If that happens
>> > after
>> > we've adjusted the active delay, but before we've processed the
>> > suspended
>> > delays, that Delay will not get adjusted, and will show the symptoms
>> > that
>> > Mariano is seeing.
>>
>> A quick experiment to test this might be in shutDown/#startUp trying...
>> [ self saveResumptionTimes ] valueAt: Processor timingPriority
>> [ self resumeResumptionTimes ] valueAt: Processor timingPriority
>>
>> >
>> > Also, I'm not sure how the Heap that holds the suspendedDelays will
>> > react to
>> > being modified in the middle of an enumeration. That might open a larger
>> > window for the problem.
>> >
>> > Regards,
>> >
>> > -Martin
>> >
>>
>> Even if not directly related to Mariano's problem, I agree with your
>> general assessment. I'm not comfortable with the way that
>> #save/#restoreResumptionTimes (which manipulate suspendedDelays) are
>> called from user priority code via #shutDown/#startUp. Per the
>> original code**, accessProtect can't be used inside the timing
>> priority #handleTimerEvent since accessProtect is held by the user
>> priority #schedule when it uses "timingSemaphore signal" to invoke
>> invokes #handleTimerEvent. accessProtect never protected
>> timingPriority manipulation of suspendedDelays by #handleTimerEvent,
>> nor expired delays waking up. But ahhh... the disabling of
>> accessProtect previously prevented new delays being scheduled between
>> a #save and #restore. If a new delay is scheduled after the #save,
>> when it is #restore'd its resumptionTime would be wrong.
>>
>> Waiting in the wings for Pharo 6 I have changes that should help:
>> * have #save/#restoreResumptionTimes *only* called from timing
>> priority event loop (i.e. #handleTimerEvent)
>> * shutDown/startUp suspends/resumes the timing priority event loop,
>> instead of trying to block signals to timingSemaphore
>>
>> I haven't touched it for a few months so I'll need to chase it up to
>> provide a preview.
>>
>>
>>
>> Mariano, can you try DelayMillisecondScheduler (which however is
>> missing some fixes for other issues).
>>
>
>
> Ben,
>
> Did you find a chance to do something else with this? I am still finding
> lookups every in a while even with the DelayMillisecondScheduler :(

Hi Mariano,

Sorry for the delay attending to this. I got a bit twisted up on
whether I'm complicating things too much with my DelayScheduler
refactoring for Pharo 6 (I should just put it out for early review),
but then I had to get through a Netflix addiction to the Star Trek
prequel Enterprise, and family time over my kids school holidays. I'm
back looking at it now - and taking a step back to see what is the
minimum change required to fix this.

hahaha no problem.

Do you have some test code or a test image I could run to reproduce
the problem?

The problem is that I cannot reproduce it immediately. It takes me some days and likely some image save and start again until I reproduce it.

Now I have one thing for you to try. In the move of DelaySchedulerXxx
from milliseconds to microseconds, the code for dealing with clock
wrap-around was removed, but I wonder if it was also covering clock
jitter as a side effect. Could you try again the SpinScheduler but
restore this code...

From handletTimerEvent...
"Check for clock wrap-around."
millisecondNowTick < activeDelayStartTime ifTrue: [
"clock wrapped"
self saveResumptionTimes.
self restoreResumptionTimes ].
activeDelayStartTime := millisecondNowTick.

From runTimerEventLoop...
Time millisecondClockValue < millisecondNowTick
ifTrue:[ timingSemaphore signal ]. "retry"

From startup...
activeDelayStartTime := Time millisecondClockValue.

OK, I put back the code into the spin scheduler. Then I kept using milisecond one until I finished the changes. Then I saved image and switch to spin one: the image hungs. I cannot even interrupt it. I attach my modifications.

Do they look correct?

Thanks!

Mariano
http://marianopeck.wordpress.com

Kernel-MarianoMartinezPeck.2291.mcz (875K) Download Attachment

Ben Coman

Re: Problem with delay waiting (OSSubprocess code) on Pharo 5.0

On Fri, Jul 22, 2016 at 4:42 AM, Mariano Martinez Peck <[hidden email]> wrote:

On Sun, Jul 17, 2016 at 6:46 AM, Ben Coman <[hidden email]> wrote:

Do you have some test code or a test image I could run to reproduce
the problem?

The problem is that I cannot reproduce it immediately. It takes me some days and likely some image save and start again until I reproduce it.

Could you try

How about something that invokes

Now I have one thing for you to try. In the move of DelaySchedulerXxx
from milliseconds to microseconds, the code for dealing with clock
wrap-around was removed, but I wonder if it was also covering clock
jitter as a side effect. Could you try again the SpinScheduler but
restore this code...

From handletTimerEvent...
"Check for clock wrap-around."
millisecondNowTick < activeDelayStartTime ifTrue: [
"clock wrapped"
self saveResumptionTimes.
self restoreResumptionTimes ].
activeDelayStartTime := millisecondNowTick.

From runTimerEventLoop...
Time millisecondClockValue < millisecondNowTick
ifTrue:[ timingSemaphore signal ]. "retry"

From startup...
activeDelayStartTime := Time millisecondClockValue.

OK, I put back the code into the spin scheduler. Then I kept using milisecond one until I finished the changes. Then I saved image and switch to spin one: the image hungs. I cannot even interrupt it. I attach my modifications.
Do they look correct?

I confirm that locked my image also.

One change of milli to micro was missed at the bottom of runTimeEventLoop...

That is...

Time millisecondClockValue < microsecondNowTick

to...

Time microsecondClockValue < microsecondNowTick

(just a case of domestic blindness) [1] :) :)

cheers -ben

[1] http://www.urbandictionary.com/define.php?term=domestic%20blindness

P.S. A trick to debug delays is to bypass interCycleDelay by at the top of it putting...

true ifTrue: [^self].