Delay and Server reliability

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Delay and Server reliability

Andreas.Raab
Hi -

We recently had some "fun" chasing server lockups (with truly awful
uptimes of about a day or less before things went downhill) and were
finally able to track a huge portion of it down to problems with Delay.
The effect we were seeing on our servers was that the system would
randomly lock up and either go down to 0% CPU or 100% CPU.

After poking it with a USR1 signal (which, in our VMs is hooked up such
that it prints all the call stacks in the image; it's a life-safer if
you need to debug these issues) we usually found that all processes were
waiting on Delay's AccessProtect (0%) or alternatively found that a
particular process (the event tickler) would sit in a tight loop
swallowing repeated errors complaining that "this delay is already
scheduled".

After hours and hours of testing, debugging, and a little stroke of luck
we finally found out that all of these issues were caused by the fact
that Delay's internal structures are updated by the calling process
(insertion into and removal from SuspendedDelays) which renders the
process susceptible to being terminated in the midst of updating these
structures.

If you look at the code, this is obviously an issue because if (for
example) the calling process gets terminated while it's resorting
SuspendedDelays the result is unpredictable. This is in particular an
issue because the calling process is often running at a relatively low
priority so interruption by other, high-priority processes is a common
case. And if any of these higher priority processes kills the one that
just happens to execute SortedCollection>>remove: anything can happen -
from leaving a later delay in front of an earlier one (one of the cases
we had indicated that this was just what had happened) to errors when
doing the next insert/remove ("trying to evaluate a block that is
already evaluated") to many more weirdnesses. Unfortunately, it is
basically impossible to recreate this problem under any kind of
controlled circumstances, mostly because you need a source of events
that is truly independent from your time source.

As a consequence of our findings we rewrote Delay to deal with these
issues properly and, having deployed the changes about ten days ago on
our servers, all of these sources of problems simply vanished. We
haven't had a single server problem which we couldn't attribute to our
own stupidity (such as running out of disk space ;-)

The changes will in particular be helpful to you if you:
* run network servers
* fork processes to handle network requests
* terminate these processes explicitly (on error conditions for example)
* use Semaphore>>waitTimeoutMsecs: (all socket functions use this)

If you have seen random, unexplained lockups of your server (0% CPU load
while being locked up is a dead giveaway[*]) I'd recommend using the
attached changes (which work best on top of a VM with David Lewis' 64bit
fixes applied) and see if that helps. For us, they made the difference
between running the server in Squeak and rewriting it in Java.

I've also filed this as http://bugs.squeak.org/view.php?id=6576

[*] The 0% CPU lockups have sometimes been attributed to issues with
Linux wait functions. After having seen the havoc that Delay wrecks on
the system I don't buy these explanations any longer. A much simpler
(and more likely) explanation is that Delay went wild.

Cheers,
   - Andreas



SafeDelay.cs (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Janko Mivšek
Hi Andreas,

That's very important patch and very interesting to me too, because I'm
just deciding to put some of my public Aida/Web websites from VW to
Squeak and I was afraid of such issues as one you just solved.

Is there any chance that this patch goes to 3.10?

Best regards
Janko

Andreas Raab wrote:

> Hi -
>
> We recently had some "fun" chasing server lockups (with truly awful
> uptimes of about a day or less before things went downhill) and were
> finally able to track a huge portion of it down to problems with Delay.
> The effect we were seeing on our servers was that the system would
> randomly lock up and either go down to 0% CPU or 100% CPU.
>
> After poking it with a USR1 signal (which, in our VMs is hooked up such
> that it prints all the call stacks in the image; it's a life-safer if
> you need to debug these issues) we usually found that all processes were
> waiting on Delay's AccessProtect (0%) or alternatively found that a
> particular process (the event tickler) would sit in a tight loop
> swallowing repeated errors complaining that "this delay is already
> scheduled".
>
> After hours and hours of testing, debugging, and a little stroke of luck
> we finally found out that all of these issues were caused by the fact
> that Delay's internal structures are updated by the calling process
> (insertion into and removal from SuspendedDelays) which renders the
> process susceptible to being terminated in the midst of updating these
> structures.
>
> If you look at the code, this is obviously an issue because if (for
> example) the calling process gets terminated while it's resorting
> SuspendedDelays the result is unpredictable. This is in particular an
> issue because the calling process is often running at a relatively low
> priority so interruption by other, high-priority processes is a common
> case. And if any of these higher priority processes kills the one that
> just happens to execute SortedCollection>>remove: anything can happen -
> from leaving a later delay in front of an earlier one (one of the cases
> we had indicated that this was just what had happened) to errors when
> doing the next insert/remove ("trying to evaluate a block that is
> already evaluated") to many more weirdnesses. Unfortunately, it is
> basically impossible to recreate this problem under any kind of
> controlled circumstances, mostly because you need a source of events
> that is truly independent from your time source.
>
> As a consequence of our findings we rewrote Delay to deal with these
> issues properly and, having deployed the changes about ten days ago on
> our servers, all of these sources of problems simply vanished. We
> haven't had a single server problem which we couldn't attribute to our
> own stupidity (such as running out of disk space ;-)
>
> The changes will in particular be helpful to you if you:
> * run network servers
> * fork processes to handle network requests
> * terminate these processes explicitly (on error conditions for example)
> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>
> If you have seen random, unexplained lockups of your server (0% CPU load
> while being locked up is a dead giveaway[*]) I'd recommend using the
> attached changes (which work best on top of a VM with David Lewis' 64bit
> fixes applied) and see if that helps. For us, they made the difference
> between running the server in Squeak and rewriting it in Java.
>
> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>
> [*] The 0% CPU lockups have sometimes been attributed to issues with
> Linux wait functions. After having seen the havoc that Delay wrecks on
> the system I don't buy these explanations any longer. A much simpler
> (and more likely) explanation is that Delay went wild.
>
> Cheers,
>   - Andreas
>
>
> ------------------------------------------------------------------------
>
> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 July 2007 at 11:53:23 pm'!
> "Change Set: SafeDelay
> Date: 23 July 2007
> Author: Andreas Raab
>
> This change set fixes a set of severe problems with concurrent use of Delay. Previously, many of the delay-internal structures were modified by the calling process which made it susceptible to being terminated in the middle of manipulating these structures and leave Delay (and consequently the entire system) in an inconsistent state.
>
> This change set fixes this problem by moving *all* manipulation of Delay's internal structures out of the calling process. As a side-effect it also removes the requirement of Delays being limited to SmallInteger range; the new code has no limitation on the duration of a delay.
>
> No tests are provided since outside of true asynchronous environments (networks) it is basically impossible to recreate the situation reliably."!
>
>
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
> activate
> "Private!! Make the receiver the Delay to be awoken when the next timer interrupt occurs. This method should only be called from a block protected by the AccessProtect semaphore."
> TimerEventLoop ifNotNil:[^nil].
> ActiveDelay := self.
> ActiveDelayStartTime := Time millisecondClockValue.
> ActiveDelayStartTime > resumptionTime ifTrue:[
> ActiveDelay signalWaitingProcess.
> SuspendedDelays isEmpty ifTrue:[
> ActiveDelay := nil.
> ActiveDelayStartTime := nil.
> ] ifFalse:[SuspendedDelays removeFirst activate].
> ] ifFalse:[
> TimingSemaphore initSignals.
> Delay primSignal: TimingSemaphore atMilliseconds: resumptionTime.
> ].! !
>
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
> schedule
> "Private!! Schedule this Delay, but return immediately rather than waiting. The receiver's semaphore will be signalled when its delay duration has elapsed."
>
> beingWaitedOn ifTrue: [self error: 'This Delay has already been scheduled.'].
>
> TimerEventLoop ifNotNil:[^self scheduleEvent].
> AccessProtect critical: [
> beingWaitedOn := true.
> resumptionTime := Time millisecondClockValue + delayDuration.
> ActiveDelay == nil
> ifTrue: [self activate]
> ifFalse: [
> resumptionTime < ActiveDelay resumptionTime
> ifTrue: [
> SuspendedDelays add: ActiveDelay.
> self activate]
> ifFalse: [SuspendedDelays add: self]]].
> ! !
>
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
> scheduleEvent
> "Schedule this delay"
> resumptionTime := Time millisecondClockValue + delayDuration.
> AccessProtect critical:[
> ScheduledDelay := self.
> TimingSemaphore signal.
> ].! !
>
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
> unschedule
> "Unschedule this Delay. Do nothing if it wasn't scheduled."
>
> | done |
> TimerEventLoop ifNotNil:[^self unscheduleEvent].
> AccessProtect critical: [
> done := false.
> [done] whileFalse:
> [SuspendedDelays remove: self ifAbsent: [done := true]].
> ActiveDelay == self ifTrue: [
> SuspendedDelays isEmpty
> ifTrue: [
> ActiveDelay := nil.
> ActiveDelayStartTime := nil]
> ifFalse: [
> SuspendedDelays removeFirst activate]]].
> ! !
>
> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
> unscheduleEvent
> AccessProtect critical:[
> FinishedDelay := self.
> TimingSemaphore signal.
> ].! !
>
> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
> beingWaitedOn
> "Answer whether this delay is currently scheduled, e.g., being waited on"
> ^beingWaitedOn! !
>
> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
> beingWaitedOn: aBool
> "Indicate whether this delay is currently scheduled, e.g., being waited on"
> beingWaitedOn := aBool! !
>
> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
> delayDuration
> ^delayDuration! !
>
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:35'!
> handleTimerEvent
> "Handle a timer event; which can be either:
> - a schedule request (ScheduledDelay notNil)
> - an unschedule request (FinishedDelay notNil)
> - a timer signal (not explicitly specified)
> We check for timer expiry every time we get a signal."
> | nextTick |
> "Wait until there is work to do."
> TimingSemaphore wait.
>
> "Process any schedule requests"
> ScheduledDelay ifNotNil:[
> "Schedule the given delay"
> self scheduleDelay: ScheduledDelay.
> ScheduledDelay := nil.
> ].
>
> "Process any unschedule requests"
> FinishedDelay ifNotNil:[
> self unscheduleDelay: FinishedDelay.
> FinishedDelay := nil.
> ].
>
> "Check for clock wrap-around."
> nextTick := Time millisecondClockValue.
> nextTick < ActiveDelayStartTime ifTrue: [
> "clock wrapped"
> self saveResumptionTimes.
> self restoreResumptionTimes.
> ].
> ActiveDelayStartTime := nextTick.
>
> "Signal any expired delays"
> [ActiveDelay notNil and:[
> Time millisecondClockValue >= ActiveDelay resumptionTime]] whileTrue:[
> ActiveDelay signalWaitingProcess.
> SuspendedDelays isEmpty
> ifTrue: [ActiveDelay := nil]
> ifFalse:[ActiveDelay := SuspendedDelays removeFirst].
> ].
>
> "And signal when the next request is due. We sleep at most 1sec here
> as a soft busy-loop so that we don't accidentally miss signals."
> nextTick := Time millisecondClockValue + 1000.
> ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay resumptionTime].
> nextTick := nextTick min: SmallInteger maxVal.
>
> "Since we have processed all outstanding requests, reset the timing semaphore so
> that only new work will wake us up again. Do this RIGHT BEFORE setting the next
> wakeup call from the VM because it is only signaled once so we mustn't miss it."
> TimingSemaphore initSignals.
> Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
> ! !
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 09:04'!
> runTimerEventLoop
> "Run the timer event loop."
> [
> [RunTimerEventLoop] whileTrue: [self handleTimerEvent]
> ] on: Error do:[:ex|
> "Clear out the process so it does't get killed"
> TimerEventLoop := nil.
> "Launch the old-style interrupt watcher"
> self startTimerInterruptWatcher.
> "And pass the exception on"
> ex pass.
> ].! !
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
> scheduleDelay: aDelay
> "Private. Schedule this Delay."
> aDelay beingWaitedOn: true.
> ActiveDelay ifNil:[
> ActiveDelay := aDelay
> ] ifNotNil:[
> aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
> SuspendedDelays add: ActiveDelay.
> ActiveDelay := aDelay.
> ] ifFalse: [SuspendedDelays add: aDelay].
> ].
> ! !
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:18'!
> startTimerEventLoop
> "Start the timer event loop"
> "Delay startTimerEventLoop"
> self stopTimerEventLoop.
> self stopTimerInterruptWatcher.
> AccessProtect := Semaphore forMutualExclusion.
> ActiveDelayStartTime := Time millisecondClockValue.
> SuspendedDelays :=
> Heap withAll: (SuspendedDelays ifNil:[#()])
> sortBlock: [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime].
> TimingSemaphore := Semaphore new.
> RunTimerEventLoop := true.
> TimerEventLoop := [self runTimerEventLoop] newProcess.
> TimerEventLoop priority: Processor timingPriority.
> TimerEventLoop resume.
> TimingSemaphore signal. "get going"
> ! !
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
> startTimerInterruptWatcher
> "Reset the class variables that keep track of active Delays and re-start the timer interrupt watcher process. Any currently scheduled delays are forgotten."
> "Delay startTimerInterruptWatcher"
> | p |
> self stopTimerEventLoop.
> self stopTimerInterruptWatcher.
> TimingSemaphore := Semaphore new.
> AccessProtect := Semaphore forMutualExclusion.
> SuspendedDelays :=
> SortedCollection sortBlock:
> [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime].
> ActiveDelay := nil.
> p := [self timerInterruptWatcher] newProcess.
> p priority: Processor timingPriority.
> p resume.
> ! !
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:26'!
> stopTimerEventLoop
> "Stop the timer event loop"
> RunTimerEventLoop := false.
> TimingSemaphore signal.
> TimerEventLoop := nil.! !
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:32'!
> stopTimerInterruptWatcher
> "Reset the class variables that keep track of active Delays and re-start the timer interrupt watcher process. Any currently scheduled delays are forgotten."
> "Delay startTimerInterruptWatcher"
> self primSignal: nil atMilliseconds: 0.
> TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
>
> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:33'!
> unscheduleDelay: aDelay
> "Private. Unschedule this Delay."
> ActiveDelay == aDelay ifTrue: [
> SuspendedDelays isEmpty ifTrue:[
> ActiveDelay := nil.
> ] ifFalse: [
> ActiveDelay := SuspendedDelays removeFirst.
> ]
> ] ifFalse:[
> SuspendedDelays remove: aDelay ifAbsent: [].
> ].
> aDelay beingWaitedOn: false.! !
>
> !Delay class methodsFor: 'class initialization' stamp: 'ar 7/11/2007 18:16'!
> initialize
> "Delay initialize"
> self startTimerEventLoop.! !
>
> Delay initialize!
>
>
> ------------------------------------------------------------------------
>
>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Damien Cassou-3
Hi,

2007/7/24, Janko Mivšek <[hidden email]>:
> That's very important patch and very interesting to me too, because I'm
> just deciding to put some of my public Aida/Web websites from VW to
> Squeak and I was afraid of such issues as one you just solved.
>
> Is there any chance that this patch goes to 3.10?

Chance would be greater if unit tests were included.

--
Damien Cassou


Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Andreas.Raab
In reply to this post by Janko Mivšek
You can apply the fix yourself; it works in all Squeak versions that I'm
aware of (and if not, you'll find out really quickly ;-) This is just
the kind of thing for which I wanted to see some sort of "standard
package" for so that people across various Squeak versions can benefit
from it.

Cheers,
   - Andreas

Janko Mivšek wrote:

> Hi Andreas,
>
> That's very important patch and very interesting to me too, because I'm
> just deciding to put some of my public Aida/Web websites from VW to
> Squeak and I was afraid of such issues as one you just solved.
>
> Is there any chance that this patch goes to 3.10?
>
> Best regards
> Janko
>
> Andreas Raab wrote:
>> Hi -
>>
>> We recently had some "fun" chasing server lockups (with truly awful
>> uptimes of about a day or less before things went downhill) and were
>> finally able to track a huge portion of it down to problems with
>> Delay. The effect we were seeing on our servers was that the system
>> would randomly lock up and either go down to 0% CPU or 100% CPU.
>>
>> After poking it with a USR1 signal (which, in our VMs is hooked up
>> such that it prints all the call stacks in the image; it's a
>> life-safer if you need to debug these issues) we usually found that
>> all processes were waiting on Delay's AccessProtect (0%) or
>> alternatively found that a particular process (the event tickler)
>> would sit in a tight loop swallowing repeated errors complaining that
>> "this delay is already scheduled".
>>
>> After hours and hours of testing, debugging, and a little stroke of
>> luck we finally found out that all of these issues were caused by the
>> fact that Delay's internal structures are updated by the calling
>> process (insertion into and removal from SuspendedDelays) which
>> renders the process susceptible to being terminated in the midst of
>> updating these structures.
>>
>> If you look at the code, this is obviously an issue because if (for
>> example) the calling process gets terminated while it's resorting
>> SuspendedDelays the result is unpredictable. This is in particular an
>> issue because the calling process is often running at a relatively low
>> priority so interruption by other, high-priority processes is a common
>> case. And if any of these higher priority processes kills the one that
>> just happens to execute SortedCollection>>remove: anything can happen
>> - from leaving a later delay in front of an earlier one (one of the
>> cases we had indicated that this was just what had happened) to errors
>> when doing the next insert/remove ("trying to evaluate a block that is
>> already evaluated") to many more weirdnesses. Unfortunately, it is
>> basically impossible to recreate this problem under any kind of
>> controlled circumstances, mostly because you need a source of events
>> that is truly independent from your time source.
>>
>> As a consequence of our findings we rewrote Delay to deal with these
>> issues properly and, having deployed the changes about ten days ago on
>> our servers, all of these sources of problems simply vanished. We
>> haven't had a single server problem which we couldn't attribute to our
>> own stupidity (such as running out of disk space ;-)
>>
>> The changes will in particular be helpful to you if you:
>> * run network servers
>> * fork processes to handle network requests
>> * terminate these processes explicitly (on error conditions for example)
>> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>>
>> If you have seen random, unexplained lockups of your server (0% CPU
>> load while being locked up is a dead giveaway[*]) I'd recommend using
>> the attached changes (which work best on top of a VM with David Lewis'
>> 64bit fixes applied) and see if that helps. For us, they made the
>> difference between running the server in Squeak and rewriting it in Java.
>>
>> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>>
>> [*] The 0% CPU lockups have sometimes been attributed to issues with
>> Linux wait functions. After having seen the havoc that Delay wrecks on
>> the system I don't buy these explanations any longer. A much simpler
>> (and more likely) explanation is that Delay went wild.
>>
>> Cheers,
>>   - Andreas
>>
>>
>> ------------------------------------------------------------------------
>>
>> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 July
>> 2007 at 11:53:23 pm'!
>> "Change Set:        SafeDelay
>> Date:            23 July 2007
>> Author:            Andreas Raab
>>
>> This change set fixes a set of severe problems with concurrent use of
>> Delay. Previously, many of the delay-internal structures were modified
>> by the calling process which made it susceptible to being terminated
>> in the middle of manipulating these structures and leave Delay (and
>> consequently the entire system) in an inconsistent state.
>>
>> This change set fixes this problem by moving *all* manipulation of
>> Delay's internal structures out of the calling process. As a
>> side-effect it also removes the requirement of Delays being limited to
>> SmallInteger range; the new code has no limitation on the duration of
>> a delay.
>>
>> No tests are provided since outside of true asynchronous environments
>> (networks) it is basically impossible to recreate the situation
>> reliably."!
>>
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
>> activate
>>     "Private!! Make the receiver the Delay to be awoken when the next
>> timer interrupt occurs. This method should only be called from a block
>> protected by the AccessProtect semaphore."
>>     TimerEventLoop ifNotNil:[^nil].
>>     ActiveDelay := self.
>>     ActiveDelayStartTime := Time millisecondClockValue.
>>     ActiveDelayStartTime > resumptionTime ifTrue:[
>>         ActiveDelay signalWaitingProcess.
>>         SuspendedDelays isEmpty ifTrue:[
>>             ActiveDelay := nil.
>>             ActiveDelayStartTime := nil.
>>         ] ifFalse:[SuspendedDelays removeFirst activate].
>>     ] ifFalse:[
>>         TimingSemaphore initSignals.
>>         Delay primSignal: TimingSemaphore atMilliseconds: resumptionTime.
>>     ].! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>> schedule
>>     "Private!! Schedule this Delay, but return immediately rather than
>> waiting. The receiver's semaphore will be signalled when its delay
>> duration has elapsed."
>>
>>     beingWaitedOn ifTrue: [self error: 'This Delay has already been
>> scheduled.'].
>>
>>     TimerEventLoop ifNotNil:[^self scheduleEvent].
>>     AccessProtect critical: [
>>         beingWaitedOn := true.
>>         resumptionTime := Time millisecondClockValue + delayDuration.
>>         ActiveDelay == nil
>>             ifTrue: [self activate]
>>             ifFalse: [
>>                 resumptionTime < ActiveDelay resumptionTime
>>                     ifTrue: [
>>                         SuspendedDelays add: ActiveDelay.
>>                         self activate]
>>                     ifFalse: [SuspendedDelays add: self]]].
>> ! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
>> scheduleEvent
>>     "Schedule this delay"
>>     resumptionTime := Time millisecondClockValue + delayDuration.
>>     AccessProtect critical:[
>>         ScheduledDelay := self.
>>         TimingSemaphore signal.
>>     ].! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>> unschedule
>>     "Unschedule this Delay. Do nothing if it wasn't scheduled."
>>
>>     | done |
>>     TimerEventLoop ifNotNil:[^self unscheduleEvent].
>>     AccessProtect critical: [
>>         done := false.
>>         [done] whileFalse:
>>             [SuspendedDelays remove: self ifAbsent: [done := true]].
>>         ActiveDelay == self ifTrue: [
>>             SuspendedDelays isEmpty
>>                 ifTrue: [
>>                     ActiveDelay := nil.
>>                     ActiveDelayStartTime := nil]
>>                 ifFalse: [
>>                     SuspendedDelays removeFirst activate]]].
>> ! !
>>
>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
>> unscheduleEvent
>>     AccessProtect critical:[
>>         FinishedDelay := self.
>>         TimingSemaphore signal.
>>     ].! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>> beingWaitedOn
>>     "Answer whether this delay is currently scheduled, e.g., being
>> waited on"
>>     ^beingWaitedOn! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>> beingWaitedOn: aBool
>>     "Indicate whether this delay is currently scheduled, e.g., being
>> waited on"
>>     beingWaitedOn := aBool! !
>>
>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
>> delayDuration
>>     ^delayDuration! !
>>
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:35'!
>> handleTimerEvent
>>     "Handle a timer event; which can be either:
>>         - a schedule request (ScheduledDelay notNil)
>>         - an unschedule request (FinishedDelay notNil)
>>         - a timer signal (not explicitly specified)
>>     We check for timer expiry every time we get a signal."
>>     | nextTick |
>>     "Wait until there is work to do."
>>     TimingSemaphore wait.
>>
>>     "Process any schedule requests"
>>     ScheduledDelay ifNotNil:[
>>         "Schedule the given delay"
>>         self scheduleDelay: ScheduledDelay.
>>         ScheduledDelay := nil.
>>     ].
>>
>>     "Process any unschedule requests"
>>     FinishedDelay ifNotNil:[
>>         self unscheduleDelay: FinishedDelay.
>>         FinishedDelay := nil.
>>     ].
>>
>>     "Check for clock wrap-around."
>>     nextTick := Time millisecondClockValue.
>>     nextTick < ActiveDelayStartTime ifTrue: [
>>         "clock wrapped"
>>         self saveResumptionTimes.
>>         self restoreResumptionTimes.
>>     ].
>>     ActiveDelayStartTime := nextTick.
>>
>>     "Signal any expired delays"
>>     [ActiveDelay notNil and:[
>>         Time millisecondClockValue >= ActiveDelay resumptionTime]]
>> whileTrue:[
>>             ActiveDelay signalWaitingProcess.
>>             SuspendedDelays isEmpty                 ifTrue:
>> [ActiveDelay := nil]                 ifFalse:[ActiveDelay :=
>> SuspendedDelays removeFirst].
>>         ].
>>
>>     "And signal when the next request is due. We sleep at most 1sec here
>>     as a soft busy-loop so that we don't accidentally miss signals."
>>     nextTick := Time millisecondClockValue + 1000.
>>     ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay
>> resumptionTime].
>>     nextTick := nextTick min: SmallInteger maxVal.
>>
>>     "Since we have processed all outstanding requests, reset the
>> timing semaphore so
>>     that only new work will wake us up again. Do this RIGHT BEFORE
>> setting the next
>>     wakeup call from the VM because it is only signaled once so we
>> mustn't miss it."
>>     TimingSemaphore initSignals.
>>     Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 09:04'!
>> runTimerEventLoop
>>     "Run the timer event loop."
>>     [
>>         [RunTimerEventLoop] whileTrue: [self handleTimerEvent]
>>     ] on: Error do:[:ex|
>>         "Clear out the process so it does't get killed"
>>         TimerEventLoop := nil.
>>         "Launch the old-style interrupt watcher"
>>         self startTimerInterruptWatcher.
>>         "And pass the exception on"
>>         ex pass.
>>     ].! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
>> scheduleDelay: aDelay
>>     "Private. Schedule this Delay."
>>     aDelay beingWaitedOn: true.
>>     ActiveDelay ifNil:[
>>         ActiveDelay := aDelay
>>     ] ifNotNil:[
>>         aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
>>             SuspendedDelays add: ActiveDelay.
>>             ActiveDelay := aDelay.
>>         ] ifFalse: [SuspendedDelays add: aDelay].
>>     ].
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:18'!
>> startTimerEventLoop
>>     "Start the timer event loop"
>>     "Delay startTimerEventLoop"
>>     self stopTimerEventLoop.
>>     self stopTimerInterruptWatcher.
>>     AccessProtect := Semaphore forMutualExclusion.
>>     ActiveDelayStartTime := Time millisecondClockValue.
>>     SuspendedDelays :=         Heap withAll: (SuspendedDelays
>> ifNil:[#()])
>>             sortBlock: [:d1 :d2 | d1 resumptionTime <= d2
>> resumptionTime].
>>     TimingSemaphore := Semaphore new.
>>     RunTimerEventLoop := true.
>>     TimerEventLoop := [self runTimerEventLoop] newProcess.
>>     TimerEventLoop priority: Processor timingPriority.
>>     TimerEventLoop resume.
>>     TimingSemaphore signal. "get going"
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'!
>> startTimerInterruptWatcher
>>     "Reset the class variables that keep track of active Delays and
>> re-start the timer interrupt watcher process. Any currently scheduled
>> delays are forgotten."
>>     "Delay startTimerInterruptWatcher"
>>     | p |
>>     self stopTimerEventLoop.
>>     self stopTimerInterruptWatcher.
>>     TimingSemaphore := Semaphore new.
>>     AccessProtect := Semaphore forMutualExclusion.
>>     SuspendedDelays :=         SortedCollection sortBlock:            
>> [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime].
>>     ActiveDelay := nil.
>>     p := [self timerInterruptWatcher] newProcess.
>>     p priority: Processor timingPriority.
>>     p resume.
>> ! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:26'!
>> stopTimerEventLoop
>>     "Stop the timer event loop"
>>     RunTimerEventLoop := false.
>>     TimingSemaphore signal.
>>     TimerEventLoop := nil.! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:32'!
>> stopTimerInterruptWatcher
>>     "Reset the class variables that keep track of active Delays and
>> re-start the timer interrupt watcher process. Any currently scheduled
>> delays are forgotten."
>>     "Delay startTimerInterruptWatcher"
>>     self primSignal: nil atMilliseconds: 0.
>>     TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
>>
>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:33'!
>> unscheduleDelay: aDelay
>>     "Private. Unschedule this Delay."
>>     ActiveDelay == aDelay ifTrue: [
>>         SuspendedDelays isEmpty ifTrue:[
>>             ActiveDelay := nil.
>>         ] ifFalse: [
>>             ActiveDelay := SuspendedDelays removeFirst.
>>         ]
>>     ] ifFalse:[
>>         SuspendedDelays remove: aDelay ifAbsent: [].
>>     ].
>>     aDelay beingWaitedOn: false.! !
>>
>> !Delay class methodsFor: 'class initialization' stamp: 'ar 7/11/2007
>> 18:16'!
>> initialize
>>     "Delay initialize"
>>     self startTimerEventLoop.! !
>>
>> Delay initialize!
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Andreas.Raab
In reply to this post by Damien Cassou-3
Damien Cassou wrote:
>> Is there any chance that this patch goes to 3.10?
>
> Chance would be greater if unit tests were included.

Good luck with that. I tried for a couple of hours to find a reliable
way of creating this problem with not even so much as a hint of being
able to make it happen. The problem is that on any local machine you're
never completely independent from the time source of that machine and if
you are dependent on the time source you are in sync with Delay and
everything will be fine. You need an independent source of events and
I've yet to find someone who shows me how to write unit tests across
multiple machines (and no, running multiple images on the same machine
doesn't work because your process scheduler uses the same time source
that your image uses so it's not independent).

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Bert Freudenberg
In reply to this post by Andreas.Raab
On Jul 24, 2007, at 9:45 , Andreas Raab wrote:

> Hi -
>
> We recently had some "fun" chasing server lockups (with truly awful  
> uptimes of about a day or less before things went downhill) and  
> were finally able to track a huge portion of it down to problems  
> with Delay.

Yay!

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Bert Freudenberg
In reply to this post by Damien Cassou-3
On Jul 24, 2007, at 10:16 , Damien Cassou wrote:

> Hi,
>
> 2007/7/24, Janko Mivšek <[hidden email]>:
>> That's very important patch and very interesting to me too,  
>> because I'm
>> just deciding to put some of my public Aida/Web websites from VW to
>> Squeak and I was afraid of such issues as one you just solved.
>>
>> Is there any chance that this patch goes to 3.10?
>
> Chance would be greater if unit tests were included.

No. It just takes a couple of people filing this in and using the  
image for a while. Preferably on servers. And then reporting their  
findings.

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Edgar J. De Cleene



El 7/24/07 5:40 AM, "Bert Freudenberg" <[hidden email]> escribió:

> No. It just takes a couple of people filing this in and using the
> image for a while. Preferably on servers. And then reporting their
> findings.
>
> - Bert -

Damien and Janko
I watching your finds.
This is the last week of vacations, so on next Monday when students come , I
put they to test on Mac , Windows XP and 98 and I hope on Umbuntu also.

Edgar



Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Schwab,Wilhelm K
In reply to this post by Andreas.Raab
Andreas,

Congratulations and thanks.  I am also interested in the "all
callstacks" feature of your VM.  Years ago, I hacked Dolphin to do that
on break in runtime sessions, and later did the same thing in Squeak
when one of my early projects was suffering a GUI lockup.  Thanks to the
callstacks, I was able to fix it.  I _tried_ to make some associated VM
patches available, but there was no interest that I could detect.
Please put your solution in the mainstream VM.  I would also suggest
adding a VM menu command, or (what I did) simply changing the existing
one-callstack command to dump all callstacks instead.

My fixes to the VM mostly involved getting the right number of new lines
between the stacks by moving the crs so they are added only when needed.
 I forget the terminology, but one iterates over more objects than are
in fact associated with a waiting process, leading to spurious new lines
in the output.  I might even be able to find the code if it would help.
But my hunch is you have already encountered and fixed it.

Bill



Wilhelm K. Schwab, Ph.D.
University of Florida
Department of Anesthesiology
PO Box 100254
Gainesville, FL 32610-0254

Email: [hidden email]
Tel: (352) 846-1285
FAX: (352) 392-7029


Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Jon Hylands
In reply to this post by Andreas.Raab
On Tue, 24 Jul 2007 01:17:59 -0700, Andreas Raab <[hidden email]>
wrote:

> You can apply the fix yourself; it works in all Squeak versions that I'm
> aware of (and if not, you'll find out really quickly ;-) This is just
> the kind of thing for which I wanted to see some sort of "standard
> package" for so that people across various Squeak versions can benefit
> from it.

I tried filing it into my 3.8 (#6665) image, and the following variables
are undeclared:

TimerEventLoop
ScheduledDelay
FinishedDelay

Perhaps there was supposed to be a class definition included?

Later,
Jon

--------------------------------------------------------------
   Jon Hylands      [hidden email]      http://www.huv.com/jon

  Project: Micro Raptor (Small Biped Velociraptor Robot)
           http://www.huv.com/blog

Reply | Threaded
Open this post in threaded view
|

RE: Delay and Server reliability

Gary Chambers-4
And RunTimerEventLoop.

Good work though. May explain a few mysteries!
(using quite a lot of processes with Delays...)

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Jon
Hylands
Sent: 24 July 2007 1:17 pm
To: The general-purpose Squeak developers list
Subject: Re: Delay and Server reliability


On Tue, 24 Jul 2007 01:17:59 -0700, Andreas Raab <[hidden email]>
wrote:

> You can apply the fix yourself; it works in all Squeak versions that
> I'm
> aware of (and if not, you'll find out really quickly ;-) This is just
> the kind of thing for which I wanted to see some sort of "standard
> package" for so that people across various Squeak versions can benefit
> from it.

I tried filing it into my 3.8 (#6665) image, and the following variables are
undeclared:

TimerEventLoop
ScheduledDelay
FinishedDelay

Perhaps there was supposed to be a class definition included?

Later,
Jon

--------------------------------------------------------------
   Jon Hylands      [hidden email]      http://www.huv.com/jon

  Project: Micro Raptor (Small Biped Velociraptor Robot)
           http://www.huv.com/blog


Reply | Threaded
Open this post in threaded view
|

RE: Delay and Server reliability

Ron Teitelbaum
In reply to this post by Andreas.Raab
Hi Andreas,

This is Terrific!!  Thank you for doing it!!

Ron

> -----Original Message-----
> From: [hidden email] [mailto:squeak-dev-
> [hidden email]] On Behalf Of Andreas Raab
> Sent: Tuesday, July 24, 2007 3:46 AM
> To: The general-purpose Squeak developers list
> Subject: Delay and Server reliability
>
> Hi -
>
> We recently had some "fun" chasing server lockups (with truly awful
> uptimes of about a day or less before things went downhill) and were
> finally able to track a huge portion of it down to problems with Delay.
> The effect we were seeing on our servers was that the system would
> randomly lock up and either go down to 0% CPU or 100% CPU.
>
> After poking it with a USR1 signal (which, in our VMs is hooked up such
> that it prints all the call stacks in the image; it's a life-safer if
> you need to debug these issues) we usually found that all processes were
> waiting on Delay's AccessProtect (0%) or alternatively found that a
> particular process (the event tickler) would sit in a tight loop
> swallowing repeated errors complaining that "this delay is already
> scheduled".
>
> After hours and hours of testing, debugging, and a little stroke of luck
> we finally found out that all of these issues were caused by the fact
> that Delay's internal structures are updated by the calling process
> (insertion into and removal from SuspendedDelays) which renders the
> process susceptible to being terminated in the midst of updating these
> structures.
>
> If you look at the code, this is obviously an issue because if (for
> example) the calling process gets terminated while it's resorting
> SuspendedDelays the result is unpredictable. This is in particular an
> issue because the calling process is often running at a relatively low
> priority so interruption by other, high-priority processes is a common
> case. And if any of these higher priority processes kills the one that
> just happens to execute SortedCollection>>remove: anything can happen -
> from leaving a later delay in front of an earlier one (one of the cases
> we had indicated that this was just what had happened) to errors when
> doing the next insert/remove ("trying to evaluate a block that is
> already evaluated") to many more weirdnesses. Unfortunately, it is
> basically impossible to recreate this problem under any kind of
> controlled circumstances, mostly because you need a source of events
> that is truly independent from your time source.
>
> As a consequence of our findings we rewrote Delay to deal with these
> issues properly and, having deployed the changes about ten days ago on
> our servers, all of these sources of problems simply vanished. We
> haven't had a single server problem which we couldn't attribute to our
> own stupidity (such as running out of disk space ;-)
>
> The changes will in particular be helpful to you if you:
> * run network servers
> * fork processes to handle network requests
> * terminate these processes explicitly (on error conditions for example)
> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>
> If you have seen random, unexplained lockups of your server (0% CPU load
> while being locked up is a dead giveaway[*]) I'd recommend using the
> attached changes (which work best on top of a VM with David Lewis' 64bit
> fixes applied) and see if that helps. For us, they made the difference
> between running the server in Squeak and rewriting it in Java.
>
> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>
> [*] The 0% CPU lockups have sometimes been attributed to issues with
> Linux wait functions. After having seen the havoc that Delay wrecks on
> the system I don't buy these explanations any longer. A much simpler
> (and more likely) explanation is that Delay went wild.
>
> Cheers,
>    - Andreas


Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Andreas.Raab
In reply to this post by Jon Hylands
Ouch. You are right. Here is a variant with the class definition included.

Cheers,
   - Andreas

Jon Hylands wrote:

> On Tue, 24 Jul 2007 01:17:59 -0700, Andreas Raab <[hidden email]>
> wrote:
>
>> You can apply the fix yourself; it works in all Squeak versions that I'm
>> aware of (and if not, you'll find out really quickly ;-) This is just
>> the kind of thing for which I wanted to see some sort of "standard
>> package" for so that people across various Squeak versions can benefit
>> from it.
>
> I tried filing it into my 3.8 (#6665) image, and the following variables
> are undeclared:
>
> TimerEventLoop
> ScheduledDelay
> FinishedDelay
>
> Perhaps there was supposed to be a class definition included?
>
> Later,
> Jon
>
> --------------------------------------------------------------
>    Jon Hylands      [hidden email]      http://www.huv.com/jon
>
>   Project: Micro Raptor (Small Biped Velociraptor Robot)
>            http://www.huv.com/blog
>
>



SafeDelay-2.cs (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Jon Hylands
On Tue, 24 Jul 2007 09:03:44 -0700, Andreas Raab <[hidden email]>
wrote:

> Ouch. You are right. Here is a variant with the class definition included.

Thanks, that installs much better. I'll let you know how it works once I
start testing again (I'm kinda down with pneumonia right now) - my server
uses a bunch of processes and a lot of delays. In the current
configuration, it is all on one machine, so I probably wouldn't run into
the issue right now, but I also run with part of the server running on my
PC, and the other part running on a gumstix, and they do socket
communications in both directions. This may possibly explain why I have
seen my gumstix system stop responding on occasion.

Later,
Jon

--------------------------------------------------------------
   Jon Hylands      [hidden email]      http://www.huv.com/jon

  Project: Micro Raptor (Small Biped Velociraptor Robot)
           http://www.huv.com/blog

Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Steven W Riggins
In reply to this post by Andreas.Raab
Could this be related to why monticello goes to sleep and never wakes  
up until you wiggle the mouse?

Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

stephane ducasse
In reply to this post by Andreas.Raab
Yes.
I imagine the pain you got to chase it....
This kind of bug huge pain.

Stef

On 24 juil. 07, at 10:17, Andreas Raab wrote:

> You can apply the fix yourself; it works in all Squeak versions  
> that I'm aware of (and if not, you'll find out really quickly ;-)  
> This is just the kind of thing for which I wanted to see some sort  
> of "standard package" for so that people across various Squeak  
> versions can benefit from it.
>
> Cheers,
>   - Andreas
>
> Janko Mivšek wrote:
>> Hi Andreas,
>> That's very important patch and very interesting to me too,  
>> because I'm just deciding to put some of my public Aida/Web  
>> websites from VW to Squeak and I was afraid of such issues as one  
>> you just solved.
>> Is there any chance that this patch goes to 3.10?
>> Best regards
>> Janko
>> Andreas Raab wrote:
>>> Hi -
>>>
>>> We recently had some "fun" chasing server lockups (with truly  
>>> awful uptimes of about a day or less before things went downhill)  
>>> and were finally able to track a huge portion of it down to  
>>> problems with Delay. The effect we were seeing on our servers was  
>>> that the system would randomly lock up and either go down to 0%  
>>> CPU or 100% CPU.
>>>
>>> After poking it with a USR1 signal (which, in our VMs is hooked  
>>> up such that it prints all the call stacks in the image; it's a  
>>> life-safer if you need to debug these issues) we usually found  
>>> that all processes were waiting on Delay's AccessProtect (0%) or  
>>> alternatively found that a particular process (the event tickler)  
>>> would sit in a tight loop swallowing repeated errors complaining  
>>> that "this delay is already scheduled".
>>>
>>> After hours and hours of testing, debugging, and a little stroke  
>>> of luck we finally found out that all of these issues were caused  
>>> by the fact that Delay's internal structures are updated by the  
>>> calling process (insertion into and removal from SuspendedDelays)  
>>> which renders the process susceptible to being terminated in the  
>>> midst of updating these structures.
>>>
>>> If you look at the code, this is obviously an issue because if  
>>> (for example) the calling process gets terminated while it's  
>>> resorting SuspendedDelays the result is unpredictable. This is in  
>>> particular an issue because the calling process is often running  
>>> at a relatively low priority so interruption by other, high-
>>> priority processes is a common case. And if any of these higher  
>>> priority processes kills the one that just happens to execute  
>>> SortedCollection>>remove: anything can happen - from leaving a  
>>> later delay in front of an earlier one (one of the cases we had  
>>> indicated that this was just what had happened) to errors when  
>>> doing the next insert/remove ("trying to evaluate a block that is  
>>> already evaluated") to many more weirdnesses. Unfortunately, it  
>>> is basically impossible to recreate this problem under any kind  
>>> of controlled circumstances, mostly because you need a source of  
>>> events that is truly independent from your time source.
>>>
>>> As a consequence of our findings we rewrote Delay to deal with  
>>> these issues properly and, having deployed the changes about ten  
>>> days ago on our servers, all of these sources of problems simply  
>>> vanished. We haven't had a single server problem which we  
>>> couldn't attribute to our own stupidity (such as running out of  
>>> disk space ;-)
>>>
>>> The changes will in particular be helpful to you if you:
>>> * run network servers
>>> * fork processes to handle network requests
>>> * terminate these processes explicitly (on error conditions for  
>>> example)
>>> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this)
>>>
>>> If you have seen random, unexplained lockups of your server (0%  
>>> CPU load while being locked up is a dead giveaway[*]) I'd  
>>> recommend using the attached changes (which work best on top of a  
>>> VM with David Lewis' 64bit fixes applied) and see if that helps.  
>>> For us, they made the difference between running the server in  
>>> Squeak and rewriting it in Java.
>>>
>>> I've also filed this as http://bugs.squeak.org/view.php?id=6576
>>>
>>> [*] The 0% CPU lockups have sometimes been attributed to issues  
>>> with Linux wait functions. After having seen the havoc that Delay  
>>> wrecks on the system I don't buy these explanations any longer. A  
>>> much simpler (and more likely) explanation is that Delay went wild.
>>>
>>> Cheers,
>>>   - Andreas
>>>
>>>
>>> --------------------------------------------------------------------
>>> ----
>>>
>>> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23  
>>> July 2007 at 11:53:23 pm'!
>>> "Change Set:        SafeDelay
>>> Date:            23 July 2007
>>> Author:            Andreas Raab
>>>
>>> This change set fixes a set of severe problems with concurrent  
>>> use of Delay. Previously, many of the delay-internal structures  
>>> were modified by the calling process which made it susceptible to  
>>> being terminated in the middle of manipulating these structures  
>>> and leave Delay (and consequently the entire system) in an  
>>> inconsistent state.
>>>
>>> This change set fixes this problem by moving *all* manipulation  
>>> of Delay's internal structures out of the calling process. As a  
>>> side-effect it also removes the requirement of Delays being  
>>> limited to SmallInteger range; the new code has no limitation on  
>>> the duration of a delay.
>>>
>>> No tests are provided since outside of true asynchronous  
>>> environments (networks) it is basically impossible to recreate  
>>> the situation reliably."!
>>>
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'!
>>> activate
>>>     "Private!! Make the receiver the Delay to be awoken when the  
>>> next timer interrupt occurs. This method should only be called  
>>> from a block protected by the AccessProtect semaphore."
>>>     TimerEventLoop ifNotNil:[^nil].
>>>     ActiveDelay := self.
>>>     ActiveDelayStartTime := Time millisecondClockValue.
>>>     ActiveDelayStartTime > resumptionTime ifTrue:[
>>>         ActiveDelay signalWaitingProcess.
>>>         SuspendedDelays isEmpty ifTrue:[
>>>             ActiveDelay := nil.
>>>             ActiveDelayStartTime := nil.
>>>         ] ifFalse:[SuspendedDelays removeFirst activate].
>>>     ] ifFalse:[
>>>         TimingSemaphore initSignals.
>>>         Delay primSignal: TimingSemaphore atMilliseconds:  
>>> resumptionTime.
>>>     ].! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>>> schedule
>>>     "Private!! Schedule this Delay, but return immediately rather  
>>> than waiting. The receiver's semaphore will be signalled when its  
>>> delay duration has elapsed."
>>>
>>>     beingWaitedOn ifTrue: [self error: 'This Delay has already  
>>> been scheduled.'].
>>>
>>>     TimerEventLoop ifNotNil:[^self scheduleEvent].
>>>     AccessProtect critical: [
>>>         beingWaitedOn := true.
>>>         resumptionTime := Time millisecondClockValue +  
>>> delayDuration.
>>>         ActiveDelay == nil
>>>             ifTrue: [self activate]
>>>             ifFalse: [
>>>                 resumptionTime < ActiveDelay resumptionTime
>>>                     ifTrue: [
>>>                         SuspendedDelays add: ActiveDelay.
>>>                         self activate]
>>>                     ifFalse: [SuspendedDelays add: self]]].
>>> ! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'!
>>> scheduleEvent
>>>     "Schedule this delay"
>>>     resumptionTime := Time millisecondClockValue + delayDuration.
>>>     AccessProtect critical:[
>>>         ScheduledDelay := self.
>>>         TimingSemaphore signal.
>>>     ].! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'!
>>> unschedule
>>>     "Unschedule this Delay. Do nothing if it wasn't scheduled."
>>>
>>>     | done |
>>>     TimerEventLoop ifNotNil:[^self unscheduleEvent].
>>>     AccessProtect critical: [
>>>         done := false.
>>>         [done] whileFalse:
>>>             [SuspendedDelays remove: self ifAbsent: [done := true]].
>>>         ActiveDelay == self ifTrue: [
>>>             SuspendedDelays isEmpty
>>>                 ifTrue: [
>>>                     ActiveDelay := nil.
>>>                     ActiveDelayStartTime := nil]
>>>                 ifFalse: [
>>>                     SuspendedDelays removeFirst activate]]].
>>> ! !
>>>
>>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'!
>>> unscheduleEvent
>>>     AccessProtect critical:[
>>>         FinishedDelay := self.
>>>         TimingSemaphore signal.
>>>     ].! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>>> beingWaitedOn
>>>     "Answer whether this delay is currently scheduled, e.g.,  
>>> being waited on"
>>>     ^beingWaitedOn! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'!
>>> beingWaitedOn: aBool
>>>     "Indicate whether this delay is currently scheduled, e.g.,  
>>> being waited on"
>>>     beingWaitedOn := aBool! !
>>>
>>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'!
>>> delayDuration
>>>     ^delayDuration! !
>>>
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007  
>>> 10:35'!
>>> handleTimerEvent
>>>     "Handle a timer event; which can be either:
>>>         - a schedule request (ScheduledDelay notNil)
>>>         - an unschedule request (FinishedDelay notNil)
>>>         - a timer signal (not explicitly specified)
>>>     We check for timer expiry every time we get a signal."
>>>     | nextTick |
>>>     "Wait until there is work to do."
>>>     TimingSemaphore wait.
>>>
>>>     "Process any schedule requests"
>>>     ScheduledDelay ifNotNil:[
>>>         "Schedule the given delay"
>>>         self scheduleDelay: ScheduledDelay.
>>>         ScheduledDelay := nil.
>>>     ].
>>>
>>>     "Process any unschedule requests"
>>>     FinishedDelay ifNotNil:[
>>>         self unscheduleDelay: FinishedDelay.
>>>         FinishedDelay := nil.
>>>     ].
>>>
>>>     "Check for clock wrap-around."
>>>     nextTick := Time millisecondClockValue.
>>>     nextTick < ActiveDelayStartTime ifTrue: [
>>>         "clock wrapped"
>>>         self saveResumptionTimes.
>>>         self restoreResumptionTimes.
>>>     ].
>>>     ActiveDelayStartTime := nextTick.
>>>
>>>     "Signal any expired delays"
>>>     [ActiveDelay notNil and:[
>>>         Time millisecondClockValue >= ActiveDelay  
>>> resumptionTime]] whileTrue:[
>>>             ActiveDelay signalWaitingProcess.
>>>             SuspendedDelays isEmpty                 ifTrue:  
>>> [ActiveDelay := nil]                 ifFalse:[ActiveDelay :=  
>>> SuspendedDelays removeFirst].
>>>         ].
>>>
>>>     "And signal when the next request is due. We sleep at most  
>>> 1sec here
>>>     as a soft busy-loop so that we don't accidentally miss signals."
>>>     nextTick := Time millisecondClockValue + 1000.
>>>     ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay  
>>> resumptionTime].
>>>     nextTick := nextTick min: SmallInteger maxVal.
>>>
>>>     "Since we have processed all outstanding requests, reset the  
>>> timing semaphore so
>>>     that only new work will wake us up again. Do this RIGHT  
>>> BEFORE setting the next
>>>     wakeup call from the VM because it is only signaled once so  
>>> we mustn't miss it."
>>>     TimingSemaphore initSignals.
>>>     Delay primSignal: TimingSemaphore atMilliseconds: nextTick.
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007  
>>> 09:04'!
>>> runTimerEventLoop
>>>     "Run the timer event loop."
>>>     [
>>>         [RunTimerEventLoop] whileTrue: [self handleTimerEvent]
>>>     ] on: Error do:[:ex|
>>>         "Clear out the process so it does't get killed"
>>>         TimerEventLoop := nil.
>>>         "Launch the old-style interrupt watcher"
>>>         self startTimerInterruptWatcher.
>>>         "And pass the exception on"
>>>         ex pass.
>>>     ].! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 22:32'!
>>> scheduleDelay: aDelay
>>>     "Private. Schedule this Delay."
>>>     aDelay beingWaitedOn: true.
>>>     ActiveDelay ifNil:[
>>>         ActiveDelay := aDelay
>>>     ] ifNotNil:[
>>>         aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[
>>>             SuspendedDelays add: ActiveDelay.
>>>             ActiveDelay := aDelay.
>>>         ] ifFalse: [SuspendedDelays add: aDelay].
>>>     ].
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007  
>>> 10:18'!
>>> startTimerEventLoop
>>>     "Start the timer event loop"
>>>     "Delay startTimerEventLoop"
>>>     self stopTimerEventLoop.
>>>     self stopTimerInterruptWatcher.
>>>     AccessProtect := Semaphore forMutualExclusion.
>>>     ActiveDelayStartTime := Time millisecondClockValue.
>>>     SuspendedDelays :=         Heap withAll: (SuspendedDelays  
>>> ifNil:[#()])
>>>             sortBlock: [:d1 :d2 | d1 resumptionTime <= d2  
>>> resumptionTime].
>>>     TimingSemaphore := Semaphore new.
>>>     RunTimerEventLoop := true.
>>>     TimerEventLoop := [self runTimerEventLoop] newProcess.
>>>     TimerEventLoop priority: Processor timingPriority.
>>>     TimerEventLoop resume.
>>>     TimingSemaphore signal. "get going"
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 22:32'!
>>> startTimerInterruptWatcher
>>>     "Reset the class variables that keep track of active Delays  
>>> and re-start the timer interrupt watcher process. Any currently  
>>> scheduled delays are forgotten."
>>>     "Delay startTimerInterruptWatcher"
>>>     | p |
>>>     self stopTimerEventLoop.
>>>     self stopTimerInterruptWatcher.
>>>     TimingSemaphore := Semaphore new.
>>>     AccessProtect := Semaphore forMutualExclusion.
>>>     SuspendedDelays :=         SortedCollection  
>>> sortBlock:             [:d1 :d2 | d1 resumptionTime <= d2  
>>> resumptionTime].
>>>     ActiveDelay := nil.
>>>     p := [self timerInterruptWatcher] newProcess.
>>>     p priority: Processor timingPriority.
>>>     p resume.
>>> ! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 21:26'!
>>> stopTimerEventLoop
>>>     "Stop the timer event loop"
>>>     RunTimerEventLoop := false.
>>>     TimingSemaphore signal.
>>>     TimerEventLoop := nil.! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 21:32'!
>>> stopTimerInterruptWatcher
>>>     "Reset the class variables that keep track of active Delays  
>>> and re-start the timer interrupt watcher process. Any currently  
>>> scheduled delays are forgotten."
>>>     "Delay startTimerInterruptWatcher"
>>>     self primSignal: nil atMilliseconds: 0.
>>>     TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! !
>>>
>>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007  
>>> 22:33'!
>>> unscheduleDelay: aDelay
>>>     "Private. Unschedule this Delay."
>>>     ActiveDelay == aDelay ifTrue: [
>>>         SuspendedDelays isEmpty ifTrue:[
>>>             ActiveDelay := nil.
>>>         ] ifFalse: [
>>>             ActiveDelay := SuspendedDelays removeFirst.
>>>         ]
>>>     ] ifFalse:[
>>>         SuspendedDelays remove: aDelay ifAbsent: [].
>>>     ].
>>>     aDelay beingWaitedOn: false.! !
>>>
>>> !Delay class methodsFor: 'class initialization' stamp: 'ar  
>>> 7/11/2007 18:16'!
>>> initialize
>>>     "Delay initialize"
>>>     self startTimerEventLoop.! !
>>>
>>> Delay initialize!
>>>
>>>
>>> --------------------------------------------------------------------
>>> ----
>>>
>>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Bert Freudenberg
In reply to this post by Steven W Riggins
On Jul 24, 2007, at 18:30 , Steven W Riggins wrote:

> Could this be related to why monticello goes to sleep and never  
> wakes up until you wiggle the mouse?

I hope so (although you mean squeaksource).

- Bert -




Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

timrowledge
In reply to this post by Bert Freudenberg

On 24-Jul-07, at 24-Jul;1:40 AM, Bert Freudenberg wrote:

> On Jul 24, 2007, at 10:16 , Damien Cassou wrote:
>
>> Hi,
>>
>> 2007/7/24, Janko Mivšek <[hidden email]>:
>>> That's very important patch and very interesting to me too,  
>>> because I'm
>>> just deciding to put some of my public Aida/Web websites from VW to
>>> Squeak and I was afraid of such issues as one you just solved.
>>>
>>> Is there any chance that this patch goes to 3.10?
>>
>> Chance would be greater if unit tests were included.
>
> No. It just takes a couple of people filing this in and using the  
> image for a while. Preferably on servers. And then reporting their  
> findings.

Whilst it may be effectively impossible to do unit tests to cover the  
problem this code is intended to fix it would be quite nice to have  
some Delay tests in the image (caveat - I've just checked my  
*working* image which is a sophie development image and thus 3.8  
based) to make sure that the Delay and related classes work the way  
we expect after this fix is applied. T'would be tragic to find that  
it stops a major lockup but breaks some small routine function.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: BPB: Branch on Program Bug



Reply | Threaded
Open this post in threaded view
|

Re: Delay and Server reliability

Steven W Riggins
In reply to this post by Bert Freudenberg

On Jul 24, 2007, at 9:34 AM, Bert Freudenberg wrote:

> On Jul 24, 2007, at 18:30 , Steven W Riggins wrote:
>
>> Could this be related to why monticello goes to sleep and never  
>> wakes up until you wiggle the mouse?
>
> I hope so (although you mean squeaksource).
>

Yeah I suppose, I only ever see MC, not the back end, except when I  
am wiggling the mouse!

If this fixes the squeaksource bug, I will personally buy dinner for  
whomever was involved with fixing this, unless it's like 100 people  
or something.

Dinner is much cheaper than the nearly 4 LCD screens I have smashed  
when trying to check something in 5 mins before I had to leave the  
house. :)

Reply | Threaded
Open this post in threaded view
|

RE: Delay and Server reliability

Gary Chambers-4
In reply to this post by Bert Freudenberg
+1 (not nice to have to rebuild...)

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Bert
Freudenberg
Sent: 24 July 2007 5:34 pm
To: The general-purpose Squeak developers list
Subject: Re: Delay and Server reliability


On Jul 24, 2007, at 18:30 , Steven W Riggins wrote:

> Could this be related to why monticello goes to sleep and never
> wakes up until you wiggle the mouse?

I hope so (although you mean squeaksource).

- Bert -





12