Hi -
We recently had some "fun" chasing server lockups (with truly awful uptimes of about a day or less before things went downhill) and were finally able to track a huge portion of it down to problems with Delay. The effect we were seeing on our servers was that the system would randomly lock up and either go down to 0% CPU or 100% CPU. After poking it with a USR1 signal (which, in our VMs is hooked up such that it prints all the call stacks in the image; it's a life-safer if you need to debug these issues) we usually found that all processes were waiting on Delay's AccessProtect (0%) or alternatively found that a particular process (the event tickler) would sit in a tight loop swallowing repeated errors complaining that "this delay is already scheduled". After hours and hours of testing, debugging, and a little stroke of luck we finally found out that all of these issues were caused by the fact that Delay's internal structures are updated by the calling process (insertion into and removal from SuspendedDelays) which renders the process susceptible to being terminated in the midst of updating these structures. If you look at the code, this is obviously an issue because if (for example) the calling process gets terminated while it's resorting SuspendedDelays the result is unpredictable. This is in particular an issue because the calling process is often running at a relatively low priority so interruption by other, high-priority processes is a common case. And if any of these higher priority processes kills the one that just happens to execute SortedCollection>>remove: anything can happen - from leaving a later delay in front of an earlier one (one of the cases we had indicated that this was just what had happened) to errors when doing the next insert/remove ("trying to evaluate a block that is already evaluated") to many more weirdnesses. Unfortunately, it is basically impossible to recreate this problem under any kind of controlled circumstances, mostly because you need a source of events that is truly independent from your time source. As a consequence of our findings we rewrote Delay to deal with these issues properly and, having deployed the changes about ten days ago on our servers, all of these sources of problems simply vanished. We haven't had a single server problem which we couldn't attribute to our own stupidity (such as running out of disk space ;-) The changes will in particular be helpful to you if you: * run network servers * fork processes to handle network requests * terminate these processes explicitly (on error conditions for example) * use Semaphore>>waitTimeoutMsecs: (all socket functions use this) If you have seen random, unexplained lockups of your server (0% CPU load while being locked up is a dead giveaway[*]) I'd recommend using the attached changes (which work best on top of a VM with David Lewis' 64bit fixes applied) and see if that helps. For us, they made the difference between running the server in Squeak and rewriting it in Java. I've also filed this as http://bugs.squeak.org/view.php?id=6576 [*] The 0% CPU lockups have sometimes been attributed to issues with Linux wait functions. After having seen the havoc that Delay wrecks on the system I don't buy these explanations any longer. A much simpler (and more likely) explanation is that Delay went wild. Cheers, - Andreas SafeDelay.cs (8K) Download Attachment |
Hi Andreas,
That's very important patch and very interesting to me too, because I'm just deciding to put some of my public Aida/Web websites from VW to Squeak and I was afraid of such issues as one you just solved. Is there any chance that this patch goes to 3.10? Best regards Janko Andreas Raab wrote: > Hi - > > We recently had some "fun" chasing server lockups (with truly awful > uptimes of about a day or less before things went downhill) and were > finally able to track a huge portion of it down to problems with Delay. > The effect we were seeing on our servers was that the system would > randomly lock up and either go down to 0% CPU or 100% CPU. > > After poking it with a USR1 signal (which, in our VMs is hooked up such > that it prints all the call stacks in the image; it's a life-safer if > you need to debug these issues) we usually found that all processes were > waiting on Delay's AccessProtect (0%) or alternatively found that a > particular process (the event tickler) would sit in a tight loop > swallowing repeated errors complaining that "this delay is already > scheduled". > > After hours and hours of testing, debugging, and a little stroke of luck > we finally found out that all of these issues were caused by the fact > that Delay's internal structures are updated by the calling process > (insertion into and removal from SuspendedDelays) which renders the > process susceptible to being terminated in the midst of updating these > structures. > > If you look at the code, this is obviously an issue because if (for > example) the calling process gets terminated while it's resorting > SuspendedDelays the result is unpredictable. This is in particular an > issue because the calling process is often running at a relatively low > priority so interruption by other, high-priority processes is a common > case. And if any of these higher priority processes kills the one that > just happens to execute SortedCollection>>remove: anything can happen - > from leaving a later delay in front of an earlier one (one of the cases > we had indicated that this was just what had happened) to errors when > doing the next insert/remove ("trying to evaluate a block that is > already evaluated") to many more weirdnesses. Unfortunately, it is > basically impossible to recreate this problem under any kind of > controlled circumstances, mostly because you need a source of events > that is truly independent from your time source. > > As a consequence of our findings we rewrote Delay to deal with these > issues properly and, having deployed the changes about ten days ago on > our servers, all of these sources of problems simply vanished. We > haven't had a single server problem which we couldn't attribute to our > own stupidity (such as running out of disk space ;-) > > The changes will in particular be helpful to you if you: > * run network servers > * fork processes to handle network requests > * terminate these processes explicitly (on error conditions for example) > * use Semaphore>>waitTimeoutMsecs: (all socket functions use this) > > If you have seen random, unexplained lockups of your server (0% CPU load > while being locked up is a dead giveaway[*]) I'd recommend using the > attached changes (which work best on top of a VM with David Lewis' 64bit > fixes applied) and see if that helps. For us, they made the difference > between running the server in Squeak and rewriting it in Java. > > I've also filed this as http://bugs.squeak.org/view.php?id=6576 > > [*] The 0% CPU lockups have sometimes been attributed to issues with > Linux wait functions. After having seen the havoc that Delay wrecks on > the system I don't buy these explanations any longer. A much simpler > (and more likely) explanation is that Delay went wild. > > Cheers, > - Andreas > > > ------------------------------------------------------------------------ > > 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 July 2007 at 11:53:23 pm'! > "Change Set: SafeDelay > Date: 23 July 2007 > Author: Andreas Raab > > This change set fixes a set of severe problems with concurrent use of Delay. Previously, many of the delay-internal structures were modified by the calling process which made it susceptible to being terminated in the middle of manipulating these structures and leave Delay (and consequently the entire system) in an inconsistent state. > > This change set fixes this problem by moving *all* manipulation of Delay's internal structures out of the calling process. As a side-effect it also removes the requirement of Delays being limited to SmallInteger range; the new code has no limitation on the duration of a delay. > > No tests are provided since outside of true asynchronous environments (networks) it is basically impossible to recreate the situation reliably."! > > > !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'! > activate > "Private!! Make the receiver the Delay to be awoken when the next timer interrupt occurs. This method should only be called from a block protected by the AccessProtect semaphore." > TimerEventLoop ifNotNil:[^nil]. > ActiveDelay := self. > ActiveDelayStartTime := Time millisecondClockValue. > ActiveDelayStartTime > resumptionTime ifTrue:[ > ActiveDelay signalWaitingProcess. > SuspendedDelays isEmpty ifTrue:[ > ActiveDelay := nil. > ActiveDelayStartTime := nil. > ] ifFalse:[SuspendedDelays removeFirst activate]. > ] ifFalse:[ > TimingSemaphore initSignals. > Delay primSignal: TimingSemaphore atMilliseconds: resumptionTime. > ].! ! > > !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'! > schedule > "Private!! Schedule this Delay, but return immediately rather than waiting. The receiver's semaphore will be signalled when its delay duration has elapsed." > > beingWaitedOn ifTrue: [self error: 'This Delay has already been scheduled.']. > > TimerEventLoop ifNotNil:[^self scheduleEvent]. > AccessProtect critical: [ > beingWaitedOn := true. > resumptionTime := Time millisecondClockValue + delayDuration. > ActiveDelay == nil > ifTrue: [self activate] > ifFalse: [ > resumptionTime < ActiveDelay resumptionTime > ifTrue: [ > SuspendedDelays add: ActiveDelay. > self activate] > ifFalse: [SuspendedDelays add: self]]]. > ! ! > > !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'! > scheduleEvent > "Schedule this delay" > resumptionTime := Time millisecondClockValue + delayDuration. > AccessProtect critical:[ > ScheduledDelay := self. > TimingSemaphore signal. > ].! ! > > !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'! > unschedule > "Unschedule this Delay. Do nothing if it wasn't scheduled." > > | done | > TimerEventLoop ifNotNil:[^self unscheduleEvent]. > AccessProtect critical: [ > done := false. > [done] whileFalse: > [SuspendedDelays remove: self ifAbsent: [done := true]]. > ActiveDelay == self ifTrue: [ > SuspendedDelays isEmpty > ifTrue: [ > ActiveDelay := nil. > ActiveDelayStartTime := nil] > ifFalse: [ > SuspendedDelays removeFirst activate]]]. > ! ! > > !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'! > unscheduleEvent > AccessProtect critical:[ > FinishedDelay := self. > TimingSemaphore signal. > ].! ! > > !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'! > beingWaitedOn > "Answer whether this delay is currently scheduled, e.g., being waited on" > ^beingWaitedOn! ! > > !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'! > beingWaitedOn: aBool > "Indicate whether this delay is currently scheduled, e.g., being waited on" > beingWaitedOn := aBool! ! > > !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'! > delayDuration > ^delayDuration! ! > > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:35'! > handleTimerEvent > "Handle a timer event; which can be either: > - a schedule request (ScheduledDelay notNil) > - an unschedule request (FinishedDelay notNil) > - a timer signal (not explicitly specified) > We check for timer expiry every time we get a signal." > | nextTick | > "Wait until there is work to do." > TimingSemaphore wait. > > "Process any schedule requests" > ScheduledDelay ifNotNil:[ > "Schedule the given delay" > self scheduleDelay: ScheduledDelay. > ScheduledDelay := nil. > ]. > > "Process any unschedule requests" > FinishedDelay ifNotNil:[ > self unscheduleDelay: FinishedDelay. > FinishedDelay := nil. > ]. > > "Check for clock wrap-around." > nextTick := Time millisecondClockValue. > nextTick < ActiveDelayStartTime ifTrue: [ > "clock wrapped" > self saveResumptionTimes. > self restoreResumptionTimes. > ]. > ActiveDelayStartTime := nextTick. > > "Signal any expired delays" > [ActiveDelay notNil and:[ > Time millisecondClockValue >= ActiveDelay resumptionTime]] whileTrue:[ > ActiveDelay signalWaitingProcess. > SuspendedDelays isEmpty > ifTrue: [ActiveDelay := nil] > ifFalse:[ActiveDelay := SuspendedDelays removeFirst]. > ]. > > "And signal when the next request is due. We sleep at most 1sec here > as a soft busy-loop so that we don't accidentally miss signals." > nextTick := Time millisecondClockValue + 1000. > ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay resumptionTime]. > nextTick := nextTick min: SmallInteger maxVal. > > "Since we have processed all outstanding requests, reset the timing semaphore so > that only new work will wake us up again. Do this RIGHT BEFORE setting the next > wakeup call from the VM because it is only signaled once so we mustn't miss it." > TimingSemaphore initSignals. > Delay primSignal: TimingSemaphore atMilliseconds: nextTick. > ! ! > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 09:04'! > runTimerEventLoop > "Run the timer event loop." > [ > [RunTimerEventLoop] whileTrue: [self handleTimerEvent] > ] on: Error do:[:ex| > "Clear out the process so it does't get killed" > TimerEventLoop := nil. > "Launch the old-style interrupt watcher" > self startTimerInterruptWatcher. > "And pass the exception on" > ex pass. > ].! ! > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'! > scheduleDelay: aDelay > "Private. Schedule this Delay." > aDelay beingWaitedOn: true. > ActiveDelay ifNil:[ > ActiveDelay := aDelay > ] ifNotNil:[ > aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[ > SuspendedDelays add: ActiveDelay. > ActiveDelay := aDelay. > ] ifFalse: [SuspendedDelays add: aDelay]. > ]. > ! ! > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:18'! > startTimerEventLoop > "Start the timer event loop" > "Delay startTimerEventLoop" > self stopTimerEventLoop. > self stopTimerInterruptWatcher. > AccessProtect := Semaphore forMutualExclusion. > ActiveDelayStartTime := Time millisecondClockValue. > SuspendedDelays := > Heap withAll: (SuspendedDelays ifNil:[#()]) > sortBlock: [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime]. > TimingSemaphore := Semaphore new. > RunTimerEventLoop := true. > TimerEventLoop := [self runTimerEventLoop] newProcess. > TimerEventLoop priority: Processor timingPriority. > TimerEventLoop resume. > TimingSemaphore signal. "get going" > ! ! > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'! > startTimerInterruptWatcher > "Reset the class variables that keep track of active Delays and re-start the timer interrupt watcher process. Any currently scheduled delays are forgotten." > "Delay startTimerInterruptWatcher" > | p | > self stopTimerEventLoop. > self stopTimerInterruptWatcher. > TimingSemaphore := Semaphore new. > AccessProtect := Semaphore forMutualExclusion. > SuspendedDelays := > SortedCollection sortBlock: > [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime]. > ActiveDelay := nil. > p := [self timerInterruptWatcher] newProcess. > p priority: Processor timingPriority. > p resume. > ! ! > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:26'! > stopTimerEventLoop > "Stop the timer event loop" > RunTimerEventLoop := false. > TimingSemaphore signal. > TimerEventLoop := nil.! ! > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:32'! > stopTimerInterruptWatcher > "Reset the class variables that keep track of active Delays and re-start the timer interrupt watcher process. Any currently scheduled delays are forgotten." > "Delay startTimerInterruptWatcher" > self primSignal: nil atMilliseconds: 0. > TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! ! > > !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:33'! > unscheduleDelay: aDelay > "Private. Unschedule this Delay." > ActiveDelay == aDelay ifTrue: [ > SuspendedDelays isEmpty ifTrue:[ > ActiveDelay := nil. > ] ifFalse: [ > ActiveDelay := SuspendedDelays removeFirst. > ] > ] ifFalse:[ > SuspendedDelays remove: aDelay ifAbsent: []. > ]. > aDelay beingWaitedOn: false.! ! > > !Delay class methodsFor: 'class initialization' stamp: 'ar 7/11/2007 18:16'! > initialize > "Delay initialize" > self startTimerEventLoop.! ! > > Delay initialize! > > > ------------------------------------------------------------------------ > > -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
Hi,
2007/7/24, Janko Mivšek <[hidden email]>: > That's very important patch and very interesting to me too, because I'm > just deciding to put some of my public Aida/Web websites from VW to > Squeak and I was afraid of such issues as one you just solved. > > Is there any chance that this patch goes to 3.10? Chance would be greater if unit tests were included. -- Damien Cassou |
In reply to this post by Janko Mivšek
You can apply the fix yourself; it works in all Squeak versions that I'm
aware of (and if not, you'll find out really quickly ;-) This is just the kind of thing for which I wanted to see some sort of "standard package" for so that people across various Squeak versions can benefit from it. Cheers, - Andreas Janko Mivšek wrote: > Hi Andreas, > > That's very important patch and very interesting to me too, because I'm > just deciding to put some of my public Aida/Web websites from VW to > Squeak and I was afraid of such issues as one you just solved. > > Is there any chance that this patch goes to 3.10? > > Best regards > Janko > > Andreas Raab wrote: >> Hi - >> >> We recently had some "fun" chasing server lockups (with truly awful >> uptimes of about a day or less before things went downhill) and were >> finally able to track a huge portion of it down to problems with >> Delay. The effect we were seeing on our servers was that the system >> would randomly lock up and either go down to 0% CPU or 100% CPU. >> >> After poking it with a USR1 signal (which, in our VMs is hooked up >> such that it prints all the call stacks in the image; it's a >> life-safer if you need to debug these issues) we usually found that >> all processes were waiting on Delay's AccessProtect (0%) or >> alternatively found that a particular process (the event tickler) >> would sit in a tight loop swallowing repeated errors complaining that >> "this delay is already scheduled". >> >> After hours and hours of testing, debugging, and a little stroke of >> luck we finally found out that all of these issues were caused by the >> fact that Delay's internal structures are updated by the calling >> process (insertion into and removal from SuspendedDelays) which >> renders the process susceptible to being terminated in the midst of >> updating these structures. >> >> If you look at the code, this is obviously an issue because if (for >> example) the calling process gets terminated while it's resorting >> SuspendedDelays the result is unpredictable. This is in particular an >> issue because the calling process is often running at a relatively low >> priority so interruption by other, high-priority processes is a common >> case. And if any of these higher priority processes kills the one that >> just happens to execute SortedCollection>>remove: anything can happen >> - from leaving a later delay in front of an earlier one (one of the >> cases we had indicated that this was just what had happened) to errors >> when doing the next insert/remove ("trying to evaluate a block that is >> already evaluated") to many more weirdnesses. Unfortunately, it is >> basically impossible to recreate this problem under any kind of >> controlled circumstances, mostly because you need a source of events >> that is truly independent from your time source. >> >> As a consequence of our findings we rewrote Delay to deal with these >> issues properly and, having deployed the changes about ten days ago on >> our servers, all of these sources of problems simply vanished. We >> haven't had a single server problem which we couldn't attribute to our >> own stupidity (such as running out of disk space ;-) >> >> The changes will in particular be helpful to you if you: >> * run network servers >> * fork processes to handle network requests >> * terminate these processes explicitly (on error conditions for example) >> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this) >> >> If you have seen random, unexplained lockups of your server (0% CPU >> load while being locked up is a dead giveaway[*]) I'd recommend using >> the attached changes (which work best on top of a VM with David Lewis' >> 64bit fixes applied) and see if that helps. For us, they made the >> difference between running the server in Squeak and rewriting it in Java. >> >> I've also filed this as http://bugs.squeak.org/view.php?id=6576 >> >> [*] The 0% CPU lockups have sometimes been attributed to issues with >> Linux wait functions. After having seen the havoc that Delay wrecks on >> the system I don't buy these explanations any longer. A much simpler >> (and more likely) explanation is that Delay went wild. >> >> Cheers, >> - Andreas >> >> >> ------------------------------------------------------------------------ >> >> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 July >> 2007 at 11:53:23 pm'! >> "Change Set: SafeDelay >> Date: 23 July 2007 >> Author: Andreas Raab >> >> This change set fixes a set of severe problems with concurrent use of >> Delay. Previously, many of the delay-internal structures were modified >> by the calling process which made it susceptible to being terminated >> in the middle of manipulating these structures and leave Delay (and >> consequently the entire system) in an inconsistent state. >> >> This change set fixes this problem by moving *all* manipulation of >> Delay's internal structures out of the calling process. As a >> side-effect it also removes the requirement of Delays being limited to >> SmallInteger range; the new code has no limitation on the duration of >> a delay. >> >> No tests are provided since outside of true asynchronous environments >> (networks) it is basically impossible to recreate the situation >> reliably."! >> >> >> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'! >> activate >> "Private!! Make the receiver the Delay to be awoken when the next >> timer interrupt occurs. This method should only be called from a block >> protected by the AccessProtect semaphore." >> TimerEventLoop ifNotNil:[^nil]. >> ActiveDelay := self. >> ActiveDelayStartTime := Time millisecondClockValue. >> ActiveDelayStartTime > resumptionTime ifTrue:[ >> ActiveDelay signalWaitingProcess. >> SuspendedDelays isEmpty ifTrue:[ >> ActiveDelay := nil. >> ActiveDelayStartTime := nil. >> ] ifFalse:[SuspendedDelays removeFirst activate]. >> ] ifFalse:[ >> TimingSemaphore initSignals. >> Delay primSignal: TimingSemaphore atMilliseconds: resumptionTime. >> ].! ! >> >> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'! >> schedule >> "Private!! Schedule this Delay, but return immediately rather than >> waiting. The receiver's semaphore will be signalled when its delay >> duration has elapsed." >> >> beingWaitedOn ifTrue: [self error: 'This Delay has already been >> scheduled.']. >> >> TimerEventLoop ifNotNil:[^self scheduleEvent]. >> AccessProtect critical: [ >> beingWaitedOn := true. >> resumptionTime := Time millisecondClockValue + delayDuration. >> ActiveDelay == nil >> ifTrue: [self activate] >> ifFalse: [ >> resumptionTime < ActiveDelay resumptionTime >> ifTrue: [ >> SuspendedDelays add: ActiveDelay. >> self activate] >> ifFalse: [SuspendedDelays add: self]]]. >> ! ! >> >> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'! >> scheduleEvent >> "Schedule this delay" >> resumptionTime := Time millisecondClockValue + delayDuration. >> AccessProtect critical:[ >> ScheduledDelay := self. >> TimingSemaphore signal. >> ].! ! >> >> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'! >> unschedule >> "Unschedule this Delay. Do nothing if it wasn't scheduled." >> >> | done | >> TimerEventLoop ifNotNil:[^self unscheduleEvent]. >> AccessProtect critical: [ >> done := false. >> [done] whileFalse: >> [SuspendedDelays remove: self ifAbsent: [done := true]]. >> ActiveDelay == self ifTrue: [ >> SuspendedDelays isEmpty >> ifTrue: [ >> ActiveDelay := nil. >> ActiveDelayStartTime := nil] >> ifFalse: [ >> SuspendedDelays removeFirst activate]]]. >> ! ! >> >> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'! >> unscheduleEvent >> AccessProtect critical:[ >> FinishedDelay := self. >> TimingSemaphore signal. >> ].! ! >> >> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'! >> beingWaitedOn >> "Answer whether this delay is currently scheduled, e.g., being >> waited on" >> ^beingWaitedOn! ! >> >> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'! >> beingWaitedOn: aBool >> "Indicate whether this delay is currently scheduled, e.g., being >> waited on" >> beingWaitedOn := aBool! ! >> >> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'! >> delayDuration >> ^delayDuration! ! >> >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:35'! >> handleTimerEvent >> "Handle a timer event; which can be either: >> - a schedule request (ScheduledDelay notNil) >> - an unschedule request (FinishedDelay notNil) >> - a timer signal (not explicitly specified) >> We check for timer expiry every time we get a signal." >> | nextTick | >> "Wait until there is work to do." >> TimingSemaphore wait. >> >> "Process any schedule requests" >> ScheduledDelay ifNotNil:[ >> "Schedule the given delay" >> self scheduleDelay: ScheduledDelay. >> ScheduledDelay := nil. >> ]. >> >> "Process any unschedule requests" >> FinishedDelay ifNotNil:[ >> self unscheduleDelay: FinishedDelay. >> FinishedDelay := nil. >> ]. >> >> "Check for clock wrap-around." >> nextTick := Time millisecondClockValue. >> nextTick < ActiveDelayStartTime ifTrue: [ >> "clock wrapped" >> self saveResumptionTimes. >> self restoreResumptionTimes. >> ]. >> ActiveDelayStartTime := nextTick. >> >> "Signal any expired delays" >> [ActiveDelay notNil and:[ >> Time millisecondClockValue >= ActiveDelay resumptionTime]] >> whileTrue:[ >> ActiveDelay signalWaitingProcess. >> SuspendedDelays isEmpty ifTrue: >> [ActiveDelay := nil] ifFalse:[ActiveDelay := >> SuspendedDelays removeFirst]. >> ]. >> >> "And signal when the next request is due. We sleep at most 1sec here >> as a soft busy-loop so that we don't accidentally miss signals." >> nextTick := Time millisecondClockValue + 1000. >> ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay >> resumptionTime]. >> nextTick := nextTick min: SmallInteger maxVal. >> >> "Since we have processed all outstanding requests, reset the >> timing semaphore so >> that only new work will wake us up again. Do this RIGHT BEFORE >> setting the next >> wakeup call from the VM because it is only signaled once so we >> mustn't miss it." >> TimingSemaphore initSignals. >> Delay primSignal: TimingSemaphore atMilliseconds: nextTick. >> ! ! >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 09:04'! >> runTimerEventLoop >> "Run the timer event loop." >> [ >> [RunTimerEventLoop] whileTrue: [self handleTimerEvent] >> ] on: Error do:[:ex| >> "Clear out the process so it does't get killed" >> TimerEventLoop := nil. >> "Launch the old-style interrupt watcher" >> self startTimerInterruptWatcher. >> "And pass the exception on" >> ex pass. >> ].! ! >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'! >> scheduleDelay: aDelay >> "Private. Schedule this Delay." >> aDelay beingWaitedOn: true. >> ActiveDelay ifNil:[ >> ActiveDelay := aDelay >> ] ifNotNil:[ >> aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[ >> SuspendedDelays add: ActiveDelay. >> ActiveDelay := aDelay. >> ] ifFalse: [SuspendedDelays add: aDelay]. >> ]. >> ! ! >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 10:18'! >> startTimerEventLoop >> "Start the timer event loop" >> "Delay startTimerEventLoop" >> self stopTimerEventLoop. >> self stopTimerInterruptWatcher. >> AccessProtect := Semaphore forMutualExclusion. >> ActiveDelayStartTime := Time millisecondClockValue. >> SuspendedDelays := Heap withAll: (SuspendedDelays >> ifNil:[#()]) >> sortBlock: [:d1 :d2 | d1 resumptionTime <= d2 >> resumptionTime]. >> TimingSemaphore := Semaphore new. >> RunTimerEventLoop := true. >> TimerEventLoop := [self runTimerEventLoop] newProcess. >> TimerEventLoop priority: Processor timingPriority. >> TimerEventLoop resume. >> TimingSemaphore signal. "get going" >> ! ! >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:32'! >> startTimerInterruptWatcher >> "Reset the class variables that keep track of active Delays and >> re-start the timer interrupt watcher process. Any currently scheduled >> delays are forgotten." >> "Delay startTimerInterruptWatcher" >> | p | >> self stopTimerEventLoop. >> self stopTimerInterruptWatcher. >> TimingSemaphore := Semaphore new. >> AccessProtect := Semaphore forMutualExclusion. >> SuspendedDelays := SortedCollection sortBlock: >> [:d1 :d2 | d1 resumptionTime <= d2 resumptionTime]. >> ActiveDelay := nil. >> p := [self timerInterruptWatcher] newProcess. >> p priority: Processor timingPriority. >> p resume. >> ! ! >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:26'! >> stopTimerEventLoop >> "Stop the timer event loop" >> RunTimerEventLoop := false. >> TimingSemaphore signal. >> TimerEventLoop := nil.! ! >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 21:32'! >> stopTimerInterruptWatcher >> "Reset the class variables that keep track of active Delays and >> re-start the timer interrupt watcher process. Any currently scheduled >> delays are forgotten." >> "Delay startTimerInterruptWatcher" >> self primSignal: nil atMilliseconds: 0. >> TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! ! >> >> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 22:33'! >> unscheduleDelay: aDelay >> "Private. Unschedule this Delay." >> ActiveDelay == aDelay ifTrue: [ >> SuspendedDelays isEmpty ifTrue:[ >> ActiveDelay := nil. >> ] ifFalse: [ >> ActiveDelay := SuspendedDelays removeFirst. >> ] >> ] ifFalse:[ >> SuspendedDelays remove: aDelay ifAbsent: []. >> ]. >> aDelay beingWaitedOn: false.! ! >> >> !Delay class methodsFor: 'class initialization' stamp: 'ar 7/11/2007 >> 18:16'! >> initialize >> "Delay initialize" >> self startTimerEventLoop.! ! >> >> Delay initialize! >> >> >> ------------------------------------------------------------------------ >> >> > |
In reply to this post by Damien Cassou-3
Damien Cassou wrote:
>> Is there any chance that this patch goes to 3.10? > > Chance would be greater if unit tests were included. Good luck with that. I tried for a couple of hours to find a reliable way of creating this problem with not even so much as a hint of being able to make it happen. The problem is that on any local machine you're never completely independent from the time source of that machine and if you are dependent on the time source you are in sync with Delay and everything will be fine. You need an independent source of events and I've yet to find someone who shows me how to write unit tests across multiple machines (and no, running multiple images on the same machine doesn't work because your process scheduler uses the same time source that your image uses so it's not independent). Cheers, - Andreas |
In reply to this post by Andreas.Raab
On Jul 24, 2007, at 9:45 , Andreas Raab wrote:
> Hi - > > We recently had some "fun" chasing server lockups (with truly awful > uptimes of about a day or less before things went downhill) and > were finally able to track a huge portion of it down to problems > with Delay. Yay! - Bert - |
In reply to this post by Damien Cassou-3
On Jul 24, 2007, at 10:16 , Damien Cassou wrote:
> Hi, > > 2007/7/24, Janko Mivšek <[hidden email]>: >> That's very important patch and very interesting to me too, >> because I'm >> just deciding to put some of my public Aida/Web websites from VW to >> Squeak and I was afraid of such issues as one you just solved. >> >> Is there any chance that this patch goes to 3.10? > > Chance would be greater if unit tests were included. No. It just takes a couple of people filing this in and using the image for a while. Preferably on servers. And then reporting their findings. - Bert - |
El 7/24/07 5:40 AM, "Bert Freudenberg" <[hidden email]> escribió: > No. It just takes a couple of people filing this in and using the > image for a while. Preferably on servers. And then reporting their > findings. > > - Bert - Damien and Janko I watching your finds. This is the last week of vacations, so on next Monday when students come , I put they to test on Mac , Windows XP and 98 and I hope on Umbuntu also. Edgar |
In reply to this post by Andreas.Raab
Andreas,
Congratulations and thanks. I am also interested in the "all callstacks" feature of your VM. Years ago, I hacked Dolphin to do that on break in runtime sessions, and later did the same thing in Squeak when one of my early projects was suffering a GUI lockup. Thanks to the callstacks, I was able to fix it. I _tried_ to make some associated VM patches available, but there was no interest that I could detect. Please put your solution in the mainstream VM. I would also suggest adding a VM menu command, or (what I did) simply changing the existing one-callstack command to dump all callstacks instead. My fixes to the VM mostly involved getting the right number of new lines between the stacks by moving the crs so they are added only when needed. I forget the terminology, but one iterates over more objects than are in fact associated with a waiting process, leading to spurious new lines in the output. I might even be able to find the code if it would help. But my hunch is you have already encountered and fixed it. Bill Wilhelm K. Schwab, Ph.D. University of Florida Department of Anesthesiology PO Box 100254 Gainesville, FL 32610-0254 Email: [hidden email] Tel: (352) 846-1285 FAX: (352) 392-7029 |
In reply to this post by Andreas.Raab
On Tue, 24 Jul 2007 01:17:59 -0700, Andreas Raab <[hidden email]>
wrote: > You can apply the fix yourself; it works in all Squeak versions that I'm > aware of (and if not, you'll find out really quickly ;-) This is just > the kind of thing for which I wanted to see some sort of "standard > package" for so that people across various Squeak versions can benefit > from it. I tried filing it into my 3.8 (#6665) image, and the following variables are undeclared: TimerEventLoop ScheduledDelay FinishedDelay Perhaps there was supposed to be a class definition included? Later, Jon -------------------------------------------------------------- Jon Hylands [hidden email] http://www.huv.com/jon Project: Micro Raptor (Small Biped Velociraptor Robot) http://www.huv.com/blog |
And RunTimerEventLoop.
Good work though. May explain a few mysteries! (using quite a lot of processes with Delays...) -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Jon Hylands Sent: 24 July 2007 1:17 pm To: The general-purpose Squeak developers list Subject: Re: Delay and Server reliability On Tue, 24 Jul 2007 01:17:59 -0700, Andreas Raab <[hidden email]> wrote: > You can apply the fix yourself; it works in all Squeak versions that > I'm > aware of (and if not, you'll find out really quickly ;-) This is just > the kind of thing for which I wanted to see some sort of "standard > package" for so that people across various Squeak versions can benefit > from it. I tried filing it into my 3.8 (#6665) image, and the following variables are undeclared: TimerEventLoop ScheduledDelay FinishedDelay Perhaps there was supposed to be a class definition included? Later, Jon -------------------------------------------------------------- Jon Hylands [hidden email] http://www.huv.com/jon Project: Micro Raptor (Small Biped Velociraptor Robot) http://www.huv.com/blog |
In reply to this post by Andreas.Raab
Hi Andreas,
This is Terrific!! Thank you for doing it!! Ron > -----Original Message----- > From: [hidden email] [mailto:squeak-dev- > [hidden email]] On Behalf Of Andreas Raab > Sent: Tuesday, July 24, 2007 3:46 AM > To: The general-purpose Squeak developers list > Subject: Delay and Server reliability > > Hi - > > We recently had some "fun" chasing server lockups (with truly awful > uptimes of about a day or less before things went downhill) and were > finally able to track a huge portion of it down to problems with Delay. > The effect we were seeing on our servers was that the system would > randomly lock up and either go down to 0% CPU or 100% CPU. > > After poking it with a USR1 signal (which, in our VMs is hooked up such > that it prints all the call stacks in the image; it's a life-safer if > you need to debug these issues) we usually found that all processes were > waiting on Delay's AccessProtect (0%) or alternatively found that a > particular process (the event tickler) would sit in a tight loop > swallowing repeated errors complaining that "this delay is already > scheduled". > > After hours and hours of testing, debugging, and a little stroke of luck > we finally found out that all of these issues were caused by the fact > that Delay's internal structures are updated by the calling process > (insertion into and removal from SuspendedDelays) which renders the > process susceptible to being terminated in the midst of updating these > structures. > > If you look at the code, this is obviously an issue because if (for > example) the calling process gets terminated while it's resorting > SuspendedDelays the result is unpredictable. This is in particular an > issue because the calling process is often running at a relatively low > priority so interruption by other, high-priority processes is a common > case. And if any of these higher priority processes kills the one that > just happens to execute SortedCollection>>remove: anything can happen - > from leaving a later delay in front of an earlier one (one of the cases > we had indicated that this was just what had happened) to errors when > doing the next insert/remove ("trying to evaluate a block that is > already evaluated") to many more weirdnesses. Unfortunately, it is > basically impossible to recreate this problem under any kind of > controlled circumstances, mostly because you need a source of events > that is truly independent from your time source. > > As a consequence of our findings we rewrote Delay to deal with these > issues properly and, having deployed the changes about ten days ago on > our servers, all of these sources of problems simply vanished. We > haven't had a single server problem which we couldn't attribute to our > own stupidity (such as running out of disk space ;-) > > The changes will in particular be helpful to you if you: > * run network servers > * fork processes to handle network requests > * terminate these processes explicitly (on error conditions for example) > * use Semaphore>>waitTimeoutMsecs: (all socket functions use this) > > If you have seen random, unexplained lockups of your server (0% CPU load > while being locked up is a dead giveaway[*]) I'd recommend using the > attached changes (which work best on top of a VM with David Lewis' 64bit > fixes applied) and see if that helps. For us, they made the difference > between running the server in Squeak and rewriting it in Java. > > I've also filed this as http://bugs.squeak.org/view.php?id=6576 > > [*] The 0% CPU lockups have sometimes been attributed to issues with > Linux wait functions. After having seen the havoc that Delay wrecks on > the system I don't buy these explanations any longer. A much simpler > (and more likely) explanation is that Delay went wild. > > Cheers, > - Andreas |
In reply to this post by Jon Hylands
Ouch. You are right. Here is a variant with the class definition included.
Cheers, - Andreas Jon Hylands wrote: > On Tue, 24 Jul 2007 01:17:59 -0700, Andreas Raab <[hidden email]> > wrote: > >> You can apply the fix yourself; it works in all Squeak versions that I'm >> aware of (and if not, you'll find out really quickly ;-) This is just >> the kind of thing for which I wanted to see some sort of "standard >> package" for so that people across various Squeak versions can benefit >> from it. > > I tried filing it into my 3.8 (#6665) image, and the following variables > are undeclared: > > TimerEventLoop > ScheduledDelay > FinishedDelay > > Perhaps there was supposed to be a class definition included? > > Later, > Jon > > -------------------------------------------------------------- > Jon Hylands [hidden email] http://www.huv.com/jon > > Project: Micro Raptor (Small Biped Velociraptor Robot) > http://www.huv.com/blog > > SafeDelay-2.cs (9K) Download Attachment |
On Tue, 24 Jul 2007 09:03:44 -0700, Andreas Raab <[hidden email]>
wrote: > Ouch. You are right. Here is a variant with the class definition included. Thanks, that installs much better. I'll let you know how it works once I start testing again (I'm kinda down with pneumonia right now) - my server uses a bunch of processes and a lot of delays. In the current configuration, it is all on one machine, so I probably wouldn't run into the issue right now, but I also run with part of the server running on my PC, and the other part running on a gumstix, and they do socket communications in both directions. This may possibly explain why I have seen my gumstix system stop responding on occasion. Later, Jon -------------------------------------------------------------- Jon Hylands [hidden email] http://www.huv.com/jon Project: Micro Raptor (Small Biped Velociraptor Robot) http://www.huv.com/blog |
In reply to this post by Andreas.Raab
Could this be related to why monticello goes to sleep and never wakes
up until you wiggle the mouse? |
In reply to this post by Andreas.Raab
Yes.
I imagine the pain you got to chase it.... This kind of bug huge pain. Stef On 24 juil. 07, at 10:17, Andreas Raab wrote: > You can apply the fix yourself; it works in all Squeak versions > that I'm aware of (and if not, you'll find out really quickly ;-) > This is just the kind of thing for which I wanted to see some sort > of "standard package" for so that people across various Squeak > versions can benefit from it. > > Cheers, > - Andreas > > Janko Mivšek wrote: >> Hi Andreas, >> That's very important patch and very interesting to me too, >> because I'm just deciding to put some of my public Aida/Web >> websites from VW to Squeak and I was afraid of such issues as one >> you just solved. >> Is there any chance that this patch goes to 3.10? >> Best regards >> Janko >> Andreas Raab wrote: >>> Hi - >>> >>> We recently had some "fun" chasing server lockups (with truly >>> awful uptimes of about a day or less before things went downhill) >>> and were finally able to track a huge portion of it down to >>> problems with Delay. The effect we were seeing on our servers was >>> that the system would randomly lock up and either go down to 0% >>> CPU or 100% CPU. >>> >>> After poking it with a USR1 signal (which, in our VMs is hooked >>> up such that it prints all the call stacks in the image; it's a >>> life-safer if you need to debug these issues) we usually found >>> that all processes were waiting on Delay's AccessProtect (0%) or >>> alternatively found that a particular process (the event tickler) >>> would sit in a tight loop swallowing repeated errors complaining >>> that "this delay is already scheduled". >>> >>> After hours and hours of testing, debugging, and a little stroke >>> of luck we finally found out that all of these issues were caused >>> by the fact that Delay's internal structures are updated by the >>> calling process (insertion into and removal from SuspendedDelays) >>> which renders the process susceptible to being terminated in the >>> midst of updating these structures. >>> >>> If you look at the code, this is obviously an issue because if >>> (for example) the calling process gets terminated while it's >>> resorting SuspendedDelays the result is unpredictable. This is in >>> particular an issue because the calling process is often running >>> at a relatively low priority so interruption by other, high- >>> priority processes is a common case. And if any of these higher >>> priority processes kills the one that just happens to execute >>> SortedCollection>>remove: anything can happen - from leaving a >>> later delay in front of an earlier one (one of the cases we had >>> indicated that this was just what had happened) to errors when >>> doing the next insert/remove ("trying to evaluate a block that is >>> already evaluated") to many more weirdnesses. Unfortunately, it >>> is basically impossible to recreate this problem under any kind >>> of controlled circumstances, mostly because you need a source of >>> events that is truly independent from your time source. >>> >>> As a consequence of our findings we rewrote Delay to deal with >>> these issues properly and, having deployed the changes about ten >>> days ago on our servers, all of these sources of problems simply >>> vanished. We haven't had a single server problem which we >>> couldn't attribute to our own stupidity (such as running out of >>> disk space ;-) >>> >>> The changes will in particular be helpful to you if you: >>> * run network servers >>> * fork processes to handle network requests >>> * terminate these processes explicitly (on error conditions for >>> example) >>> * use Semaphore>>waitTimeoutMsecs: (all socket functions use this) >>> >>> If you have seen random, unexplained lockups of your server (0% >>> CPU load while being locked up is a dead giveaway[*]) I'd >>> recommend using the attached changes (which work best on top of a >>> VM with David Lewis' 64bit fixes applied) and see if that helps. >>> For us, they made the difference between running the server in >>> Squeak and rewriting it in Java. >>> >>> I've also filed this as http://bugs.squeak.org/view.php?id=6576 >>> >>> [*] The 0% CPU lockups have sometimes been attributed to issues >>> with Linux wait functions. After having seen the havoc that Delay >>> wrecks on the system I don't buy these explanations any longer. A >>> much simpler (and more likely) explanation is that Delay went wild. >>> >>> Cheers, >>> - Andreas >>> >>> >>> -------------------------------------------------------------------- >>> ---- >>> >>> 'From Croquet1.0beta of 11 April 2006 [latest update: #1] on 23 >>> July 2007 at 11:53:23 pm'! >>> "Change Set: SafeDelay >>> Date: 23 July 2007 >>> Author: Andreas Raab >>> >>> This change set fixes a set of severe problems with concurrent >>> use of Delay. Previously, many of the delay-internal structures >>> were modified by the calling process which made it susceptible to >>> being terminated in the middle of manipulating these structures >>> and leave Delay (and consequently the entire system) in an >>> inconsistent state. >>> >>> This change set fixes this problem by moving *all* manipulation >>> of Delay's internal structures out of the calling process. As a >>> side-effect it also removes the requirement of Delays being >>> limited to SmallInteger range; the new code has no limitation on >>> the duration of a delay. >>> >>> No tests are provided since outside of true asynchronous >>> environments (networks) it is basically impossible to recreate >>> the situation reliably."! >>> >>> >>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:24'! >>> activate >>> "Private!! Make the receiver the Delay to be awoken when the >>> next timer interrupt occurs. This method should only be called >>> from a block protected by the AccessProtect semaphore." >>> TimerEventLoop ifNotNil:[^nil]. >>> ActiveDelay := self. >>> ActiveDelayStartTime := Time millisecondClockValue. >>> ActiveDelayStartTime > resumptionTime ifTrue:[ >>> ActiveDelay signalWaitingProcess. >>> SuspendedDelays isEmpty ifTrue:[ >>> ActiveDelay := nil. >>> ActiveDelayStartTime := nil. >>> ] ifFalse:[SuspendedDelays removeFirst activate]. >>> ] ifFalse:[ >>> TimingSemaphore initSignals. >>> Delay primSignal: TimingSemaphore atMilliseconds: >>> resumptionTime. >>> ].! ! >>> >>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'! >>> schedule >>> "Private!! Schedule this Delay, but return immediately rather >>> than waiting. The receiver's semaphore will be signalled when its >>> delay duration has elapsed." >>> >>> beingWaitedOn ifTrue: [self error: 'This Delay has already >>> been scheduled.']. >>> >>> TimerEventLoop ifNotNil:[^self scheduleEvent]. >>> AccessProtect critical: [ >>> beingWaitedOn := true. >>> resumptionTime := Time millisecondClockValue + >>> delayDuration. >>> ActiveDelay == nil >>> ifTrue: [self activate] >>> ifFalse: [ >>> resumptionTime < ActiveDelay resumptionTime >>> ifTrue: [ >>> SuspendedDelays add: ActiveDelay. >>> self activate] >>> ifFalse: [SuspendedDelays add: self]]]. >>> ! ! >>> >>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 22:33'! >>> scheduleEvent >>> "Schedule this delay" >>> resumptionTime := Time millisecondClockValue + delayDuration. >>> AccessProtect critical:[ >>> ScheduledDelay := self. >>> TimingSemaphore signal. >>> ].! ! >>> >>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:55'! >>> unschedule >>> "Unschedule this Delay. Do nothing if it wasn't scheduled." >>> >>> | done | >>> TimerEventLoop ifNotNil:[^self unscheduleEvent]. >>> AccessProtect critical: [ >>> done := false. >>> [done] whileFalse: >>> [SuspendedDelays remove: self ifAbsent: [done := true]]. >>> ActiveDelay == self ifTrue: [ >>> SuspendedDelays isEmpty >>> ifTrue: [ >>> ActiveDelay := nil. >>> ActiveDelayStartTime := nil] >>> ifFalse: [ >>> SuspendedDelays removeFirst activate]]]. >>> ! ! >>> >>> !Delay methodsFor: 'private' stamp: 'ar 7/10/2007 21:56'! >>> unscheduleEvent >>> AccessProtect critical:[ >>> FinishedDelay := self. >>> TimingSemaphore signal. >>> ].! ! >>> >>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'! >>> beingWaitedOn >>> "Answer whether this delay is currently scheduled, e.g., >>> being waited on" >>> ^beingWaitedOn! ! >>> >>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 21:49'! >>> beingWaitedOn: aBool >>> "Indicate whether this delay is currently scheduled, e.g., >>> being waited on" >>> beingWaitedOn := aBool! ! >>> >>> !Delay methodsFor: 'public' stamp: 'ar 7/10/2007 20:56'! >>> delayDuration >>> ^delayDuration! ! >>> >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 >>> 10:35'! >>> handleTimerEvent >>> "Handle a timer event; which can be either: >>> - a schedule request (ScheduledDelay notNil) >>> - an unschedule request (FinishedDelay notNil) >>> - a timer signal (not explicitly specified) >>> We check for timer expiry every time we get a signal." >>> | nextTick | >>> "Wait until there is work to do." >>> TimingSemaphore wait. >>> >>> "Process any schedule requests" >>> ScheduledDelay ifNotNil:[ >>> "Schedule the given delay" >>> self scheduleDelay: ScheduledDelay. >>> ScheduledDelay := nil. >>> ]. >>> >>> "Process any unschedule requests" >>> FinishedDelay ifNotNil:[ >>> self unscheduleDelay: FinishedDelay. >>> FinishedDelay := nil. >>> ]. >>> >>> "Check for clock wrap-around." >>> nextTick := Time millisecondClockValue. >>> nextTick < ActiveDelayStartTime ifTrue: [ >>> "clock wrapped" >>> self saveResumptionTimes. >>> self restoreResumptionTimes. >>> ]. >>> ActiveDelayStartTime := nextTick. >>> >>> "Signal any expired delays" >>> [ActiveDelay notNil and:[ >>> Time millisecondClockValue >= ActiveDelay >>> resumptionTime]] whileTrue:[ >>> ActiveDelay signalWaitingProcess. >>> SuspendedDelays isEmpty ifTrue: >>> [ActiveDelay := nil] ifFalse:[ActiveDelay := >>> SuspendedDelays removeFirst]. >>> ]. >>> >>> "And signal when the next request is due. We sleep at most >>> 1sec here >>> as a soft busy-loop so that we don't accidentally miss signals." >>> nextTick := Time millisecondClockValue + 1000. >>> ActiveDelay ifNotNil:[nextTick := nextTick min: ActiveDelay >>> resumptionTime]. >>> nextTick := nextTick min: SmallInteger maxVal. >>> >>> "Since we have processed all outstanding requests, reset the >>> timing semaphore so >>> that only new work will wake us up again. Do this RIGHT >>> BEFORE setting the next >>> wakeup call from the VM because it is only signaled once so >>> we mustn't miss it." >>> TimingSemaphore initSignals. >>> Delay primSignal: TimingSemaphore atMilliseconds: nextTick. >>> ! ! >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 >>> 09:04'! >>> runTimerEventLoop >>> "Run the timer event loop." >>> [ >>> [RunTimerEventLoop] whileTrue: [self handleTimerEvent] >>> ] on: Error do:[:ex| >>> "Clear out the process so it does't get killed" >>> TimerEventLoop := nil. >>> "Launch the old-style interrupt watcher" >>> self startTimerInterruptWatcher. >>> "And pass the exception on" >>> ex pass. >>> ].! ! >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 >>> 22:32'! >>> scheduleDelay: aDelay >>> "Private. Schedule this Delay." >>> aDelay beingWaitedOn: true. >>> ActiveDelay ifNil:[ >>> ActiveDelay := aDelay >>> ] ifNotNil:[ >>> aDelay resumptionTime < ActiveDelay resumptionTime ifTrue:[ >>> SuspendedDelays add: ActiveDelay. >>> ActiveDelay := aDelay. >>> ] ifFalse: [SuspendedDelays add: aDelay]. >>> ]. >>> ! ! >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/11/2007 >>> 10:18'! >>> startTimerEventLoop >>> "Start the timer event loop" >>> "Delay startTimerEventLoop" >>> self stopTimerEventLoop. >>> self stopTimerInterruptWatcher. >>> AccessProtect := Semaphore forMutualExclusion. >>> ActiveDelayStartTime := Time millisecondClockValue. >>> SuspendedDelays := Heap withAll: (SuspendedDelays >>> ifNil:[#()]) >>> sortBlock: [:d1 :d2 | d1 resumptionTime <= d2 >>> resumptionTime]. >>> TimingSemaphore := Semaphore new. >>> RunTimerEventLoop := true. >>> TimerEventLoop := [self runTimerEventLoop] newProcess. >>> TimerEventLoop priority: Processor timingPriority. >>> TimerEventLoop resume. >>> TimingSemaphore signal. "get going" >>> ! ! >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 >>> 22:32'! >>> startTimerInterruptWatcher >>> "Reset the class variables that keep track of active Delays >>> and re-start the timer interrupt watcher process. Any currently >>> scheduled delays are forgotten." >>> "Delay startTimerInterruptWatcher" >>> | p | >>> self stopTimerEventLoop. >>> self stopTimerInterruptWatcher. >>> TimingSemaphore := Semaphore new. >>> AccessProtect := Semaphore forMutualExclusion. >>> SuspendedDelays := SortedCollection >>> sortBlock: [:d1 :d2 | d1 resumptionTime <= d2 >>> resumptionTime]. >>> ActiveDelay := nil. >>> p := [self timerInterruptWatcher] newProcess. >>> p priority: Processor timingPriority. >>> p resume. >>> ! ! >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 >>> 21:26'! >>> stopTimerEventLoop >>> "Stop the timer event loop" >>> RunTimerEventLoop := false. >>> TimingSemaphore signal. >>> TimerEventLoop := nil.! ! >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 >>> 21:32'! >>> stopTimerInterruptWatcher >>> "Reset the class variables that keep track of active Delays >>> and re-start the timer interrupt watcher process. Any currently >>> scheduled delays are forgotten." >>> "Delay startTimerInterruptWatcher" >>> self primSignal: nil atMilliseconds: 0. >>> TimingSemaphore ifNotNil:[TimingSemaphore terminateProcess].! ! >>> >>> !Delay class methodsFor: 'timer process' stamp: 'ar 7/10/2007 >>> 22:33'! >>> unscheduleDelay: aDelay >>> "Private. Unschedule this Delay." >>> ActiveDelay == aDelay ifTrue: [ >>> SuspendedDelays isEmpty ifTrue:[ >>> ActiveDelay := nil. >>> ] ifFalse: [ >>> ActiveDelay := SuspendedDelays removeFirst. >>> ] >>> ] ifFalse:[ >>> SuspendedDelays remove: aDelay ifAbsent: []. >>> ]. >>> aDelay beingWaitedOn: false.! ! >>> >>> !Delay class methodsFor: 'class initialization' stamp: 'ar >>> 7/11/2007 18:16'! >>> initialize >>> "Delay initialize" >>> self startTimerEventLoop.! ! >>> >>> Delay initialize! >>> >>> >>> -------------------------------------------------------------------- >>> ---- >>> >>> > > > |
In reply to this post by Steven W Riggins
On Jul 24, 2007, at 18:30 , Steven W Riggins wrote:
> Could this be related to why monticello goes to sleep and never > wakes up until you wiggle the mouse? I hope so (although you mean squeaksource). - Bert - |
In reply to this post by Bert Freudenberg
On 24-Jul-07, at 24-Jul;1:40 AM, Bert Freudenberg wrote: > On Jul 24, 2007, at 10:16 , Damien Cassou wrote: > >> Hi, >> >> 2007/7/24, Janko Mivšek <[hidden email]>: >>> That's very important patch and very interesting to me too, >>> because I'm >>> just deciding to put some of my public Aida/Web websites from VW to >>> Squeak and I was afraid of such issues as one you just solved. >>> >>> Is there any chance that this patch goes to 3.10? >> >> Chance would be greater if unit tests were included. > > No. It just takes a couple of people filing this in and using the > image for a while. Preferably on servers. And then reporting their > findings. Whilst it may be effectively impossible to do unit tests to cover the problem this code is intended to fix it would be quite nice to have some Delay tests in the image (caveat - I've just checked my *working* image which is a sophie development image and thus 3.8 based) to make sure that the Delay and related classes work the way we expect after this fix is applied. T'would be tragic to find that it stops a major lockup but breaks some small routine function. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: BPB: Branch on Program Bug |
In reply to this post by Bert Freudenberg
On Jul 24, 2007, at 9:34 AM, Bert Freudenberg wrote: > On Jul 24, 2007, at 18:30 , Steven W Riggins wrote: > >> Could this be related to why monticello goes to sleep and never >> wakes up until you wiggle the mouse? > > I hope so (although you mean squeaksource). > Yeah I suppose, I only ever see MC, not the back end, except when I am wiggling the mouse! If this fixes the squeaksource bug, I will personally buy dinner for whomever was involved with fixing this, unless it's like 100 people or something. Dinner is much cheaper than the nearly 4 LCD screens I have smashed when trying to check something in 5 mins before I had to leave the house. :) |
In reply to this post by Bert Freudenberg
+1 (not nice to have to rebuild...)
-----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Bert Freudenberg Sent: 24 July 2007 5:34 pm To: The general-purpose Squeak developers list Subject: Re: Delay and Server reliability On Jul 24, 2007, at 18:30 , Steven W Riggins wrote: > Could this be related to why monticello goes to sleep and never > wakes up until you wiggle the mouse? I hope so (although you mean squeaksource). - Bert - |
Free forum by Nabble | Edit this page |