Some more digging revealed a pending VW fix that might be related - it
purports to fix a hole in Semaphore>>critical: and I wonder whether it might apply to Squeak as well. Current Squeak implementation: critical: mutuallyExcludedBlock | blockValue | self wait. [blockValue _ mutuallyExcludedBlock value] ensure: [self signal]. ^blockValue Current VisualWorks implementation: critical: mutuallyExcludedBlock self waitIfCurtailedSignal. ^mutuallyExcludedBlock ensure: [self signal] Proposed fix (see http://tinyurl.com/9ed25): critical: mutuallyExcludedBlock | recover | recover := false. ^[ | result | self waitIfCurtailedSignal. "The dynamic translator creates code that can't be interrupted here (I think)." recover := true. result := mutuallyExcludedBlock value. recover := false. "The dynamic translator creates code that can't be interrupted here (I think)." self signal. result ] ifCurtailed: [ recover ifTrue: [ recover := false. self signal]]. The fix has some test code in its comment (again, see the tinyurl) that I'm going to run now, but I was wondering whether people could comment on the applicability to Squeak of this fix. In VW, it seems to make a difference... On 1/5/06, Cees De Groot <[hidden email]> wrote: > Below a stack trace we've been seeing a couple of times lately in our > wxSqueak project (which is mostly 3.8). It seems that SuspendedDelays > has a nil element somewhere. However, the code in Delay seems to be > safe against that and I can't see any code elsewhere in the image that > messes around with delays... > > Has anyone ever seen this? > > UndefinedObject(Object)>>doesNotUnderstand: #resumptionTime > [] in Delay class>>DoIt {[:t2 :t3 | t2 resumptionTime <= t3 resumptionTime]} > SortedCollection>>indexForInserting: > SortedCollection>>add: > [] in Delay>>schedule {[beingWaitedOn := true. resumptionTime := Time > millisecondClockValue + dela...]} > [] in Semaphore>>critical: {[t2 := t1 value]} > BlockContext>>ensure: > Semaphore>>critical: > Delay>>schedule > Delay>>wait > TUDispatchingDatagramChannel(TUDatagramChannel)>>retransmissionRun > [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[self > retransmissionRun]} > BlockContext>>repeat > [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[[self > retransmissionRun] repeat]} > [] in BlockContext>>newProcess {[self value. Processor terminateActive]} > |
Well, the test code which I reproduce below throws an error very
quickly on Squeak. When I patch #critical:, the test runs out just fine. Should this be adopted? 100 timesRepeat: [ | s procs | s := Semaphore forMutualExclusion. procs := (1 to: 50) collect: [:i | [Processor yield. [s critical: []] repeat] forkAt: Processor activeProcess priority - 1]. procs do: [:p | (Delay forMilliseconds: 2) wait. p terminate]. (Delay forMilliseconds: 100) wait. s isSignaled ifFalse: [self error: 'Failure to release critical semaphore']]. On 1/23/06, Cees De Groot <[hidden email]> wrote: > Some more digging revealed a pending VW fix that might be related - it > purports to fix a hole in Semaphore>>critical: and I wonder whether it > might apply to Squeak as well. > |
And here is another funny one. Two colleagues of mine where looking
whether our trouble was related to SortedCollection in combination with Semaphore (these are the primary objects involved with the original issue, and we have another couple of bug reports which shows sorted collections exhibiting funny behavior even though the are protected by Semaphores). They whipped this up: 10 timesRepeat: [ | collection threadCount semaphore safeguard | collection := SortedCollection new. safeguard := Semaphore forMutualExclusion. threadCount := 50. semaphore := Semaphore new. semaphore signal. (1 to: threadCount) do: [:each | Transcript show: each asString;cr. semaphore signal. [1 to: 250000 do: [:x | safeguard critical: [collection add: x; remove: x]]] fork. semaphore wait]. semaphore wait. self assert: collection isEmpty] This code reliably crashes in the "remove: x" mentioning that the object is not present in the collection. Which, looking at the code, is impossible. Is Semaphore broken? Or are we overlooking something in our test? |
By the way - you may need to run this one more than once. It doesn't
happen on every run. On 1/23/06, Cees De Groot <[hidden email]> wrote: > And here is another funny one. |
Discussing this with Radoslav Hodničák on the Parcplace.net #smalltalk
channel turned up that this could be due to blocks not being real closures in Squeak. Sending #fixTemps to the block that has [:x |...] seems to prevent the crash. "<rh> process 1 is invoked with x=1, runs into critical, adds 1 to collection, get preempted. process 2 is invokes with x=2 (x is now 2 for all processes), runs into critical block, stops, processes 1 now tries to remove 2 and fails" which makes sense. |
Urk! I thought Smalltalk had evolved to the point where blocks were
proper closures... this will definitely bite me as I have many years of functional programming language intuition, and the subtle distinction here will certainly lead me astray. #fixTemps doesn't do what I expect either. Is this just Squeak? Do other Smalltalks get it right? Are there plans to fix this in Squeak? ../Dave |
On 1/23/06, Dave Mason <[hidden email]> wrote:
> Is this just Squeak? There are Smalltalks that have real closures. But I can't give an exhaustive list so I cannot answer this in the positive. > to fix this in Squeak? > Yup, in 3.9a it already has been fixed IIRC. |
On 23.01.2006, at 16:06, Cees De Groot wrote: > On 1/23/06, Dave Mason <[hidden email]> wrote: >> Is this just Squeak? > > There are Smalltalks that have real closures. But I can't give an > exhaustive list so I cannot answer this in the positive. > >> to fix this in Squeak? >> > Yup, in 3.9a it already has been fixed IIRC. > More like one step towards it.... you can install an optional compiler from SqueakMap to have full closures... so there will be runtime support in 3.9, but you need to install the additional compiler package to make use of them. Marcus |
In reply to this post by Dave Mason-4
> Is this just Squeak? Do other Smalltalks get it right?
VW of course gets it right. I don't know about others. > Are there plans to fix this in Squeak? #compileUseNewCompiler #compileBlocksAsColsures The problem is that this makes all blocks (which are not inlined like #ifTrue:) much slower. This is a real issue with Seaside as well. In general as soon as you loop and use and html render block, you are in need of #fixTemps. |
In reply to this post by Cees De Groot
On 23 janv. 06, at 16:06, Cees De Groot wrote: > On 1/23/06, Dave Mason <[hidden email]> wrote: >> Is this just Squeak? > > There are Smalltalks that have real closures. But I can't give an > exhaustive list so I cannot answer this in the positive. VisualWorks VisualAge Dolphin Smalltalk/X GnuSmalltalk should have full blockclosures > >> to fix this in Squeak? >> > Yup, in 3.9a it already has been fixed IIRC. > > |
In reply to this post by Cees De Groot
Cees De Groot wrote:
> Should this be adopted? The VW fix does seem overly complex for Squeak - the main issue is that in Semaphore>>critical: we can be interrupted between the following two lines: self wait. [blockValue := mutuallyExcludedBlock value] ensure:[self signal]. Simply moving #wait inside the ensure'd block will cure the problem (for very, very specific and nitpicky reasons that I'm not going to explain in detail unless someone actually wants to know ;-) Cheers, - Andreas |
Er, if we're messing with this, and with rescheduling processes when
you change priorities, do we want to consider how Processor>>yield behaves? Should it allow a lower priority process to run if there are no processes runable at the same priority? On 23-Jan-06, at 7:54 PM, Andreas Raab wrote: > Cees De Groot wrote: >> Should this be adopted? > > The VW fix does seem overly complex for Squeak - the main issue is > that in Semaphore>>critical: we can be interrupted between the > following two lines: > > self wait. > [blockValue := mutuallyExcludedBlock value] ensure:[self signal]. > > Simply moving #wait inside the ensure'd block will cure the problem > (for very, very specific and nitpicky reasons that I'm not going to > explain in detail unless someone actually wants to know ;-) > > Cheers, > - Andreas > -- ======================================================================== === John M. McIntosh <[hidden email]> 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== === |
On 23-Jan-06, at 8:19 PM, John M McIntosh wrote: > Er, if we're messing with this, and with rescheduling processes > when you change priorities, do we want to consider how > Processor>>yield behaves? Should it allow a lower > priority process to run if there are no processes runable at the > same priority? If you mean 'no >other< processes' then it would be almost completely pointless since at the very next opportunity the higher priority process would preempt the low priority one. In the case of 'no processes' then obviously a lower priority process would get a turn anyway. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful random insult:- Cackles a lot, but I ain't seen no eggs yet. |
Well right now it only checks for processes in the same priority
group, doesn't consider ones higher, mind they should run at some point. Also doesn't consider any lower, so they'll never run, which is different from will run *a bit*. Howwcwe I will accept that run *a bit* might be a worthless exercise? Still if the intent is to yield it seems that is a poor choice of words if it never yields to lower priority processes. On 23-Jan-06, at 9:36 PM, tim Rowledge wrote: > > On 23-Jan-06, at 8:19 PM, John M McIntosh wrote: > >> Er, if we're messing with this, and with rescheduling processes >> when you change priorities, do we want to consider how >> Processor>>yield behaves? Should it allow a lower >> priority process to run if there are no processes runable at the >> same priority? > If you mean 'no >other< processes' then it would be almost > completely pointless since at the very next opportunity the higher > priority process would preempt the low priority one. In the case of > 'no processes' then obviously a lower priority process would get a > turn anyway. > > tim > -- > tim Rowledge; [hidden email]; http://www.rowledge.org/tim > Useful random insult:- Cackles a lot, but I ain't seen no eggs yet. > > > -- ======================================================================== === John M. McIntosh <[hidden email]> 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== === |
In reply to this post by Andreas.Raab
On 1/24/06, Andreas Raab <[hidden email]> wrote:
> Simply moving #wait inside the ensure'd block will cure the problem (for > very, very specific and nitpicky reasons that I'm not going to explain > in detail unless someone actually wants to know ;-) > I do. I have a customer who is seeing error reports because of funny values turning up in collections that are supposed to be protected by critical sections. And the funny values turning up are very much the sort of stuff you get when that protection fails, so, yes, I desperately want to know about every possibility, no matter how remote (apart from that, I'm a curious guy ;-)) |
Ah, well, all right. The trick isn't in fixing the old problem - the
trick is in avoiding the new problem. Moving the call to #wait inside the ensured block certainly will fix the old problem of the semaphore not being signaled when the process gets terminated at the wrong point but how do you avoid the semaphore being signaled erranously because the process gets aborted after entering the ensured block and *before* we end up in #wait? (the same problem in reverse and just as bad) The main trick here is that we really check for interrupts *after* executing primitives (and not before or during) which means that because (and only because!) wait is a primitive its call will not be interrupted (this btw, is one of the reasons why message tally sometimes associates a lot of time with a "wrong" primitive call - it cannot interrupt it before the primitive is executed and therefore sees the "wrong" method being run). Which means that if you were to change #wait to call a method #primitiveWait it would no longer be safe (possible interruption after the primitive check in #wait). It might be fun to try this out! Hope this gets you the gist of it (it's really late here). Cheers, - Andreas Cees De Groot wrote: > On 1/24/06, Andreas Raab <[hidden email]> wrote: > >>Simply moving #wait inside the ensure'd block will cure the problem (for >>very, very specific and nitpicky reasons that I'm not going to explain >>in detail unless someone actually wants to know ;-) >> > > I do. I have a customer who is seeing error reports because of funny > values turning up in collections that are supposed to be protected by > critical sections. And the funny values turning up are very much the > sort of stuff you get when that protection fails, so, yes, I > desperately want to know about every possibility, no matter how remote > > (apart from that, I'm a curious guy ;-)) > > |
On 1/24/06, Andreas Raab <[hidden email]> wrote:
> Hope this gets you the gist of it (it's really late here). > Yeah, I think I do. Thanks for the explanation. |
In reply to this post by johnmci
On 23-Jan-06, at 10:59 PM, John M McIntosh wrote: > Well right now it only checks for processes in the same priority > group, doesn't consider ones higher, mind they should run at some > point. Exactly - you don't need to worry about the higher priority processes since they get a chance to take over at every checkInterrupts. Fiddling with yield won't achieve anything interesting. If a cpu sharing scheduler is wanted it would take a quite different chunk of code. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: CM: Circulate Memory |
In reply to this post by Cees De Groot
Well, I have "proof" now:
http://www.cdegroot.com/blog/wp-content/uploads/screenshot.png shows a screenshot with two debuggers that seem to be inside the same critical section... The VM's dump stack results in: SortedCollection>>add: [] in Delay>>schedule {[beingWaitedOn := true. resumptionTime := Time millisecondClockValue + dela...]} [] in Semaphore>>critical: {[[self wait] ifCurtailed: [self signal]. recover := true. result := mutu...]} BlockContext>>ifCurtailed: Semaphore>>critical: Delay>>schedule Delay>>wait WorldState>>interCyclePause: WorldState>>doOneCycleFor: PasteUpMorph>>doOneCycle [] in Project class>>spawnNewProcess {[[World doOneCycle. Processor yield. false] whileFalse. nil]} [] in BlockContext>>newProcess {[self value. Processor terminateActive]} It *seems* that this only happens in wxSqueak. I have SysInternals' process explorer running, and the Squeak VM is running a bunch of threads but I'm not sure what these threads do. But apart from the guess that something like this could happen when multiple native threads start executing Squeak code, I'm at a loss what other cause could be behind this... HEEELP!!! ;-) |
I'll just keep spitting out info until someone gets a brilliant idea ;-)
http://elec.tric.nl/~cg/delay2.dump.txt (Unix LF terminated format) I created a little bit of code that dumps all processes when something goes wrong, and this dump has two processes inside the critical section of Delay>>schedule. One starts on line 453 ('a Process in Process>>suspend'). One starts on line 90 ('a Process in nil'). If you look at it, there's one funny thing. I added an #on:do: handler around the actual addition to the "SuspendedDelays add:..." bit in Delay>>schedule: .... [SuspendedDelays add: ActiveDelay] on: Error do: [:ex | Processor dumpOnFile: 'delay2.dump'. self halt] If you follow the stack frames from line 90, you'll encounter this #on:do:. However, if you do the same with the other process, there's no stack frame for the exception handler. the Delay>>schedule frame directly calls the SortedCollection>>add: frame. I checked, double checked, and checked again - this happens every time I get this bug in my image (available on request ;-)), it happens currently quite soon after I fire up the wx-based program, and always one of the processes - which can be different processes - misses this stack frame. Could it be that I have somehow two copies of Delay in my image? Which share the ScheduledDelays collection but not the Semaphore? This is sooo crazy, it must have an easy solution :-) |
Free forum by Nabble | Edit this page |