Smalltalk › Squeak › Squeak - Dev

Re: nil in SuspendedDelays???

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

28 messages Options

Cees De Groot

Re: nil in SuspendedDelays???

Some more digging revealed a pending VW fix that might be related - it
purports to fix a hole in Semaphore>>critical: and I wonder whether it
might apply to Squeak as well.

Current Squeak implementation:
critical: mutuallyExcludedBlock
| blockValue |
self wait.
[blockValue _ mutuallyExcludedBlock value]
ensure: [self signal].
^blockValue

Current VisualWorks implementation:
critical: mutuallyExcludedBlock
self waitIfCurtailedSignal.
^mutuallyExcludedBlock
ensure: [self signal]

Proposed fix (see http://tinyurl.com/9ed25):
critical: mutuallyExcludedBlock
| recover |
recover := false.
^[
| result |
self waitIfCurtailedSignal.
"The dynamic translator creates code that can't be
interrupted here (I think)."
recover := true.
result := mutuallyExcludedBlock value.
recover := false.
"The dynamic translator creates code that can't be
interrupted here (I think)."
self signal.
result
] ifCurtailed: [
recover ifTrue: [
recover := false.
self signal]].

The fix has some test code in its comment (again, see the tinyurl)
that I'm going to run now, but I was wondering whether people could
comment on the applicability to Squeak of this fix. In VW, it seems to
make a difference...

On 1/5/06, Cees De Groot <[hidden email]> wrote:

> Below a stack trace we've been seeing a couple of times lately in our
> wxSqueak project (which is mostly 3.8). It seems that SuspendedDelays
> has a nil element somewhere. However, the code in Delay seems to be
> safe against that and I can't see any code elsewhere in the image that
> messes around with delays...
>
> Has anyone ever seen this?
>
> UndefinedObject(Object)>>doesNotUnderstand: #resumptionTime
> [] in Delay class>>DoIt {[:t2 :t3 | t2 resumptionTime <= t3 resumptionTime]}
> SortedCollection>>indexForInserting:
> SortedCollection>>add:
> [] in Delay>>schedule {[beingWaitedOn := true. resumptionTime := Time
> millisecondClockValue + dela...]}
> [] in Semaphore>>critical: {[t2 := t1 value]}
> BlockContext>>ensure:
> Semaphore>>critical:
> Delay>>schedule
> Delay>>wait
> TUDispatchingDatagramChannel(TUDatagramChannel)>>retransmissionRun
> [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[self
> retransmissionRun]}
> BlockContext>>repeat
> [] in TUDispatchingDatagramChannel(TUDatagramChannel)>>start {[[self
> retransmissionRun] repeat]}
> [] in BlockContext>>newProcess {[self value. Processor terminateActive]}
>

Cees De Groot

Re: nil in SuspendedDelays???

Well, the test code which I reproduce below throws an error very
quickly on Squeak. When I patch #critical:, the test runs out just
fine.

Should this be adopted?

100 timesRepeat: [
| s procs |
s := Semaphore forMutualExclusion.
procs := (1 to: 50) collect: [:i |
[Processor yield. [s critical: []]
repeat] forkAt: Processor activeProcess priority - 1].
procs do: [:p |
(Delay forMilliseconds: 2) wait.
p terminate].
(Delay forMilliseconds: 100) wait.
s isSignaled ifFalse: [self error:
'Failure to release critical semaphore']].

On 1/23/06, Cees De Groot <[hidden email]> wrote:
> Some more digging revealed a pending VW fix that might be related - it
> purports to fix a hole in Semaphore>>critical: and I wonder whether it
> might apply to Squeak as well.
>

Cees De Groot

Re: nil in SuspendedDelays???

And here is another funny one. Two colleagues of mine where looking
whether our trouble was related to SortedCollection in combination
with Semaphore (these are the primary objects involved with the
original issue, and we have another couple of bug reports which shows
sorted collections exhibiting funny behavior even though the are
protected by Semaphores).

They whipped this up:

10 timesRepeat: [
| collection threadCount semaphore safeguard |
collection := SortedCollection new.

safeguard := Semaphore forMutualExclusion.

threadCount := 50.
semaphore := Semaphore new.
semaphore signal.

(1 to: threadCount) do: [:each |
Transcript show: each asString;cr.
semaphore signal.
[1 to: 250000 do: [:x |
safeguard critical: [collection add: x; remove: x]]] fork.
semaphore wait].

semaphore wait.

self assert: collection isEmpty]

This code reliably crashes in the "remove: x" mentioning that the
object is not present in the collection. Which, looking at the code,
is impossible.

Is Semaphore broken? Or are we overlooking something in our test?

Cees De Groot

Re: nil in SuspendedDelays???

By the way - you may need to run this one more than once. It doesn't
happen on every run.

On 1/23/06, Cees De Groot <[hidden email]> wrote:
> And here is another funny one.

Cees De Groot

Re: nil in SuspendedDelays???

Discussing this with Radoslav Hodničák on the Parcplace.net #smalltalk
channel turned up that this could be due to blocks not being real
closures in Squeak. Sending #fixTemps to the block that has [:x |...]
seems to prevent the crash.

"<rh> process 1 is invoked with x=1, runs into critical, adds 1 to
collection, get preempted. process 2 is invokes with x=2 (x is now 2
for all processes), runs into critical block, stops, processes 1 now
tries to remove 2 and fails"

which makes sense.

Dave Mason-4

Proper closures (was Re: nil in SuspendedDelays???)

Urk! I thought Smalltalk had evolved to the point where blocks were
proper closures... this will definitely bite me as I have many years of
functional programming language intuition, and the subtle distinction
here will certainly lead me astray. #fixTemps doesn't do what I expect
either.

Is this just Squeak? Do other Smalltalks get it right? Are there plans
to fix this in Squeak?

../Dave

Cees De Groot

Re: Proper closures (was Re: nil in SuspendedDelays???)

On 1/23/06, Dave Mason <[hidden email]> wrote:
> Is this just Squeak?

There are Smalltalks that have real closures. But I can't give an
exhaustive list so I cannot answer this in the positive.

> to fix this in Squeak?
>
Yup, in 3.9a it already has been fixed IIRC.

Marcus Denker

Re: Proper closures (was Re: nil in SuspendedDelays???)

On 23.01.2006, at 16:06, Cees De Groot wrote:

> On 1/23/06, Dave Mason <[hidden email]> wrote:
>> Is this just Squeak?
>
> There are Smalltalks that have real closures. But I can't give an
> exhaustive list so I cannot answer this in the positive.
>
>> to fix this in Squeak?
>>
> Yup, in 3.9a it already has been fixed IIRC.
>

More like one step towards it.... you can install an optional
compiler from SqueakMap to have full closures... so there
will be runtime support in 3.9, but you need to install
the additional compiler package to make use of them.

Marcus

Philippe Marschall

Re: Proper closures (was Re: nil in SuspendedDelays???)

In reply to this post by Dave Mason-4

> Is this just Squeak? Do other Smalltalks get it right?
VW of course gets it right. I don't know about others.

> Are there plans to fix this in Squeak?
#compileUseNewCompiler
#compileBlocksAsColsures
The problem is that this makes all blocks (which are not inlined like
#ifTrue:) much slower.

This is a real issue with Seaside as well. In general as soon as you
loop and use and html render block, you are in need of #fixTemps.

stéphane ducasse-2

Re: Proper closures (was Re: nil in SuspendedDelays???)

In reply to this post by Cees De Groot

On 23 janv. 06, at 16:06, Cees De Groot wrote:

> On 1/23/06, Dave Mason <[hidden email]> wrote:
>> Is this just Squeak?
>
> There are Smalltalks that have real closures. But I can't give an
> exhaustive list so I cannot answer this in the positive.

VisualWorks
VisualAge
Dolphin
Smalltalk/X
GnuSmalltalk
should have full blockclosures

>
>> to fix this in Squeak?
>>
> Yup, in 3.9a it already has been fixed IIRC.
>
>

Andreas.Raab

Re: nil in SuspendedDelays???

In reply to this post by Cees De Groot

Cees De Groot wrote:
> Should this be adopted?

The VW fix does seem overly complex for Squeak - the main issue is that
in Semaphore>>critical: we can be interrupted between the following two
lines:

self wait.
[blockValue := mutuallyExcludedBlock value] ensure:[self signal].

Simply moving #wait inside the ensure'd block will cure the problem (for
very, very specific and nitpicky reasons that I'm not going to explain
in detail unless someone actually wants to know ;-)

Cheers,
- Andreas

johnmci

Re: nil in SuspendedDelays???

Er, if we're messing with this, and with rescheduling processes when
you change priorities, do we want to consider how Processor>>yield
behaves? Should it allow a lower
priority process to run if there are no processes runable at the same
priority?

On 23-Jan-06, at 7:54 PM, Andreas Raab wrote:

> Cees De Groot wrote:
>> Should this be adopted?
>
> The VW fix does seem overly complex for Squeak - the main issue is
> that in Semaphore>>critical: we can be interrupted between the
> following two lines:
>
> self wait.
> [blockValue := mutuallyExcludedBlock value] ensure:[self signal].
>
> Simply moving #wait inside the ensure'd block will cure the problem
> (for very, very specific and nitpicky reasons that I'm not going to
> explain in detail unless someone actually wants to know ;-)
>
> Cheers,
> - Andreas
>

--
========================================================================
===
John M. McIntosh <[hidden email]> 1-800-477-2659
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
========================================================================
===

timrowledge

Re: nil in SuspendedDelays???

On 23-Jan-06, at 8:19 PM, John M McIntosh wrote:

> Er, if we're messing with this, and with rescheduling processes
> when you change priorities, do we want to consider how
> Processor>>yield behaves? Should it allow a lower
> priority process to run if there are no processes runable at the
> same priority?
If you mean 'no >other< processes' then it would be almost completely
pointless since at the very next opportunity the higher priority
process would preempt the low priority one. In the case of 'no
processes' then obviously a lower priority process would get a turn
anyway.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Cackles a lot, but I ain't seen no eggs yet.

johnmci

Re: nil in SuspendedDelays???

Well right now it only checks for processes in the same priority
group, doesn't consider ones higher, mind they should run at some point.
Also doesn't consider any lower, so they'll never run, which is
different from will run *a bit*.
Howwcwe I will accept that run *a bit* might be a worthless exercise?
Still if the intent is to yield it seems that is a poor choice of
words if it never yields to lower
priority processes.

On 23-Jan-06, at 9:36 PM, tim Rowledge wrote:

>
> On 23-Jan-06, at 8:19 PM, John M McIntosh wrote:
>
>> Er, if we're messing with this, and with rescheduling processes
>> when you change priorities, do we want to consider how
>> Processor>>yield behaves? Should it allow a lower
>> priority process to run if there are no processes runable at the
>> same priority?
> If you mean 'no >other< processes' then it would be almost
> completely pointless since at the very next opportunity the higher
> priority process would preempt the low priority one. In the case of
> 'no processes' then obviously a lower priority process would get a
> turn anyway.
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Useful random insult:- Cackles a lot, but I ain't seen no eggs yet.
>
>
>

Cees De Groot

Re: nil in SuspendedDelays???

In reply to this post by Andreas.Raab

On 1/24/06, Andreas Raab <[hidden email]> wrote:
> Simply moving #wait inside the ensure'd block will cure the problem (for
> very, very specific and nitpicky reasons that I'm not going to explain
> in detail unless someone actually wants to know ;-)
>
I do. I have a customer who is seeing error reports because of funny
values turning up in collections that are supposed to be protected by
critical sections. And the funny values turning up are very much the
sort of stuff you get when that protection fails, so, yes, I
desperately want to know about every possibility, no matter how remote

(apart from that, I'm a curious guy ;-))

Andreas.Raab

Re: nil in SuspendedDelays???

Ah, well, all right. The trick isn't in fixing the old problem - the
trick is in avoiding the new problem. Moving the call to #wait inside
the ensured block certainly will fix the old problem of the semaphore
not being signaled when the process gets terminated at the wrong point
but how do you avoid the semaphore being signaled erranously because the
process gets aborted after entering the ensured block and *before* we
end up in #wait? (the same problem in reverse and just as bad)

The main trick here is that we really check for interrupts *after*
executing primitives (and not before or during) which means that because
(and only because!) wait is a primitive its call will not be interrupted
(this btw, is one of the reasons why message tally sometimes associates
a lot of time with a "wrong" primitive call - it cannot interrupt it
before the primitive is executed and therefore sees the "wrong" method
being run). Which means that if you were to change #wait to call a
method #primitiveWait it would no longer be safe (possible interruption
after the primitive check in #wait). It might be fun to try this out!

Hope this gets you the gist of it (it's really late here).

Cheers,
- Andreas

Cees De Groot wrote:

> On 1/24/06, Andreas Raab <[hidden email]> wrote:
>
>>Simply moving #wait inside the ensure'd block will cure the problem (for
>>very, very specific and nitpicky reasons that I'm not going to explain
>>in detail unless someone actually wants to know ;-)
>>
>
> I do. I have a customer who is seeing error reports because of funny
> values turning up in collections that are supposed to be protected by
> critical sections. And the funny values turning up are very much the
> sort of stuff you get when that protection fails, so, yes, I
> desperately want to know about every possibility, no matter how remote
>
> (apart from that, I'm a curious guy ;-))
>
>

Cees De Groot

Re: nil in SuspendedDelays???

On 1/24/06, Andreas Raab <[hidden email]> wrote:
> Hope this gets you the gist of it (it's really late here).
>
Yeah, I think I do. Thanks for the explanation.

timrowledge

Re: nil in SuspendedDelays???

In reply to this post by johnmci

On 23-Jan-06, at 10:59 PM, John M McIntosh wrote:

> Well right now it only checks for processes in the same priority
> group, doesn't consider ones higher, mind they should run at some
> point.
Exactly - you don't need to worry about the higher priority processes
since they get a chance to take over at every checkInterrupts.

Fiddling with yield won't achieve anything interesting. If a cpu
sharing scheduler is wanted it would take a quite different chunk of
code.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: CM: Circulate Memory

Cees De Groot

Re: nil in SuspendedDelays???

In reply to this post by Cees De Groot

Well, I have "proof" now:
http://www.cdegroot.com/blog/wp-content/uploads/screenshot.png shows a
screenshot with two debuggers that seem to be inside the same critical
section...

The VM's dump stack results in:

SortedCollection>>add:
[] in Delay>>schedule {[beingWaitedOn := true. resumptionTime := Time
millisecondClockValue + dela...]}
[] in Semaphore>>critical: {[[self wait] ifCurtailed: [self signal].
recover := true. result := mutu...]}
BlockContext>>ifCurtailed:
Semaphore>>critical:
Delay>>schedule
Delay>>wait
WorldState>>interCyclePause:
WorldState>>doOneCycleFor:
PasteUpMorph>>doOneCycle
[] in Project class>>spawnNewProcess {[[World doOneCycle. Processor
yield. false] whileFalse. nil]}
[] in BlockContext>>newProcess {[self value. Processor terminateActive]}

It *seems* that this only happens in wxSqueak. I have SysInternals'
process explorer running, and the Squeak VM is running a bunch of
threads but I'm not sure what these threads do. But apart from the
guess that something like this could happen when multiple native
threads start executing Squeak code, I'm at a loss what other cause
could be behind this...

HEEELP!!! ;-)

Cees De Groot

Re: nil in SuspendedDelays???

I'll just keep spitting out info until someone gets a brilliant idea ;-)

http://elec.tric.nl/~cg/delay2.dump.txt (Unix LF terminated format)

I created a little bit of code that dumps all processes when something
goes wrong, and this dump has two processes inside the critical
section of Delay>>schedule.

One starts on line 453 ('a Process in Process>>suspend').
One starts on line 90 ('a Process in nil').

If you look at it, there's one funny thing. I added an #on:do: handler
around the actual addition to the "SuspendedDelays add:..." bit in
Delay>>schedule:

....
[SuspendedDelays add: ActiveDelay] on: Error do: [:ex |
Processor dumpOnFile: 'delay2.dump'. self halt]

If you follow the stack frames from line 90, you'll encounter this
#on:do:. However, if you do the same with the other process, there's
no stack frame for the exception handler. the Delay>>schedule frame
directly calls the SortedCollection>>add: frame.

I checked, double checked, and checked again - this happens every time
I get this bug in my image (available on request ;-)), it happens
currently quite soon after I fire up the wx-based program, and always
one of the processes - which can be different processes - misses this
stack frame.

Could it be that I have somehow two copies of Delay in my image? Which
share the ScheduledDelays collection but not the Semaphore?

This is sooo crazy, it must have an easy solution :-)