More information about Dolphin bug

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

More information about Dolphin bug

Geoff
Here is some more information about the bug I reported November 24.  First,
the symptom is much worse than it seems.  After you use Ctrl-Break to get out
of it, your image is not restored to normal.  You can no longer use any
delays at all.  Second, the cause of the bug is cumulative.  The code shown
below is just one way to make it happen.  It can also happen as a result of
using one image for doing a lot of miscellaneous development work involving
processes.

This code is not intended as any kind of benchmark or other such test.  It's
only to show this bug.  It was the result of spending a lot of time finding a
way to make the bug repeatable.  When I posted it November 24, I may have
given the impression that this bug was just an interesting quirk.  But it's a
serious bug which makes Dolphin unreliable for doing development work
involving large numbers of processes.  I hope Object Arts is giving it high
priority and will soon be able to explain what caused it.

Here is the code again to make it happen.  Be sure to discard the image after
running this in a workspace:

        | procs |
        procs := OrderedCollection new.
        50 timesRepeat: [
                procs add: [
                        50 timesRepeat: [Processor sleep: 1].
                ] fork.
        ].
        procs do: [:proc |  proc terminate].
        Processor sleep: 100.
        Processor sleep: 100.


Reply | Threaded
Open this post in threaded view
|

Re: More information about Dolphin bug

Chris Uppal-2
I can't tell you exactly what's going wrong, but here's a few things to help
keep you working, and a step towards an explanation of what's happening.

First, when you execute your script and it locks up Dolphin; in case you
didn't already know, you can break out of that by doing <CONTROL><BREAK>,
and then "terminate" the resulting walkback.

That leaves the system in a state where any other Delay>>wait, or
Processor>>sleep: will lock up again.  This is because the AccessProtect
semaphore used internally by the Delay class, has been left in an improper
state.  To get back to a normal state you can try the following (which works
reliably for me).

You need to define a new method on the class side of Delay:

================
prod2
 "Private - Forcibly reset the mutex to it's 'normal' state"

 AccessProtect set.
================

You can now execute (Delay prod2) a few times to wake up any processes which
are wrongly still sleeping on it (there may be more than one, so I execute
it several times, I don't know if that's necessary).  That will also leave
the Semaphore in the correct 'signalled' state.  It'd probably be an idea to
execute (Delay prod) too.

It should now be possible to execute (Processor sleep: 1) without any ill
effects.  If so then your image is probably healthy again.

Now, as to what's going wrong.  I don't have the complete story, but the
lines:
================
procs := OrderedCollection new.
50 timesRepeat: [
 procs add: [
 50 timesRepeat: [Processor sleep: 1].
 ] fork.
].
procs do: [:proc | proc terminate].
procs := nil. [:proc | ] value: nil.
================
reliably leave the Delay class's AccessProtect semaphore with a few extra
signals (typically about half-a-dozen on this machine).  Once that has
happened it will let more than one process execute the critical section at
the same time, and sooner or later this will result in the Delay stuff
breaking.

In fact, executing the following (all at one go) will leave the Semaphore
with one extra signal (which doesn't immediately cause visible problems, but
it will allow them to occur later since the Delay class is no longer
threadsafe):
================
p1 := [Processor sleep: 1000000] fork.
p2 := [Processor sleep: 1000000] fork.
p1 terminate.
p2 terminate.
================
This leaves the AccessProtect Semaphore with 2 signals against it instead of
1.

(Blair, if you haven't already solved the problem by the time you read this,
then you'll be relieved to know that it reproduces reliably on both my
slowish W98 machine and my fastish Win2K machine.)

It is necessary to have at least 2 processes for this to "work".  I put a
bit of tracing in, and it appears that the first process enters the critical
section in Delay>>wait OK, and that everything's still OK after it has been
interrupted by the #terminate.  The second process doesn't get into the
critical section at all, but unless it is started, and #terminated, the
problem doesn't occur.

Given that, I *suspect* that the problem is being caused when the second
process attempts to enter the critical section, but is interrupted before
it's really got into it.  The Semaphore gets signalled, so I think that the
underlying call to Semaphore>>wait:ret: must have started and then
"returned" WAIT_OBJECT_O.  But the VM apparently hadn't managed to decrement
its signal-count by the time the process was interrupted.  The nett effect
is then to increment its signal-count wrongly.  If I'm right, then it looks
as if it requires a VM fix.

BTW, three suggestions for OA:

1)    Should Process>>sleep: use an #ensure: block to clear it's Delay
object when it is terminated ?  As it is, the Delay instance continues to
hang around after termination, which doesn't cause malfunctions, as such,
but is untidy and hangs on to unnecessary references.

2)   It'd be nice is the Process Monitor showed counts of the
active/dead/etc processes in it's caption.

3)   Can you arrange for the Process Monitor to preserve the selection and
scroll position when it refreshes itself please ?  (Otherwise its impossible
to terminate processes a long way down the list.)

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: More information about Dolphin bug

Bill Schwab-2
Chris,

> 3)   Can you arrange for the Process Monitor to preserve the selection and
> scroll position when it refreshes itself please ?  (Otherwise its
impossible
> to terminate processes a long way down the list.)

As a workaround, can you simply slow down the refresh rate to give yourself
time to get there?  IIRC, there's an option to set the rate.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: More information about Dolphin bug

Chris Uppal-2
Bill,
> As a workaround, can you simply slow down the refresh rate to give
yourself
> time to get there?  IIRC, there's an option to set the rate.

True, or even to pause it alltogther.

Cheers.

> Bill

    -- chris