Idle panic?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Idle panic?

Bill Schwab-2
Hi Blair,

Re that serial communications app, switching away from Delay to the
overlapped #sleep: helped in one area and has added a wrinkle (well, more
like a really deep crease<g>).  Here's the error that happens early on:

a GPFault('Invalid access to memory location. Reading 0xFFFFFFFF, IP
0x1000426C (C:\WINDOWS\SYSTEM\DOLPHINVM993.DLL)'):
 'Invalid access to memory location. Reading 0xFFFFFFFF, IP 0x1000426C
(C:\WINDOWS\SYSTEM\DOLPHINVM993.DLL)'

ProcessorScheduler>>gpFault:
[] in ProcessorScheduler>>vmi:list:no:with:
BlockClosure>>ifCurtailed:
ProcessorScheduler>>vmi:list:no:with:
ProcessorScheduler>>primUnwindInterrupt
[] in ProcessorScheduler>>vmi:list:no:with:
[] in BlockClosure>>ifCurtailed:
BlockClosure>>ifCurtailed:
ProcessorScheduler>>vmi:list:no:with:
ProcessorScheduler>>primUnwindInterrupt
[] in ProcessorScheduler>>vmi:list:no:with:
[] in BlockClosure>>ifCurtailed:
BlockClosure>>ifCurtailed:
ProcessorScheduler>>vmi:list:no:with:
[] in MerlinCommunications(EarMonitorCommunications)>>next
[] in Mutex>>critical:
BlockClosure>>ensure:
Mutex>>critical:
MerlinCommunications(EarMonitorCommunications)>>next
...
BlockClosure>>on:do:
[] in BlockClosure>>newProcess

This almost looks like I'm reading using a bad handle or buffer, or
something along those lines.  It's possible that somebody snuck past my
attempts at thread synchronization to read from the port before it was
opened, or to read from the buffer before it was full, etc.  However, I'd
expect to see another level or two of messages in the walkback if that were
the case.

The other thing that's interesting is that very similar walkbacks start
appearing after this, at about 60 times per second on the machine where I
was able to capture this particular view of things.  More interesting is
that it seems to be adding layers of #primUnwindInterrupt between the #next
call that fails and the ultimate GPF that crashed the app (no dump sadly).
By that I mean that the first call goes through one layer, the second two,
and so on; at least that's the way it looks from the first few walkbacks set
up side by side.

Does this sound like an idle panic?  Assuming so for the moment, the Wiki
states that this happens when no process is runnable.  You mentioned the
idle process as one not to block on an overlapped call; are other threads
immune to causing the problem directly?  I guess I'm asking whether I'd have
had to somehow inject one of my sleeps into the idler, if the VM is in fact
panicing?  If the VM is panicing and sending an interrupt to the thread
that's trying to sleep, what would be the effect?

If it's not a VM panic, any thoughts on what it might be other than a dirty
read?

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Idle panic?

Blair McGlashan
Bill

You wrote in message news:8u51lr$q0pc$[hidden email]...
>
> Re that serial communications app, switching away from Delay to the
> overlapped #sleep: helped in one area and has added a wrinkle (well, more
> like a really deep crease<g>).
>...
> Does this sound like an idle panic?  ...

Not really, but it isn't possible to tell from the walkback because it
contains no data. We need to see the numbers of the original interrupts.
This sort of information can be gathered from the VMs own dump (from the raw
stack content dump), which you could force from your SessionManager's
logError: method, rather than writing a simple stack trace log. At 60 times
per second (!) that dump file would get pretty big, pretty fast, so you
might want to configure the dump to have shorter stack and walkback entries,
or to exit immediately. Another alternative would be to add the
DisableGPFTrap reg key:

    HKLM\Software\Object Arts\Dolphin Smalltalk\3.0\DisableGPFTrap


The value is unimportant.

Disabling the GPF trap will cause the VM to create a dump and exit without
attempting to recover from the access violations. The GPF trap is very
useful in development, especially when doing external interfacing work, but
sometimes less helpful in a runtime app.

You can test out whether the GPF trap is correctly disabled or not by
deliberately causing one, e.g:

    (ExternalAddress fromInteger: 1) dwordAtOffset: 0

>...Assuming so for the moment, the Wiki
> states that this happens when no process is runnable.  You mentioned the
> idle process as one not to block on an overlapped call; are other threads
> immune to causing the problem directly?  I guess I'm asking whether I'd
have
> had to somehow inject one of my sleeps into the idler, if the VM is in
fact
> panicing?  If the VM is panicing and sending an interrupt to the thread
> that's trying to sleep, what would be the effect?

That depends on what the applications response to having the both the idle
process and, more importantly, the main process abrubtly terminated and new
ones started. If it causes the same thing to happen again then a rapidly
degenerating spiral is created. In quite a lot of situations where I have
inadvertantly created an "idle panic" situation (e.g. by inserting erroenous
code in the idle loop), it has not been recoverable because each newly
started process repeats the errors of its forebears and never learns from
their mistakes :-).

>
> If it's not a VM panic, any thoughts on what it might be other than a
dirty
> read?

Is it possible that callbacks are arriving on other OS threads? This might
happen if you've overlapped something else which generates callbacks (BTW in
4.0 the VM will intercept such foreign-thread calls and route them back to
the VMs main thread). The VM crashdump will reveal whether a callback is
occurring on a worker thread.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: Idle panic?

Bill Schwab-2
Blair,

> > Does this sound like an idle panic?  ...
>
> Not really, but it isn't possible to tell from the walkback because it
> contains no data. We need to see the numbers of the original interrupts.

Quite reasonable, and not unexpected.


> This sort of information can be gathered from the VMs own dump (from the
raw
> stack content dump), which you could force from your SessionManager's
> logError: method, rather than writing a simple stack trace log. At 60
times
> per second (!) that dump file would get pretty big, pretty fast,

Yup - that's why it's set up the way it is.


> so you
> might want to configure the dump to have shorter stack and walkback
entries,

I like to have them set to full to catch the rare "out of the blue" crash;
but, can of couse shorten them for this purpose.


> or to exit immediately. Another alternative would be to add the
> DisableGPFTrap reg key:
>
>     HKLM\Software\Object Arts\Dolphin Smalltalk\3.0\DisableGPFTrap
>
> The value is unimportant.

Sounds easy enough.  I might start here.


> Is it possible that callbacks are arriving on other OS threads?

I wondered about that too, or at least about the issue of synchronization
with the message queue.  I'm doing a lot of serial I/O; the relevant calls
are _not overlapped, but, one thought was to queue a deferred action to read
and signal a semaphore.


> This might
> happen if you've overlapped something else which generates callbacks (BTW
in
> 4.0 the VM will intercept such foreign-thread calls and route them back to
> the VMs main thread).

I'm impressed!!  How can tell when to do it, or do you simply route all
callbacks like this?


> The VM crashdump will reveal whether a callback is
> occurring on a worker thread.

I doubt this is the cause, but, one never knows.  It's easy enough to get a
dump to see.  One change the I made this morning was to convert my Delay
references (just the few in this app) to Processor sleep: sends, converting
to milliseconds where needed.  The result is that it's now trivial to change
the type of delay.

As I type, a copy with Delay-based times is running downstairs - and has
been for well over an hour (a record<g>).  One factor that has shaken out is
that it took me a while to apply your terminate/mutex patch.  It turns out
that just a couple of weeks before I re-discovered that patch, I had "fixed"
a deadlock in this app by removing a critical section protecting serial port
handles.  I could never find the other participant in the "deadlock", which
now appears to have been caused by the mutex's being left locked.  With the
patch, I was able to restore the critical section and not "deadlock".  It's
possible that the wiz-bang serial cards are more senstitive to the problem,
or maybe simply by having more threads running around, there were more
opportunties to get into trouble.  Anyway, I'm glad to have the critical
section back.

The short-term bad news is that the machine that gave the dump is involved
in a liver transplant at the moment.  The other machine is at our disposal,
so I can put the overlapped delays back into the app and try to get a dump
from it.

Thanks for your help!!!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Idle panic?

Bill Schwab-2
Blair,

I'm starting to think I copied an incorrect file at one point; a combination
I thought I had tried before is now running very nicely.  The delay-based
app with the critical section protecting shutdown and with the new serial
card  =:0  has been running for 4+ hours.  One explanation for all of this
might be that I put the ailing overlapped-delay executable on the machine
sooner than I thought.  At this point, there are conflicting needs to "just
let it run" (to see how it does over time) and to experiment with different
versions to find out what the DLL happened.  There are a large number of
variables, and the software runs in an environment that's not easily
controlled.

One other wrinkle: I was running around with a trashed image and/or change
log for a few days.  This first became noticeable during a package save.
The simplest explanation for what happened is an altered change log.
Obviously, a fried image could generate bad executables; could an altered
change log affect deployment?  The good news is that I found a stable backup
(a little further back than I'd have liked) and filed stuff out of various
damaged images to get pretty much everything back as it was intended to be.

Maybe the best thing to do is to let this app run and then experiment on the
machine that's currently involved with the liver transplant.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Idle panic?

Bill Schwab-2
Blair,

The bad news: it's official; there really is a problem.  The vendor seems to
be taking responsibility for it, and I suspect they will figure it out if it
is their fault.

Hopefully good new: you'll be glad to know that I just fielded an NT
machine.  It's a P3 with 128MB of RAM running NT4 sp 6.  My thinking is that
it will hopefully be more likely to work, or at least more likely to gripe
at me for whatever I might be doing wrong in my code.

The machine arrived with some virus checking software that I didn't want
running on a data collecting machine, so I uninstalled it.  The interesting
part was that the uninstall program saw fit to remove somethings that IE
needed.  After copying a DLL from another machine, using Netscape(!!) to
download IE 5.5, and upgrading, I was able to get it going again.  The IE
installer wouldn't run w/o the repairs, so this is sorta a strike against
the zero admin initiative.  Ok, 2k's system backup/restore would help, but,
I fear they have too many dependencies for something that's used to install
system critical components.

The hardware installation went smoothly.

Also, this was the first real test of my (hopefully) NT-friendly installers.
It pretty much worked as planned, though in the interest of saving time and
risk, I admit to cheating a little by first installing Dolphin on the
machine to get the VM registered ;)    The higher level stuff (no cheating
possible there) worked nicely.

The app's startup speed is definitely tied to mouse movement over the splash
screen, suggesting that I'll probably end up hacking the idleNT loop like I
did for 9x.  This particular app seems ok in idle otherwise, though some of
the other apps might get into trouble.  Before I hacked the idler, they
would always run, but, with some long pauses.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]