Any suggestions for how to debug this?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Any suggestions for how to debug this?

Bill Schwab
Hello all,

One of my apps does a lot of serial communications; we even had to shop
around for ways to add more serial ports for at least a couple of
installations.  The evolving picture is that things work fine over native
serial ports, and don't do so well over the added ports.

Conclusive testing has been complicated by a lack of excess hardware (slowly
getting fixed) and a chaotic environment in which equipment is sometimes
disconnected at no extra charge :)  We tried and "fired" one card vendor;
despite the chaos, we were able to get clear evidence that sending data over
their card would lock up a Win9x machine in short order.  In contrast, that
same machine running over native serial ports goes for weeks at a time as
long as its power chords are left in place.  That card's vendor seemed
completely unwilling and/or unable to help, so enter card vendor number two.

I am working on a similar test with the new card; so far, I've demonstrated
failure with the card and will soon switch to trying to show success on
native ports.  At first, I'll leave the card in place and simply not use it.
This vendor was kind enough to loan me an extra card to make the the testing
easier.

The failure mode of the first card on Win9x was a complete vapor lock of the
machine: frozen screen, complete with visible but non moving mouse cursor,
and the only recourse to reset.  The new card on Win9x does a little better;
the app will freeze, but it's possible to interact with the machine and,
among other things, run a debug viewer.  The machine eventually locks up
when my app is terminated via the task list, but, it's a big improvement
over "Don't know, it just quit :(".

When the app hangs on Win9x with the new card, the output on the debug
viewer is a rapid stream of messages from the card's driver.  Of course,
that doesn't necessarily mean that it caused the problem.  This condition
can arise almost any time; it seems to take several hours to a couple of
days, though I've seen it happen as soon as two hours.

The new card on NT has a different failure mode: a blue screen.  So far,
this hasn't happened sooner than two days, and has taken up to a week to
appear.  One of our ongoing studies requires lots of serial ports and
reliable data collection.  We've been able to get that by running on NT and
simply remembering to reboot before each case.

There are some indications that the system gets "tired" after a couple of
days; it's hard to explain, so I won't try until I can quantify it.  If
there's anything to it, it might be due to memory or resource leaks, either
in my code or the driver.  One theory that I've kicked around is that the
Win9x rapid failures could be the fault of the drivers, and the longer term
NT meltdown might be caused by some kind of starvation that could turn out
to be my doing.

When I refer to days and weeks, it's important to keep in mind that it's
unclear whether the time interval is important, or whether two days or one
week was simply the time required to get to a certain number of
connect/disconnect cycles on one of the devices, or to reach some other
threshold.  There's probably more that I should say, but, I can't think of
it right now.

Suggestions for debugging tools and/or strategies would be most appreciated.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Any suggestions for how to debug this?

Bob Jarvis
Does your app make much use of COM?  I've got an app (written in VB)
which makes heavy use of COM, and is very unreliable on Win9x but runs
fine on NT.

NT - in my experience applications, no matter how badly written or ill-
behaved, don't crash NT unless you're intentionally using some of the
extremely low-level stuff in the Driver Development Kit, and my guess
is that you're not.  If you're getting a BSOD (Blue Screen Of Death)
it's very likely a driver problem.  What's the failure cause listed on
the BSOD?  IRQL_NOT_LESS_THAN_OR_EQUAL or something like that?  What's
the PC at when the machine goes down, and what's in the module list
shown on the BSOD?  That's the first thing (and sometimes the ONLY
thing) you should look at.  Also, make sure the machine is set up to
record a crash dump.  To do this, go to Control Panel, open the
Services applet, click on the Startup/Shutdown tab, and make sure
the "Write debugging information to:" checkbox is checked, and that
there's a filename (commonly %SystemRoot%\MEMORY.DMP) in the text box.


Sent via Deja.com
http://www.deja.com/


Reply | Threaded
Open this post in threaded view
|

Re: Any suggestions for how to debug this?

Bill Schwab-2
Bob,

> Does your app make much use of COM?  I've got an app (written in VB)
> which makes heavy use of COM, and is very unreliable on Win9x but runs
> fine on NT.

It makes some use of COM, but, not much.  It's no different in that respect
than any of the other apps that work fine.


> NT - in my experience applications, no matter how badly written or ill-
> behaved, don't crash NT unless you're intentionally using some of the
> extremely low-level stuff in the Driver Development Kit, and my guess
> is that you're not.

Not directly at least.  I do open and close serial ports, but, that's all
through the Windows API.


> If you're getting a BSOD (Blue Screen Of Death)
> it's very likely a driver problem.  What's the failure cause listed on
> the BSOD?  IRQL_NOT_LESS_THAN_OR_EQUAL or something like that?

Sadly, I don't remember.  But, another one will appear :(  and a I'll take
note.


>  and what's in the module list
> shown on the BSOD?

I recorded this on one of the early crashes, and the vendor seemed to think
it was their problem, and were not terribly interested in the text.  What
they really want to do is reproduce in their lab; but, my suspicion is that
there's no way to generate the failure w/o the external devices and lots of
other stuff connected to them.

Interestingly, they wanted to go after the problem in the Win9x driver first
because it's easier to reproduce.


> That's the first thing (and sometimes the ONLY
> thing) you should look at.  Also, make sure the machine is set up to
> record a crash dump.  To do this, go to Control Panel, open the
> Services applet, click on the Startup/Shutdown tab, and make sure
> the "Write debugging information to:" checkbox is checked, and that
> there's a filename (commonly %SystemRoot%\MEMORY.DMP) in the text box.

Thats a new one!  Thanks!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Any suggestions for how to debug this?

Frank Sergeant
In reply to this post by Bill Schwab
"Bill Schwab" <[hidden email]> wrote in message
news:949p1k$49e$[hidden email]...
> The evolving picture is that things work fine over native
> serial ports, and don't do so well over the added ports.

This might be related to the add-on ports using their own drivers as opposed
to the standard drivers.  I've done some low-level serial port coding (in my
DOS days) and there are some tricky squences involved to guarantee nothing
will go wrong.  There are several FAQs on the net about serial ports that go
into this, not that that would do much good unless you were working on the
driver code directly.

It is just hell troubleshooting something where you have to wait 2 days to 2
weeks for the problem to manifest itself.  Eliminating that delay is one of
my favorite things about the Linux "stress test" I've mentioned earlier for
testing PC hardware.  Would there be anyway to set up a serial port
"stresser"?  Perhaps set up a machine that all it does is send a particular
sequence (or a variety of sequences) over and over to the target machine?
The point would be to get a quicker yes or no answer as to whether the
serial port was working and whether a change you make has an effect on the
problem.

Well, there a lots of reasons why the following won't work in your
situation, but I'll toss them out just in case they help or lead to another
idea:

Use a separate machine as a serial port server.  Stuff it with ports and let
it communicate to the main processing machine which would need only one
serial port (or ethernet, etc.).  This serial port server might even be a
very low-end machine.  It could run DOS.

Or, above but use Linux on the serial port server.  Either of these could be
done with no or very low disk space (just a floppy) and modest RAM for Linux
(say 4 MB, but 8 MB is better) or 1 MB on a DOS machine.

Run the serial ports on the Windows machine in a DOS box so they talk to the
hardware more directly (use DOS drivers where perhaps the interrupt
enabling/disabling sequences are better solved and tweakable by you if
necessary (I can supply some serial port code if it helps, if you wind up
working at the DOS level)).  This might work better on a W9x machine than on
NT.  Then, figure how to communicate between the DOS serial port front-end
and your main Dolphin application (and tell me how you do it!).  (Various
multi-serial port vendors advertise to the Linux community.  I wonder if any
of those cards work more reliably than the ones you been using?  Although,
it is probably entirely a matter of the drivers and not the actual serial
ports causing the problem.)

Well, that's all that comes to mind.  Good luck.  I'll look forward to
reports of how it all resolves.


-- Frank
[hidden email]