50 hours and couting

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

50 hours and couting

Bill Schwab-2
Hello all,

Acting on some good advice offered in this group, I've been working on a
problem I described earlier with system lockups when using a board that adds
serial ports.  Among lots of little steps, there is one very noteworthy
thing happening: an ME box is (so far) working!  It's been up for over 50
hours now - not unprecedented on 9x, but, HIGHLY unusual.  I'm gradually
cycling 9x boxes through a torture-test in my office and "active duty"; it's
slow because of safety rules and logistics in general.

Let's assume for a moment that something about ME was the answer.  Then the
one thing in all of this that _really_ bothers me is the relative fragility
of the NT machines that I've been (fairly recently) using.  The one doing
the same job as the ME box fails (blue screen) every couple of days, but,
it's predictable, and serves well as long as we reboot it before we really
need it.  Another NT box, doing a different job, blue screened on me not too
long ago.

Watching the ME box do as well as it has so far has me wondering whether the
NT machines are properly patched.  What should I have installed in the way
of service packs?

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: 50 hours and couting

Blair McGlashan
Bill

You wrote in message news:989bvg$12e8n$[hidden email]...
>
> Acting on some good advice offered in this group, I've been working on a
> problem I described earlier with system lockups when using a board that
adds
> serial ports.  Among lots of little steps, there is one very noteworthy
> thing happening: an ME box is (so far) working!  It's been up for over 50
> hours now - not unprecedented on 9x, but, HIGHLY unusual.  I'm gradually
> cycling 9x boxes through a torture-test in my office and "active duty";
it's
> slow because of safety rules and logistics in general.
>
> Let's assume for a moment that something about ME was the answer.  Then
the
> one thing in all of this that _really_ bothers me is the relative
fragility
> of the NT machines that I've been (fairly recently) using.  The one doing
> the same job as the ME box fails (blue screen) every couple of days, but,
> it's predictable, and serves well as long as we reboot it before we really
> need it.  Another NT box, doing a different job, blue screened on me not
too
> long ago.

I would still take that as an indication of problems with device drivers. NT
can certainly run reliably (I used to run it continuously for weeks on end,
although Andy liked to reboot regularly especially when there was a Y in the
day :-)), but it is very easy to compromise it with poor drivers as
essentially run as if they are part of the OS and if they misbehave then
they can trample over its memory. One can't blame Microsoft for the poor
drivers, although one can blame them for the design. Under Win2K they have
addressed this to some extent by having driver signing and certification,
and it certainly seems to have helped a great deal.

>
> Watching the ME box do as well as it has so far has me wondering whether
the
> NT machines are properly patched.  What should I have installed in the way
> of service packs?

The ME box won't be sharing the same device drivers, so if that is indeed
the problem then perhaps the ME ones are better for the h/w your are using.
Regarding service packs, I think SP6a is the latest, but as of SP3 onwards
it was pretty stable.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: 50 hours and couting [COM hanging]

David Simmons
"Blair McGlashan" <[hidden email]> wrote in message
news:98bel1$18au3$[hidden email]...
> Bill
>
> You wrote in message news:989bvg$12e8n$[hidden email]...
> >
> > Acting on some good advice offered in this group, I've been working on a
> > problem I described earlier with system lockups when using a board that
> adds
> > serial ports.  Among lots of little steps, there is one very noteworthy
> > thing happening: an ME box is (so far) working!  It's been up for over
50

> > hours now - not unprecedented on 9x, but, HIGHLY unusual.  I'm gradually
> > cycling 9x boxes through a torture-test in my office and "active duty";
> it's
> > slow because of safety rules and logistics in general.
> >
> > Let's assume for a moment that something about ME was the answer.  Then
> the
> > one thing in all of this that _really_ bothers me is the relative
> fragility
> > of the NT machines that I've been (fairly recently) using.  The one
doing
> > the same job as the ME box fails (blue screen) every couple of days,
but,
> > it's predictable, and serves well as long as we reboot it before we
really
> > need it.  Another NT box, doing a different job, blue screened on me not
> too
> > long ago.
>
> I would still take that as an indication of problems with device drivers.

I too suggest that the problem is either a device driver or something
related to a device driver such as a bad registry setup.

I've been using Win95/NT (for NT stabilility) followed by Win98 (for its
explorer UI) and then Win2K (for both stability and UI -- Win2k is
distinctly superior to its predecessors) for many years now.

I don't play computer games, but I'm told that with the advent of DirectX8
for Win2k you can now play most games on it -- which had apparently been a
major reason for sticking with Win98.

Our primary Win2K domain servers get rebooted maybe once every six months or
year when we add service packs or upgrade to the next version. We install
little or nothing on the domain servers other than required/stock elements
for IIS etc -- we especially avoid installing things like MS-Office
components or ANY Outlook server stuff both of which, in my experience, have
a marked affect on stability.

OTOH, my dev boxes often need rebooting every week or two because of some
spoo that was caused by:

  o  a device driver
  o  windows media services
  o  visual studio debugger or intellisense scanner
        runs amok and dies leaving hooks everywhere
  o  some stupidly written installer that "thinks" it
        requires a reboot.

** For the paranoid or those with painful install related woes/experience.
When installing new software, it is wise to reboot your machine and *then*
install the new software as the *first* thing you do after:

  o  your machine has rebooted
  o  you've logged on
  o  you've opened the task manager and observed that the
       CPU indicator is steadily saying that the "System
       Idle Process" is running at 99%. I.e., you machine
       is now in steady state idle mode and nothing else
       is happening.
  o  if you're really paranoid then here are some additional thoughts:
     o  back up your registry and system directory before any install
     o  partition your drive to have an emergency mini (1GB partition)
        Win2K/NT system you can use to repair your primary system.
     o  organize your boot drive to be 2-4GB so you can perform
          partition backups of the system before installing new
          and possibly suspicious drivers. Then get yourself a
          copy of BOTH "Partition Magic" and "Partition Commander".

**
In other words, it is much faster to copy and restore a partition than to
try and repair a system. Or worse, have to re-install and recover it. Since
big drives are so inexpensive these days, there is little reason not to have
a small system partition and safety backups. Furthermore, Win2K NTFS
supports mount points and hard links just like unix so even if you have
physically different drive partitions you can make it appear to all sfotware
as if it was just one big logical (single letter) file system.
**

One more thing, if your application is COM based and is using apartment
threading you can also hang. This was an area that I had to put special code
into the AOS VM threading system to address in our multi-threaded smalltalk
debugger for SmalltalkAgents.

My first ugly experience with this was in 1996 while working on QKS
Smalltalk for Win32 (SmalltalkAgents) I would regularly have my Eudora Mail
package or Internet Explorer apparently hang when a QKS Smalltalk thread was
suspended -- which lead me to deep kernel debugging and SoftICE to figure
out the problem.

In my direct experience, it has been a historical problem on all Win32
versions [noting that Win2000 is specially architected to minimize the
issue].

I see this problem on my current Win2K dev box nearly every day or two when
I launch internet explorer 5.5 with .net spoo and visual studio 7 crap
installed. For some reason I've not bothered to chase down, IE will
sometimes leave a zombie process around when you close its last window. That
zombie process causes COM hanging so that when I save my QKS Smalltalk
environment/image it will hang until I kill the IE process. At which
instant, the QKS Smalltalk process continues running again as if nothing had
happened.

Note, this is on Win2K where Microsoft has worked hard to minimize the
hanging problem. On any other Win32 system it is much worse. But on Win2K it
seems to affect the COM Initialization but not necessarily a COM component.
I'm am not aware of Microsoft ever documenting or commenting on this
problem.

For many years, until Win2K it was my "fun" demo of how to bring any
(supposedly process safe NT) Win32 box to its knees. I.e., You could build
an errant thread that initialized itself as COM apartment threaded thread
and then suspended itself. Whereupon, many aspects of the machine (such as
any use of IE or an in-proc web-browser-view) would suddenly hang ;-).

Adjunct Comment: "Windows Task Manager" is your friend for killing such
applications.
================

The problem is that cross-thread COM calls to apartment threaded components
must go through the thread message send system to make their calls.
Apparently they do so in an apparently stupid broadcast mode (until Win2000
where it is unclear what they're doing).

Basically once a thread gets com/ole initialized, there is a
secret/private/invisible/hidden window that COM installs to manage message
sends. If the thread doesn't process messages in a timely fashion this
com-window doesn't get serviced. Which means that anything which sends it a
intra-thread message will hang.

** the rules/mechanism is a little different on Win2k but the basic problem
still exists **

The net result is that if you have any non-worker COM initialized threads
[i.e., threads with an event queue] that are not performing an event loop
then they block ALL cross-thread/intra-process COM activity for a CALLING
component.

That said, the EVILLY BAD code in RichEdit (1/2/3) among other crap, causes
any thread in which it has a window, to become a COM apartment-model
initialized thread.

So if you have some COM thread that is not responsive, anywhere in your
system, and you have a rich edit component (window) for a given (smalltalk)
thread, it will result in that thread appearing to be frozen/dead.

Generally speaking, if you have any activity that results in a threaded com
call where some com-enabled thread in your entire system is not responding
(looping) in a timely fashion processing messages then ALL apartment
threaded components are at risk.

-- Dave Simmons [www.qks.com / www.smallscript.com]
  "Effectively solving a problem begins with how you express it."

> NT
> can certainly run reliably (I used to run it continuously for weeks on
end,
> although Andy liked to reboot regularly especially when there was a Y in
the

> day :-)), but it is very easy to compromise it with poor drivers as
> essentially run as if they are part of the OS and if they misbehave then
> they can trample over its memory. One can't blame Microsoft for the poor
> drivers, although one can blame them for the design. Under Win2K they have
> addressed this to some extent by having driver signing and certification,
> and it certainly seems to have helped a great deal.
>
> >
> > Watching the ME box do as well as it has so far has me wondering whether
> the
> > NT machines are properly patched.  What should I have installed in the
way
> > of service packs?
>
> The ME box won't be sharing the same device drivers, so if that is indeed
> the problem then perhaps the ME ones are better for the h/w your are
using.
> Regarding service packs, I think SP6a is the latest, but as of SP3 onwards
> it was pretty stable.
>
> Regards
>
> Blair
>
>


Reply | Threaded
Open this post in threaded view
|

Re: 50 hours and couting [COM hanging]

David Simmons
I forgot to mention that another significant source of Windows box
instability is mixing memory with different timing characteristics (or just
poor qaulity memory from different manufacturers). Depending on the
manufacturer and bios settings of your motherboard this can lead to
instability which is much more exposed under an NT/2K kernel.

I was badly bitten by this problem with WinNT 3.5 and 4.0 early on.

-- Dave Simmons [www.qks.com / www.smallscript.com]
  "Effectively solving a problem begins with how you express it."


Reply | Threaded
Open this post in threaded view
|

MS printing problem (Re: 50 hours and couting [COM hanging])

Bruce Samuelson
In reply to this post by David Simmons
David Simmons wrote:

<big snip>

> I'm am not aware of Microsoft ever documenting or commenting on this
> problem.

OK, this is *way* off topic, but I encountered another Microsoft problem
for which I've never read documentation. The format of Word 97 documents
is dependent on the resolution at which you print them. Change the
resolution, and formatting properties such as tab alignment, word wrap
and pagination may change. This happened to me recently, and two friends
confirmed that it happened to them. Just take any heavily formatted
document you have, go to the appropriate printer dialog, change the
resolution from, say, 600 to 300 dpi, and watch what happens. I suspect
their formatting algorithm handles round off errors in a brain damaged
way. The net result is that documents can become misformatted with
printer upgrades, and collaborative work on documents leads to subtle
errors. Send your resume to a recruiter, and the format can change.

I encountered this error for Word 97 and I think Excel 97 on Win NT. I
don't know whether it exists in newer MS products.


Reply | Threaded
Open this post in threaded view
|

Re: 50 hours and couting [COM hanging]

Bill Schwab-2
In reply to this post by David Simmons
Dave,

> I forgot to mention that another significant source of Windows box
> instability is mixing memory with different timing characteristics (or
just
> poor qaulity memory from different manufacturers). Depending on the
> manufacturer and bios settings of your motherboard this can lead to
> instability which is much more exposed under an NT/2K kernel.
>
> I was badly bitten by this problem with WinNT 3.5 and 4.0 early on.

Interesting.  I'll keep this one in mind!

Re your COM hanging suggestion, I wish you had been around a couple of years
ago :)  I (almost certainly) ran into that problem with a commerical app
that managed to hang my apps; the details are a blur now, but, we were able
to get around it by installing an update to the commercial app.

Thanks!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: 50 hours and couting

Bill Schwab-2
In reply to this post by Blair McGlashan
Blair,

> I would still take that as an indication of problems with device drivers.

I've made some progress toward reproducing the problem outside of a thoracic
OR.  A Win95 machine choked up while trying to read from an emulator for one
of the monitors.  It took days to happen, but, I turned up the pace (a lot)
to see if the failure will happen sooner.  If I can make something that
turns ugly, the vendor will be able to (and I believe will) turn their
hardware debuggers on the problem.


> NT
> can certainly run reliably (I used to run it continuously for weeks on
end,
> although Andy liked to reboot regularly especially when there was a Y in
the
> day :-)), but it is very easy to compromise it with poor drivers as
> essentially run as if they are part of the OS and if they misbehave then
> they can trample over its memory. One can't blame Microsoft for the poor
> drivers, although one can blame them for the design. Under Win2K they have
> addressed this to some extent by having driver signing and certification,
> and it certainly seems to have helped a great deal.

It probably is drivers, but, if it is my fault, I would expect either: (1)
memory/resource leak; (2) threading.  In my early days with making serial
port calls from background threads, I caused some ugly system crashes (on
Win95) by having two threads that felt responsible for closing the port.
That was fixed easily enough by closing from an #ensure: block in one of the
threads.


> > Watching the ME box do as well as it has so far has me wondering whether
> the
> > NT machines are properly patched.  What should I have installed in the
way
> > of service packs?
>
> The ME box won't be sharing the same device drivers,

They won't share with NT, but, they would with other 9x machines (is that
correct??).   A secondary concern of mine is that the ME boxes are doing a
lot better than the Win95 machines; but, ...


> so if that is indeed
> the problem then perhaps the ME ones are better for the h/w your are
using.

this is perhaps all the more true for the Win95 machines.


> Regarding service packs, I think SP6a is the latest, but as of SP3 onwards
> it was pretty stable.

Thanks!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]