Hi Blair,
Re that serial communications app, switching away from Delay to the overlapped #sleep: helped in one area and has added a wrinkle (well, more like a really deep crease<g>). Here's the error that happens early on: a GPFault('Invalid access to memory location. Reading 0xFFFFFFFF, IP 0x1000426C (C:\WINDOWS\SYSTEM\DOLPHINVM993.DLL)'): 'Invalid access to memory location. Reading 0xFFFFFFFF, IP 0x1000426C (C:\WINDOWS\SYSTEM\DOLPHINVM993.DLL)' ProcessorScheduler>>gpFault: [] in ProcessorScheduler>>vmi:list:no:with: BlockClosure>>ifCurtailed: ProcessorScheduler>>vmi:list:no:with: ProcessorScheduler>>primUnwindInterrupt [] in ProcessorScheduler>>vmi:list:no:with: [] in BlockClosure>>ifCurtailed: BlockClosure>>ifCurtailed: ProcessorScheduler>>vmi:list:no:with: ProcessorScheduler>>primUnwindInterrupt [] in ProcessorScheduler>>vmi:list:no:with: [] in BlockClosure>>ifCurtailed: BlockClosure>>ifCurtailed: ProcessorScheduler>>vmi:list:no:with: [] in MerlinCommunications(EarMonitorCommunications)>>next [] in Mutex>>critical: BlockClosure>>ensure: Mutex>>critical: MerlinCommunications(EarMonitorCommunications)>>next ... BlockClosure>>on:do: [] in BlockClosure>>newProcess This almost looks like I'm reading using a bad handle or buffer, or something along those lines. It's possible that somebody snuck past my attempts at thread synchronization to read from the port before it was opened, or to read from the buffer before it was full, etc. However, I'd expect to see another level or two of messages in the walkback if that were the case. The other thing that's interesting is that very similar walkbacks start appearing after this, at about 60 times per second on the machine where I was able to capture this particular view of things. More interesting is that it seems to be adding layers of #primUnwindInterrupt between the #next call that fails and the ultimate GPF that crashed the app (no dump sadly). By that I mean that the first call goes through one layer, the second two, and so on; at least that's the way it looks from the first few walkbacks set up side by side. Does this sound like an idle panic? Assuming so for the moment, the Wiki states that this happens when no process is runnable. You mentioned the idle process as one not to block on an overlapped call; are other threads immune to causing the problem directly? I guess I'm asking whether I'd have had to somehow inject one of my sleeps into the idler, if the VM is in fact panicing? If the VM is panicing and sending an interrupt to the thread that's trying to sleep, what would be the effect? If it's not a VM panic, any thoughts on what it might be other than a dirty read? Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Bill
You wrote in message news:8u51lr$q0pc$[hidden email]... > > Re that serial communications app, switching away from Delay to the > overlapped #sleep: helped in one area and has added a wrinkle (well, more > like a really deep crease<g>). >... > Does this sound like an idle panic? ... Not really, but it isn't possible to tell from the walkback because it contains no data. We need to see the numbers of the original interrupts. This sort of information can be gathered from the VMs own dump (from the raw stack content dump), which you could force from your SessionManager's logError: method, rather than writing a simple stack trace log. At 60 times per second (!) that dump file would get pretty big, pretty fast, so you might want to configure the dump to have shorter stack and walkback entries, or to exit immediately. Another alternative would be to add the DisableGPFTrap reg key: HKLM\Software\Object Arts\Dolphin Smalltalk\3.0\DisableGPFTrap The value is unimportant. Disabling the GPF trap will cause the VM to create a dump and exit without attempting to recover from the access violations. The GPF trap is very useful in development, especially when doing external interfacing work, but sometimes less helpful in a runtime app. You can test out whether the GPF trap is correctly disabled or not by deliberately causing one, e.g: (ExternalAddress fromInteger: 1) dwordAtOffset: 0 >...Assuming so for the moment, the Wiki > states that this happens when no process is runnable. You mentioned the > idle process as one not to block on an overlapped call; are other threads > immune to causing the problem directly? I guess I'm asking whether I'd have > had to somehow inject one of my sleeps into the idler, if the VM is in fact > panicing? If the VM is panicing and sending an interrupt to the thread > that's trying to sleep, what would be the effect? That depends on what the applications response to having the both the idle process and, more importantly, the main process abrubtly terminated and new ones started. If it causes the same thing to happen again then a rapidly degenerating spiral is created. In quite a lot of situations where I have inadvertantly created an "idle panic" situation (e.g. by inserting erroenous code in the idle loop), it has not been recoverable because each newly started process repeats the errors of its forebears and never learns from their mistakes :-). > > If it's not a VM panic, any thoughts on what it might be other than a dirty > read? Is it possible that callbacks are arriving on other OS threads? This might happen if you've overlapped something else which generates callbacks (BTW in 4.0 the VM will intercept such foreign-thread calls and route them back to the VMs main thread). The VM crashdump will reveal whether a callback is occurring on a worker thread. Regards Blair |
Blair,
> > Does this sound like an idle panic? ... > > Not really, but it isn't possible to tell from the walkback because it > contains no data. We need to see the numbers of the original interrupts. Quite reasonable, and not unexpected. > This sort of information can be gathered from the VMs own dump (from the raw > stack content dump), which you could force from your SessionManager's > logError: method, rather than writing a simple stack trace log. At 60 times > per second (!) that dump file would get pretty big, pretty fast, Yup - that's why it's set up the way it is. > so you > might want to configure the dump to have shorter stack and walkback entries, I like to have them set to full to catch the rare "out of the blue" crash; but, can of couse shorten them for this purpose. > or to exit immediately. Another alternative would be to add the > DisableGPFTrap reg key: > > HKLM\Software\Object Arts\Dolphin Smalltalk\3.0\DisableGPFTrap > > The value is unimportant. Sounds easy enough. I might start here. > Is it possible that callbacks are arriving on other OS threads? I wondered about that too, or at least about the issue of synchronization with the message queue. I'm doing a lot of serial I/O; the relevant calls are _not overlapped, but, one thought was to queue a deferred action to read and signal a semaphore. > This might > happen if you've overlapped something else which generates callbacks (BTW in > 4.0 the VM will intercept such foreign-thread calls and route them back to > the VMs main thread). I'm impressed!! How can tell when to do it, or do you simply route all callbacks like this? > The VM crashdump will reveal whether a callback is > occurring on a worker thread. I doubt this is the cause, but, one never knows. It's easy enough to get a dump to see. One change the I made this morning was to convert my Delay references (just the few in this app) to Processor sleep: sends, converting to milliseconds where needed. The result is that it's now trivial to change the type of delay. As I type, a copy with Delay-based times is running downstairs - and has been for well over an hour (a record<g>). One factor that has shaken out is that it took me a while to apply your terminate/mutex patch. It turns out that just a couple of weeks before I re-discovered that patch, I had "fixed" a deadlock in this app by removing a critical section protecting serial port handles. I could never find the other participant in the "deadlock", which now appears to have been caused by the mutex's being left locked. With the patch, I was able to restore the critical section and not "deadlock". It's possible that the wiz-bang serial cards are more senstitive to the problem, or maybe simply by having more threads running around, there were more opportunties to get into trouble. Anyway, I'm glad to have the critical section back. The short-term bad news is that the machine that gave the dump is involved in a liver transplant at the moment. The other machine is at our disposal, so I can put the overlapped delays back into the app and try to get a dump from it. Thanks for your help!!! Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Blair,
I'm starting to think I copied an incorrect file at one point; a combination I thought I had tried before is now running very nicely. The delay-based app with the critical section protecting shutdown and with the new serial card =:0 has been running for 4+ hours. One explanation for all of this might be that I put the ailing overlapped-delay executable on the machine sooner than I thought. At this point, there are conflicting needs to "just let it run" (to see how it does over time) and to experiment with different versions to find out what the DLL happened. There are a large number of variables, and the software runs in an environment that's not easily controlled. One other wrinkle: I was running around with a trashed image and/or change log for a few days. This first became noticeable during a package save. The simplest explanation for what happened is an altered change log. Obviously, a fried image could generate bad executables; could an altered change log affect deployment? The good news is that I found a stable backup (a little further back than I'd have liked) and filed stuff out of various damaged images to get pretty much everything back as it was intended to be. Maybe the best thing to do is to let this app run and then experiment on the machine that's currently involved with the liver transplant. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Blair,
The bad news: it's official; there really is a problem. The vendor seems to be taking responsibility for it, and I suspect they will figure it out if it is their fault. Hopefully good new: you'll be glad to know that I just fielded an NT machine. It's a P3 with 128MB of RAM running NT4 sp 6. My thinking is that it will hopefully be more likely to work, or at least more likely to gripe at me for whatever I might be doing wrong in my code. The machine arrived with some virus checking software that I didn't want running on a data collecting machine, so I uninstalled it. The interesting part was that the uninstall program saw fit to remove somethings that IE needed. After copying a DLL from another machine, using Netscape(!!) to download IE 5.5, and upgrading, I was able to get it going again. The IE installer wouldn't run w/o the repairs, so this is sorta a strike against the zero admin initiative. Ok, 2k's system backup/restore would help, but, I fear they have too many dependencies for something that's used to install system critical components. The hardware installation went smoothly. Also, this was the first real test of my (hopefully) NT-friendly installers. It pretty much worked as planned, though in the interest of saving time and risk, I admit to cheating a little by first installing Dolphin on the machine to get the VM registered ;) The higher level stuff (no cheating possible there) worked nicely. The app's startup speed is definitely tied to mouse movement over the splash screen, suggesting that I'll probably end up hacking the idleNT loop like I did for 9x. This particular app seems ok in idle otherwise, though some of the other apps might get into trouble. Before I hacked the idler, they would always run, but, with some long pauses. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Free forum by Nabble | Edit this page |