How does this effect STB of simple nonview objects? I'm writing a
LiveJournal client application and all the journal entries are stored as STB objects on the file system. |
Jerome Chan <[hidden email]> wrote in message
news:[hidden email]... > How does this effect STB of simple nonview objects? I'm writing a > LiveJournal client application and all the journal entries are stored as > STB objects on the file system. I think (I am sure he can clarify) that Bill encountered this problem in a non-view STB object. I have not personally encountered this issue with non-view STB objects. However I don't use my "problem machine" anymore so it has not seen many non-view STB objects from me. On my Windows 2000 machine I have not yet encountered this issue in either view or non-view STB objects in either D4 or D5. Chris |
In reply to this post by Christopher J. Demers
Christopher J. Demers <[hidden email]> wrote in message
news:ao7pah$kdc03$[hidden email]... > I just made an EXE in D5 on my W2K machine that contains my STB stress code > (with Andy's change). I do not get any errors on W2K or NT with the D5 EXE. > I will try running it over the weekend to stress it. I was just retting ready to let the EXE rip over the weekend and less than a few seconds after I started it I got a corruption. This is a D5 EXE running on my Windows NT machine. It looks like a method name has become corrupt, perhaps it was supposed to be #selectionRange: . At the moment I can't get the program to loop more than a few times without some kind of STB corruption. Just this afternoon the same exact program did 100 loops with no errors on this computer. I will run the stress test on my 2K computer this weekend and see if I get any errors. Based on the results so far it looks like STB (even in D5) is not safe on certain machines. While this is bad, it does not seem to be a common problem (unless it is under-reported). Here is the dump: ************************** Dolphin Virtual Machine Dump Report *************************** 23:15:03 PM, 10/11/02: TextEdit does not understand #'|Y( ctionRange:' *----> VM Context <----* Process: {07250004:suspended frame 072505BD, priority 5, callbacks 0 last failure 0:nil, FPE mask 3, thread nil} Active Method: RuntimeSessionManager>>logError: IP: 070DE03F (15) SP: 07250480 BP: 07250458 (261) ActiveFrame: {0725045C: cf 07250441, sp 07250470, bp 07250458, ip 5, STBErrorSessionManager(RuntimeSessionManager)>>logError:} New Method: VMLibrary>>dump:path:stackDepth:walkbackDepth: Message Selector: #dump:path:stackDepth:walkbackDepth: *----> Stack <----* [07250480: 271]-->50 [0725047C: 270]-->60 [07250478: 269]-->nil [07250474: 268]-->'TextEdit does not understand #'|\oF\(\07\ctionRange:'' [07250470: 267]-->a VMLibrary [0725046C: 266]-->59933228 [07250468: 265]-->RuntimeSessionManager>>logError: [07250464: 264]-->59933240 [07250460: 263]-->8 [0725045C: 262]-->59933216 [07250458: 261]-->a MessageNotUnderstood [07250454: 260]-->a STBErrorSessionManager [07250450: 259]-->59933214 [0725044C: 258]-->SessionManager>>unhandledException: [07250448: 257]-->59933224 [07250444: 256]-->7 [07250440: 255]-->59933202 [0725043C: 254]-->a MessageNotUnderstood [07250438: 253]-->a STBErrorSessionManager [07250434: 252]-->59933200 [07250430: 251]-->SessionManager>>onUnhandledError: [0725042C: 250]-->59933210 [07250428: 249]-->3 [07250424: 248]-->59933188 [07250420: 247]-->a MessageNotUnderstood [0725041C: 246]-->a STBErrorSessionManager [07250418: 245]-->59933188 [07250414: 244]-->Error>>defaultAction [07250410: 243]-->59933196 [0725040C: 242]-->8 ... <210 slots omitted> ... [072500C0: 31]-->a MethodContext [072500BC: 30]-->BlockClosure>>ifCurtailed: [072500B8: 29]-->59932772 [072500B4: 28]-->20 [072500B0: 27]-->59932750 [072500AC: 26]-->59932746 [072500A8: 25]-->BlockClosure>>ensure: [072500A4: 24]-->59932758 [072500A0: 23]-->7 [0725009C: 22]-->59932734 [07250098: 21]-->nil [07250094: 20]-->[] @ 34 in ExceptionHandlerAbstract>>try: [07250090: 19]-->[] @ 15 in ExceptionHandlerAbstract>>try: [0725008C: 18]-->a MethodContext [07250088: 17]-->ExceptionHandlerAbstract>>try: [07250084: 16]-->59932742 [07250080: 15]-->42 [0725007C: 14]-->59932724 [07250078: 13]-->59932720 [07250074: 12]-->BlockClosure>>on:do: [07250070: 11]-->59932732 [0725006C: 10]-->10 [07250068: 9]-->59932708 [07250064: 8]-->[] @ 12 in BlockClosure>>newProcess [07250060: 7]-->ProcessTermination [0725005C: 6]-->[] @ 8 in InputState>>forkMain [07250058: 5]-->[] @ 6 in BlockClosure>>newProcess [07250054: 4]-->BlockClosure>>newProcess [07250050: 3]-->59932716 [0725004C: 2]-->20 [07250048: 1]-->0 <Bottom of stack> *----> Stack Back Trace <----* {0725045C: cf 07250441, sp 07250470, bp 07250458, ip 5, STBErrorSessionManager(RuntimeSessionManager)>>logError:} {07250440: cf 07250425, sp 07250450, bp 0725043C, ip 4, STBErrorSessionManager(SessionManager)>>unhandledException:} {07250424: cf 07250409, sp 07250434, bp 07250420, ip 4, STBErrorSessionManager(SessionManager)>>onUnhandledError:} {07250408: cf 072503F1, sp 07250418, bp 07250408, ip 5, MessageNotUnderstood(Error)>>defaultAction} {072503F0: cf 072503DD, sp 07250400, bp 070EF4E0, ip 57, MessageNotUnderstood(Exception)>>_propagateFrom:} {072503DC: cf 072503C1, sp 072503EC, bp 072503D8, ip 6, MessageNotUnderstood(Exception)>>_propagate} {072503C0: cf 072503A9, sp 072503D0, bp 072503C0, ip 12, MessageNotUnderstood(Exception)>>signal} {072503A8: cf 07250389, sp 072503B8, bp 072503A0, ip 13, MessageNotUnderstood class>>receiver:message:} {07250388: cf 0725036D, sp 07250398, bp 07250384, ip 5, TextEdit(Object)>>doesNotUnderstand:} {0725036C: cf 07250355, sp 0725037C, bp 0725036C, ip 6, MessageSend(MessageSendAbstract)>>value} {07250354: cf 07250339, sp 07250364, bp 070EF940, ip 9, [] in MessageSequence(MessageSequenceAbstract)>>value} {07250338: cf 07250319, sp 07250350, bp 07250330, ip 15, OrderedCollection>>do:} {07250318: cf 072502FD, sp 07250328, bp 07250314, ip 4, MessageSequence>>messagesDo:} {072502FC: cf 072502E9, sp 0725030C, bp 070EF940, ip 13, MessageSequence(MessageSequenceAbstract)>>value} {072502E8: cf 072502CD, sp 072502F8, bp 072502E4, ip 3, TextEdit(View)>>state:} {072502CC: cf 072502B9, sp 072502DC, bp 0728DB70, ip 61, TextEdit(STBViewProxy)>>restoreView} {072502B8: cf 0725029D, sp 072502C8, bp 070EF2B0, ip 70, [] in DialogView(STBViewProxy)>>restoreView} {0725029C: cf 0725027D, sp 072502B4, bp 07250294, ip 15, OrderedCollection>>do:} {0725027C: cf 07250269, sp 0725028C, bp 070EF2B0, ip 72, DialogView(STBViewProxy)>>restoreView} {07250268: cf 0725024D, sp 07250278, bp 07250264, ip 3, DialogView(STBViewProxy)>>restoreTopView} {0725024C: cf 07250235, sp 0725025C, bp 0725024C, ip 6, MessageSend(MessageSendAbstract)>>value} {07250234: cf 07250219, sp 07250244, bp 070EF470, ip 12, [] in STBInFiler>>evaluateDeferredActions} {07250218: cf 072501F9, sp 07250230, bp 07250210, ip 15, OrderedCollection>>do:} {072501F8: cf 072501E5, sp 07250208, bp 070EF470, ip 14, STBInFiler>>evaluateDeferredActions} {072501E4: cf 072501C9, sp 072501F4, bp 072501E0, ip 6, STBInFiler>>next} {072501C8: cf 072501A9, sp 072501D8, bp 072501C0, ip 18, ResourceSTBByteArrayAccessor>>loadWithContext:} {072501A8: cf 07250185, sp 072501B8, bp 072501A4, ip 4, ViewResource(Resource)>>loadWithContext:} {07250184: cf 07250159, sp 0725019C, bp 07250170, ip 40, STBErrorSessionManager>>main} {07250158: cf 07250141, sp 07250168, bp 07250158, ip 6, STBErrorSessionManager(SessionManager)>>mainLoopStarted} {07250140: cf 0725012D, sp 07250150, bp 070E2978, ip 9, [] in STBErrorSessionManager(SessionManager)>>forkMain} {0725012C: cf 07250109, sp 0725013C, bp 07250120, ip 32, InputState>>loopWhile:} {07250108: cf 072500F5, sp 07250118, bp 070E2B38, ip 12, InputState>>mainLoop} {072500F4: cf 072500E1, sp 07250104, bp 070E29B0, ip 13, [] in InputState>>forkMain} {072500E0: cf 072500CD, sp 072500F0, bp 070E2AC8, ip 11, ExceptionHandler(ExceptionHandlerAbstract)>>markAndTry} {072500CC: cf 072500B1, sp 072500DC, bp 070E2A90, ip 21, [] in ExceptionHandler(ExceptionHandlerAbstract)>>try:} {072500B0: cf 0725009D, sp 072500C8, bp 070E2B00, ip 17, BlockClosure>>ifCurtailed:} {0725009C: cf 0725007D, sp 072500AC, bp 07250094, ip 4, BlockClosure>>ensure:} {0725007C: cf 07250069, sp 0725008C, bp 070E2A90, ip 39, ExceptionHandler(ExceptionHandlerAbstract)>>try:} {07250068: cf 07250049, sp 07250078, bp 07250060, ip 7, BlockClosure>>on:do:} {07250048: cf 00000001, sp 07250058, bp 070E29E8, ip 17, [] in BlockClosure>>newProcess} <Bottom of stack> ***** End of dump ***** |
In reply to this post by Christopher J. Demers
Hi,
I have run the test package on a machine with: - Dolphin 4.013 - NT 4 Workstation SP6a - 200Mhz Pentium Pro with 128MB of ram. I ran the test 5 times, each time getting 0 errors/100. I am not in a position to be able to leave the test running continuously, but it appears this machine is not effected. I was also able to run it 5 times without seeing the error on my Win2k development machine. Thanks, Steve ========== Steve Waring [hidden email] http://www.dolphinharbor.org/dh/harbor/steve.html |
In reply to this post by Christopher J. Demers
Christopher et al,
Just some vague musings as I can't say I ever remember seeing this problem (never used NT but did use 2000 for 18 months or so). Did you notice that the corruption changed during the course of the dump generation - > 23:15:03 PM, 10/11/02: TextEdit does not understand #'|Y( ctionRange:' but > [07250474: 268]-->'TextEdit does not understand #'|\oF\(\07\ctionRange:'' That looks similar to the sort of thing that happens if you don't retain a Smalltalk reference to a String but expect a memory address to remain valid. This could also fit in with the intermittent nature of the problem - as long as a garbage cycle doesn't allow the unreferenced memory to be freed you will be ok. It might also explain the OS difference - maybe the NT memory (re)allocation strategy is different to Win2000 and the freed memory will take longer to be reallocated. Ian |
In reply to this post by Christopher J. Demers
In article <ao83vl$k4ki3$[hidden email]>,
"Christopher J. Demers" <[hidden email]> wrote: > Jerome Chan <[hidden email]> wrote in message > news:[hidden email]... > > How does this effect STB of simple nonview objects? I'm writing a > > LiveJournal client application and all the journal entries are stored as > > STB objects on the file system. > > I think (I am sure he can clarify) that Bill encountered this problem in a > non-view STB object. I have not personally encountered this issue with > non-view STB objects. However I don't use my "problem machine" anymore so > it has not seen many non-view STB objects from me. On my Windows 2000 > machine I have not yet encountered this issue in either view or non-view STB > objects in either D4 or D5. > > Chris > > In my testcases, I create 10,000 objects, store them on file and read them back and had no corruptions. Version 5 XP (Home) don't know patch number. |
In reply to this post by Christopher J. Demers
Chris, Andy,
> > I have just finished a 2 hour run under NT4 SP6 on an Athlon 1800+ m/c > with > > no errors. I ran three processes, two saving/loading the CHB view as per > > I wonder if you have an older less advanced machine you could try? I was wondering about that myself. Ian's point about GC activity might be more relevant on slower machines???? > The > machine that I can always make it happen on has a Pentium 300Mhz CPU and 128 > MB RAM. > > Bill: What kind of machine were you running that gave you the error? With 2k sp2, a 500 MHz Celeron - a Fujitsu pen tablet. The NT machine that I used for development and quickly "patched" with 2k to avoid view composer problems is (I think) a P3. > > I'm not sure where to go from here. I notice that the corrupt STB file you > > sent me had a filename that indicated that it was generated on 30/1/2001 > > which is well back into the days of D4. Do you have an example from the > > recent D5 failures, e.g. can you run your test again using a fresh D5 > image > > and NT and get the error handler to save out the file? > > I can't install D5 on my NT machine because I have SP4 rather than SP6. The > D5 install program doesn't even let me install it. I only run D5 on W2K and > I have never experienced the STB problem there, however I never experienced > it with D4 on that machine (Pentium Pro 200 Mhz with 128 MB ram) either. If > I thought this problem would never happen in D5 I would not worry. However > Bill has reported seeing the exact same kind of corruption in D5 on W2K that > I am seeing with D4 on NT. If I understand Bill's posting it sounds like > the issue is rare with D5 on W2K. Very rare (but not rare enough sadly), at least with the machines that I've encoutered. The sample size is small, but it seems to be worse on 2k sp1 than on sp2. > Even if the corruption is rare I am just > worried that after I release my program I am going to get some calls from > end users whose files are corrupt. > > I just made an EXE in D5 on my W2K machine that contains my STB stress code > (with Andy's change). I do not get any errors on W2K or NT with the D5 EXE. > I will try running it over the weekend to stress it. > > In a virgin D4 image I loaded a package with only one dialog class and a > view. I was able to get 18 corruptions out of 100 tries. Then I ran it a > few more times and got 2/100, 0/100,0/100, 10/100. I waited between a few > seconds and a few minutes between the tests. That's interesting. > > How many Dolphin processes are running in the image that gives the errors? > > Does it fail in a fresh image? Is your NT box doing anything else at the > > time (are there any unusual services running that we wouldn't have here). > > The process monitor shows only the standard 5 Dolphin processes. I usually > have the following running: AOL IM, MS Outlook, MS Outlook Express (for > news), McAfee virus scanner and sometimes IE. I am not running any servers > on the NT box. Chris has an alibi but I'm busted :) The exact number of processes is impossible to predict because it will depend on what the machine is doing at the time. Ten is probably a good guess. I'm not sure about services on the NT/development box, and it's been all but reformatted, so it's probably impossible to tell now. I'll look at the pen tablets rather than trying to give a quick answer. > > Without being able to get it to fail here it is going to be tricky to go > > much further. I'll leave it going over the weekend but I suspect it's not > > going to fail. > > I understand. Ditto. > I am not sure if you want to try D4 on NT in hopes of being > able to reproduce the problem and apply the fix to D5 since it seems to be > the same problem but at a lower frequency. That seems like an excellent idea. At worst, it might find/fix a problem in D4, and I really suspect it is in fact the same bug, simply masked by calmer waters in 2k. > However I don't want you to > waste your time on this either. It seems that something about D5 has > improved the issue over D4. I guess the real test is to see what the real > world frequency of this issue is. Perhaps this corruption will only occur > once every 10 years in D5, and Bill just got "lucky". I guess the best > thing is for people to just report this here if it happens to them. If Bill > or someone else runs into this again with D5 on W2K then I think it becomes > a more pressing issue. Unfortunately, I have seen it multiple times on 2k, but I took the sp1 instances as a sign that I needed to apply the service pack. It now seems that service packs don't fix it. > If any wants to try here is a D4 package that includes a Dialog with a view > that I used for my tests above. The example code is in the package comment > (it has Andy's fix) but I suggest not saving the image after the test or > running it in a trash image. > http://www.mitchellscientific.com/smalltalk/Dolphin4/4ByteSTBErrorExample.pa > c I'll grab it and give it a shot on one of the pen tablets. > I truly enjoy using Dolphin Smalltalk. Thanks for > bringing it to life! Well said! Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Christopher J. Demers
Chris,
> > My most recent observation was fixed by rebooting the offending machine. > I > > was under some pressure to fix it, so I can't claim to have tried > everything > > along the way. A previous encounter (perhaps on 2k sp1 though, and on a > > different machine) also seemed to require a reboot to fix. Could the > > machines be suffering some kind of heap fragmentation causing more active > > memory management and therefore making the problem more likely to > appear??? > > I am interested that you say rebooting seemed to fix the problem. It seems > that there are two layers to this problem. The root problem is the > corruption of STB data. However where you probably notice the problem is > when you read back from the STB data. I assume that you either got rid of > the corrupt STB files or fixed them. Were newly created STB files becoming > corrupt? I'm sending STB data across sockets, and it is not entirely clear which side is at fault. However, I don't recall ever seeing this problem on 9x (although it's possible that the machines simply crashed when it happened???), and I still have 9x machines in use. Just the other week, I had a situation in which a Win95 machine could talk to my 2k server, and my 2k development machine could not. Also, I've had several rounds of rebooting pen tablets w/o rebooting the server fixing the problem. The most recent instance was WinME server, win2k sp2 client, client can't connect due to corrupt data, reboot client, all is well - the server hadn't changed. Lumping all of that together, it suggests a problem reading back data, though I agree that this is not entirely consistent with the logical assumption that D4/NT/VC saves broken view resources. I _think_ I've even gone so far as to save a pac with a broken resource for Blair to inspect. He sent back a fixed copy of the resource (his own idea, try getting that kind of service from MS!!). If I'm remembering correctly, that's evidence that the written data can be corrupt. > Are you experiencing a situation where the program runs fine, happily > writing to and reading from STB files for a period of time, and then starts > frequently trashing STB files (even after deleting previously damaged files > and restarting the program), and continues to do so until it is rebooted? Well, it's over a network, but that seems to be the idea. The (relatively new) win2k pen tablets were originally being rebooted somewhat more often due to the D4 print/shutdown problem that now appears to be resolved in D5. One could argue that as they run longer, we're starting to see problems (n=1 on that one though). > That sounds worse than I thought it was. I thought you had just stumbled > upon one instance of STB corruption in D5 on W2K. Is STB corruption on this > machine a reoccurring situation? It's recurring across a group of four sister machines that have seen a mix of service packs. As they standardize on sp2, it will get easier to narrow it down. There was also an incident on the P4 2k server after 4+ months of continuous up time. It's _possible_ that the latter problem originated elsewhere, but I doubt it. This is definitely not one isolated cosmic ray :( > I asked this in my reply to Andy, but incase you don't see it: What kind of > computer causes this problem under W2K? Mine is a 300 mhz Pentium with 128 > mb ram that causes trouble under NT. I had NT problems with a P3 that's likely not too much faster. The offending 2k machines are 500 MHz celerons. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Christopher J. Demers
Chis,
> > How does this effect STB of simple nonview objects? I'm writing a > > LiveJournal client application and all the journal entries are stored as > > STB objects on the file system. > > I think (I am sure he can clarify) that Bill encountered this problem in a > non-view STB object. I wouldn't call the objects simple, but non-view yes, and they get STB'd into byte arrays, encrypted, and shoved through a socket. At first I suspected the encryption/decryption, or maybe unintended side effects of SocketReadStream (#position will become misleading after #readPage), but, none of that seems to be relevant. I basically use only #nextPutAll: and #next: with socket streams, and the corruption always strikes in the first four bytes of a class name, the same as the NT/D4/VC problems that have been reported by a few different users. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Ian Bartholomew-18
Ian Bartholomew <[hidden email]> wrote in message
news:DuSp9.722$Lm1.79078@stones... > Did you notice that the corruption changed during the course of the dump > generation - > > > 23:15:03 PM, 10/11/02: TextEdit does not understand #'|Y( ctionRange:' > > but > > > [07250474: 268]-->'TextEdit does not understand #'|\oF\(\07\ctionRange:'' Is it really changing, or is this some sort of escaping? Or perhaps it is escaping because it is changing? > That looks similar to the sort of thing that happens if you don't retain a > Smalltalk reference to a String but expect a memory address to remain valid. > This could also fit in with the intermittent nature of the problem - as long > as a garbage cycle doesn't allow the unreferenced memory to be freed you > will be ok. It might also explain the OS difference - maybe the NT memory > (re)allocation strategy is different to Win2000 and the freed memory will > take longer to be reallocated. I figure it is something like that. If it were a string address changing would it make sense that only the first four bytes would be garbled or should be corruption location and range be more random? Chris |
Chris,
> Is it really changing, or is this some sort of escaping? Or perhaps it is > escaping because it is changing? I don't know how valid this is but converting back into bytes you get. '|Y( ctionRange:' asByteArray #[124 89 40 7 99 116 105 111 110 82 97 110 103 101 58] '|\oF\(\07\ctionRange:' asByteArray #[124 92 111 70 92 40 92 48 55 92 99 116 105 111 110 82 97 110 103 101 58] I've added the extra spacing as I thought it might indicate that something (not Dolphin) is using the \..\ as an escape sequence. It is purely speculation though. > I figure it is something like that. If it were a string address changing > would it make sense that only the first four bytes would be garbled or > should be corruption location and range be more random? Yes, that's a good point and one that I have no answer for. I expect there could be a few ways of explaining it (perhaps memory allocation is done in 32 byte chunks?) but, as above, it would just be pure speculation. Have you (or Bill) got an example of a STB that is corrupted. It might be useful to examine the contents to try and determine a) If the problem is with the writing or the reading of the STB. b) The extent of the corruption and if it has any overall format (every 32 bytes for example) FWIW, I did have a, half-hearted, attempt at forcing the problem on my 2000 machine by generating a lot of objects during the STB encoding and decoding operation, to try and cause garbage collections, but nothing untoward happened. Regards Ian |
In reply to this post by Bill Schwab-2
Bill Schwab <[hidden email]> wrote in message
news:ao2cqm$hs79j$[hidden email]... > > Remember the binary filing problem (typically showed up when saving view > resources) that corrupts some/all of the first four bytes of a class name, > and was (in my limited experience with it) fairly common on NT? I've now > seen it (with STB, not in the VC) on win2k sp2. Any ideas? A reproduceable > case would be terrific :) Ian's message got me thinking about problems with pointers to strings again. I looked back at STBOutFiler<<writeInstanceVariables: which had previously seemed interesting to me due to its use of yourAddress asExternalAddress. I remmed this line: stream next: basicSize putAll: objectToSave yourAddress asExternalAddress startingAt: 1] and unremmed the line above it: 1 to: basicSize do: [:i | stream nextPut: (objectToSave basicAt: i) asInteger]] It looks like OA was trying optimize STB performance. The two lines seem to accomplish the same thing. I can't claim this as a fix for sure due to the random nature of this problem, however after this change I have not had an STB corruption. I will keep testing. Anyone else experiencing this problem (that feels brave) can try this "fix". Let us know if it actually fixes anything. Be aware that this change might make the STB system slower. Be carefull. Chris |
Chris,
> Ian's message got me thinking about problems with pointers to strings again. > I looked back at STBOutFiler<<writeInstanceVariables: which had previously > seemed interesting to me due to its use of yourAddress asExternalAddress. > > I remmed this line: > stream next: basicSize putAll: objectToSave yourAddress asExternalAddress > startingAt: 1] > and unremmed the line above it: > 1 to: basicSize do: [:i | stream nextPut: (objectToSave basicAt: i) > asInteger]] Fix or not, congratulations on a good piece of detective work. Question for OA: when was this change made? I suppose all the answer could do is disprove it as a fix, but even that's good to know. > It looks like OA was trying optimize STB performance. The two lines seem to > accomplish the same thing. I can't claim this as a fix for sure due to the > random nature of this problem, however after this change I have not had an > STB corruption. I will keep testing. Anyone else experiencing this problem > (that feels brave) can try this "fix". Let us know if it actually fixes > anything. Be aware that this change might make the STB system slower. Be > carefull. A small blip in speed would be better than random failures. Hopefully OA will have a chance to comment on this before I regain consciousness :) Speed changes in specific cases will be easy enough to measure, and perhaps there is a compromise solution, such as hanging onto one of the external addresses (the result of #yourAddress looks like it could be at risk), and/or making sure the right one owns the memory. If all goes well, Chris gets the Jolt Cola engraved mouse pad, and all of us will get a confirmed bug fix. However, as planned, I built up an NT machine today. It's a 366/64MB celeron, with everything but the network card working. The latter is proving **very** stubborn. Any NT4 gurus out there have some advice? The machine is currently at sp1 (which might actually be a good thing for bug hunting on D4??). Another option might be to put service packs on a CD. Here's hoping the work will have been for nothing :) Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Bill Schwab <[hidden email]> wrote in message
news:aofl7p$lq59g$[hidden email]... > Fix or not, congratulations on a good piece of detective work. Thanks! > Question for OA: when was this change made? I suppose all the answer could > do is disprove it as a fix, but even that's good to know. I am curious as well. I am glad the old code was still there as it made the change easy. I looked at a Dolphin 98 image and was surprised to see that the code looked similar to that in D4 and D5, so I assume the change was made quite a while ago. > A small blip in speed would be better than random failures. Hopefully OA > will have a chance to comment on this before I regain consciousness :) .. I agree. I am curious where the real problem is, and if it may have broader implications. Perhaps the STB corruption is merle a symptom of some more serious memory reference problem. However Dolphin does seem stable on my problem machine with the STB corruptions being the only exception. > If all goes well, Chris gets the Jolt Cola engraved mouse pad, and all of us > will get a confirmed bug fix. However, as planned, I built up an NT machine > today. It's a 366/64MB celeron, with everything but the network card > working. The latter is proving **very** stubborn. Any NT4 gurus out there > have some advice? The machine is currently at sp1 (which might actually be > a good thing for bug hunting on D4??). Another option might be to put > service packs on a CD. Here's hoping the work will have been for nothing :) Ah the joys of networking with NT! I had a heck of a time getting my network card working on my NT box. I think I needed at least SP3. Also if memory serves me correctly there was some silly thing that had to be done every time a network driver (or perhaps even setting) change was made. I think I had to rerun the service pack or something like that. I don't remember the details since it has been a while, and hopefully I will never have to do that again. ;) I implemented my "fix" in D5. Generated a test EXE and ran it with a loop size of 100 on my NT "problem" machine without any errors. My previous D5 EXE with the original code continues to run into STB corruption. This is encouraging, I am feeling more confident about the "fix" but will continue to test. Chris |
In reply to this post by Bill Schwab-2
"Bill Schwab" <[hidden email]> wrote in message
news:ao2cqm$hs79j$[hidden email]... > Hello all, I am sorry to chim in this lately, but here are a few data points we have observed: a) in our case corruption seems to occur when communicating some byte data from external library (stb is not involved). We can not guarantee that there is no error in our dll, but we did a lot of checking there. b) it seems to us that error started occuring when we switched our app from dolphin 98 to dolphin 3.0 . At the same time we did changes to the app, but the dll that dolphin talks to remained the same. c) frequency: it seems that it happens once in 1500 hrs of work. more load or stress on the machine may provoke it d) versions: DS 3.x , os win 95. rush |
In reply to this post by Christopher J. Demers
Chris,
> I am curious as well. I am glad the old code was still there as it made the > change easy. I looked at a Dolphin 98 image and was surprised to see that > the code looked similar to that in D4 and D5, so I assume the change was > made quite a while ago. <somebodyHasToSayIt> I hope the XP/don't comment/code formatting is irrelevant/tests can catch it all crowd is listening. Sometimes I fear that the increasing body of _totally_ uncommented code will do what C++ and Java couldn't. </somebodyHasToSayIt> > Ah the joys of networking with NT! I had a heck of a time getting my > network card working on my NT box. I think I needed at least SP3. Also if > memory serves me correctly there was some silly thing that had to be done > every time a network driver (or perhaps even setting) change was made. I > think I had to rerun the service pack or something like that. I don't > remember the details since it has been a while, and hopefully I will never > have to do that again. ;) Ouch! I might just go the CD route with service packs; it sounds like you're saying I'd have to start that way for the first few packs. Besides, I'm tempted to start with it unpatched because we're trying to make it break. > I implemented my "fix" in D5. Generated a test EXE and ran it with a loop > size of 100 on my NT "problem" machine without any errors. My previous D5 > EXE with the original code continues to run into STB corruption. This is > encouraging, I am feeling more confident about the "fix" but will continue > to test. Thanks!! Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Christopher J. Demers
Chris,
What does your NT box think of the following? What do _you_ think of it? Hopefully it preserves the speed boost w/o the bug?? Note that I could be holding on to the wrong object, so feel encouraged to tweak as you see fit. Have a good one, Bill !STBOutFiler methodsFor! writeInstanceVariables: objectToSave "Private - Dump the instance variables of the <Object> argument to the binary stream." | class basicSize notGarbage | #stbFix. basicSize := objectToSave basicSize. self writeInteger: basicSize. class := objectToSave basicClass. class isBytes ifTrue: [ " 1 to: basicSize do: [:i | stream nextPut: (objectToSave basicAt: i) asInteger]]" stream next: basicSize putAll: ( notGarbage := objectToSave yourAddress ) asExternalAddress startingAt: 1] ifFalse: [1 to: (class instSize+basicSize) do: [:i | self basicNextPut: (objectToSave instVarAt: i)]]! ! !STBOutFiler categoriesFor: #writeInstanceVariables:!operations!private! ! -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Ian Bartholomew-18
Ian,
> Have you (or Bill) got an example of a STB that is corrupted. It might be > useful to examine the contents to try and determine > > a) If the problem is with the writing or the reading of the STB. A sample that I might be able to find (it would take some work) is the resource that I sent to Blair some time ago. He fixed up the class name in the debugger and sent the result back to me. It seemed ok other than the one four byte region. Blair, does that sound familar/correct? Obviously that corruption occured on writing. I've thought some more about the read/write issue, and the bottom line is that it can be quite hard to tell in the offending system. In the recent incident, it's very possible that the error occured on the pen tablet during a write, got sent to the ME "server", blew up there, with the error being reported on the pen tablet. I probably should find a way to make the logs more clear, but it's generally been sufficient as-is. In confusing cases, I typically just set breakpoints on both sides and wait for something to trip, but that obviously won't work with a rare runtime problem. So, making a long story short, at this point I have no evidence that refutes Chris' proposed fix, at least not w/o doing more analysis to prove that it's a refutation :) My hunch is he found it. > b) The extent of the corruption and if it has any overall format (every 32 > bytes for example) I think it's not periodic, but since people have reported string corruption in view captions, etc., it's possible that Blair and I missed something else in that resource he fixed. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by rush
> I am sorry to chim in this lately, but here are a few data points we have
> observed: > > a) in our case corruption seems to occur when communicating some byte data > from external library (stb is not involved). We can not guarantee that there > is no error in our dll, but we did a lot of checking there. > b) it seems to us that error started occuring when we switched our app from > dolphin 98 to dolphin 3.0 . At the same time we did changes to the app, but > the dll that dolphin talks to remained the same. > c) frequency: it seems that it happens once in 1500 hrs of work. more load > or stress on the machine may provoke it > d) versions: DS 3.x , os win 95. Is this one of those "the customer is happy or just plain stuck w/ 95 so you have to support it" situations? I'm a little surprised that you're still using 3.x though; any particular reason, or is the app just that old? Does the 1500 hrs include off hours, or is it all time when the app is doing real work? Either way, that's a fairly frequent problem, at least transplanting it into my environment - 1500 hours of total machine time goes by pretty fast around here. Let's assume for the moment that my extrapolation of Chris' fix is correct. Then this makes me wonder if the image should be scoured for 'yourAddress asExternalAddress', ideally by browsing for references to both selectors or using the SmalltalkParser on all methods so as not to miss variants due to white space. Searching for that exact string in my image finds four methods, including the original form of Chris' STB method. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
"Bill Schwab" <[hidden email]> wrote in message
news:aoh5fh$mhvp9$[hidden email]... > > Is this one of those "the customer is happy or just plain stuck w/ 95 so you > have to support it" situations? I'm a little surprised that you're still > using 3.x though; any particular reason, or is the app just that old? The app does have some years on its shoulders, and it does not get updated very often. (I tend to grow gray hair in my beard when releasing new versions in production ;) When we did last update Dolphin 4.0 has been available, but we choosed to stay with 3.0 because the app shares a part of the codebase with dolphin applet, and as far as I recall, using D4 would make this more difficult. This and the fact that every new VM was harder and harder to install on ancient machines made us stick with D3 On the other hand, the machines runnig the app are owned by some 50-60 separate legal entities, so changing them in coordinated and controled way is difficult political process :) Anyway, I hope that we will have a new release under D5 and win XP by end of this year. > Does the 1500 hrs include off hours, or is it all time when the app is doing > real work? Either way, that's a fairly frequent problem, at least > transplanting it into my environment - 1500 hours of total machine time goes > by pretty fast around here. Well, I calculated it on basis one incident per 50-60 machines, runing 6 hours for 5 workdays. The numbers are not exact, more a ballpark figure. I have a fix in app that catches the corrupted data and disconnects, so the bug does not critically influence the app, but it is creapy. > Let's assume for the moment that my > extrapolation of Chris' fix is correct. Then this makes me wonder if the > image should be scoured for 'yourAddress asExternalAddress', ideally by > browsing for references to both selectors or using the SmalltalkParser on > all methods so as not to miss variants due to white space. Searching for > that exact string in my image finds four methods, including the original > form of Chris' STB method. I'll try to search for it. But if OA could nail this one that would be great since I do not see why is "yourAddress asExternalAddress" expression wrong . I also hope that along this great Chirs finding, info that we did not notice the bug in D98 but we did in D3 which introduced a new VM could be of some help to OA (like going back and thinking what has changed in those times). rush |
Free forum by Nabble | Edit this page |