The four byte bug lives

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
72 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Jerome Chan
How does this effect STB of simple nonview objects? I'm writing a
LiveJournal client application and all the journal entries are stored as
STB objects on the file system.


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Christopher J. Demers
Jerome Chan <[hidden email]> wrote in message
news:[hidden email]...
> How does this effect STB of simple nonview objects? I'm writing a
> LiveJournal client application and all the journal entries are stored as
> STB objects on the file system.

I think (I am sure he can clarify) that Bill encountered this problem in a
non-view STB object.  I have not personally encountered this issue with
non-view STB objects.  However I don't use my "problem machine" anymore so
it has not seen many non-view STB objects from me.  On my Windows 2000
machine I have not yet encountered this issue in either view or non-view STB
objects in either D4 or D5.

Chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Christopher J. Demers
In reply to this post by Christopher J. Demers
Christopher J. Demers <[hidden email]> wrote in message
news:ao7pah$kdc03$[hidden email]...

> I just made an EXE in D5 on my W2K machine that contains my STB stress
code
> (with Andy's change).  I do not get any errors on W2K or NT with the D5
EXE.
> I will try running it over the weekend to stress it.

I was just retting ready to let the EXE rip over the weekend and less than a
few seconds after I started it I got a corruption.  This is a D5 EXE running
on my Windows NT machine.  It looks like a method name has become corrupt,
perhaps it was supposed to be #selectionRange: .  At the moment I can't get
the program to loop more than a few times without some kind of STB
corruption.  Just this afternoon the same exact program did 100 loops with
no errors on this computer.

I will run the stress test on my 2K computer this weekend and see if I get
any errors.  Based on the results so far it looks like STB (even in D5) is
not safe on certain machines.  While this is bad, it does not seem to be a
common problem (unless it is under-reported).

Here is the dump:
************************** Dolphin Virtual Machine Dump Report
***************************

23:15:03 PM, 10/11/02: TextEdit does not understand #'|Y( ctionRange:'

*----> VM Context <----*
Process: {07250004:suspended frame 072505BD, priority 5, callbacks 0
last failure 0:nil, FPE mask 3, thread nil}
Active Method: RuntimeSessionManager>>logError:
IP: 070DE03F (15)
SP: 07250480
BP: 07250458 (261)
ActiveFrame: {0725045C: cf 07250441, sp 07250470, bp 07250458, ip 5,
STBErrorSessionManager(RuntimeSessionManager)>>logError:}

New Method: VMLibrary>>dump:path:stackDepth:walkbackDepth:
Message Selector: #dump:path:stackDepth:walkbackDepth:

*----> Stack <----*
[07250480: 271]-->50
[0725047C: 270]-->60
[07250478: 269]-->nil
[07250474: 268]-->'TextEdit does not understand #'|\oF\(\07\ctionRange:''
[07250470: 267]-->a VMLibrary
[0725046C: 266]-->59933228
[07250468: 265]-->RuntimeSessionManager>>logError:
[07250464: 264]-->59933240
[07250460: 263]-->8
[0725045C: 262]-->59933216
[07250458: 261]-->a MessageNotUnderstood
[07250454: 260]-->a STBErrorSessionManager
[07250450: 259]-->59933214
[0725044C: 258]-->SessionManager>>unhandledException:
[07250448: 257]-->59933224
[07250444: 256]-->7
[07250440: 255]-->59933202
[0725043C: 254]-->a MessageNotUnderstood
[07250438: 253]-->a STBErrorSessionManager
[07250434: 252]-->59933200
[07250430: 251]-->SessionManager>>onUnhandledError:
[0725042C: 250]-->59933210
[07250428: 249]-->3
[07250424: 248]-->59933188
[07250420: 247]-->a MessageNotUnderstood
[0725041C: 246]-->a STBErrorSessionManager
[07250418: 245]-->59933188
[07250414: 244]-->Error>>defaultAction
[07250410: 243]-->59933196
[0725040C: 242]-->8
...
<210 slots omitted>
...
[072500C0: 31]-->a MethodContext
[072500BC: 30]-->BlockClosure>>ifCurtailed:
[072500B8: 29]-->59932772
[072500B4: 28]-->20
[072500B0: 27]-->59932750
[072500AC: 26]-->59932746
[072500A8: 25]-->BlockClosure>>ensure:
[072500A4: 24]-->59932758
[072500A0: 23]-->7
[0725009C: 22]-->59932734
[07250098: 21]-->nil
[07250094: 20]-->[] @ 34 in ExceptionHandlerAbstract>>try:
[07250090: 19]-->[] @ 15 in ExceptionHandlerAbstract>>try:
[0725008C: 18]-->a MethodContext
[07250088: 17]-->ExceptionHandlerAbstract>>try:
[07250084: 16]-->59932742
[07250080: 15]-->42
[0725007C: 14]-->59932724
[07250078: 13]-->59932720
[07250074: 12]-->BlockClosure>>on:do:
[07250070: 11]-->59932732
[0725006C: 10]-->10
[07250068: 9]-->59932708
[07250064: 8]-->[] @ 12 in BlockClosure>>newProcess
[07250060: 7]-->ProcessTermination
[0725005C: 6]-->[] @ 8 in InputState>>forkMain
[07250058: 5]-->[] @ 6 in BlockClosure>>newProcess
[07250054: 4]-->BlockClosure>>newProcess
[07250050: 3]-->59932716
[0725004C: 2]-->20
[07250048: 1]-->0
<Bottom of stack>

*----> Stack Back Trace <----*
{0725045C: cf 07250441, sp 07250470, bp 07250458, ip 5,
STBErrorSessionManager(RuntimeSessionManager)>>logError:}
{07250440: cf 07250425, sp 07250450, bp 0725043C, ip 4,
STBErrorSessionManager(SessionManager)>>unhandledException:}
{07250424: cf 07250409, sp 07250434, bp 07250420, ip 4,
STBErrorSessionManager(SessionManager)>>onUnhandledError:}
{07250408: cf 072503F1, sp 07250418, bp 07250408, ip 5,
MessageNotUnderstood(Error)>>defaultAction}
{072503F0: cf 072503DD, sp 07250400, bp 070EF4E0, ip 57,
MessageNotUnderstood(Exception)>>_propagateFrom:}
{072503DC: cf 072503C1, sp 072503EC, bp 072503D8, ip 6,
MessageNotUnderstood(Exception)>>_propagate}
{072503C0: cf 072503A9, sp 072503D0, bp 072503C0, ip 12,
MessageNotUnderstood(Exception)>>signal}
{072503A8: cf 07250389, sp 072503B8, bp 072503A0, ip 13,
MessageNotUnderstood class>>receiver:message:}
{07250388: cf 0725036D, sp 07250398, bp 07250384, ip 5,
TextEdit(Object)>>doesNotUnderstand:}
{0725036C: cf 07250355, sp 0725037C, bp 0725036C, ip 6,
MessageSend(MessageSendAbstract)>>value}
{07250354: cf 07250339, sp 07250364, bp 070EF940, ip 9, [] in
MessageSequence(MessageSequenceAbstract)>>value}
{07250338: cf 07250319, sp 07250350, bp 07250330, ip 15,
OrderedCollection>>do:}
{07250318: cf 072502FD, sp 07250328, bp 07250314, ip 4,
MessageSequence>>messagesDo:}
{072502FC: cf 072502E9, sp 0725030C, bp 070EF940, ip 13,
MessageSequence(MessageSequenceAbstract)>>value}
{072502E8: cf 072502CD, sp 072502F8, bp 072502E4, ip 3,
TextEdit(View)>>state:}
{072502CC: cf 072502B9, sp 072502DC, bp 0728DB70, ip 61,
TextEdit(STBViewProxy)>>restoreView}
{072502B8: cf 0725029D, sp 072502C8, bp 070EF2B0, ip 70, [] in
DialogView(STBViewProxy)>>restoreView}
{0725029C: cf 0725027D, sp 072502B4, bp 07250294, ip 15,
OrderedCollection>>do:}
{0725027C: cf 07250269, sp 0725028C, bp 070EF2B0, ip 72,
DialogView(STBViewProxy)>>restoreView}
{07250268: cf 0725024D, sp 07250278, bp 07250264, ip 3,
DialogView(STBViewProxy)>>restoreTopView}
{0725024C: cf 07250235, sp 0725025C, bp 0725024C, ip 6,
MessageSend(MessageSendAbstract)>>value}
{07250234: cf 07250219, sp 07250244, bp 070EF470, ip 12, [] in
STBInFiler>>evaluateDeferredActions}
{07250218: cf 072501F9, sp 07250230, bp 07250210, ip 15,
OrderedCollection>>do:}
{072501F8: cf 072501E5, sp 07250208, bp 070EF470, ip 14,
STBInFiler>>evaluateDeferredActions}
{072501E4: cf 072501C9, sp 072501F4, bp 072501E0, ip 6, STBInFiler>>next}
{072501C8: cf 072501A9, sp 072501D8, bp 072501C0, ip 18,
ResourceSTBByteArrayAccessor>>loadWithContext:}
{072501A8: cf 07250185, sp 072501B8, bp 072501A4, ip 4,
ViewResource(Resource)>>loadWithContext:}
{07250184: cf 07250159, sp 0725019C, bp 07250170, ip 40,
STBErrorSessionManager>>main}
{07250158: cf 07250141, sp 07250168, bp 07250158, ip 6,
STBErrorSessionManager(SessionManager)>>mainLoopStarted}
{07250140: cf 0725012D, sp 07250150, bp 070E2978, ip 9, [] in
STBErrorSessionManager(SessionManager)>>forkMain}
{0725012C: cf 07250109, sp 0725013C, bp 07250120, ip 32,
InputState>>loopWhile:}
{07250108: cf 072500F5, sp 07250118, bp 070E2B38, ip 12,
InputState>>mainLoop}
{072500F4: cf 072500E1, sp 07250104, bp 070E29B0, ip 13, [] in
InputState>>forkMain}
{072500E0: cf 072500CD, sp 072500F0, bp 070E2AC8, ip 11,
ExceptionHandler(ExceptionHandlerAbstract)>>markAndTry}
{072500CC: cf 072500B1, sp 072500DC, bp 070E2A90, ip 21, [] in
ExceptionHandler(ExceptionHandlerAbstract)>>try:}
{072500B0: cf 0725009D, sp 072500C8, bp 070E2B00, ip 17,
BlockClosure>>ifCurtailed:}
{0725009C: cf 0725007D, sp 072500AC, bp 07250094, ip 4,
BlockClosure>>ensure:}
{0725007C: cf 07250069, sp 0725008C, bp 070E2A90, ip 39,
ExceptionHandler(ExceptionHandlerAbstract)>>try:}
{07250068: cf 07250049, sp 07250078, bp 07250060, ip 7,
BlockClosure>>on:do:}
{07250048: cf 00000001, sp 07250058, bp 070E29E8, ip 17, [] in
BlockClosure>>newProcess}
<Bottom of stack>

***** End of dump *****


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Steve Alan Waring
In reply to this post by Christopher J. Demers
Hi,

I have run the test package on a machine with:

 - Dolphin 4.013
 - NT 4 Workstation SP6a
 - 200Mhz Pentium Pro with 128MB of ram.

I ran the test 5 times, each time getting 0 errors/100.

I am not in a position to be able to leave the test running continuously,
but it appears this machine is not effected. I was also able to run it 5
times without seeing the error on my Win2k development machine.

Thanks,
Steve

==========
Steve Waring
[hidden email]
http://www.dolphinharbor.org/dh/harbor/steve.html


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Ian Bartholomew-18
In reply to this post by Christopher J. Demers
Christopher et al,

Just some vague musings as I can't say I ever remember seeing this problem
(never used NT but did use 2000 for 18 months or so).

Did you notice that the corruption changed during the course of the dump
generation -

> 23:15:03 PM, 10/11/02: TextEdit does not understand #'|Y( ctionRange:'

but

> [07250474: 268]-->'TextEdit does not understand #'|\oF\(\07\ctionRange:''

That looks similar to the sort of thing that happens if you don't retain a
Smalltalk reference to a String but expect a memory address to remain valid.
This could also fit in with the intermittent nature of the problem - as long
as a garbage cycle doesn't allow the unreferenced memory to be freed you
will be ok.  It might also explain the OS difference - maybe the NT memory
(re)allocation strategy is different to Win2000 and the freed memory will
take longer to be reallocated.

Ian


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Jerome Chan
In reply to this post by Christopher J. Demers
In article <ao83vl$k4ki3$[hidden email]>,
 "Christopher J. Demers" <[hidden email]> wrote:

> Jerome Chan <[hidden email]> wrote in message
> news:[hidden email]...
> > How does this effect STB of simple nonview objects? I'm writing a
> > LiveJournal client application and all the journal entries are stored as
> > STB objects on the file system.
>
> I think (I am sure he can clarify) that Bill encountered this problem in a
> non-view STB object.  I have not personally encountered this issue with
> non-view STB objects.  However I don't use my "problem machine" anymore so
> it has not seen many non-view STB objects from me.  On my Windows 2000
> machine I have not yet encountered this issue in either view or non-view STB
> objects in either D4 or D5.
>
> Chris
>
>


In my testcases, I create 10,000 objects, store them on file and read
them back and had no corruptions.

Version 5
XP (Home) don't know patch number.


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Bill Schwab
In reply to this post by Christopher J. Demers
Chris, Andy,

> > I have just finished a 2 hour run under NT4 SP6 on an Athlon 1800+ m/c
> with
> > no errors. I ran three processes, two saving/loading the CHB view as per
>
> I wonder if you have an older less advanced machine you could try?

I was wondering about that myself.  Ian's point about GC activity might be
more relevant on slower machines????


> The
> machine that I can always make it happen on has a Pentium 300Mhz CPU and
128
> MB RAM.
>
> Bill: What kind of machine were you running that gave you the error?

With 2k sp2, a 500 MHz Celeron - a Fujitsu pen tablet.  The NT machine that
I used for development and quickly "patched" with 2k to avoid view composer
problems is (I think) a P3.


> > I'm not sure where to go from here. I notice that the corrupt STB file
you
> > sent me had a filename that indicated that it was generated on 30/1/2001
> > which is well back into the days of D4. Do you have an example from the
> > recent D5 failures, e.g. can you run your test again using a fresh D5
> image
> > and NT and get the error handler to save out the file?
>
> I can't install D5 on my NT machine because I have SP4 rather than SP6.
The
> D5 install program doesn't even let me install it.  I only run D5 on W2K
and
> I have never experienced the STB problem there, however I never
experienced
> it with D4 on that machine (Pentium Pro 200 Mhz with 128 MB ram) either.
If
> I thought this problem would never happen in D5 I would not worry.
However
> Bill has reported seeing the exact same kind of corruption in D5 on W2K
that
> I am seeing with D4 on NT.  If I understand Bill's posting it sounds like
> the issue is rare with D5 on W2K.

Very rare (but not rare enough sadly), at least with the machines that I've
encoutered.  The sample size is small, but it seems to be worse on 2k sp1
than on sp2.


>  Even if the corruption is rare I am just
> worried that after I release my program I am going to get some calls from
> end users whose files are corrupt.
>
> I just made an EXE in D5 on my W2K machine that contains my STB stress
code
> (with Andy's change).  I do not get any errors on W2K or NT with the D5
EXE.
> I will try running it over the weekend to stress it.
>
> In a virgin D4 image I loaded a package with only one dialog class and a
> view.  I was able to get 18 corruptions out of 100 tries.  Then I ran it a
> few more times and got 2/100, 0/100,0/100, 10/100.  I waited between a few
> seconds and a few minutes between the tests.

That's interesting.


> > How many Dolphin processes are running in the image that gives the
errors?
> > Does it fail in a fresh image? Is your NT box doing anything else at the
> > time (are there any unusual services running that we wouldn't have
here).
>
> The process monitor shows only the standard 5 Dolphin processes.  I
usually
> have the following running: AOL IM, MS Outlook, MS Outlook Express (for
> news), McAfee virus scanner and sometimes IE.  I am not running any
servers
> on the NT box.

Chris has an alibi but I'm busted :)  The exact number of processes is
impossible to predict because it will depend on what the machine is doing at
the time.  Ten is probably a good guess.  I'm not sure about services on the
NT/development box, and it's been all but reformatted, so it's probably
impossible to tell now.  I'll look at the pen tablets rather than trying to
give a quick answer.


> > Without being able to get it to fail here it is going to be tricky to go
> > much further. I'll leave it going over the weekend but I suspect it's
not
> > going to fail.
>
> I understand.

Ditto.


>  I am not sure if you want to try D4 on NT in hopes of being
> able to reproduce the problem and apply the fix to D5 since it seems to be
> the same problem but at a lower frequency.

That seems like an excellent idea.  At worst, it might find/fix a problem in
D4, and I really suspect it is in fact the same bug, simply masked by calmer
waters in 2k.


> However I don't want you to
> waste your time on this either.  It seems that something about D5 has
> improved the issue over D4.  I guess the real test is to see what the real
> world frequency of this issue is.  Perhaps this corruption will only occur
> once every 10 years in D5, and Bill just got "lucky".  I guess the best
> thing is for people to just report this here if it happens to them.  If
Bill
> or someone else runs into this again with D5 on W2K then I think it
becomes
> a more pressing issue.

Unfortunately, I have seen it multiple times on 2k, but I took the sp1
instances as a sign that I needed to apply the service pack.  It now seems
that service packs don't fix it.



> If any wants to try here is a D4 package that includes a Dialog with a
view
> that I used for my tests above.  The example code is in the package
comment
> (it has Andy's fix) but I suggest not saving the image after the test or
> running it in a trash image.
>
http://www.mitchellscientific.com/smalltalk/Dolphin4/4ByteSTBErrorExample.pa
> c

I'll grab it and give it a shot on one of the pen tablets.



> I truly enjoy using Dolphin Smalltalk.  Thanks for
> bringing it to life!

Well said!

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Bill Schwab
In reply to this post by Christopher J. Demers
Chris,

> > My most recent observation was fixed by rebooting the offending machine.
> I
> > was under some pressure to fix it, so I can't claim to have tried
> everything
> > along the way.  A previous encounter (perhaps on 2k sp1 though, and on a
> > different machine) also seemed to require a reboot to fix.  Could the
> > machines be suffering some kind of heap fragmentation causing more
active
> > memory management and therefore making the problem more likely to
> appear???
>
> I am interested that you say rebooting seemed to fix the problem.  It
seems
> that there are two layers to this problem.  The root problem is the
> corruption of STB data.  However where you probably notice the problem is
> when you read back from the STB data.  I assume that you either got rid of
> the corrupt STB files or fixed them.  Were newly created STB files
becoming
> corrupt?

I'm sending STB data across sockets, and it is not entirely clear which side
is at fault.  However, I don't recall ever seeing this problem on 9x
(although it's possible that the machines simply crashed when it
happened???), and I still have 9x machines in use.  Just the other week, I
had a situation in which a Win95 machine could talk to my 2k server, and my
2k development machine could not.  Also, I've had several rounds of
rebooting pen tablets w/o rebooting the server fixing the problem.  The most
recent instance was WinME server, win2k sp2 client, client can't connect due
to corrupt data, reboot client, all is well - the server hadn't changed.

Lumping all of that together, it suggests a problem reading back data,
though I agree that this is not entirely consistent with the logical
assumption that D4/NT/VC saves broken view resources.  I _think_ I've even
gone so far as to save a pac with a broken resource for Blair to inspect.
He sent back a fixed copy of the resource (his own idea, try getting that
kind of service from MS!!).  If I'm remembering correctly, that's evidence
that the written data can be corrupt.


> Are you experiencing a situation where the program runs fine, happily
> writing to and reading from STB files for a period of time, and then
starts
> frequently trashing STB files (even after deleting previously damaged
files
> and restarting the program), and continues to do so until it is rebooted?

Well, it's over a network, but that seems to be the idea.  The (relatively
new) win2k pen tablets were originally being rebooted somewhat more often
due to the D4 print/shutdown problem that now appears to be resolved in D5.
One could argue that as they run longer, we're starting to see problems (n=1
on that one though).


> That sounds worse than I thought it was.  I thought you had just stumbled
> upon one instance of STB corruption in D5 on W2K.  Is STB corruption on
this
> machine a reoccurring situation?

It's recurring across a group of four sister machines that have seen a mix
of service packs.  As they standardize on sp2, it will get easier to narrow
it down.  There was also an incident on the P4 2k server after 4+ months of
continuous up time.  It's _possible_ that the latter problem originated
elsewhere, but I doubt it.

This is definitely not one isolated cosmic ray :(


> I asked this in my reply to Andy, but incase you don't see it:  What kind
of
> computer causes this problem under W2K?  Mine is a 300 mhz Pentium with
128
> mb ram that causes trouble under NT.

I had NT problems with a P3 that's likely not too much faster.  The
offending 2k machines are 500 MHz celerons.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Bill Schwab
In reply to this post by Christopher J. Demers
Chis,

> > How does this effect STB of simple nonview objects? I'm writing a
> > LiveJournal client application and all the journal entries are stored as
> > STB objects on the file system.
>
> I think (I am sure he can clarify) that Bill encountered this problem in a
> non-view STB object.

I wouldn't call the objects simple, but non-view yes, and they get STB'd
into byte arrays, encrypted, and shoved through a socket.  At first I
suspected the encryption/decryption, or maybe unintended side effects of
SocketReadStream (#position will become misleading after #readPage), but,
none of that seems to be relevant.  I basically use only #nextPutAll: and
#next: with socket streams, and the corruption always strikes in the first
four bytes of a class name, the same as the NT/D4/VC problems that have been
reported by a few different users.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Christopher J. Demers
In reply to this post by Ian Bartholomew-18
Ian Bartholomew <[hidden email]> wrote in message
news:DuSp9.722$Lm1.79078@stones...
> Did you notice that the corruption changed during the course of the dump
> generation -
>
> > 23:15:03 PM, 10/11/02: TextEdit does not understand #'|Y( ctionRange:'
>
> but
>
> > [07250474: 268]-->'TextEdit does not understand
#'|\oF\(\07\ctionRange:''

Is it really changing, or is this some sort of escaping?  Or perhaps it is
escaping because it is changing?

> That looks similar to the sort of thing that happens if you don't retain a
> Smalltalk reference to a String but expect a memory address to remain
valid.
> This could also fit in with the intermittent nature of the problem - as
long
> as a garbage cycle doesn't allow the unreferenced memory to be freed you
> will be ok.  It might also explain the OS difference - maybe the NT memory
> (re)allocation strategy is different to Win2000 and the freed memory will
> take longer to be reallocated.

I figure it is something like that.  If it were a string address changing
would it make sense that only the first four bytes would be garbled or
should be corruption location and range be more random?

Chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Ian Bartholomew-18
Chris,

> Is it really changing, or is this some sort of escaping?  Or perhaps it is
> escaping because it is changing?

I don't know how valid this is but converting back into bytes you get.

'|Y( ctionRange:' asByteArray
#[124          89           40           7
99 116 105 111 110 82 97 110 103 101 58]

'|\oF\(\07\ctionRange:' asByteArray
#[124             92 111 70 92           40              92 48 55 92
99 116 105 111 110 82 97 110 103 101 58]

I've added the extra spacing as I thought it might indicate that something
(not Dolphin) is using the \..\ as an escape sequence.   It is purely
speculation though.

> I figure it is something like that.  If it were a string address changing
> would it make sense that only the first four bytes would be garbled or
> should be corruption location and range be more random?

Yes, that's a good point and one that I have no answer for.  I expect there
could be a few ways of explaining it (perhaps memory allocation is done in
32 byte chunks?) but, as above, it would just be pure speculation.

Have you (or Bill) got an example of a STB that is corrupted.  It might be
useful to examine the contents to try and determine

a) If the problem is with the writing or the reading of the STB.
b) The extent of the corruption and if it has any overall format (every 32
bytes for example)

FWIW, I did have a, half-hearted, attempt at forcing the problem on my 2000
machine by generating a lot of objects during the STB encoding and decoding
operation, to try and cause garbage collections, but nothing untoward
happened.

Regards
    Ian


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Christopher J. Demers
In reply to this post by Bill Schwab-2
Bill Schwab <[hidden email]> wrote in message
news:ao2cqm$hs79j$[hidden email]...
>
> Remember the binary filing problem (typically showed up when saving view
> resources) that corrupts some/all of the first four bytes of a class name,
> and was (in my limited experience with it) fairly common on NT?  I've now
> seen it (with STB, not in the VC) on win2k sp2.  Any ideas?  A
reproduceable
> case would be terrific :)

Ian's message got me thinking about problems with pointers to strings again.
I looked back at STBOutFiler<<writeInstanceVariables: which had previously
seemed interesting to me due to its use of yourAddress asExternalAddress.

I remmed this line:
stream next: basicSize putAll: objectToSave yourAddress asExternalAddress
startingAt: 1]
and unremmed the line above it:
1 to: basicSize do: [:i | stream nextPut: (objectToSave basicAt: i)
asInteger]]

It looks like OA was trying optimize STB performance.  The two lines seem to
accomplish the same thing.  I can't claim this as a fix for sure due to the
random nature of this problem, however after this change I have not had an
STB corruption.  I will keep testing.  Anyone else experiencing this problem
(that feels brave) can try this "fix".  Let us know if it actually fixes
anything.  Be aware that this change might make the STB system slower.  Be
carefull.

Chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
Chris,

> Ian's message got me thinking about problems with pointers to strings
again.
> I looked back at STBOutFiler<<writeInstanceVariables: which had previously
> seemed interesting to me due to its use of yourAddress asExternalAddress.
>
> I remmed this line:
> stream next: basicSize putAll: objectToSave yourAddress asExternalAddress
> startingAt: 1]
> and unremmed the line above it:
> 1 to: basicSize do: [:i | stream nextPut: (objectToSave basicAt: i)
> asInteger]]

Fix or not, congratulations on a good piece of detective work.

Question for OA: when was this change made?  I suppose all the answer could
do is disprove it as a fix, but even that's good to know.


> It looks like OA was trying optimize STB performance.  The two lines seem
to
> accomplish the same thing.  I can't claim this as a fix for sure due to
the
> random nature of this problem, however after this change I have not had an
> STB corruption.  I will keep testing.  Anyone else experiencing this
problem
> (that feels brave) can try this "fix".  Let us know if it actually fixes
> anything.  Be aware that this change might make the STB system slower.  Be
> carefull.

A small blip in speed would be better than random failures.  Hopefully OA
will have a chance to comment on this before I regain consciousness :)
Speed changes in specific cases will be easy enough to measure, and perhaps
there is a compromise solution, such as hanging onto one of the external
addresses (the result of #yourAddress looks like it could be at risk),
and/or making sure the right one owns the memory.

If all goes well, Chris gets the Jolt Cola engraved mouse pad, and all of us
will get a confirmed bug fix.  However, as planned, I built up an NT machine
today.  It's a 366/64MB celeron, with everything but the network card
working.  The latter is proving **very** stubborn.  Any NT4 gurus out there
have some advice?  The machine is currently at sp1 (which might actually be
a good thing for bug hunting on D4??).  Another option might be to put
service packs on a CD.  Here's hoping the work will have been for nothing :)

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Christopher J. Demers
Bill Schwab <[hidden email]> wrote in message
news:aofl7p$lq59g$[hidden email]...

> Fix or not, congratulations on a good piece of detective work.
Thanks!

> Question for OA: when was this change made?  I suppose all the answer
could
> do is disprove it as a fix, but even that's good to know.

I am curious as well.  I am glad the old code was still there as it made the
change easy.  I looked at a Dolphin 98 image and was surprised to see that
the code looked similar to that in D4 and D5, so I assume the change was
made quite a while ago.

> A small blip in speed would be better than random failures.  Hopefully OA
> will have a chance to comment on this before I regain consciousness :)
..

I agree.  I am curious where the real problem is, and if it may have broader
implications.  Perhaps the STB corruption is merle a symptom of some more
serious memory reference problem.  However Dolphin does seem stable on my
problem machine with the STB corruptions being the only exception.

> If all goes well, Chris gets the Jolt Cola engraved mouse pad, and all of
us
> will get a confirmed bug fix.  However, as planned, I built up an NT
machine
> today.  It's a 366/64MB celeron, with everything but the network card
> working.  The latter is proving **very** stubborn.  Any NT4 gurus out
there
> have some advice?  The machine is currently at sp1 (which might actually
be
> a good thing for bug hunting on D4??).  Another option might be to put
> service packs on a CD.  Here's hoping the work will have been for nothing
:)

Ah the joys of networking with NT!  I had a heck of a time getting my
network card working on my NT box.  I think I needed at least SP3.  Also if
memory serves me correctly there was some silly thing that had to be done
every time a network driver (or perhaps even setting) change was made.  I
think I had to rerun the service pack or something like that.  I don't
remember the details since it has been a while, and hopefully I will never
have to do that again. ;)

I implemented my "fix" in D5.  Generated a test EXE and ran it with a loop
size of 100 on my NT "problem" machine without any errors.  My previous D5
EXE with the original code continues to run into STB corruption.  This is
encouraging, I am feeling more confident about the "fix" but will continue
to test.

Chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

rush
In reply to this post by Bill Schwab-2
"Bill Schwab" <[hidden email]> wrote in message
news:ao2cqm$hs79j$[hidden email]...
> Hello all,

I am sorry to chim in this lately, but here are a few data points we have
observed:

a) in our case corruption seems to occur when communicating some byte data
from external library (stb is not involved). We can not guarantee that there
is no error in our dll, but we did a lot of checking there.
b) it seems to us that error started occuring when we switched our app from
dolphin 98 to dolphin 3.0 . At the same time we did changes to the app, but
the dll that dolphin talks to remained the same.
c) frequency: it seems that it happens once in 1500 hrs of work. more load
or stress on the machine may provoke it
d) versions: DS 3.x , os win 95.

rush


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
In reply to this post by Christopher J. Demers
Chris,

> I am curious as well.  I am glad the old code was still there as it made
the
> change easy.  I looked at a Dolphin 98 image and was surprised to see that
> the code looked similar to that in D4 and D5, so I assume the change was
> made quite a while ago.

<somebodyHasToSayIt>
I hope the XP/don't comment/code formatting is irrelevant/tests can catch it
all crowd is listening.  Sometimes I fear that the increasing body of
_totally_ uncommented code will do what C++ and Java couldn't.
</somebodyHasToSayIt>

> Ah the joys of networking with NT!  I had a heck of a time getting my
> network card working on my NT box.  I think I needed at least SP3.  Also
if
> memory serves me correctly there was some silly thing that had to be done
> every time a network driver (or perhaps even setting) change was made.  I
> think I had to rerun the service pack or something like that.  I don't
> remember the details since it has been a while, and hopefully I will never
> have to do that again. ;)

Ouch!  I might just go the CD route with service packs; it sounds like
you're saying I'd have to start that way for the first few packs.  Besides,
I'm tempted to start with it unpatched because we're trying to make it
break.


> I implemented my "fix" in D5.  Generated a test EXE and ran it with a loop
> size of 100 on my NT "problem" machine without any errors.  My previous D5
> EXE with the original code continues to run into STB corruption.  This is
> encouraging, I am feeling more confident about the "fix" but will continue
> to test.

Thanks!!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
In reply to this post by Christopher J. Demers
Chris,

What does your NT box think of the following?  What do _you_ think of it?
Hopefully it preserves the speed boost w/o the bug??  Note that I could be
holding on to the wrong object, so feel encouraged to tweak as you see fit.

Have a good one,

Bill

!STBOutFiler methodsFor!

writeInstanceVariables: objectToSave
 "Private - Dump the instance variables of the <Object> argument to the
binary stream."

 | class basicSize notGarbage |
 #stbFix.
 basicSize := objectToSave basicSize.
 self writeInteger: basicSize.
 class := objectToSave basicClass.
 class isBytes
  ifTrue: [
"   1 to: basicSize do: [:i | stream nextPut: (objectToSave basicAt: i)
asInteger]]"
   stream next: basicSize putAll: ( notGarbage := objectToSave yourAddress )
asExternalAddress startingAt: 1]
  ifFalse: [1 to: (class instSize+basicSize) do: [:i | self basicNextPut:
(objectToSave instVarAt: i)]]! !
!STBOutFiler categoriesFor: #writeInstanceVariables:!operations!private! !

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Bill Schwab-2
In reply to this post by Ian Bartholomew-18
Ian,

> Have you (or Bill) got an example of a STB that is corrupted.  It might be
> useful to examine the contents to try and determine
>
> a) If the problem is with the writing or the reading of the STB.

A sample that I might be able to find (it would take some work) is the
resource that I sent to Blair some time ago.  He fixed up the class name in
the debugger and sent the result back to me.  It seemed ok other than the
one four byte region.  Blair, does that sound familar/correct?

Obviously that corruption occured on writing.  I've thought some more about
the read/write issue, and the bottom line is that it can be quite hard to
tell in the offending system.  In the recent incident, it's very possible
that the error occured on the pen tablet during a write, got sent to the ME
"server", blew up there, with the error being reported on the pen tablet.  I
probably should find a way to make the logs more clear, but it's generally
been sufficient as-is.  In confusing cases, I typically just set breakpoints
on both sides and wait for something to trip, but that obviously won't work
with a rare runtime problem.

So, making a long story short, at this point I have no evidence that refutes
Chris' proposed fix, at least not w/o doing more analysis to prove that it's
a refutation :)  My hunch is he found it.


> b) The extent of the corruption and if it has any overall format (every 32
> bytes for example)

I think it's not periodic, but since people have reported string corruption
in view captions, etc., it's possible that Blair and I missed something else
in that resource he fixed.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

Bill Schwab-2
In reply to this post by rush
> I am sorry to chim in this lately, but here are a few data points we have
> observed:
>
> a) in our case corruption seems to occur when communicating some byte data
> from external library (stb is not involved). We can not guarantee that
there
> is no error in our dll, but we did a lot of checking there.
> b) it seems to us that error started occuring when we switched our app
from
> dolphin 98 to dolphin 3.0 . At the same time we did changes to the app,
but
> the dll that dolphin talks to remained the same.
> c) frequency: it seems that it happens once in 1500 hrs of work. more load
> or stress on the machine may provoke it
> d) versions: DS 3.x , os win 95.

Is this one of those "the customer is happy or just plain stuck w/ 95 so you
have to support it" situations?  I'm a little surprised that you're still
using 3.x though; any particular reason, or is the app just that old?

Does the 1500 hrs include off hours, or is it all time when the app is doing
real work?  Either way, that's a fairly frequent problem, at least
transplanting it into my environment - 1500 hours of total machine time goes
by pretty fast around here.  Let's assume for the moment that my
extrapolation of Chris' fix is correct.  Then this makes me wonder if the
image should be scoured for 'yourAddress asExternalAddress', ideally by
browsing for references to both selectors or using the SmalltalkParser on
all methods so as not to miss variants due to white space.  Searching for
that exact string in my image finds four methods, including the original
form of Chris' STB method.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives

rush
"Bill Schwab" <[hidden email]> wrote in message
news:aoh5fh$mhvp9$[hidden email]...
>
> Is this one of those "the customer is happy or just plain stuck w/ 95 so
you
> have to support it" situations?  I'm a little surprised that you're still
> using 3.x though; any particular reason, or is the app just that old?

The app does have some years on its shoulders, and it does not get updated
very often. (I tend to grow gray hair in my beard when releasing new
versions in production ;) When we did last update Dolphin 4.0 has been
available, but we choosed to stay with 3.0 because the app shares a part of
the codebase with dolphin applet, and as far as I recall, using D4 would
make this more difficult. This and the fact that every new VM was harder and
harder to install on ancient machines made us stick with D3

On the other hand, the machines runnig the app are owned by some 50-60
separate legal entities, so changing them in coordinated and controled way
is difficult political process :)

Anyway, I hope that we will have a new release under D5 and win XP by end of
this year.

> Does the 1500 hrs include off hours, or is it all time when the app is
doing
> real work?  Either way, that's a fairly frequent problem, at least
> transplanting it into my environment - 1500 hours of total machine time
goes
> by pretty fast around here.

Well, I calculated it on basis one incident per 50-60 machines, runing 6
hours for 5 workdays. The numbers are not exact, more a ballpark figure. I
have a fix in app that catches the corrupted data and disconnects, so the
bug does not critically influence the app, but it is creapy.

>  Let's assume for the moment that my
> extrapolation of Chris' fix is correct.  Then this makes me wonder if the
> image should be scoured for 'yourAddress asExternalAddress', ideally by
> browsing for references to both selectors or using the SmalltalkParser on
> all methods so as not to miss variants due to white space.  Searching for
> that exact string in my image finds four methods, including the original
> form of Chris' STB method.

I'll try to search for it. But if OA could nail this one that would be great
since I do not see why is "yourAddress asExternalAddress" expression wrong .
I also hope that along this great Chirs finding, info that we did not notice
the bug in D98 but we did in D3 which introduced a new VM could be of some
help to OA (like going back and thinking what has changed in those times).

rush


1234