The four byte bug lives

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
72 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Andy Bower
Christopher,

We can't get a test case for this problem here (even under NT) so we'll have
to rely on your continued sleuthing I'm afraid. That comment swap code that
you've found obviously was an attempt to speed thing up and has been there
since before 1997. It is a shabby piece of work and your proposed fixed
should probably be reinstated into the image regardless of whether it is a
true fix for the four byte overwrite issue or not. I doubt it makes any real
speed difference but we should test this first anyway.

The thing that puzzles us about your fix is that it is not that piece of
code (#writeInstanceVariables) that is responsible for writing out the
corrupted class names but, rather, #writeClass:withPrefix:. Anyway, we await
the results of your further testing with interest.

Best Regards,

Andy Bower
Dolphin Support
http://www.object-arts.com
---
Are you trying too hard?
http://www.object-arts.com/Relax.htm
---

"Christopher J. Demers" <[hidden email]> wrote in
message news:aoffjn$lug31$[hidden email]...
> Bill Schwab <[hidden email]> wrote in message
> news:ao2cqm$hs79j$[hidden email]...
> >
> > Remember the binary filing problem (typically showed up when saving view
> > resources) that corrupts some/all of the first four bytes of a class
name,
> > and was (in my limited experience with it) fairly common on NT?  I've
now
> > seen it (with STB, not in the VC) on win2k sp2.  Any ideas?  A
> reproduceable
> > case would be terrific :)
>
> Ian's message got me thinking about problems with pointers to strings
again.

> I looked back at STBOutFiler<<writeInstanceVariables: which had previously
> seemed interesting to me due to its use of yourAddress asExternalAddress.
>
> I remmed this line:
> stream next: basicSize putAll: objectToSave yourAddress asExternalAddress
> startingAt: 1]
> and unremmed the line above it:
> 1 to: basicSize do: [:i | stream nextPut: (objectToSave basicAt: i)
> asInteger]]
>
> It looks like OA was trying optimize STB performance.  The two lines seem
to
> accomplish the same thing.  I can't claim this as a fix for sure due to
the
> random nature of this problem, however after this change I have not had an
> STB corruption.  I will keep testing.  Anyone else experiencing this
problem
> (that feels brave) can try this "fix".  Let us know if it actually fixes
> anything.  Be aware that this change might make the STB system slower.  Be
> carefull.
>
> Chris
>
>


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
Andy,

> The thing that puzzles us about your fix is that it is not that piece of
> code (#writeInstanceVariables) that is responsible for writing out the
> corrupted class names but, rather, #writeClass:withPrefix:. Anyway, we
await
> the results of your further testing with interest.

The code does run though (I set a breakpoint to verify that my tweaked
version was excercising the branch that I altered), so if there is something
wrong with "yourAddress asExternalAddress", then it might make a difference.
Or, is "yourAddress asExternalAddress" guaranteed to be safe?

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Andy Bower
Bill,

> The code does run though (I set a breakpoint to verify that my tweaked
> version was excercising the branch that I altered), so if there is
something
> wrong with "yourAddress asExternalAddress", then it might make a
difference.
> Or, is "yourAddress asExternalAddress" guaranteed to be safe?

Well I wouldn't ever like to use the word "guaranteed" :-) but providing the
object in question doesn't go out of scope and get GC'd while that external
address is used there shouldn't be a probelm. I can't see how that could
happen with the code in question since objectToSave is a parameter and must
be in scope until at least the end of the method.

Best Regards,

Andy Bower
Dolphin Support
http://www.object-arts.com
---
Are you trying too hard?
http://www.object-arts.com/Relax.htm
---


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
Andy,

> > The code does run though (I set a breakpoint to verify that my tweaked
> > version was excercising the branch that I altered), so if there is
> something
> > wrong with "yourAddress asExternalAddress", then it might make a
> difference.
> > Or, is "yourAddress asExternalAddress" guaranteed to be safe?
>
> Well I wouldn't ever like to use the word "guaranteed" :-) but providing
the
> object in question doesn't go out of scope and get GC'd while that
external
> address is used there shouldn't be a probelm. I can't see how that could
> happen with the code in question since objectToSave is a parameter and
must
> be in scope until at least the end of the method.

Understood, but what about the object returned by #yourAddress?  I'm
wondering if it gets away, perhaps only when it is or isn't already in some
particular format.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Chris Uppal-3
In reply to this post by Andy Bower
Andy,

> object in question doesn't go out of scope and get GC'd while that
external
> address is used there shouldn't be a probelm. I can't see how that could
> happen with the code in question since objectToSave is a parameter and
must
> be in scope until at least the end of the method.

Isn't there at least the theoretical possibility that the object will be
moved by GC before or while the bytes from its "external address" are being
copied to the stream ?

I haven't been able to find any circumstances when byte objects do change
their external address, running D5 over W2K, but maybe something happens
differently on NT ?

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Christopher J. Demers
In reply to this post by Bill Schwab-2
Bill Schwab <[hidden email]> wrote in message
news:aoh3c3$ls177$[hidden email]...
> Chris,
>
> What does your NT box think of the following?  What do _you_ think of it?
> Hopefully it preserves the speed boost w/o the bug??  Note that I could be
> holding on to the wrong object, so feel encouraged to tweak as you see
fit.
...

No luck with this code.  I tried it both as you wrote it, and with an
another variable doing them same thing to the result of asExternalAddress as
well.  The corruption occurred with both versions.  Then I want back to the
code that does not use asExternalAddress and did not have any corruptions.
I think there is something wrong with using yourAddress  and/or
asExternalAddress in this way on certain machines.

I ran a D5 EXE with my fix on my NT machine last night.  I had it looping
100,000 times sending a view back and forth to STB.  It eventually consumed
all my system resources (either there was a leak, or my half second delay
did not let it GC), but it did not cause any STB corruptions.  So as you may
have noticed I am not using quotes around fix any more.  I will keep
testing, but I think the fix works.  I would encourage you to give it a shot
and see if it alleviates your problems.

Chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Christopher J. Demers
In reply to this post by Andy Bower
Andy Bower <[hidden email]> wrote in message
news:aohav2$mb4og$[hidden email]...
>
> We can't get a test case for this problem here (even under NT) so we'll
have
> to rely on your continued sleuthing I'm afraid. That comment swap code
that
> you've found obviously was an attempt to speed thing up and has been there
> since before 1997. It is a shabby piece of work and your proposed fixed
> should probably be reinstated into the image regardless of whether it is a
> true fix for the four byte overwrite issue or not. I doubt it makes any
real
> speed difference but we should test this first anyway.

I generated a Dolphin 5 EXE with my fix, and told it to run a big loop of
the same tests that were causing corruptions for me before.  I did not see
any STB corruptions.  The program eventually exhausted my system resources
overnight, but that is another matter and since I don't normally try to open
and close a view 100,000 times in a row I don't much care. ;)  I feel quite
confident that not using yourAddress and/or asExternalAddress is the fix for
the STB problems on my computer.  I only wish I had discovered this sooner.
I would suggest making this change to Dolphin if there is not much downside
as it will improve the overall reliability of Dolphin on different
platforms.  This may not be as important for people developing "in house"
solutions, but I think it is important for anyone shipping a Dolphin based
product to end users.

> The thing that puzzles us about your fix is that it is not that piece of
> code (#writeInstanceVariables) that is responsible for writing out the
> corrupted class names but, rather, #writeClass:withPrefix:. Anyway, we
await
> the results of your further testing with interest.

I wondered about that too and it kept me from looking at that method sooner,
here are my thoughts:  My most common corruption before was
IdentityDictionary.  I also saw corruption in the display name of a tab in
the class browser once, and lately I have seen it in method names.  The
later two seem like they certainly could be in instance variable values on
some level.  I think the first item is as well, since it is contained in an
STBIdentityDictionaryProxy which has an instance variable for the class.  I
just looked at the code, and I am just as puzzled as you are, because I
think the class should still be routed to #writeClass:withPrefix: even if it
is in an instance variable.  Well I can't explain why, but the fix I
described appears to have eliminated the STB corruption problem for me on my
problem computer.

Chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
In reply to this post by Chris Uppal-3
Chris,

> I haven't been able to find any circumstances when byte objects do change
> their external address, running D5 over W2K, but maybe something happens
> differently on NT ?

But it also happens on 2k, just a lot less often.  As an aside, a different
machine (sp 3 if memory serves) got into trouble this afternoon.  I've
pretty much decided to install a hot fix based on your proposed patch.

Thanks for providing the option!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
In reply to this post by Christopher J. Demers
Chris,

> No luck with this code.  I tried it both as you wrote it, and with an
> another variable doing them same thing to the result of asExternalAddress
as
> well.  The corruption occurred with both versions.  Then I want back to
the
> code that does not use asExternalAddress and did not have any corruptions.
> I think there is something wrong with using yourAddress  and/or
> asExternalAddress in this way on certain machines.

Thanks - I'm just about to create the hot fix that I mentioned moments ago.
I was going to use your _fix_<g> as written anyway, but now it's pretty
obviously the right thing to do.

> I ran a D5 EXE with my fix on my NT machine last night.  I had it looping
> 100,000 times sending a view back and forth to STB.  It eventually
consumed
> all my system resources (either there was a leak, or my half second delay
> did not let it GC), but it did not cause any STB corruptions.  So as you
may
> have noticed I am not using quotes around fix any more.  I will keep
> testing, but I think the fix works.  I would encourage you to give it a
shot
> and see if it alleviates your problems.

Will do - very shortly.

THANKS!!!!

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
In reply to this post by Christopher J. Demers
Andy, Chris,

> > The thing that puzzles us about your fix is that it is not that piece of
> > code (#writeInstanceVariables) that is responsible for writing out the
> > corrupted class names but, rather, #writeClass:withPrefix:. Anyway, we
> await
> > the results of your further testing with interest.
>
> I wondered about that too and it kept me from looking at that method
sooner,
> here are my thoughts:  My most common corruption before was
> IdentityDictionary.  I also saw corruption in the display name of a tab in
> the class browser once, and lately I have seen it in method names.  The
> later two seem like they certainly could be in instance variable values on
> some level.  I think the first item is as well, since it is contained in
an
> STBIdentityDictionaryProxy which has an instance variable for the class.
I
> just looked at the code, and I am just as puzzled as you are, because I
> think the class should still be routed to #writeClass:withPrefix: even if
it
> is in an instance variable.  Well I can't explain why, but the fix I
> described appears to have eliminated the STB corruption problem for me on
my
> problem computer.

I sorta hate to mention it, but Davorin has raised a question of the same
kind of corruption absent STB.  Could the problem be something more generic
with #yourAddress or #asExternalAddress?  Maybe your problematic NT machine
could spend some time looking for objects that move around after the address
is grabbed, as you theorized might be happening.  Perhaps we should create
something similar to Davorin's DLL (maybe even the same DLL if he's willing
and able to share it with you) and make LOTS of calls???  Failing that,
perhaps we could rig up a DLL that even just copies a string from one buffer
to another, and have yourAddress asExternalAddress figure in the call;
hopefully the corruption would show itself by looking at the results after
the call.

Another possibility (wild guess only) would be a problem w/ hashing.
Davorin, do you use dictionaries in your problematic case?  Do you explictly
obtain addresses?  If so, using which messages to what kinds of objects?
Feel free to ask yourself any questions I'm forgetting :)

Ok, on to that hot fix.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Christopher J. Demers
Bill Schwab <[hidden email]> wrote in message
news:aoi8l3$m0ugm$[hidden email]...
>
> I sorta hate to mention it, but Davorin has raised a question of the same
> kind of corruption absent STB.  Could the problem be something more
generic
> with #yourAddress or #asExternalAddress?  Maybe your problematic NT
machine
> could spend some time looking for objects that move around after the
address
...

This thread has turned into a bit of a monster, so I may be having trouble
following it or missing messages.  I saw a post from "rush" that sounds like
what you are referring to.  Is "rush" Davorin, or am I missing some posts?

If someone wants to write some simple test code I would be happy to run it
on my "problem machine" as my time permits.  However I probably won't have
time to write the test code for this myself.  I have already spent more time
working on this STB corruption issue than I expected to.  Now that is seems
to be resolved I have to get back to slaying some other dragons.

I would have thought that if this corruption issue could happen in regard to
interfacing with external DLL's that the image itself would be a good deal
less stable.  As I said I don't work in Dolphin on my problem computer any
more, but while I was kicking the development environment around to
experiment with the STB corruptions I did not experience any instability.
Surely Dolphin calls a good many API functions in DLL's during normal use.
However this world is full of strange things, so who really knows?

Chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
Chris,

> This thread has turned into a bit of a monster, so I may be having trouble
> following it or missing messages.  I saw a post from "rush" that sounds
like
> what you are referring to.  Is "rush" Davorin, or am I missing some posts?

I hope so :)


> If someone wants to write some simple test code I would be happy to run it
> on my "problem machine" as my time permits.  However I probably won't have
> time to write the test code for this myself.  I have already spent more
time
> working on this STB corruption issue than I expected to.  Now that is
seems
> to be resolved I have to get back to slaying some other dragons.

Fair enough.  I'll throw something together, but like you, I'm over quota on
this for the week.  Well, the real truth is that I have to give a talk on
Thursday, and it's not entirely done.  I'll try to write something next
week.


> I would have thought that if this corruption issue could happen in regard
to
> interfacing with external DLL's that the image itself would be a good deal
> less stable.  As I said I don't work in Dolphin on my problem computer any
> more, but while I was kicking the development environment around to
> experiment with the STB corruptions I did not experience any instability.
> Surely Dolphin calls a good many API functions in DLL's during normal use.
> However this world is full of strange things, so who really knows?

Understood, but one failure every 1500 hours is more than worth pursuing.
The horse is already dead, so I say we practice our backhand  :)

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

rush
"Bill Schwab" <[hidden email]> wrote in message
news:aoihtc$mnn55$[hidden email]...
> > what you are referring to.  Is "rush" Davorin, or am I missing some
posts?
>
> I hope so :)

It's me rush == Davorin Rusevljan, and my appologies to everyone for
confusion about my identity. I thought that since there is another Davorin
also posting here (Mestric), using my nick name would be a good thing. (not
so sure anymore).

rush


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Blair McGlashan
In reply to this post by Bill Schwab-2
"Bill Schwab" <[hidden email]> wrote in message
news:aohfk1$mm4tf$[hidden email]...

> Andy,
>
> > > The code does run though (I set a breakpoint to verify that my tweaked
> > > version was excercising the branch that I altered), so if there is
> > something
> > > wrong with "yourAddress asExternalAddress", then it might make a
> > difference.
> > > Or, is "yourAddress asExternalAddress" guaranteed to be safe?
> >
> > Well I wouldn't ever like to use the word "guaranteed" :-) but providing
> the
> > object in question doesn't go out of scope and get GC'd while that
> external
> > address is used there shouldn't be a probelm. I can't see how that could
> > happen with the code in question since objectToSave is a parameter and
> must
> > be in scope until at least the end of the method.
>
> Understood, but what about the object returned by #yourAddress?  I'm
> wondering if it gets away, perhaps only when it is or isn't already in
some
> particular format.

Its a SmallInteger representation of the address. But think it through, even
if it were a non-immediate object (i.e. one that consumed memory subject to
GC) then it must remain in scope for the required time because it is the
receiver of the #asExternalAddress method.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

rush
In reply to this post by Christopher J. Demers
"Christopher J. Demers" <[hidden email]> wrote in
message news:aoig26$mgmqp$[hidden email]...
> on my "problem machine" as my time permits.  However I probably won't have
> time to write the test code for this myself.  I have already spent more
time
> working on this STB corruption issue than I expected to.  Now that is
seems
> to be resolved I have to get back to slaying some other dragons.
> I would have thought that if this corruption issue could happen in regard
to
> interfacing with external DLL's that the image itself would be a good deal
> less stable.  As I said I don't work in Dolphin on my problem computer any
> more, but while I was kicking the development environment around to
> experiment with the STB corruptions I did not experience any instability.
> Surely Dolphin calls a good many API functions in DLL's during normal use.
> However this world is full of strange things, so who really knows?

Chirs, do not get me wrong, I am, and I think also everyone else is very
gratefull to your finding and effort you have putted into it. And certanly
no one finds you obligated by any means to debug it any more. My writing is
more oriented towards OA, to provide them some information in hope that it
can be of some help to them.

What I am concerned about is that I suspect that your fix has curred a
simptom of a problem, but that cause is still laying somewhere. I say
suspect since there is possibility that the bug is in our dll or in our
smalltalk code, and that our problem is completely unrelated to the problem
you have decribed and fixed. However here are the reasons that make me
suspect that this is another manifestation of the same problem (none of them
I consider a proof but a hint or a lead):

a) 4 bytes corruption in both cases.
b) it does not happen allways, but under prolonged period
c) the problem seems to be exagerated by slower or more stressed machines
d) I am not sure that I understand why the code you have replaced is wrong
and how does it exactly generate 4 bytes corruption.
e) I am not sure that dialog box text corruption presented by Ian(?) is
related to the stb also?

Another thing that leads me to suspect the subtle and rare VM problem is
that we noticed the problem for the first time started when we switched the
VM (98 to 3.x). Another theory can be that In our program we use same "anti
pattern" as OA in the image, and the new Dolphin just provoked it.

I have looked at our code again, and the place where we are getting the
wrong results is something like this:

ByteArray fromAddress: self theKrypt buffer length: len

sometimes, the byte array obtained this way has first 4 bytes corrupted.
again it is possible that the bug is in our dll, but:

a) it worked ok in dolphin 98.
b) the dll does the crc checks on the data several times, and once very
close before the buffer is returned.

we can not completely exclude that we do not currupt the buffer from our
smalltalk code but I would not rate the probability of the bug beeing there
above 15-20%

I have looked at the #fromAddress:length: code and it is like:

^anAddress asExternalAddress
    replaceBytesOf: (self new: anInteger)
    from: 1 to anInteger startingAt:1

-----

I think we have a few unused per incident support tickets, and if OA thinks
this is enough data to get into it, I would be more than happy to spend
them. If not I'll wait another occassion, when more data will surface.
Finding and solving rare and stubbron bugs is patience and endurance game,
but I think not giving up on such problems and solving them even after a few
years, greatly improves the quality of the product. (something like aging
with vine). But my gut feeling tells me this is a right time to squash this
one.

rush


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Blair McGlashan
In reply to this post by Bill Schwab-2
"Bill Schwab" <[hidden email]> wrote in message
news:aoi8l3$m0ugm$[hidden email]...
> Andy, Chris,
>
> > > The thing that puzzles us about your fix is that it is not that piece
of
> > > code (#writeInstanceVariables) that is responsible for writing out the
> > > corrupted class names but, rather, #writeClass:withPrefix:. Anyway, we
> > await
> > > the results of your further testing with interest.
> >
> > I wondered about that too and it kept me from looking at that method
> sooner,
> > here are my thoughts:  My most common corruption before was
> > IdentityDictionary.  I also saw corruption in the display name of a tab
in
> > the class browser once, and lately I have seen it in method names.  The
> > later two seem like they certainly could be in instance variable values
on
> > some level.  I think the first item is as well, since it is contained in
> an
> > STBIdentityDictionaryProxy which has an instance variable for the class.
> I
> > just looked at the code, and I am just as puzzled as you are, because I
> > think the class should still be routed to #writeClass:withPrefix: even
if
> it
> > is in an instance variable.  Well I can't explain why, but the fix I
> > described appears to have eliminated the STB corruption problem for me
on
> my
> > problem computer.
>
> I sorta hate to mention it, but Davorin has raised a question of the same
> kind of corruption absent STB.  Could the problem be something more
generic
> with #yourAddress or #asExternalAddress?

Whoa there. The evidence by no means suggests any such general issue. What
evidence is there to draw the conclusion that Davorin's problem is related?
He mentions that he started to experience corruption in an application that
does a lot of FFI work when he moved from D2.1 to D3. This is far more
likely to be due to a buffer overrun in the external code; the memory
subsystem in 3.x and later is much "tighter", and with less slop such
problems are more likely to show up. Furthermore Dolphin runs thousands of
API calls per second, because it does most things from the image rather than
the VM. The #yourAddress and #asExternalAddress methods are very heavily
used.

Regarding the STB issue, lets look at the evidence there and try and apply
some logical thought rather than jumping to conclusions. There is a distinct
pattern, judging from peoples reports. The first four bytes of certain
strings, very often class names, get corrupted. We can't be sure this is
limited to strings, it could be any byte object, since it would be far less
visible in ByteArrays, and anyway those are less common. It doesn't appear
to affect "pointer" objects, because if it did this would probably crash the
system, although that might not happen if the corruption is caused by
another valid Oop being written.

What is particularly distinctive about this corruption, is that it always
seems to be of the first four bytes. We can visually see that it doesn't go
any further than that. We can conclude that it doesn't start any earlier,
because the visible part of the string object is preceeded by a header that
includes the size of the string. If the header were being corrupted, then
the size of the string would very likely get corrupted, and this would be
obvious since it would either be truncated, or include extra junk at the
end.

Ian suggested that perhaps the memory in question was being prematurely
re-used. A quite reasonable theory, but on consideration it doesn't fit the
observed evidence. If this were the case, then it is unlikely that we would
see the corruption that consistently occurs. Firstly it seems very unlikely
that the size of the new object would be precisely the same as the one who's
memory it is overwriting, and secondly why would the corruption always be
limited to that first four bytes? Ian was mislead by the dump output because
in the one case the raw string has been printed, where as in the other it is
showing octal/hex (can't remember which) escape sequences. This is because
the VM's crash dump code uses C++ iostreams, and presumably previous output
has left the stream in a different mode when the stack object is printed.

A more plausible explanation would be that the string is itself re-using
memory that is still in use by another object. If this were a pointer object
and its first field was updated, then this might conceivably explain the
corruption that has been observed.

Finally let us consider Chris' proposed fix. Unfortunately we can't be
absolutely sure that this fixes the problem, because we cannot reliably
produce it (in fact I have not been able to reproduce it at all yet).
Therefore we have to rely on statistical evidence, and he seems confident
that it has stopped occurring. However, given that it is an intermittent
fault, and we haven't been able to identify a reasonable cause, we can't be
sure that this isn't because we have changed the dynamics of the situation,
and just hidden whatever problem exists. In particular the fact that most of
the corrupted strings that have been observed have been the class names that
are not even written by this code, is suggestive. Remember that those class
names are written from a string which is a copy of the class name symbol,
and that is copied again into a byte array that is then written. No one has
reported that the original class name has been permanently corrupted.
Therefore it seems very unlikely that the code in question could be directly
responsible for the corruption of the class name objects - those names are
not even in existence at any time it runs, and they have already been
written to the stream.

So why does it appear to correct the problem (if it really does)? At the
moment its just guess work, but we could speculate that it indicates a
possible issue with the stream block write operation. The change switches
back to using #nextPut: on a byte-by-byte basis, which would certainly
change how the stream is used. However the intermittent nature of the fault
does not really support this theory.

In summary I don't think we yet have enough to go on. A hotfix is
undoubtedly useful, but I won't be satisfied until we can produce a
reproduceable case that we can reliably correct, or failing that a
reasonably explanation of what is wrong and why the fix works.

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Bill Schwab-2
Blair,

> Whoa there. The evidence by no means suggests any such general issue. What
> evidence is there to draw the conclusion that Davorin's problem is
related?
> He mentions that he started to experience corruption in an application
that
> does a lot of FFI work when he moved from D2.1 to D3. This is far more
> likely to be due to a buffer overrun in the external code; the memory
> subsystem in 3.x and later is much "tighter", and with less slop such
> problems are more likely to show up. Furthermore Dolphin runs thousands of
> API calls per second, because it does most things from the image rather
than
> the VM. The #yourAddress and #asExternalAddress methods are very heavily
> used.

With respect, such leaps of illogic (which I freely admit is the case - it's
legal, in fact encouraged in brainstorming sessions, which is what this has
been) have uncovered bugs in Dolphin, and in my code too on several
occaisions.  Granted, #yourAddress and #asExternalAddress are heavily used.
BUT, the concatenation of them occurs only four times that I found with a
naive search - that same combination is at the heart of something that Chris
found suspicious, so I ran with it just as Chris ran with a suggestion from
Ian.  I'll write the test unless Davorin has something ready made, and Chris
has agreed to run it on a machine that has (thankfully) exhibited an ability
to reproduce a long standing bug.

That's not meant to trivialize the other excellent analysis that you've
provided.  Here's hoping you'll soon put it all together and patch the real
problem.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Blair McGlashan
In reply to this post by Blair McGlashan
"Blair McGlashan" <[hidden email]> wrote in message
news:aoja9i$l82g0$[hidden email]...
> [.... snip...]
> [re the fix]
> So why does it appear to correct the problem (if it really does)? At the
> moment its just guess work, but we could speculate that it indicates a
> possible issue with the stream block write operation. The change switches
> back to using #nextPut: on a byte-by-byte basis, which would certainly
> change how the stream is used. However the intermittent nature of the
fault
> does not really support this theory.

We have identified a definite bug in WriteStream>>next:putAll:startingAt:,
or more specifically the fact that FileStream does not override it. This can
be seen by running the attached SUnit. As you will see any block write to a
file stream through #next:putAll:startingAt: can fail to get written out if
it fits within a "page" (8192 bytes), and if the write in question was the
first and last to that page. The reason is simple: The
WriteStream>>next:putAll:startingAt: implementation optimizes the case where
a block fits within the "collection", by using a block transfer operation.
Where the write is too large, it goes through #nextPut:. FileStreams are
implemented as a sort of write through cache onto a page of a file. Correct
updating relies on a dirty flag to know whether the current page has been
updated and needs to be written back when a new page is loaded (or the file
stream closed). If any block write to a FileStream occurs when the dirty
flag is not yet set, and the write fits within the current page, and no
further writes occur to the page that set the dirty flag, then the page will
not get written out, and the updates are lost.

I have not been able to come up with a scenario which would explain how the
bug could create the sort of corruption seen. However, there is no doubt
that this is a real bug, and Chris' change (which is essentially to fall
back on #nextPut: rather than the #next:putAll:startingAt:) would work
around this bug, and this is a possible explanation as for why his patch
works.

I have also attached a patch for this so that those who are experiencing the
problem can try it out. Of course you'll need to use the original STB code,
or this patch will not get exercised.

Regards

Blair

------------------------------------------
"Filed out from Dolphin Smalltalk XP 2002 release 5.10"!

TestCase subclass: #FileStreamTest2
 instanceVariableNames: ''
 classVariableNames: ''
 poolDictionaries: ''
 classInstanceVariableNames: ''!
!FileStreamTest2 methodsFor!

failureLog
 "dialect-specific"
 "VA - System errorLog"
 "VW, Dolphin - Transcript"

 ^Transcript
!

isLogging
 ^true!

testNextPutAllStartingAt
 | stream name data r written |
 name := File temporaryFilename.
 r := Random new.
 #(0 1 512 4095 4096 4097 8191 8192 8193 16383 16384 16385) do:
   [:i |
   data := ((1 to: i) collect: [:each | (r next * 256) truncated])
asByteArray.
   stream := FileStream write: name text: false.

   [stream
    next: data size
    putAll: data
    startingAt: 1] ensure: [stream close].
   stream := FileStream read: name text: false.
   written := [stream contents] ensure: [stream close].
   self
    assert: data = written
    description: 'Failed on write of ' , i printString , ' bytes'
    resumable: true]! !
!FileStreamTest2 categoriesFor: #failureLog!Accessing!public! !
!FileStreamTest2 categoriesFor: #isLogging!Accessing!public! !
!FileStreamTest2 categoriesFor: #testNextPutAllStartingAt!public!unit tests!
!

"---------------------------------------------------------------------------
-----
    And the patch - run the test first before installing this to see the
failure
    The eventual fix will not be quite like this, because there is a
refactoring opportunity here,
    but this is the simplest, lowest risk, fix.
----------------------------------------------------------------------------
-----"!

!FileStream methodsFor!

next: size putAll: aSequenceableCollection startingAt: start
 "Append countInteger elements of aSequenceableCollection from startInteger
 to the receiver. Answer aSequenceableCollection."

 | stop |
 stop := position + size.
 stop <= writeLimit
  ifTrue:
   ["The block fits into the current collection and can be
   written directly"

   collection
    replaceFrom: position + 1
    to: stop
    with: aSequenceableCollection
    startingAt: start.
   position := stop.
   self beDirty]
  ifFalse:
   ["The block will not fit within the current collection, so
   fall back on the slow way"

   start to: start + size - 1 do: [:i | self nextPut:
(aSequenceableCollection at: i)]].
 ^aSequenceableCollection! !
!FileStream categoriesFor: #next:putAll:startingAt:!accessing!public! !


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Chris Uppal-3
In reply to this post by Christopher J. Demers
Christopher,

> If someone wants to write some simple test code I would be happy to run it
> on my "problem machine" as my time permits.  However I probably won't have
> time to write the test code for this myself.  I have already spent more
time
> working on this STB corruption issue than I expected to.  Now that is
seems
> to be resolved I have to get back to slaying some other dragons.

Could you (when you have time) try rewriting the offending bit to:

 class isBytes
    ifTrue: [| was |
       was := objectToSave yourAddress.
       stream next: basicSize putAll: was asExternalAddress startingAt: 1.
       self assert: [was = objectToSave yourAddress]]
    ifFalse: ....

and see if the test throws assertion failures at the same time as the
errors.

Probably a waste of time, but you never know...

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: The four byte bug lives (Perhaps a FIX!?!?!)

Chris Uppal-3
In reply to this post by Bill Schwab-2
Bill,

> Thanks for providing the option!

You're thanking the wrong "Chris" I'm afraid.  But I'm sure that Christopher
J Demers took the will for the deed.

Whereas I am just:

    -- chris


1234