Sockets race conditions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Sockets race conditions

rush
I know that D6 is scheduled to have a new Sockets implementation, but maybe
this would still be of some use for D6 as interesting example, and of course
for old implementation. So here is something I have occasionally experienced
under more heavy usage of sockets.This is under D3.06 but I think sockets
have not changed significantly since.

Occassionally Socket>>receiveByteArray would raise exception, with socket
error code 0, which is not valid socket error code. Sniffing network
traffic, showed that other side of TCP connection did not close the socket,
and that there is no apparent reason why would connection go belly up. After
some debugging it appeared that:

Socket>>basicReceiveByteArray: anInteger
 "Private - Reads anInteger bytes from the socket.
 Answers a ByteArray representing the bytes read."

 | bytesReceived byteArray grijeska |
 byteArray := ByteArray new: anInteger.
 bytesReceived := WSockLibrary default
  recv: self asParameter
  buf: byteArray
  len: byteArray size
  flags: 0.

 bytesReceived > 0 ifTrue:
  [ "Success."
  ^byteArray copyFrom: 1 to: bytesReceived ].
 bytesReceived = 0 ifTrue:
  [ "Socket has been closed."
  SocketClosed signal ].
 "Some other error."
 self error.
----------
receives -1 from the socket call, but consequent

SocketAbstract>>error
 "Private - Throw a SocketError exception.
 We MUST do the wsaGetLastError here rather than leaving it to the
SocketError class.
 Otherwise it is possible (especially with loading classes from STC files)
that the
 last error is lost by the time it is fished out by SocketError."

 | err |

 err := WSockLibrary default wsaGetLastError.

 SocketError signalWith: err.
----------

gets 0 into err which stands for no error, and that last operation was ok.
My bet is that original error whas WOULDBLOCK one, but because of some race
condition, the error code has ben reset to 0 before it has been read. The
consequence is that exception is raised on the socket, while apropriate
action would probably be to retry read. As far as I am aware there is no
other user level socket operation going on at that time.

Moving reading wsaGetLastError a little bit before is possible but I am not
sure it would aliveate situation. Maybe protecting recv call and
getLastError with critical section would help, but I am not also completely
sure this would protect me frome something that is happening oin Dolphin VM
or windows socket implementation. I am actually incluned to interpret socket
error as WOULDBLOCK.

rush
--
http://www.templatetamer.com/


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

rush
"rush" <[hidden email]> wrote in message
news:cmd3rk$les$[hidden email]...
> I know that D6 is scheduled to have a new Sockets implementation, but
maybe
> this would still be of some use for D6 as interesting example, and of
course
> for old implementation. So here is something I have occasionally
experienced
> under more heavy usage of sockets.This is under D3.06 but I think sockets
> have not changed significantly since.

just to add, I have checked, 5.1 has same relevant socket code, so it would
have same problems.

rush
--
http://www.templatetamer.com/


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Chris Uppal-3
In reply to this post by rush
rush wrote:

> [details snipped]

[I'm not sure how strongly relevant this is -- only tangentially, perhaps --
but you've reminded me of a point I've been meaning to raise for some time,
so...]

There's a general problem with external interfacing with libraries that use the
"errno" or GetLastError() style of error reporting.  Even if those libraries
are carefully designed so that the status flag is thread-safe (thread-local)
the way the Dolphin multiplexes its Processes onto a single OS thread makes it
difficult or impossible to avoid the risk that some Process will pre-emptively
see or overwrite an error intended for some other Process.

I have exactly that problem (and no solution) in JNIPort.  There, one is
expected to check after every call into Java to see if an exception was thrown
(so I do), but there's a risk, theoretically at least, that the wrong Process
will see the exception.

A related problem is that when handling a callback from Java into Smalltalk, I
have to change some "global" (to Smalltalk) state changes for the duration of
the callback before anything is "allowed" to talk to Java again.  And in a
similar way to the above, there's the possibility that once the Dolphin VM is
in charge again (in the callback) it will schedule some other Process /before/
the Smalltalk code that is actually handling the callback itself, and that that
Process will make use of Java before the necessary changes have been put in
place. (Or similarly at the end of the callback, may make use of the global
state after it has been returned to "normal" but before the VM has really
returned from the callback)

At least the above are theoretical possibilities according to my understanding
of how the VM works.  I have to admit that I've not been able to make either of
them manifest in practise (and I've tried pretty hard) so maybe there's
something I'm missing.

If not then I think what's needed is some way to mark an external call so that
when it returns the calling Process is initially non-interruptible.  Similarly
it should be possible to mark an ExternalCallback so that the VM enters it in a
state where the handling process is initially non-interruptible (and has a way
to return to that state to clean up at the end of the callback, before
returning to the VM).

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

rush
"Chris Uppal" <[hidden email]> wrote in message
news:[hidden email]...
> [details snipped]

Yes, that also seems as a (related or similar) problem. Anyway, in case I am
seeing it seems there is no other smalltalk Process that might clear error
status. I can not be 100% sure at this moment, but let's say it is in 95%
range that when this happens there is only one Socket>>receiveByteArray call
outstanding (blocked on read), and no other socket operations are issued
during that time by my program.

What seems like a slightly possible candidate for clearing error status is

WinAsyncSocket>>wsaEvent: message wParam: wParam lParam: lParam

windows message handler in Sockets implementation. maybe some of these
events under some circumstances cause error to be cleared. It seems that
many of these messages are delivered late in problem case; i.e. the socket
has allready consumed received data, and many notifications come after on.

I am currently testing a workaround that raises wouldblock in case error = 0
, and it seems to solve the problem, without bad side effects, but I will
conduct more testing. As far as I understand it is pretty safe work around
for sockets, since if there really is error on the socket subsequent calls
will fail anyway. If it was actually wouldblock, raising wouldblock is right
thing to do anyway. Only concern is that undres some circumstance it might
cause socket to lag, until new wsaEvent for read is generated. But since it
seems there is generally surplus of those messages, and that in my case I
more or less allways have something comming to the socket, it seems this
will not be an issue. But if there would be, probably periodically "just in
case" deblocking sockets processes would do the trick.

rush
--
http://www.templatetamer.com/


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Schwab,Wilhelm K
In reply to this post by Chris Uppal-3
Chris,

> There's a general problem with external interfacing with libraries that use the
> "errno" or GetLastError() style of error reporting.  Even if those libraries
> are carefully designed so that the status flag is thread-safe (thread-local)
> the way the Dolphin multiplexes its Processes onto a single OS thread makes it
> difficult or impossible to avoid the risk that some Process will pre-emptively
> see or overwrite an error intended for some other Process.
>
> I have exactly that problem (and no solution) in JNIPort.  There, one is
> expected to check after every call into Java to see if an exception was thrown
> (so I do), but there's a risk, theoretically at least, that the wrong Process
> will see the exception.

I don't know whether you have this option available to you, but I wrote
a very simple wrapper DLL around a library that forced me to do such
error checking.  In my case, I was having problems not so much because
anything over-wrote the error, but because the error was stored per OS
thread, and Dolphin was potentially[*] giving me a different OS thread
to check for the error, in which case it didn't work at all.

My wrapper DLL looks suspiciously like what (IMHO) the wrapped library
should have been exporting.  It is a handful of functions that make the
"real" call, check the error status, and return that into a buffer
provided by Dolphin.  It works regardless of which threads Dolphin uses
to make the calls.

[*] IIRC, typically

Have a good one,

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Chris Uppal-3
Bill,

> I don't know whether you have this option available to you, but I wrote
> a very simple wrapper DLL around a library that forced me to do such
> error checking.

Mmmyess...  I /could/ do that, but it isn't really the solution I want.

For one thing, it doesn't solve the problem with callbacks.

For another thing there are some 250 methods that would need wrapping in this
way...

But mainly, I think that this is a general problem and needs a general solution
(not necessarily the one I suggested).  I mean if the line is "Dolphin has
excellent abilities to interface with external code, all you have to do is
write a wrapper DLL for it" then something's wrong...


> My wrapper DLL looks suspiciously like what (IMHO) the wrapped library
> should have been exporting.

I think it's fair not to expect Dolphin to be able to compensate for every
oddly/wrongly designed external library.  If the library's design is bad enough
then creating a sensible wrapper for it may be the only feasible approach.  But
in the cases I described I don't think the external library /is/ badly designed
(at least, not in this way ;-) so I'd like to be able to connect to it without
"messing".

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Schwab,Wilhelm K
Chris,

> Mmmyess...  I /could/ do that, but it isn't really the solution I want.
>
> For one thing, it doesn't solve the problem with callbacks.
>
> For another thing there are some 250 methods that would need wrapping in this
> way...

I would consider that impractical, to say the least.  In my case, I
needed one callback (trivial thing too, I have *no* idea why they didn't
provide a default - more bad design), and only a few functions.
However, it saved my bacon at least in this one case.


> But mainly, I think that this is a general problem and needs a general solution
> (not necessarily the one I suggested).  I mean if the line is "Dolphin has
> excellent abilities to interface with external code, all you have to do is
> write a wrapper DLL for it" then something's wrong...

Indeed.  IIRC, Blair was intending to allow D6 to associate an OS thread
with each Process (presumably that makes overlapped calls) to avoid the
problem I had.  But that does not sound ideal either, unless...

Blair, would it make sense to have an overlapSameThread (hopefully you
can think of a better name) call type that would signal the VM to begin
associating a particular OS thread with the calling Process?  That might
allow most other Proceses to benefit from pooling.  Put another way, it
would punish only systems that make calls to libraries with thread
affinities.


> I think it's fair not to expect Dolphin to be able to compensate for every
> oddly/wrongly designed external library.  If the library's design is bad enough
> then creating a sensible wrapper for it may be the only feasible approach.  But
> in the cases I described I don't think the external library /is/ badly designed
> (at least, not in this way ;-) so I'd like to be able to connect to it without
> "messing".

Reasonable on both counts.  However, I wish that "Writing Solid Code"
were required reading somewhere.  We would have less trouble if that
were the case.  Maybe we should have it printed on leaflets and drop
them over Redmond :)  Yes, I know it is an MS Press book, which only
adds to the satire :(

Have a good one,

Bill


--
Wilhelm K. Schwab, Ph.D.
[hidden email]


jas
Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

jas
In reply to this post by Chris Uppal-3
Chris Uppal wrote:

> rush wrote:
>
>
>>[details snipped]
>
>
> [I'm not sure how strongly relevant this is -- only tangentially, perhaps --
> but you've reminded me of a point I've been meaning to raise for some time,
> so...]
>
> There's a general problem with external interfacing with libraries that use the
> "errno" or GetLastError() style of error reporting.  Even if those libraries
> are carefully designed so that the status flag is thread-safe (thread-local)
> the way the Dolphin multiplexes its Processes onto a single OS thread makes it


It seems to me that we have a limited number of possibilities here:

1.  Dolphin always runs on the same single OS thread.
2.  Dolphin runs on a single OS thread at a time,
     but which one (which OS thread) is effectively random.
3.  Dolphin runs on more than one OS thread, but the mapping
     from Dolphin process to OS thread is not fixed.
4.  Dolphin maps Processes 1-1 onto OS threads.

Seems the actual situation must be (3).
{Because (4) would, if true, be a well known feature,
  and because (2) would, it appears, make it impossible
  to receive and/or react to (at least the) conditions
  reported to the originating, but now wrong, thread;
  and because (1) would mean that all of Dolphin was
  blocked until the external call completed
).

Even for case (3), Dolphin should always have the correct
lastError *somewhere*, if, as you said, the external library
is storing it thread-local.  And Dolphin needs only to
deliver said flag to the associated (initiating) process.

Meaning Dolphin must record that "expectedInfo" on
"thisOSThread" goes to "thisProcess" (the running
process which is about to cause such info to
be expected), before the external call is issued.
At that point, the lastError processing is safe,
because any overlapped call will occur on a
different OS thread.


> difficult or impossible to avoid the risk that  > some Process will pre-emptively see or overwrite
> an error intended for some other Process.


Presumably, each process has a unique "lastError"
slot, into which the arriving per-OS-thread flag
is stored, by way of the mapping above?


> I have exactly that problem (and no solution) in JNIPort.  There, one is
> expected to check after every call into Java to see if an exception was thrown
> (so I do), but there's a risk, theoretically at least, that the wrong Process
> will see the exception.


I don't quite see where this specific concern originates?


>
> A related problem is that when handling a callback from Java into Smalltalk, I
> have to change some "global" (to Smalltalk) state changes for the duration of
> the callback before anything is "allowed" to talk to Java again.  And in a
> similar way to the above, there's the possibility that once the Dolphin VM is
> in charge again (in the callback) it will schedule some other Process /before/
> the Smalltalk code that is actually handling the callback itself, and that that
> Process will make use of Java before the necessary changes have been put in
> place. (Or similarly at the end of the callback, may make use of the global
> state after it has been returned to "normal" but before the VM has really
> returned from the callback)


Sounds like a possible circular embrace to me -
you're allowing reentry, but you aren't reentrant.

St -> Java
         \
          ---- calls back to St
                               \
                                ---- changes Global state
                                     *after which* Java is
                                     allowed to be reentered.

Is this what you mean?
If so, then you either need
to "giantLock" the whole St->Java bridge:


<PRE>
St -> critical:
         [ isJavaBlocked?
                 /\
EWouldBlock <- Y  N -> beJavaBlocked
         ]
          \
           Java
               \
                -> callback St
                             /
       Change Global State <-
        \
         -> critical: [beJavaBlockedNot]
                                      /
                           proceed <-
</PRE>

===

or ensure that the entirety of the Java->St callback
runs to completion (NEVER yields)
and runs at higher priority than anything else.


>
> At least the above are theoretical possibilities according to my understanding
> of how the VM works.  I have to admit that I've not been able to make either of
> them manifest in practise (and I've tried pretty hard) so maybe there's
> something I'm missing.

Reentrancy is nearly impossible to test for,
without special hardware.  Usually, the best
you can do is "soak test" under known load,
for periods which are long enough to have
encountered known historical failures
with some statistically defensible comfort.

(Murphy's law, and corollaries, apply more
  strongly in this area than perhaps anywhere
  else.)

>
> If not then I think what's needed is some way to mark an external call so that
> when it returns the calling Process is initially non-interruptible.


Probably too late - if you're coordinating w/r/t other
processes, you'll need to protect the entry-into and
returning-from parts of the external call.

Once the external call completes a transition
back to the caller, things had better already be
in the necessary (process) state - i.e. the same
state as before the call.

Because there is no way to prevent another process
from having *already* breached the safety of the
thing you're concerned about, by the time this call
has even begun the returning-from transition.

{ The exception to this is if the "special stuff"
   is entirely local to the caller, and is *only*
   referenced *after* the external call.
}


  Similarly
> it should be possible to mark an ExternalCallback so that the VM enters it in a
> state where the handling process is initially non-interruptible (and has a way
> to return to that state to clean up at the end of the callback, before
> returning to the VM).


Right.


Regards,

-cstb


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Chris Uppal-3
In reply to this post by Schwab,Wilhelm K
Bill,

> [...]I wish that "Writing Solid Code"
> were required reading somewhere.  We would have less trouble if that
> were the case.  Maybe we should have it printed on leaflets and drop
> them over Redmond :)  Yes, I know it is an MS Press book, which only
> adds to the satire :(

Whisper it, but I've never read that.  Not even on my bookshelf waiting to be
read.  I've heard that it's pretty good, but frankly I find the provenance
off-putting...

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Chris Uppal-3
In reply to this post by jas
cstb wrote:

> It seems to me that we have a limited number of possibilities here:
> [...]
> Seems the actual situation must be (3).

Well, we don't have to guess -- Blair has gone over this in some detail before.

The following is my understanding.  Corrections and amplifications from anyone
very much appreciated...

All Dolphin Processes share one OS-thread.  Always the same thread.  Hence all
Smalltalk code is running (as far as the OS knows) as one thread -- call that
the "Smalltalk thread".

When you issue an external call that has been marked "overlapped" the Dolphin
VM will execute that call on a separate OS-thread.  That thread is either
started specially for the occasion, or is found in a pool of threads that are
used for this purpose.  Thus for overlapped calls there is no "thread
affinity".  For non-overlapped calls (the bulk of them) there is perfect thread
affinity since the external call will always happen on the Smalltalk thread.

While an external call is outstanding, if it is not overlapped, all Dolphin
processes are blocked (since they all run on the same thread, and that thread
is doing something else).  If it /is/ overlapped, then other Dolphin Processes
can proceeed (running on the Smalltalk thread), and only the issuing Process is
blocked (by the VM) until such time as the thread for the overlapped call
signals the VM that it has completed.  At which time the VM passes the response
to the Smalltalk thread where the calling Process is unblocked.  The overlapped
call thread then falls back into the thread pool, where it will die if it is
not fished out and reused in a small time (a few seconds, I think).  The
overlapped call mechanism is basically a slightly hacky way of allowing one to
issue slow external calls without blocking the whole image, but without the
very considerable complexity (and probably slowness) of using OS-threads for
Smalltalk Processes.

On the Smalltalk thread, the VM schedules the Processes preemptively, taking
due note of Process priorities.  (I don't know what the algorithm is).  This is
/not/ like the classic mostly-non-preemptive scheduling in (as I understand it)
ST-80 and VW.  (And a bloody good thing too, IMO).

The Dolphin VM uses a few more OS-threads internally (Windows Task Manager
shows that it uses 5 threads before any overlapped calls are issued), but I
don't know what for.  I think Blair mentioned something to do with the
"background" garbage collector once, but I'm not clear on why it needs an
OS-thread when it isn't actually a classic "backgound" gc algorithm (it still
halts the VM).  Maybe the COM server stuff needs an OS-thread or two as well.
Just guessing...

That architecure does make it difficult to interact with external libraries
that use a the common errno-style, or GetLastError()-style, way of reporting
errors (and other data) back to the called.  If such a library is badly
designed then it'll use a single global variable to hold the data.  Such
libraries are becoming rare, and these days the data is normally held
thread-local.  That's OS-thread-local, since OS-threads are the lingua-franca
of multiprocessing (in much the same way as files are the lingua-franca of data
persistance).  Since Dolphin is running several Processes on the same
OS-thread, and since they are preemptive, the possibility exists[*] that some
Process will issue an external call (not overlapped), that call will write an
error status (or an exception record, in the case of JNIPort) into some
thread-local place.  If the VM schedules another Process to run before the
calling Process reads that data, then there's a chance that the other Process
will issue another external call (to the same library) that will overwrite the
error flag incorrectly.

([*] or at least, it /might/ exist if my understanding's correct and I'm not
missing something -- I'm hoping Blair will pop up to clarify ;-)

OTOH, if the external call /is/ overlapped, then the error status will be saved
nicely, but in the thread-local storage associated with the helper thread that
is probably now back in the thread pool, or is otherwise unavailable.  I think
this second scenario is the one that is affecting Bill, and for which he needed
a helper DLL.  My own theoretical problems, and possibly the real ones that
'rush' is seeing too (though not necessarily) are from the first scenario.

The Dolphin VM does have the ability to run non-preemptively (see
BlockClosure>>critical), and my suggestion is that the above problems (for
non-overlapped calls) can be avoided by allowing us to tell the VM that it
should disable pre-emption before returning from some external calls, and
similarly that it should disable pre-emption before executing the Smalltalk
code for some ExternalCallbacks.


> >
> > A related problem is that when handling a callback from Java into
> > Smalltalk, I
> > have to change some "global" (to Smalltalk) state changes for the
> > duration of
> > the callback before anything is "allowed" to talk to Java again.
> [...]
> Sounds like a possible circular embrace to me -
> you're allowing reentry, but you aren't reentrant.

There are several possibilities for deadlock (see
   http://www.metagnostic.org/DolphinSmalltalk/JNIPort/threading.html
for lots of detail, if you are interested).  And putting a "giant lock" around
something that might issue a callback would make it worse.  What I need to do
(at least theoretically) is something like:

    Dolphin calls Java
    {
        Java calls back into Dolphin
        {
            The Dolphin VM disables pre-emption then
            calls my (Smalltalk) code.

            My code executes, changes the
            global pointer, and then re-enables
            pre-emption.

            .... stuff happens, including handling
            the request, but also including other
            Processes running, until...

            My code disables pre-emption,
            changes the global pointer back,
            and then returns to the VM

            The VM clears the no-preemption state (which
            has no immediate effect since no Smalltalk
            code is executing).
        }
        Dolphin returns to Java.
    }
    Java returns to Dolphin, where the normal, preemtive,
    sheduling of Smalltalk code continues.

(BTW, it's not obvious from the above that it's safe for other Processes that
were using the global pointer before the beginning of the sequence to be
allowed to continue, using a different pointer, while the callback is
executing.  Or, similarly, to start an operation during the callback and
complete it after the sequence has finished.   In fact it is safe since the
system's largely stateless (the JNI external library is designed correctly in
this sense), so provided I take care never to cache a local copy of the global
pointer (and take a couple of other precautions that are only relevant to JNI),
I /think/ it should be OK.)

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Schwab,Wilhelm K
In reply to this post by jas
> Presumably, each process has a unique "lastError"
> slot, into which the arriving per-OS-thread flag
> is stored, by way of the mapping above?

My understanding is that it indeed does this, but lastError is not
enough for all cases, as there are libraries that do similar things with
their own flags.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Sockets race conditions

Schwab,Wilhelm K
In reply to this post by Chris Uppal-3
Chris,

>>[...]I wish that "Writing Solid Code"
>>were required reading somewhere.  We would have less trouble if that
>>were the case.  Maybe we should have it printed on leaflets and drop
>>them over Redmond :)  Yes, I know it is an MS Press book, which only
>>adds to the satire :(
>
>
> Whisper it, but I've never read that.  Not even on my bookshelf waiting to be
> read.  I've heard that it's pretty good, but frankly I find the provenance
> off-putting...

Actually, it's the sub-title ("Microsoft's techniques for writing
bug-free C programs") that gets me.  However, just because they
apparently don't read it, is no reason for you to ignore it :)  It might
help that the author gives up some nice dirt on the early years of Word,
all with a focus on how to learn from the problems they had. Better yet,
it is an excellent book.  The disassembler (or is it a CPU emulator?? -
something like that) discussion alone is worth the price of the book.

Anything that goes from candy machines to how to avoid the problems
we've been discussing in this thread is a must-read, right?

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]