Strange socket behavior

Strange socket behavior

Hi Guys -

I debugged a *really* interesting problem today. For some reason, our
Croquet sessions failed seemingly random with socket timeouts in strange
places. The main clue we had was that it was somehow related to a rather
large space being replicated over a rather slow line (a DSL uplink as
the source for replication).

Tracking this down into its gory details I ended up with a test case
like here:

   data := ByteArray new: 10000000.
   socket := Socket newTCP.
   socket connectTo: 'myHost' port: myPort.
   socket sendData: data count: count.
   socket sendData: 'Hello' count: 5.

When I did this over a slow uplink this would *reliably* time out on the
second sendData:count: call. But why? Simply put, because the windows
sockets interface doesn't quite function like I *thought* it would. I
had expected the Windows send() call to accept only a "TCP packet size"
full of data but it turns out it takes *everything* right down to the
last byte in the first call. Meaning that the first sendData: call
returns immediately but after that call it's chugging along trying to
get the data out to the interface and the next sendData: call really
wants a response with the default ConnectionTimeOut (which is less than
the time it needs to complete the previous send).

Why is this relevant? I believe pretty much all code we currently have
is written under the assumption that the primitive will only accept
"reasonable" amounts of data. Any code that pushes large amounts of data
and expects the socket interface to handle it will be affected by this
problem. I also suspect that other platforms may show similar behavior
so some testing is in order. If you had random unexplained timeouts when
sending large data buffers over slow lines, splitting them up into
smaller ones as a workaround may just be your ticket until I fixed this
problem in the VM, e.g., make the VM only take "reasonable" amounts of
data in each call such that the caller can rest assured that the time
out values are meaningful.

I would also be interested in what other platforms do. Basically, the
question is whether the primitive returns immediately in a single call,
consuming all the data, or whether it will loop in
Socket>>sendData:count:. If you have evidence towards either end please
post your results to VM-dev (incl. the precise version of your OS).

   - Andreas

Re: Strange socket behavior


On 2-Oct-06, at 8:01 PM, Andreas Raab wrote:

>   data := ByteArray new: 10000000.
>   socket := Socket newTCP.
>   socket connectTo: 'myHost' port: myPort.
>   socket sendData: data count: count.
>   socket sendData: 'Hello' count: 5.

You mean a example like

serverAddr := NetNameResolver addressForName: 'localhost' timeout:  
count := 10000000.
>  data := ByteArray new: count.
>   socket := Socket newTCP.
>   socket connectTo: serverAddr port: myPort.
>   socket sendData: data count: count.
>   socket sendData: 'Hello' count: 5.

man send

Depending on your flavor of unix it may or may not allow you to grab  
10,000,000 bytes of storage.
If it does then

    If no messages space is available at the socket to hold the  
message to be
      transmitted, then send() normally blocks, unless the socket has  
      placed in non-blocking I/O mode.  The select(2) call may be  
used to
      determine when it is possible to send more data.

We don't run the socket in non-blocking mode so the socket will block  
if there isn't space to transmit the message. 10MB will block and  
timeout I note if I don't read the data on the server, but at 100K it  
will cheerfully accept and say it sent the bytes, how much it will  
accept before blocking is dependent on window size etc, likely my  
home gigabit infranet  (RFC 1323) is configured to allow lots of  
bytes in flight btw, so other networks might abort at 100K.  Iin this  
case the socket won't block  until I've sent the agreed window size  
of data which is > 100K

We have no idea if the data has been received by the other side yet,  
and if sending to Mars we still have many minutes to wait.

One problem people have encounter in the past is sending oh say 64K  
then closing the socket on a slow connection, the socket *lingers*  
around open for the linger time for a few seconds after the close  
request to flush any data, but on a slow connection this linger time  
is insufficient to ensure all the data is transmitted beofre the  
close.  If your model is send 10MB, then close the socket, I suspect  
it's terminating before the data is fully sent on unix based machines.

btw sqSocketSendDone on the unix platforms checks to see if sending  
more data would block if not then it says the send is done, which  
isn't quite true since the send might not be done because send done  
doesn't mean all the bytes are sent and delivered to the remote host,  
that is a different question.

sendDone ~= messageDelivered

John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.