I would be extremely grateful for a reproducible case for the following Socket issue

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

I would be extremely grateful for a reproducible case for the following Socket issue

Eliot Miranda-2

On 8/13/2011 13:42, Levente Uzonyi wrote:

> Socket's readSemaphore is losing signals with CogVMs on linux. We 
> found several cases (RFB, PostgreSQL) when processes are stuck in the 
> following method: 
> Socket >> waitForDataIfClosed: closedBlock 
>     "Wait indefinitely for data to arrive.  This method will block until 
>     data is available or the socket is closed." 
>     [ 
>         (self primSocketReceiveDataAvailable: socketHandle) 
>             ifTrue: [^self]. 
>         self isConnected 
>             ifFalse: [^closedBlock value]. 
>         self readSemaphore wait ] repeat 
> When we inspect the contexts, the process is waiting for the 
> readSemaphore, but evaluating (self primSocketReceiveDataAvailable: 
> socketHandle) yields true. Signaling the readSemaphore makes the 
> process running again. As a workaround we replaced #wait with 
> #waitTimeoutMSecs: and all our problems disappeared. 
> The interpreter VM doesn't seem to have this bug, so I guess the bug 
> was introduced with the changes of aio.c.
«  [hide part of quote]

Oh, interesting. We know this problem fairly well and have always worked 
around by changing the wait in the above to a "waitTimeoutMSecs: 500" 
which turns it into a soft busy loop. It would be interesting to see if 
there's a bug in Cog which causes this. FWIW, here is the relevant portion: 

             "Soft 500ms busy loop - to protect against AIO probs; 
             occasionally, VM-level AIO fails to trip the semaphore" 
             self readSemaphore waitTimeoutMSecs: 500. 

Cheers, 
   - Andreas 
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Levente Uzonyi
Hi Eliot,

What sort of reproducibility are you looking for? Is it enough if it
happens once every few hours or do you need something that you can trigger
on demand?

Levente


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Eliot Miranda-2
Hi Levente,

On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote:
Hi Eliot,

What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?

I'll take every few hours, but I'd prefer "in under 30 minutes".  Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult.  Let's say you have a case which reproduces in 8 hours 50% of the time.  To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right?  That's nearly 2 days; it could take weeks to fix :-(

Levente


_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: I would be extremely grateful for a reproducible case for the following Socket issue

John McKeon
In reply to this post by Eliot Miranda-2
i remeber using rfb to connect to  a mac from a win machine at work and that problem would crop up quite regularly. not nearly an eight hour wait. i will see if i can give you a more definite way of reproducing it when i am back at my desk tomorrow
Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Stephan Eggermont-3
In reply to this post by Eliot Miranda-2
On 23-03-16 19:45, Eliot Miranda wrote:
> I'll take every few hours, but I'd prefer "in under 30 minutes".
> Getting warm and fuzzy feelings when trying to prove a negative with
> something that takes hours to run is very difficult. Let's say you
> have a case which reproduces in 8 hours 50% of the time. To reach 99%
> confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours
> without seeing it reproduce, right? That's nearly 2 days; it could
> take weeks to fix :-( Levente

Is this something where throwing a lot of hardware at it would speed up
things?
Run for a few minutes on hundreds of vms?

Stephan



Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Levente Uzonyi
In reply to this post by Eliot Miranda-2
Hi Eliot,

The snippet below, evaluated from a workspace, triggered the issue in less
than a minute for me, three times in a row.
Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return
within a second. If you send #dataAvailable to the socket, you'll find
that it has data ready to be read, but its readSemaphore has no signal.

Levente


Socket compile: 'sloppyWaitForDataIfClosed: closedBlock

  [(socketHandle ~~ nil
   and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue:
  [^self].
  self isConnected ifFalse:
  [^closedBlock value].
  self readSemaphore wait] repeat'
classified: 'waiting'.

[
  listenerSocket := Socket newTCP.
  listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1].
  clientSocket := Socket newTCP.
  clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort.
  clientSocket waitForConnectionFor: 1.
  self assert: clientSocket isConnected.
  serverSocket := listenerSocket waitForAcceptFor: 1.
  self assert: serverSocket isConnected ]
  ensure: [ listenerSocket destroy ].

serverProcess := [
  | shouldRun buffer bytesReceived waitDuration |
  shouldRun := true.
  buffer := ByteString new: 10.
  waitDuration := 1 second.
  [
  [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun :=
false ] ]
  valueWithin: waitDuration
  onTimeout: [ self halt ].
  buffer atAllPut: (Character value: 0).
  bytesReceived := serverSocket receiveDataInto: buffer.
  self assert: bytesReceived = 4.
  self assert: (buffer first: 4) = 'PING'.
  serverSocket sendData: 'PONG' ] repeat ] newProcess.
clientProcess := [
  | shouldRun buffer bytesReceived waitDuration |
  shouldRun := true.
  buffer := ByteString new: 10.
  waitDuration := 1 second.
  [
  clientSocket sendData: 'PING'.
  [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
  valueWithin: waitDuration
  onTimeout: [ self halt ].
  buffer atAllPut: (Character value: 0).
  bytesReceived := clientSocket receiveDataInto: buffer.
  self assert: bytesReceived = 4.
  self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess.
clientProcess priority: 39; resume.
serverProcess priority: 39; resume.

"Evaluate these after debugging:
clientSocket destroy.
serverSocket destroy."


On Wed, 23 Mar 2016, Eliot Miranda wrote:

> Hi Levente,
> On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote:
>       Hi Eliot,
>
>       What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?
>
>
> I'll take every few hours, but I'd prefer "in under 30 minutes".  Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult.  Let's say you have
> a case which reproduces in 8 hours 50% of the time.  To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right?  That's nearly 2 days; it could
> take weeks to fix :-(
>
>       Levente
>
>
>
> _,,,^..^,,,_
> best, Eliot
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Eliot Miranda-2
Hi Levente,

On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote:
Hi Eliot,

The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row.
Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.

Levente


Socket compile: 'sloppyWaitForDataIfClosed: closedBlock

        [(socketHandle ~~ nil
          and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue:
                [^self].
         self isConnected ifFalse:
                [^closedBlock value].
         self readSemaphore wait] repeat'
classified: 'waiting'.

[
        listenerSocket := Socket newTCP.
        listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1].
        clientSocket := Socket newTCP.
        clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort.
        clientSocket waitForConnectionFor: 1.
        self assert: clientSocket isConnected.
        serverSocket := listenerSocket waitForAcceptFor: 1.
        self assert: serverSocket isConnected ]
        ensure: [ listenerSocket destroy ].

serverProcess := [
        | shouldRun buffer bytesReceived waitDuration |
        shouldRun := true.
        buffer := ByteString new: 10.
        waitDuration := 1 second.
        [
                [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
                        valueWithin: waitDuration
                        onTimeout: [ self halt ].
                buffer atAllPut: (Character value: 0).
                bytesReceived := serverSocket receiveDataInto: buffer.
                self assert: bytesReceived = 4.
                self assert: (buffer first: 4) = 'PING'.
                serverSocket sendData: 'PONG' ] repeat ] newProcess.
clientProcess := [
        | shouldRun buffer bytesReceived waitDuration |
        shouldRun := true.
        buffer := ByteString new: 10.
        waitDuration := 1 second.
        [
                clientSocket sendData: 'PING'.
                [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
                        valueWithin: waitDuration
                        onTimeout: [ self halt ].
                buffer atAllPut: (Character value: 0).
                bytesReceived := clientSocket receiveDataInto: buffer.
                self assert: bytesReceived = 4.
                self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess.
clientProcess priority: 39; resume.
serverProcess priority: 39; resume.

"Evaluate these after debugging:
clientSocket destroy.
serverSocket destroy."

Fabulous, thank you!  replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test.  I have work to do tomorrow but hope I should be able to debug this soon.  I'll add kqueue and epoll support when I fix it.

(Stephan, interesting suggestion to throw hardware at the problem, thank you).

On Wed, 23 Mar 2016, Eliot Miranda wrote:

Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote:
      Hi Eliot,

      What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?


I'll take every few hours, but I'd prefer "in under 30 minutes".  Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult.  Let's say you have
a case which reproduces in 8 hours 50% of the time.  To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right?  That's nearly 2 days; it could
take weeks to fix :-(

      Levente



_,,,^..^,,,_
best, Eliot







--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Levente Uzonyi
In reply to this post by Levente Uzonyi
Hi Holger,

The snippet doesn't seem to trigger the issue when strace is attached to
the process. But it usually runs into the issue short after I stop strace.

Levente

On Thu, 24 Mar 2016, Holger Freyther wrote:

>
>
>> On 24 Mar 2016, at 10:16, Levente Uzonyi <[hidden email]> wrote:
>>
>> Hi Eliot,
>>
>> The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row.
>> Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.
>
> does it happen if you have strace -o output.txt -f on the VM as well?
>
> kind regards
> holger

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Eliot Miranda-2
In reply to this post by Eliot Miranda-2
Hi Holger,

    let's collaborate on this.  You and I and lots of others want this to be done.  But in fixing the below bug my priorities are to
a) deliver a fix promptly
b) write the code so that either kqueue or epoll or select will be used based on HAVE_KQUEUE & HAVE_EPOLL defines

So if I deliver a fix don't think that I'm against having a proper event-driven or idling VM, I just don't have it in my head yet and want the bug fixed and kqueue/epoll available asap.

On Thu, Mar 24, 2016 at 11:42 PM, Holger Freyther <[hidden email]> wrote:


> On 25 Mar 2016, at 00:35, Eliot Miranda <[hidden email]> wrote:
>


Dear Eliot,


> Fabulous, thank you!  replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test.  I have work to do tomorrow but hope I should be able to debug this soon.  I'll add kqueue and epoll support when I fix it.

before going to kqueue, could you have a look at my hack to create a completely idle VM? Once the VM has moved to kqueue it seems one can not get a SIGIO for the "kqueue descriptor"[1] itself and I don't know if SIGIO remains working for the individual file descriptors.

holger


[1] https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1735




--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Eliot Miranda-2
In reply to this post by Levente Uzonyi
Hi Levente,

    the first thing to report is that this isn't a Cog-specific bug.  It also reproduces using the trunk VM sources (4.15.3) and a freshly built linux VM with a 4.6 image.

On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote:
Hi Eliot,

The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row.
Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.

Levente


Socket compile: 'sloppyWaitForDataIfClosed: closedBlock

        [(socketHandle ~~ nil
          and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue:
                [^self].
         self isConnected ifFalse:
                [^closedBlock value].
         self readSemaphore wait] repeat'
classified: 'waiting'.

[
        listenerSocket := Socket newTCP.
        listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1].
        clientSocket := Socket newTCP.
        clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort.
        clientSocket waitForConnectionFor: 1.
        self assert: clientSocket isConnected.
        serverSocket := listenerSocket waitForAcceptFor: 1.
        self assert: serverSocket isConnected ]
        ensure: [ listenerSocket destroy ].

serverProcess := [
        | shouldRun buffer bytesReceived waitDuration |
        shouldRun := true.
        buffer := ByteString new: 10.
        waitDuration := 1 second.
        [
                [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
                        valueWithin: waitDuration
                        onTimeout: [ self halt ].
                buffer atAllPut: (Character value: 0).
                bytesReceived := serverSocket receiveDataInto: buffer.
                self assert: bytesReceived = 4.
                self assert: (buffer first: 4) = 'PING'.
                serverSocket sendData: 'PONG' ] repeat ] newProcess.
clientProcess := [
        | shouldRun buffer bytesReceived waitDuration |
        shouldRun := true.
        buffer := ByteString new: 10.
        waitDuration := 1 second.
        [
                clientSocket sendData: 'PING'.
                [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
                        valueWithin: waitDuration
                        onTimeout: [ self halt ].
                buffer atAllPut: (Character value: 0).
                bytesReceived := clientSocket receiveDataInto: buffer.
                self assert: bytesReceived = 4.
                self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess.
clientProcess priority: 39; resume.
serverProcess priority: 39; resume.

"Evaluate these after debugging:
clientSocket destroy.
serverSocket destroy."



On Wed, 23 Mar 2016, Eliot Miranda wrote:

Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote:
      Hi Eliot,

      What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?


I'll take every few hours, but I'd prefer "in under 30 minutes".  Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult.  Let's say you have
a case which reproduces in 8 hours 50% of the time.  To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right?  That's nearly 2 days; it could
take weeks to fix :-(

      Levente



_,,,^..^,,,_
best, Eliot







--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Levente Uzonyi
Hi Eliot,

AFAIK the workaround came from the Qwaq project, and it had been there
long before it was added to the Trunk.

Levente