Smalltalk › Pharo › Pharo Smalltalk Developers

I would be extremely grateful for a reproducible case for the following Socket issue

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

Eliot Miranda-2

I would be extremely grateful for a reproducible case for the following Socket issue

http://forum.world.st/Socket-s-readSemaphore-is-losing-signals-with-Cog-on-Linux-td3741174.html

On 8/13/2011 13:42, Levente Uzonyi wrote:

> Socket's readSemaphore is losing signals with CogVMs on linux. We

> found several cases (RFB, PostgreSQL) when processes are stuck in the

> following method:

> Socket >> waitForDataIfClosed: closedBlock

> "Wait indefinitely for data to arrive. This method will block until

> data is available or the socket is closed."

> [

> (self primSocketReceiveDataAvailable: socketHandle)

> ifTrue: [^self].

> self isConnected

> ifFalse: [^closedBlock value].

> self readSemaphore wait ] repeat

> When we inspect the contexts, the process is waiting for the

> readSemaphore, but evaluating (self primSocketReceiveDataAvailable:

> socketHandle) yields true. Signaling the readSemaphore makes the

> process running again. As a workaround we replaced #wait with

> #waitTimeoutMSecs: and all our problems disappeared.

> The interpreter VM doesn't seem to have this bug, so I guess the bug

> was introduced with the changes of aio.c.

« [hide part of quote]

Oh, interesting. We know this problem fairly well and have always worked

around by changing the wait in the above to a "waitTimeoutMSecs: 500"

which turns it into a soft busy loop. It would be interesting to see if

there's a bug in Cog which causes this. FWIW, here is the relevant portion:

"Soft 500ms busy loop - to protect against AIO probs;

occasionally, VM-level AIO fails to trip the semaphore"

self readSemaphore waitTimeoutMSecs: 500.

Cheers,

- Andreas

_,,,^..^,,,_

best, Eliot

Eliot Miranda-2

Re: [squeak-dev] Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Hi Levente,

On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote:

Hi Eliot,

What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?

I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-(

Levente

_,,,^..^,,,_

best, Eliot

Holger Freyther

Re: [squeak-dev] [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

> On 24 Mar 2016, at 10:16, Levente Uzonyi <[hidden email]> wrote:
>
> Hi Eliot,
>
> The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row.
> Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.

does it happen if you have strace -o output.txt -f on the VM as well?

kind regards
holger

Eliot Miranda-2

Re: [squeak-dev] Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

In reply to this post by Eliot Miranda-2

Hi Levente,

On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote:

Hi Eliot,

The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row.
Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.

Levente

Socket compile: 'sloppyWaitForDataIfClosed: closedBlock

[(socketHandle ~~ nil
and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue:
[^self].
self isConnected ifFalse:
[^closedBlock value].
self readSemaphore wait] repeat'
classified: 'waiting'.

[
listenerSocket := Socket newTCP.
listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1].
clientSocket := Socket newTCP.
clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort.
clientSocket waitForConnectionFor: 1.
self assert: clientSocket isConnected.
serverSocket := listenerSocket waitForAcceptFor: 1.
self assert: serverSocket isConnected ]
ensure: [ listenerSocket destroy ].

serverProcess := [
| shouldRun buffer bytesReceived waitDuration |
shouldRun := true.
buffer := ByteString new: 10.
waitDuration := 1 second.
[
[ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
valueWithin: waitDuration
onTimeout: [ self halt ].
buffer atAllPut: (Character value: 0).
bytesReceived := serverSocket receiveDataInto: buffer.
self assert: bytesReceived = 4.
self assert: (buffer first: 4) = 'PING'.
serverSocket sendData: 'PONG' ] repeat ] newProcess.
clientProcess := [
| shouldRun buffer bytesReceived waitDuration |
shouldRun := true.
buffer := ByteString new: 10.
waitDuration := 1 second.
[
clientSocket sendData: 'PING'.
[ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
valueWithin: waitDuration
onTimeout: [ self halt ].
buffer atAllPut: (Character value: 0).
bytesReceived := clientSocket receiveDataInto: buffer.
self assert: bytesReceived = 4.
self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess.
clientProcess priority: 39; resume.
serverProcess priority: 39; resume.

"Evaluate these after debugging:
clientSocket destroy.
serverSocket destroy."

Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it.

(Stephan, interesting suggestion to throw hardware at the problem, thank you).

On Wed, 23 Mar 2016, Eliot Miranda wrote:

Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote:
Hi Eliot,

What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?

I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have
a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could
take weeks to fix :-(

Levente

_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot

Holger Freyther

Re: [squeak-dev] [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

> On 25 Mar 2016, at 00:35, Eliot Miranda <[hidden email]> wrote:
>

Dear Eliot,

> Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it.

before going to kqueue, could you have a look at my hack to create a completely idle VM? Once the VM has moved to kqueue it seems one can not get a SIGIO for the "kqueue descriptor"[1] itself and I don't know if SIGIO remains working for the individual file descriptors.

holger

[1] https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1735

Eliot Miranda-2

Re: [squeak-dev] [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

Hi Holger,

let's collaborate on this. You and I and lots of others want this to be done. But in fixing the below bug my priorities are to

a) deliver a fix promptly

b) write the code so that either kqueue or epoll or select will be used based on HAVE_KQUEUE & HAVE_EPOLL defines

So if I deliver a fix don't think that I'm against having a proper event-driven or idling VM, I just don't have it in my head yet and want the bug fixed and kqueue/epoll available asap.

On Thu, Mar 24, 2016 at 11:42 PM, Holger Freyther <[hidden email]> wrote:

> On 25 Mar 2016, at 00:35, Eliot Miranda <[hidden email]> wrote:
>

Dear Eliot,

> Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it.

before going to kqueue, could you have a look at my hack to create a completely idle VM? Once the VM has moved to kqueue it seems one can not get a SIGIO for the "kqueue descriptor"[1] itself and I don't know if SIGIO remains working for the individual file descriptors.

holger

[1] https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1735

_,,,^..^,,,_

best, Eliot

Eliot Miranda-2

Re: [squeak-dev] Re: [Vm-dev] I would be extremely grateful for a reproducible case for the following Socket issue

In reply to this post by Eliot Miranda-2

Hi Levente,

the first thing to report is that this isn't a Cog-specific bug. It also reproduces using the trunk VM sources (4.15.3) and a freshly built linux VM with a 4.6 image.

On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote:

Hi Eliot,

The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row.
Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal.

Levente

Socket compile: 'sloppyWaitForDataIfClosed: closedBlock

[(socketHandle ~~ nil
and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue:
[^self].
self isConnected ifFalse:
[^closedBlock value].
self readSemaphore wait] repeat'
classified: 'waiting'.

[
listenerSocket := Socket newTCP.
listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1].
clientSocket := Socket newTCP.
clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort.
clientSocket waitForConnectionFor: 1.
self assert: clientSocket isConnected.
serverSocket := listenerSocket waitForAcceptFor: 1.
self assert: serverSocket isConnected ]
ensure: [ listenerSocket destroy ].

serverProcess := [
| shouldRun buffer bytesReceived waitDuration |
shouldRun := true.
buffer := ByteString new: 10.
waitDuration := 1 second.
[
[ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
valueWithin: waitDuration
onTimeout: [ self halt ].
buffer atAllPut: (Character value: 0).
bytesReceived := serverSocket receiveDataInto: buffer.
self assert: bytesReceived = 4.
self assert: (buffer first: 4) = 'PING'.
serverSocket sendData: 'PONG' ] repeat ] newProcess.
clientProcess := [
| shouldRun buffer bytesReceived waitDuration |
shouldRun := true.
buffer := ByteString new: 10.
waitDuration := 1 second.
[
clientSocket sendData: 'PING'.
[ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ]
valueWithin: waitDuration
onTimeout: [ self halt ].
buffer atAllPut: (Character value: 0).
bytesReceived := clientSocket receiveDataInto: buffer.
self assert: bytesReceived = 4.
self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess.
clientProcess priority: 39; resume.
serverProcess priority: 39; resume.

"Evaluate these after debugging:
clientSocket destroy.
serverSocket destroy."

On Wed, 23 Mar 2016, Eliot Miranda wrote:

Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote:
Hi Eliot,

What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand?

I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have
a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could
take weeks to fix :-(

Levente

_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot