Socket's readSemaphore is losing signals with Cog on Linux

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Socket's readSemaphore is losing signals with Cog on Linux

Levente Uzonyi-2
 
Hi,

Socket's readSemaphore is losing signals with CogVMs on linux. We found
several cases (RFB, PostgreSQL) when processes are stuck in the following
method:

Socket >> waitForDataIfClosed: closedBlock
  "Wait indefinitely for data to arrive.  This method will block until
  data is available or the socket is closed."

  [
  (self primSocketReceiveDataAvailable: socketHandle)
  ifTrue: [^self].
  self isConnected
  ifFalse: [^closedBlock value].
  self readSemaphore wait ] repeat

When we inspect the contexts, the process is waiting for the
readSemaphore, but evaluating (self primSocketReceiveDataAvailable:
socketHandle) yields true. Signaling the readSemaphore makes the process
running again. As a workaround we replaced #wait with #waitTimeoutMSecs:
and all our problems disappeared.

The interpreter VM doesn't seem to have this bug, so I guess the bug was
introduced with the changes of aio.c.


Cheers,
Levente
Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Andreas.Raab
 
On 8/13/2011 13:42, Levente Uzonyi wrote:

> Socket's readSemaphore is losing signals with CogVMs on linux. We
> found several cases (RFB, PostgreSQL) when processes are stuck in the
> following method:
>
> Socket >> waitForDataIfClosed: closedBlock
>     "Wait indefinitely for data to arrive.  This method will block until
>     data is available or the socket is closed."
>
>     [
>         (self primSocketReceiveDataAvailable: socketHandle)
>             ifTrue: [^self].
>         self isConnected
>             ifFalse: [^closedBlock value].
>         self readSemaphore wait ] repeat
>
> When we inspect the contexts, the process is waiting for the
> readSemaphore, but evaluating (self primSocketReceiveDataAvailable:
> socketHandle) yields true. Signaling the readSemaphore makes the
> process running again. As a workaround we replaced #wait with
> #waitTimeoutMSecs: and all our problems disappeared.
>
> The interpreter VM doesn't seem to have this bug, so I guess the bug
> was introduced with the changes of aio.c.

Oh, interesting. We know this problem fairly well and have always worked
around by changing the wait in the above to a "waitTimeoutMSecs: 500"
which turns it into a soft busy loop. It would be interesting to see if
there's a bug in Cog which causes this. FWIW, here is the relevant portion:

             "Soft 500ms busy loop - to protect against AIO probs;
             occasionally, VM-level AIO fails to trip the semaphore"
             self readSemaphore waitTimeoutMSecs: 500.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Levente Uzonyi-2
 
On Sun, 14 Aug 2011, Andreas Raab wrote:

>
> On 8/13/2011 13:42, Levente Uzonyi wrote:
>> Socket's readSemaphore is losing signals with CogVMs on linux. We found
>> several cases (RFB, PostgreSQL) when processes are stuck in the following
>> method:
>>
>> Socket >> waitForDataIfClosed: closedBlock
>>     "Wait indefinitely for data to arrive.  This method will block until
>>     data is available or the socket is closed."
>>
>>     [
>>         (self primSocketReceiveDataAvailable: socketHandle)
>>             ifTrue: [^self].
>>         self isConnected
>>             ifFalse: [^closedBlock value].
>>         self readSemaphore wait ] repeat
>>
>> When we inspect the contexts, the process is waiting for the readSemaphore,
>> but evaluating (self primSocketReceiveDataAvailable: socketHandle) yields
>> true. Signaling the readSemaphore makes the process running again. As a
>> workaround we replaced #wait with #waitTimeoutMSecs: and all our problems
>> disappeared.
>>
>> The interpreter VM doesn't seem to have this bug, so I guess the bug was
>> introduced with the changes of aio.c.
>
> Oh, interesting. We know this problem fairly well and have always worked
> around by changing the wait in the above to a "waitTimeoutMSecs: 500" which
> turns it into a soft busy loop. It would be interesting to see if there's a

It took a while for us to realize that _this_ bug is responsible for our
problems. With RFB we found that the server doesn't accept input from the
client, while it's still sending the changes of the view when the bug
happens, which is every few hours. We thought that it's the side effect of
some changes in recent Squeak versions and we just didn't care about it.
Restarting the RFB client can be done in a second.
With PostgreSQL we thought that our Postgres V3 client has a bug. Our old
system uses Postgres V2 client, Seaside 2.8, Squeak 3.9 and interpreter VM
and it didn't have such problem for years.
We recently started migrating it to Postgres V3, Custom web framework,
Squeak 4.2 and CogVM.
The main differences between these system are interpreter VM - CogVM and
Postgres V2 - V3. We assumed that Cog is identical from this POV and
tried to debug the postgres protocol, but when I saw where the processes
got stalled I remembered your email from 2009 when you mentioned that you
had a similar bug [1].
So I'm pretty sure this bug is Cog specific. Reproducing it seems to be
pretty hard, so a code review (with sufficient knowledge :)) is more
likely to help solving this issue.


Levente

[1] http://lists.squeakfoundation.org/pipermail/vm-dev/2009-May/002619.html

> bug in Cog which causes this. FWIW, here is the relevant portion:
>
>            "Soft 500ms busy loop - to protect against AIO probs;
>            occasionally, VM-level AIO fails to trip the semaphore"
>            self readSemaphore waitTimeoutMSecs: 500.
>
> Cheers,
>  - Andreas
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Colin Putney-3
 
On Sun, Aug 14, 2011 at 12:44 PM, Levente Uzonyi <[hidden email]> wrote:

> So I'm pretty sure this bug is Cog specific. Reproducing it seems to be
> pretty hard, so a code review (with sufficient knowledge :)) is more likely
> to help solving this issue.

FWIW, I can reproduce it fairly easily using the Xtreams test suite.
Running XTSocketReadingWritingTest almost always has several tests
time out because of it.

Colin
Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Eliot Miranda-2
In reply to this post by Levente Uzonyi-2
 
Thanks, Levente (and Colin for the reproducible case).  I should be able to look at this towards the end of the week.  Anyone else who wants to eyeball aio.c in the Cog branch against aio.c in the trunk vm is most welcome.

On Sun, Aug 14, 2011 at 12:44 PM, Levente Uzonyi <[hidden email]> wrote:

On Sun, 14 Aug 2011, Andreas Raab wrote:


On 8/13/2011 13:42, Levente Uzonyi wrote:
Socket's readSemaphore is losing signals with CogVMs on linux. We found several cases (RFB, PostgreSQL) when processes are stuck in the following method:

Socket >> waitForDataIfClosed: closedBlock
   "Wait indefinitely for data to arrive.  This method will block until
   data is available or the socket is closed."

   [
       (self primSocketReceiveDataAvailable: socketHandle)
           ifTrue: [^self].
       self isConnected
           ifFalse: [^closedBlock value].
       self readSemaphore wait ] repeat

When we inspect the contexts, the process is waiting for the readSemaphore, but evaluating (self primSocketReceiveDataAvailable: socketHandle) yields true. Signaling the readSemaphore makes the process running again. As a workaround we replaced #wait with #waitTimeoutMSecs: and all our problems disappeared.

The interpreter VM doesn't seem to have this bug, so I guess the bug was introduced with the changes of aio.c.

Oh, interesting. We know this problem fairly well and have always worked around by changing the wait in the above to a "waitTimeoutMSecs: 500" which turns it into a soft busy loop. It would be interesting to see if there's a

It took a while for us to realize that _this_ bug is responsible for our problems. With RFB we found that the server doesn't accept input from the client, while it's still sending the changes of the view when the bug happens, which is every few hours. We thought that it's the side effect of some changes in recent Squeak versions and we just didn't care about it. Restarting the RFB client can be done in a second.
With PostgreSQL we thought that our Postgres V3 client has a bug. Our old system uses Postgres V2 client, Seaside 2.8, Squeak 3.9 and interpreter VM and it didn't have such problem for years.
We recently started migrating it to Postgres V3, Custom web framework, Squeak 4.2 and CogVM.
The main differences between these system are interpreter VM - CogVM and Postgres V2 - V3. We assumed that Cog is identical from this POV and tried to debug the postgres protocol, but when I saw where the processes got stalled I remembered your email from 2009 when you mentioned that you had a similar bug [1].
So I'm pretty sure this bug is Cog specific. Reproducing it seems to be pretty hard, so a code review (with sufficient knowledge :)) is more likely to help solving this issue.


Levente

[1] http://lists.squeakfoundation.org/pipermail/vm-dev/2009-May/002619.html


bug in Cog which causes this. FWIW, here is the relevant portion:

          "Soft 500ms busy loop - to protect against AIO probs;
          occasionally, VM-level AIO fails to trip the semaphore"
          self readSemaphore waitTimeoutMSecs: 500.

Cheers,
 - Andreas





--
best,
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Igor Stasenko

On 15 August 2011 21:10, Eliot Miranda <[hidden email]> wrote:
>
> Thanks, Levente (and Colin for the reproducible case).  I should be able to look at this towards the end of the week.  Anyone else who wants to eyeball aio.c in the Cog branch against aio.c in the trunk vm is most welcome.
>

There are multiple places:
sqExternalSemaphores.c is one of it.
Sockets, currently maybe the only plugin, which using that code for
signaling semaphores from non-VM thread.


> On Sun, Aug 14, 2011 at 12:44 PM, Levente Uzonyi <[hidden email]> wrote:
>>
>> On Sun, 14 Aug 2011, Andreas Raab wrote:
>>
>>>
>>> On 8/13/2011 13:42, Levente Uzonyi wrote:
>>>>
>>>> Socket's readSemaphore is losing signals with CogVMs on linux. We found several cases (RFB, PostgreSQL) when processes are stuck in the following method:
>>>>
>>>> Socket >> waitForDataIfClosed: closedBlock
>>>>    "Wait indefinitely for data to arrive.  This method will block until
>>>>    data is available or the socket is closed."
>>>>
>>>>    [
>>>>        (self primSocketReceiveDataAvailable: socketHandle)
>>>>            ifTrue: [^self].
>>>>        self isConnected
>>>>            ifFalse: [^closedBlock value].
>>>>        self readSemaphore wait ] repeat
>>>>
>>>> When we inspect the contexts, the process is waiting for the readSemaphore, but evaluating (self primSocketReceiveDataAvailable: socketHandle) yields true. Signaling the readSemaphore makes the process running again. As a workaround we replaced #wait with #waitTimeoutMSecs: and all our problems disappeared.
>>>>
>>>> The interpreter VM doesn't seem to have this bug, so I guess the bug was introduced with the changes of aio.c.
>>>
>>> Oh, interesting. We know this problem fairly well and have always worked around by changing the wait in the above to a "waitTimeoutMSecs: 500" which turns it into a soft busy loop. It would be interesting to see if there's a
>>
>> It took a while for us to realize that _this_ bug is responsible for our problems. With RFB we found that the server doesn't accept input from the client, while it's still sending the changes of the view when the bug happens, which is every few hours. We thought that it's the side effect of some changes in recent Squeak versions and we just didn't care about it. Restarting the RFB client can be done in a second.
>> With PostgreSQL we thought that our Postgres V3 client has a bug. Our old system uses Postgres V2 client, Seaside 2.8, Squeak 3.9 and interpreter VM and it didn't have such problem for years.
>> We recently started migrating it to Postgres V3, Custom web framework, Squeak 4.2 and CogVM.
>> The main differences between these system are interpreter VM - CogVM and Postgres V2 - V3. We assumed that Cog is identical from this POV and tried to debug the postgres protocol, but when I saw where the processes got stalled I remembered your email from 2009 when you mentioned that you had a similar bug [1].
>> So I'm pretty sure this bug is Cog specific. Reproducing it seems to be pretty hard, so a code review (with sufficient knowledge :)) is more likely to help solving this issue.
>>
>>
>> Levente
>>
>> [1] http://lists.squeakfoundation.org/pipermail/vm-dev/2009-May/002619.html
>>
>>> bug in Cog which causes this. FWIW, here is the relevant portion:
>>>
>>>           "Soft 500ms busy loop - to protect against AIO probs;
>>>           occasionally, VM-level AIO fails to trip the semaphore"
>>>           self readSemaphore waitTimeoutMSecs: 500.
>>>
>>> Cheers,
>>>  - Andreas
>>>
>>>
>
>
>
> --
> best,
> Eliot
>
>



--
Best regards,
Igor Stasenko AKA sig.
Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Henrik Sperre Johansen
In reply to this post by Colin Putney-3
 
On 15.08.2011 19:14, Colin Putney wrote:

>
> On Sun, Aug 14, 2011 at 12:44 PM, Levente Uzonyi<[hidden email]>  wrote:
>
>> So I'm pretty sure this bug is Cog specific. Reproducing it seems to be
>> pretty hard, so a code review (with sufficient knowledge :)) is more likely
>> to help solving this issue.
> FWIW, I can reproduce it fairly easily using the Xtreams test suite.
> Running XTSocketReadingWritingTest almost always has several tests
> time out because of it.
>
> Colin
FWIW, I ran the XTSocketReadingWritingTest often enough to see errors in
the results, and interrupted it while it looked to be hanging (this in
Windows).
In latest Pharo images, I got the "'Not enough space for external
objects, set a larger size at startup!'" error message I added for
trying to adjust ExternalSemaphoreTable size at runtime.
If default behaviour in your image is to just silently increase size/not
increase it at all, that may explain the lost signals.

Cheers,
Henry
Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Henrik Sperre Johansen
 
On 15.08.2011 23:14, Henrik Sperre Johansen wrote:

>
> On 15.08.2011 19:14, Colin Putney wrote:
>>
>> On Sun, Aug 14, 2011 at 12:44 PM, Levente Uzonyi<[hidden email]>  wrote:
>>
>>> So I'm pretty sure this bug is Cog specific. Reproducing it seems to be
>>> pretty hard, so a code review (with sufficient knowledge :)) is more
>>> likely
>>> to help solving this issue.
>> FWIW, I can reproduce it fairly easily using the Xtreams test suite.
>> Running XTSocketReadingWritingTest almost always has several tests
>> time out because of it.
>>
>> Colin
> FWIW, I ran the XTSocketReadingWritingTest often enough to see errors
> in the results, and interrupted it while it looked to be hanging (this
> in Windows).
> In latest Pharo images, I got the "'Not enough space for external
> objects, set a larger size at startup!'" error message I added for
> trying to adjust ExternalSemaphoreTable size at runtime.
> If default behaviour in your image is to just silently increase
> size/not increase it at all, that may explain the lost signals.
>
> Cheers,
> Henry
http://code.google.com/p/pharo/issues/detail?id=4655

With this, I can run Xtreams tests as much as I want without timeouts.
(On Windows at least)
There are still occasional failures, afaict they're due to threading
bugs in the tests and not signal losses though.

Could of course be totally unrelated to the lost signals Levente and
others are seeing on Unix, but it'd be interesting to hear if that still
happened with changes in place equivalent to the ones described in issue.

Cheers,
Henry

PS. At least on Windows, independently of the above, if I manually set
maxExternalObjects to > 4095 (Ie its real size is 8192), I inevitably
run into
"CreateThread() failed (8) - Not enough storage is available to process
this command" errors in the output console if I run the
XTSocketReadingWritingTest...

Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Igor Stasenko

On 16 August 2011 16:30, Henrik Sperre Johansen
<[hidden email]> wrote:

>
> On 15.08.2011 23:14, Henrik Sperre Johansen wrote:
>>
>> On 15.08.2011 19:14, Colin Putney wrote:
>>>
>>> On Sun, Aug 14, 2011 at 12:44 PM, Levente Uzonyi<[hidden email]>  wrote:
>>>
>>>> So I'm pretty sure this bug is Cog specific. Reproducing it seems to be
>>>> pretty hard, so a code review (with sufficient knowledge :)) is more
>>>> likely
>>>> to help solving this issue.
>>>
>>> FWIW, I can reproduce it fairly easily using the Xtreams test suite.
>>> Running XTSocketReadingWritingTest almost always has several tests
>>> time out because of it.
>>>
>>> Colin
>>
>> FWIW, I ran the XTSocketReadingWritingTest often enough to see errors in
>> the results, and interrupted it while it looked to be hanging (this in
>> Windows).
>> In latest Pharo images, I got the "'Not enough space for external objects,
>> set a larger size at startup!'" error message I added for trying to adjust
>> ExternalSemaphoreTable size at runtime.
>> If default behaviour in your image is to just silently increase size/not
>> increase it at all, that may explain the lost signals.
>>
>> Cheers,
>> Henry
>
> http://code.google.com/p/pharo/issues/detail?id=4655
>
Is it already integrated in image(s)?

> With this, I can run Xtreams tests as much as I want without timeouts. (On
> Windows at least)
> There are still occasional failures, afaict they're due to threading bugs in
> the tests and not signal losses though.
>


> Could of course be totally unrelated to the lost signals Levente and others
> are seeing on Unix, but it'd be interesting to hear if that still happened
> with changes in place equivalent to the ones described in issue.
>
> Cheers,
> Henry
>
> PS. At least on Windows, independently of the above, if I manually set
> maxExternalObjects to > 4095 (Ie its real size is 8192), I inevitably run
> into
> "CreateThread() failed (8) - Not enough storage is available to process this
> command" errors in the output console if I run the
> XTSocketReadingWritingTest...
>
>
Is there another hard limit? Like  size of VM table for socket/file handles?


--
Best regards,
Igor Stasenko AKA sig.
Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Henrik Sperre Johansen
 

On Aug 16, 2011, at 4:00 49PM, Igor Stasenko wrote:


On 16 August 2011 16:30, Henrik Sperre Johansen
<[hidden email]> wrote:

On 15.08.2011 23:14, Henrik Sperre Johansen wrote:

http://code.google.com/p/pharo/issues/detail?id=4655

Is it already integrated in image(s)?

As I wrote it a couple of hours ago, I doubt it.
One half is integrated in 1.4 (raising errors if you try to allocate enough external objects that you'd have to adjust the size), other half in neither 1.3 nor 1.4. (try a GC to free slots before growing beyond current max size). 

Could of course be totally unrelated to the lost signals Levente and others
are seeing on Unix, but it'd be interesting to hear if that still happened
with changes in place equivalent to the ones described in issue.

Cheers,
Henry


Is there another hard limit? Like  size of VM table for socket/file handles?

Huh?

File handles don't use external objects at all that I'm aware of.
Didn't even know the VM had a special table for them.

Sockets use 3 semaphores registered in the externalobjects table each, not cleaned up until Sockets are either explicity destroyed or finalized.
Xtreams tests used 3 Sockets per test, with no explicit destruction in tearDown, so with 87 tests the default 512 externalObjects table filled up rather quickly with no finalization happening.

There is no hard limit in Cog per se, except one really shouldn't be adjusting the maxExternalObjects size after startup, as that CAN lead to lost signals. (thoroughly documented in the code)
max-max size of the table is 64k in current image format (stored in image header), as indicated in the comment in maxExternalObjects: method.

PS. At least on Windows, independently of the above, if I manually set
maxExternalObjects to > 4095 (Ie its real size is 8192), I inevitably run
into
"CreateThread() failed (8) - Not enough storage is available to process this
command" errors in the output console if I run the
XTSocketReadingWritingTest...

Tried replicating this on OSX, and couldn't do it, so I guess it is Windows-specific?

Cheers,
Henry
Reply | Threaded
Open this post in threaded view
|

Re: Socket's readSemaphore is losing signals with Cog on Linux

Igor Stasenko

On 16 August 2011 18:33, Henrik Johansen <[hidden email]> wrote:

>
>
> On Aug 16, 2011, at 4:00 49PM, Igor Stasenko wrote:
>
> On 16 August 2011 16:30, Henrik Sperre Johansen
> <[hidden email]> wrote:
>
> On 15.08.2011 23:14, Henrik Sperre Johansen wrote:
>
> http://code.google.com/p/pharo/issues/detail?id=4655
>
> Is it already integrated in image(s)?
>
> As I wrote it a couple of hours ago, I doubt it.
> One half is integrated in 1.4 (raising errors if you try to allocate enough external objects that you'd have to adjust the size), other half in neither 1.3 nor 1.4. (try a GC to free slots before growing beyond current max size).
>
> Could of course be totally unrelated to the lost signals Levente and others
>
> are seeing on Unix, but it'd be interesting to hear if that still happened
>
> with changes in place equivalent to the ones described in issue.
>
> Cheers,
>
> Henry
>
>
> Is there another hard limit? Like  size of VM table for socket/file handles?
>
> Huh?
> File handles don't use external objects at all that I'm aware of.
> Didn't even know the VM had a special table for them.

Seems its not for files/sockets.
I remember i saw there is some plugin, who keeps a separate table with
low-level data structure(s),
while exposing them as a 'handles' (which is simply an index to that
table) to image side.

> Sockets use 3 semaphores registered in the externalobjects table each, not cleaned up until Sockets are either explicity destroyed or finalized.
> Xtreams tests used 3 Sockets per test, with no explicit destruction in tearDown, so with 87 tests the default 512 externalObjects table filled up rather quickly with no finalization happening.
> There is no hard limit in Cog per se, except one really shouldn't be adjusting the maxExternalObjects size after startup, as that CAN lead to lost signals. (thoroughly documented in the code)
> max-max size of the table is 64k in current image format (stored in image header), as indicated in the comment in maxExternalObjects: method.
>
> PS. At least on Windows, independently of the above, if I manually set
>
> maxExternalObjects to > 4095 (Ie its real size is 8192), I inevitably run
>
> into
>
> "CreateThread() failed (8) - Not enough storage is available to process this
>
> command" errors in the output console if I run the
>
> XTSocketReadingWritingTest...
>
> Tried replicating this on OSX, and couldn't do it, so I guess it is Windows-specific?

On windows, a plugin creates a separate OS thread for every new socket.
(not sure about Cog, but in Squeak it is like that)

So, if you open too many sockets, you can quickly hit the hard limit
of max number of OS threads :) Which if i remember correctly is around
1024.

I think tests should be rewritten to release resources as soon as they
are no longer needed.
Because there's actually little what can be done, since it is more an
OS limitation (and partly, a limitation of socket plugin
implementation).

> Cheers,
> Henry
>



--
Best regards,
Igor Stasenko AKA sig.