Smalltalk › Squeak › Squeak VM

Re: Socket clock rollover issues

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Andreas.Raab

Re: Socket clock rollover issues

Folks -

Just as a follow-up to this note I now have proof that we're loosing
semaphore signals occasionally. What I was able to detect was that when
running forums over a period of 20 hours we lost 2 out of 421355
signals. We'll have the follow-on discussion on vm-dev since I don't
think most people here are interested in discussing the possibilities of
how this could happen and what to do about it. Please send any
follow-ups to vm-dev (and vm-dev only).

Cheers,
- Andreas

Andreas Raab wrote:

> John M McIntosh wrote:
>> Er, so given we don't have a thread safe signalSemaphoreWithIndex code
>> base (on purpose) I wonder how many signals per second are you doing
>> and are you perhaps
>> overflowing the semaphoresUseBufferA/B table? Assuming you are saying
>> you do the signalSemaphoreWithIndex() and you never see that over in
>> the image?
>
> I cannot prove any of this because it's so unreliable but I don't think
> that's the problem. An overflow like you are describing is only possible
> if you overflow before the VM (not the image!) gets to the next
> interrupt check. If that were the case (for example because we're
> spending too much time in some primitive like BitBlt) I believe we'd be
> seeing this problem more reliably than we do.
>
> Also, the Windows VM actually replaces signalSemaphoreWithIndex with a
> version that *is* thread-safe in the proxy interface since this used to
> be an issue in the past. It is still possible to overflow the semaphores
> but not that you're competing between two threads when signaling (i.e.,
> overwriting entries because threads are executing on different cores).
>
> Perhaps most importantly, the last place where I've seen this happen was
> in a callback which means the signaling code was running from the main
> thread. There is of course a possibility something completely else goes
> wrong (random corruption of the semaphore index for example) but I
> haven't had the time to investigate this - I was more interested in
> finding a suitable workaround for the release ;-)
>
> Cheers,
> - Andreas
>
>

johnmci

Re: [squeak-dev] Re: Socket clock rollover issues

So is this on windows, or unix?
So how did you measure that?
I think you said you had a VM that does proper locking of the queues?
Versus the code I wrote.

On 5-May-09, at 5:52 PM, Andreas Raab wrote:

> Folks -
>
> Just as a follow-up to this note I now have proof that we're loosing
> semaphore signals occasionally. What I was able to detect was that
> when running forums over a period of 20 hours we lost 2 out of
> 421355 signals. We'll have the follow-on discussion on vm-dev since
> I don't think most people here are interested in discussing the
> possibilities of how this could happen and what to do about it.
> Please send any follow-ups to vm-dev (and vm-dev only).
>
> Cheers,
> - Andreas

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]> Twitter:
squeaker68882
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Andreas.Raab

Re: Re: [squeak-dev] Re: Socket clock rollover issues

John M McIntosh wrote:
> So is this on windows, or unix?

Windows Qwaq Forums client.

> So how did you measure that?

Our original issue was that Python apps in Forums would "stop working"
after running for hours. The way these apps work is by calling from
Forums to Python and have callback facilities that allow Python code to
invoke methods inside Forums. When the apps stopped working I could
observe that it was when a callback was being executed, i.e., from the
Python side everything was set up and the VM had entered the interpreter
loop again. Except that the Python callback semaphore wasn't signaled.

I then changed that code to use waitTimeout: and count the number of
times the callback semaphore was signaled (i.e., didn't time out) vs.
the number of times we had callback data waiting. These numbers should
be exactly the same and they weren't.

Since all of this is code that is under our control it means that I am
100% certain that we've been calling signalSemaphoreWithIndex() and that
this wasn't delivered to the image. And obviously it's not a common
event (2 out of 400k callbacks missed the signal).

> I think you said you had a VM that does proper locking of the queues?

Yes. I don't think that's the problem. Right now my theory is that we're
indeed overflowing the VMs semaphore buffer because a Python callout may
take a long, long time. I think what happens then might be that over the
period of time the (few) sockets generate multiple semaphore signals
which overflows the VMs buffer and then there is no room left in the
buffer when the callback executes.

If that's true then I should be able to recreate the problem by calling
an OS-level sleep() function via FFI (i.e., block the main interpreter
loop) while performing heavy network activity and see if that overflows
the VMs buffer.

And if that's indeed the case then I think there are two actions to
take: One is to fix the Windows sockets code to not do that ;-) (i.e.,
not signal an already signaled semaphore a gazillion times) but also to
keep track of the number of signals on a particular semaphore instead of
keeping an entry in the buffer each time the semaphore is signaled
(which would completely solve this type of problem in general).

The next step for me will be to attribute our Python callback facilities
to keep track of the time that's passed between entering Python and
getting back to Forums and see if that correlates. Plus doing the
sleep() test via FFI to see how long this needs to take before we
overflow the VM buffer.

Cheers,
- Andreas

> On 5-May-09, at 5:52 PM, Andreas Raab wrote:
>
>> Folks -
>>
>> Just as a follow-up to this note I now have proof that we're loosing
>> semaphore signals occasionally. What I was able to detect was that
>> when running forums over a period of 20 hours we lost 2 out of 421355
>> signals. We'll have the follow-on discussion on vm-dev since I don't
>> think most people here are interested in discussing the possibilities
>> of how this could happen and what to do about it. Please send any
>> follow-ups to vm-dev (and vm-dev only).
>>
>> Cheers,
>> - Andreas
>
> --
> ===========================================================================
> John M. McIntosh <[hidden email]> Twitter:
> squeaker68882
> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
> ===========================================================================
>
>
>
>

johnmci

Re: Re: [squeak-dev] Re: Socket clock rollover issues

On 5-May-09, at 7:09 PM, Andreas Raab wrote:

> Yes. I don't think that's the problem. Right now my theory is that
> we're indeed overflowing the VMs semaphore buffer because a Python
> callout may take a long, long time. I think what happens then might
> be that over the period of time the (few) sockets generate multiple
> semaphore signals which overflows the VMs buffer and then there is
> no room left in the buffer when the callback executes.

Well given the
if (foo->semaphoresUseBufferA) {
if (foo->semaphoresToSignalCountA < SemaphoresToSignalSize) {
foo->semaphoresToSignalCountA += 1;
foo->semaphoresToSignalA[foo->semaphoresToSignalCountA] = index;
}
} else {
if (foo->semaphoresToSignalCountB < SemaphoresToSignalSize) {
foo->semaphoresToSignalCountB += 1;
foo->semaphoresToSignalB[foo->semaphoresToSignalCountB] = index;
}
}

You could just code up the else to print diagnostic data when the foo-
>semaphoresToSignalCountA IS = to SemaphoresToSignalSize

Also in checkForInterrupts at

foo->semaphoresUseBufferA = !foo->semaphoresUseBufferA;

is there exposure on multi-cpu machines for semaphoresUseBufferA to be
NOTed as it's evaluating "if (foo->semaphoresUseBufferA)"
Oh likely there is... maybe we need a volatile on the
semaphoresUseBufferA ? But I'm not sure that will work since this the
traditional race condition on a
shared memory location between two cpus. '

>
>
> If that's true then I should be able to recreate the problem by
> calling an OS-level sleep() function via FFI (i.e., block the main
> interpreter loop) while performing heavy network activity and see if
> that overflows the VMs buffer.
>
> And if that's indeed the case then I think there are two actions to
> take: One is to fix the Windows sockets code to not do that ;-)
> (i.e., not signal an already signaled semaphore a gazillion times)
> but also to keep track of the number of signals on a particular
> semaphore instead of keeping an entry in the buffer each time the
> semaphore is signaled (which would completely solve this type of
> problem in general).