Smalltalk › Squeak › Squeak - Dev

[squeak-dev] Socket clock rollover issues

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Andreas.Raab

[squeak-dev] Socket clock rollover issues

Hi -

I'm not sure anyone cares about this (most likely Seaside users), but
class Socket has some issues with clock-rollover which can be seen on
high load servers that have enough uptime and load.

The problem is caused by Socket>>deadlineSecs: which at the time of the
clock rollover computes a deadline that can never be reached. If you've
seen connections that wouldn't time out apparently at random there is a
chance you have been affected by this type of issue.

Bug (and fix) are posted at http://bugs.squeak.org/view.php?id=7343

Cheers,
- Andreas

Nicolas Cellier

Re: [squeak-dev] Socket clock rollover issues

A quick inquire to tell when rollover happens:

{SmallInteger maxVal // (24*3600*1000) -> 'days'.
SmallInteger maxVal \\ (24*3600*1000) // (3600*1000) -> 'hours'.
SmallInteger maxVal \\ (3600*1000) // (60*1000) -> 'minutes'.
SmallInteger maxVal \\ (60*1000) // (1000) -> 'seconds'.
SmallInteger maxVal \\ (1000) -> 'milliseconds'.}.
{12->'days' . 10->'hours' . 15->'minutes' . 41->'seconds' .
823->'milliseconds'}

In a Pharo image, I see 112 senders of #millisecondClockValue and only
5 of #millisecondsSince:
That let room for future rollover-proof improvments...

First on the list, there are some senders of #deadlineSecs: in HTTPSocket.

Nicolas

2009/4/28 Andreas Raab <[hidden email]>:

> Hi -
>
> I'm not sure anyone cares about this (most likely Seaside users), but class
> Socket has some issues with clock-rollover which can be seen on high load
> servers that have enough uptime and load.
>
> The problem is caused by Socket>>deadlineSecs: which at the time of the
> clock rollover computes a deadline that can never be reached. If you've seen
> connections that wouldn't time out apparently at random there is a chance
> you have been affected by this type of issue.
>
> Bug (and fix) are posted at http://bugs.squeak.org/view.php?id=7343
>
> Cheers,
> - Andreas
>
>

Nicolas Cellier

Re: [squeak-dev] Socket clock rollover issues

In reply to this post by Andreas.Raab

Hi Andreas,
I wonder why you adopt a soft busy loop like this:
"Wait in a soft busy-loop (500ms) to avoid msecs-rollover issues"
self writeSemaphore waitTimeoutMSecs:
(deadline - Time millisecondClockValue min: 500 max: 0)

Wouldn't this one do the trick, since Delay itself can handle rollover ?
self writeSemaphore waitTimeoutMSecs:
(msecsDelta - (Time millisecondsSince: startTime) min: msecsDelta max: 0).

Keeping this (deadline := startTime + msecsDelta) is a bad idea, since
it can be a LargeInteger, and (deadline - Time millisecondsClockValue)
will stay > 0 forever in this case.
By replacing this piece of code, I think I do not need the rollover protection.

Beside, since startTime := Time millisecondClockValue,
(Time millisecondsSince: startTime) is guaranteed >= 0 and <=
SmallInteger maxVal
Thus, protection can be:
self writeSemaphore waitTimeoutMSecs:
(msecsDelta - (Time millisecondsSince: startTime) max: 0).

And then, since Process swap should not happen in a forward branch, a
msecsEllapsed variable could also be used:
waitForSendDoneFor: timeout
"Wait up until the given deadline for the current send operation to
complete. Return true if it completes by the deadline, false if not."

| startTime msecsDelta msecsEllapsed sendDone |
startTime := Time millisecondClockValue.
msecsDelta := (timeout * 1000) truncated.
[self isConnected & (sendDone := self primSocketSendDone: socketHandle) not
"Connection end and final data can happen fast, so test in this order"
and: [(msecsEllapsed := Time millisecondsSince: startTime) <
msecsDelta]] whileTrue: [
self writeSemaphore waitTimeoutMSecs: msecsDelta - msecsEllapsed].

^ sendDone

Unless I missed something, this code look simpler.

Nicolas

2009/4/28 Andreas Raab <[hidden email]>:

Andreas.Raab

[squeak-dev] Re: Socket clock rollover issues

Nicolas Cellier wrote:
> I wonder why you adopt a soft busy loop like this:
> "Wait in a soft busy-loop (500ms) to avoid msecs-rollover issues"
> self writeSemaphore waitTimeoutMSecs:
> (deadline - Time millisecondClockValue min: 500 max: 0)

Heh. You are asking about my dirty little secrets ;-) The reason for
writing the loop that way is that I have (mounting) evidence that at
times we may loose signals originating from external semaphores. We have
seen situations where external signals apparently weren't delivered even
though the condition was clearly set on the underlying socket. This is
not only true for sockets but we have seen other situations when
external semaphore signals apparently have been delivered.

Unfortunately, I have not been able to fully understand this problem and
when or why it happens. All I can say at this point is that there is
evidence to suggest that at times signals may not be delivered. The code
is written in the way you see purely as a defensive measure against such
effects - it actually had been rewritten that way before I fixed the
rollover problems and it just fit in nicely with the rest of it.

> And then, since Process swap should not happen in a forward branch, a
> msecsEllapsed variable could also be used:
> waitForSendDoneFor: timeout
> "Wait up until the given deadline for the current send operation to
> complete. Return true if it completes by the deadline, false if not."
>
> | startTime msecsDelta msecsEllapsed sendDone |
> startTime := Time millisecondClockValue.
> msecsDelta := (timeout * 1000) truncated.
> [self isConnected & (sendDone := self primSocketSendDone: socketHandle) not
> "Connection end and final data can happen fast, so test in this order"
> and: [(msecsEllapsed := Time millisecondsSince: startTime) <
> msecsDelta]] whileTrue: [
> self writeSemaphore waitTimeoutMSecs: msecsDelta - msecsEllapsed].
>
> ^ sendDone
>
> Unless I missed something, this code look simpler.

Yes, that looks indeed much nicer. I don't know if you want to include
the upper wait limit - it somewhat depends whether you think you might
be loosing signals or not.

Cheers,
- Andreas

johnmci

Re: [squeak-dev] Re: Socket clock rollover issues

On 28-Apr-09, at 6:24 PM, Andreas Raab wrote:

> Nicolas Cellier wrote:
>> I wonder why you adopt a soft busy loop like this:
>> "Wait in a soft busy-loop (500ms) to avoid msecs-rollover issues"
>> self writeSemaphore waitTimeoutMSecs:
>> (deadline - Time millisecondClockValue min: 500 max: 0)
>
> Heh. You are asking about my dirty little secrets ;-) The reason for
> writing the loop that way is that I have (mounting) evidence that at
> times we may loose signals originating from external semaphores. We
> have seen situations where external signals apparently weren't
> delivered even though the condition was clearly set on the
> underlying socket. This is not only true for sockets but we have
> seen other situations when external semaphore signals apparently
> have been delivered.

Er, so given we don't have a thread safe signalSemaphoreWithIndex code
base (on purpose) I wonder how many signals per second are you doing
and are you perhaps
overflowing the semaphoresUseBufferA/B table? Assuming you are saying
you do the signalSemaphoreWithIndex() and you never see that over in
the image?

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]> Twitter:
squeaker68882
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Andreas.Raab

[squeak-dev] Re: Socket clock rollover issues

John M McIntosh wrote:
> Er, so given we don't have a thread safe signalSemaphoreWithIndex code
> base (on purpose) I wonder how many signals per second are you doing and
> are you perhaps
> overflowing the semaphoresUseBufferA/B table? Assuming you are saying
> you do the signalSemaphoreWithIndex() and you never see that over in the
> image?

I cannot prove any of this because it's so unreliable but I don't think
that's the problem. An overflow like you are describing is only possible
if you overflow before the VM (not the image!) gets to the next
interrupt check. If that were the case (for example because we're
spending too much time in some primitive like BitBlt) I believe we'd be
seeing this problem more reliably than we do.

Also, the Windows VM actually replaces signalSemaphoreWithIndex with a
version that *is* thread-safe in the proxy interface since this used to
be an issue in the past. It is still possible to overflow the semaphores
but not that you're competing between two threads when signaling (i.e.,
overwriting entries because threads are executing on different cores).

Perhaps most importantly, the last place where I've seen this happen was
in a callback which means the signaling code was running from the main
thread. There is of course a possibility something completely else goes
wrong (random corruption of the semaphore index for example) but I
haven't had the time to investigate this - I was more interested in
finding a suitable workaround for the release ;-)

Cheers,
- Andreas

johnmci

[squeak-dev] Re: Socket clock rollover issues

On 28-Apr-09, at 6:46 PM, Andreas Raab wrote:

> John M McIntosh wrote:
>> Er, so given we don't have a thread safe signalSemaphoreWithIndex
>> code base (on purpose) I wonder how many signals per second are you
>> doing and are you perhaps
>> overflowing the semaphoresUseBufferA/B table? Assuming you are
>> saying you do the signalSemaphoreWithIndex() and you never see that
>> over in the image?
>
> I cannot prove any of this because it's so unreliable but I don't
> think that's the problem. An overflow like you are describing is
> only possible if you overflow before the VM (not the image!) gets to
> the next interrupt check. If that were the case (for example because
> we're spending too much time in some primitive like BitBlt) I
> believe we'd be seeing this problem more reliably than we do.

Ok well people are welcome then to look at the semaphoresUseBufferA/B
logic (is there an off by one error there?) and consider the issues
with multi-cpu machines and if there are any exposures to loosing a
signal, versus overflowing the table. Frankly I wonder about the
safety of foo->semaphoresUseBufferA

> Also, the Windows VM actually replaces signalSemaphoreWithIndex with
> a version that *is* thread-safe in the proxy interface since this
> used to be an issue in the past. It is still possible to overflow
> the semaphores but not that you're competing between two threads
> when signaling (i.e., overwriting entries because threads are
> executing on different cores).

Er, maybe you could buy those window's boxes and accidently run linux
or freebsd on them, and pretend they are windows machines, I mean if
they are stuffed away in some ISP's rack/cage who would know what they
run anyway? However you have to prove you are not loosing interrupts
from somewhere in the bowels of the windows socket code heh?

>
> Perhaps most importantly, the last place where I've seen this happen
> was in a callback which means the signaling code was running from
> the main thread. There is of course a possibility something
> completely else goes wrong (random corruption of the semaphore index
> for example) but I haven't had the time to investigate this - I was
> more interested in finding a suitable workaround for the release ;-)

No doubt you then have to follow the trail from synchronousSignal()
and confirm in your mind that it does reach the smalltalk object....
>
>
>
> Cheers,
> - Andreas

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]> Twitter:
squeaker68882
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Andreas.Raab

[squeak-dev] Re: Socket clock rollover issues

In reply to this post by Andreas.Raab

Folks -

Just as a follow-up to this note I now have proof that we're loosing
semaphore signals occasionally. What I was able to detect was that when
running forums over a period of 20 hours we lost 2 out of 421355
signals. We'll have the follow-on discussion on vm-dev since I don't
think most people here are interested in discussing the possibilities of
how this could happen and what to do about it. Please send any
follow-ups to vm-dev (and vm-dev only).

Cheers,
- Andreas

Andreas Raab wrote:

> John M McIntosh wrote:
>> Er, so given we don't have a thread safe signalSemaphoreWithIndex code
>> base (on purpose) I wonder how many signals per second are you doing
>> and are you perhaps
>> overflowing the semaphoresUseBufferA/B table? Assuming you are saying
>> you do the signalSemaphoreWithIndex() and you never see that over in
>> the image?
>
> I cannot prove any of this because it's so unreliable but I don't think
> that's the problem. An overflow like you are describing is only possible
> if you overflow before the VM (not the image!) gets to the next
> interrupt check. If that were the case (for example because we're
> spending too much time in some primitive like BitBlt) I believe we'd be
> seeing this problem more reliably than we do.
>
> Also, the Windows VM actually replaces signalSemaphoreWithIndex with a
> version that *is* thread-safe in the proxy interface since this used to
> be an issue in the past. It is still possible to overflow the semaphores
> but not that you're competing between two threads when signaling (i.e.,
> overwriting entries because threads are executing on different cores).
>
> Perhaps most importantly, the last place where I've seen this happen was
> in a callback which means the signaling code was running from the main
> thread. There is of course a possibility something completely else goes
> wrong (random corruption of the semaphore index for example) but I
> haven't had the time to investigate this - I was more interested in
> finding a suitable workaround for the release ;-)
>
> Cheers,
> - Andreas
>
>