Hi -
I'm not sure anyone cares about this (most likely Seaside users), but class Socket has some issues with clock-rollover which can be seen on high load servers that have enough uptime and load. The problem is caused by Socket>>deadlineSecs: which at the time of the clock rollover computes a deadline that can never be reached. If you've seen connections that wouldn't time out apparently at random there is a chance you have been affected by this type of issue. Bug (and fix) are posted at http://bugs.squeak.org/view.php?id=7343 Cheers, - Andreas |
A quick inquire to tell when rollover happens:
{SmallInteger maxVal // (24*3600*1000) -> 'days'. SmallInteger maxVal \\ (24*3600*1000) // (3600*1000) -> 'hours'. SmallInteger maxVal \\ (3600*1000) // (60*1000) -> 'minutes'. SmallInteger maxVal \\ (60*1000) // (1000) -> 'seconds'. SmallInteger maxVal \\ (1000) -> 'milliseconds'.}. {12->'days' . 10->'hours' . 15->'minutes' . 41->'seconds' . 823->'milliseconds'} In a Pharo image, I see 112 senders of #millisecondClockValue and only 5 of #millisecondsSince: That let room for future rollover-proof improvments... First on the list, there are some senders of #deadlineSecs: in HTTPSocket. Nicolas 2009/4/28 Andreas Raab <[hidden email]>: > Hi - > > I'm not sure anyone cares about this (most likely Seaside users), but class > Socket has some issues with clock-rollover which can be seen on high load > servers that have enough uptime and load. > > The problem is caused by Socket>>deadlineSecs: which at the time of the > clock rollover computes a deadline that can never be reached. If you've seen > connections that wouldn't time out apparently at random there is a chance > you have been affected by this type of issue. > > Bug (and fix) are posted at http://bugs.squeak.org/view.php?id=7343 > > Cheers, > - Andreas > > |
In reply to this post by Andreas.Raab
Hi Andreas,
I wonder why you adopt a soft busy loop like this: "Wait in a soft busy-loop (500ms) to avoid msecs-rollover issues" self writeSemaphore waitTimeoutMSecs: (deadline - Time millisecondClockValue min: 500 max: 0) Wouldn't this one do the trick, since Delay itself can handle rollover ? self writeSemaphore waitTimeoutMSecs: (msecsDelta - (Time millisecondsSince: startTime) min: msecsDelta max: 0). Keeping this (deadline := startTime + msecsDelta) is a bad idea, since it can be a LargeInteger, and (deadline - Time millisecondsClockValue) will stay > 0 forever in this case. By replacing this piece of code, I think I do not need the rollover protection. Beside, since startTime := Time millisecondClockValue, (Time millisecondsSince: startTime) is guaranteed >= 0 and <= SmallInteger maxVal Thus, protection can be: self writeSemaphore waitTimeoutMSecs: (msecsDelta - (Time millisecondsSince: startTime) max: 0). And then, since Process swap should not happen in a forward branch, a msecsEllapsed variable could also be used: waitForSendDoneFor: timeout "Wait up until the given deadline for the current send operation to complete. Return true if it completes by the deadline, false if not." | startTime msecsDelta msecsEllapsed sendDone | startTime := Time millisecondClockValue. msecsDelta := (timeout * 1000) truncated. [self isConnected & (sendDone := self primSocketSendDone: socketHandle) not "Connection end and final data can happen fast, so test in this order" and: [(msecsEllapsed := Time millisecondsSince: startTime) < msecsDelta]] whileTrue: [ self writeSemaphore waitTimeoutMSecs: msecsDelta - msecsEllapsed]. ^ sendDone Unless I missed something, this code look simpler. Nicolas 2009/4/28 Andreas Raab <[hidden email]>: > Hi - > > I'm not sure anyone cares about this (most likely Seaside users), but class > Socket has some issues with clock-rollover which can be seen on high load > servers that have enough uptime and load. > > The problem is caused by Socket>>deadlineSecs: which at the time of the > clock rollover computes a deadline that can never be reached. If you've seen > connections that wouldn't time out apparently at random there is a chance > you have been affected by this type of issue. > > Bug (and fix) are posted at http://bugs.squeak.org/view.php?id=7343 > > Cheers, > - Andreas > > |
Nicolas Cellier wrote:
> I wonder why you adopt a soft busy loop like this: > "Wait in a soft busy-loop (500ms) to avoid msecs-rollover issues" > self writeSemaphore waitTimeoutMSecs: > (deadline - Time millisecondClockValue min: 500 max: 0) Heh. You are asking about my dirty little secrets ;-) The reason for writing the loop that way is that I have (mounting) evidence that at times we may loose signals originating from external semaphores. We have seen situations where external signals apparently weren't delivered even though the condition was clearly set on the underlying socket. This is not only true for sockets but we have seen other situations when external semaphore signals apparently have been delivered. Unfortunately, I have not been able to fully understand this problem and when or why it happens. All I can say at this point is that there is evidence to suggest that at times signals may not be delivered. The code is written in the way you see purely as a defensive measure against such effects - it actually had been rewritten that way before I fixed the rollover problems and it just fit in nicely with the rest of it. > And then, since Process swap should not happen in a forward branch, a > msecsEllapsed variable could also be used: > waitForSendDoneFor: timeout > "Wait up until the given deadline for the current send operation to > complete. Return true if it completes by the deadline, false if not." > > | startTime msecsDelta msecsEllapsed sendDone | > startTime := Time millisecondClockValue. > msecsDelta := (timeout * 1000) truncated. > [self isConnected & (sendDone := self primSocketSendDone: socketHandle) not > "Connection end and final data can happen fast, so test in this order" > and: [(msecsEllapsed := Time millisecondsSince: startTime) < > msecsDelta]] whileTrue: [ > self writeSemaphore waitTimeoutMSecs: msecsDelta - msecsEllapsed]. > > ^ sendDone > > Unless I missed something, this code look simpler. Yes, that looks indeed much nicer. I don't know if you want to include the upper wait limit - it somewhat depends whether you think you might be loosing signals or not. Cheers, - Andreas |
On 28-Apr-09, at 6:24 PM, Andreas Raab wrote: > Nicolas Cellier wrote: >> I wonder why you adopt a soft busy loop like this: >> "Wait in a soft busy-loop (500ms) to avoid msecs-rollover issues" >> self writeSemaphore waitTimeoutMSecs: >> (deadline - Time millisecondClockValue min: 500 max: 0) > > Heh. You are asking about my dirty little secrets ;-) The reason for > writing the loop that way is that I have (mounting) evidence that at > times we may loose signals originating from external semaphores. We > have seen situations where external signals apparently weren't > delivered even though the condition was clearly set on the > underlying socket. This is not only true for sockets but we have > seen other situations when external semaphore signals apparently > have been delivered. Er, so given we don't have a thread safe signalSemaphoreWithIndex code base (on purpose) I wonder how many signals per second are you doing and are you perhaps overflowing the semaphoresUseBufferA/B table? Assuming you are saying you do the signalSemaphoreWithIndex() and you never see that over in the image? -- = = = ======================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
John M McIntosh wrote:
> Er, so given we don't have a thread safe signalSemaphoreWithIndex code > base (on purpose) I wonder how many signals per second are you doing and > are you perhaps > overflowing the semaphoresUseBufferA/B table? Assuming you are saying > you do the signalSemaphoreWithIndex() and you never see that over in the > image? I cannot prove any of this because it's so unreliable but I don't think that's the problem. An overflow like you are describing is only possible if you overflow before the VM (not the image!) gets to the next interrupt check. If that were the case (for example because we're spending too much time in some primitive like BitBlt) I believe we'd be seeing this problem more reliably than we do. Also, the Windows VM actually replaces signalSemaphoreWithIndex with a version that *is* thread-safe in the proxy interface since this used to be an issue in the past. It is still possible to overflow the semaphores but not that you're competing between two threads when signaling (i.e., overwriting entries because threads are executing on different cores). Perhaps most importantly, the last place where I've seen this happen was in a callback which means the signaling code was running from the main thread. There is of course a possibility something completely else goes wrong (random corruption of the semaphore index for example) but I haven't had the time to investigate this - I was more interested in finding a suitable workaround for the release ;-) Cheers, - Andreas |
On 28-Apr-09, at 6:46 PM, Andreas Raab wrote: > John M McIntosh wrote: >> Er, so given we don't have a thread safe signalSemaphoreWithIndex >> code base (on purpose) I wonder how many signals per second are you >> doing and are you perhaps >> overflowing the semaphoresUseBufferA/B table? Assuming you are >> saying you do the signalSemaphoreWithIndex() and you never see that >> over in the image? > > I cannot prove any of this because it's so unreliable but I don't > think that's the problem. An overflow like you are describing is > only possible if you overflow before the VM (not the image!) gets to > the next interrupt check. If that were the case (for example because > we're spending too much time in some primitive like BitBlt) I > believe we'd be seeing this problem more reliably than we do. Ok well people are welcome then to look at the semaphoresUseBufferA/B logic (is there an off by one error there?) and consider the issues with multi-cpu machines and if there are any exposures to loosing a signal, versus overflowing the table. Frankly I wonder about the safety of foo->semaphoresUseBufferA > Also, the Windows VM actually replaces signalSemaphoreWithIndex with > a version that *is* thread-safe in the proxy interface since this > used to be an issue in the past. It is still possible to overflow > the semaphores but not that you're competing between two threads > when signaling (i.e., overwriting entries because threads are > executing on different cores). Er, maybe you could buy those window's boxes and accidently run linux or freebsd on them, and pretend they are windows machines, I mean if they are stuffed away in some ISP's rack/cage who would know what they run anyway? However you have to prove you are not loosing interrupts from somewhere in the bowels of the windows socket code heh? > > Perhaps most importantly, the last place where I've seen this happen > was in a callback which means the signaling code was running from > the main thread. There is of course a possibility something > completely else goes wrong (random corruption of the semaphore index > for example) but I haven't had the time to investigate this - I was > more interested in finding a suitable workaround for the release ;-) No doubt you then have to follow the trail from synchronousSignal() and confirm in your mind that it does reach the smalltalk object.... > > > > Cheers, > - Andreas -- = = = ======================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Andreas.Raab
Folks -
Just as a follow-up to this note I now have proof that we're loosing semaphore signals occasionally. What I was able to detect was that when running forums over a period of 20 hours we lost 2 out of 421355 signals. We'll have the follow-on discussion on vm-dev since I don't think most people here are interested in discussing the possibilities of how this could happen and what to do about it. Please send any follow-ups to vm-dev (and vm-dev only). Cheers, - Andreas Andreas Raab wrote: > John M McIntosh wrote: >> Er, so given we don't have a thread safe signalSemaphoreWithIndex code >> base (on purpose) I wonder how many signals per second are you doing >> and are you perhaps >> overflowing the semaphoresUseBufferA/B table? Assuming you are saying >> you do the signalSemaphoreWithIndex() and you never see that over in >> the image? > > I cannot prove any of this because it's so unreliable but I don't think > that's the problem. An overflow like you are describing is only possible > if you overflow before the VM (not the image!) gets to the next > interrupt check. If that were the case (for example because we're > spending too much time in some primitive like BitBlt) I believe we'd be > seeing this problem more reliably than we do. > > Also, the Windows VM actually replaces signalSemaphoreWithIndex with a > version that *is* thread-safe in the proxy interface since this used to > be an issue in the past. It is still possible to overflow the semaphores > but not that you're competing between two threads when signaling (i.e., > overwriting entries because threads are executing on different cores). > > Perhaps most importantly, the last place where I've seen this happen was > in a callback which means the signaling code was running from the main > thread. There is of course a possibility something completely else goes > wrong (random corruption of the semaphore index for example) but I > haven't had the time to investigate this - I was more interested in > finding a suitable workaround for the release ;-) > > Cheers, > - Andreas > > |
Free forum by Nabble | Edit this page |