Folks - Just as a follow-up to this note I now have proof that we're loosing semaphore signals occasionally. What I was able to detect was that when running forums over a period of 20 hours we lost 2 out of 421355 signals. We'll have the follow-on discussion on vm-dev since I don't think most people here are interested in discussing the possibilities of how this could happen and what to do about it. Please send any follow-ups to vm-dev (and vm-dev only). Cheers, - Andreas Andreas Raab wrote: > John M McIntosh wrote: >> Er, so given we don't have a thread safe signalSemaphoreWithIndex code >> base (on purpose) I wonder how many signals per second are you doing >> and are you perhaps >> overflowing the semaphoresUseBufferA/B table? Assuming you are saying >> you do the signalSemaphoreWithIndex() and you never see that over in >> the image? > > I cannot prove any of this because it's so unreliable but I don't think > that's the problem. An overflow like you are describing is only possible > if you overflow before the VM (not the image!) gets to the next > interrupt check. If that were the case (for example because we're > spending too much time in some primitive like BitBlt) I believe we'd be > seeing this problem more reliably than we do. > > Also, the Windows VM actually replaces signalSemaphoreWithIndex with a > version that *is* thread-safe in the proxy interface since this used to > be an issue in the past. It is still possible to overflow the semaphores > but not that you're competing between two threads when signaling (i.e., > overwriting entries because threads are executing on different cores). > > Perhaps most importantly, the last place where I've seen this happen was > in a callback which means the signaling code was running from the main > thread. There is of course a possibility something completely else goes > wrong (random corruption of the semaphore index for example) but I > haven't had the time to investigate this - I was more interested in > finding a suitable workaround for the release ;-) > > Cheers, > - Andreas > > |
So is this on windows, or unix? So how did you measure that? I think you said you had a VM that does proper locking of the queues? Versus the code I wrote. On 5-May-09, at 5:52 PM, Andreas Raab wrote: > Folks - > > Just as a follow-up to this note I now have proof that we're loosing > semaphore signals occasionally. What I was able to detect was that > when running forums over a period of 20 hours we lost 2 out of > 421355 signals. We'll have the follow-on discussion on vm-dev since > I don't think most people here are interested in discussing the > possibilities of how this could happen and what to do about it. > Please send any follow-ups to vm-dev (and vm-dev only). > > Cheers, > - Andreas -- = = = ======================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
John M McIntosh wrote: > So is this on windows, or unix? Windows Qwaq Forums client. > So how did you measure that? Our original issue was that Python apps in Forums would "stop working" after running for hours. The way these apps work is by calling from Forums to Python and have callback facilities that allow Python code to invoke methods inside Forums. When the apps stopped working I could observe that it was when a callback was being executed, i.e., from the Python side everything was set up and the VM had entered the interpreter loop again. Except that the Python callback semaphore wasn't signaled. I then changed that code to use waitTimeout: and count the number of times the callback semaphore was signaled (i.e., didn't time out) vs. the number of times we had callback data waiting. These numbers should be exactly the same and they weren't. Since all of this is code that is under our control it means that I am 100% certain that we've been calling signalSemaphoreWithIndex() and that this wasn't delivered to the image. And obviously it's not a common event (2 out of 400k callbacks missed the signal). > I think you said you had a VM that does proper locking of the queues? Yes. I don't think that's the problem. Right now my theory is that we're indeed overflowing the VMs semaphore buffer because a Python callout may take a long, long time. I think what happens then might be that over the period of time the (few) sockets generate multiple semaphore signals which overflows the VMs buffer and then there is no room left in the buffer when the callback executes. If that's true then I should be able to recreate the problem by calling an OS-level sleep() function via FFI (i.e., block the main interpreter loop) while performing heavy network activity and see if that overflows the VMs buffer. And if that's indeed the case then I think there are two actions to take: One is to fix the Windows sockets code to not do that ;-) (i.e., not signal an already signaled semaphore a gazillion times) but also to keep track of the number of signals on a particular semaphore instead of keeping an entry in the buffer each time the semaphore is signaled (which would completely solve this type of problem in general). The next step for me will be to attribute our Python callback facilities to keep track of the time that's passed between entering Python and getting back to Forums and see if that correlates. Plus doing the sleep() test via FFI to see how long this needs to take before we overflow the VM buffer. Cheers, - Andreas > On 5-May-09, at 5:52 PM, Andreas Raab wrote: > >> Folks - >> >> Just as a follow-up to this note I now have proof that we're loosing >> semaphore signals occasionally. What I was able to detect was that >> when running forums over a period of 20 hours we lost 2 out of 421355 >> signals. We'll have the follow-on discussion on vm-dev since I don't >> think most people here are interested in discussing the possibilities >> of how this could happen and what to do about it. Please send any >> follow-ups to vm-dev (and vm-dev only). >> >> Cheers, >> - Andreas > > -- > =========================================================================== > John M. McIntosh <[hidden email]> Twitter: > squeaker68882 > Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com > =========================================================================== > > > > |
On 5-May-09, at 7:09 PM, Andreas Raab wrote: > Yes. I don't think that's the problem. Right now my theory is that > we're indeed overflowing the VMs semaphore buffer because a Python > callout may take a long, long time. I think what happens then might > be that over the period of time the (few) sockets generate multiple > semaphore signals which overflows the VMs buffer and then there is > no room left in the buffer when the callback executes. Well given the if (foo->semaphoresUseBufferA) { if (foo->semaphoresToSignalCountA < SemaphoresToSignalSize) { foo->semaphoresToSignalCountA += 1; foo->semaphoresToSignalA[foo->semaphoresToSignalCountA] = index; } } else { if (foo->semaphoresToSignalCountB < SemaphoresToSignalSize) { foo->semaphoresToSignalCountB += 1; foo->semaphoresToSignalB[foo->semaphoresToSignalCountB] = index; } } You could just code up the else to print diagnostic data when the foo- >semaphoresToSignalCountA IS = to SemaphoresToSignalSize Also in checkForInterrupts at foo->semaphoresUseBufferA = !foo->semaphoresUseBufferA; is there exposure on multi-cpu machines for semaphoresUseBufferA to be NOTed as it's evaluating "if (foo->semaphoresUseBufferA)" Oh likely there is... maybe we need a volatile on the semaphoresUseBufferA ? But I'm not sure that will work since this the traditional race condition on a shared memory location between two cpus. ' > > > If that's true then I should be able to recreate the problem by > calling an OS-level sleep() function via FFI (i.e., block the main > interpreter loop) while performing heavy network activity and see if > that overflows the VMs buffer. > > And if that's indeed the case then I think there are two actions to > take: One is to fix the Windows sockets code to not do that ;-) > (i.e., not signal an already signaled semaphore a gazillion times) > but also to keep track of the number of signals on a particular > semaphore instead of keeping an entry in the buffer each time the > semaphore is signaled (which would completely solve this type of > problem in general). -- = = = ======================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
Free forum by Nabble | Edit this page |