On 8/13/2011 13:42, Levente Uzonyi wrote: > Socket's readSemaphore is losing signals with CogVMs on linux. We > found several cases (RFB, PostgreSQL) when processes are stuck in the > following method: > > Socket >> waitForDataIfClosed: closedBlock > "Wait indefinitely for data to arrive. This method will block until > data is available or the socket is closed." > > [ > (self primSocketReceiveDataAvailable: socketHandle) > ifTrue: [^self]. > self isConnected > ifFalse: [^closedBlock value]. > self readSemaphore wait ] repeat > > When we inspect the contexts, the process is waiting for the > readSemaphore, but evaluating (self primSocketReceiveDataAvailable: > socketHandle) yields true. Signaling the readSemaphore makes the > process running again. As a workaround we replaced #wait with > #waitTimeoutMSecs: and all our problems disappeared. > > The interpreter VM doesn't seem to have this bug, so I guess the bug > was introduced with the changes of aio.c. « [hide part of quote] Oh, interesting. We know this problem fairly well and have always worked around by changing the wait in the above to a "waitTimeoutMSecs: 500" which turns it into a soft busy loop. It would be interesting to see if there's a bug in Cog which causes this. FWIW, here is the relevant portion: "Soft 500ms busy loop - to protect against AIO probs; occasionally, VM-level AIO fails to trip the semaphore" self readSemaphore waitTimeoutMSecs: 500. Cheers, - Andreas _,,,^..^,,,_ best, Eliot |
Hi Eliot, What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand? Levente |
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-( Levente _,,,^..^,,,_ best, Eliot |
Hi Eliot, The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal. Levente Socket compile: 'sloppyWaitForDataIfClosed: closedBlock [(socketHandle ~~ nil and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait] repeat' classified: 'waiting'. [ listenerSocket := Socket newTCP. listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. clientSocket := Socket newTCP. clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort. clientSocket waitForConnectionFor: 1. self assert: clientSocket isConnected. serverSocket := listenerSocket waitForAcceptFor: 1. self assert: serverSocket isConnected ] ensure: [ listenerSocket destroy ]. serverProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := serverSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PING'. serverSocket sendData: 'PONG' ] repeat ] newProcess. clientProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ clientSocket sendData: 'PING'. [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := clientSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess. clientProcess priority: 39; resume. serverProcess priority: 39; resume. "Evaluate these after debugging: clientSocket destroy. serverSocket destroy." On Wed, 23 Mar 2016, Eliot Miranda wrote: > Hi Levente, > On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote: > Hi Eliot, > > What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand? > > > I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have > a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could > take weeks to fix :-( > > Levente > > > > _,,,^..^,,,_ > best, Eliot > > |
> On 24 Mar 2016, at 10:16, Levente Uzonyi <[hidden email]> wrote: > > Hi Eliot, > > The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. > Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal. does it happen if you have strace -o output.txt -f on the VM as well? kind regards holger |
In reply to this post by Levente Uzonyi
Hi Levente,
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it. (Stephan, interesting suggestion to throw hardware at the problem, thank you).
_,,,^..^,,,_ best, Eliot |
> On 25 Mar 2016, at 00:35, Eliot Miranda <[hidden email]> wrote: > Dear Eliot, > Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it. before going to kqueue, could you have a look at my hack to create a completely idle VM? Once the VM has moved to kqueue it seems one can not get a SIGIO for the "kqueue descriptor"[1] itself and I don't know if SIGIO remains working for the individual file descriptors. holger [1] https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1735 |
In reply to this post by Holger Freyther
Hi Holger, The snippet doesn't seem to trigger the issue when strace is attached to the process. But it usually runs into the issue short after I stop strace. Levente On Thu, 24 Mar 2016, Holger Freyther wrote: > > >> On 24 Mar 2016, at 10:16, Levente Uzonyi <[hidden email]> wrote: >> >> Hi Eliot, >> >> The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. >> Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal. > > does it happen if you have strace -o output.txt -f on the VM as well? > > kind regards > holger |
In reply to this post by Holger Freyther
Hi Holger,
let's collaborate on this. You and I and lots of others want this to be done. But in fixing the below bug my priorities are to a) deliver a fix promptly b) write the code so that either kqueue or epoll or select will be used based on HAVE_KQUEUE & HAVE_EPOLL defines So if I deliver a fix don't think that I'm against having a proper event-driven or idling VM, I just don't have it in my head yet and want the bug fixed and kqueue/epoll available asap. On Thu, Mar 24, 2016 at 11:42 PM, Holger Freyther <[hidden email]> wrote:
_,,,^..^,,,_ best, Eliot |
In reply to this post by Levente Uzonyi
Hi Levente, the first thing to report is that this isn't a Cog-specific bug. It also reproduces using the trunk VM sources (4.15.3) and a freshly built linux VM with a 4.6 image. On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, _,,,^..^,,,_ best, Eliot |
Hi Eliot, AFAIK the workaround came from the Qwaq project, and it had been there long before it was added to the Trunk. Levente |
In reply to this post by Levente Uzonyi
Hi Levente,
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, Thanks very much. In Cog the bug was that the sqAtomicOps.h implementation for atomic increment had been changed to use an intrinsic that operated on 32-bit or 64-bit quantities, but not on the 16-bit quantities that sqExternalSemaphores.c used to coordinate the multiple writers/single reader signalSemaphoreWithIndex scheme. There may still be occasions in which signals are lost, but at least the below runs for several minutes without a problem. Dave, it would be great to know why it fails in the Interpreter VM on unix. The platform subsystem is quite different, so its odd that we see a similar symptom. Socket compile: 'sloppyWaitForDataIfClosed: closedBlock _,,,^..^,,,_ best, Eliot |
On Mon, Mar 28, 2016 at 03:46:22PM -0700, Eliot Miranda wrote: > Hi Levente, > > On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> > wrote: > > > Hi Eliot, > > > > The snippet below, evaluated from a workspace, triggered the issue in less > > than a minute for me, three times in a row. > > Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return > > within a second. If you send #dataAvailable to the socket, you'll find that > > it has data ready to be read, but its readSemaphore has no signal. > > > > Levente > > > > Thanks very much. In Cog the bug was that the sqAtomicOps.h implementation > for atomic increment had been changed to use an intrinsic that operated on > 32-bit or 64-bit quantities, but not on the 16-bit quantities that > sqExternalSemaphores.c used to coordinate the multiple writers/single > reader signalSemaphoreWithIndex scheme. There may still be occasions in > which signals are lost, but at least the below runs for several minutes > without a problem. Dave, it would be great to know why it fails in the > Interpreter VM on unix. The platform subsystem is quite different, so its > odd that we see a similar symptom. > Hi Eliot, I set up Levente's test with an interpreter VM, and I'll see if I can get any further insight. On my first test, it ran for maybe 5 or 10 minutes before encountering a halt. A second test ran for about 10 minutes or so, and now seems to have a locked up my image, but with the VM running normally according to /proc/<pid>/status. I'm not sure what to make of that, but I'll try it a few more times and see if there is any repeatable pattern. The image that I am using for this is a 4.6 image updated to more or less current trunk level, so I'm not sure how stable it is. Dave > Socket compile: 'sloppyWaitForDataIfClosed: closedBlock > > > > [(socketHandle ~~ nil > > and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: > > [^self]. > > self isConnected ifFalse: > > [^closedBlock value]. > > self readSemaphore wait] repeat' > > classified: 'waiting'. > > > > [ > > listenerSocket := Socket newTCP. > > listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. > > clientSocket := Socket newTCP. > > clientSocket connectTo: #[127 0 0 1] port: listenerSocket > > localPort. > > clientSocket waitForConnectionFor: 1. > > self assert: clientSocket isConnected. > > serverSocket := listenerSocket waitForAcceptFor: 1. > > self assert: serverSocket isConnected ] > > ensure: [ listenerSocket destroy ]. > > > > serverProcess := [ > > | shouldRun buffer bytesReceived waitDuration | > > shouldRun := true. > > buffer := ByteString new: 10. > > waitDuration := 1 second. > > [ > > [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := > > false ] ] > > valueWithin: waitDuration > > onTimeout: [ self halt ]. > > buffer atAllPut: (Character value: 0). > > bytesReceived := serverSocket receiveDataInto: buffer. > > self assert: bytesReceived = 4. > > self assert: (buffer first: 4) = 'PING'. > > serverSocket sendData: 'PONG' ] repeat ] newProcess. > > clientProcess := [ > > | shouldRun buffer bytesReceived waitDuration | > > shouldRun := true. > > buffer := ByteString new: 10. > > waitDuration := 1 second. > > [ > > clientSocket sendData: 'PING'. > > [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := > > false ] ] > > valueWithin: waitDuration > > onTimeout: [ self halt ]. > > buffer atAllPut: (Character value: 0). > > bytesReceived := clientSocket receiveDataInto: buffer. > > self assert: bytesReceived = 4. > > self assert: (buffer first: 4) = 'PONG' ] repeat ] > > newProcess. > > clientProcess priority: 39; resume. > > serverProcess priority: 39; resume. > > > > "Evaluate these after debugging: > > clientSocket destroy. > > serverSocket destroy." > > > > > > > > On Wed, 23 Mar 2016, Eliot Miranda wrote: > > > > Hi Levente, > >> On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> > >> wrote: > >> Hi Eliot, > >> > >> What sort of reproducibility are you looking for? Is it enough if > >> it happens once every few hours or do you need something that you can > >> trigger on demand? > >> > >> > >> I'll take every few hours, but I'd prefer "in under 30 minutes". Getting > >> warm and fuzzy feelings when trying to prove a negative with something that > >> takes hours to run is very difficult. Let's say you have > >> a case which reproduces in 8 hours 50% of the time. To reach 99% > >> confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours > >> without seeing it reproduce, right? That's nearly 2 days; it could > >> take weeks to fix :-( > >> > >> Levente > >> > >> > >> > >> _,,,^..^,,,_ > >> best, Eliot > >> > >> > > > > > > > > > -- > _,,,^..^,,,_ > best, Eliot |
Free forum by Nabble | Edit this page |