On 8/13/2011 13:42, Levente Uzonyi wrote: > Socket's readSemaphore is losing signals with CogVMs on linux. We > found several cases (RFB, PostgreSQL) when processes are stuck in the > following method: > > Socket >> waitForDataIfClosed: closedBlock > "Wait indefinitely for data to arrive. This method will block until > data is available or the socket is closed." > > [ > (self primSocketReceiveDataAvailable: socketHandle) > ifTrue: [^self]. > self isConnected > ifFalse: [^closedBlock value]. > self readSemaphore wait ] repeat > > When we inspect the contexts, the process is waiting for the > readSemaphore, but evaluating (self primSocketReceiveDataAvailable: > socketHandle) yields true. Signaling the readSemaphore makes the > process running again. As a workaround we replaced #wait with > #waitTimeoutMSecs: and all our problems disappeared. > > The interpreter VM doesn't seem to have this bug, so I guess the bug > was introduced with the changes of aio.c. « [hide part of quote] Oh, interesting. We know this problem fairly well and have always worked around by changing the wait in the above to a "waitTimeoutMSecs: 500" which turns it into a soft busy loop. It would be interesting to see if there's a bug in Cog which causes this. FWIW, here is the relevant portion: "Soft 500ms busy loop - to protect against AIO probs; occasionally, VM-level AIO fails to trip the semaphore" self readSemaphore waitTimeoutMSecs: 500. Cheers, - Andreas _,,,^..^,,,_ best, Eliot |
Hi Eliot,
What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand? Levente |
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-( Levente _,,,^..^,,,_ best, Eliot |
In reply to this post by Eliot Miranda-2
i remeber using rfb to connect to a mac from a win machine at work and that problem would crop up quite regularly. not nearly an eight hour wait. i will see if i can give you a more definite way of reproducing it when i am back at my desk tomorrow
|
In reply to this post by Eliot Miranda-2
On 23-03-16 19:45, Eliot Miranda wrote:
> I'll take every few hours, but I'd prefer "in under 30 minutes". > Getting warm and fuzzy feelings when trying to prove a negative with > something that takes hours to run is very difficult. Let's say you > have a case which reproduces in 8 hours 50% of the time. To reach 99% > confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours > without seeing it reproduce, right? That's nearly 2 days; it could > take weeks to fix :-( Levente Is this something where throwing a lot of hardware at it would speed up things? Run for a few minutes on hundreds of vms? Stephan |
In reply to this post by Eliot Miranda-2
Hi Eliot,
The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal. Levente Socket compile: 'sloppyWaitForDataIfClosed: closedBlock [(socketHandle ~~ nil and: [self primSocketReceiveDataAvailable: socketHandle]) ifTrue: [^self]. self isConnected ifFalse: [^closedBlock value]. self readSemaphore wait] repeat' classified: 'waiting'. [ listenerSocket := Socket newTCP. listenerSocket listenOn: 0 backlogSize: 4 interface: #[127 0 0 1]. clientSocket := Socket newTCP. clientSocket connectTo: #[127 0 0 1] port: listenerSocket localPort. clientSocket waitForConnectionFor: 1. self assert: clientSocket isConnected. serverSocket := listenerSocket waitForAcceptFor: 1. self assert: serverSocket isConnected ] ensure: [ listenerSocket destroy ]. serverProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ [ serverSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := serverSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PING'. serverSocket sendData: 'PONG' ] repeat ] newProcess. clientProcess := [ | shouldRun buffer bytesReceived waitDuration | shouldRun := true. buffer := ByteString new: 10. waitDuration := 1 second. [ clientSocket sendData: 'PING'. [ clientSocket sloppyWaitForDataIfClosed: [ shouldRun := false ] ] valueWithin: waitDuration onTimeout: [ self halt ]. buffer atAllPut: (Character value: 0). bytesReceived := clientSocket receiveDataInto: buffer. self assert: bytesReceived = 4. self assert: (buffer first: 4) = 'PONG' ] repeat ] newProcess. clientProcess priority: 39; resume. serverProcess priority: 39; resume. "Evaluate these after debugging: clientSocket destroy. serverSocket destroy." On Wed, 23 Mar 2016, Eliot Miranda wrote: > Hi Levente, > On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote: > Hi Eliot, > > What sort of reproducibility are you looking for? Is it enough if it happens once every few hours or do you need something that you can trigger on demand? > > > I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have > a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could > take weeks to fix :-( > > Levente > > > > _,,,^..^,,,_ > best, Eliot > > |
Hi Levente,
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it. (Stephan, interesting suggestion to throw hardware at the problem, thank you).
_,,,^..^,,,_ best, Eliot |
In reply to this post by Levente Uzonyi
Hi Holger,
The snippet doesn't seem to trigger the issue when strace is attached to the process. But it usually runs into the issue short after I stop strace. Levente On Thu, 24 Mar 2016, Holger Freyther wrote: > > >> On 24 Mar 2016, at 10:16, Levente Uzonyi <[hidden email]> wrote: >> >> Hi Eliot, >> >> The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. >> Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal. > > does it happen if you have strace -o output.txt -f on the VM as well? > > kind regards > holger |
In reply to this post by Eliot Miranda-2
Hi Holger,
let's collaborate on this. You and I and lots of others want this to be done. But in fixing the below bug my priorities are to a) deliver a fix promptly b) write the code so that either kqueue or epoll or select will be used based on HAVE_KQUEUE & HAVE_EPOLL defines So if I deliver a fix don't think that I'm against having a proper event-driven or idling VM, I just don't have it in my head yet and want the bug fixed and kqueue/epoll available asap. On Thu, Mar 24, 2016 at 11:42 PM, Holger Freyther <[hidden email]> wrote:
_,,,^..^,,,_ best, Eliot |
In reply to this post by Levente Uzonyi
Hi Levente, the first thing to report is that this isn't a Cog-specific bug. It also reproduces using the trunk VM sources (4.15.3) and a freshly built linux VM with a 4.6 image. On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, _,,,^..^,,,_ best, Eliot |
Hi Eliot,
AFAIK the workaround came from the Qwaq project, and it had been there long before it was added to the Trunk. Levente |
Free forum by Nabble | Edit this page |