On 8/13/2011 13:42, Levente Uzonyi wrote: > Socket's readSemaphore is losing signals with CogVMs on linux. We > found several cases (RFB, PostgreSQL) when processes are stuck in the > following method: > > Socket >> waitForDataIfClosed: closedBlock > "Wait indefinitely for data to arrive. This method will block until > data is available or the socket is closed." > > [ > (self primSocketReceiveDataAvailable: socketHandle) > ifTrue: [^self]. > self isConnected > ifFalse: [^closedBlock value]. > self readSemaphore wait ] repeat > > When we inspect the contexts, the process is waiting for the > readSemaphore, but evaluating (self primSocketReceiveDataAvailable: > socketHandle) yields true. Signaling the readSemaphore makes the > process running again. As a workaround we replaced #wait with > #waitTimeoutMSecs: and all our problems disappeared. > > The interpreter VM doesn't seem to have this bug, so I guess the bug > was introduced with the changes of aio.c. « [hide part of quote] Oh, interesting. We know this problem fairly well and have always worked around by changing the wait in the above to a "waitTimeoutMSecs: 500" which turns it into a soft busy loop. It would be interesting to see if there's a bug in Cog which causes this. FWIW, here is the relevant portion: "Soft 500ms busy loop - to protect against AIO probs; occasionally, VM-level AIO fails to trip the semaphore" self readSemaphore waitTimeoutMSecs: 500. Cheers, - Andreas _,,,^..^,,,_ best, Eliot |
Hi Levente,
On Wed, Mar 23, 2016 at 11:31 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, I'll take every few hours, but I'd prefer "in under 30 minutes". Getting warm and fuzzy feelings when trying to prove a negative with something that takes hours to run is very difficult. Let's say you have a case which reproduces in 8 hours 50% of the time. To reach 99% confidence level in a fix I'd have to run it for 8 * (50 log: 2) hours without seeing it reproduce, right? That's nearly 2 days; it could take weeks to fix :-( Levente _,,,^..^,,,_ best, Eliot |
> On 24 Mar 2016, at 10:16, Levente Uzonyi <[hidden email]> wrote: > > Hi Eliot, > > The snippet below, evaluated from a workspace, triggered the issue in less than a minute for me, three times in a row. > Both processes will halt if #sloppyWaitForDataIfClosed: doesn't return within a second. If you send #dataAvailable to the socket, you'll find that it has data ready to be read, but its readSemaphore has no signal. does it happen if you have strace -o output.txt -f on the VM as well? kind regards holger |
In reply to this post by Eliot Miranda-2
Hi Levente,
On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it. (Stephan, interesting suggestion to throw hardware at the problem, thank you).
_,,,^..^,,,_ best, Eliot |
> On 25 Mar 2016, at 00:35, Eliot Miranda <[hidden email]> wrote: > Dear Eliot, > Fabulous, thank you! replace the self halts with e.g. self assert: (clientSocket dataAvailable = (clientSocket readSemaphore excessSignals > 0)), and we even have a test. I have work to do tomorrow but hope I should be able to debug this soon. I'll add kqueue and epoll support when I fix it. before going to kqueue, could you have a look at my hack to create a completely idle VM? Once the VM has moved to kqueue it seems one can not get a SIGIO for the "kqueue descriptor"[1] itself and I don't know if SIGIO remains working for the individual file descriptors. holger [1] https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_event.c#L1735 |
Hi Holger,
let's collaborate on this. You and I and lots of others want this to be done. But in fixing the below bug my priorities are to a) deliver a fix promptly b) write the code so that either kqueue or epoll or select will be used based on HAVE_KQUEUE & HAVE_EPOLL defines So if I deliver a fix don't think that I'm against having a proper event-driven or idling VM, I just don't have it in my head yet and want the bug fixed and kqueue/epoll available asap. On Thu, Mar 24, 2016 at 11:42 PM, Holger Freyther <[hidden email]> wrote:
_,,,^..^,,,_ best, Eliot |
In reply to this post by Eliot Miranda-2
Hi Levente, the first thing to report is that this isn't a Cog-specific bug. It also reproduces using the trunk VM sources (4.15.3) and a freshly built linux VM with a 4.6 image. On Thu, Mar 24, 2016 at 2:16 AM, Levente Uzonyi <[hidden email]> wrote: Hi Eliot, _,,,^..^,,,_ best, Eliot |
Free forum by Nabble | Edit this page |