Hi,
I’m getting deadlocks in recursive calls of #critical: on the Seaside transactionMutex (which is a TransientRecursionLock). The #lockingProcess stored by the mutex is no longer the same object as the ‘Processor activeProcess’, though it should be (see below). Hence, the mutex blocks access to the critical section and deadlocks because the #lockingProcess object no longer exists. It actually seems as if the result object of ‘Processor activeProcess’ changes even though we are still in the same thread. I found this by finally getting a good tracing on a long-lasting bug that sometimes blocks a seaside gem outside of a transaction, eventually having the gem killed off by the stone due to a lostOTView. The typical log sequence found in the seaside gem log is quoted below. The gem log would contain a log entry for a commit failure but the object log would not. This means that the commit on the object log never happened. Furthermore, the gem is left outside a transaction when that happens, eventually getting the #lostOT signal from the stone. This log is from a 2.4.4.1 stone but I get the same results in a 3.1.0.6 stone (have not tried with 3.2 at this time).
The exact point where the process gets blocked is the call to the #doTransaction: in the #saveLogEntry: method that logs the commit failure. So, the gem log got the log entry on file but the transaction to commit to the object log is blocked and never proceeds. Inside #doTransaction, we first make a recursive call to #critical: on the transactionMutex (‘recursive’ because it happens inside the Seaside request cycle, which is protected by the same mutex). However, it should never block because it is a RecursionLock that allows recursive entry; it blocks however because the #lockingProcess object and the ‘Processor activeProcess’ object are no longer the same. I added some transcript logging at several points in the request handling and in the TransientRecursionSemaphore>>#critical: method that demonstrates what happens. - the ‘START’ entry prints label of the activeProcess at the beginning of the request handling inside the #seasideProcessRequestWithRetry:resultBlock method. - the ‘CONTINUE’ entry prints label of the activeProcess right after the Seaside processing (i.e. right after result := (resultBlock value: aNativeRequest) ) - The indented parts are a printing of the labels of all processes (i.e. Processor allProcesses) to see what processes exist at the same time as the log entry just before it. What we see that happens is that the oop of the activeProcess at the START and at CONTINUE is not the same. However, right after the ‘Commit Failure’, this process wants to commit to the object log but is being blocked by semaphore because it’s activeProcess now has oop 1022696705 and it was 1020803073 when it first entered the critical section. So, this recursive call is being blocked by itself because the semaphore thinks it’s another process.
Mind that this ‘switching’ of Process activeProcess objects does not always happens and it only becomes a problem when there is a recursive call on the transactionMutex. However, these conditions pop up from time to time in production. I just expose them a bit faster by randomly returning false from the doCommitTransaction method, which forces the Seaside code in a bit more retries per request, so I can reproduce it easily. The problem is, of course, why is this happening? The implementation of TransientRecursionLock>>critical: relies on the fact that same thread returns the same object for ‘Processor activeProcess’, while it seems it does not. Hope someone can shed some light on this issue... greetings, Johan _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
A bit of follow-up: I am suspecting this is because of the #call:/#answer logic of Seaside.
When the request processing of Seaside returns, it might have processed a call or answer, which is manipulating the call stack, which seems to induce a change of the process. At least, when I process requests with call/answer, there is no such thing happening. Dale, you probably can tell what goes on here :) Working my way through this… I can imagine a couple of workarounds/fixes but I would like to pinpoint the exact issue first Johan
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Sorry for talking in partial mails like this :)
But yes, it’s obvious that the continuation logic is creating new processes and that this is the cause. I fixed it by setting the #lockingProcess in the mutex back to ‘Processor activeProcess’ when we return from the seaside request handling, right before we try committing and possibly need to commit object log entries when the commit fails. So, in #seasideProcessRequestWithRetry:resultBlock: ... [result := (resultBlock value: aNativeRequest)] on: (WARetryHttpRequest, SafelyPerformBlockRequiringAbort, DALGsSemanticConflictException) do: [:ex | retryRequest := true. retryException := ex ]. self transactionMutex lockingProcess: Processor activeProcess. ... The reason this does not happen too much is the combination of conditions… and the result is nasty… but the fix seems easy ;-) Johan
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Created an issue for better tracking: https://github.com/GsDevKit/Seaside31/issues/71
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Johan,
I'm glad that you "talked this out" ... I'm trying to wrap my brain around the "combination of conditions". I think that this problem might be partially related to Issue #64[1] as well ... it involves commit errors while logging commit conflicts, too, which is a common theme. The part that is mind boggling to me, is how the activeProcess is changed at this point in the stack ... I seem to recall that there should be a process fork at the session level ... hold on a second ... I just looked and there used to be (in 2.8?) a fork at the session level and that fork insulated the rest of the handling stack (which is pretty much unchanged since 2.8) from process perturbations caused by continuations ... with the introduction of partial continuations that occurred in 3.0(?) the fork was eliminated ... so these days the process can (and apparently does) change all of the time during `(resultBlock value: aNativeRequest)` and as you point out this doesn't matter unless a commit conflict occurs during the handling of a continuation .... ... and your proposed fix looks a perfectly good solution ... So I will take a look at your bug report where you propose alternate solutions and comment there .... Dale [1] https://github.com/GsDevKit/Seaside31/issues/64 On 4/13/15 7:25 AM, Johan Brichau via
Glass wrote:
Sorry for talking in partial mails like this :) _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
For what I understand, the "root of all evil" is in the WAPartialContinuation class doing calls to GsProcess>>partialContinuationFromLevel:to: and GsProcess>>installPartialContinuation:atLevel:value: I assume (because there is a primitive call there) that this switches the GsProcess object for the active process. So, it’s never passing through any fork or other process switching in the ProcessorScheduler, from what I can make of it. Yes, glad I finally found this one… those crashes have been bothering me for years already!! Johan _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Yep, that's my take on it ...
On 4/13/15 10:32 PM, Johan Brichau
wrote:
_______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Thanks guys, this helps. Way too deep for me to follow properly, but
this has been bothering us. We disabled the object log a long time ago, with the hope that we'll get fewer problems. (by overriding addToLog to do nothing). On Tue, Apr 14, 2015 at 3:31 PM, Dale Henrichs via Glass <[hidden email]> wrote: > Yep, that's my take on it ... > > > On 4/13/15 10:32 PM, Johan Brichau wrote: > > > On 13 Apr 2015, at 21:53, Dale Henrichs via Glass > <[hidden email]> wrote: > > I just looked and there used to be (in 2.8?) a fork at the session level and > that fork insulated the rest of the handling stack (which is pretty much > unchanged since 2.8) from process perturbations caused by continuations ... > with the introduction of partial continuations that occurred in 3.0(?) the > fork was eliminated ... so these days the process can (and apparently does) > change all of the time during `(resultBlock value: aNativeRequest)` and as > you point out this doesn't matter unless a commit conflict occurs during the > handling of a continuation .... > > > For what I understand, the "root of all evil" is in the > WAPartialContinuation class doing calls to > GsProcess>>partialContinuationFromLevel:to: and > GsProcess>>installPartialContinuation:atLevel:value: > I assume (because there is a primitive call there) that this switches the > GsProcess object for the active process. So, it’s never passing through any > fork or other process switching in the ProcessorScheduler, from what I can > make of it. > > Yes, glad I finally found this one… those crashes have been bothering me for > years already!! > > Johan > > > > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |