[Glass] Recursion deadlock with TransientRecursionLock

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list
Hi,

I’m getting deadlocks in recursive calls of #critical: on the Seaside transactionMutex (which is a TransientRecursionLock). The #lockingProcess stored by the mutex is no longer the same object as the ‘Processor activeProcess’, though it should be (see below).
Hence, the mutex blocks access to the critical section and deadlocks because the #lockingProcess object no longer exists. It actually seems as if the result object of ‘Processor activeProcess’ changes even though we are still in the same thread.

I found this by finally getting a good tracing on a long-lasting bug that sometimes blocks a seaside gem outside of a transaction, eventually having the gem killed off by the stone due to a lostOTView. The typical log sequence found in the seaside gem log is quoted below. The gem log would contain a log entry for a commit failure but the object log would not. This means that the commit on the object log never happened. Furthermore, the gem is left outside a transaction when that happens, eventually getting the #lostOT signal from the stone. This log is from a 2.4.4.1 stone but I get the same results in a 3.1.0.6 stone (have not tried with 3.2 at this time).

failure
Read-Write Conflicts...
Write-Write Conflicts...
    3856175617
Write-Dependency Conflicts...
Write-ReadLock Conflicts...
Write-WriteLock Conflicts...
Rc-Write-Write Conflicts...
Synchronized-Commit Conflicts...
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
-----------------------------------------------------
GemStone: Error         Nonfatal
When running outside of a transaction the session failed to respond
quickly enough to a request from Stone to abort.  All object caches
have been cleared.  --- 
Error Category: 231169 [GemStone] Number: 3031 Arg Count: 0 Context : 7964271617

Now executing the following command saved from "iferr 1":
   where
==> 1 ProcessorScheduler >> _reapEvents: @63 line 65   [GsMethod 10209537]
2 ProcessorScheduler >> _findReadyProcess @2 line 7   [GsMethod 10204417]
3 ProcessorScheduler >> _reschedule @2 line 8   [GsMethod 10207489]
4 GsProcess >> _wait @2 line 3   [GsMethod 4503553]
5 Delay >> wait @23 line 23   [GsMethod 11755009]
6 SimpleBlock in Executed Code @19 line 24   [GsMethod 3879968001]
7 GsProcess >> _startPart2 @15 line 17   [GsMethod 4501249]
8 GsProcess >> _start @1 line 9   [GsMethod 4501761]
  [GsProcess 7964271617]
topaz 1> [268 sz:0 cls: 68097 Boolean] true
topaz 1> 
[Info]: Logging out at 03/27/15 14:47:52 CET

The exact point where the process gets blocked is the call to the #doTransaction: in the #saveLogEntry: method that logs the commit failure. So, the gem log got the log entry on file but the transaction to commit to the object log is blocked and never proceeds. 
Inside #doTransaction, we first make a recursive call to #critical: on the transactionMutex (‘recursive’ because it happens inside the Seaside request cycle, which is protected by the same mutex). However, it should never block because it is a RecursionLock that allows recursive entry; it blocks however because the #lockingProcess object and the ‘Processor activeProcess’ object are no longer the same.

I added some transcript logging at several points in the request handling and in the TransientRecursionSemaphore>>#critical: method that demonstrates what happens.

- the ‘START’ entry prints label of the activeProcess at the beginning of the request handling inside the #seasideProcessRequestWithRetry:resultBlock method.
- the ‘CONTINUE’ entry prints label of the activeProcess right after the Seaside processing (i.e. right after result := (resultBlock value: aNativeRequest) )
- The indented parts are a printing of the labels of all processes (i.e. Processor allProcesses) to see what processes exist at the same time as the log entry just before it.

What we see that happens is that the oop of the activeProcess at the START and at CONTINUE is not the same. 
However, right after the ‘Commit Failure’, this process wants to commit to the object log but is being blocked by semaphore because it’s activeProcess now has oop 1022696705 and it was 1020803073 when it first entered the critical section. So, this recursive call is being blocked by itself because the semaphore thinks it’s another process. 

--transcript--'yield for process 1020803073’
--transcript--'(priority=15) active [oop=1020803073] START  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1020803073]’
--transcript--'(priority=15) active [oop=1022696705] CONTINUING  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]'
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
--transcript--'want to commit to object log by 1022696705'
--transcript--'block for process 1022696705 by 1020803073'
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]’

Mind that this ‘switching’ of Process activeProcess objects does not always happens and it only becomes a problem when there is a recursive call on the transactionMutex. However, these conditions pop up from time to time in production. I just expose them a bit faster by randomly returning false from the doCommitTransaction method, which forces the Seaside code in a bit more retries per request, so I can reproduce it easily. 

The problem is, of course, why is this happening? The implementation of TransientRecursionLock>>critical: relies on the fact that same thread returns the same object for ‘Processor activeProcess’, while it seems it does not.
Hope someone can shed some light on this issue...

greetings,
Johan

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list
A bit of follow-up: I am suspecting this is because of the #call:/#answer logic of Seaside.
When the request processing of Seaside returns, it might have processed a call or answer, which is manipulating the call stack, which seems to induce a change of the process.

At least, when I process requests with call/answer, there is no such thing happening.

Dale, you probably can tell what goes on here :)
Working my way through this… I can imagine a couple of workarounds/fixes but I would like to pinpoint the exact issue first

Johan


On 13 Apr 2015, at 11:23, Johan Brichau <[hidden email]> wrote:

Hi,

I’m getting deadlocks in recursive calls of #critical: on the Seaside transactionMutex (which is a TransientRecursionLock). The #lockingProcess stored by the mutex is no longer the same object as the ‘Processor activeProcess’, though it should be (see below).
Hence, the mutex blocks access to the critical section and deadlocks because the #lockingProcess object no longer exists. It actually seems as if the result object of ‘Processor activeProcess’ changes even though we are still in the same thread.

I found this by finally getting a good tracing on a long-lasting bug that sometimes blocks a seaside gem outside of a transaction, eventually having the gem killed off by the stone due to a lostOTView. The typical log sequence found in the seaside gem log is quoted below. The gem log would contain a log entry for a commit failure but the object log would not. This means that the commit on the object log never happened. Furthermore, the gem is left outside a transaction when that happens, eventually getting the #lostOT signal from the stone. This log is from a 2.4.4.1 stone but I get the same results in a 3.1.0.6 stone (have not tried with 3.2 at this time).

failure
Read-Write Conflicts...
Write-Write Conflicts...
    3856175617
Write-Dependency Conflicts...
Write-ReadLock Conflicts...
Write-WriteLock Conflicts...
Rc-Write-Write Conflicts...
Synchronized-Commit Conflicts...
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
-----------------------------------------------------
GemStone: Error         Nonfatal
When running outside of a transaction the session failed to respond
quickly enough to a request from Stone to abort.  All object caches
have been cleared.  --- 
Error Category: 231169 [GemStone] Number: 3031 Arg Count: 0 Context : 7964271617

Now executing the following command saved from "iferr 1":
   where
==> 1 ProcessorScheduler >> _reapEvents: @63 line 65   [GsMethod 10209537]
2 ProcessorScheduler >> _findReadyProcess @2 line 7   [GsMethod 10204417]
3 ProcessorScheduler >> _reschedule @2 line 8   [GsMethod 10207489]
4 GsProcess >> _wait @2 line 3   [GsMethod 4503553]
5 Delay >> wait @23 line 23   [GsMethod 11755009]
6 SimpleBlock in Executed Code @19 line 24   [GsMethod 3879968001]
7 GsProcess >> _startPart2 @15 line 17   [GsMethod 4501249]
8 GsProcess >> _start @1 line 9   [GsMethod 4501761]
  [GsProcess 7964271617]
topaz 1> [268 sz:0 cls: 68097 Boolean] true
topaz 1> 
[Info]: Logging out at 03/27/15 14:47:52 CET

The exact point where the process gets blocked is the call to the #doTransaction: in the #saveLogEntry: method that logs the commit failure. So, the gem log got the log entry on file but the transaction to commit to the object log is blocked and never proceeds. 
Inside #doTransaction, we first make a recursive call to #critical: on the transactionMutex (‘recursive’ because it happens inside the Seaside request cycle, which is protected by the same mutex). However, it should never block because it is a RecursionLock that allows recursive entry; it blocks however because the #lockingProcess object and the ‘Processor activeProcess’ object are no longer the same.

I added some transcript logging at several points in the request handling and in the TransientRecursionSemaphore>>#critical: method that demonstrates what happens.

- the ‘START’ entry prints label of the activeProcess at the beginning of the request handling inside the #seasideProcessRequestWithRetry:resultBlock method.
- the ‘CONTINUE’ entry prints label of the activeProcess right after the Seaside processing (i.e. right after result := (resultBlock value: aNativeRequest) )
- The indented parts are a printing of the labels of all processes (i.e. Processor allProcesses) to see what processes exist at the same time as the log entry just before it.

What we see that happens is that the oop of the activeProcess at the START and at CONTINUE is not the same. 
However, right after the ‘Commit Failure’, this process wants to commit to the object log but is being blocked by semaphore because it’s activeProcess now has oop 1022696705 and it was 1020803073 when it first entered the critical section. So, this recursive call is being blocked by itself because the semaphore thinks it’s another process. 

--transcript--'yield for process 1020803073’
--transcript--'(priority=15) active [oop=1020803073] START  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1020803073]’
--transcript--'(priority=15) active [oop=1022696705] CONTINUING  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]'
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
--transcript--'want to commit to object log by 1022696705'
--transcript--'block for process 1022696705 by 1020803073'
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]’

Mind that this ‘switching’ of Process activeProcess objects does not always happens and it only becomes a problem when there is a recursive call on the transactionMutex. However, these conditions pop up from time to time in production. I just expose them a bit faster by randomly returning false from the doCommitTransaction method, which forces the Seaside code in a bit more retries per request, so I can reproduce it easily. 

The problem is, of course, why is this happening? The implementation of TransientRecursionLock>>critical: relies on the fact that same thread returns the same object for ‘Processor activeProcess’, while it seems it does not.
Hope someone can shed some light on this issue...

greetings,
Johan


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list
Sorry for talking in partial mails like this :)
But yes, it’s obvious that the continuation logic is creating new processes and that this is the cause.

I fixed it by setting the #lockingProcess in the mutex back to ‘Processor activeProcess’ when we return from the seaside request handling, right before we try committing and possibly need to commit object log entries when the commit fails. So, in #seasideProcessRequestWithRetry:resultBlock:

...
[result := (resultBlock value: aNativeRequest)]
on: (WARetryHttpRequest, SafelyPerformBlockRequiringAbort, DALGsSemanticConflictException)
do: [:ex |
retryRequest := true.
retryException := ex ].
self transactionMutex lockingProcess: Processor activeProcess.
...

The reason this does not happen too much is the combination of conditions… and the result is nasty… but the fix seems easy ;-)

Johan

On 13 Apr 2015, at 15:58, Johan Brichau <[hidden email]> wrote:

A bit of follow-up: I am suspecting this is because of the #call:/#answer logic of Seaside.
When the request processing of Seaside returns, it might have processed a call or answer, which is manipulating the call stack, which seems to induce a change of the process.

At least, when I process requests with call/answer, there is no such thing happening.

Dale, you probably can tell what goes on here :)
Working my way through this… I can imagine a couple of workarounds/fixes but I would like to pinpoint the exact issue first

Johan


On 13 Apr 2015, at 11:23, Johan Brichau <[hidden email]> wrote:

Hi,

I’m getting deadlocks in recursive calls of #critical: on the Seaside transactionMutex (which is a TransientRecursionLock). The #lockingProcess stored by the mutex is no longer the same object as the ‘Processor activeProcess’, though it should be (see below).
Hence, the mutex blocks access to the critical section and deadlocks because the #lockingProcess object no longer exists. It actually seems as if the result object of ‘Processor activeProcess’ changes even though we are still in the same thread.

I found this by finally getting a good tracing on a long-lasting bug that sometimes blocks a seaside gem outside of a transaction, eventually having the gem killed off by the stone due to a lostOTView. The typical log sequence found in the seaside gem log is quoted below. The gem log would contain a log entry for a commit failure but the object log would not. This means that the commit on the object log never happened. Furthermore, the gem is left outside a transaction when that happens, eventually getting the #lostOT signal from the stone. This log is from a 2.4.4.1 stone but I get the same results in a 3.1.0.6 stone (have not tried with 3.2 at this time).

failure
Read-Write Conflicts...
Write-Write Conflicts...
    3856175617
Write-Dependency Conflicts...
Write-ReadLock Conflicts...
Write-WriteLock Conflicts...
Rc-Write-Write Conflicts...
Synchronized-Commit Conflicts...
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
-----------------------------------------------------
GemStone: Error         Nonfatal
When running outside of a transaction the session failed to respond
quickly enough to a request from Stone to abort.  All object caches
have been cleared.  --- 
Error Category: 231169 [GemStone] Number: 3031 Arg Count: 0 Context : 7964271617

Now executing the following command saved from "iferr 1":
   where
==> 1 ProcessorScheduler >> _reapEvents: @63 line 65   [GsMethod 10209537]
2 ProcessorScheduler >> _findReadyProcess @2 line 7   [GsMethod 10204417]
3 ProcessorScheduler >> _reschedule @2 line 8   [GsMethod 10207489]
4 GsProcess >> _wait @2 line 3   [GsMethod 4503553]
5 Delay >> wait @23 line 23   [GsMethod 11755009]
6 SimpleBlock in Executed Code @19 line 24   [GsMethod 3879968001]
7 GsProcess >> _startPart2 @15 line 17   [GsMethod 4501249]
8 GsProcess >> _start @1 line 9   [GsMethod 4501761]
  [GsProcess 7964271617]
topaz 1> [268 sz:0 cls: 68097 Boolean] true
topaz 1> 
[Info]: Logging out at 03/27/15 14:47:52 CET

The exact point where the process gets blocked is the call to the #doTransaction: in the #saveLogEntry: method that logs the commit failure. So, the gem log got the log entry on file but the transaction to commit to the object log is blocked and never proceeds. 
Inside #doTransaction, we first make a recursive call to #critical: on the transactionMutex (‘recursive’ because it happens inside the Seaside request cycle, which is protected by the same mutex). However, it should never block because it is a RecursionLock that allows recursive entry; it blocks however because the #lockingProcess object and the ‘Processor activeProcess’ object are no longer the same.

I added some transcript logging at several points in the request handling and in the TransientRecursionSemaphore>>#critical: method that demonstrates what happens.

- the ‘START’ entry prints label of the activeProcess at the beginning of the request handling inside the #seasideProcessRequestWithRetry:resultBlock method.
- the ‘CONTINUE’ entry prints label of the activeProcess right after the Seaside processing (i.e. right after result := (resultBlock value: aNativeRequest) )
- The indented parts are a printing of the labels of all processes (i.e. Processor allProcesses) to see what processes exist at the same time as the log entry just before it.

What we see that happens is that the oop of the activeProcess at the START and at CONTINUE is not the same. 
However, right after the ‘Commit Failure’, this process wants to commit to the object log but is being blocked by semaphore because it’s activeProcess now has oop 1022696705 and it was 1020803073 when it first entered the critical section. So, this recursive call is being blocked by itself because the semaphore thinks it’s another process. 

--transcript--'yield for process 1020803073’
--transcript--'(priority=15) active [oop=1020803073] START  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1020803073]’
--transcript--'(priority=15) active [oop=1022696705] CONTINUING  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]'
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
--transcript--'want to commit to object log by 1022696705'
--transcript--'block for process 1022696705 by 1020803073'
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]’

Mind that this ‘switching’ of Process activeProcess objects does not always happens and it only becomes a problem when there is a recursive call on the transactionMutex. However, these conditions pop up from time to time in production. I just expose them a bit faster by randomly returning false from the doCommitTransaction method, which forces the Seaside code in a bit more retries per request, so I can reproduce it easily. 

The problem is, of course, why is this happening? The implementation of TransientRecursionLock>>critical: relies on the fact that same thread returns the same object for ‘Processor activeProcess’, while it seems it does not.
Hope someone can shed some light on this issue...

greetings,
Johan



_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list
Created an issue for better tracking: https://github.com/GsDevKit/Seaside31/issues/71

On 13 Apr 2015, at 16:25, Johan Brichau <[hidden email]> wrote:

Sorry for talking in partial mails like this :)
But yes, it’s obvious that the continuation logic is creating new processes and that this is the cause.

I fixed it by setting the #lockingProcess in the mutex back to ‘Processor activeProcess’ when we return from the seaside request handling, right before we try committing and possibly need to commit object log entries when the commit fails. So, in #seasideProcessRequestWithRetry:resultBlock:

...
[result := (resultBlock value: aNativeRequest)]
on: (WARetryHttpRequest, SafelyPerformBlockRequiringAbort, DALGsSemanticConflictException)
do: [:ex |
retryRequest := true.
retryException := ex ].
self transactionMutex lockingProcess: Processor activeProcess.
...

The reason this does not happen too much is the combination of conditions… and the result is nasty… but the fix seems easy ;-)

Johan

On 13 Apr 2015, at 15:58, Johan Brichau <[hidden email]> wrote:

A bit of follow-up: I am suspecting this is because of the #call:/#answer logic of Seaside.
When the request processing of Seaside returns, it might have processed a call or answer, which is manipulating the call stack, which seems to induce a change of the process.

At least, when I process requests with call/answer, there is no such thing happening.

Dale, you probably can tell what goes on here :)
Working my way through this… I can imagine a couple of workarounds/fixes but I would like to pinpoint the exact issue first

Johan


On 13 Apr 2015, at 11:23, Johan Brichau <[hidden email]> wrote:

Hi,

I’m getting deadlocks in recursive calls of #critical: on the Seaside transactionMutex (which is a TransientRecursionLock). The #lockingProcess stored by the mutex is no longer the same object as the ‘Processor activeProcess’, though it should be (see below).
Hence, the mutex blocks access to the critical section and deadlocks because the #lockingProcess object no longer exists. It actually seems as if the result object of ‘Processor activeProcess’ changes even though we are still in the same thread.

I found this by finally getting a good tracing on a long-lasting bug that sometimes blocks a seaside gem outside of a transaction, eventually having the gem killed off by the stone due to a lostOTView. The typical log sequence found in the seaside gem log is quoted below. The gem log would contain a log entry for a commit failure but the object log would not. This means that the commit on the object log never happened. Furthermore, the gem is left outside a transaction when that happens, eventually getting the #lostOT signal from the stone. This log is from a 2.4.4.1 stone but I get the same results in a 3.1.0.6 stone (have not tried with 3.2 at this time).

failure
Read-Write Conflicts...
Write-Write Conflicts...
    3856175617
Write-Dependency Conflicts...
Write-ReadLock Conflicts...
Write-WriteLock Conflicts...
Rc-Write-Write Conflicts...
Synchronized-Commit Conflicts...
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
-----------------------------------------------------
GemStone: Error         Nonfatal
When running outside of a transaction the session failed to respond
quickly enough to a request from Stone to abort.  All object caches
have been cleared.  --- 
Error Category: 231169 [GemStone] Number: 3031 Arg Count: 0 Context : 7964271617

Now executing the following command saved from "iferr 1":
   where
==> 1 ProcessorScheduler >> _reapEvents: @63 line 65   [GsMethod 10209537]
2 ProcessorScheduler >> _findReadyProcess @2 line 7   [GsMethod 10204417]
3 ProcessorScheduler >> _reschedule @2 line 8   [GsMethod 10207489]
4 GsProcess >> _wait @2 line 3   [GsMethod 4503553]
5 Delay >> wait @23 line 23   [GsMethod 11755009]
6 SimpleBlock in Executed Code @19 line 24   [GsMethod 3879968001]
7 GsProcess >> _startPart2 @15 line 17   [GsMethod 4501249]
8 GsProcess >> _start @1 line 9   [GsMethod 4501761]
  [GsProcess 7964271617]
topaz 1> [268 sz:0 cls: 68097 Boolean] true
topaz 1> 
[Info]: Logging out at 03/27/15 14:47:52 CET

The exact point where the process gets blocked is the call to the #doTransaction: in the #saveLogEntry: method that logs the commit failure. So, the gem log got the log entry on file but the transaction to commit to the object log is blocked and never proceeds. 
Inside #doTransaction, we first make a recursive call to #critical: on the transactionMutex (‘recursive’ because it happens inside the Seaside request cycle, which is protected by the same mutex). However, it should never block because it is a RecursionLock that allows recursive entry; it blocks however because the #lockingProcess object and the ‘Processor activeProcess’ object are no longer the same.

I added some transcript logging at several points in the request handling and in the TransientRecursionSemaphore>>#critical: method that demonstrates what happens.

- the ‘START’ entry prints label of the activeProcess at the beginning of the request handling inside the #seasideProcessRequestWithRetry:resultBlock method.
- the ‘CONTINUE’ entry prints label of the activeProcess right after the Seaside processing (i.e. right after result := (resultBlock value: aNativeRequest) )
- The indented parts are a printing of the labels of all processes (i.e. Processor allProcesses) to see what processes exist at the same time as the log entry just before it.

What we see that happens is that the oop of the activeProcess at the START and at CONTINUE is not the same. 
However, right after the ‘Commit Failure’, this process wants to commit to the object log but is being blocked by semaphore because it’s activeProcess now has oop 1022696705 and it was 1020803073 when it first entered the critical section. So, this recursive call is being blocked by itself because the semaphore thinks it’s another process. 

--transcript--'yield for process 1020803073’
--transcript--'(priority=15) active [oop=1020803073] START  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1020803073]’
--transcript--'(priority=15) active [oop=1022696705] CONTINUING  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]'
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
--transcript--'want to commit to object log by 1022696705'
--transcript--'block for process 1022696705 by 1020803073'
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]’

Mind that this ‘switching’ of Process activeProcess objects does not always happens and it only becomes a problem when there is a recursive call on the transactionMutex. However, these conditions pop up from time to time in production. I just expose them a bit faster by randomly returning false from the doCommitTransaction method, which forces the Seaside code in a bit more retries per request, so I can reproduce it easily. 

The problem is, of course, why is this happening? The implementation of TransientRecursionLock>>critical: relies on the fact that same thread returns the same object for ‘Processor activeProcess’, while it seems it does not.
Hope someone can shed some light on this issue...

greetings,
Johan




_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list
In reply to this post by GLASS mailing list
Johan,

I'm glad that you "talked this out" ... I'm trying to wrap my brain around the "combination of conditions". I think that this problem might be partially related to Issue #64[1] as well ... it involves commit errors while logging commit conflicts, too, which is a common theme.

The part that is mind boggling to me, is how the activeProcess is changed at this point in the stack ... I seem to recall that there should be a process fork at the session level

... hold on a second ...

I just looked and there used to be (in 2.8?) a fork at the session level and that fork insulated the rest of the handling stack (which is pretty much unchanged since 2.8)  from process perturbations caused by continuations ... with the introduction of partial continuations that occurred in 3.0(?) the fork was eliminated ... so these days the process can (and apparently does) change all of the time during `(resultBlock value: aNativeRequest)` and as you point out this doesn't matter unless a commit conflict occurs during the handling of a continuation ....

... and your proposed fix looks a perfectly good solution ...

So I will take a look at your bug report where you propose alternate solutions and comment there ....

Dale

[1] https://github.com/GsDevKit/Seaside31/issues/64

On 4/13/15 7:25 AM, Johan Brichau via Glass wrote:
Sorry for talking in partial mails like this :)
But yes, it’s obvious that the continuation logic is creating new processes and that this is the cause.

I fixed it by setting the #lockingProcess in the mutex back to ‘Processor activeProcess’ when we return from the seaside request handling, right before we try committing and possibly need to commit object log entries when the commit fails. So, in #seasideProcessRequestWithRetry:resultBlock:

...
[result := (resultBlock value: aNativeRequest)]
on: (WARetryHttpRequest, SafelyPerformBlockRequiringAbort, DALGsSemanticConflictException)
do: [:ex |
retryRequest := true.
retryException := ex ].
self transactionMutex lockingProcess: Processor activeProcess.
...

The reason this does not happen too much is the combination of conditions… and the result is nasty… but the fix seems easy ;-)

Johan

On 13 Apr 2015, at 15:58, Johan Brichau <[hidden email]> wrote:

A bit of follow-up: I am suspecting this is because of the #call:/#answer logic of Seaside.
When the request processing of Seaside returns, it might have processed a call or answer, which is manipulating the call stack, which seems to induce a change of the process.

At least, when I process requests with call/answer, there is no such thing happening.

Dale, you probably can tell what goes on here :)
Working my way through this… I can imagine a couple of workarounds/fixes but I would like to pinpoint the exact issue first

Johan


On 13 Apr 2015, at 11:23, Johan Brichau <[hidden email]> wrote:

Hi,

I’m getting deadlocks in recursive calls of #critical: on the Seaside transactionMutex (which is a TransientRecursionLock). The #lockingProcess stored by the mutex is no longer the same object as the ‘Processor activeProcess’, though it should be (see below).
Hence, the mutex blocks access to the critical section and deadlocks because the #lockingProcess object no longer exists. It actually seems as if the result object of ‘Processor activeProcess’ changes even though we are still in the same thread.

I found this by finally getting a good tracing on a long-lasting bug that sometimes blocks a seaside gem outside of a transaction, eventually having the gem killed off by the stone due to a lostOTView. The typical log sequence found in the seaside gem log is quoted below. The gem log would contain a log entry for a commit failure but the object log would not. This means that the commit on the object log never happened. Furthermore, the gem is left outside a transaction when that happens, eventually getting the #lostOT signal from the stone. This log is from a 2.4.4.1 stone but I get the same results in a 3.1.0.6 stone (have not tried with 3.2 at this time).

failure
Read-Write Conflicts...
Write-Write Conflicts...
    3856175617
Write-Dependency Conflicts...
Write-ReadLock Conflicts...
Write-WriteLock Conflicts...
Rc-Write-Write Conflicts...
Synchronized-Commit Conflicts...
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
-----------------------------------------------------
GemStone: Error         Nonfatal
When running outside of a transaction the session failed to respond
quickly enough to a request from Stone to abort.  All object caches
have been cleared.  --- 
Error Category: 231169 [GemStone] Number: 3031 Arg Count: 0 Context : 7964271617

Now executing the following command saved from "iferr 1":
   where
==> 1 ProcessorScheduler >> _reapEvents: @63 line 65   [GsMethod 10209537]
2 ProcessorScheduler >> _findReadyProcess @2 line 7   [GsMethod 10204417]
3 ProcessorScheduler >> _reschedule @2 line 8   [GsMethod 10207489]
4 GsProcess >> _wait @2 line 3   [GsMethod 4503553]
5 Delay >> wait @23 line 23   [GsMethod 11755009]
6 SimpleBlock in Executed Code @19 line 24   [GsMethod 3879968001]
7 GsProcess >> _startPart2 @15 line 17   [GsMethod 4501249]
8 GsProcess >> _start @1 line 9   [GsMethod 4501761]
  [GsProcess 7964271617]
topaz 1> [268 sz:0 cls: 68097 Boolean] true
topaz 1> 
[Info]: Logging out at 03/27/15 14:47:52 CET

The exact point where the process gets blocked is the call to the #doTransaction: in the #saveLogEntry: method that logs the commit failure. So, the gem log got the log entry on file but the transaction to commit to the object log is blocked and never proceeds. 
Inside #doTransaction, we first make a recursive call to #critical: on the transactionMutex (‘recursive’ because it happens inside the Seaside request cycle, which is protected by the same mutex). However, it should never block because it is a RecursionLock that allows recursive entry; it blocks however because the #lockingProcess object and the ‘Processor activeProcess’ object are no longer the same.

I added some transcript logging at several points in the request handling and in the TransientRecursionSemaphore>>#critical: method that demonstrates what happens.

- the ‘START’ entry prints label of the activeProcess at the beginning of the request handling inside the #seasideProcessRequestWithRetry:resultBlock method.
- the ‘CONTINUE’ entry prints label of the activeProcess right after the Seaside processing (i.e. right after result := (resultBlock value: aNativeRequest) )
- The indented parts are a printing of the labels of all processes (i.e. Processor allProcesses) to see what processes exist at the same time as the log entry just before it.

What we see that happens is that the oop of the activeProcess at the START and at CONTINUE is not the same. 
However, right after the ‘Commit Failure’, this process wants to commit to the object log but is being blocked by semaphore because it’s activeProcess now has oop 1022696705 and it was 1020803073 when it first entered the critical section. So, this recursive call is being blocked by itself because the semaphore thinks it’s another process. 

--transcript--'yield for process 1020803073’
--transcript--'(priority=15) active [oop=1020803073] START  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1020803073]’
--transcript--'(priority=15) active [oop=1022696705] CONTINUING  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]'
----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
--transcript--'want to commit to object log by 1022696705'
--transcript--'block for process 1022696705 by 1020803073'
--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
--transcript--'(priority=15) active [oop=1022696705]’

Mind that this ‘switching’ of Process activeProcess objects does not always happens and it only becomes a problem when there is a recursive call on the transactionMutex. However, these conditions pop up from time to time in production. I just expose them a bit faster by randomly returning false from the doCommitTransaction method, which forces the Seaside code in a bit more retries per request, so I can reproduce it easily. 

The problem is, of course, why is this happening? The implementation of TransientRecursionLock>>critical: relies on the fact that same thread returns the same object for ‘Processor activeProcess’, while it seems it does not.
Hope someone can shed some light on this issue...

greetings,
Johan




_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list

On 13 Apr 2015, at 21:53, Dale Henrichs via Glass <[hidden email]> wrote:

I just looked and there used to be (in 2.8?) a fork at the session level and that fork insulated the rest of the handling stack (which is pretty much unchanged since 2.8)  from process perturbations caused by continuations ... with the introduction of partial continuations that occurred in 3.0(?) the fork was eliminated ... so these days the process can (and apparently does) change all of the time during `(resultBlock value: aNativeRequest)` and as you point out this doesn't matter unless a commit conflict occurs during the handling of a continuation ....

For what I understand, the "root of all evil" is in the WAPartialContinuation class doing calls to GsProcess>>partialContinuationFromLevel:to: and GsProcess>>installPartialContinuation:atLevel:value:
I assume (because there is a primitive call there) that this switches the GsProcess object for the active process. So, it’s never passing through any fork or other process switching in the ProcessorScheduler, from what I can make of it.

Yes, glad I finally found this one… those crashes have been bothering me for years already!!

Johan

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list
Yep, that's my take on it ...

On 4/13/15 10:32 PM, Johan Brichau wrote:

On 13 Apr 2015, at 21:53, Dale Henrichs via Glass <[hidden email]> wrote:

I just looked and there used to be (in 2.8?) a fork at the session level and that fork insulated the rest of the handling stack (which is pretty much unchanged since 2.8)  from process perturbations caused by continuations ... with the introduction of partial continuations that occurred in 3.0(?) the fork was eliminated ... so these days the process can (and apparently does) change all of the time during `(resultBlock value: aNativeRequest)` and as you point out this doesn't matter unless a commit conflict occurs during the handling of a continuation ....

For what I understand, the "root of all evil" is in the WAPartialContinuation class doing calls to GsProcess>>partialContinuationFromLevel:to: and GsProcess>>installPartialContinuation:atLevel:value:
I assume (because there is a primitive call there) that this switches the GsProcess object for the active process. So, it’s never passing through any fork or other process switching in the ProcessorScheduler, from what I can make of it.

Yes, glad I finally found this one… those crashes have been bothering me for years already!!

Johan


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] Recursion deadlock with TransientRecursionLock

GLASS mailing list
Thanks guys, this helps. Way too deep for me to follow properly, but
this has been bothering us.

We disabled the object log a long time ago, with the hope that we'll
get fewer problems. (by overriding addToLog to do nothing).

On Tue, Apr 14, 2015 at 3:31 PM, Dale Henrichs via Glass
<[hidden email]> wrote:

> Yep, that's my take on it ...
>
>
> On 4/13/15 10:32 PM, Johan Brichau wrote:
>
>
> On 13 Apr 2015, at 21:53, Dale Henrichs via Glass
> <[hidden email]> wrote:
>
> I just looked and there used to be (in 2.8?) a fork at the session level and
> that fork insulated the rest of the handling stack (which is pretty much
> unchanged since 2.8)  from process perturbations caused by continuations ...
> with the introduction of partial continuations that occurred in 3.0(?) the
> fork was eliminated ... so these days the process can (and apparently does)
> change all of the time during `(resultBlock value: aNativeRequest)` and as
> you point out this doesn't matter unless a commit conflict occurs during the
> handling of a continuation ....
>
>
> For what I understand, the "root of all evil" is in the
> WAPartialContinuation class doing calls to
> GsProcess>>partialContinuationFromLevel:to: and
> GsProcess>>installPartialContinuation:atLevel:value:
> I assume (because there is a primitive call there) that this switches the
> GsProcess object for the active process. So, it’s never passing through any
> fork or other process switching in the ProcessorScheduler, from what I can
> make of it.
>
> Yes, glad I finally found this one… those crashes have been bothering me for
> years already!!
>
> Johan
>
>
>
> _______________________________________________
> Glass mailing list
> [hidden email]
> http://lists.gemtalksystems.com/mailman/listinfo/glass
>
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass