Issue 223 in glassdb: "A fatal network protocol error occurred on the Gem to Stone network" [2.4.4.1]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue 223 in glassdb: "A fatal network protocol error occurred on the Gem to Stone network" [2.4.4.1]

glassdb
Status: Accepted
Owner: [hidden email]
Labels: Type-Defect Priority-Medium GLASS-Server Version-1.0-beta.8.1  
Milestone-1.0-beta.8.6

New issue 223 by [hidden email]: "A fatal network protocol error  
occurred on the Gem to Stone network" [2.4.4.1]
http://code.google.com/p/glassdb/issues/detail?id=223

Johan Brichau and I have seen this error on a Mac.

In my case I saw the error while doing a seaside30 load into a virgin  
2.4.4.1 extent on both Mac and LInux.

The linux error is tied into Issue 222 and when I start the stone with the  
-s option on Linux, the protocol error goes away.

Unfortunately, the -s option for the stone doesn't fix the Mac-based  
problem ... I am still characterizing the problem.

Reply | Threaded
Open this post in threaded view
|

Re: Issue 223 in glassdb: "A fatal network protocol error occurred on the Gem to Stone network" [2.4.4.1]

glassdb

Comment #1 on issue 223 by [hidden email]: "A fatal network protocol  
error occurred on the Gem to Stone network" [2.4.4.1]
http://code.google.com/p/glassdb/issues/detail?id=223

By setting:

GEM_HALT_ON_ERROR=4034;

in the system.conf (or gem.conf) the gem will stop and allow one to attach  
gdb to get a stack (on the mac). On Linux the stack is dumped to the log  
file.

Here is the c stack for Linux:


Thread 1 (Thread 0x7f555ccba720 (LWP 28086)):
#0  0x00007f555c8ad48d in waitpid () from /lib/libpthread.so.0
#1  0x00007f555b5ff332 in forkAndWait (
#2  0x00007f555b5ff4fc in HostPrintCStack ()
#3  0x00007f555b635955 in HostPrintCStack_notifyStone() ()
#4  0x00007f555b5ff855 in HostCallDebuggerMsg (
#5  0x00007f555b5ff8eb in HostCallDebuggerMsg_fl (
#6  0x00007f555b4886a1 in GemSupErr(char const*, unsigned long, int,  
int, ...)
#7  0x00007f555b61cc1f in StnCallProcessMessage(RepWorkspaceType*) ()
#8  0x00007f555b61d142 in StnCallConverseWithStn_(RepWorkspaceType*,  
GscMsgBufRefSType*) ()
#9  0x00007f555b50f44a in StnCallUpdateSharedCounter(WorkEntrySType*, int,  
int, long) ()
#10 0x00007f555b5951a6 in SysPrimUpdatePersistentCounter(IntStateSType*,  
omObjSType**) () at intloopamd64.asmm4:6258
#11 0x00007f555b54c49c in IntLpBCLoop () at intloopam64.m4:1
#12 0x00007f555b53f829 in IntLpSupControlLoop(IntStateSType*) ()
#13 0x00007f555b53237d in IntContinue(IntStateSType*, omObjSType**,  
omObjSType**, int, unsigned int, GciErrSType*) ()
#14 0x00007f555b47deb5 in GemDoContinue(unsigned long, unsigned long*,  
unsigned long, int, GciErrSType*) ()
#15 0x00007f555b4556e6 in dispatchLoop(IntStateSType*, LgcStateSType*,  
RpcStateSType*) ()
#16 0x00007f555b457497 in GemDoRpcLoop(IntStateSType*, LgcStateSType*,  
IntGciActStateSType*) ()
#17 0x00007f555b45241e in doConnect() ()
#18 0x00007f555b452fea in dispatchCommand() ()
#19 0x00007f555b45365f in Gdbg(char const*, int) ()
#20 0x00007f555b45818f in gemMain ()
#21 0x00000000004015d7 in main (argc=4, argv=0x7fff60efced8)

As mentioned, starting the stone on Linux with the -s flag appears to avoid  
the problem.

Reply | Threaded
Open this post in threaded view
|

Re: Issue 223 in glassdb: "A fatal network protocol error occurred on the Gem to Stone network" [2.4.4.1]

glassdb

Comment #2 on issue 223 by [hidden email]: "A fatal network protocol  
error occurred on the Gem to Stone network" [2.4.4.1]
http://code.google.com/p/glassdb/issues/detail?id=223

Here's the stack from Johan's mac:


Attaching to process 13344.
Reading symbols for shared libraries . done
Reading symbols for shared libraries ....warning: Could not find object  
file "/export/orpheus2/users/buildgss/gs64/244x-1/fast42/omdebug.o" - no  
debug information available  
for "/export/orpheus2/users/buildgss/gs64/244x-1/src/omdebug.c".

warning: Could not find object  
file "/export/orpheus2/users/buildgss/gs64/244x-1/fast42/hostdebugmt.o" -  
no debug information available  
for "/export/orpheus2/users/buildgss/gs64/244x-1/src/hostdebugmt.c".

warning: Could not find object  
file "/export/orpheus2/users/buildgss/gs64/244x-1/fast42/hostdebug.o" - no  
debug information available  
for "/export/orpheus2/users/buildgss/gs64/244x-1/src/hostdebug.c".

. done
0x00007fff835e8fca in __semwait_signal ()
(gdb) bt
#0  0x00007fff835e8fca in __semwait_signal ()
#1  0x00007fff8360fd56 in pthread_join ()
#2  0x00000001003fc3a5 in RepOobThreadType::joinThread ()
#3  0x00000001003feaeb in StnCallLogout_ ()
#4  0x00000001002eb68b in StnCallReportFatalError ()
#5  0x0000000100417c76 in HostPrintCStack_notifyStone ()
#6  0x00000001003e3afe in HostCallDebuggerMsg ()
#7  0x00000001003e3b72 in HostCallDebuggerMsg_fl ()
#8  0x0000000100258854 in GemSupErr ()
#9  0x00000001003fed41 in StnCallProcessMessage ()
#10 0x0000000100319e23 in IntLpSupProcessInterruptFlag ()
#11 0x0000000100335444 in .LuncacheCheck_Interrupts2 ()
#12 0x00000001003207f8 in IntLpSupControlLoop ()
#13 0x0000000100314c8d in IntExecuteMethod ()
#14 0x000000010024ea82 in executeFromCtx ()
#15 0x000000010024ec60 in GemDoExecuteFromContext ()
#16 0x0000000100204db6 in gciCall5Args ()
#17 0x0000000100206348 in GciExecuteFromContextDbg ()
#18 0x000000010000c759 in TpAuxDispatchCmd ()
#19 0x0000000100005969 in processCommand ()
#20 0x00000001000077d8 in topazMain ()
#21 0x0000000100001424 in start ()
(gdb)

Note that there is no SysPrimUpdatePersistentCounter() function on the  
stack, so it appears to be a different problem...

Reply | Threaded
Open this post in threaded view
|

Re: Issue 223 in glassdb: "A fatal network protocol error occurred on the Gem to Stone network" [2.4.4.1]

glassdb

Comment #3 on issue 223 by [hidden email]: "A fatal network protocol  
error occurred on the Gem to Stone network" [2.4.4.1]
http://code.google.com/p/glassdb/issues/detail?id=223

Here's the stack from my mac ... very similar to Johan's mac stack:

0x00007fff827d01a6 in poll ()
(gdb) where
#0  0x00007fff827d01a6 in poll ()
#1  0x00000001003e50c1 in HostMilliSleep ()
#2  0x00000001003fce03 in RepOobThreadType::requestThrStop ()
#3  0x00000001003fe715 in StnCallLogout_ ()
#4  0x00000001002eb68b in StnCallReportFatalError ()
#5  0x0000000100417c76 in HostPrintCStack_notifyStone ()
#6  0x00000001003e3afe in HostCallDebuggerMsg ()
#7  0x00000001003e3b72 in HostCallDebuggerMsg_fl ()
#8  0x0000000100258854 in GemSupErr ()
#9  0x00000001003fed41 in StnCallProcessMessage ()
#10 0x0000000100319e23 in IntLpSupProcessInterruptFlag ()
#11 0x0000000100335444 in .LuncacheCheck_Interrupts2 ()
#12 0x00000001003207f8 in IntLpSupControlLoop ()
#13 0x0000000100315a4c in IntContinue ()
#14 0x0000000100248f96 in GemDoContinue ()
#15 0x000000010022294b in dispatchLoop ()
#16 0x0000000100224b01 in GemDoRpcLoop ()
#17 0x000000010021edf8 in doConnect ()
#18 0x0000000100220761 in dispatchCommand ()
#19 0x0000000100220d5f in Gdbg ()
#20 0x0000000100225553 in gemMain ()
#21 0x0000000100000c30 in main ()
(gdb)


Reply | Threaded
Open this post in threaded view
|

Re: Issue 223 in glassdb: "A fatal network protocol error occurred on the Gem to Stone network" [2.4.4.1]

glassdb
Updates:
        Labels: bugid-40330

Comment #4 on issue 223 by [hidden email]: "A fatal network protocol  
error occurred on the Gem to Stone network" [2.4.4.1]
http://code.google.com/p/glassdb/issues/detail?id=223

This is an instance of GS/64 server bug 40330, introduced in version 2.3.0,  
fixed in versions 2.5 and 3.0 of the server product:

Bug Number:      40330
Bug Comment:     protocol errors from StnCallUpdateSharedCounter
Mail From:       bille
Date:            Thu Feb 10 2011 10:16


BugNote Comment:

Version:          2.3.0, 2.3.1, 2.3.1.1, 2.3.1.2, 2.3.1.4, 2.3.1.5,  
2.3.1.6, 2.3.1.7, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.4.1, 2.4.4.2,  
2.4.4.3, 2.4.4.4

Platform:         All platforms

BugNote Title:    Changing shared counters can trigger gsErrStnNetProtocol  
error

BugNote Text:

Large-scale applications making heavy use of shared counters may  
occationally
hit #gsErrStnNetProtocol errors (error 4034).  Affected methods include:

System>>_updateSharedCounterAt:by:withOpCode:
System>>persistentCounterAt:put:
System>>persistentCounterAt:incrementBy:
System>>persistentCounterAt:decrementBy:

Workaround:

No workaround.