SIGSEGV in topaz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

SIGSEGV in topaz

Ken Treis
I'm sure this is my fault, one way or another.

We're running 2.4.4.1 on Linux, and we're getting a segfault in our FastCGI handler when we try to access a certain collection of our models. I was able to run it in topazl.slow and attach with gdb, and it gives me the following backtrace:

#0  0x00007eff6b93b21d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007eff6b93b0bc in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000681ad8 in HostSleep (seconds=3, exitIfInterrupted=0)
    at /export/toronto3/users/buildgss/244x-1/src/hostunixmt.c:617
#3  0x00000000006cb230 in waitForDebuggerImpl (sleeptime=129600)
    at /export/toronto3/users/buildgss/244x-1/src/hostdebug.c:70
#4  0x00000000006cb53f in HostCoreDump (explicitlyRequested=0)
    at /export/toronto3/users/buildgss/244x-1/src/hostdebug.c:114
#5  0x00000000006c8421 in sigCoreExit (quitSignaled=0, actualSig=11, info=0x7fffbf7e8c70, context=0x7fffbf7e8b40)
    at /export/toronto3/users/buildgss/244x-1/src/hostunix.c:1136
#6  0x00000000006c8dea in HostFaultHandler (sig=11, info=0x7fffbf7e8c70, context=0x7fffbf7e8b40)
    at /export/toronto3/users/buildgss/244x-1/src/hostunix.c:1572
#7  <signal handler called>
#8  0x00000000004f9ec0 in om::FetchOop (obj=0x7eff61fd6e30, offset=3)
    at /export/toronto3/users/buildgss/244x-1/src/om.c:3661

#9  0x0000000000578429 in IntLpSupPrim32 (iS=0x8c3a00, ARStackPtr=0x7eff67aa1460)
    at /export/toronto3/users/buildgss/244x-1/src/intloopsup.c:791
#10 0x0000000000593a9c in IntLpBCLoop () at intloopam64.m4:1
#11 0x0000000000577267 in IntLpSupControlLoop (iS=0x8c3a00)
    at /export/toronto3/users/buildgss/244x-1/src/intloopsup.c:7071
#12 0x000000000056fab4 in IntSendMsg (iS=0x8c3a00, receiver=0x8c3a38, selector=0x126d4e0, numArgs=0, argsArrayH=0x0, 
    flags=1) at /export/toronto3/users/buildgss/244x-1/src/interp.c:952

I'm suspicious that this means there's a problem in my extent, but I'm looking for confirmation on that. We moved our application to a new server a couple of days ago and maybe something went wrong in the copy.

So now I'm wondering:

1. If my suspicions are anywhere near correct
2. Why I can still make Smalltalk full backups without problems
3. What I did wrong to cause this, and why the problem didn't show up until a couple days after the move

The crash is 100% reproducible; I was even able to reproduce on a completely separate stone when I copied the troublesome extent.

Any pointers would be much appreciated.

--
Ken Treis
Miriam Technologies, Inc.

Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV in topaz

Dale Henrichs
Ken,

There is additional information before the c-stack trace that would provide some more information for debugging.

With regards to the corruption, you should run an object audit. If the audit is clean then there is no explicit object corruption. If there are object audit errors we'll go from there...

Dale

----- Original Message -----
| From: "Ken Treis" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Friday, May 11, 2012 4:09:10 PM
| Subject: [GS/SS Beta] SIGSEGV in topaz
|
| I'm sure this is my fault, one way or another.
|
|
| We're running 2.4.4.1 on Linux, and we're getting a segfault in our
| FastCGI handler when we try to access a certain collection of our
| models. I was able to run it in topazl.slow and attach with gdb, and
| it gives me the following backtrace:
|
|
|
|
| #0 0x00007eff6b93b21d in nanosleep () from
| /lib/x86_64-linux-gnu/libc.so.6
| #1 0x00007eff6b93b0bc in sleep () from
| /lib/x86_64-linux-gnu/libc.so.6
| #2 0x0000000000681ad8 in HostSleep (seconds=3, exitIfInterrupted=0)
| at /export/toronto3/users/buildgss/244x-1/src/hostunixmt.c:617
| #3 0x00000000006cb230 in waitForDebuggerImpl (sleeptime=129600)
| at /export/toronto3/users/buildgss/244x-1/src/hostdebug.c:70
| #4 0x00000000006cb53f in HostCoreDump (explicitlyRequested=0)
| at /export/toronto3/users/buildgss/244x-1/src/hostdebug.c:114
| #5 0x00000000006c8421 in sigCoreExit (quitSignaled=0, actualSig=11,
| info=0x7fffbf7e8c70, context=0x7fffbf7e8b40)
| at /export/toronto3/users/buildgss/244x-1/src/hostunix.c:1136
| #6 0x00000000006c8dea in HostFaultHandler (sig=11,
| info=0x7fffbf7e8c70, context=0x7fffbf7e8b40)
| at /export/toronto3/users/buildgss/244x-1/src/hostunix.c:1572
| #7 <signal handler called>
| #8 0x00000000004f9ec0 in om::FetchOop (obj=0x7eff61fd6e30, offset=3)
| at /export/toronto3/users/buildgss/244x-1/src/om.c:3661
| #9 0x0000000000578429 in IntLpSupPrim32 (iS=0x8c3a00,
| ARStackPtr=0x7eff67aa1460)
| at /export/toronto3/users/buildgss/244x-1/src/intloopsup.c:791
| #10 0x0000000000593a9c in IntLpBCLoop () at intloopam64.m4:1
| #11 0x0000000000577267 in IntLpSupControlLoop (iS=0x8c3a00)
| at /export/toronto3/users/buildgss/244x-1/src/intloopsup.c:7071
| #12 0x000000000056fab4 in IntSendMsg (iS=0x8c3a00, receiver=0x8c3a38,
| selector=0x126d4e0, numArgs=0, argsArrayH=0x0,
| flags=1) at /export/toronto3/users/buildgss/244x-1/src/interp.c:952
|
|
| I'm suspicious that this means there's a problem in my extent, but
| I'm looking for confirmation on that. We moved our application to a
| new server a couple of days ago and maybe something went wrong in
| the copy.
|
|
| So now I'm wondering:
|
|
| 1. If my suspicions are anywhere near correct
| 2. Why I can still make Smalltalk full backups without problems
| 3. What I did wrong to cause this, and why the problem didn't show up
| until a couple days after the move
|
|
| The crash is 100% reproducible; I was even able to reproduce on a
| completely separate stone when I copied the troublesome extent.
|
|
| Any pointers would be much appreciated.
|
|
|
|
| --
| Ken Treis
| Miriam Technologies, Inc.
|
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV in topaz

Dale Henrichs
A little more info about running the object audit:

  You can check for corruption in your data base by
  running an object audit (see section 8.2 of the System
  Administration Guide[1]). If you follow these steps
  you own't need to run the object audit as single
  user (see the docs):

    - expire sessions
    - mfc
    - reclaimAll
    - object audit

----- Original Message -----
| From: "Dale Henrichs" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Saturday, May 12, 2012 9:39:27 AM
| Subject: Re: [GS/SS Beta] SIGSEGV in topaz
|
| Ken,
|
| There is additional information before the c-stack trace that would
| provide some more information for debugging.
|
| With regards to the corruption, you should run an object audit. If
| the audit is clean then there is no explicit object corruption. If
| there are object audit errors we'll go from there...
|
| Dale
|
| ----- Original Message -----
| | From: "Ken Treis" <[hidden email]>
| | To: "GemStone Seaside beta discussion" <[hidden email]>
| | Sent: Friday, May 11, 2012 4:09:10 PM
| | Subject: [GS/SS Beta] SIGSEGV in topaz
| |
| | I'm sure this is my fault, one way or another.
| |
| |
| | We're running 2.4.4.1 on Linux, and we're getting a segfault in our
| | FastCGI handler when we try to access a certain collection of our
| | models. I was able to run it in topazl.slow and attach with gdb,
| | and
| | it gives me the following backtrace:
| |
| |
| |
| |
| | #0 0x00007eff6b93b21d in nanosleep () from
| | /lib/x86_64-linux-gnu/libc.so.6
| | #1 0x00007eff6b93b0bc in sleep () from
| | /lib/x86_64-linux-gnu/libc.so.6
| | #2 0x0000000000681ad8 in HostSleep (seconds=3, exitIfInterrupted=0)
| | at /export/toronto3/users/buildgss/244x-1/src/hostunixmt.c:617
| | #3 0x00000000006cb230 in waitForDebuggerImpl (sleeptime=129600)
| | at /export/toronto3/users/buildgss/244x-1/src/hostdebug.c:70
| | #4 0x00000000006cb53f in HostCoreDump (explicitlyRequested=0)
| | at /export/toronto3/users/buildgss/244x-1/src/hostdebug.c:114
| | #5 0x00000000006c8421 in sigCoreExit (quitSignaled=0, actualSig=11,
| | info=0x7fffbf7e8c70, context=0x7fffbf7e8b40)
| | at /export/toronto3/users/buildgss/244x-1/src/hostunix.c:1136
| | #6 0x00000000006c8dea in HostFaultHandler (sig=11,
| | info=0x7fffbf7e8c70, context=0x7fffbf7e8b40)
| | at /export/toronto3/users/buildgss/244x-1/src/hostunix.c:1572
| | #7 <signal handler called>
| | #8 0x00000000004f9ec0 in om::FetchOop (obj=0x7eff61fd6e30,
| | offset=3)
| | at /export/toronto3/users/buildgss/244x-1/src/om.c:3661
| | #9 0x0000000000578429 in IntLpSupPrim32 (iS=0x8c3a00,
| | ARStackPtr=0x7eff67aa1460)
| | at /export/toronto3/users/buildgss/244x-1/src/intloopsup.c:791
| | #10 0x0000000000593a9c in IntLpBCLoop () at intloopam64.m4:1
| | #11 0x0000000000577267 in IntLpSupControlLoop (iS=0x8c3a00)
| | at /export/toronto3/users/buildgss/244x-1/src/intloopsup.c:7071
| | #12 0x000000000056fab4 in IntSendMsg (iS=0x8c3a00,
| | receiver=0x8c3a38,
| | selector=0x126d4e0, numArgs=0, argsArrayH=0x0,
| | flags=1) at /export/toronto3/users/buildgss/244x-1/src/interp.c:952
| |
| |
| | I'm suspicious that this means there's a problem in my extent, but
| | I'm looking for confirmation on that. We moved our application to a
| | new server a couple of days ago and maybe something went wrong in
| | the copy.
| |
| |
| | So now I'm wondering:
| |
| |
| | 1. If my suspicions are anywhere near correct
| | 2. Why I can still make Smalltalk full backups without problems
| | 3. What I did wrong to cause this, and why the problem didn't show
| | up
| | until a couple days after the move
| |
| |
| | The crash is 100% reproducible; I was even able to reproduce on a
| | completely separate stone when I copied the troublesome extent.
| |
| |
| | Any pointers would be much appreciated.
| |
| |
| |
| |
| | --
| | Ken Treis
| | Miriam Technologies, Inc.
| |
|
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV in topaz

Ken Treis
Hi Dale,

Thanks for the pointers. I was able to use the object audit to find the offending object, which was an OrderedCollection with an invalid pointer:

> Object 14408533249, of class 92673, at 1-based offset 4, references nonexistent object 5192651470943554048


> topaz 1> send @92673 name
> OrderedCollection

I was able to replace element 4 with nil, and now the object audit finds no problems and my gems have stopped crashing.

Now, the bigger question -- is there any way I can tell what went wrong to cause this? I have a backup from May 10 that doesn't have this problem (the OC only has 2 elements), and if I replay transaction logs from that backup through May 11th, the invalid OOP gets added to the OC.  I've looked at logs for that time interval, but my untrained eyes aren't seeing anything out of the ordinary.

The model that contains this OC only adds newly instantiated objects, so it seems like the problem must have happened in the gem or between the gem and the stone. But maybe there are other possibilities: SPC corruption? Hardware problem?


Ken

On May 12, 2012, at 9:41 AM, Dale Henrichs wrote:

> A little more info about running the object audit:
>
>  You can check for corruption in your data base by
>  running an object audit (see section 8.2 of the System
>  Administration Guide[1]). If you follow these steps
>  you own't need to run the object audit as single
>  user (see the docs):
>
>    - expire sessions
>    - mfc
>    - reclaimAll
>    - object audit
>

--
Ken Treis
Miriam Technologies, Inc.

Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV in topaz

Dale Henrichs
Ken,

Good news that you were able to repair after the audit. Even better news that you can reproduce...that is if you could supply us with the backup and tranlogs ... we're definitely interested in putting this under our microscopes ...

If sharing the extent and tranlogs isn't feasible, then we'll see if we can work out another way to characterize the problem.

Dale

----- Original Message -----
| From: "Ken Treis" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Monday, May 14, 2012 12:14:52 PM
| Subject: Re: [GS/SS Beta] SIGSEGV in topaz
|
| Hi Dale,
|
| Thanks for the pointers. I was able to use the object audit to find
| the offending object, which was an OrderedCollection with an invalid
| pointer:
|
| > Object 14408533249, of class 92673, at 1-based offset 4, references
| > nonexistent object 5192651470943554048
|
|
| > topaz 1> send @92673 name
| > OrderedCollection
|
| I was able to replace element 4 with nil, and now the object audit
| finds no problems and my gems have stopped crashing.
|
| Now, the bigger question -- is there any way I can tell what went
| wrong to cause this? I have a backup from May 10 that doesn't have
| this problem (the OC only has 2 elements), and if I replay
| transaction logs from that backup through May 11th, the invalid OOP
| gets added to the OC.  I've looked at logs for that time interval,
| but my untrained eyes aren't seeing anything out of the ordinary.
|
| The model that contains this OC only adds newly instantiated objects,
| so it seems like the problem must have happened in the gem or
| between the gem and the stone. But maybe there are other
| possibilities: SPC corruption? Hardware problem?
|
|
| Ken
|
| On May 12, 2012, at 9:41 AM, Dale Henrichs wrote:
|
| > A little more info about running the object audit:
| >
| >  You can check for corruption in your data base by
| >  running an object audit (see section 8.2 of the System
| >  Administration Guide[1]). If you follow these steps
| >  you own't need to run the object audit as single
| >  user (see the docs):
| >
| >    - expire sessions
| >    - mfc
| >    - reclaimAll
| >    - object audit
| >
|
| --
| Ken Treis
| Miriam Technologies, Inc.
|
|
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV in topaz

Dale Henrichs
Ken,

Thanks for sharing your db and tranlogs. We were able to reproduce the problem by restoring from backup and replaying tranlogs but the bad oop was actually in the tranlog, so the act of restoring was not the root cause....

Allen looked through the object manager bugs that have been fixed since 2.4.4.1 and it looks like there is only one fix that might apply:

  Bug42138 - in-memory GC failure in continuations tests

The bugfix is in 2.4.5.1 and it is probably worth upgrading to 2.4.5.1 on the off chance that this is the root cause.

Dale


----- Original Message -----
| From: "Dale Henrichs" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Monday, May 14, 2012 12:50:58 PM
| Subject: Re: [GS/SS Beta] SIGSEGV in topaz
|
| Ken,
|
| Good news that you were able to repair after the audit. Even better
| news that you can reproduce...that is if you could supply us with
| the backup and tranlogs ... we're definitely interested in putting
| this under our microscopes ...
|
| If sharing the extent and tranlogs isn't feasible, then we'll see if
| we can work out another way to characterize the problem.
|
| Dale
|
| ----- Original Message -----
| | From: "Ken Treis" <[hidden email]>
| | To: "GemStone Seaside beta discussion" <[hidden email]>
| | Sent: Monday, May 14, 2012 12:14:52 PM
| | Subject: Re: [GS/SS Beta] SIGSEGV in topaz
| |
| | Hi Dale,
| |
| | Thanks for the pointers. I was able to use the object audit to find
| | the offending object, which was an OrderedCollection with an
| | invalid
| | pointer:
| |
| | > Object 14408533249, of class 92673, at 1-based offset 4,
| | > references
| | > nonexistent object 5192651470943554048
| |
| |
| | > topaz 1> send @92673 name
| | > OrderedCollection
| |
| | I was able to replace element 4 with nil, and now the object audit
| | finds no problems and my gems have stopped crashing.
| |
| | Now, the bigger question -- is there any way I can tell what went
| | wrong to cause this? I have a backup from May 10 that doesn't have
| | this problem (the OC only has 2 elements), and if I replay
| | transaction logs from that backup through May 11th, the invalid OOP
| | gets added to the OC.  I've looked at logs for that time interval,
| | but my untrained eyes aren't seeing anything out of the ordinary.
| |
| | The model that contains this OC only adds newly instantiated
| | objects,
| | so it seems like the problem must have happened in the gem or
| | between the gem and the stone. But maybe there are other
| | possibilities: SPC corruption? Hardware problem?
| |
| |
| | Ken
| |
| | On May 12, 2012, at 9:41 AM, Dale Henrichs wrote:
| |
| | > A little more info about running the object audit:
| | >
| | >  You can check for corruption in your data base by
| | >  running an object audit (see section 8.2 of the System
| | >  Administration Guide[1]). If you follow these steps
| | >  you own't need to run the object audit as single
| | >  user (see the docs):
| | >
| | >    - expire sessions
| | >    - mfc
| | >    - reclaimAll
| | >    - object audit
| | >
| |
| | --
| | Ken Treis
| | Miriam Technologies, Inc.
| |
| |
|