Smalltalk › Gemtalk › GLASS

Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

56 messages Options

123

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

On Tue, Sep 8, 2015 at 4:26 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

Sorry for the delay, but I'm back in the office today and what we would like to do is capture the args that are being used for the primitive so replaicing the `memOnlyBool` block logic in the listInstances:.... method with the following will help us get them:

Hi Dale, no worries, thanks for pushing!

memOnlyBool
    ifFalse: [
      scanBlk := [ :scanSetThisTime | | ret sKind |
      sKind := (directoryString ifNotNil:[ 2 ] ifNil:[ 0 ]).
      ret := self
        _scanPomWithMaxThreads: maxThreads
        waitForLock: 60
        pageBufSize: 8
        percentCpuActiveLimit: aPercentage
        identSet: scanSetThisTime
        limit: aSmallInt
        scanKind: sKind
        toDirectory: directoryString ].
ret ifNil: [
   Transcript cr; show: '_scanPomWithMaxThreads failure: ',
                maxThreads printString, ' ',
                aPercentage printString, ' ',
                scanSetThisTime printString, ' ',
                aSmallInt printString, ' ',
                sKind printString, ' ',
                directoryString printString ].
    ret ].

This doesn't compile because 'sKind' was defined inside the 'scanBlk' and 'scanSetThisTime' is the argument to the closure. Since this problem was related to temp vars, I am not sure which is the correct solution.

Let me know,

We thought the problem might have been related to the method temp reference for `(directoryString ifNotNil:[ 2 ] ifNil:[ 0 ])`, but since the prim is still failing with that expression inlined there must be a different (less obvious) failure mechanism.

Dale

On 09/01/2015 11:45 AM, Mariano Martinez Peck wrote:

OK then. Perfect. Let me know.
Thanks!

On Tue, Sep 1, 2015 at 3:28 PM, Dale Henrichs <[hidden email]> wrote:

On 9/1/15 10:59 AM, Mariano Martinez Peck wrote:

On Tue, Sep 1, 2015 at 2:14 PM, Dale Henrichs <[hidden email]> wrote:

Could you arrange to get a stack trace from your most recent error and a listing of the method that you used ... I want to make sure that we understand the failure mechanism ... if it is related to block temps then it is fixed in 3.2.x, but if it is not related to block temps then it could be present in later versions of GemStone and we'll want to characterize the problem .... Obviously, this particular call doesn't reproduce very frequently (I wasn't able to make it break with trivial examples) so there is likely to be something a little more complex going on ...

Dale, the exception I get is the one I original shared with you and you got to the same conclusion as I did.

What I can offer you is this that I log the error (continuation) in the object log and the provide you a user for the web user for our app and from there I can allow you open a kind of Seaside debugger/inspector which will be much richer than a plain string stack and at least you can also print/inspect from there. I cannot send you the extent because its quite big.

If you think this is OK, then I please need you to ask you to only share the login info with GemTalks engineer. Since the site is a bit on use (but with a working extent) I must recover from backup and so the system will be running with a "broken" extent for a while. No problem with this but if this will be only a couple of hours or 1-2 day max. So if we will do this, I would appreciate that you let me know when (you or the engineer) would be available to take a look.

Let me know if you want this.

Thanks for the offer ... we might want to instrument up the method a bit more instead of looking at a continuation ... so I will get back to you ... I won't be in the office until Thursday, and that's when I will talk things over with the engineer ...

Dale

--

Mariano
http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

Just rename the temps to ones that compile:)

This time around we are not suspecting that blockClosures and block temps are the problem, we are just trying to get the args to the primitive call when it fails, so we can trace things further in the C code and try determine the code path that leads to a nil return value ...

Dale

On 09/08/2015 12:49 PM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 4:26 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

Sorry for the delay, but I'm back in the office today and what we would like to do is capture the args that are being used for the primitive so replaicing the `memOnlyBool` block logic in the listInstances:.... method with the following will help us get them:

Hi Dale, no worries, thanks for pushing!

memOnlyBool
    ifFalse: [
      scanBlk := [ :scanSetThisTime | | ret sKind |
      sKind := (directoryString ifNotNil:[ 2 ] ifNil:[ 0 ]).
      ret := self
        _scanPomWithMaxThreads: maxThreads
        waitForLock: 60
        pageBufSize: 8
        percentCpuActiveLimit: aPercentage
        identSet: scanSetThisTime
        limit: aSmallInt
        scanKind: sKind
        toDirectory: directoryString ].
ret ifNil: [
   Transcript cr; show: '_scanPomWithMaxThreads failure: ',
                maxThreads printString, ' ',
                aPercentage printString, ' ',
                scanSetThisTime printString, ' ',
                aSmallInt printString, ' ',
                sKind printString, ' ',
                directoryString printString ].
    ret ].

This doesn't compile because 'sKind' was defined inside the 'scanBlk' and 'scanSetThisTime' is the argument to the closure. Since this problem was related to temp vars, I am not sure which is the correct solution.

Let me know,

We thought the problem might have been related to the method temp reference for `(directoryString ifNotNil:[ 2 ] ifNil:[ 0 ])`, but since the prim is still failing with that expression inlined there must be a different (less obvious) failure mechanism.

Dale

On 09/01/2015 11:45 AM, Mariano Martinez Peck wrote:

OK then. Perfect. Let me know.
Thanks!

On Tue, Sep 1, 2015 at 3:28 PM, Dale Henrichs <[hidden email]> wrote:

On 9/1/15 10:59 AM, Mariano Martinez Peck wrote:

On Tue, Sep 1, 2015 at 2:14 PM, Dale Henrichs <[hidden email]> wrote:

Could you arrange to get a stack trace from your most recent error and a listing of the method that you used ... I want to make sure that we understand the failure mechanism ... if it is related to block temps then it is fixed in 3.2.x, but if it is not related to block temps then it could be present in later versions of GemStone and we'll want to characterize the problem .... Obviously, this particular call doesn't reproduce very frequently (I wasn't able to make it break with trivial examples) so there is likely to be something a little more complex going on ...

Dale, the exception I get is the one I original shared with you and you got to the same conclusion as I did.

What I can offer you is this that I log the error (continuation) in the object log and the provide you a user for the web user for our app and from there I can allow you open a kind of Seaside debugger/inspector which will be much richer than a plain string stack and at least you can also print/inspect from there. I cannot send you the extent because its quite big.

If you think this is OK, then I please need you to ask you to only share the login info with GemTalks engineer. Since the site is a bit on use (but with a working extent) I must recover from backup and so the system will be running with a "broken" extent for a while. No problem with this but if this will be only a couple of hours or 1-2 day max. So if we will do this, I would appreciate that you let me know when (you or the engineer) would be available to take a look.

Let me know if you want this.

Thanks for the offer ... we might want to instrument up the method a bit more instead of looking at a continuation ... so I will get back to you ... I won't be in the office until Thursday, and that's when I will talk things over with the engineer ...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

OK Dale, I found out which was the problem, the code of printing should have been placed inside the scanBlock. Anyway..I did that, and then it did not work either because gem was crashing and so I couldn't see the log from GemTools. So I then replaced Transcript show: with "GsFile gciLogServer: " and now I got it the log:

--LIST-FAILURE--_scanPomWithMaxThreads failure: 1 95 anIdentitySet( FaSecurityAdjustedClosingPriceRecord) 0 0 nil

Doesn't look like wrong, does it?

Cheers,

On Tue, Sep 8, 2015 at 5:03 PM, Dale Henrichs <[hidden email]> wrote:

Just rename the temps to ones that compile:)

This time around we are not suspecting that blockClosures and block temps are the problem, we are just trying to get the args to the primitive call when it fails, so we can trace things further in the C code and try determine the code path that leads to a nil return value ...

Dale

On 09/08/2015 12:49 PM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 4:26 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

Sorry for the delay, but I'm back in the office today and what we would like to do is capture the args that are being used for the primitive so replaicing the `memOnlyBool` block logic in the listInstances:.... method with the following will help us get them:

Hi Dale, no worries, thanks for pushing!

memOnlyBool
    ifFalse: [
      scanBlk := [ :scanSetThisTime | | ret sKind |
      sKind := (directoryString ifNotNil:[ 2 ] ifNil:[ 0 ]).
      ret := self
        _scanPomWithMaxThreads: maxThreads
        waitForLock: 60
        pageBufSize: 8
        percentCpuActiveLimit: aPercentage
        identSet: scanSetThisTime
        limit: aSmallInt
        scanKind: sKind
        toDirectory: directoryString ].
ret ifNil: [
   Transcript cr; show: '_scanPomWithMaxThreads failure: ',
                maxThreads printString, ' ',
                aPercentage printString, ' ',
                scanSetThisTime printString, ' ',
                aSmallInt printString, ' ',
                sKind printString, ' ',
                directoryString printString ].
    ret ].

This doesn't compile because 'sKind' was defined inside the 'scanBlk' and 'scanSetThisTime' is the argument to the closure. Since this problem was related to temp vars, I am not sure which is the correct solution.

Let me know,

We thought the problem might have been related to the method temp reference for `(directoryString ifNotNil:[ 2 ] ifNil:[ 0 ])`, but since the prim is still failing with that expression inlined there must be a different (less obvious) failure mechanism.

Dale

On 09/01/2015 11:45 AM, Mariano Martinez Peck wrote:

OK then. Perfect. Let me know.
Thanks!

On Tue, Sep 1, 2015 at 3:28 PM, Dale Henrichs <[hidden email]> wrote:

On 9/1/15 10:59 AM, Mariano Martinez Peck wrote:

On Tue, Sep 1, 2015 at 2:14 PM, Dale Henrichs <[hidden email]> wrote:

Could you arrange to get a stack trace from your most recent error and a listing of the method that you used ... I want to make sure that we understand the failure mechanism ... if it is related to block temps then it is fixed in 3.2.x, but if it is not related to block temps then it could be present in later versions of GemStone and we'll want to characterize the problem .... Obviously, this particular call doesn't reproduce very frequently (I wasn't able to make it break with trivial examples) so there is likely to be something a little more complex going on ...

Dale, the exception I get is the one I original shared with you and you got to the same conclusion as I did.

What I can offer you is this that I log the error (continuation) in the object log and the provide you a user for the web user for our app and from there I can allow you open a kind of Seaside debugger/inspector which will be much richer than a plain string stack and at least you can also print/inspect from there. I cannot send you the extent because its quite big.

If you think this is OK, then I please need you to ask you to only share the login info with GemTalks engineer. Since the site is a bit on use (but with a working extent) I must recover from backup and so the system will be running with a "broken" extent for a while. No problem with this but if this will be only a couple of hours or 1-2 day max. So if we will do this, I would appreciate that you let me know when (you or the engineer) would be available to take a look.

Let me know if you want this.

Thanks for the offer ... we might want to instrument up the method a bit more instead of looking at a continuation ... so I will get back to you ... I won't be in the office until Thursday, and that's when I will talk things over with the engineer ...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

Thanks Mariano - yeah the args look okay - At this point, I'm suspicious that we're running out of memory during the scan and not failing "gracefully", but no evidence of that quite yet ...

Dale

On 09/08/2015 02:00 PM, Mariano Martinez Peck wrote:

OK Dale, I found out which was the problem, the code of printing should have been placed inside the scanBlock. Anyway..I did that, and then it did not work either because gem was crashing and so I couldn't see the log from GemTools. So I then replaced Transcript show: with "GsFile gciLogServer: " and now I got it the log:

--LIST-FAILURE--_scanPomWithMaxThreads failure: 1 95 anIdentitySet( FaSecurityAdjustedClosingPriceRecord) 0 0 nil

Doesn't look like wrong, does it?

Cheers,

On Tue, Sep 8, 2015 at 5:03 PM, Dale Henrichs <[hidden email]> wrote:

Just rename the temps to ones that compile:)

This time around we are not suspecting that blockClosures and block temps are the problem, we are just trying to get the args to the primitive call when it fails, so we can trace things further in the C code and try determine the code path that leads to a nil return value ...

Dale

On 09/08/2015 12:49 PM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 4:26 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

Sorry for the delay, but I'm back in the office today and what we would like to do is capture the args that are being used for the primitive so replaicing the `memOnlyBool` block logic in the listInstances:.... method with the following will help us get them:

Hi Dale, no worries, thanks for pushing!

memOnlyBool
    ifFalse: [
      scanBlk := [ :scanSetThisTime | | ret sKind |
      sKind := (directoryString ifNotNil:[ 2 ] ifNil:[ 0 ]).
      ret := self
        _scanPomWithMaxThreads: maxThreads
        waitForLock: 60
        pageBufSize: 8
        percentCpuActiveLimit: aPercentage
        identSet: scanSetThisTime
        limit: aSmallInt
        scanKind: sKind
        toDirectory: directoryString ].
ret ifNil: [
   Transcript cr; show: '_scanPomWithMaxThreads failure: ',
                maxThreads printString, ' ',
                aPercentage printString, ' ',
                scanSetThisTime printString, ' ',
                aSmallInt printString, ' ',
                sKind printString, ' ',
                directoryString printString ].
    ret ].

This doesn't compile because 'sKind' was defined inside the 'scanBlk' and 'scanSetThisTime' is the argument to the closure. Since this problem was related to temp vars, I am not sure which is the correct solution.

Let me know,

We thought the problem might have been related to the method temp reference for `(directoryString ifNotNil:[ 2 ] ifNil:[ 0 ])`, but since the prim is still failing with that expression inlined there must be a different (less obvious) failure mechanism.

Dale

On 09/01/2015 11:45 AM, Mariano Martinez Peck wrote:

OK then. Perfect. Let me know.
Thanks!

On Tue, Sep 1, 2015 at 3:28 PM, Dale Henrichs <[hidden email]> wrote:

On 9/1/15 10:59 AM, Mariano Martinez Peck wrote:

On Tue, Sep 1, 2015 at 2:14 PM, Dale Henrichs <[hidden email]> wrote:

Could you arrange to get a stack trace from your most recent error and a listing of the method that you used ... I want to make sure that we understand the failure mechanism ... if it is related to block temps then it is fixed in 3.2.x, but if it is not related to block temps then it could be present in later versions of GemStone and we'll want to characterize the problem .... Obviously, this particular call doesn't reproduce very frequently (I wasn't able to make it break with trivial examples) so there is likely to be something a little more complex going on ...

Dale, the exception I get is the one I original shared with you and you got to the same conclusion as I did.

What I can offer you is this that I log the error (continuation) in the object log and the provide you a user for the web user for our app and from there I can allow you open a kind of Seaside debugger/inspector which will be much richer than a plain string stack and at least you can also print/inspect from there. I cannot send you the extent because its quite big.

If you think this is OK, then I please need you to ask you to only share the login info with GemTalks engineer. Since the site is a bit on use (but with a working extent) I must recover from backup and so the system will be running with a "broken" extent for a while. No problem with this but if this will be only a couple of hours or 1-2 day max. So if we will do this, I would appreciate that you let me know when (you or the engineer) would be available to take a look.

Let me know if you want this.

Thanks for the offer ... we might want to instrument up the method a bit more instead of looking at a continuation ... so I will get back to you ... I won't be in the office until Thursday, and that's when I will talk things over with the engineer ...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Dale

On 09/08/2015 02:51 PM, Dale Henrichs wrote:

Thanks Mariano - yeah the args look okay - At this point, I'm suspicious that we're running out of memory during the scan and not failing "gracefully", but no evidence of that quite yet ...

Dale

On 09/08/2015 02:00 PM, Mariano Martinez Peck wrote:

OK Dale, I found out which was the problem, the code of printing should have been placed inside the scanBlock. Anyway..I did that, and then it did not work either because gem was crashing and so I couldn't see the log from GemTools. So I then replaced Transcript show: with "GsFile gciLogServer: " and now I got it the log:

--LIST-FAILURE--_scanPomWithMaxThreads failure: 1 95 anIdentitySet( FaSecurityAdjustedClosingPriceRecord) 0 0 nil

Doesn't look like wrong, does it?

Cheers,

On Tue, Sep 8, 2015 at 5:03 PM, Dale Henrichs <[hidden email]> wrote:

Just rename the temps to ones that compile:)

This time around we are not suspecting that blockClosures and block temps are the problem, we are just trying to get the args to the primitive call when it fails, so we can trace things further in the C code and try determine the code path that leads to a nil return value ...

Dale

On 09/08/2015 12:49 PM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 4:26 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

Sorry for the delay, but I'm back in the office today and what we would like to do is capture the args that are being used for the primitive so replaicing the `memOnlyBool` block logic in the listInstances:.... method with the following will help us get them:

Hi Dale, no worries, thanks for pushing!

memOnlyBool
    ifFalse: [
      scanBlk := [ :scanSetThisTime | | ret sKind |
      sKind := (directoryString ifNotNil:[ 2 ] ifNil:[ 0 ]).
      ret := self
        _scanPomWithMaxThreads: maxThreads
        waitForLock: 60
        pageBufSize: 8
        percentCpuActiveLimit: aPercentage
        identSet: scanSetThisTime
        limit: aSmallInt
        scanKind: sKind
        toDirectory: directoryString ].
ret ifNil: [
   Transcript cr; show: '_scanPomWithMaxThreads failure: ',
                maxThreads printString, ' ',
                aPercentage printString, ' ',
                scanSetThisTime printString, ' ',
                aSmallInt printString, ' ',
                sKind printString, ' ',
                directoryString printString ].
    ret ].

This doesn't compile because 'sKind' was defined inside the 'scanBlk' and 'scanSetThisTime' is the argument to the closure. Since this problem was related to temp vars, I am not sure which is the correct solution.

Let me know,

We thought the problem might have been related to the method temp reference for `(directoryString ifNotNil:[ 2 ] ifNil:[ 0 ])`, but since the prim is still failing with that expression inlined there must be a different (less obvious) failure mechanism.

Dale

On 09/01/2015 11:45 AM, Mariano Martinez Peck wrote:

OK then. Perfect. Let me know.
Thanks!

On Tue, Sep 1, 2015 at 3:28 PM, Dale Henrichs <[hidden email]> wrote:

On 9/1/15 10:59 AM, Mariano Martinez Peck wrote:

On Tue, Sep 1, 2015 at 2:14 PM, Dale Henrichs <[hidden email]> wrote:

Could you arrange to get a stack trace from your most recent error and a listing of the method that you used ... I want to make sure that we understand the failure mechanism ... if it is related to block temps then it is fixed in 3.2.x, but if it is not related to block temps then it could be present in later versions of GemStone and we'll want to characterize the problem .... Obviously, this particular call doesn't reproduce very frequently (I wasn't able to make it break with trivial examples) so there is likely to be something a little more complex going on ...

Dale, the exception I get is the one I original shared with you and you got to the same conclusion as I did.

What I can offer you is this that I log the error (continuation) in the object log and the provide you a user for the web user for our app and from there I can allow you open a kind of Seaside debugger/inspector which will be much richer than a plain string stack and at least you can also print/inspect from there. I cannot send you the extent because its quite big.

If you think this is OK, then I please need you to ask you to only share the login info with GemTalks engineer. Since the site is a bit on use (but with a working extent) I must recover from backup and so the system will be running with a "broken" extent for a while. No problem with this but if this will be only a couple of hours or 1-2 day max. So if we will do this, I would appreciate that you let me know when (you or the engineer) would be available to take a look.

Let me know if you want this.

Thanks for the offer ... we might want to instrument up the method a bit more instead of looking at a continuation ... so I will get back to you ... I won't be in the office until Thursday, and that's when I will talk things over with the engineer ...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Thanks for the effort!

Dale

On 09/08/2015 02:51 PM, Dale Henrichs wrote:

Thanks Mariano - yeah the args look okay - At this point, I'm suspicious that we're running out of memory during the scan and not failing "gracefully", but no evidence of that quite yet ...

Dale

On 09/08/2015 02:00 PM, Mariano Martinez Peck wrote:

OK Dale, I found out which was the problem, the code of printing should have been placed inside the scanBlock. Anyway..I did that, and then it did not work either because gem was crashing and so I couldn't see the log from GemTools. So I then replaced Transcript show: with "GsFile gciLogServer: " and now I got it the log:

--LIST-FAILURE--_scanPomWithMaxThreads failure: 1 95 anIdentitySet( FaSecurityAdjustedClosingPriceRecord) 0 0 nil

Doesn't look like wrong, does it?

Cheers,

On Tue, Sep 8, 2015 at 5:03 PM, Dale Henrichs <[hidden email]> wrote:

Just rename the temps to ones that compile:)

This time around we are not suspecting that blockClosures and block temps are the problem, we are just trying to get the args to the primitive call when it fails, so we can trace things further in the C code and try determine the code path that leads to a nil return value ...

Dale

On 09/08/2015 12:49 PM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 4:26 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

Sorry for the delay, but I'm back in the office today and what we would like to do is capture the args that are being used for the primitive so replaicing the `memOnlyBool` block logic in the listInstances:.... method with the following will help us get them:

Hi Dale, no worries, thanks for pushing!

memOnlyBool
    ifFalse: [
      scanBlk := [ :scanSetThisTime | | ret sKind |
      sKind := (directoryString ifNotNil:[ 2 ] ifNil:[ 0 ]).
      ret := self
        _scanPomWithMaxThreads: maxThreads
        waitForLock: 60
        pageBufSize: 8
        percentCpuActiveLimit: aPercentage
        identSet: scanSetThisTime
        limit: aSmallInt
        scanKind: sKind
        toDirectory: directoryString ].
ret ifNil: [
   Transcript cr; show: '_scanPomWithMaxThreads failure: ',
                maxThreads printString, ' ',
                aPercentage printString, ' ',
                scanSetThisTime printString, ' ',
                aSmallInt printString, ' ',
                sKind printString, ' ',
                directoryString printString ].
    ret ].

This doesn't compile because 'sKind' was defined inside the 'scanBlk' and 'scanSetThisTime' is the argument to the closure. Since this problem was related to temp vars, I am not sure which is the correct solution.

Let me know,

We thought the problem might have been related to the method temp reference for `(directoryString ifNotNil:[ 2 ] ifNil:[ 0 ])`, but since the prim is still failing with that expression inlined there must be a different (less obvious) failure mechanism.

Dale

On 09/01/2015 11:45 AM, Mariano Martinez Peck wrote:

OK then. Perfect. Let me know.
Thanks!

On Tue, Sep 1, 2015 at 3:28 PM, Dale Henrichs <[hidden email]> wrote:

On 9/1/15 10:59 AM, Mariano Martinez Peck wrote:

On Tue, Sep 1, 2015 at 2:14 PM, Dale Henrichs <[hidden email]> wrote:

Could you arrange to get a stack trace from your most recent error and a listing of the method that you used ... I want to make sure that we understand the failure mechanism ... if it is related to block temps then it is fixed in 3.2.x, but if it is not related to block temps then it could be present in later versions of GemStone and we'll want to characterize the problem .... Obviously, this particular call doesn't reproduce very frequently (I wasn't able to make it break with trivial examples) so there is likely to be something a little more complex going on ...

Dale, the exception I get is the one I original shared with you and you got to the same conclusion as I did.

What I can offer you is this that I log the error (continuation) in the object log and the provide you a user for the web user for our app and from there I can allow you open a kind of Seaside debugger/inspector which will be much richer than a plain string stack and at least you can also print/inspect from there. I cannot send you the extent because its quite big.

If you think this is OK, then I please need you to ask you to only share the login info with GemTalks engineer. Since the site is a bit on use (but with a working extent) I must recover from backup and so the system will be running with a "broken" extent for a while. No problem with this but if this will be only a couple of hours or 1-2 day max. So if we will do this, I would appreciate that you let me know when (you or the engineer) would be available to take a look.

Let me know if you want this.

Thanks for the offer ... we might want to instrument up the method a bit more instead of looking at a continuation ... so I will get back to you ... I won't be in the office until Thursday, and that's when I will talk things over with the engineer ...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

>
>
> Hi Dale,
>
> Just for the record, I tried with this scenario:
>
> [marianopeck@quuveserver1 ~]$ free -m
> total used free shared buff/cache
> available
> Mem: 8014 388 6850 359 775 7205
> Swap: 16639 0 16639
>
> And still didn't work. Note that I have 7GB of RAM free. At the end,
> when the system crashed, this was the resulting state:
>
> [marianopeck@quuveserver1 ~]$ free -m
> total used free shared buff/cache
> available
> Mem: 8014 338 1316 973 6359 6639
> Swap: 16639 0 16639
>
>
> Anyway, no problem, I would assume this is a problem in 3.1.0.6 and
> hopefully I will never need to list instances / migrate this class
> until I am in 3.2/3.3...
>
> Thanks for the effort!
>

I'm not sure that I can interpret the `free -m` numbers correctly. Are
you confirming that this as a near out of RAM situation?

We've got an engineer pursuing the "out of memory" scenario and looking
for a smoking gun in the code for 3.1.0.6, so that we can be assured
that we don't have an existing bug in 3.2/3.3...

Thank you for your help in tracking this down ...

Dale
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

On Wed, Sep 9, 2015 at 3:36 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m
total used free shared buff/cache available
Mem: 8014 388 6850 359 775 7205
Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m
total used free shared buff/cache available
Mem: 8014 338 1316 973 6359 6639
Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Thanks for the effort!

I'm not sure that I can interpret the `free -m` numbers correctly. Are you confirming that this as a near out of RAM situation?

I am saying that I cannot make it work even with 7GB of RAM free/available...and I also have plenty of swap space from what I can tell.

Sounds like this should be plenty of RAM to list 66MM objects (66154585 instances to be accurate). But maybe I am wrong...

We've got an engineer pursuing the "out of memory" scenario and looking for a smoking gun in the code for 3.1.0.6, so that we can be assured that we don't have an existing bug in 3.2/3.3...

Thank you for your help in tracking this down ...

No problem. It usually takes me some time because I must stop everything, restore from backup, modify the listMethod via topaz with SystemUser, then run update code... then as soon as it fail I must revert again with the "corrected" extent so that the system is not that long in a bad state... But still, if this is if help to you by any means, I am happy to continue trying.

My offer is still valid if you want to enter via web and have an use some "code workspace" I have or the seaside debugger etc. I could also open log the exception in the object log if you want. Or I could temporary open a port in the firewall in case you want remote-gemtools (but that is very very slow). But as said, we should coordinate the date for this.

Cheers,

Dale

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

On 09/09/2015 11:47 AM, Mariano Martinez Peck wrote:

On Wed, Sep 9, 2015 at 3:36 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m
total used free shared buff/cache available
Mem: 8014 388 6850 359 775 7205
Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m
total used free shared buff/cache available
Mem: 8014 338 1316 973 6359 6639
Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Thanks for the effort!

I'm not sure that I can interpret the `free -m` numbers correctly. Are you confirming that this as a near out of RAM situation?

I am saying that I cannot make it work even with 7GB of RAM free/available...and I also have plenty of swap space from what I can tell.

Sounds like this should be plenty of RAM to list 66MM objects (66154585 instances to be accurate). But maybe I am wrong...

We've got an engineer pursuing the "out of memory" scenario and looking for a smoking gun in the code for 3.1.0.6, so that we can be assured that we don't have an existing bug in 3.2/3.3...

Thank you for your help in tracking this down ...

No problem. It usually takes me some time because I must stop everything, restore from backup, modify the listMethod via topaz with SystemUser, then run update code... then as soon as it fail I must revert again with the "corrected" extent so that the system is not that long in a bad state... But still, if this is if help to you by any means, I am happy to continue trying.

My offer is still valid if you want to enter via web and have an use some "code workspace" I have or the seaside debugger etc. I could also open log the exception in the object log if you want. Or I could temporary open a port in the firewall in case you want remote-gemtools (but that is very very slow). But as said, we should coordinate the date for this.

Well we are still guessing and the problem with looking at the Smalltalk stack is that all of the interesting things that are going on are happening in a separate os process running c code ....having 7GB of free memory does not immediately rule out a "memory problem" - we had a list instances bug that used way too much memory ... a statmon run using 1 second sampling should allow you to see whether or not the gem's memory consumption is rising during the list instances run (we expect it to return to normal after the failure) ...

We _are_ still guessing because we have not found a smoking gun yet .... can you tell me whether there is a time delay that causes the error to be raised, or does it happen "immediately" ... the guys are reading code here and we haven't found anything yet ...

Dale

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

In reply to this post by GLASS mailing list

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Okay, we've read code and to sorta confirm your experience, we _do not_ return a nil when the malloc fails ... So we're reading more code, but our suspicion now is that you are running out of TOC and the"normal" failure mechanisms aren't being triggered ... to help confirm this suspicion we think that you can try two independent things:

1. trigger an in-vm scavenge before making a call and/or
2. bump up the TOC for that particular vm and see if you can find a size that works ...

The journey continues...

Dale

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

Hi Dale,

Ok, I increased the SPC at 2GB and I put a TOC of 1.8GB. Now, the code update DOES WORK and does not crash anymore.

However, the resulting stuff is again the 2 metaclasses / 2 classes for the same class. So I think we are dealing with 2 problems:

1) One was that the listInstances thingy was clearly failing because of TOC size. As you just found out.

2) This kind of code refactor I needed, does not seem to be correctly performed by Monticello. The way to solve this was performing the manual thing that James and Martin recommended at the very beginning of this thread. This change also avoided migration and so avoided the listInstaces issue too.

So... I think those are the 2 problems and conclusions. I don't think we should continue investigating more. Thoughts?

Thank you very much for keeping searching for this and for the engineers also.

On Fri, Sep 11, 2015 at 2:03 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Okay, we've read code and to sorta confirm your experience, we _do not_ return a nil when the malloc fails ... So we're reading more code, but our suspicion now is that you are running out of TOC and the"normal" failure mechanisms aren't being triggered ... to help confirm this suspicion we think that you can try two independent things:

1. trigger an in-vm scavenge before making a call and/or
2. bump up the TOC for that particular vm and see if you can find a size that works ...

The journey continues...

Dale

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

Okay ... now that the bug is characterized we'll be able to determine if it exists in older versions or not ... the code in this area has been reworked for 3.2+ ...

Which brings us to the second problem ... since I am entering the bug sweep, it will be worth creating a test case to produce the "2 metaclasses / 2 classes for the same class" and I plan to do that (if I can) and then see if there is a reasonable resolution (not sure:) ...

Dale

On 09/11/2015 11:31 AM, Mariano Martinez Peck wrote:

Hi Dale,

Ok, I increased the SPC at 2GB and I put a TOC of 1.8GB. Now, the code update DOES WORK and does not crash anymore.

However, the resulting stuff is again the 2 metaclasses / 2 classes for the same class. So I think we are dealing with 2 problems:

1) One was that the listInstances thingy was clearly failing because of TOC size. As you just found out.

2) This kind of code refactor I needed, does not seem to be correctly performed by Monticello. The way to solve this was performing the manual thing that James and Martin recommended at the very beginning of this thread. This change also avoided migration and so avoided the listInstaces issue too.

So... I think those are the 2 problems and conclusions. I don't think we should continue investigating more. Thoughts?

Thank you very much for keeping searching for this and for the engineers also.

On Fri, Sep 11, 2015 at 2:03 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Okay, we've read code and to sorta confirm your experience, we _do not_ return a nil when the malloc fails ... So we're reading more code, but our suspicion now is that you are running out of TOC and the"normal" failure mechanisms aren't being triggered ... to help confirm this suspicion we think that you can try two independent things:

1. trigger an in-vm scavenge before making a call and/or
2. bump up the TOC for that particular vm and see if you can find a size that works ...

The journey continues...

Dale

--

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

On Fri, Sep 11, 2015 at 4:06 PM, Dale Henrichs <[hidden email]> wrote:

Okay ... now that the bug is characterized we'll be able to determine if it exists in older versions or not ... the code in this area has been reworked for 3.2+ ...

Indeed.

Which brings us to the second problem ... since I am entering the bug sweep, it will be worth creating a test case to produce the "2 metaclasses / 2 classes for the same class" and I plan to do that (if I can) and then see if there is a reasonable resolution (not sure:) ...

Yes! I will see if I can reproduce that too today. Basically, I had this:

Object

- FaSecurityClosingPriceRecord (no instances)

- SpecialSuperclass

- - FaSecurityClosingPriceRecord2 (many instances)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

and then I committed a monticello change with this:

Object

- SpecialSuperclass

- - FaSecurityClosingPriceRecord (many instances....and note there is no 2 at the end)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

I will see if I can reproduce it too using dummy classes.

Cheers,

Dale

On 09/11/2015 11:31 AM, Mariano Martinez Peck wrote:

Hi Dale,

Ok, I increased the SPC at 2GB and I put a TOC of 1.8GB. Now, the code update DOES WORK and does not crash anymore.

However, the resulting stuff is again the 2 metaclasses / 2 classes for the same class. So I think we are dealing with 2 problems:

1) One was that the listInstances thingy was clearly failing because of TOC size. As you just found out.

2) This kind of code refactor I needed, does not seem to be correctly performed by Monticello. The way to solve this was performing the manual thing that James and Martin recommended at the very beginning of this thread. This change also avoided migration and so avoided the listInstaces issue too.

So... I think those are the 2 problems and conclusions. I don't think we should continue investigating more. Thoughts?

Thank you very much for keeping searching for this and for the engineers also.

On Fri, Sep 11, 2015 at 2:03 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Okay, we've read code and to sorta confirm your experience, we _do not_ return a nil when the malloc fails ... So we're reading more code, but our suspicion now is that you are running out of TOC and the"normal" failure mechanisms aren't being triggered ... to help confirm this suspicion we think that you can try two independent things:

1. trigger an in-vm scavenge before making a call and/or
2. bump up the TOC for that particular vm and see if you can find a size that works ...

The journey continues...

Dale

--

Mariano
http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

Excellent! There's a bug for that[1] ... if you can reproduce it ..

Dale

[1] https://github.com/GsDevKit/GsDevKit/issues/74

On 09/11/2015 12:10 PM, Mariano Martinez Peck wrote:

On Fri, Sep 11, 2015 at 4:06 PM, Dale Henrichs <[hidden email]> wrote:

Okay ... now that the bug is characterized we'll be able to determine if it exists in older versions or not ... the code in this area has been reworked for 3.2+ ...

Indeed.

Which brings us to the second problem ... since I am entering the bug sweep, it will be worth creating a test case to produce the "2 metaclasses / 2 classes for the same class" and I plan to do that (if I can) and then see if there is a reasonable resolution (not sure:) ...

Yes! I will see if I can reproduce that too today. Basically, I had this:

Object

- FaSecurityClosingPriceRecord (no instances)

- SpecialSuperclass

- - FaSecurityClosingPriceRecord2 (many instances)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

and then I committed a monticello change with this:

Object

- SpecialSuperclass

- - FaSecurityClosingPriceRecord (many instances....and note there is no 2 at the end)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

I will see if I can reproduce it too using dummy classes.

Cheers,

Dale

On 09/11/2015 11:31 AM, Mariano Martinez Peck wrote:

Hi Dale,

Ok, I increased the SPC at 2GB and I put a TOC of 1.8GB. Now, the code update DOES WORK and does not crash anymore.

However, the resulting stuff is again the 2 metaclasses / 2 classes for the same class. So I think we are dealing with 2 problems:

1) One was that the listInstances thingy was clearly failing because of TOC size. As you just found out.

2) This kind of code refactor I needed, does not seem to be correctly performed by Monticello. The way to solve this was performing the manual thing that James and Martin recommended at the very beginning of this thread. This change also avoided migration and so avoided the listInstaces issue too.

So... I think those are the 2 problems and conclusions. I don't think we should continue investigating more. Thoughts?

Thank you very much for keeping searching for this and for the engineers also.

On Fri, Sep 11, 2015 at 2:03 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Okay, we've read code and to sorta confirm your experience, we _do not_ return a nil when the malloc fails ... So we're reading more code, but our suspicion now is that you are running out of TOC and the"normal" failure mechanisms aren't being triggered ... to help confirm this suspicion we think that you can try two independent things:

1. trigger an in-vm scavenge before making a call and/or
2. bump up the TOC for that particular vm and see if you can find a size that works ...

The journey continues...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

On Fri, Sep 11, 2015 at 4:17 PM, Dale Henrichs <[hidden email]> wrote:

Excellent! There's a bug for that[1] ... if you can reproduce it ..

"Challengeeeee ..... Accepted!!!" like Barny hahahaha.

Ok...will see if I can reproduce it.

Dale

[1] https://github.com/GsDevKit/GsDevKit/issues/74

On 09/11/2015 12:10 PM, Mariano Martinez Peck wrote:

On Fri, Sep 11, 2015 at 4:06 PM, Dale Henrichs <[hidden email]> wrote:

Okay ... now that the bug is characterized we'll be able to determine if it exists in older versions or not ... the code in this area has been reworked for 3.2+ ...

Indeed.

Which brings us to the second problem ... since I am entering the bug sweep, it will be worth creating a test case to produce the "2 metaclasses / 2 classes for the same class" and I plan to do that (if I can) and then see if there is a reasonable resolution (not sure:) ...

Yes! I will see if I can reproduce that too today. Basically, I had this:

Object

- FaSecurityClosingPriceRecord (no instances)

- SpecialSuperclass

- - FaSecurityClosingPriceRecord2 (many instances)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

and then I committed a monticello change with this:

Object

- SpecialSuperclass

- - FaSecurityClosingPriceRecord (many instances....and note there is no 2 at the end)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

I will see if I can reproduce it too using dummy classes.

Cheers,

Dale

On 09/11/2015 11:31 AM, Mariano Martinez Peck wrote:

Hi Dale,

Ok, I increased the SPC at 2GB and I put a TOC of 1.8GB. Now, the code update DOES WORK and does not crash anymore.

However, the resulting stuff is again the 2 metaclasses / 2 classes for the same class. So I think we are dealing with 2 problems:

1) One was that the listInstances thingy was clearly failing because of TOC size. As you just found out.

2) This kind of code refactor I needed, does not seem to be correctly performed by Monticello. The way to solve this was performing the manual thing that James and Martin recommended at the very beginning of this thread. This change also avoided migration and so avoided the listInstaces issue too.

So... I think those are the 2 problems and conclusions. I don't think we should continue investigating more. Thoughts?

Thank you very much for keeping searching for this and for the engineers also.

On Fri, Sep 11, 2015 at 2:03 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Okay, we've read code and to sorta confirm your experience, we _do not_ return a nil when the malloc fails ... So we're reading more code, but our suspicion now is that you are running out of TOC and the"normal" failure mechanisms aren't being triggered ... to help confirm this suspicion we think that you can try two independent things:

1. trigger an in-vm scavenge before making a call and/or
2. bump up the TOC for that particular vm and see if you can find a size that works ...

The journey continues...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

GLASS mailing list

Re: Grrrr cannot migrate (class rename with subclasses and with a name of a deleted class)

Ok...it seems I am being able to reproduce the bug. I have added all the steps and details in the issue tracker.

Let me know!

Cheers,

On Fri, Sep 11, 2015 at 4:28 PM, Mariano Martinez Peck <[hidden email]> wrote:

On Fri, Sep 11, 2015 at 4:17 PM, Dale Henrichs <[hidden email]> wrote:

Excellent! There's a bug for that[1] ... if you can reproduce it ..

"Challengeeeee ..... Accepted!!!" like Barny hahahaha.
Ok...will see if I can reproduce it.

Dale

[1] https://github.com/GsDevKit/GsDevKit/issues/74

On 09/11/2015 12:10 PM, Mariano Martinez Peck wrote:

On Fri, Sep 11, 2015 at 4:06 PM, Dale Henrichs <[hidden email]> wrote:

Okay ... now that the bug is characterized we'll be able to determine if it exists in older versions or not ... the code in this area has been reworked for 3.2+ ...

Indeed.

Which brings us to the second problem ... since I am entering the bug sweep, it will be worth creating a test case to produce the "2 metaclasses / 2 classes for the same class" and I plan to do that (if I can) and then see if there is a reasonable resolution (not sure:) ...

Yes! I will see if I can reproduce that too today. Basically, I had this:

Object

- FaSecurityClosingPriceRecord (no instances)

- SpecialSuperclass

- - FaSecurityClosingPriceRecord2 (many instances)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

and then I committed a monticello change with this:

Object

- SpecialSuperclass

- - FaSecurityClosingPriceRecord (many instances....and note there is no 2 at the end)

- - - FSCPR2a (instances)

- - - FSCPR2b (instances)

I will see if I can reproduce it too using dummy classes.

Cheers,

Dale

On 09/11/2015 11:31 AM, Mariano Martinez Peck wrote:

Hi Dale,

Ok, I increased the SPC at 2GB and I put a TOC of 1.8GB. Now, the code update DOES WORK and does not crash anymore.

However, the resulting stuff is again the 2 metaclasses / 2 classes for the same class. So I think we are dealing with 2 problems:

1) One was that the listInstances thingy was clearly failing because of TOC size. As you just found out.

2) This kind of code refactor I needed, does not seem to be correctly performed by Monticello. The way to solve this was performing the manual thing that James and Martin recommended at the very beginning of this thread. This change also avoided migration and so avoided the listInstaces issue too.

So... I think those are the 2 problems and conclusions. I don't think we should continue investigating more. Thoughts?

Thank you very much for keeping searching for this and for the engineers also.

On Fri, Sep 11, 2015 at 2:03 PM, Dale Henrichs <[hidden email]> wrote:

On 09/09/2015 06:24 AM, Mariano Martinez Peck wrote:

On Tue, Sep 8, 2015 at 7:00 PM, Dale Henrichs <[hidden email]> wrote:

Mariano,

I just talked with engineering and they concur that this is likely to be a malloc failure and the this area of the code has been substantially reworked in recent releases to attempt to reduce the amount of RAM consumed during list instances ...

So for 3.1.0.6, you might try this operation with more RAM available or perhaps just adding more swap space will allow the malloc to complete ... running statmon with a 1 second interval and looking at the heap consumption of the gem, might show growth and a "sudden decline" when the malloc fails ...

Hi Dale,

Just for the record, I tried with this scenario:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 388 6850 359 775 7205

Swap: 16639 0 16639

And still didn't work. Note that I have 7GB of RAM free. At the end, when the system crashed, this was the resulting state:

[marianopeck@quuveserver1 ~]$ free -m

total used free shared buff/cache available

Mem: 8014 338 1316 973 6359 6639

Swap: 16639 0 16639

Anyway, no problem, I would assume this is a problem in 3.1.0.6 and hopefully I will never need to list instances / migrate this class until I am in 3.2/3.3...

Okay, we've read code and to sorta confirm your experience, we _do not_ return a nil when the malloc fails ... So we're reading more code, but our suspicion now is that you are running out of TOC and the"normal" failure mechanisms aren't being triggered ... to help confirm this suspicion we think that you can try two independent things:

1. trigger an in-vm scavenge before making a call and/or
2. bump up the TOC for that particular vm and see if you can find a size that works ...

The journey continues...

Dale

--

Mariano
http://marianopeck.wordpress.com

--

Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

Mariano
http://marianopeck.wordpress.com

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

123