loading new code base stalls and fails

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

loading new code base stalls and fails

Johan Brichau-2
Hi,

Until now I was always able to install a new version of the codebase of our application in an extent using a Metacello load and having auto-migrate on.

The update I am currently doing seems almost impossible for Gemstone to swallow. The load runs for hours and the gem eventually goes out of temporary object space (even with the autocommit wrapped around it).
What is even more strange is that the gem is hardly taking any CPU power. It just runs from time to time but seems a sitting duck 90% of the time.

I tried turning auto migrate off, use the GsDeployer>>deploy, or load even a single package only.
The only thing that seems to work is to upgrade incrementally: i.e. selecting an intermediate version of each package of the project and thus slowly moving the codebase to the latest version.

Now... I have quite a bunch of packages... and although I suspect the problem is only with the packages that carry data model classes (because of migration), it does not feel comfortable.

So, am I missing anything I can try to alleviate the loading? I think I can manage migrating once everything is loaded (i.e. using GsDeployer>>doBulkMigrate), but it needs to load first :-(

Any ideas?
Johan
Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Dale Henrichs
Johan,

It does sound like migration is at the root of your performance problem during loads, but it is probably prudent to verify that.

Since the cpu is pretty much idle, we need to find out if disk i/o is the pinch point ... presumably the migration is causing the entire repository to be loaded into the SPC multiple times and disk i/o would be the limiting factor ...

So I'd like to verify that that is what the system is doing before going too far.

If disk i/o is the bottleneck, then you should see a high 'io wait' percentage in top, also we should get some stack samples from the vm to see what is happening in the image while the cpu is idle. You can use `kill -USR1 <pid>` to force stack traces to be written to the gem log. Do this multiple times to a) flush the i/o to the log file and b) get multiple samples for comparison ...

With this information, we should be able to move forward in coming up with better load solutions...

Dale

----- Original Message -----
| From: "Johan Brichau" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Tuesday, June 28, 2011 10:38:01 AM
| Subject: [GS/SS Beta] loading new code base stalls and fails
|
| Hi,
|
| Until now I was always able to install a new version of the codebase
| of our application in an extent using a Metacello load and having
| auto-migrate on.
|
| The update I am currently doing seems almost impossible for Gemstone
| to swallow. The load runs for hours and the gem eventually goes out
| of temporary object space (even with the autocommit wrapped around
| it).
| What is even more strange is that the gem is hardly taking any CPU
| power. It just runs from time to time but seems a sitting duck 90%
| of the time.
|
| I tried turning auto migrate off, use the GsDeployer>>deploy, or load
| even a single package only.
| The only thing that seems to work is to upgrade incrementally: i.e.
| selecting an intermediate version of each package of the project and
| thus slowly moving the codebase to the latest version.
|
| Now... I have quite a bunch of packages... and although I suspect the
| problem is only with the packages that carry data model classes
| (because of migration), it does not feel comfortable.
|
| So, am I missing anything I can try to alleviate the loading? I think
| I can manage migrating once everything is loaded (i.e. using
| GsDeployer>>doBulkMigrate), but it needs to load first :-(
|
| Any ideas?
| Johan
Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Johan Brichau-2
Hi Dale,

I made some progress today but that raised also some new questions :-)

First off, I pulled some stack traces from a running process. However, it took far less time today than my earlier attempts. This is somewhat strange (because I tried multiple times over the weekend), but probably you are right that i/o wait was causing the delays. I'm running the migration on Mac OS X and don't have any good indicators for i/o wait on there. However, I pulled some stack traces and attach them anyway.

Next, I noticed that the extent I am upgrading still had a lot of uncollected garbage objects.
After an MFC and a long wait for the actual reclaim, the size shrunk to 500M and I managed to do the migration as follows:
- start a topaz with at least 100M temp obj space
- execute a load using metacello
- tried to execute the #doBulkMigrate, but got an error (see below)
- executed #doBulkMigrate again and it works
(in another trial, I opened up gemtools for the last step and did a metacello load with automigrate on: that worked too)

So, in the end, I think the migration was fine (the system works) but I do have another set of questions:

The error I got was the 'An attempt to execute a method that requires an abort or prevents further commits would result in lost data'.
I think that the load in gemtools, which I tried afterwards, triggered the migration also. Is that correct?
But is it 'normal' that I might have to do the bulkMigrate twice?
And is it correct that classhistories are kept when doing a load via monticello or metacello?

thanks for the help!
Johan




On 28 Jun 2011, at 20:32, Dale Henrichs wrote:

> Johan,
>
> It does sound like migration is at the root of your performance problem during loads, but it is probably prudent to verify that.
>
> Since the cpu is pretty much idle, we need to find out if disk i/o is the pinch point ... presumably the migration is causing the entire repository to be loaded into the SPC multiple times and disk i/o would be the limiting factor ...
>
> So I'd like to verify that that is what the system is doing before going too far.
>
> If disk i/o is the bottleneck, then you should see a high 'io wait' percentage in top, also we should get some stack samples from the vm to see what is happening in the image while the cpu is idle. You can use `kill -USR1 <pid>` to force stack traces to be written to the gem log. Do this multiple times to a) flush the i/o to the log file and b) get multiple samples for comparison ...
>
> With this information, we should be able to move forward in coming up with better load solutions...
>
> Dale
>
> ----- Original Message -----
> | From: "Johan Brichau" <[hidden email]>
> | To: "GemStone Seaside beta discussion" <[hidden email]>
> | Sent: Tuesday, June 28, 2011 10:38:01 AM
> | Subject: [GS/SS Beta] loading new code base stalls and fails
> |
> | Hi,
> |
> | Until now I was always able to install a new version of the codebase
> | of our application in an extent using a Metacello load and having
> | auto-migrate on.
> |
> | The update I am currently doing seems almost impossible for Gemstone
> | to swallow. The load runs for hours and the gem eventually goes out
> | of temporary object space (even with the autocommit wrapped around
> | it).
> | What is even more strange is that the gem is hardly taking any CPU
> | power. It just runs from time to time but seems a sitting duck 90%
> | of the time.
> |
> | I tried turning auto migrate off, use the GsDeployer>>deploy, or load
> | even a single package only.
> | The only thing that seems to work is to upgrade incrementally: i.e.
> | selecting an intermediate version of each package of the project and
> | thus slowly moving the codebase to the latest version.
> |
> | Now... I have quite a bunch of packages... and although I suspect the
> | problem is only with the packages that carry data model classes
> | (because of migration), it does not feel comfortable.
> |
> | So, am I missing anything I can try to alleviate the loading? I think
> | I can manage migrating once everything is loaded (i.e. using
> | GsDeployer>>doBulkMigrate), but it needs to load first :-(
> |
> | Any ideas?
> | Johan


trace1.txt (43K) Download Attachment
trace2.txt (49K) Download Attachment
trace3.txt (47K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Dale Henrichs
Johan,

>From your stack traces, it definitely appears that class creation is the culprit, but no migration operations appear in the stacks ... instead we see `ClassOrganizer new` as the probable culprit ... You will also see that the call that is being made is #cachedOrganizer and that's because creating new ClassOrganizer's is very expensive ... on the other hand, if you weren't seeing it take too long this result may not mean too much (also note that a different class is involved in each case) ...

Did you happen to play with bulkMigrate before rerunning your tests? The bulkMigrate algorithm is based upon migrating instances of classes with a non-empty class history and after it is done, it cleans up the class history...

The error involving 'An attempt to execute a method that requires an abort or prevents further commits would result in lost data' is probably being thrown, because someone wants to do an allInstances call somewhere ... I've wrapped the allInstances calls that I know about with a call that tests for autoCommit ... if autoCommit is true then we commit and run the allInstances, if autoCommit fails we let the allInstances call fail ...

This kind of problem is why I haven't been able to trumpet the advantages of using bulkMigrate ... there are certain cases where some packages (like Metacello which is being used during the load) may need the autoMigrate to be run in order to keep on functioning, so there are a bunch of caveats/warnings (some still to be defined) surrounding the use of bulkMigrate ...

The fact that you had to run the bulkMigrate twice means that some initializers may not have been run at all (those in line to be executing following the one that failed)...I'm pretty sure that the two problems are primarily with:
 
  - new class version for instances of objects on stack
  - initializers failing because they perform allInstances
    (Magritte and Pier are the allInstances problem childs if I recall)...

The class histories are not bee cleaned up when doing a normal (autoMigrate == true) load using Metacello and Monticello.

In the end I think that we have to have some code changes/procedural changes to make using the bulkMigrate practical...

Dale
----- Original Message -----
| From: "Johan Brichau" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Wednesday, June 29, 2011 12:27:06 PM
| Subject: Re: [GS/SS Beta] loading new code base stalls and fails
|
| Hi Dale,
|
| I made some progress today but that raised also some new questions
| :-)
|
| First off, I pulled some stack traces from a running process.
| However, it took far less time today than my earlier attempts. This
| is somewhat strange (because I tried multiple times over the
| weekend), but probably you are right that i/o wait was causing the
| delays. I'm running the migration on Mac OS X and don't have any
| good indicators for i/o wait on there. However, I pulled some stack
| traces and attach them anyway.
|
| Next, I noticed that the extent I am upgrading still had a lot of
| uncollected garbage objects.
| After an MFC and a long wait for the actual reclaim, the size shrunk
| to 500M and I managed to do the migration as follows:
| - start a topaz with at least 100M temp obj space
| - execute a load using metacello
| - tried to execute the #doBulkMigrate, but got an error (see below)
| - executed #doBulkMigrate again and it works
| (in another trial, I opened up gemtools for the last step and did a
| metacello load with automigrate on: that worked too)
|
| So, in the end, I think the migration was fine (the system works) but
| I do have another set of questions:
|
| The error I got was the 'An attempt to execute a method that requires
| an abort or prevents further commits would result in lost data'.
| I think that the load in gemtools, which I tried afterwards,
| triggered the migration also. Is that correct?
| But is it 'normal' that I might have to do the bulkMigrate twice?
| And is it correct that classhistories are kept when doing a load via
| monticello or metacello?
|
| thanks for the help!
| Johan
|
|
|
|
| On 28 Jun 2011, at 20:32, Dale Henrichs wrote:
|
| > Johan,
| >
| > It does sound like migration is at the root of your performance
| > problem during loads, but it is probably prudent to verify that.
| >
| > Since the cpu is pretty much idle, we need to find out if disk i/o
| > is the pinch point ... presumably the migration is causing the
| > entire repository to be loaded into the SPC multiple times and
| > disk i/o would be the limiting factor ...
| >
| > So I'd like to verify that that is what the system is doing before
| > going too far.
| >
| > If disk i/o is the bottleneck, then you should see a high 'io wait'
| > percentage in top, also we should get some stack samples from the
| > vm to see what is happening in the image while the cpu is idle.
| > You can use `kill -USR1 <pid>` to force stack traces to be written
| > to the gem log. Do this multiple times to a) flush the i/o to the
| > log file and b) get multiple samples for comparison ...
| >
| > With this information, we should be able to move forward in coming
| > up with better load solutions...
| >
| > Dale
| >
| > ----- Original Message -----
| > | From: "Johan Brichau" <[hidden email]>
| > | To: "GemStone Seaside beta discussion"
| > | <[hidden email]>
| > | Sent: Tuesday, June 28, 2011 10:38:01 AM
| > | Subject: [GS/SS Beta] loading new code base stalls and fails
| > |
| > | Hi,
| > |
| > | Until now I was always able to install a new version of the
| > | codebase
| > | of our application in an extent using a Metacello load and having
| > | auto-migrate on.
| > |
| > | The update I am currently doing seems almost impossible for
| > | Gemstone
| > | to swallow. The load runs for hours and the gem eventually goes
| > | out
| > | of temporary object space (even with the autocommit wrapped
| > | around
| > | it).
| > | What is even more strange is that the gem is hardly taking any
| > | CPU
| > | power. It just runs from time to time but seems a sitting duck
| > | 90%
| > | of the time.
| > |
| > | I tried turning auto migrate off, use the GsDeployer>>deploy, or
| > | load
| > | even a single package only.
| > | The only thing that seems to work is to upgrade incrementally:
| > | i.e.
| > | selecting an intermediate version of each package of the project
| > | and
| > | thus slowly moving the codebase to the latest version.
| > |
| > | Now... I have quite a bunch of packages... and although I suspect
| > | the
| > | problem is only with the packages that carry data model classes
| > | (because of migration), it does not feel comfortable.
| > |
| > | So, am I missing anything I can try to alleviate the loading? I
| > | think
| > | I can manage migrating once everything is loaded (i.e. using
| > | GsDeployer>>doBulkMigrate), but it needs to load first :-(
| > |
| > | Any ideas?
| > | Johan
|
|
Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Johan Brichau-2
Hi Dale,

Coming back on this migration issue...

On 30 Jun 2011, at 22:15, Dale Henrichs wrote:

> Did you happen to play with bulkMigrate before rerunning your tests? The bulkMigrate algorithm is based upon migrating instances of classes with a non-empty class history and after it is done, it cleans up the class history...

Yes, I did. I thought that cleaning those things up might shrink the repository some more.

However, I have now migrated several extents and I notice serious differences in time-to-migrate which is absolutely not related to the difference in data size.
In addition, I often get *very weird* class structures after a load:
- classes that changed to another superclass (thus inheritance hierarchies that do not correspond to the code in the mc repository)
- classes that appear twice in the browser

In the end, I manage to resolve these. But it's not a comfortable feeling.
I'm running glass 1.0-beta8.6 and gemtools 1.0.beta8.4 (which I believe are the latest ones),
But I'm running the metacello load from a topaz session anyway...

> The error involving 'An attempt to execute a method that requires an abort or prevents further commits would result in lost data' is probably being thrown, because someone wants to do an allInstances call somewhere ... I've wrapped the allInstances calls that I know about with a call that tests for autoCommit ... if autoCommit is true then we commit and run the allInstances, if autoCommit fails we let the allInstances call fail ...

The one that is failing is the one inside #doBulkMigrate itself.

> The fact that you had to run the bulkMigrate twice means that some initializers may not have been run at all (those in line to be executing following the one that failed)...I'm pretty sure that the two problems are primarily with:
>  - new class version for instances of objects on stack
>  - initializers failing because they perform allInstances
>    (Magritte and Pier are the allInstances problem childs if I recall)...

Hm... I'm not sure if I understand this one.
When I'm doing the #doBulkMigrate, I'm just starting a new transaction. So the call to the initializers should be happening inside the flow of #doBulkMigrate, no?
But I don't understand how that can happen...

Johan
ps: will you be coming to ESUG this year ?

Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Dale Henrichs
Johan,

I will be coming to ESUG!


I'd like to characterize the problems that you are seeing ...

First, I'd like to see a stack from the 'Attempt to execute...' error to understand where it is being thrown.

Second, I'd like to start analyzing the performance issues ... it would be helpful if you run statmon while doing the build migrate runs:

  $GEMSTONE/statmonitor <stonename> -rAzq -h 1 -i 1 -u 0

You can mark progress through your tests by incrementing a cache stat:

  System _sessionCacheStatAt: 25 put: 1

Change the value of the cache stat (increment by one) for each expression (of interest) in your script so that we can correlate system peroformance information with the expressions being performed by your script ..

After a run, stop the statmonitor process and ship my the statmonitor file and I should be able to start understading what is going on ...

Dale

----- Original Message -----
| From: "Johan Brichau" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Friday, July 8, 2011 9:17:30 AM
| Subject: Re: [GS/SS Beta] loading new code base stalls and fails
|
| Hi Dale,
|
| Coming back on this migration issue...
|
| On 30 Jun 2011, at 22:15, Dale Henrichs wrote:
|
| > Did you happen to play with bulkMigrate before rerunning your
| > tests? The bulkMigrate algorithm is based upon migrating instances
| > of classes with a non-empty class history and after it is done, it
| > cleans up the class history...
|
| Yes, I did. I thought that cleaning those things up might shrink the
| repository some more.
|
| However, I have now migrated several extents and I notice serious
| differences in time-to-migrate which is absolutely not related to
| the difference in data size.
| In addition, I often get *very weird* class structures after a load:
| - classes that changed to another superclass (thus inheritance
| hierarchies that do not correspond to the code in the mc repository)
| - classes that appear twice in the browser
|
| In the end, I manage to resolve these. But it's not a comfortable
| feeling.
| I'm running glass 1.0-beta8.6 and gemtools 1.0.beta8.4 (which I
| believe are the latest ones),
| But I'm running the metacello load from a topaz session anyway...
|
| > The error involving 'An attempt to execute a method that requires
| > an abort or prevents further commits would result in lost data' is
| > probably being thrown, because someone wants to do an allInstances
| > call somewhere ... I've wrapped the allInstances calls that I know
| > about with a call that tests for autoCommit ... if autoCommit is
| > true then we commit and run the allInstances, if autoCommit fails
| > we let the allInstances call fail ...
|
| The one that is failing is the one inside #doBulkMigrate itself.
|
| > The fact that you had to run the bulkMigrate twice means that some
| > initializers may not have been run at all (those in line to be
| > executing following the one that failed)...I'm pretty sure that
| > the two problems are primarily with:
| >  - new class version for instances of objects on stack
| >  - initializers failing because they perform allInstances
| >    (Magritte and Pier are the allInstances problem childs if I
| >    recall)...
|
| Hm... I'm not sure if I understand this one.
| When I'm doing the #doBulkMigrate, I'm just starting a new
| transaction. So the call to the initializers should be happening
| inside the flow of #doBulkMigrate, no?
| But I don't understand how that can happen...
|
| Johan
| ps: will you be coming to ESUG this year ?
|
|
Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Johan Brichau-2
Hi Dale,

Sorry for the delay in responding to this one.
Holidays surrounded by a lot of work have made this one slip by.

I am sending you where I am getting the #rtErrAbortWouldLoseData error.

On 11 Jul 2011, at 22:35, Dale Henrichs wrote:

> I'd like to characterize the problems that you are seeing ...
>
> First, I'd like to see a stack from the 'Attempt to execute...' error to understand where it is being thrown.

In attachment bulkMigrateStacktrace.txt (I did a copy stack in a gemtools debugger):




And a screenshot that shows it's the #listInstance: method in the second run of the loop inside bulkMigrate:



 


bulkMigrateStackTrace.txt (7K) Download Attachment
Screen Shot 2011-08-08 at 18.18.55.png (125K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Dale Henrichs
Johan,

No problem ... the stack trace and screen shot did the trick ... I've submitted Issue 286 ... you can probably get away with moving the commit inside the loop unless you want to finish your migrate in one transaction ... then all instances will need to be grabbed before doing the migrates ...

Thanks,

Dale

----- Original Message -----
| From: "Johan Brichau" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Monday, August 8, 2011 9:27:40 AM
| Subject: Re: [GS/SS Beta] loading new code base stalls and fails
|
| Hi Dale,
|
| Sorry for the delay in responding to this one.
| Holidays surrounded by a lot of work have made this one slip by.
|
| I am sending you where I am getting the #rtErrAbortWouldLoseData
| error.
|
| On 11 Jul 2011, at 22:35, Dale Henrichs wrote:
|
| > I'd like to characterize the problems that you are seeing ...
| >
| > First, I'd like to see a stack from the 'Attempt to execute...'
| > error to understand where it is being thrown.
|
| In attachment bulkMigrateStacktrace.txt (I did a copy stack in a
| gemtools debugger):
|
|
|
|
| And a screenshot that shows it's the #listInstance: method in the
| second run of the loop inside bulkMigrate:
|
|
|
| [image/png:Screen Shot 2011-08-08 at 18.18.55.png]
|
|
|  
|
|
Reply | Threaded
Open this post in threaded view
|

Re: loading new code base stalls and fails

Johan Brichau-2
Hi Dale,

Thanks!
 
Doing an intermediate commit when the error appeared was exactly how I was working around the problem.

Johan

On 08 Aug 2011, at 19:04, Dale Henrichs wrote:

> Johan,
>
> No problem ... the stack trace and screen shot did the trick ... I've submitted Issue 286 ... you can probably get away with moving the commit inside the loop unless you want to finish your migrate in one transaction ... then all instances will need to be grabbed before doing the migrates ...
>
> Thanks,
>
> Dale
>
> ----- Original Message -----
> | From: "Johan Brichau" <[hidden email]>
> | To: "GemStone Seaside beta discussion" <[hidden email]>
> | Sent: Monday, August 8, 2011 9:27:40 AM
> | Subject: Re: [GS/SS Beta] loading new code base stalls and fails
> |
> | Hi Dale,
> |
> | Sorry for the delay in responding to this one.
> | Holidays surrounded by a lot of work have made this one slip by.
> |
> | I am sending you where I am getting the #rtErrAbortWouldLoseData
> | error.
> |
> | On 11 Jul 2011, at 22:35, Dale Henrichs wrote:
> |
> | > I'd like to characterize the problems that you are seeing ...
> | >
> | > First, I'd like to see a stack from the 'Attempt to execute...'
> | > error to understand where it is being thrown.
> |
> | In attachment bulkMigrateStacktrace.txt (I did a copy stack in a
> | gemtools debugger):
> |
> |
> |
> |
> | And a screenshot that shows it's the #listInstance: method in the
> | second run of the loop inside bulkMigrate:
> |
> |
> |
> | [image/png:Screen Shot 2011-08-08 at 18.18.55.png]
> |
> |
> |  
> |
> |