Migration und Updating Databases
I would like to post some experiences I had while migrating a Gemstone database. The customer database - my tests are based on - has a size of about 420GB. The database has been copied to our reference system - an old Thinkpad W520 (I7 based) with 16GB of RAM and ONE SSD and the tests were done on this machine. The stone is working with 8GM shared cache page. Between v70 and v71 of our product there were several changes to the domain model we were developing. The model is defined by 197 domain classes. In v71 39 of these classes have been changed and theses changes are the reason for 119.000.000 objects to be migrated. One class had 66.000.000 instances, another one 49.000.000 instances and the other classes have around 4.000.000 instances. *** The origial way *** The old traditional way had been written in the early state of this product, where databases were not that big and migration speed was not that critical. It worked more or less the following way (shame on me): a) Scan the repository for ONE (!) changed class b) For each instance do a migration and on demand (no memory) make a commit. This was ok in the past. I could start update process on saturday and finish the update on sunday remote. Now the database became too large and this way of updating the database would take from Thursday, 11:00 to Monday afternoon (so more or less 4 days !) *** Repository Scanning *** The next evolution in this topic had been done: a) now ONE repository scan (FOR ALL changed classes) is done - using fastAllInstances and GsBitmap instances. b) For each instance do a migration and on demand make a commit. With this step the multiple scanning of the repository has been removed and the largest time is now the base migration code execution. But for 119 millions objects this still takes much time. I did not make a full test but an initial test over some hours suggested, that this would take around 2 days. *** Indices *** More than satisfied with the benefits of ONE scan, I had to look to the migration code. The base migration code was generated by our code generator and I did not want to change that (because it is general and would cover all model versions), but actually knowing the specific model I want to migrate from, would cut the to be executed code to 1/4 of the originial code. So here would be possibilities for enhancements. So, what about starting multiple processes, do the step (b) in parallel ? I stored the GsBitmap in page order on the disc and that file became around 600 MByte of data. And wrote processes to do the migration in parallel based on that GsBitmap file ... and it did not work. Over and over commit conflicts. No way to go ... speed was pretty bad. Actually only one process was running more or less without problems - the other processes sometime did a little work, but most of the time they did an abort transaction. So, somehow these conflicts were based on. As a first step I decided to remove ALL indices used in the database. I had luck, that this application had an execution path to find all used indices to remove them, to build them etc. That script to remove all indices were started before migration (and it took at least 1-2 hours). Then I started the parallel migration code and now the stuff was working. The I7 had 8 execution threads and I started 8 of these processes and they work without problems. The topaz script were started with "-t 500000" and that fit very well to the machine above. 100% usage of the available RAM und minimal swapping. The code itself had a sliding transaction size (from 1 to max. of 20000 objects between each commit. This limit is adapted according to conflicts/successes) - but the logs showed, that the processes are working with the upper value of 20000 each commit. So to summarize: a) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) b) Removing the indices (1-2) hours c) Starting the migration code in 8 tasks (8 hours) d) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) - to reassure e) Clean the history f) Building the indices (3) hours So, now I am at 17 hours and that is ok. I think, that (b) and (f) could also be done in parallel execution mode. *** Workload *** So removing indices in concurrent tasks leads to very strange exception errors, so I gave that up. Creating indices in concurrent tasks work - so the 3h above can be reduced to 40 minutes and the overall time is now 15 hours. *** Equal Workload up to the end *** The next point shown up in this work was, that the work of creating indices vary very much and so some task have much more to do than others ... and the parallel workidea is not done up to the end. (creating indices task: 37 minutes (longest) against 11 minutes (fastest)). So rearranging this work could still improve the time needed to create the indices. Marten _______________________________________________ Glass mailing list [hidden email] https://lists.gemtalksystems.com/mailman/listinfo/glass |
Marten,
Thank you very much for describing your experience. At each step it seems to me like you made good choices and exhibited a strong understanding of how GemStone works and can be optimized. I appreciate that you arrived at something that is adequate but recognize that more could be done if necessary. James > On Dec 20, 2019, at 3:56 AM, Marten Feldtmann via Glass <[hidden email]> wrote: > > Migration und Updating Databases > > I would like to post some experiences I had while migrating a > Gemstone database. > > The customer database - my tests are based on - has a size > of about 420GB. The database has been copied to our reference > system - an old Thinkpad W520 (I7 based) with 16GB of RAM > and ONE SSD and the tests were done on this machine. The stone > is working with 8GM shared cache page. > > Between v70 and v71 of our product there were several > changes to the domain model we were developing. The model > is defined by 197 domain classes. > > In v71 39 of these classes have been changed and theses changes > are the reason for 119.000.000 objects to be migrated. One class > had 66.000.000 instances, another one 49.000.000 instances and > the other classes have around 4.000.000 instances. > > *** The origial way *** > > The old traditional way had been written in the early state > of this product, where databases were not that big and migration > speed was not that critical. > > It worked more or less the following way (shame on me): > > a) Scan the repository for ONE (!) changed class > > b) For each instance do a migration and on demand (no memory) > make a commit. > > This was ok in the past. I could start update process on saturday > and finish the update on sunday remote. > > Now the database became too large and this way of updating the > database would take from Thursday, 11:00 to Monday afternoon (so > more or less 4 days !) > > *** Repository Scanning *** > > The next evolution in this topic had been done: > > a) now ONE repository scan (FOR ALL changed classes) is done - using > fastAllInstances and GsBitmap instances. > > b) For each instance do a migration and on demand make a commit. > > With this step the multiple scanning of the repository has been removed > and the largest time is now the base migration code execution. But > for 119 millions objects this still takes much time. I did not make a full > test but an initial test over some hours suggested, that this would take > around 2 days. > > *** Indices *** > > More than satisfied with the benefits of ONE scan, I had to look to the migration > code. The base migration code was generated by our code generator and I did > not want to change that (because it is general and would cover all model versions), > but actually knowing the specific model I want to migrate from, would cut the > to be executed code to 1/4 of the originial code. So here would be possibilities > for enhancements. > > So, what about starting multiple processes, do the step (b) in parallel ? I stored > the GsBitmap in page order on the disc and that file became around 600 MByte of data. > > And wrote processes to do the migration in parallel based on that GsBitmap file > ... and it did not work. > Over and over commit conflicts. No way to go ... speed was pretty bad. > > Actually only one process was running more or less without problems - the other > processes sometime did a little work, but most of the time they did an abort transaction. > > So, somehow these conflicts were based on. As a first step I decided to remove > ALL indices used in the database. I had luck, that this application had an execution > path to find all used indices to remove them, to build them etc. > > > That script to remove all indices were started before migration (and it took at > least 1-2 hours). > > Then I started the parallel migration code and now the stuff was working. The I7 > had 8 execution threads and I started 8 of these processes and they work without > problems. The topaz script were started with "-t 500000" and that fit very well to > the machine above. 100% usage of the available RAM und minimal swapping. > > The code itself had a sliding transaction size (from 1 to max. of 20000 objects between > each commit. This limit is adapted according to conflicts/successes) - but the logs showed, > that the processes are working with the upper value of 20000 each commit. > > > So to summarize: > > a) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) > b) Removing the indices (1-2) hours > c) Starting the migration code in 8 tasks (8 hours) > d) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) - to reassure > e) Clean the history > f) Building the indices (3) hours > > So, now I am at 17 hours and that is ok. I think, that (b) and (f) could also be > done in parallel execution mode. > > *** Workload *** > > So removing indices in concurrent tasks leads to very strange exception errors, so > I gave that up. > > Creating indices in concurrent tasks work - so the 3h above can be reduced to 40 > minutes and the overall time is now 15 hours. > > *** Equal Workload up to the end *** > > The next point shown up in this work was, that the work of creating indices vary > very much and so some task have much more to do than others ... and the > parallel workidea is not done up to the end. (creating indices task: 37 minutes (longest) > against 11 minutes (fastest)). So rearranging this work could still improve the > time needed to create the indices. > > > Marten > _______________________________________________ > Glass mailing list > [hidden email] > https://lists.gemtalksystems.com/mailman/listinfo/glass > _______________________________________________ Glass mailing list [hidden email] https://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Thanks for sharing your story ... impressive improvements ....
I see that you had “very strange exception errors” while attempted to concurrently remove indexes ... could you share more details about the errors you hit? In theory you should be able to do concurrent index removal without errors, so perhaps you ran into some fixable bugs? Dale Sent from my iPhone > On Dec 20, 2019, at 4:01 AM, Marten Feldtmann via Glass <[hidden email]> wrote: > > Migration und Updating Databases > > I would like to post some experiences I had while migrating a > Gemstone database. > > The customer database - my tests are based on - has a size > of about 420GB. The database has been copied to our reference > system - an old Thinkpad W520 (I7 based) with 16GB of RAM > and ONE SSD and the tests were done on this machine. The stone > is working with 8GM shared cache page. > > Between v70 and v71 of our product there were several > changes to the domain model we were developing. The model > is defined by 197 domain classes. > > In v71 39 of these classes have been changed and theses changes > are the reason for 119.000.000 objects to be migrated. One class > had 66.000.000 instances, another one 49.000.000 instances and > the other classes have around 4.000.000 instances. > > *** The origial way *** > > The old traditional way had been written in the early state > of this product, where databases were not that big and migration > speed was not that critical. > > It worked more or less the following way (shame on me): > > a) Scan the repository for ONE (!) changed class > > b) For each instance do a migration and on demand (no memory) > make a commit. > > This was ok in the past. I could start update process on saturday > and finish the update on sunday remote. > > Now the database became too large and this way of updating the > database would take from Thursday, 11:00 to Monday afternoon (so > more or less 4 days !) > > *** Repository Scanning *** > > The next evolution in this topic had been done: > > a) now ONE repository scan (FOR ALL changed classes) is done - using > fastAllInstances and GsBitmap instances. > > b) For each instance do a migration and on demand make a commit. > > With this step the multiple scanning of the repository has been removed > and the largest time is now the base migration code execution. But > for 119 millions objects this still takes much time. I did not make a full > test but an initial test over some hours suggested, that this would take > around 2 days. > > *** Indices *** > > More than satisfied with the benefits of ONE scan, I had to look to the migration > code. The base migration code was generated by our code generator and I did > not want to change that (because it is general and would cover all model versions), > but actually knowing the specific model I want to migrate from, would cut the > to be executed code to 1/4 of the originial code. So here would be possibilities > for enhancements. > > So, what about starting multiple processes, do the step (b) in parallel ? I stored > the GsBitmap in page order on the disc and that file became around 600 MByte of data. > > And wrote processes to do the migration in parallel based on that GsBitmap file > ... and it did not work. > Over and over commit conflicts. No way to go ... speed was pretty bad. > > Actually only one process was running more or less without problems - the other > processes sometime did a little work, but most of the time they did an abort transaction. > > So, somehow these conflicts were based on. As a first step I decided to remove > ALL indices used in the database. I had luck, that this application had an execution > path to find all used indices to remove them, to build them etc. > > > That script to remove all indices were started before migration (and it took at > least 1-2 hours). > > Then I started the parallel migration code and now the stuff was working. The I7 > had 8 execution threads and I started 8 of these processes and they work without > problems. The topaz script were started with "-t 500000" and that fit very well to > the machine above. 100% usage of the available RAM und minimal swapping. > > The code itself had a sliding transaction size (from 1 to max. of 20000 objects between > each commit. This limit is adapted according to conflicts/successes) - but the logs showed, > that the processes are working with the upper value of 20000 each commit. > > > So to summarize: > > a) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) > b) Removing the indices (1-2) hours > c) Starting the migration code in 8 tasks (8 hours) > d) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) - to reassure > e) Clean the history > f) Building the indices (3) hours > > So, now I am at 17 hours and that is ok. I think, that (b) and (f) could also be > done in parallel execution mode. > > *** Workload *** > > So removing indices in concurrent tasks leads to very strange exception errors, so > I gave that up. > > Creating indices in concurrent tasks work - so the 3h above can be reduced to 40 > minutes and the overall time is now 15 hours. > > *** Equal Workload up to the end *** > > The next point shown up in this work was, that the work of creating indices vary > very much and so some task have much more to do than others ... and the > parallel workidea is not done up to the end. (creating indices task: 37 minutes (longest) > against 11 minutes (fastest)). So rearranging this work could still improve the > time needed to create the indices. > > > Marten > _______________________________________________ > Glass mailing list > [hidden email] > https://lists.gemtalksystems.com/mailman/listinfo/glass Glass mailing list [hidden email] https://lists.gemtalksystems.com/mailman/listinfo/glass |
This is the error I get from my tasks which are removig indices. The code works, when its running alone. My working theory is, that the RcConflictSet in the IndexManager instance is the reason for this failure ... but I have no proof for that.
the commitTransaction in line 16 is just after I remove all indices from ONE structure. the removeIndex code is a copy of the createIndex code, so iI think, that all tasks are working on their own (not common) data
ERROR 2261 , a InternalError occurred (error 2261), The object with object ID 1480033747713 is corrupt. Reason: 'CorruptObj, FetchObjId fetch past end' (InternalError)
topaz > exec iferr 1 : where ==> 1 InternalError (AbstractException) >> _signalFromPrimitive: @6 line 15 2 DepListTable >> depListBucketFor: @1 line 1 3 DepListTable >> _add: @3 line 8 4 DepListTable (Object) >> perform:withArguments: @1 line 1 5 LogEntry >> redo @2 line 5 6 RedoLog >> _redoOperationsForEntries: @6 line 4 7 DepListTable (Object) >> _abortAndReplay: @15 line 20 8 DepListTable >> _resolveRcConflictsWith: @3 line 10 9 System class >> _resolveRcConflicts @21 line 26 10 System class >> _resolveRcConflictsForCommit: @4 line 8 11 [] in System class >> _localCommit: @28 line 42 12 ExecBlock0 (ExecBlock) >> onException:do: @2 line 66 13 System class >> _localCommit: @16 line 44 14 SessionMethodTransactionBoundaryPolicy (TransactionBoundaryDefaultPolicy) >> commit: @3 line 3 15 System class >> _commit: @8 line 16 16 System class >> commitTransaction @5 line 7 17 [] in WCATIServiceClass class >> removeAllIndices:total: @59 line 28 18 SortedCollection (Collection) >> do: @6 line 10 19 WCATIServiceClass class >> removeAllIndices:total: @35 line 24 20 Executed Code @2 line 1 21 GsNMethod class >> _gsReturnToC @1 line 1 topaz 1> commit ERROR 2249 , a TransactionError occurred (error 2249), Further commits have been disabled for this session because: 'CorruptObj, FetchObjId fetch past end'. This session must logout. (TransactionError) topaz > exec iferr 1 : where ==> 1 TransactionError (AbstractException) >> _signalFromPrimitive: @6 line 15 2 System class >> _primitiveCommit: @1 line 1 3 [] in System class >> _localCommit: @21 line 30 4 ExecBlock0 (ExecBlock) >> onException:do: @2 line 66 5 System class >> _localCommit: @9 line 31 6 SessionMethodTransactionBoundaryPolicy (TransactionBoundaryDefaultPolicy) >> commit: @3 line 3 7 System class >> _commit: @8 line 16 8 System class >> _gciCommit @5 line 5 9 GsNMethod class >> _gsReturnToC @1 line 1 topaz 1> doit _______________________________________________ Glass mailing list [hidden email] https://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Great article Marten, thank you for posting. This would make a great
experience report to present at ESUG. Norm On 12/20/2019 3:56 AM, Marten Feldtmann via Glass wrote: > Migration und Updating Databases > > I would like to post some experiences I had while migrating a > Gemstone database. > > The customer database - my tests are based on - has a size > of about 420GB. The database has been copied to our reference > system - an old Thinkpad W520 (I7 based) with 16GB of RAM > and ONE SSD and the tests were done on this machine. The stone > is working with 8GM shared cache page. > > Between v70 and v71 of our product there were several > changes to the domain model we were developing. The model > is defined by 197 domain classes. > > In v71 39 of these classes have been changed and theses changes > are the reason for 119.000.000 objects to be migrated. One class > had 66.000.000 instances, another one 49.000.000 instances and > the other classes have around 4.000.000 instances. > > *** The origial way *** > > The old traditional way had been written in the early state > of this product, where databases were not that big and migration > speed was not that critical. > > It worked more or less the following way (shame on me): > > a) Scan the repository for ONE (!) changed class > > b) For each instance do a migration and on demand (no memory) > make a commit. > > This was ok in the past. I could start update process on saturday > and finish the update on sunday remote. > > Now the database became too large and this way of updating the > database would take from Thursday, 11:00 to Monday afternoon (so > more or less 4 days !) > > *** Repository Scanning *** > > The next evolution in this topic had been done: > > a) now ONE repository scan (FOR ALL changed classes) is done - using > fastAllInstances and GsBitmap instances. > > b) For each instance do a migration and on demand make a commit. > > With this step the multiple scanning of the repository has been removed > and the largest time is now the base migration code execution. But > for 119 millions objects this still takes much time. I did not make a full > test but an initial test over some hours suggested, that this would take > around 2 days. > > *** Indices *** > > More than satisfied with the benefits of ONE scan, I had to look to the migration > code. The base migration code was generated by our code generator and I did > not want to change that (because it is general and would cover all model versions), > but actually knowing the specific model I want to migrate from, would cut the > to be executed code to 1/4 of the originial code. So here would be possibilities > for enhancements. > > So, what about starting multiple processes, do the step (b) in parallel ? I stored > the GsBitmap in page order on the disc and that file became around 600 MByte of data. > > And wrote processes to do the migration in parallel based on that GsBitmap file > ... and it did not work. > Over and over commit conflicts. No way to go ... speed was pretty bad. > > Actually only one process was running more or less without problems - the other > processes sometime did a little work, but most of the time they did an abort transaction. > > So, somehow these conflicts were based on. As a first step I decided to remove > ALL indices used in the database. I had luck, that this application had an execution > path to find all used indices to remove them, to build them etc. > > > That script to remove all indices were started before migration (and it took at > least 1-2 hours). > > Then I started the parallel migration code and now the stuff was working. The I7 > had 8 execution threads and I started 8 of these processes and they work without > problems. The topaz script were started with "-t 500000" and that fit very well to > the machine above. 100% usage of the available RAM und minimal swapping. > > The code itself had a sliding transaction size (from 1 to max. of 20000 objects between > each commit. This limit is adapted according to conflicts/successes) - but the logs showed, > that the processes are working with the upper value of 20000 each commit. > > > So to summarize: > > a) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) > b) Removing the indices (1-2) hours > c) Starting the migration code in 8 tasks (8 hours) > d) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) - to reassure > e) Clean the history > f) Building the indices (3) hours > > So, now I am at 17 hours and that is ok. I think, that (b) and (f) could also be > done in parallel execution mode. > > *** Workload *** > > So removing indices in concurrent tasks leads to very strange exception errors, so > I gave that up. > > Creating indices in concurrent tasks work - so the 3h above can be reduced to 40 > minutes and the overall time is now 15 hours. > > *** Equal Workload up to the end *** > > The next point shown up in this work was, that the work of creating indices vary > very much and so some task have much more to do than others ... and the > parallel workidea is not done up to the end. (creating indices task: 37 minutes (longest) > against 11 minutes (fastest)). So rearranging this work could still improve the > time needed to create the indices. > > > Marten > _______________________________________________ > Glass mailing list > [hidden email] > https://lists.gemtalksystems.com/mailman/listinfo/glass _______________________________________________ Glass mailing list [hidden email] https://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Hey Marten, What's the actual GemStone call made to remove the indexes before you did the commit causing the error? Also, can you describe the type and paths of the various indexes on that collection? I see that Dale has gone ahead and filed bug 48485 on this issue -- we may need this info to track down the problem. ------------------------------------------------------------------------ Bill Erickson GemTalk Systems Engineering 15220 NW Greenbrier Parkway #240, Beaverton OR 97006 ------------------------------------------------------------------------ On Fri, Dec 20, 2019 at 7:50 AM Marten Feldtmann via Glass <[hidden email]> wrote:
_______________________________________________ Glass mailing list [hidden email] https://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by GLASS mailing list
Marten, Thanks for the stack ... For completeness, could you share the
code in the removeAllIndices:total: with me? I would like to
reproduce this problem, so the number of concurrent processes
doing removal would be useful, as well ... Concurrent bugs are
always difficult to track down (and reproduce) so the more
information I have the better. I've submitted a bug: "48485 'CorruptObj, FetchObjId fetch past
end' during concurrent index removal" to track this problem. Finally I am curious if you have tried `IndexManager
removeAllIndexes`? It is intended for use in the case where you
are removing all of the indexes in the system ... Instead of
removing the individual objects participating in the indexes
(which is what is done by the standard remove index code), the
index data structures (btrees, etc.) are simply dropped on the
floor --- the objects participating in index are removed directly
from the dependency lists, so it should be quite a bit faster than
removing each individual index from it's collection ... Dale On 12/20/19 7:45 AM, Marten Feldtmann
wrote:
_______________________________________________ Glass mailing list [hidden email] https://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |