Smalltalk › Gemtalk › GLASS

[Glass] [3.1.0.4] Tranlog Restore Seems to Hang

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Ken Treis

[Glass] [3.1.0.4] Tranlog Restore Seems to Hang

We had a cloud server crash (problem on the underlying hardware), and after things were fixed I find myself unable to restore certain transaction logs. It goes something like this:

1. startstone -N -R seaside.

2. topaz -l

successful login
topaz 1> run
SystemRepository restoreStatus
%
Restoring from transaction log files, restored to 20/08/13 18:26:21 UTC, file 926 record 1596051, nextFileId 926, oldest fileId 926
topaz 1> run
SystemRepository restoreFromCurrentLogs
%

3. In the stone log, I see:

--- 08/22/13 23:56:27 UTC ---

Opened a transaction log file for read_nolocks.

filename = /opt/gemstone/GemStone64Bit3.1.0.4-x86_64.Linux//seaside/data/tranlog926.dbf

Restoring from current log directory to end of logs

and then nothing else happens. After an hour, I get impatient. Load average is low, and the only process using any appreciable CPU is pgsvrmain. So I kill pgsvrmain and let the stone crash.

4. I repeat 1 and 2, and restoreStatus shows me that no progress has been made. I'm still at the same record (1596051).

I've seen this happen in 3 different scenarios:

(A) After restoring a Smalltalk full backup.

(B) After restoring a filesystem backup.

(C) During normal startup after the crash, when there are just a couple of records to replay since the pre-crash checkpoint. I didn't confirm that the record ID wasn't changing at this point, but the behavior was identical to the other two).

All three hang at different record IDs, so I can't blame a single bad record. Also, in (A) both page and object audits came back OK. I haven't done audits on the other scenario.

Is this a known issue in 3.1.0.4? Anything else I should be checking/trying?

--
Ken Treis
Miriam Technologies, Inc.

(866) 652-2040 x221

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Ken,

I've forwarded to the server guys to see if they have suggestions.

Dale

From: "Ken Treis" <[hidden email]>
To: [hidden email]
Sent: Thursday, August 22, 2013 5:30:12 PM
Subject: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

We had a cloud server crash (problem on the underlying hardware), and after things were fixed I find myself unable to restore certain transaction logs. It goes something like this:

1. startstone -N -R seaside.
2. topaz -l

successful login
topaz 1> run
SystemRepository restoreStatus
%
Restoring from transaction log files, restored to 20/08/13 18:26:21 UTC, file 926 record 1596051, nextFileId 926, oldest fileId 926
topaz 1> run
SystemRepository restoreFromCurrentLogs
%

3. In the stone log, I see:

--- 08/22/13 23:56:27 UTC ---
Opened a transaction log file for read_nolocks.
filename = /opt/gemstone/GemStone64Bit3.1.0.4-x86_64.Linux//seaside/data/tranlog926.dbf
Restoring from current log directory to end of logs

and then nothing else happens. After an hour, I get impatient. Load average is low, and the only process using any appreciable CPU is pgsvrmain. So I kill pgsvrmain and let the stone crash.

4. I repeat 1 and 2, and restoreStatus shows me that no progress has been made. I'm still at the same record (1596051).

I've seen this happen in 3 different scenarios:

(A) After restoring a Smalltalk full backup.
(B) After restoring a filesystem backup.
(C) During normal startup after the crash, when there are just a couple of records to replay since the pre-crash checkpoint. I didn't confirm that the record ID wasn't changing at this point, but the behavior was identical to the other two).

All three hang at different record IDs, so I can't blame a single bad record. Also, in (A) both page and object audits came back OK. I haven't done audits on the other scenario.

Is this a known issue in 3.1.0.4? Anything else I should be checking/trying?

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

In reply to this post by Ken Treis

Ken,

In earlier versions we've seen hangs but they were always accompanied by "Recovery waiting for gcGems to reclaim dead" in the stone log, so this appears to possibly be a different variant of that problem...

First off, it would be easier if we could get copies of your extents and tranlogs so that we could debug the hang here ...

If that's not feasible, then we'd like to gather stats, stacks and logs:

1. start the stone with -N -R
2. fire up statmon
3. do your restore from current logs
4. after you see the hand, use $GEMSTONE/bin/pstack to get stacks
from the stoned process ... (take several samples with an intervening
delay)

The wrap up the stats, stacks and stone.log and send it to me ... if you can share extents and tranlogs, let me know and I'll give you the details ftp details...

Dale

From: "Ken Treis" <[hidden email]>
To: [hidden email]
Sent: Thursday, August 22, 2013 5:30:12 PM
Subject: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

We had a cloud server crash (problem on the underlying hardware), and after things were fixed I find myself unable to restore certain transaction logs. It goes something like this:

1. startstone -N -R seaside.
2. topaz -l

successful login
topaz 1> run
SystemRepository restoreStatus
%
Restoring from transaction log files, restored to 20/08/13 18:26:21 UTC, file 926 record 1596051, nextFileId 926, oldest fileId 926
topaz 1> run
SystemRepository restoreFromCurrentLogs
%

3. In the stone log, I see:

--- 08/22/13 23:56:27 UTC ---
Opened a transaction log file for read_nolocks.
filename = /opt/gemstone/GemStone64Bit3.1.0.4-x86_64.Linux//seaside/data/tranlog926.dbf
Restoring from current log directory to end of logs

and then nothing else happens. After an hour, I get impatient. Load average is low, and the only process using any appreciable CPU is pgsvrmain. So I kill pgsvrmain and let the stone crash.

4. I repeat 1 and 2, and restoreStatus shows me that no progress has been made. I'm still at the same record (1596051).

I've seen this happen in 3 different scenarios:

(A) After restoring a Smalltalk full backup.
(B) After restoring a filesystem backup.
(C) During normal startup after the crash, when there are just a couple of records to replay since the pre-crash checkpoint. I didn't confirm that the record ID wasn't changing at this point, but the behavior was identical to the other two).

All three hang at different record IDs, so I can't blame a single bad record. Also, in (A) both page and object audits came back OK. I haven't done audits on the other scenario.

Is this a known issue in 3.1.0.4? Anything else I should be checking/trying?

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Ken,

We ran the restore and it turns out that the restore was not really hung, but running very very slowly ... The system is actually doing a reclaim of the dead objects. If you watch the stone stat deadNotReclaimedObjs, you'll see that it is slowly working its way toward zero when the restore appears to be hung ... when deadNotReclaimedObjs hits zero things get moving again ... your repository is a lot bigger than the SPC, so that is one reason by the dead object reclaim is slow ...

Apparently tranlog930 is missing from the set of files that you sent us ... and in our case it took about 5 hours to hit that point, but we were using a 4G SPC.

If you've got tranlog930 you could ship it to us so that we can ensure that isn't something else wrong....but at least the "hang" is explained ...

Let me know if you have any other questions.
Dale

From: "Dale K. Henrichs" <[hidden email]>
To: "Ken Treis" <[hidden email]>
Cc: [hidden email]
Sent: Friday, August 23, 2013 9:21:03 AM
Subject: Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Ken,

In earlier versions we've seen hangs but they were always accompanied by "Recovery waiting for gcGems to reclaim dead" in the stone log, so this appears to possibly be a different variant of that problem...

First off, it would be easier if we could get copies of your extents and tranlogs so that we could debug the hang here ...

If that's not feasible, then we'd like to gather stats, stacks and logs:

1. start the stone with -N -R
2. fire up statmon
3. do your restore from current logs
4. after you see the hand, use $GEMSTONE/bin/pstack to get stacks
from the stoned process ... (take several samples with an intervening
delay)

The wrap up the stats, stacks and stone.log and send it to me ... if you can share extents and tranlogs, let me know and I'll give you the details ftp details...

Dale

From: "Ken Treis" <[hidden email]>
To: [hidden email]
Sent: Thursday, August 22, 2013 5:30:12 PM
Subject: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

We had a cloud server crash (problem on the underlying hardware), and after things were fixed I find myself unable to restore certain transaction logs. It goes something like this:

1. startstone -N -R seaside.
2. topaz -l

successful login
topaz 1> run
SystemRepository restoreStatus
%
Restoring from transaction log files, restored to 20/08/13 18:26:21 UTC, file 926 record 1596051, nextFileId 926, oldest fileId 926
topaz 1> run
SystemRepository restoreFromCurrentLogs
%

3. In the stone log, I see:

--- 08/22/13 23:56:27 UTC ---
Opened a transaction log file for read_nolocks.
filename = /opt/gemstone/GemStone64Bit3.1.0.4-x86_64.Linux//seaside/data/tranlog926.dbf
Restoring from current log directory to end of logs

and then nothing else happens. After an hour, I get impatient. Load average is low, and the only process using any appreciable CPU is pgsvrmain. So I kill pgsvrmain and let the stone crash.

4. I repeat 1 and 2, and restoreStatus shows me that no progress has been made. I'm still at the same record (1596051).

I've seen this happen in 3 different scenarios:

(A) After restoring a Smalltalk full backup.
(B) After restoring a filesystem backup.
(C) During normal startup after the crash, when there are just a couple of records to replay since the pre-crash checkpoint. I didn't confirm that the record ID wasn't changing at this point, but the behavior was identical to the other two).

All three hang at different record IDs, so I can't blame a single bad record. Also, in (A) both page and object audits came back OK. I haven't done audits on the other scenario.

Is this a known issue in 3.1.0.4? Anything else I should be checking/trying?

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Ken Treis

Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

<base href="x-msg://8138/">Hi Dale,

Thank you very much for looking into this for me. Chalk another one up to my impatience.

During normal operation, reclaim is performed incrementally and asynchronously, right? Is there any way to delay the reclaim until after tranlogs have been replayed? Our site was down during the long replay process, and I would have gladly given up some more disk space to get things back online more quickly.

Ken Treis
Miriam Technologies, Inc.

(866) 652-2040 x221

On Aug 24, 2013, at 5:51 PM, Dale K. Henrichs wrote:

Ken,

We ran the restore and it turns out that the restore was not really hung, but running very very slowly ... The system is actually doing a reclaim of the dead objects. If you watch the stone stat deadNotReclaimedObjs, you'll see that it is slowly working its way toward zero when the restore appears to be hung ... when deadNotReclaimedObjs hits zero things get moving again ... your repository is a lot bigger than the SPC, so that is one reason by the dead object reclaim is slow ...

Apparently tranlog930 is missing from the set of files that you sent us ... and in our case it took about 5 hours to hit that point, but we were using a 4G SPC.

If you've got tranlog930 you could ship it to us so that we can ensure that isn't something else wrong....but at least the "hang" is explained ...

Let me know if you have any other questions.
Dale

From: "Dale K. Henrichs" <[hidden email]>
To: "Ken Treis" <[hidden email]>
Cc: [hidden email]
Sent: Friday, August 23, 2013 9:21:03 AM
Subject: Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Ken,

In earlier versions we've seen hangs but they were always accompanied by "Recovery waiting for gcGems to reclaim dead" in the stone log, so this appears to possibly be a different variant of that problem...

First off, it would be easier if we could get copies of your extents and tranlogs so that we could debug the hang here ...

If that's not feasible, then we'd like to gather stats, stacks and logs:

1. start the stone with -N -R
2. fire up statmon
3. do your restore from current logs
4. after you see the hand, use $GEMSTONE/bin/pstack to get stacks
from the stoned process ... (take several samples with an intervening
delay)

The wrap up the stats, stacks and stone.log and send it to me ... if you can share extents and tranlogs, let me know and I'll give you the details ftp details...

Dale

From: "Ken Treis" <[hidden email]>
To: [hidden email]
Sent: Thursday, August 22, 2013 5:30:12 PM
Subject: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

We had a cloud server crash (problem on the underlying hardware), and after things were fixed I find myself unable to restore certain transaction logs. It goes something like this:

1. startstone -N -R seaside.
2. topaz -l

successful login
topaz 1> run
SystemRepository restoreStatus
%
Restoring from transaction log files, restored to 20/08/13 18:26:21 UTC, file 926 record 1596051, nextFileId 926, oldest fileId 926
topaz 1> run
SystemRepository restoreFromCurrentLogs
%

3. In the stone log, I see:

--- 08/22/13 23:56:27 UTC ---
Opened a transaction log file for read_nolocks.
filename = /opt/gemstone/GemStone64Bit3.1.0.4-x86_64.Linux//seaside/data/tranlog926.dbf
Restoring from current log directory to end of logs

and then nothing else happens. After an hour, I get impatient. Load average is low, and the only process using any appreciable CPU is pgsvrmain. So I kill pgsvrmain and let the stone crash.

4. I repeat 1 and 2, and restoreStatus shows me that no progress has been made. I'm still at the same record (1596051).

I've seen this happen in 3 different scenarios:

(A) After restoring a Smalltalk full backup.
(B) After restoring a filesystem backup.
(C) During normal startup after the crash, when there are just a couple of records to replay since the pre-crash checkpoint. I didn't confirm that the record ID wasn't changing at this point, but the behavior was identical to the other two).

All three hang at different record IDs, so I can't blame a single bad record. Also, in (A) both page and object audits came back OK. I haven't done audits on the other scenario.

Is this a known issue in 3.1.0.4? Anything else I should be checking/trying?

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

Dale Henrichs-3

Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Ken,

Well, you are right to be suspicious and we should let you know how to monitor progress .. internally we know that we are doing something "unexpected" during restore, so we are planning on putting a message in the stone log.

There's really no way for you bypass the dead reclaim ... we need to make sure that our bookkeeping for the dead objects is correct ... based on this incident we have discussed the possibility of potential optimizations for the future, but there is nothing to be done now other than to be patient:)

Dale

From: "Ken Treis" <[hidden email]>
To: "Dale K. Henrichs" <[hidden email]>
Cc: [hidden email]
Sent: Monday, August 26, 2013 12:13:11 PM
Subject: Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Hi Dale,

Thank you very much for looking into this for me. Chalk another one up to my impatience.

During normal operation, reclaim is performed incrementally and asynchronously, right? Is there any way to delay the reclaim until after tranlogs have been replayed? Our site was down during the long replay process, and I would have gladly given up some more disk space to get things back online more quickly.

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221

On Aug 24, 2013, at 5:51 PM, Dale K. Henrichs wrote:

Ken,

We ran the restore and it turns out that the restore was not really hung, but running very very slowly ... The system is actually doing a reclaim of the dead objects. If you watch the stone stat deadNotReclaimedObjs, you'll see that it is slowly working its way toward zero when the restore appears to be hung ... when deadNotReclaimedObjs hits zero things get moving again ... your repository is a lot bigger than the SPC, so that is one reason by the dead object reclaim is slow ...

Apparently tranlog930 is missing from the set of files that you sent us ... and in our case it took about 5 hours to hit that point, but we were using a 4G SPC.

If you've got tranlog930 you could ship it to us so that we can ensure that isn't something else wrong....but at least the "hang" is explained ...

Let me know if you have any other questions.
Dale

From: "Dale K. Henrichs" <[hidden email]>
To: "Ken Treis" <[hidden email]>
Cc: [hidden email]
Sent: Friday, August 23, 2013 9:21:03 AM
Subject: Re: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Ken,

In earlier versions we've seen hangs but they were always accompanied by "Recovery waiting for gcGems to reclaim dead" in the stone log, so this appears to possibly be a different variant of that problem...

First off, it would be easier if we could get copies of your extents and tranlogs so that we could debug the hang here ...

If that's not feasible, then we'd like to gather stats, stacks and logs:

1. start the stone with -N -R
2. fire up statmon
3. do your restore from current logs
4. after you see the hand, use $GEMSTONE/bin/pstack to get stacks
from the stoned process ... (take several samples with an intervening
delay)

The wrap up the stats, stacks and stone.log and send it to me ... if you can share extents and tranlogs, let me know and I'll give you the details ftp details...

Dale

From: "Ken Treis" <[hidden email]>
To: [hidden email]
Sent: Thursday, August 22, 2013 5:30:12 PM
Subject: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

We had a cloud server crash (problem on the underlying hardware), and after things were fixed I find myself unable to restore certain transaction logs. It goes something like this:

1. startstone -N -R seaside.
2. topaz -l

successful login
topaz 1> run
SystemRepository restoreStatus
%
Restoring from transaction log files, restored to 20/08/13 18:26:21 UTC, file 926 record 1596051, nextFileId 926, oldest fileId 926
topaz 1> run
SystemRepository restoreFromCurrentLogs
%

3. In the stone log, I see:

--- 08/22/13 23:56:27 UTC ---
Opened a transaction log file for read_nolocks.
filename = /opt/gemstone/GemStone64Bit3.1.0.4-x86_64.Linux//seaside/data/tranlog926.dbf
Restoring from current log directory to end of logs

and then nothing else happens. After an hour, I get impatient. Load average is low, and the only process using any appreciable CPU is pgsvrmain. So I kill pgsvrmain and let the stone crash.

4. I repeat 1 and 2, and restoreStatus shows me that no progress has been made. I'm still at the same record (1596051).

I've seen this happen in 3 different scenarios:

(A) After restoring a Smalltalk full backup.
(B) After restoring a filesystem backup.
(C) During normal startup after the crash, when there are just a couple of records to replay since the pre-crash checkpoint. I didn't confirm that the record ID wasn't changing at this point, but the behavior was identical to the other two).

All three hang at different record IDs, so I can't blame a single bad record. Also, in (A) both page and object audits came back OK. I haven't done audits on the other scenario.

Is this a known issue in 3.1.0.4? Anything else I should be checking/trying?

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass