Slow Reclaims during Recovery

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Slow Reclaims during Recovery

Ken Treis
I made a major mistake yesterday and had to restore from a backup copy of the extent file. Thankfully, the copy was quite recent, so I expected recovery to be more or less instant because there were so few transactions to restore.

But.. recovery took over 9 hours:

    Recovery waited 0.630 seconds for freeFrameCount to recover.
    Recovery took 34749.732 seconds
      6.770 seconds spent waiting for free frames in shared page cache
      34350.070 seconds spent in readThread waiting for free buffers

I watched the DeadNotReclaimedObjs stat during this time (thanks Dale for pointing me to that), and I watched it slowly reclaim 65 million dead objects. Once the reclaim was done, things came alive really quickly. I'd finished a MFC earlier in the day, so apparently we had a long list of un-reclaimed objects.

While the whole world was waiting for reclaim, the server appeared to be idle. Load average was about 0.01, GemStone processes were using very little CPU and doing no measurable I/O. Even after the entire 9-hour process, the reclaim gem had only accumulated 33 seconds of CPU time.

During this agonizing process, I read up on tuning reclaim, and I fully expected to find that my system had a non-zero #sleepTimeBetweenReclaimMs:

#sleepTimeBetweenReclaimMs
The minimum amount of time in milliseconds that the process will sleep between reclaims, even when work is scheduled. The default is 0 milliseconds, maximum 3600000.

... because that's what it was acting like. But no, after recovery finished, I checked and found that parameter to be 0.

What else could cause reclaim-during-recovery to run so slowly? The reclaim gem didn't seem to be working any harder than it does during normal operation. I kind of appreciate that it doesn't over-tax the system normally, though but during recovery I want it to work its tail off.

This is a GLASS app, so we have a small SPC (2 GB) but our extent is large (77 GB). But again, I didn't see that this work was either processor- or I/O-bound. We're running on an i2.xlarge EC2 instance with SSD storage, so we can pull all of that data through the small SPC pretty quickly. MFCs take us less than 20 minutes. 

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221


_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Slow Reclaims during Recovery

Dale Henrichs-3
Ken,

I'm pretty sure that there are some known bugs in this area with regards to slow recovery ... and we do have fixes for these bugs, but I'm not sure which version they have been fixed in, but if they are fixed they should be in 3.2.1....

I'll check with the guys in engineering to see if they recognize this problem, but there are some folks on vacation so I won't necessarily be getting quick responses:)

Dale


On Fri, Jul 18, 2014 at 9:29 AM, Ken Treis <[hidden email]> wrote:
I made a major mistake yesterday and had to restore from a backup copy of the extent file. Thankfully, the copy was quite recent, so I expected recovery to be more or less instant because there were so few transactions to restore.

But.. recovery took over 9 hours:

    Recovery waited 0.630 seconds for freeFrameCount to recover.
    Recovery took 34749.732 seconds
      6.770 seconds spent waiting for free frames in shared page cache
      34350.070 seconds spent in readThread waiting for free buffers

I watched the DeadNotReclaimedObjs stat during this time (thanks Dale for pointing me to that), and I watched it slowly reclaim 65 million dead objects. Once the reclaim was done, things came alive really quickly. I'd finished a MFC earlier in the day, so apparently we had a long list of un-reclaimed objects.

While the whole world was waiting for reclaim, the server appeared to be idle. Load average was about 0.01, GemStone processes were using very little CPU and doing no measurable I/O. Even after the entire 9-hour process, the reclaim gem had only accumulated 33 seconds of CPU time.

During this agonizing process, I read up on tuning reclaim, and I fully expected to find that my system had a non-zero #sleepTimeBetweenReclaimMs:

#sleepTimeBetweenReclaimMs
The minimum amount of time in milliseconds that the process will sleep between reclaims, even when work is scheduled. The default is 0 milliseconds, maximum 3600000.

... because that's what it was acting like. But no, after recovery finished, I checked and found that parameter to be 0.

What else could cause reclaim-during-recovery to run so slowly? The reclaim gem didn't seem to be working any harder than it does during normal operation. I kind of appreciate that it doesn't over-tax the system normally, though but during recovery I want it to work its tail off.

This is a GLASS app, so we have a small SPC (2 GB) but our extent is large (77 GB). But again, I didn't see that this work was either processor- or I/O-bound. We're running on an i2.xlarge EC2 instance with SSD storage, so we can pull all of that data through the small SPC pretty quickly. MFCs take us less than 20 minutes. 

--
Ken Treis
Miriam Technologies, Inc.
<a href="tel:%28866%29%20652-2040%20x221" value="+18666522040" target="_blank">(866) 652-2040 x221


_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk



_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Slow Reclaims during Recovery

Dale Henrichs-3
In reply to this post by Ken Treis
Ken,

It is indeed a known bug:

"Well observed on the impact of reclaimSleepTime. This parameter is not supposed to impact active reclaim, as Bill notes; it should only apply when reclaim is not busy and ready to go to sleep. We've reproduced the problem, however; it turns out that there is a bug that causes this to have an impact similar to sleepTimeBetweenReclaimMs, and so, significantly slows down reclaim. This has been reported as bug 44292."

The bug _has_ been fixed in 3.2.1 and is queued up for release in 3.1.0.7, but we do not have a release date for 3.1.0.7. Here's a bugnote[1] describing the bug and workarounds...

Dale



On Fri, Jul 18, 2014 at 9:29 AM, Ken Treis <[hidden email]> wrote:
I made a major mistake yesterday and had to restore from a backup copy of the extent file. Thankfully, the copy was quite recent, so I expected recovery to be more or less instant because there were so few transactions to restore.

But.. recovery took over 9 hours:

    Recovery waited 0.630 seconds for freeFrameCount to recover.
    Recovery took 34749.732 seconds
      6.770 seconds spent waiting for free frames in shared page cache
      34350.070 seconds spent in readThread waiting for free buffers

I watched the DeadNotReclaimedObjs stat during this time (thanks Dale for pointing me to that), and I watched it slowly reclaim 65 million dead objects. Once the reclaim was done, things came alive really quickly. I'd finished a MFC earlier in the day, so apparently we had a long list of un-reclaimed objects.

While the whole world was waiting for reclaim, the server appeared to be idle. Load average was about 0.01, GemStone processes were using very little CPU and doing no measurable I/O. Even after the entire 9-hour process, the reclaim gem had only accumulated 33 seconds of CPU time.

During this agonizing process, I read up on tuning reclaim, and I fully expected to find that my system had a non-zero #sleepTimeBetweenReclaimMs:

#sleepTimeBetweenReclaimMs
The minimum amount of time in milliseconds that the process will sleep between reclaims, even when work is scheduled. The default is 0 milliseconds, maximum 3600000.

... because that's what it was acting like. But no, after recovery finished, I checked and found that parameter to be 0.

What else could cause reclaim-during-recovery to run so slowly? The reclaim gem didn't seem to be working any harder than it does during normal operation. I kind of appreciate that it doesn't over-tax the system normally, though but during recovery I want it to work its tail off.

This is a GLASS app, so we have a small SPC (2 GB) but our extent is large (77 GB). But again, I didn't see that this work was either processor- or I/O-bound. We're running on an i2.xlarge EC2 instance with SSD storage, so we can pull all of that data through the small SPC pretty quickly. MFCs take us less than 20 minutes. 

--
Ken Treis
Miriam Technologies, Inc.
<a href="tel:%28866%29%20652-2040%20x221" value="+18666522040" target="_blank">(866) 652-2040 x221


_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk



_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Slow Reclaims during Recovery

Ken Treis
Great, thanks Dale. I'll drop the reclaimSleepTime to 1 as the workaround suggests.

--
Ken Treis
Miriam Technologies, Inc.
(866) 652-2040 x221

On Jul 18, 2014, at 11:20 AM, Dale Henrichs <[hidden email]> wrote:

Ken,

It is indeed a known bug:

"Well observed on the impact of reclaimSleepTime. This parameter is not supposed to impact active reclaim, as Bill notes; it should only apply when reclaim is not busy and ready to go to sleep. We've reproduced the problem, however; it turns out that there is a bug that causes this to have an impact similar to sleepTimeBetweenReclaimMs, and so, significantly slows down reclaim. This has been reported as bug 44292."

The bug _has_ been fixed in 3.2.1 and is queued up for release in 3.1.0.7, but we do not have a release date for 3.1.0.7. Here's a bugnote[1] describing the bug and workarounds...

Dale



On Fri, Jul 18, 2014 at 9:29 AM, Ken Treis <[hidden email]> wrote:
I made a major mistake yesterday and had to restore from a backup copy of the extent file. Thankfully, the copy was quite recent, so I expected recovery to be more or less instant because there were so few transactions to restore.

But.. recovery took over 9 hours:

    Recovery waited 0.630 seconds for freeFrameCount to recover.
    Recovery took 34749.732 seconds
      6.770 seconds spent waiting for free frames in shared page cache
      34350.070 seconds spent in readThread waiting for free buffers

I watched the DeadNotReclaimedObjs stat during this time (thanks Dale for pointing me to that), and I watched it slowly reclaim 65 million dead objects. Once the reclaim was done, things came alive really quickly. I'd finished a MFC earlier in the day, so apparently we had a long list of un-reclaimed objects.

While the whole world was waiting for reclaim, the server appeared to be idle. Load average was about 0.01, GemStone processes were using very little CPU and doing no measurable I/O. Even after the entire 9-hour process, the reclaim gem had only accumulated 33 seconds of CPU time.

During this agonizing process, I read up on tuning reclaim, and I fully expected to find that my system had a non-zero #sleepTimeBetweenReclaimMs:

#sleepTimeBetweenReclaimMs
The minimum amount of time in milliseconds that the process will sleep between reclaims, even when work is scheduled. The default is 0 milliseconds, maximum 3600000.

... because that's what it was acting like. But no, after recovery finished, I checked and found that parameter to be 0.

What else could cause reclaim-during-recovery to run so slowly? The reclaim gem didn't seem to be working any harder than it does during normal operation. I kind of appreciate that it doesn't over-tax the system normally, though but during recovery I want it to work its tail off.

This is a GLASS app, so we have a small SPC (2 GB) but our extent is large (77 GB). But again, I didn't see that this work was either processor- or I/O-bound. We're running on an i2.xlarge EC2 instance with SSD storage, so we can pull all of that data through the small SPC pretty quickly. MFCs take us less than 20 minutes. 

--
Ken Treis
Miriam Technologies, Inc.
<a href="tel:%28866%29%20652-2040%20x221" value="+18666522040" target="_blank">(866) 652-2040 x221


_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk




_______________________________________________
GemStone-Smalltalk mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/gemstone-smalltalk