Maximum value of -stackpages VM parameter?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Maximum value of -stackpages VM parameter?

Phil B
 
In trying to troubleshoot an issue, I needed to bump up the stackpages parameter.  On 64-bit Linux, a value of 600 worked but 1000 segfaulted so I was just wondering what the limit(s) are for it?
Reply | Threaded
Open this post in threaded view
|

Re: Maximum value of -stackpages VM parameter?

Eliot Miranda-2
 
Hi Phil,


> On Jun 12, 2017, at 12:50 PM, Phil B <[hidden email]> wrote:
>
> In trying to troubleshoot an issue, I needed to bump up the stackpages parameter.  On 64-bit Linux, a value of 600 worked but 1000 segfaulted so I was just wondering what the limit(s) are for it?

There are no explicit limits.  The set fault you're seeing is as a result of the stack pages being allocated on the c stack.  When the number is high the stack overflows and boom.

A word to the wise: too high a value and scavenging performance falls (stack pages are implicitly roots into new space), and become performance falls (all activations in stack space are scanned post become to avoid a read barrier on inst var fetch).

The default value was 192, a value chosen to exceed qwaq server process usage, but both at Cadence and in Spur profiling we found that was not a good value and pulled it back to 64 (IIRC).

I'm curious as to why are you exploring such high values.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum value of -stackpages VM parameter?

Phil B
 
Eliot,

Thanks for the info, that's good to know.  I probably should have been explicit in that I am only bumping it up this high to troubleshoot a rather annoying startup bug in my code. When it crashes as a result of the stack overflow the trace is pretty useless (iirc, about 1/2 a page of INVALID REFERENCE so I'm mostly flying blind.)  Bumping up the limit is allowing me to get a better view of where things are going wrong and I plan to drop back once I've resolved it.

Thanks,
Phil

On Jun 12, 2017 4:43 PM, "Eliot Miranda" <[hidden email]> wrote:

Hi Phil,


> On Jun 12, 2017, at 12:50 PM, Phil B <[hidden email]> wrote:
>
> In trying to troubleshoot an issue, I needed to bump up the stackpages parameter.  On 64-bit Linux, a value of 600 worked but 1000 segfaulted so I was just wondering what the limit(s) are for it?

There are no explicit limits.  The set fault you're seeing is as a result of the stack pages being allocated on the c stack.  When the number is high the stack overflows and boom.

A word to the wise: too high a value and scavenging performance falls (stack pages are implicitly roots into new space), and become performance falls (all activations in stack space are scanned post become to avoid a read barrier on inst var fetch).

The default value was 192, a value chosen to exceed qwaq server process usage, but both at Cadence and in Spur profiling we found that was not a good value and pulled it back to 64 (IIRC).

I'm curious as to why are you exploring such high values.

Reply | Threaded
Open this post in threaded view
|

Re: Maximum value of -stackpages VM parameter?

Eliot Miranda-2
 
Hi Phil,

On Jun 12, 2017, at 2:25 PM, Phil B <[hidden email]> wrote:

Eliot,

Thanks for the info, that's good to know.  I probably should have been explicit in that I am only bumping it up this high to troubleshoot a rather annoying startup bug in my code. When it crashes as a result of the stack overflow the trace is pretty useless (iirc, about 1/2 a page of INVALID REFERENCE so I'm mostly flying blind.)  Bumping up the limit is allowing me to get a better view of where things are going wrong and I plan to drop back once I've resolved it.

A better way to debug thus will be to set a breakpoint in the scavenger and the GC on every GC.  Stack overflow in a language like Smalltalk where activations are objects means that the heap grows as the stack grows.  (The stack pages in the stack zone can be seen as an allocation cache for the most recent activations, reducing the pressure on the GC).  So if run under gdb (lldb on Mac) and you print the stack in each GC you should be able to at least see where the infinite recursion is coming from before the system runs out of memory:

(gdb) b doScavenge
breakpoint 1 set at NNNN
(gdb) commands 1
call printStackCallStackOf(framePointer)
end
(gdb) run myimage.image

You can use
(gdb) call pushOutputFile("stack.log")
to get the vm to send subsequent output to a file and 
(gdb) call popOutputFile()
to close the log.


Thanks,
Phil

On Jun 12, 2017 4:43 PM, "Eliot Miranda" <[hidden email]> wrote:

Hi Phil,


> On Jun 12, 2017, at 12:50 PM, Phil B <[hidden email]> wrote:
>
> In trying to troubleshoot an issue, I needed to bump up the stackpages parameter.  On 64-bit Linux, a value of 600 worked but 1000 segfaulted so I was just wondering what the limit(s) are for it?

There are no explicit limits.  The set fault you're seeing is as a result of the stack pages being allocated on the c stack.  When the number is high the stack overflows and boom.

A word to the wise: too high a value and scavenging performance falls (stack pages are implicitly roots into new space), and become performance falls (all activations in stack space are scanned post become to avoid a read barrier on inst var fetch).

The default value was 192, a value chosen to exceed qwaq server process usage, but both at Cadence and in Spur profiling we found that was not a good value and pulled it back to 64 (IIRC).

I'm curious as to why are you exploring such high values.

Reply | Threaded
Open this post in threaded view
|

Re: Maximum value of -stackpages VM parameter?

Phil B
 
Eliot,

Thanks for the tip, I'll give that a shot.  Also, is it possible to check the amount of stack usage from the image? (I.e. just to get a rough idea of where things stand that's reasonably fast)

Phil


On Jun 12, 2017 7:24 PM, "Eliot Miranda" <[hidden email]> wrote:
 
Hi Phil,

On Jun 12, 2017, at 2:25 PM, Phil B <[hidden email]> wrote:

Eliot,

Thanks for the info, that's good to know.  I probably should have been explicit in that I am only bumping it up this high to troubleshoot a rather annoying startup bug in my code. When it crashes as a result of the stack overflow the trace is pretty useless (iirc, about 1/2 a page of INVALID REFERENCE so I'm mostly flying blind.)  Bumping up the limit is allowing me to get a better view of where things are going wrong and I plan to drop back once I've resolved it.

A better way to debug thus will be to set a breakpoint in the scavenger and the GC on every GC.  Stack overflow in a language like Smalltalk where activations are objects means that the heap grows as the stack grows.  (The stack pages in the stack zone can be seen as an allocation cache for the most recent activations, reducing the pressure on the GC).  So if run under gdb (lldb on Mac) and you print the stack in each GC you should be able to at least see where the infinite recursion is coming from before the system runs out of memory:

(gdb) b doScavenge
breakpoint 1 set at NNNN
(gdb) commands 1
call printStackCallStackOf(framePointer)
end
(gdb) run myimage.image

You can use
(gdb) call pushOutputFile("stack.log")
to get the vm to send subsequent output to a file and 
(gdb) call popOutputFile()
to close the log.


Thanks,
Phil

On Jun 12, 2017 4:43 PM, "Eliot Miranda" <[hidden email]> wrote:

Hi Phil,


> On Jun 12, 2017, at 12:50 PM, Phil B <[hidden email]> wrote:
>
> In trying to troubleshoot an issue, I needed to bump up the stackpages parameter.  On 64-bit Linux, a value of 600 worked but 1000 segfaulted so I was just wondering what the limit(s) are for it?

There are no explicit limits.  The set fault you're seeing is as a result of the stack pages being allocated on the c stack.  When the number is high the stack overflows and boom.

A word to the wise: too high a value and scavenging performance falls (stack pages are implicitly roots into new space), and become performance falls (all activations in stack space are scanned post become to avoid a read barrier on inst var fetch).

The default value was 192, a value chosen to exceed qwaq server process usage, but both at Cadence and in Spur profiling we found that was not a good value and pulled it back to 64 (IIRC).

I'm curious as to why are you exploring such high values.



Reply | Threaded
Open this post in threaded view
|

Re: Maximum value of -stackpages VM parameter?

Eliot Miranda-2
 
Hi Phil,

    via vmParameterAt: you'll access

60 number of stack page overflows since startup (read-only; Cog VMs only)

a stack page overflow occurs when a computation that sends deeply enough fills a page with activations and to continue needs to extend onto a fresh page.  It is expected that this number be high.

61 number of stack page divorces since startup (read-only; Cog VMs only)

a stack page divorce occurs when either a stack overflow or a process switch require a new page but all pages are in use and so the least recently used page is "divorced"; its activations are converted into context objects on the heap, emptying the page and allowing its reuse.  This is the number we'd like to keep low by upping the number of stack pages, but not so much that we slow down GC.

68 the average number of live stack pages when scanned by GC (at scavenge/gc/become et al)
69 the maximum number of live stack pages when scanned by GC (at scavenge/gc/become et al)

These two (sorry, just noticed they're not in the method comment, at least in Squeak) can be used to monitor how many stack pages are in use as the system runs.

From these two we can tell whether a large number of pages leads to a high load on the scavenger scanning stack pages.  If the average is low while the number of stack pages is high then the application has a pattern that is insensitive to the number of stack pages and then one can increase the number of stack pages without seeing much GC overhead.  But I expect this is unlikely; these two were added to monitor GC performance at Cadence and indeed we see that increasing the number of stack pages in use also increases the average number of stack pages in use at GC time.

That said, in my current VMMaker image, in this session merely used for browsing, I see

#42 50 number of stack pages available   (default)
#43 0 desired number of stack pages (i.e. select default)

#60 89,370 number of stack page overflows since startup
#61 0 number of stack page divorces since startup

#68 11.35 the average number of live stack pages when scanned by scavenge/gc/become
#69 16 the maximum number of live stack pages when scanned by scavenge/gc/become

So in normal development use it looks like stack page use is minimal.

On Thu, Jun 15, 2017 at 1:10 PM, Phil B <[hidden email]> wrote:
 
Eliot,

Thanks for the tip, I'll give that a shot.  Also, is it possible to check the amount of stack usage from the image? (I.e. just to get a rough idea of where things stand that's reasonably fast)

Phil


On Jun 12, 2017 7:24 PM, "Eliot Miranda" <[hidden email]> wrote:
 
Hi Phil,

On Jun 12, 2017, at 2:25 PM, Phil B <[hidden email]> wrote:

Eliot,

Thanks for the info, that's good to know.  I probably should have been explicit in that I am only bumping it up this high to troubleshoot a rather annoying startup bug in my code. When it crashes as a result of the stack overflow the trace is pretty useless (iirc, about 1/2 a page of INVALID REFERENCE so I'm mostly flying blind.)  Bumping up the limit is allowing me to get a better view of where things are going wrong and I plan to drop back once I've resolved it.

A better way to debug thus will be to set a breakpoint in the scavenger and the GC on every GC.  Stack overflow in a language like Smalltalk where activations are objects means that the heap grows as the stack grows.  (The stack pages in the stack zone can be seen as an allocation cache for the most recent activations, reducing the pressure on the GC).  So if run under gdb (lldb on Mac) and you print the stack in each GC you should be able to at least see where the infinite recursion is coming from before the system runs out of memory:

(gdb) b doScavenge
breakpoint 1 set at NNNN
(gdb) commands 1
call printStackCallStackOf(framePointer)
end
(gdb) run myimage.image

You can use
(gdb) call pushOutputFile("stack.log")
to get the vm to send subsequent output to a file and 
(gdb) call popOutputFile()
to close the log.


Thanks,
Phil

On Jun 12, 2017 4:43 PM, "Eliot Miranda" <[hidden email]> wrote:

Hi Phil,


> On Jun 12, 2017, at 12:50 PM, Phil B <[hidden email]> wrote:
>
> In trying to troubleshoot an issue, I needed to bump up the stackpages parameter.  On 64-bit Linux, a value of 600 worked but 1000 segfaulted so I was just wondering what the limit(s) are for it?

There are no explicit limits.  The set fault you're seeing is as a result of the stack pages being allocated on the c stack.  When the number is high the stack overflows and boom.

A word to the wise: too high a value and scavenging performance falls (stack pages are implicitly roots into new space), and become performance falls (all activations in stack space are scanned post become to avoid a read barrier on inst var fetch).

The default value was 192, a value chosen to exceed qwaq server process usage, but both at Cadence and in Spur profiling we found that was not a good value and pulled it back to 64 (IIRC).

I'm curious as to why are you exploring such high values.







--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Maximum value of -stackpages VM parameter?

Phil B
 
Excellent, I think this gives me the tools I'll need to get to the bottom of the problem.

Thanks,
Phil

On Jun 15, 2017 4:33 PM, "Eliot Miranda" <[hidden email]> wrote:
 
Hi Phil,

    via vmParameterAt: you'll access

60 number of stack page overflows since startup (read-only; Cog VMs only)

a stack page overflow occurs when a computation that sends deeply enough fills a page with activations and to continue needs to extend onto a fresh page.  It is expected that this number be high.

61 number of stack page divorces since startup (read-only; Cog VMs only)

a stack page divorce occurs when either a stack overflow or a process switch require a new page but all pages are in use and so the least recently used page is "divorced"; its activations are converted into context objects on the heap, emptying the page and allowing its reuse.  This is the number we'd like to keep low by upping the number of stack pages, but not so much that we slow down GC.

68 the average number of live stack pages when scanned by GC (at scavenge/gc/become et al)
69 the maximum number of live stack pages when scanned by GC (at scavenge/gc/become et al)

These two (sorry, just noticed they're not in the method comment, at least in Squeak) can be used to monitor how many stack pages are in use as the system runs.

From these two we can tell whether a large number of pages leads to a high load on the scavenger scanning stack pages.  If the average is low while the number of stack pages is high then the application has a pattern that is insensitive to the number of stack pages and then one can increase the number of stack pages without seeing much GC overhead.  But I expect this is unlikely; these two were added to monitor GC performance at Cadence and indeed we see that increasing the number of stack pages in use also increases the average number of stack pages in use at GC time.

That said, in my current VMMaker image, in this session merely used for browsing, I see

#42 50 number of stack pages available   (default)
#43 0 desired number of stack pages (i.e. select default)

#60 89,370 number of stack page overflows since startup
#61 0 number of stack page divorces since startup

#68 11.35 the average number of live stack pages when scanned by scavenge/gc/become
#69 16 the maximum number of live stack pages when scanned by scavenge/gc/become

So in normal development use it looks like stack page use is minimal.

On Thu, Jun 15, 2017 at 1:10 PM, Phil B <[hidden email]> wrote:
 
Eliot,

Thanks for the tip, I'll give that a shot.  Also, is it possible to check the amount of stack usage from the image? (I.e. just to get a rough idea of where things stand that's reasonably fast)

Phil


On Jun 12, 2017 7:24 PM, "Eliot Miranda" <[hidden email]> wrote:
 
Hi Phil,

On Jun 12, 2017, at 2:25 PM, Phil B <[hidden email]> wrote:

Eliot,

Thanks for the info, that's good to know.  I probably should have been explicit in that I am only bumping it up this high to troubleshoot a rather annoying startup bug in my code. When it crashes as a result of the stack overflow the trace is pretty useless (iirc, about 1/2 a page of INVALID REFERENCE so I'm mostly flying blind.)  Bumping up the limit is allowing me to get a better view of where things are going wrong and I plan to drop back once I've resolved it.

A better way to debug thus will be to set a breakpoint in the scavenger and the GC on every GC.  Stack overflow in a language like Smalltalk where activations are objects means that the heap grows as the stack grows.  (The stack pages in the stack zone can be seen as an allocation cache for the most recent activations, reducing the pressure on the GC).  So if run under gdb (lldb on Mac) and you print the stack in each GC you should be able to at least see where the infinite recursion is coming from before the system runs out of memory:

(gdb) b doScavenge
breakpoint 1 set at NNNN
(gdb) commands 1
call printStackCallStackOf(framePointer)
end
(gdb) run myimage.image

You can use
(gdb) call pushOutputFile("stack.log")
to get the vm to send subsequent output to a file and 
(gdb) call popOutputFile()
to close the log.


Thanks,
Phil

On Jun 12, 2017 4:43 PM, "Eliot Miranda" <[hidden email]> wrote:

Hi Phil,


> On Jun 12, 2017, at 12:50 PM, Phil B <[hidden email]> wrote:
>
> In trying to troubleshoot an issue, I needed to bump up the stackpages parameter.  On 64-bit Linux, a value of 600 worked but 1000 segfaulted so I was just wondering what the limit(s) are for it?

There are no explicit limits.  The set fault you're seeing is as a result of the stack pages being allocated on the c stack.  When the number is high the stack overflows and boom.

A word to the wise: too high a value and scavenging performance falls (stack pages are implicitly roots into new space), and become performance falls (all activations in stack space are scanned post become to avoid a read barrier on inst var fetch).

The default value was 192, a value chosen to exceed qwaq server process usage, but both at Cadence and in Spur profiling we found that was not a good value and pulled it back to 64 (IIRC).

I'm curious as to why are you exploring such high values.







--
_,,,^..^,,,_
best, Eliot