Image freezing problem

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Image freezing problem

Adrian Lienhard
Hi,

We recently run into the image freezing problem, where connecting  
with VNC and moving the mouse would bring the image alive again. When  
being stuck, Squeak hoggs the CPU, memory consumtion is stable.

There have been various reports matching the described symptoms,  
e.g., http://lists.squeakfoundation.org/pipermail/squeak-dev/2006- 
August/107618.html.

When seeing Andreas' delay issue, it looked just obvious that it's  
exactly this same problem. But the fix does not resolve it.
So, after quite a lot of hours of debugging, here is an explanation  
of what is going on (kudos to Marcus, who helped me with putting the  
last pieces of the puzzle together).

The problem occurred when producing a PDF of around 800 pages in a  
background process (priority 25). In production, this process runs  
besides serving a Seaside application (running with priority 30) and  
involves accessing the DB (PostgresQL). However, I managed to come up  
with a reproducable setup without involving Seaside requests and  
without accessing the DB. All there is: a CPU and memory intensive  
(~130MB) background process which does some logging to the file system.


Setup:
-----------------------------------------

Problem appeared in production with an older VM, so I decided to  
build a new one with the most recent fixes:

- VM built from VMM 3.8b6 from SM + VmUpdates-100[1-6]-dtl.1.cs + JMM-
VmUpdates32bitclean.2.cs
- Squeak 3.7 image + Andreas' delay fix
- gcc 2.95.4, Linux 2.4.18-686-smp #1 SMP i686
- Squeak is started headfull in VNC

For tracing, I added the TraceMessageSends instrumentation from John  
and extended it with some more data. (BTW, I vote to integrate this  
by default as it proved, with the signaling facility, very useful!)


Explanation for the cause:
-----------------------------------------

The following message trace was started when the image got stuck and  
it was stopped again a few seconds after Squeak continued working  
normal again:

http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB)

The extended format of the trace is: proc, ctxt, statIncrGCs,  
statTenures, statFullGCs, ioMSecs

 From this trace, the problem looks to be explainable as follows:
- The time elapsed between two calls to WorldState>interCyclePause:  
is longer than 20ms. The cause for this is not the UI process using  
up a lot of time, but that there are 3 incr. GCs consuming 6 to 8ms  
each within each UI cycle.
- Therefore, the UI process does not wait on the delay in  
WorldState>interCyclePause:
- Because the UI main loop only does a yield (see Process  
class>>spawnNewProcess) the UI process therefore stays runnable all  
the time as there is no other process with p = 40.
- Therefore, no process with p < 40 has a chance to be activated  
(only higher ones, which we find in the trace). This also explains  
why we see 100% CPU usage, but still the UI responds immediately.

So, while the UI process runs, the incr. GCs use up most of the time  
and keep the UI process active. There is just not enough space  
requested to trigger a full GC.

Now, why does moving the mouse make it run again? I have no idea...  
my guess is that the triggered behavior of a mouse move event somehow  
forces a full GC. In the trace we see that when the 107th full GC is  
done, there are much fewer incr. GCs later on. Hence, it is much more  
likely that the UI process pauses again.


How could we fix this?
-----------------------------------------
a) Simply increase the 20ms pause defined by MinCycleLapse (at least  
for production systems) or tweak the "pause logic". As a test I  
defined MinCycleLapse to be 40ms. I could not reproduce the problem  
anymore.

b) In production, suspend the UI process and only turn it on again  
when you need it (we do this via HTTP). This should also improve  
performance a bit. At best this is a workaround.

c) Tune the GC policies as they are far from optimal for today's  
systems (as John has suggested a couple of times). It seems, though,  
that this cannot guarantee to fix the problem but it should make it  
less likely to happen(?).
I also took GC statistics while running the process (only part of it  
and not when it was stuck):
- http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard  
configuration (was very slow)
- http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with  
setGCBiasToGrowGCLimit: 16*1026*1024

d) ?


I'd be interested in getting feedback on:
- whether the explanation sounds plausible
- whether the fix (e.g., a)) workes for other people that have this  
problem.
- what may be a good fix

Cheers,
Adrian

___________________
Adrian Lienhard
www.adrian-lienhard.ch



Reply | Threaded
Open this post in threaded view
|

Re: Image freezing problem

Andreas.Raab
Adrian Lienhard wrote:
> When seeing Andreas' delay issue, it looked just obvious that it's
> exactly this same problem. But the fix does not resolve it.

I didn't expect it to. If the Delay problem hits, VNC connections
typically no longer work (which makes the mystery all the more interesting).

> For tracing, I added the TraceMessageSends instrumentation from John and
> extended it with some more data. (BTW, I vote to integrate this by
> default as it proved, with the signaling facility, very useful!)

I think we need a "production Squeak" list or somesuch. We have a few
more of these things in our images and VMs that are immensely helpful
for tracking down issues like this and that sounds just like another one
that a production server really needs to have.

> The following message trace was started when the image got stuck and it
> was stopped again a few seconds after Squeak continued working normal
> again:
>
> http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB)
>
> The extended format of the trace is: proc, ctxt, statIncrGCs,
> statTenures, statFullGCs, ioMSecs
>
>  From this trace, the problem looks to be explainable as follows:
> - The time elapsed between two calls to WorldState>interCyclePause: is
> longer than 20ms. The cause for this is not the UI process using up a
> lot of time, but that there are 3 incr. GCs consuming 6 to 8ms each
> within each UI cycle.

Is this a low-end server? 6-8ms is a *lot* of time being spent in IGC -
we typically see 6-8ms only after increasing the default GC parameters
(# of allocations between GCs, various thresholds) by roughly a factor
of ten. Also, the above means that effectively all or almost all of the
time is spent in those GCs which seems doubtful unless you have a busy
UI (we had this problem once when we left a Transcript open, which is a
terribly bad idea on a server image).

> - Therefore, the UI process does not wait on the delay in
> WorldState>interCyclePause:
> - Because the UI main loop only does a yield (see Process
> class>>spawnNewProcess) the UI process therefore stays runnable all the
> time as there is no other process with p = 40.
> - Therefore, no process with p < 40 has a chance to be activated (only
> higher ones, which we find in the trace). This also explains why we see
> 100% CPU usage, but still the UI responds immediately.

This sounds like a reasonable explanation.

> Now, why does moving the mouse make it run again? I have no idea... my
> guess is that the triggered behavior of a mouse move event somehow
> forces a full GC. In the trace we see that when the 107th full GC is
> done, there are much fewer incr. GCs later on. Hence, it is much more
> likely that the UI process pauses again.

Tenuring might fix it, too. And it may just be that your wiggling the
mouse creates the bit of extra garbage to make the VM tenure.

> How could we fix this?
> -----------------------------------------
> a) Simply increase the 20ms pause defined by MinCycleLapse (at least for
> production systems) or tweak the "pause logic". As a test I defined
> MinCycleLapse to be 40ms. I could not reproduce the problem anymore.
>
> b) In production, suspend the UI process and only turn it on again when
> you need it (we do this via HTTP). This should also improve performance
> a bit. At best this is a workaround.
>
> c) Tune the GC policies as they are far from optimal for today's systems
> (as John has suggested a couple of times). It seems, though, that this
> cannot guarantee to fix the problem but it should make it less likely to
> happen(?).

d) Don't use processes that run below user scheduling priority. To be
honest, I'm not sure why you'd be running anything below UI priority on
a server.

e) Make a preference "lowerPerformance" (or call it "headlessMode" if
you wish :^) and have the effect be that in intercyclePause: you
*always* wait for a certain amount of time (20-50ms). This will ensure
that your UI process can't possibly eat up all of your CPU.

> I'd be interested in getting feedback on:
> - whether the explanation sounds plausible

It does. There is however a question what could possibly generate enough
load on the garbage collector to run IGCs that take on average 7ms and
run three of them in a single UI cycle.

> - whether the fix (e.g., a)) workes for other people that have this
> problem.
> - what may be a good fix

I'd probably go with option e) above since it ensures that there is
always time left for the lower priority processes (and you don't have to
change other code). Everything else seems risky since you can't say for
sure if there isn't anything that keeps the UI in a busy-loop.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Image freezing problem

johnmci
In reply to this post by Adrian Lienhard

On Jul 26, 2007, at 10:06 AM, Adrian Lienhard wrote:

>
> c) Tune the GC policies as they are far from optimal for today's  
> systems (as John has suggested a couple of times). It seems,  
> though, that this cannot guarantee to fix the problem but it should  
> make it less likely to happen(?).
> I also took GC statistics while running the process (only part of  
> it and not when it was stuck):
> - http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard  
> configuration (was very slow)
> - http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with  
> setGCBiasToGrowGCLimit: 16*1026*1024


I'll note the standard VMMaker has a bug in it for setting the  
setGCBiasToGrow logic.
You must load the

JMM-fixBiasToGrow.1.cs

change set which is found in the Mac OS/ vm/specialChangeSets/    
directory in the SVN tree
Without the change set this logic will not work properly.

Plus of course you have to do
Smalltalk setGCBiasToGrow: 1.

in order to turn the feature on, setting just the  
setGCBiasToGrowGCLimit: won't turn it on.

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===



Reply | Threaded
Open this post in threaded view
|

Re: Image freezing problem

Adrian Lienhard

On Jul 26, 2007, at 20:25 , John M McIntosh wrote:

>
> On Jul 26, 2007, at 10:06 AM, Adrian Lienhard wrote:
>
>>
>> c) Tune the GC policies as they are far from optimal for today's  
>> systems (as John has suggested a couple of times). It seems,  
>> though, that this cannot guarantee to fix the problem but it  
>> should make it less likely to happen(?).
>> I also took GC statistics while running the process (only part of  
>> it and not when it was stuck):
>> - http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard  
>> configuration (was very slow)
>> - http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with  
>> setGCBiasToGrowGCLimit: 16*1026*1024
>
>
> I'll note the standard VMMaker has a bug in it for setting the  
> setGCBiasToGrow logic.
> You must load the
>
> JMM-fixBiasToGrow.1.cs

I did not load this, but apparently the same version of the method is  
in the changeset VMMaker-tpr.58, which I got from loading the most  
recent SM package.

> change set which is found in the Mac OS/ vm/specialChangeSets/    
> directory in the SVN tree
> Without the change set this logic will not work properly.
>
> Plus of course you have to do
> Smalltalk setGCBiasToGrow: 1.

yes, I did that.

Adrian

> in order to turn the feature on, setting just the  
> setGCBiasToGrowGCLimit: won't turn it on.
>
> --
> ======================================================================
> =====
> John M. McIntosh <[hidden email]>
> Corporate Smalltalk Consulting Ltd.  http://
> www.smalltalkconsulting.com
> ======================================================================
> =====
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Image freezing problem

johnmci
In reply to this post by Adrian Lienhard

On Jul 26, 2007, at 10:06 AM, Adrian Lienhard wrote:

>
> c) Tune the GC policies as they are far from optimal for today's  
> systems (as John has suggested a couple of times). It seems,  
> though, that this cannot guarantee to fix the problem but it should  
> make it less likely to happen(?).
> I also took GC statistics while running the process (only part of  
> it and not when it was stuck):
> - http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard  
> configuration (was very slow)

Yes this is the classic, I think we are going to do a fullgc, but  
lets first see if a IGC will get back just one more byte. Oh it did,  
fine lets run until
the next memory allocation and repeat. I'm sure a fullgc will happen,  
someday or maybe 10,000 IGC iterations later.


> - http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with  
> setGCBiasToGrowGCLimit: 16*1026*1024


Statistical Data is always good.

Lets see I have some observations and random thoughts.

I note you have a memory allocation of 100MB and grow to 150MB
Likely you should look at the grow/shrink threshold values they  
likely should be adjusted.

when it running I see
45,042 marked objects    that is the number of objects marked when  
doing the IGC, which is all live objects in new space and references  
from old to new space.
4,001 sweeped objects,  that is the number of objects sweep (visited  
linearly after GC mark)
4,000  is the allocation count.
639 is survivor count
4 milliseconds is IGC time.

Note actually later this jumps to 86K objects, likely you have the  
issue where a large array in old space has been marked as a root and  
you are iterating over it looking
for items in young space.  The tenuring on each IGC helps that.

I believe TIM is up for fixing this problem, just slip him *lots* of  
EUROS first.

In theory you could have a much larger allocation and tenure count,  
Andreas has some data on larger values,  Likely 40,000 and tenure at  
35,000 would come to mind?
{Well actually I might take this back, given your marking issue}

However the other thing is that at 45,042 objects to mark seems a bit  
high. I've used the theory for Sophie that if marking is 2x the  
number of allocations then
we should tenure. Note that your 45,042 marked objects are a result  
of only 639 survivors, but I think you are doing this already?  Still  
you could consider
how many allocations versus marked objects is acceptable since the 2x  
might not be correct, since it's following a pattern
of 312,364,552,45K,48K,44K,45K,44K,44K  Then repeats  so you might be  
tenuring too early?
Then again maybe not since a FULL gc doesn't return any memory


In JMM memory poliicy >> calculateGoals I have...

(statMarkCount ) > (statAllocationCount*2)
                ifTrue: [[Smalltalk forceTenure] on: Error do: [:ex | ]].  "Tenure  
if we think too much root table marking is going on"

Also I can't speak if on your machine marking 45K objects should take  
4 milliseconds, don't know if that is a rational value.  Of course if  
you increase
the allocation limit, then marking count will go up in relationship  
to that.  Other people might have to think if marking only 10,000  
objects per ms seems
rational.  In checking some other examples I have, it does however  
seem in line with another sample I have.


setGCBiasToGrowGCLimit:

should be set to some value that won't interfere much with the work  
load. If for example you always expect to
grab 40MB of memory to perform a task, setting it to 16MB is less  
valuable than setting it to 50MB.
Keep in mind of course once you hit the 50MB limit you will need to  
do a full GC, and what is the time impact on what you are doing?

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===



Reply | Threaded
Open this post in threaded view
|

Re: Image freezing problem

Adrian Lienhard
In reply to this post by Andreas.Raab

On Jul 26, 2007, at 19:28 , Andreas Raab wrote:

>> For tracing, I added the TraceMessageSends instrumentation from  
>> John and extended it with some more data. (BTW, I vote to  
>> integrate this by default as it proved, with the signaling  
>> facility, very useful!)
>
> I think we need a "production Squeak" list or somesuch. We have a  
> few more of these things in our images and VMs that are immensely  
> helpful for tracking down issues like this and that sounds just  
> like another one that a production server really needs to have.

yep, I agree

>> The following message trace was started when the image got stuck  
>> and it was stopped again a few seconds after Squeak continued  
>> working normal again:
>> http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB)
>> The extended format of the trace is: proc, ctxt, statIncrGCs,  
>> statTenures, statFullGCs, ioMSecs
>>  From this trace, the problem looks to be explainable as follows:
>> - The time elapsed between two calls to  
>> WorldState>interCyclePause: is longer than 20ms. The cause for  
>> this is not the UI process using up a lot of time, but that there  
>> are 3 incr. GCs consuming 6 to 8ms each within each UI cycle.
>
> Is this a low-end server? 6-8ms is a *lot* of time being spent in  
> IGC - we typically see 6-8ms only after increasing the default GC

its a quite old server with 2 x 1.4 Ghz CPUs.

> parameters (# of allocations between GCs, various thresholds) by  
> roughly a factor of ten. Also, the above means that effectively all  
> or almost all of the time is spent in those GCs which seems doubtful

Let's see. This is an excerpt from the very beginning of the trace,  
showing one UI cycle. No other process was active in this period.
The numbers indicate: proc, ctxt, statIncrGCs, statTenures,  
statFullGCs, ioMSecs


1145108440 1189307120 58988 5419 106 796524 WorldState>interCyclePause:
....
1145108440 1189307696 58988 5419 106 796524  
EventSensor>nextEventFromQueue
1145108440 1189302556 58989 5419 106 796531 SharedQueue>isEmpty
....
1145108440 1189304704 58989 5419 106 796537  
UserInputEvent>transformedBy:
1145108440 1189304084 58990 5419 106 796544  
UserInputEvent>transformedBy:
....
1145108440 1189307680 58990 5419 106 796547  
HandMorph>noticeMouseOver:event:
1145108440 1189303900 58991 5419 106 796555  
MouseOverHandler>noticeMouseOver:event:
...
1145108440 1189307120 58991 5419 106 796556 WorldState>interCyclePause:

- The elapsed time between the start and end of the cycle is 796556 -  
796524 = 32
- There has been 3 IGCs consuming: 7 + 7 + 8 = 22

Now, the last conclusion may be debatable as I just looked at the  
difference between the clock value of the two messages between which  
the statIncrGCs was increased.

> unless you have a busy UI (we had this problem once when we left a  
> Transcript open, which is a terribly bad idea on a server image).

no no, the UI is not busy, a workspace is open and maybe a code  
browser (indeed, leaving the Transcript open is deadly).

[...]

>> How could we fix this?
>> -----------------------------------------
>> a) Simply increase the 20ms pause defined by MinCycleLapse (at  
>> least for production systems) or tweak the "pause logic". As a  
>> test I defined MinCycleLapse to be 40ms. I could not reproduce the  
>> problem anymore.
>> b) In production, suspend the UI process and only turn it on again  
>> when you need it (we do this via HTTP). This should also improve  
>> performance a bit. At best this is a workaround.
>> c) Tune the GC policies as they are far from optimal for today's  
>> systems (as John has suggested a couple of times). It seems,  
>> though, that this cannot guarantee to fix the problem but it  
>> should make it less likely to happen(?).
>
> d) Don't use processes that run below user scheduling priority. To  
> be honest, I'm not sure why you'd be running anything below UI  
> priority on a server.

hm, Kom starts processes with priority 30, I guess to keep the UI  
responsive. The long running process I have here should be of lower  
priority than the process handling the Seaside requests. Actually, I  
don't care how long it takes, as long as it does not slow down or  
even block the rest.

> e) Make a preference "lowerPerformance" (or call it "headlessMode"  
> if you wish :^) and have the effect be that in intercyclePause: you  
> *always* wait for a certain amount of time (20-50ms). This will  
> ensure that your UI process can't possibly eat up all of your CPU.

yes, sounds like a reasonable fix (apart from that we have yet  
another preference lookup in the UI loop...)

Cheers,
Adrian

>
>> I'd be interested in getting feedback on:
>> - whether the explanation sounds plausible
>
> It does. There is however a question what could possibly generate  
> enough load on the garbage collector to run IGCs that take on  
> average 7ms and run three of them in a single UI cycle.
>
>> - whether the fix (e.g., a)) workes for other people that have  
>> this problem.
>> - what may be a good fix
>
> I'd probably go with option e) above since it ensures that there is  
> always time left for the lower priority processes (and you don't  
> have to change other code). Everything else seems risky since you  
> can't say for sure if there isn't anything that keeps the UI in a  
> busy-loop.
>
> Cheers,
>   - Andreas
>


Reply | Threaded
Open this post in threaded view
|

Re: Image freezing problem

johnmci

On Jul 26, 2007, at 2:40 PM, Adrian Lienhard wrote:

> Let's see. This is an excerpt from the very beginning of the trace,  
> showing one UI cycle. No other process was active in this period.
> The numbers indicate: proc, ctxt, statIncrGCs, statTenures,  
> statFullGCs, ioMSecs
>
>
> 1145108440 1189307120 58988 5419 106 796524  
> WorldState>interCyclePause:
> ....
> 1145108440 1189307696 58988 5419 106 796524  
> EventSensor>nextEventFromQueue
> 1145108440 1189302556 58989 5419 106 796531 SharedQueue>isEmpty
> ....
> 1145108440 1189304704 58989 5419 106 796537  
> UserInputEvent>transformedBy:
> 1145108440 1189304084 58990 5419 106 796544  
> UserInputEvent>transformedBy:
> ....
> 1145108440 1189307680 58990 5419 106 796547  
> HandMorph>noticeMouseOver:event:
> 1145108440 1189303900 58991 5419 106 796555  
> MouseOverHandler>noticeMouseOver:event:
> ...
> 1145108440 1189307120 58991 5419 106 796556  
> WorldState>interCyclePause:
>
> - The elapsed time between the start and end of the cycle is 796556  
> - 796524 = 32
> - There has been 3 IGCs consuming: 7 + 7 + 8 = 22

Keep in mind of course logging message sends *is* expensive and will  
make things *lots* slower.
I'd have to look but the IGC ms clock times are recorded in the VM,  
before/after the GC work so any logging would not interferr
with that data
--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===



Reply | Threaded
Open this post in threaded view
|

Re: Image freezing problem

Adrian Lienhard
I created a new Mantis report with a fix.
See: http://bugs.squeak.org/view.php?id=6581

Adrian


On Jul 27, 2007, at 00:05 , John M McIntosh wrote:

>
> On Jul 26, 2007, at 2:40 PM, Adrian Lienhard wrote:
>
>> Let's see. This is an excerpt from the very beginning of the  
>> trace, showing one UI cycle. No other process was active in this  
>> period.
>> The numbers indicate: proc, ctxt, statIncrGCs, statTenures,  
>> statFullGCs, ioMSecs
>>
>>
>> 1145108440 1189307120 58988 5419 106 796524  
>> WorldState>interCyclePause:
>> ....
>> 1145108440 1189307696 58988 5419 106 796524  
>> EventSensor>nextEventFromQueue
>> 1145108440 1189302556 58989 5419 106 796531 SharedQueue>isEmpty
>> ....
>> 1145108440 1189304704 58989 5419 106 796537  
>> UserInputEvent>transformedBy:
>> 1145108440 1189304084 58990 5419 106 796544  
>> UserInputEvent>transformedBy:
>> ....
>> 1145108440 1189307680 58990 5419 106 796547  
>> HandMorph>noticeMouseOver:event:
>> 1145108440 1189303900 58991 5419 106 796555  
>> MouseOverHandler>noticeMouseOver:event:
>> ...
>> 1145108440 1189307120 58991 5419 106 796556  
>> WorldState>interCyclePause:
>>
>> - The elapsed time between the start and end of the cycle is  
>> 796556 - 796524 = 32
>> - There has been 3 IGCs consuming: 7 + 7 + 8 = 22
>
> Keep in mind of course logging message sends *is* expensive and  
> will make things *lots* slower.
> I'd have to look but the IGC ms clock times are recorded in the VM,  
> before/after the GC work so any logging would not interferr
> with that data
> --
> ======================================================================
> =====
> John M. McIntosh <[hidden email]>
> Corporate Smalltalk Consulting Ltd.  http://
> www.smalltalkconsulting.com
> ======================================================================
> =====
>
>
>