Hi,
We recently run into the image freezing problem, where connecting with VNC and moving the mouse would bring the image alive again. When being stuck, Squeak hoggs the CPU, memory consumtion is stable. There have been various reports matching the described symptoms, e.g., http://lists.squeakfoundation.org/pipermail/squeak-dev/2006- August/107618.html. When seeing Andreas' delay issue, it looked just obvious that it's exactly this same problem. But the fix does not resolve it. So, after quite a lot of hours of debugging, here is an explanation of what is going on (kudos to Marcus, who helped me with putting the last pieces of the puzzle together). The problem occurred when producing a PDF of around 800 pages in a background process (priority 25). In production, this process runs besides serving a Seaside application (running with priority 30) and involves accessing the DB (PostgresQL). However, I managed to come up with a reproducable setup without involving Seaside requests and without accessing the DB. All there is: a CPU and memory intensive (~130MB) background process which does some logging to the file system. Setup: ----------------------------------------- Problem appeared in production with an older VM, so I decided to build a new one with the most recent fixes: - VM built from VMM 3.8b6 from SM + VmUpdates-100[1-6]-dtl.1.cs + JMM- VmUpdates32bitclean.2.cs - Squeak 3.7 image + Andreas' delay fix - gcc 2.95.4, Linux 2.4.18-686-smp #1 SMP i686 - Squeak is started headfull in VNC For tracing, I added the TraceMessageSends instrumentation from John and extended it with some more data. (BTW, I vote to integrate this by default as it proved, with the signaling facility, very useful!) Explanation for the cause: ----------------------------------------- The following message trace was started when the image got stuck and it was stopped again a few seconds after Squeak continued working normal again: http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB) The extended format of the trace is: proc, ctxt, statIncrGCs, statTenures, statFullGCs, ioMSecs From this trace, the problem looks to be explainable as follows: - The time elapsed between two calls to WorldState>interCyclePause: is longer than 20ms. The cause for this is not the UI process using up a lot of time, but that there are 3 incr. GCs consuming 6 to 8ms each within each UI cycle. - Therefore, the UI process does not wait on the delay in WorldState>interCyclePause: - Because the UI main loop only does a yield (see Process class>>spawnNewProcess) the UI process therefore stays runnable all the time as there is no other process with p = 40. - Therefore, no process with p < 40 has a chance to be activated (only higher ones, which we find in the trace). This also explains why we see 100% CPU usage, but still the UI responds immediately. So, while the UI process runs, the incr. GCs use up most of the time and keep the UI process active. There is just not enough space requested to trigger a full GC. Now, why does moving the mouse make it run again? I have no idea... my guess is that the triggered behavior of a mouse move event somehow forces a full GC. In the trace we see that when the 107th full GC is done, there are much fewer incr. GCs later on. Hence, it is much more likely that the UI process pauses again. How could we fix this? ----------------------------------------- a) Simply increase the 20ms pause defined by MinCycleLapse (at least for production systems) or tweak the "pause logic". As a test I defined MinCycleLapse to be 40ms. I could not reproduce the problem anymore. b) In production, suspend the UI process and only turn it on again when you need it (we do this via HTTP). This should also improve performance a bit. At best this is a workaround. c) Tune the GC policies as they are far from optimal for today's systems (as John has suggested a couple of times). It seems, though, that this cannot guarantee to fix the problem but it should make it less likely to happen(?). I also took GC statistics while running the process (only part of it and not when it was stuck): - http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard configuration (was very slow) - http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with setGCBiasToGrowGCLimit: 16*1026*1024 d) ? I'd be interested in getting feedback on: - whether the explanation sounds plausible - whether the fix (e.g., a)) workes for other people that have this problem. - what may be a good fix Cheers, Adrian ___________________ Adrian Lienhard www.adrian-lienhard.ch |
Adrian Lienhard wrote:
> When seeing Andreas' delay issue, it looked just obvious that it's > exactly this same problem. But the fix does not resolve it. I didn't expect it to. If the Delay problem hits, VNC connections typically no longer work (which makes the mystery all the more interesting). > For tracing, I added the TraceMessageSends instrumentation from John and > extended it with some more data. (BTW, I vote to integrate this by > default as it proved, with the signaling facility, very useful!) I think we need a "production Squeak" list or somesuch. We have a few more of these things in our images and VMs that are immensely helpful for tracking down issues like this and that sounds just like another one that a production server really needs to have. > The following message trace was started when the image got stuck and it > was stopped again a few seconds after Squeak continued working normal > again: > > http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB) > > The extended format of the trace is: proc, ctxt, statIncrGCs, > statTenures, statFullGCs, ioMSecs > > From this trace, the problem looks to be explainable as follows: > - The time elapsed between two calls to WorldState>interCyclePause: is > longer than 20ms. The cause for this is not the UI process using up a > lot of time, but that there are 3 incr. GCs consuming 6 to 8ms each > within each UI cycle. Is this a low-end server? 6-8ms is a *lot* of time being spent in IGC - we typically see 6-8ms only after increasing the default GC parameters (# of allocations between GCs, various thresholds) by roughly a factor of ten. Also, the above means that effectively all or almost all of the time is spent in those GCs which seems doubtful unless you have a busy UI (we had this problem once when we left a Transcript open, which is a terribly bad idea on a server image). > - Therefore, the UI process does not wait on the delay in > WorldState>interCyclePause: > - Because the UI main loop only does a yield (see Process > class>>spawnNewProcess) the UI process therefore stays runnable all the > time as there is no other process with p = 40. > - Therefore, no process with p < 40 has a chance to be activated (only > higher ones, which we find in the trace). This also explains why we see > 100% CPU usage, but still the UI responds immediately. This sounds like a reasonable explanation. > Now, why does moving the mouse make it run again? I have no idea... my > guess is that the triggered behavior of a mouse move event somehow > forces a full GC. In the trace we see that when the 107th full GC is > done, there are much fewer incr. GCs later on. Hence, it is much more > likely that the UI process pauses again. Tenuring might fix it, too. And it may just be that your wiggling the mouse creates the bit of extra garbage to make the VM tenure. > How could we fix this? > ----------------------------------------- > a) Simply increase the 20ms pause defined by MinCycleLapse (at least for > production systems) or tweak the "pause logic". As a test I defined > MinCycleLapse to be 40ms. I could not reproduce the problem anymore. > > b) In production, suspend the UI process and only turn it on again when > you need it (we do this via HTTP). This should also improve performance > a bit. At best this is a workaround. > > c) Tune the GC policies as they are far from optimal for today's systems > (as John has suggested a couple of times). It seems, though, that this > cannot guarantee to fix the problem but it should make it less likely to > happen(?). d) Don't use processes that run below user scheduling priority. To be honest, I'm not sure why you'd be running anything below UI priority on a server. e) Make a preference "lowerPerformance" (or call it "headlessMode" if you wish :^) and have the effect be that in intercyclePause: you *always* wait for a certain amount of time (20-50ms). This will ensure that your UI process can't possibly eat up all of your CPU. > I'd be interested in getting feedback on: > - whether the explanation sounds plausible It does. There is however a question what could possibly generate enough load on the garbage collector to run IGCs that take on average 7ms and run three of them in a single UI cycle. > - whether the fix (e.g., a)) workes for other people that have this > problem. > - what may be a good fix I'd probably go with option e) above since it ensures that there is always time left for the lower priority processes (and you don't have to change other code). Everything else seems risky since you can't say for sure if there isn't anything that keeps the UI in a busy-loop. Cheers, - Andreas |
In reply to this post by Adrian Lienhard
On Jul 26, 2007, at 10:06 AM, Adrian Lienhard wrote: > > c) Tune the GC policies as they are far from optimal for today's > systems (as John has suggested a couple of times). It seems, > though, that this cannot guarantee to fix the problem but it should > make it less likely to happen(?). > I also took GC statistics while running the process (only part of > it and not when it was stuck): > - http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard > configuration (was very slow) > - http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with > setGCBiasToGrowGCLimit: 16*1026*1024 I'll note the standard VMMaker has a bug in it for setting the setGCBiasToGrow logic. You must load the JMM-fixBiasToGrow.1.cs change set which is found in the Mac OS/ vm/specialChangeSets/ directory in the SVN tree Without the change set this logic will not work properly. Plus of course you have to do Smalltalk setGCBiasToGrow: 1. in order to turn the feature on, setting just the setGCBiasToGrowGCLimit: won't turn it on. -- ======================================================================== === John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== === |
On Jul 26, 2007, at 20:25 , John M McIntosh wrote: > > On Jul 26, 2007, at 10:06 AM, Adrian Lienhard wrote: > >> >> c) Tune the GC policies as they are far from optimal for today's >> systems (as John has suggested a couple of times). It seems, >> though, that this cannot guarantee to fix the problem but it >> should make it less likely to happen(?). >> I also took GC statistics while running the process (only part of >> it and not when it was stuck): >> - http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard >> configuration (was very slow) >> - http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with >> setGCBiasToGrowGCLimit: 16*1026*1024 > > > I'll note the standard VMMaker has a bug in it for setting the > setGCBiasToGrow logic. > You must load the > > JMM-fixBiasToGrow.1.cs I did not load this, but apparently the same version of the method is in the changeset VMMaker-tpr.58, which I got from loading the most recent SM package. > change set which is found in the Mac OS/ vm/specialChangeSets/ > directory in the SVN tree > Without the change set this logic will not work properly. > > Plus of course you have to do > Smalltalk setGCBiasToGrow: 1. yes, I did that. Adrian > in order to turn the feature on, setting just the > setGCBiasToGrowGCLimit: won't turn it on. > > -- > ====================================================================== > ===== > John M. McIntosh <[hidden email]> > Corporate Smalltalk Consulting Ltd. http:// > www.smalltalkconsulting.com > ====================================================================== > ===== > > > |
In reply to this post by Adrian Lienhard
On Jul 26, 2007, at 10:06 AM, Adrian Lienhard wrote: > > c) Tune the GC policies as they are far from optimal for today's > systems (as John has suggested a couple of times). It seems, > though, that this cannot guarantee to fix the problem but it should > make it less likely to happen(?). > I also took GC statistics while running the process (only part of > it and not when it was stuck): > - http://www.netstyle.ch/dropbox/gcStats-1.txt.gz: standard > configuration (was very slow) Yes this is the classic, I think we are going to do a fullgc, but lets first see if a IGC will get back just one more byte. Oh it did, fine lets run until the next memory allocation and repeat. I'm sure a fullgc will happen, someday or maybe 10,000 IGC iterations later. > - http://www.netstyle.ch/dropbox/gcStats-2.txt.gz: runActive with > setGCBiasToGrowGCLimit: 16*1026*1024 Statistical Data is always good. Lets see I have some observations and random thoughts. I note you have a memory allocation of 100MB and grow to 150MB Likely you should look at the grow/shrink threshold values they likely should be adjusted. when it running I see 45,042 marked objects that is the number of objects marked when doing the IGC, which is all live objects in new space and references from old to new space. 4,001 sweeped objects, that is the number of objects sweep (visited linearly after GC mark) 4,000 is the allocation count. 639 is survivor count 4 milliseconds is IGC time. Note actually later this jumps to 86K objects, likely you have the issue where a large array in old space has been marked as a root and you are iterating over it looking for items in young space. The tenuring on each IGC helps that. I believe TIM is up for fixing this problem, just slip him *lots* of EUROS first. In theory you could have a much larger allocation and tenure count, Andreas has some data on larger values, Likely 40,000 and tenure at 35,000 would come to mind? {Well actually I might take this back, given your marking issue} However the other thing is that at 45,042 objects to mark seems a bit high. I've used the theory for Sophie that if marking is 2x the number of allocations then we should tenure. Note that your 45,042 marked objects are a result of only 639 survivors, but I think you are doing this already? Still you could consider how many allocations versus marked objects is acceptable since the 2x might not be correct, since it's following a pattern of 312,364,552,45K,48K,44K,45K,44K,44K Then repeats so you might be tenuring too early? Then again maybe not since a FULL gc doesn't return any memory In JMM memory poliicy >> calculateGoals I have... (statMarkCount ) > (statAllocationCount*2) ifTrue: [[Smalltalk forceTenure] on: Error do: [:ex | ]]. "Tenure if we think too much root table marking is going on" Also I can't speak if on your machine marking 45K objects should take 4 milliseconds, don't know if that is a rational value. Of course if you increase the allocation limit, then marking count will go up in relationship to that. Other people might have to think if marking only 10,000 objects per ms seems rational. In checking some other examples I have, it does however seem in line with another sample I have. setGCBiasToGrowGCLimit: should be set to some value that won't interfere much with the work load. If for example you always expect to grab 40MB of memory to perform a task, setting it to 16MB is less valuable than setting it to 50MB. Keep in mind of course once you hit the 50MB limit you will need to do a full GC, and what is the time impact on what you are doing? -- ======================================================================== === John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== === |
In reply to this post by Andreas.Raab
On Jul 26, 2007, at 19:28 , Andreas Raab wrote: >> For tracing, I added the TraceMessageSends instrumentation from >> John and extended it with some more data. (BTW, I vote to >> integrate this by default as it proved, with the signaling >> facility, very useful!) > > I think we need a "production Squeak" list or somesuch. We have a > few more of these things in our images and VMs that are immensely > helpful for tracking down issues like this and that sounds just > like another one that a production server really needs to have. yep, I agree >> The following message trace was started when the image got stuck >> and it was stopped again a few seconds after Squeak continued >> working normal again: >> http://www.netstyle.ch/dropbox/messageTraceFile-4.txt.gz (4MB) >> The extended format of the trace is: proc, ctxt, statIncrGCs, >> statTenures, statFullGCs, ioMSecs >> From this trace, the problem looks to be explainable as follows: >> - The time elapsed between two calls to >> WorldState>interCyclePause: is longer than 20ms. The cause for >> this is not the UI process using up a lot of time, but that there >> are 3 incr. GCs consuming 6 to 8ms each within each UI cycle. > > Is this a low-end server? 6-8ms is a *lot* of time being spent in > IGC - we typically see 6-8ms only after increasing the default GC its a quite old server with 2 x 1.4 Ghz CPUs. > parameters (# of allocations between GCs, various thresholds) by > roughly a factor of ten. Also, the above means that effectively all > or almost all of the time is spent in those GCs which seems doubtful Let's see. This is an excerpt from the very beginning of the trace, showing one UI cycle. No other process was active in this period. The numbers indicate: proc, ctxt, statIncrGCs, statTenures, statFullGCs, ioMSecs 1145108440 1189307120 58988 5419 106 796524 WorldState>interCyclePause: .... 1145108440 1189307696 58988 5419 106 796524 EventSensor>nextEventFromQueue 1145108440 1189302556 58989 5419 106 796531 SharedQueue>isEmpty .... 1145108440 1189304704 58989 5419 106 796537 UserInputEvent>transformedBy: 1145108440 1189304084 58990 5419 106 796544 UserInputEvent>transformedBy: .... 1145108440 1189307680 58990 5419 106 796547 HandMorph>noticeMouseOver:event: 1145108440 1189303900 58991 5419 106 796555 MouseOverHandler>noticeMouseOver:event: ... 1145108440 1189307120 58991 5419 106 796556 WorldState>interCyclePause: - The elapsed time between the start and end of the cycle is 796556 - 796524 = 32 - There has been 3 IGCs consuming: 7 + 7 + 8 = 22 Now, the last conclusion may be debatable as I just looked at the difference between the clock value of the two messages between which the statIncrGCs was increased. > unless you have a busy UI (we had this problem once when we left a > Transcript open, which is a terribly bad idea on a server image). no no, the UI is not busy, a workspace is open and maybe a code browser (indeed, leaving the Transcript open is deadly). [...] >> How could we fix this? >> ----------------------------------------- >> a) Simply increase the 20ms pause defined by MinCycleLapse (at >> least for production systems) or tweak the "pause logic". As a >> test I defined MinCycleLapse to be 40ms. I could not reproduce the >> problem anymore. >> b) In production, suspend the UI process and only turn it on again >> when you need it (we do this via HTTP). This should also improve >> performance a bit. At best this is a workaround. >> c) Tune the GC policies as they are far from optimal for today's >> systems (as John has suggested a couple of times). It seems, >> though, that this cannot guarantee to fix the problem but it >> should make it less likely to happen(?). > > d) Don't use processes that run below user scheduling priority. To > be honest, I'm not sure why you'd be running anything below UI > priority on a server. hm, Kom starts processes with priority 30, I guess to keep the UI responsive. The long running process I have here should be of lower priority than the process handling the Seaside requests. Actually, I don't care how long it takes, as long as it does not slow down or even block the rest. > e) Make a preference "lowerPerformance" (or call it "headlessMode" > if you wish :^) and have the effect be that in intercyclePause: you > *always* wait for a certain amount of time (20-50ms). This will > ensure that your UI process can't possibly eat up all of your CPU. yes, sounds like a reasonable fix (apart from that we have yet another preference lookup in the UI loop...) Cheers, Adrian > >> I'd be interested in getting feedback on: >> - whether the explanation sounds plausible > > It does. There is however a question what could possibly generate > enough load on the garbage collector to run IGCs that take on > average 7ms and run three of them in a single UI cycle. > >> - whether the fix (e.g., a)) workes for other people that have >> this problem. >> - what may be a good fix > > I'd probably go with option e) above since it ensures that there is > always time left for the lower priority processes (and you don't > have to change other code). Everything else seems risky since you > can't say for sure if there isn't anything that keeps the UI in a > busy-loop. > > Cheers, > - Andreas > |
On Jul 26, 2007, at 2:40 PM, Adrian Lienhard wrote: > Let's see. This is an excerpt from the very beginning of the trace, > showing one UI cycle. No other process was active in this period. > The numbers indicate: proc, ctxt, statIncrGCs, statTenures, > statFullGCs, ioMSecs > > > 1145108440 1189307120 58988 5419 106 796524 > WorldState>interCyclePause: > .... > 1145108440 1189307696 58988 5419 106 796524 > EventSensor>nextEventFromQueue > 1145108440 1189302556 58989 5419 106 796531 SharedQueue>isEmpty > .... > 1145108440 1189304704 58989 5419 106 796537 > UserInputEvent>transformedBy: > 1145108440 1189304084 58990 5419 106 796544 > UserInputEvent>transformedBy: > .... > 1145108440 1189307680 58990 5419 106 796547 > HandMorph>noticeMouseOver:event: > 1145108440 1189303900 58991 5419 106 796555 > MouseOverHandler>noticeMouseOver:event: > ... > 1145108440 1189307120 58991 5419 106 796556 > WorldState>interCyclePause: > > - The elapsed time between the start and end of the cycle is 796556 > - 796524 = 32 > - There has been 3 IGCs consuming: 7 + 7 + 8 = 22 Keep in mind of course logging message sends *is* expensive and will make things *lots* slower. I'd have to look but the IGC ms clock times are recorded in the VM, before/after the GC work so any logging would not interferr with that data -- ======================================================================== === John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== === |
I created a new Mantis report with a fix.
See: http://bugs.squeak.org/view.php?id=6581 Adrian On Jul 27, 2007, at 00:05 , John M McIntosh wrote: > > On Jul 26, 2007, at 2:40 PM, Adrian Lienhard wrote: > >> Let's see. This is an excerpt from the very beginning of the >> trace, showing one UI cycle. No other process was active in this >> period. >> The numbers indicate: proc, ctxt, statIncrGCs, statTenures, >> statFullGCs, ioMSecs >> >> >> 1145108440 1189307120 58988 5419 106 796524 >> WorldState>interCyclePause: >> .... >> 1145108440 1189307696 58988 5419 106 796524 >> EventSensor>nextEventFromQueue >> 1145108440 1189302556 58989 5419 106 796531 SharedQueue>isEmpty >> .... >> 1145108440 1189304704 58989 5419 106 796537 >> UserInputEvent>transformedBy: >> 1145108440 1189304084 58990 5419 106 796544 >> UserInputEvent>transformedBy: >> .... >> 1145108440 1189307680 58990 5419 106 796547 >> HandMorph>noticeMouseOver:event: >> 1145108440 1189303900 58991 5419 106 796555 >> MouseOverHandler>noticeMouseOver:event: >> ... >> 1145108440 1189307120 58991 5419 106 796556 >> WorldState>interCyclePause: >> >> - The elapsed time between the start and end of the cycle is >> 796556 - 796524 = 32 >> - There has been 3 IGCs consuming: 7 + 7 + 8 = 22 > > Keep in mind of course logging message sends *is* expensive and > will make things *lots* slower. > I'd have to look but the IGC ms clock times are recorded in the VM, > before/after the GC work so any logging would not interferr > with that data > -- > ====================================================================== > ===== > John M. McIntosh <[hidden email]> > Corporate Smalltalk Consulting Ltd. http:// > www.smalltalkconsulting.com > ====================================================================== > ===== > > > |
Free forum by Nabble | Edit this page |