Fwd: [Pharo-fuel] Possible collections problem

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: [Pharo-fuel] Possible collections problem

Max Leske
 
Hi

I'm forwarding this because I'd like to rule out a VM problem. Short summary:
I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with 
"#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.

The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.

I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.

I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).

So if anyone has a clue as to what might be going on I'd really appreciate the help.

Cheers,
Max

Begin forwarded message:

From: Mariano Martinez Peck <[hidden email]>
Subject: Re: [Pharo-fuel] Possible collections problem
Date: 15. Mai 2013 16:53:10 MESZ
To: The Fuel Project <[hidden email]>
Reply-To: The Fuel Project <[hidden email]>

I cannot see anything in particular. Too many GC stuff.
So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.



On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.

I haven't yet tried your suggestion Mariano.

Cheers,
Max


[Thread debugging using libthread_db enabled]
0x08060453 in updatePointersInRootObjectsFromto ()

#0  0x08060453 in updatePointersInRootObjectsFromto ()
#1  0x08060a77 in mapPointersInObjectsFromto ()
#2  0x08060bb0 in incCompBody ()
#3  0x08065fa7 in incrementalGC ()
#4  0x080661a4 in sufficientSpaceAfterGC ()
#5  0x08069420 in primitiveNew ()
#6  0x0806de15 in interpret ()
#7  0x08073dfe in main ()



[Thread debugging using libthread_db enabled]
0x08060453 in updatePointersInRootObjectsFromto ()

#0  0x08060453 in updatePointersInRootObjectsFromto ()
#1  0x08060a77 in mapPointersInObjectsFromto ()
#2  0x08060bb0 in incCompBody ()
#3  0x08065fa7 in incrementalGC ()
#4  0x080661a4 in sufficientSpaceAfterGC ()
#5  0x08069420 in primitiveNew ()
#6  0x0806de15 in interpret ()
#7  0x08073dfe in main ()




[Thread debugging using libthread_db enabled]
0x08060453 in updatePointersInRootObjectsFromto ()

#0  0x08060453 in updatePointersInRootObjectsFromto ()
#1  0x08060a77 in mapPointersInObjectsFromto ()
#2  0x08060bb0 in incCompBody ()
#3  0x08065fa7 in incrementalGC ()
#4  0x080661a4 in sufficientSpaceAfterGC ()
#5  0x0806fed2 in clone ()
#6  0x08070095 in primitiveClone ()
#7  0x0806de15 in interpret ()
#8  0x08073dfe in main ()


[Thread debugging using libthread_db enabled]
0x08060453 in updatePointersInRootObjectsFromto ()

#0  0x08060453 in updatePointersInRootObjectsFromto ()
#1  0x08060a77 in mapPointersInObjectsFromto ()
#2  0x08060bb0 in incCompBody ()
#3  0x08065fa7 in incrementalGC ()
#4  0x080661a4 in sufficientSpaceAfterGC ()
#5  0x08069270 in primitiveNewWithArg ()
#6  0x0806de15 in interpret ()
#7  0x08073dfe in main ()


[Thread debugging using libthread_db enabled]
0xb76f0f68 in select () from /lib/libc.so.6

#0  0xb76f0f68 in select () from /lib/libc.so.6
#1  0x08070880 in aioPoll ()
#2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
#3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
#4  0x08061f24 in primitiveRelinquishProcessor ()
#5  0x0806de15 in interpret ()
#6  0x08073dfe in main ()


[Thread debugging using libthread_db enabled]
0x08060453 in updatePointersInRootObjectsFromto ()

#0  0x08060453 in updatePointersInRootObjectsFromto ()
#1  0x08060a77 in mapPointersInObjectsFromto ()
#2  0x08060bb0 in incCompBody ()
#3  0x08065fa7 in incrementalGC ()
#4  0x080661a4 in sufficientSpaceAfterGC ()
#5  0x08069420 in primitiveNew ()
#6  0x0806de15 in interpret ()
#7  0x08073dfe in main ()



[Thread debugging using libthread_db enabled]
0x08064e7e in markAndTrace ()

#0  0x08064e7e in markAndTrace ()
#1  0x0806593a in markPhase ()
#2  0x08065f60 in incrementalGC ()
#3  0x080661a4 in sufficientSpaceAfterGC ()
#4  0x0806fed2 in clone ()
#5  0x08070095 in primitiveClone ()
#6  0x0806de15 in interpret ()
#7  0x08073dfe in main ()




On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:

Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive. 
Let us know!


On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:

On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:

Hi Max. Question, are you able to reproduce the problem?

Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.



On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
Hi guys

I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
- fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
- the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)

I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).

It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.

What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.

So here's my theory: There migth be a VM problem with HashedCollections.
Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.

I'll try some more but if anyone has any ideas I'd be very happy :)

Cheers,
Max
_______________________________________________
Pharo-fuel mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel



--
Mariano
http://marianopeck.wordpress.com
_______________________________________________
Pharo-fuel mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel


_______________________________________________
Pharo-fuel mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel




--
Mariano
http://marianopeck.wordpress.com
_______________________________________________
Pharo-fuel mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel


_______________________________________________
Pharo-fuel mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel




--
Mariano
http://marianopeck.wordpress.com
_______________________________________________
Pharo-fuel mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Pharo-fuel] Possible collections problem

Frank Shearar-3

On 16 May 2013 08:29, Max Leske <[hidden email]> wrote:

>
> Hi
>
> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>
> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>
> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.

If it's convenient, running `bundle install; rake build` in a checkout
of https://github.com/frankshearar/squeak-ci will build you a
very-latest Interpreter VM (4.10.2.2614; executable at
target/Squeak-4.10.2.2614-src-32/bld/squeak.sh) that will link against
whatever glibc you have. (It works happily on my Ubuntu Lucid machine,
which has glibc 2.11.) It works on OS X and Linux, but not on FreeBSD.

frank

> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>
> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>
> Cheers,
> Max
>
> Begin forwarded message:
>
> From: Mariano Martinez Peck <[hidden email]>
> Subject: Re: [Pharo-fuel] Possible collections problem
> Date: 15. Mai 2013 16:53:10 MESZ
> To: The Fuel Project <[hidden email]>
> Reply-To: The Fuel Project <[hidden email]>
>
> I cannot see anything in particular. Too many GC stuff.
> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>
>
>
> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
>>
>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>
>> I haven't yet tried your suggestion Mariano.
>>
>> Cheers,
>> Max
>>
>>
>> [Thread debugging using libthread_db enabled]
>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>
>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>> #2  0x08060bb0 in incCompBody ()
>> #3  0x08065fa7 in incrementalGC ()
>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>> #5  0x08069420 in primitiveNew ()
>> #6  0x0806de15 in interpret ()
>> #7  0x08073dfe in main ()
>>
>>
>>
>> [Thread debugging using libthread_db enabled]
>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>
>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>> #2  0x08060bb0 in incCompBody ()
>> #3  0x08065fa7 in incrementalGC ()
>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>> #5  0x08069420 in primitiveNew ()
>> #6  0x0806de15 in interpret ()
>> #7  0x08073dfe in main ()
>>
>>
>>
>>
>> [Thread debugging using libthread_db enabled]
>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>
>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>> #2  0x08060bb0 in incCompBody ()
>> #3  0x08065fa7 in incrementalGC ()
>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>> #5  0x0806fed2 in clone ()
>> #6  0x08070095 in primitiveClone ()
>> #7  0x0806de15 in interpret ()
>> #8  0x08073dfe in main ()
>>
>>
>> [Thread debugging using libthread_db enabled]
>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>
>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>> #2  0x08060bb0 in incCompBody ()
>> #3  0x08065fa7 in incrementalGC ()
>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>> #5  0x08069270 in primitiveNewWithArg ()
>> #6  0x0806de15 in interpret ()
>> #7  0x08073dfe in main ()
>>
>>
>> [Thread debugging using libthread_db enabled]
>> 0xb76f0f68 in select () from /lib/libc.so.6
>>
>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>> #1  0x08070880 in aioPoll ()
>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>> #5  0x0806de15 in interpret ()
>> #6  0x08073dfe in main ()
>>
>>
>> [Thread debugging using libthread_db enabled]
>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>
>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>> #2  0x08060bb0 in incCompBody ()
>> #3  0x08065fa7 in incrementalGC ()
>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>> #5  0x08069420 in primitiveNew ()
>> #6  0x0806de15 in interpret ()
>> #7  0x08073dfe in main ()
>>
>>
>>
>> [Thread debugging using libthread_db enabled]
>> 0x08064e7e in markAndTrace ()
>>
>> #0  0x08064e7e in markAndTrace ()
>> #1  0x0806593a in markPhase ()
>> #2  0x08065f60 in incrementalGC ()
>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>> #4  0x0806fed2 in clone ()
>> #5  0x08070095 in primitiveClone ()
>> #6  0x0806de15 in interpret ()
>> #7  0x08073dfe in main ()
>>
>>
>>
>>
>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
>>
>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
>> Let us know!
>>
>>
>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
>>>
>>>
>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
>>>
>>> Hi Max. Question, are you able to reproduce the problem?
>>>
>>>
>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>
>>>
>>>
>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
>>>>
>>>> Hi guys
>>>>
>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>
>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>
>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>
>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>
>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>
>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>
>>>> Cheers,
>>>> Max
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>>
>>>
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>>
>>>
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>> _______________________________________________
>> Pharo-fuel mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>
>>
>>
>> _______________________________________________
>> Pharo-fuel mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
> _______________________________________________
> Pharo-fuel mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Pharo-fuel] Possible collections problem

David T. Lewis
In reply to this post by Max Leske
 
First to check to see if I understand the problem:

- You are forking an image using OSProcess, and the failures are occuring
the child VM/image that you created using one of the #forkSqueak methods.

- You are serializing some object in the child image so that they can be
written to a serialized output file. You are doing this in the background
child VM/image so that your main image can continue doing whatever it was
doing without being impacted by the serializing process.

- If you do this 10 times, it will fail about 4 times out of the 10 on average.

>From the stack traces, it looks like your VM is hanging up in a normal
garbage collection sweep that is happening as a side effect of allocating
a new object. The first thing that comes to mind is that your VM is hanging
up while trying to allocate more memory than the OS can provide to it, and
it appears to be stuck while waiting memory to be provided by the OS. I'm
suggesting this because I've seen similar symptoms when I intentially try
to allocate very large object memories, so perhaps there may be something
in your runtime environment (OS settings) that limit available memory.
This is a guess only, but it might give you some ideas of where to look.

This might be a VM problem, or it might be a symptom of some other problem
that just puts the VM and OS into a condition where they are unexpectedly
trying to use a very large amount of memory.

If possible it would be good to try to rule out the OSProcess forkImage as
a possible contributing factor. If you were to do the serialization 10 times
in the main image, rather than in the background child image, would it
still fail about 4 times out of 10? This could be a factor in memory use.
Although forkSqueak is very memory efficient, if the child and parent
images do a lot of different things (such as try to serialize a bunch of
stuff in the child image), then the OS will be forced to map in enough
real memory to satisfy both of the images.

Another thing to check, if possible, is to see if the OS is doing a
lot of disk swapping at the time of the failure. If you see all of the
system memory in use, along with a lot of disk activity, then you are
very likely look at a case of the VM trying to allocate more memory
than the system can reasonable provide.

FWIW, here is what your VM is doing at the time of the "hang up". This
involves scanning object memory through memEnd, which might be a very
large value for one reason or another:

ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd
    "update pointers in root objects"
    | oop |
    <inline: false>
    1 to: rootTableCount do: [:i |
            oop := rootTable at: i.
            ((self oop: oop isLessThan: memStart)
                    or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
                ifTrue: ["Note: must not remap the fields of any object twice!"
                    "remap this oop only if not in the memory range
                    covered below"
                    self remapFieldsAndClassOf: oop]]
                       
HTH,

Dave

On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:

>  
> Hi
>
> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>
> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>
> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>
> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>
> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>
> Cheers,
> Max
>
> Begin forwarded message:
>
> > From: Mariano Martinez Peck <[hidden email]>
> > Subject: Re: [Pharo-fuel] Possible collections problem
> > Date: 15. Mai 2013 16:53:10 MESZ
> > To: The Fuel Project <[hidden email]>
> > Reply-To: The Fuel Project <[hidden email]>
> >
> > I cannot see anything in particular. Too many GC stuff.
> > So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
> >
> >
> >
> > On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
> > I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
> >
> > I haven't yet tried your suggestion Mariano.
> >
> > Cheers,
> > Max
> >
> >
> > [Thread debugging using libthread_db enabled]
> > 0x08060453 in updatePointersInRootObjectsFromto ()
> >
> > #0  0x08060453 in updatePointersInRootObjectsFromto ()
> > #1  0x08060a77 in mapPointersInObjectsFromto ()
> > #2  0x08060bb0 in incCompBody ()
> > #3  0x08065fa7 in incrementalGC ()
> > #4  0x080661a4 in sufficientSpaceAfterGC ()
> > #5  0x08069420 in primitiveNew ()
> > #6  0x0806de15 in interpret ()
> > #7  0x08073dfe in main ()
> >
> >
> >
> > [Thread debugging using libthread_db enabled]
> > 0x08060453 in updatePointersInRootObjectsFromto ()
> >
> > #0  0x08060453 in updatePointersInRootObjectsFromto ()
> > #1  0x08060a77 in mapPointersInObjectsFromto ()
> > #2  0x08060bb0 in incCompBody ()
> > #3  0x08065fa7 in incrementalGC ()
> > #4  0x080661a4 in sufficientSpaceAfterGC ()
> > #5  0x08069420 in primitiveNew ()
> > #6  0x0806de15 in interpret ()
> > #7  0x08073dfe in main ()
> >
> >
> >
> >
> > [Thread debugging using libthread_db enabled]
> > 0x08060453 in updatePointersInRootObjectsFromto ()
> >
> > #0  0x08060453 in updatePointersInRootObjectsFromto ()
> > #1  0x08060a77 in mapPointersInObjectsFromto ()
> > #2  0x08060bb0 in incCompBody ()
> > #3  0x08065fa7 in incrementalGC ()
> > #4  0x080661a4 in sufficientSpaceAfterGC ()
> > #5  0x0806fed2 in clone ()
> > #6  0x08070095 in primitiveClone ()
> > #7  0x0806de15 in interpret ()
> > #8  0x08073dfe in main ()
> >
> >
> > [Thread debugging using libthread_db enabled]
> > 0x08060453 in updatePointersInRootObjectsFromto ()
> >
> > #0  0x08060453 in updatePointersInRootObjectsFromto ()
> > #1  0x08060a77 in mapPointersInObjectsFromto ()
> > #2  0x08060bb0 in incCompBody ()
> > #3  0x08065fa7 in incrementalGC ()
> > #4  0x080661a4 in sufficientSpaceAfterGC ()
> > #5  0x08069270 in primitiveNewWithArg ()
> > #6  0x0806de15 in interpret ()
> > #7  0x08073dfe in main ()
> >
> >
> > [Thread debugging using libthread_db enabled]
> > 0xb76f0f68 in select () from /lib/libc.so.6
> >
> > #0  0xb76f0f68 in select () from /lib/libc.so.6
> > #1  0x08070880 in aioPoll ()
> > #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
> > #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
> > #4  0x08061f24 in primitiveRelinquishProcessor ()
> > #5  0x0806de15 in interpret ()
> > #6  0x08073dfe in main ()
> >
> >
> > [Thread debugging using libthread_db enabled]
> > 0x08060453 in updatePointersInRootObjectsFromto ()
> >
> > #0  0x08060453 in updatePointersInRootObjectsFromto ()
> > #1  0x08060a77 in mapPointersInObjectsFromto ()
> > #2  0x08060bb0 in incCompBody ()
> > #3  0x08065fa7 in incrementalGC ()
> > #4  0x080661a4 in sufficientSpaceAfterGC ()
> > #5  0x08069420 in primitiveNew ()
> > #6  0x0806de15 in interpret ()
> > #7  0x08073dfe in main ()
> >
> >
> >
> > [Thread debugging using libthread_db enabled]
> > 0x08064e7e in markAndTrace ()
> >
> > #0  0x08064e7e in markAndTrace ()
> > #1  0x0806593a in markPhase ()
> > #2  0x08065f60 in incrementalGC ()
> > #3  0x080661a4 in sufficientSpaceAfterGC ()
> > #4  0x0806fed2 in clone ()
> > #5  0x08070095 in primitiveClone ()
> > #6  0x0806de15 in interpret ()
> > #7  0x08073dfe in main ()
> >
> >
> >
> >
> > On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
> >
> >> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
> >> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
> >> Let us know!
> >>
> >>
> >> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
> >>
> >> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
> >>
> >>> Hi Max. Question, are you able to reproduce the problem?
> >>
> >> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
> >>
> >>>
> >>>
> >>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
> >>> Hi guys
> >>>
> >>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
> >>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
> >>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
> >>>
> >>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
> >>>
> >>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
> >>>
> >>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
> >>>
> >>> So here's my theory: There migth be a VM problem with HashedCollections.
> >>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
> >>>
> >>> I'll try some more but if anyone has any ideas I'd be very happy :)
> >>>
> >>> Cheers,
> >>> Max
> >>> _______________________________________________
> >>> Pharo-fuel mailing list
> >>> [hidden email]
> >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>
> >>>
> >>>
> >>> --
> >>> Mariano
> >>> http://marianopeck.wordpress.com
> >>> _______________________________________________
> >>> Pharo-fuel mailing list
> >>> [hidden email]
> >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>
> >>
> >> _______________________________________________
> >> Pharo-fuel mailing list
> >> [hidden email]
> >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>
> >>
> >>
> >>
> >> --
> >> Mariano
> >> http://marianopeck.wordpress.com
> >> _______________________________________________
> >> Pharo-fuel mailing list
> >> [hidden email]
> >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >
> >
> > _______________________________________________
> > Pharo-fuel mailing list
> > [hidden email]
> > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >
> >
> >
> >
> > --
> > Mariano
> > http://marianopeck.wordpress.com
> > _______________________________________________
> > Pharo-fuel mailing list
> > [hidden email]
> > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Max Leske
In reply to this post by Frank Shearar-3

I nearly went insane because of Ruby. Debian might be stable but you never get the new stuff… The result were several ours of trying to run "bundle install; rake build" until I finally just tried running "build_interpreter_vm.sh" and HURRAY! I had a working 4.12.4-2729 version.

So thanks Frank :)

The problem persists with this version too. During debugging I noticed that the image call stack was inside the weak finalization process nearly all the time (finalization -> GC, finalization -> GC, etc.). So what I'll try now is to kill the weak finalization process in the forked image. A preliminary test was successful and I'll now see how that works in production (it's an internal application, so not that much of a problem).

BTW: The graph I serialize does not contain any weak references, I double checked.

Cheers,
Max


On 16.05.2013, at 10:10, Frank Shearar <[hidden email]> wrote:

> On 16 May 2013 08:29, Max Leske <[hidden email]> wrote:
>>
>> Hi
>>
>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>>
>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>>
>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>
> If it's convenient, running `bundle install; rake build` in a checkout
> of https://github.com/frankshearar/squeak-ci will build you a
> very-latest Interpreter VM (4.10.2.2614; executable at
> target/Squeak-4.10.2.2614-src-32/bld/squeak.sh) that will link against
> whatever glibc you have. (It works happily on my Ubuntu Lucid machine,
> which has glibc 2.11.) It works on OS X and Linux, but not on FreeBSD.
>
> frank
>
>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>>
>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>>
>> Cheers,
>> Max
>>
>> Begin forwarded message:
>>
>> From: Mariano Martinez Peck <[hidden email]>
>> Subject: Re: [Pharo-fuel] Possible collections problem
>> Date: 15. Mai 2013 16:53:10 MESZ
>> To: The Fuel Project <[hidden email]>
>> Reply-To: The Fuel Project <[hidden email]>
>>
>> I cannot see anything in particular. Too many GC stuff.
>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>>
>>
>>
>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
>>>
>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>>
>>> I haven't yet tried your suggestion Mariano.
>>>
>>> Cheers,
>>> Max
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x0806fed2 in clone ()
>>> #6  0x08070095 in primitiveClone ()
>>> #7  0x0806de15 in interpret ()
>>> #8  0x08073dfe in main ()
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069270 in primitiveNewWithArg ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0xb76f0f68 in select () from /lib/libc.so.6
>>>
>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>>> #1  0x08070880 in aioPoll ()
>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>>> #5  0x0806de15 in interpret ()
>>> #6  0x08073dfe in main ()
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08064e7e in markAndTrace ()
>>>
>>> #0  0x08064e7e in markAndTrace ()
>>> #1  0x0806593a in markPhase ()
>>> #2  0x08065f60 in incrementalGC ()
>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>>> #4  0x0806fed2 in clone ()
>>> #5  0x08070095 in primitiveClone ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>>
>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
>>>
>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
>>> Let us know!
>>>
>>>
>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
>>>>
>>>>
>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
>>>>
>>>> Hi Max. Question, are you able to reproduce the problem?
>>>>
>>>>
>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
>>>>>
>>>>> Hi guys
>>>>>
>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>>
>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>>
>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>>
>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>>
>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>>
>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mariano
>>>> http://marianopeck.wordpress.com
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>
>>>
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>>
>>>
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>> _______________________________________________
>> Pharo-fuel mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>
>>
>>
>
> _______________________________________________
> Pharo-fuel mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Frank Shearar-3

On 16 May 2013 18:55, Max Leske <[hidden email]> wrote:
>
> I nearly went insane because of Ruby. Debian might be stable but you never get the new stuff… The result were several ours of trying to run "bundle install; rake build" until I finally just tried running "build_interpreter_vm.sh" and HURRAY! I had a working 4.12.4-2729 version.
>
> So thanks Frank :)

RVM is your friend. It's the helping hand while all around you is
gibbering madness. Although I'm told that RVM is severe overkill and
that things like rbenv are much lighter/better - they don't patch
chdir, for instance. Ah well. But never, ever rely on what your distro
gives you.

I'm glad I left build_interpreter_vm.sh. It's actually legacy cruft
that I'd meant to ditch, but maybe I should keep it around for those
folk who don't want/need to install a Ruby just to get an up-to-date
interpreter VM.

frank

> The problem persists with this version too. During debugging I noticed that the image call stack was inside the weak finalization process nearly all the time (finalization -> GC, finalization -> GC, etc.). So what I'll try now is to kill the weak finalization process in the forked image. A preliminary test was successful and I'll now see how that works in production (it's an internal application, so not that much of a problem).
>
> BTW: The graph I serialize does not contain any weak references, I double checked.
>
> Cheers,
> Max
>
>
> On 16.05.2013, at 10:10, Frank Shearar <[hidden email]> wrote:
>
>> On 16 May 2013 08:29, Max Leske <[hidden email]> wrote:
>>>
>>> Hi
>>>
>>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
>>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
>>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>>>
>>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>>>
>>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>>
>> If it's convenient, running `bundle install; rake build` in a checkout
>> of https://github.com/frankshearar/squeak-ci will build you a
>> very-latest Interpreter VM (4.10.2.2614; executable at
>> target/Squeak-4.10.2.2614-src-32/bld/squeak.sh) that will link against
>> whatever glibc you have. (It works happily on my Ubuntu Lucid machine,
>> which has glibc 2.11.) It works on OS X and Linux, but not on FreeBSD.
>>
>> frank
>>
>>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>>>
>>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>>>
>>> Cheers,
>>> Max
>>>
>>> Begin forwarded message:
>>>
>>> From: Mariano Martinez Peck <[hidden email]>
>>> Subject: Re: [Pharo-fuel] Possible collections problem
>>> Date: 15. Mai 2013 16:53:10 MESZ
>>> To: The Fuel Project <[hidden email]>
>>> Reply-To: The Fuel Project <[hidden email]>
>>>
>>> I cannot see anything in particular. Too many GC stuff.
>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>>>
>>>
>>>
>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
>>>>
>>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>>>
>>>> I haven't yet tried your suggestion Mariano.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069420 in primitiveNew ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069420 in primitiveNew ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x0806fed2 in clone ()
>>>> #6  0x08070095 in primitiveClone ()
>>>> #7  0x0806de15 in interpret ()
>>>> #8  0x08073dfe in main ()
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069270 in primitiveNewWithArg ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0xb76f0f68 in select () from /lib/libc.so.6
>>>>
>>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>>>> #1  0x08070880 in aioPoll ()
>>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>>>> #5  0x0806de15 in interpret ()
>>>> #6  0x08073dfe in main ()
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069420 in primitiveNew ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08064e7e in markAndTrace ()
>>>>
>>>> #0  0x08064e7e in markAndTrace ()
>>>> #1  0x0806593a in markPhase ()
>>>> #2  0x08065f60 in incrementalGC ()
>>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #4  0x0806fed2 in clone ()
>>>> #5  0x08070095 in primitiveClone ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>>
>>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
>>>>
>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
>>>> Let us know!
>>>>
>>>>
>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
>>>>>
>>>>>
>>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
>>>>>
>>>>> Hi Max. Question, are you able to reproduce the problem?
>>>>>
>>>>>
>>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
>>>>>>
>>>>>> Hi guys
>>>>>>
>>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>>>
>>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>>>
>>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>>>
>>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>>>
>>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>>>
>>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>>>
>>>>>> Cheers,
>>>>>> Max
>>>>>> _______________________________________________
>>>>>> Pharo-fuel mailing list
>>>>>> [hidden email]
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mariano
>>>>> http://marianopeck.wordpress.com
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mariano
>>>> http://marianopeck.wordpress.com
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>
>>>
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>>
>>>
>>
>> _______________________________________________
>> Pharo-fuel mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

David T. Lewis
In reply to this post by Max Leske
 
On Thu, May 16, 2013 at 07:55:39PM +0200, Max Leske wrote:
>
> I nearly went insane because of Ruby. Debian might be stable but you never get the new stuff? The result were several ours of trying to run "bundle install; rake build" until I finally just tried running "build_interpreter_vm.sh" and HURRAY! I had a working 4.12.4-2729 version.
>

Instructions for building a standard Interpreter VM for Unix, including generation
of the VM sources, can be found at http://squeakvm.org/unix/devel.html. It's worth
reading this just to get an overview of the process (regardless of whether you use
a Ruby script or any other procedure).

If you prefer just compiling from a known-good tarball, the latest official release
(which should be fine for your purposes) is on squeakvm.org at
http://squeakvm.org/unix/release/Squeak-4.10.2.2614-src.tar.gz.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Frank Shearar-3

On 16 May 2013 20:01, David T. Lewis <[hidden email]> wrote:

>
> On Thu, May 16, 2013 at 07:55:39PM +0200, Max Leske wrote:
>>
>> I nearly went insane because of Ruby. Debian might be stable but you never get the new stuff? The result were several ours of trying to run "bundle install; rake build" until I finally just tried running "build_interpreter_vm.sh" and HURRAY! I had a working 4.12.4-2729 version.
>>
>
> Instructions for building a standard Interpreter VM for Unix, including generation
> of the VM sources, can be found at http://squeakvm.org/unix/devel.html. It's worth
> reading this just to get an overview of the process (regardless of whether you use
> a Ruby script or any other procedure).
>
> If you prefer just compiling from a known-good tarball, the latest official release
> (which should be fine for your purposes) is on squeakvm.org at
> http://squeakvm.org/unix/release/Squeak-4.10.2.2614-src.tar.gz.

That's exactly what build_interpreter_vm.sh and rake build do - pull
down that particular tarball, and run the make in platforms/unix. The
advantage of the rake build is that, since Ruby is a far superior
scripting language to shell, it'll do the right thing in a bunch of
different places, depending on your OS.

Ah, actually, build_interpreter_vm.sh builds _the latest_ VM (as in
the output of build.squeak.org's InterpreterVM job). rake build builds
_the latest released VM_.

frank

> Dave
>
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

David T. Lewis
 
On Thu, May 16, 2013 at 08:06:39PM +0100, Frank Shearar wrote:

>
> On 16 May 2013 20:01, David T. Lewis <[hidden email]> wrote:
> >
> > On Thu, May 16, 2013 at 07:55:39PM +0200, Max Leske wrote:
> >>
> >> I nearly went insane because of Ruby. Debian might be stable but you never get the new stuff? The result were several ours of trying to run "bundle install; rake build" until I finally just tried running "build_interpreter_vm.sh" and HURRAY! I had a working 4.12.4-2729 version.
> >>
> >
> > Instructions for building a standard Interpreter VM for Unix, including generation
> > of the VM sources, can be found at http://squeakvm.org/unix/devel.html. It's worth
> > reading this just to get an overview of the process (regardless of whether you use
> > a Ruby script or any other procedure).
> >
> > If you prefer just compiling from a known-good tarball, the latest official release
> > (which should be fine for your purposes) is on squeakvm.org at
> > http://squeakvm.org/unix/release/Squeak-4.10.2.2614-src.tar.gz.
>
> That's exactly what build_interpreter_vm.sh and rake build do - pull
> down that particular tarball, and run the make in platforms/unix. The
> advantage of the rake build is that, since Ruby is a far superior
> scripting language to shell, it'll do the right thing in a bunch of
> different places, depending on your OS.
>
> Ah, actually, build_interpreter_vm.sh builds _the latest_ VM (as in
> the output of build.squeak.org's InterpreterVM job). rake build builds
> _the latest released VM_.
>

Yes, I understand. But I also want people to know how to find the information
on squeakvm.org. After all, it only takes a couple of minutes to read it, and
spending a few minutes to learn the basics can save a lot of time and frustration
if something does not work as expected.

Dave

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Max Leske
In reply to this post by David T. Lewis

Thanks for the suggestions Dave. I wanted to get some more information before replying; took me longer than I thought though.

I was able to pin down the problem around a couple of thousand objects. I'll sketch my findings roughly: I had two collections, both stored in an instance variable. I could serialize them both separately but not together. I could also serialize one of them and a subset of the other. The actual elements didn't seem to matter as much as the number of elements serialized. This worked for example:

arrayToSerialize := {
        coll1 first: 180. "coll1 has 200 elements"
        coll2 }

As did this:

arrayToSerialize := {
        coll1 last: 20.
        coll2 }

While this didn't:

arrayToSerialize := {
        coll1 first: 180. "coll1 has 200 elements"
        coll1 last: 20.
        coll2 }

But then again, this did:

arrayToSerialize := { coll2 }

As did this:

arrayToSerialize := { coll1 }

excluding elements that referenced each other in coll1 and coll2 didn't matter either.


I have no clue at all, what might be the problem here. My only (far fetched) guess is that maybe one of the objects was somehow corrupt (although that should have somehow triggered exceptions or at least make serialization impossible in all cases, especially when not forking).

Then, during testing, I messed up my snapshot and had to load a new one and tadaaa: the problem's gone… Just weird… I just hope, this never happens again. If it does, you'll hear from me :)



What I wrote below now seems obsolete. I'm leaving it in for reference.

Cheers and thanks a lot for your help!
Max


On 16.05.2013, at 14:33, David T. Lewis <[hidden email]> wrote:

> First to check to see if I understand the problem:
>
> - You are forking an image using OSProcess, and the failures are occuring
> the child VM/image that you created using one of the #forkSqueak methods.

Yes, precisely. I'm forking headless by the way.

>
> - You are serializing some object in the child image so that they can be
> written to a serialized output file. You are doing this in the background
> child VM/image so that your main image can continue doing whatever it was
> doing without being impacted by the serializing process.

Yes.

>
> - If you do this 10 times, it will fail about 4 times out of the 10 on average.

Yes, very roughly.

>
> From the stack traces, it looks like your VM is hanging up in a normal
> garbage collection sweep that is happening as a side effect of allocating
> a new object. The first thing that comes to mind is that your VM is hanging
> up while trying to allocate more memory than the OS can provide to it, and
> it appears to be stuck while waiting memory to be provided by the OS. I'm
> suggesting this because I've seen similar symptoms when I intentially try
> to allocate very large object memories, so perhaps there may be something
> in your runtime environment (OS settings) that limit available memory.
> This is a guess only, but it might give you some ideas of where to look.

I used vmstat to look at possible swap (paging) problems. There wasn't a single swap operation and there's ample memory.

>
> This might be a VM problem, or it might be a symptom of some other problem
> that just puts the VM and OS into a condition where they are unexpectedly
> trying to use a very large amount of memory.

That's what I hope, but I'm pretty clueless.

>
> If possible it would be good to try to rule out the OSProcess forkImage as
> a possible contributing factor. If you were to do the serialization 10 times
> in the main image, rather than in the background child image, would it
> still fail about 4 times out of 10? This could be a factor in memory use.

Serializing without forking always succeeds.

> Although forkSqueak is very memory efficient, if the child and parent
> images do a lot of different things (such as try to serialize a bunch of
> stuff in the child image), then the OS will be forced to map in enough
> real memory to satisfy both of the images.

As pointed out above, there wasn't a single swap. htop also shows swap usage at roughly 20% and there's always memory available.

>
> Another thing to check, if possible, is to see if the OS is doing a
> lot of disk swapping at the time of the failure. If you see all of the
> system memory in use, along with a lot of disk activity, then you are
> very likely look at a case of the VM trying to allocate more memory
> than the system can reasonable provide.

(Again, see above)
I/O is also pretty low. Nothing special there.

>
> FWIW, here is what your VM is doing at the time of the "hang up". This
> involves scanning object memory through memEnd, which might be a very
> large value for one reason or another:
>
> ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd
>    "update pointers in root objects"
>    | oop |
>    <inline: false>
>    1 to: rootTableCount do: [:i |
>            oop := rootTable at: i.
>            ((self oop: oop isLessThan: memStart)
>                    or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
>                ifTrue: ["Note: must not remap the fields of any object twice!"
>                    "remap this oop only if not in the memory range
>                    covered below"
>                    self remapFieldsAndClassOf: oop]]

I was able to look at two forked processes where one didn't terminate (call it 1) and the other (call it 2) did. What I saw was that 1 was doing GC's like crazy right from the start (like the stacks I posted) while 2 showed a lot of other calls too (e.g. interpret()) and spent (in my oppinion) much less time in GC.

I tried two additional things:
1. after forking I kill the weak finalization process. This seems to have bettered the problem but isn't a fix
2. I modified the GC options the following way:

        "Limit incremental GC activity to run every 100k allocations"
        SmalltalkImage current vmParameterAt: 5 put: 100000. "allocations between GCs (default: 4000)"

        "Limit tenuring threshold to only tenure w/ > 25k survivors"
        SmalltalkImage current vmParameterAt: 6 put: 25000. "tenuring threshold (default: 2000)"

        "Tell the VM that we'd rather grow than spinning in tight GC loops"
        SmalltalkImage current gcBiasToGrow: true. "default: false"

        "Tell the VM to do a fullGC for good measure if the above growth exceeded x MB"
        SmalltalkImage current gcBiasToGrowLimit: 8*1024*1024. "default: 0"

This did not have any effect (as far as I can tell) on the problem.

>
> HTH,
>
> Dave
>
> On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:
>>
>> Hi
>>
>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>>
>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>>
>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>>
>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>>
>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>>
>> Cheers,
>> Max
>>
>> Begin forwarded message:
>>
>>> From: Mariano Martinez Peck <[hidden email]>
>>> Subject: Re: [Pharo-fuel] Possible collections problem
>>> Date: 15. Mai 2013 16:53:10 MESZ
>>> To: The Fuel Project <[hidden email]>
>>> Reply-To: The Fuel Project <[hidden email]>
>>>
>>> I cannot see anything in particular. Too many GC stuff.
>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>>>
>>>
>>>
>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>>
>>> I haven't yet tried your suggestion Mariano.
>>>
>>> Cheers,
>>> Max
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x0806fed2 in clone ()
>>> #6  0x08070095 in primitiveClone ()
>>> #7  0x0806de15 in interpret ()
>>> #8  0x08073dfe in main ()
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069270 in primitiveNewWithArg ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0xb76f0f68 in select () from /lib/libc.so.6
>>>
>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>>> #1  0x08070880 in aioPoll ()
>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>>> #5  0x0806de15 in interpret ()
>>> #6  0x08073dfe in main ()
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>
>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>> #2  0x08060bb0 in incCompBody ()
>>> #3  0x08065fa7 in incrementalGC ()
>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>> #5  0x08069420 in primitiveNew ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>> [Thread debugging using libthread_db enabled]
>>> 0x08064e7e in markAndTrace ()
>>>
>>> #0  0x08064e7e in markAndTrace ()
>>> #1  0x0806593a in markPhase ()
>>> #2  0x08065f60 in incrementalGC ()
>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>>> #4  0x0806fed2 in clone ()
>>> #5  0x08070095 in primitiveClone ()
>>> #6  0x0806de15 in interpret ()
>>> #7  0x08073dfe in main ()
>>>
>>>
>>>
>>>
>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
>>>
>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
>>>> Let us know!
>>>>
>>>>
>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
>>>>
>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
>>>>
>>>>> Hi Max. Question, are you able to reproduce the problem?
>>>>
>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
>>>>> Hi guys
>>>>>
>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>>
>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>>
>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>>
>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>>
>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>>
>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mariano
>>>>> http://marianopeck.wordpress.com
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>>
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mariano
>>>> http://marianopeck.wordpress.com
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>>
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>>
>>>
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>
>
>
> _______________________________________________
> Pharo-fuel mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Max Leske

Here I go again… Shortly after sending the last e-mail the problem was back (obviously…). The good news is: I've found a workaround. The bad news: this is the work around:

[ (Delay forSeconds: 5) wait.
        "serialize stuff in child" ] withImageFork: 'spoon'.

Any ideas? I'm happy to have a workaround but I'd really like to resolve the issue. If, for instance, this were a problem with OSProcess, then the delay (if there's no way around it) should be in OSProcess and not in my code.

Cheers,
Max



On 22.05.2013, at 19:22, Max Leske <[hidden email]> wrote:

> Thanks for the suggestions Dave. I wanted to get some more information before replying; took me longer than I thought though.
>
> I was able to pin down the problem around a couple of thousand objects. I'll sketch my findings roughly: I had two collections, both stored in an instance variable. I could serialize them both separately but not together. I could also serialize one of them and a subset of the other. The actual elements didn't seem to matter as much as the number of elements serialized. This worked for example:
>
> arrayToSerialize := {
> coll1 first: 180. "coll1 has 200 elements"
> coll2 }
>
> As did this:
>
> arrayToSerialize := {
> coll1 last: 20.
> coll2 }
>
> While this didn't:
>
> arrayToSerialize := {
> coll1 first: 180. "coll1 has 200 elements"
> coll1 last: 20.
> coll2 }
>
> But then again, this did:
>
> arrayToSerialize := { coll2 }
>
> As did this:
>
> arrayToSerialize := { coll1 }
>
> excluding elements that referenced each other in coll1 and coll2 didn't matter either.
>
>
> I have no clue at all, what might be the problem here. My only (far fetched) guess is that maybe one of the objects was somehow corrupt (although that should have somehow triggered exceptions or at least make serialization impossible in all cases, especially when not forking).
>
> Then, during testing, I messed up my snapshot and had to load a new one and tadaaa: the problem's gone… Just weird… I just hope, this never happens again. If it does, you'll hear from me :)
>
>
>
> What I wrote below now seems obsolete. I'm leaving it in for reference.
>
> Cheers and thanks a lot for your help!
> Max
>
>
> On 16.05.2013, at 14:33, David T. Lewis <[hidden email]> wrote:
>
>> First to check to see if I understand the problem:
>>
>> - You are forking an image using OSProcess, and the failures are occuring
>> the child VM/image that you created using one of the #forkSqueak methods.
>
> Yes, precisely. I'm forking headless by the way.
>
>>
>> - You are serializing some object in the child image so that they can be
>> written to a serialized output file. You are doing this in the background
>> child VM/image so that your main image can continue doing whatever it was
>> doing without being impacted by the serializing process.
>
> Yes.
>
>>
>> - If you do this 10 times, it will fail about 4 times out of the 10 on average.
>
> Yes, very roughly.
>
>>
>> From the stack traces, it looks like your VM is hanging up in a normal
>> garbage collection sweep that is happening as a side effect of allocating
>> a new object. The first thing that comes to mind is that your VM is hanging
>> up while trying to allocate more memory than the OS can provide to it, and
>> it appears to be stuck while waiting memory to be provided by the OS. I'm
>> suggesting this because I've seen similar symptoms when I intentially try
>> to allocate very large object memories, so perhaps there may be something
>> in your runtime environment (OS settings) that limit available memory.
>> This is a guess only, but it might give you some ideas of where to look.
>
> I used vmstat to look at possible swap (paging) problems. There wasn't a single swap operation and there's ample memory.
>
>>
>> This might be a VM problem, or it might be a symptom of some other problem
>> that just puts the VM and OS into a condition where they are unexpectedly
>> trying to use a very large amount of memory.
>
> That's what I hope, but I'm pretty clueless.
>
>>
>> If possible it would be good to try to rule out the OSProcess forkImage as
>> a possible contributing factor. If you were to do the serialization 10 times
>> in the main image, rather than in the background child image, would it
>> still fail about 4 times out of 10? This could be a factor in memory use.
>
> Serializing without forking always succeeds.
>
>> Although forkSqueak is very memory efficient, if the child and parent
>> images do a lot of different things (such as try to serialize a bunch of
>> stuff in the child image), then the OS will be forced to map in enough
>> real memory to satisfy both of the images.
>
> As pointed out above, there wasn't a single swap. htop also shows swap usage at roughly 20% and there's always memory available.
>
>>
>> Another thing to check, if possible, is to see if the OS is doing a
>> lot of disk swapping at the time of the failure. If you see all of the
>> system memory in use, along with a lot of disk activity, then you are
>> very likely look at a case of the VM trying to allocate more memory
>> than the system can reasonable provide.
>
> (Again, see above)
> I/O is also pretty low. Nothing special there.
>
>>
>> FWIW, here is what your VM is doing at the time of the "hang up". This
>> involves scanning object memory through memEnd, which might be a very
>> large value for one reason or another:
>>
>> ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd
>>   "update pointers in root objects"
>>   | oop |
>>   <inline: false>
>>   1 to: rootTableCount do: [:i |
>>           oop := rootTable at: i.
>>           ((self oop: oop isLessThan: memStart)
>>                   or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
>>               ifTrue: ["Note: must not remap the fields of any object twice!"
>>                   "remap this oop only if not in the memory range
>>                   covered below"
>>                   self remapFieldsAndClassOf: oop]]
>
> I was able to look at two forked processes where one didn't terminate (call it 1) and the other (call it 2) did. What I saw was that 1 was doing GC's like crazy right from the start (like the stacks I posted) while 2 showed a lot of other calls too (e.g. interpret()) and spent (in my oppinion) much less time in GC.
>
> I tried two additional things:
> 1. after forking I kill the weak finalization process. This seems to have bettered the problem but isn't a fix
> 2. I modified the GC options the following way:
>
> "Limit incremental GC activity to run every 100k allocations"
> SmalltalkImage current vmParameterAt: 5 put: 100000. "allocations between GCs (default: 4000)"
>
> "Limit tenuring threshold to only tenure w/ > 25k survivors"
> SmalltalkImage current vmParameterAt: 6 put: 25000. "tenuring threshold (default: 2000)"
>
> "Tell the VM that we'd rather grow than spinning in tight GC loops"
> SmalltalkImage current gcBiasToGrow: true. "default: false"
>
> "Tell the VM to do a fullGC for good measure if the above growth exceeded x MB"
> SmalltalkImage current gcBiasToGrowLimit: 8*1024*1024. "default: 0"
>
> This did not have any effect (as far as I can tell) on the problem.
>
>>
>> HTH,
>>
>> Dave
>>
>> On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:
>>>
>>> Hi
>>>
>>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
>>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
>>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>>>
>>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>>>
>>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>>>
>>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>>>
>>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>>>
>>> Cheers,
>>> Max
>>>
>>> Begin forwarded message:
>>>
>>>> From: Mariano Martinez Peck <[hidden email]>
>>>> Subject: Re: [Pharo-fuel] Possible collections problem
>>>> Date: 15. Mai 2013 16:53:10 MESZ
>>>> To: The Fuel Project <[hidden email]>
>>>> Reply-To: The Fuel Project <[hidden email]>
>>>>
>>>> I cannot see anything in particular. Too many GC stuff.
>>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>>>>
>>>>
>>>>
>>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
>>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>>>
>>>> I haven't yet tried your suggestion Mariano.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069420 in primitiveNew ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069420 in primitiveNew ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x0806fed2 in clone ()
>>>> #6  0x08070095 in primitiveClone ()
>>>> #7  0x0806de15 in interpret ()
>>>> #8  0x08073dfe in main ()
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069270 in primitiveNewWithArg ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0xb76f0f68 in select () from /lib/libc.so.6
>>>>
>>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>>>> #1  0x08070880 in aioPoll ()
>>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>>>> #5  0x0806de15 in interpret ()
>>>> #6  0x08073dfe in main ()
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>
>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>> #2  0x08060bb0 in incCompBody ()
>>>> #3  0x08065fa7 in incrementalGC ()
>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #5  0x08069420 in primitiveNew ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>> [Thread debugging using libthread_db enabled]
>>>> 0x08064e7e in markAndTrace ()
>>>>
>>>> #0  0x08064e7e in markAndTrace ()
>>>> #1  0x0806593a in markPhase ()
>>>> #2  0x08065f60 in incrementalGC ()
>>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>>>> #4  0x0806fed2 in clone ()
>>>> #5  0x08070095 in primitiveClone ()
>>>> #6  0x0806de15 in interpret ()
>>>> #7  0x08073dfe in main ()
>>>>
>>>>
>>>>
>>>>
>>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
>>>>
>>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
>>>>> Let us know!
>>>>>
>>>>>
>>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
>>>>>
>>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
>>>>>
>>>>>> Hi Max. Question, are you able to reproduce the problem?
>>>>>
>>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
>>>>>> Hi guys
>>>>>>
>>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>>>
>>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>>>
>>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>>>
>>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>>>
>>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>>>
>>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>>>
>>>>>> Cheers,
>>>>>> Max
>>>>>> _______________________________________________
>>>>>> Pharo-fuel mailing list
>>>>>> [hidden email]
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mariano
>>>>>> http://marianopeck.wordpress.com
>>>>>> _______________________________________________
>>>>>> Pharo-fuel mailing list
>>>>>> [hidden email]
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mariano
>>>>> http://marianopeck.wordpress.com
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>>
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mariano
>>>> http://marianopeck.wordpress.com
>>>> _______________________________________________
>>>> Pharo-fuel mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>
>>
>>
>> _______________________________________________
>> Pharo-fuel mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Max Leske

And again I find myself staring at multiple never terminating forks… The workaround did make the situation better but apparently doesn't prevent the problem completely.

Max


On 23.05.2013, at 14:31, Max Leske <[hidden email]> wrote:

> Here I go again… Shortly after sending the last e-mail the problem was back (obviously…). The good news is: I've found a workaround. The bad news: this is the work around:
>
> [ (Delay forSeconds: 5) wait.
> "serialize stuff in child" ] withImageFork: 'spoon'.
>
> Any ideas? I'm happy to have a workaround but I'd really like to resolve the issue. If, for instance, this were a problem with OSProcess, then the delay (if there's no way around it) should be in OSProcess and not in my code.
>
> Cheers,
> Max
>
>
>
> On 22.05.2013, at 19:22, Max Leske <[hidden email]> wrote:
>
>> Thanks for the suggestions Dave. I wanted to get some more information before replying; took me longer than I thought though.
>>
>> I was able to pin down the problem around a couple of thousand objects. I'll sketch my findings roughly: I had two collections, both stored in an instance variable. I could serialize them both separately but not together. I could also serialize one of them and a subset of the other. The actual elements didn't seem to matter as much as the number of elements serialized. This worked for example:
>>
>> arrayToSerialize := {
>> coll1 first: 180. "coll1 has 200 elements"
>> coll2 }
>>
>> As did this:
>>
>> arrayToSerialize := {
>> coll1 last: 20.
>> coll2 }
>>
>> While this didn't:
>>
>> arrayToSerialize := {
>> coll1 first: 180. "coll1 has 200 elements"
>> coll1 last: 20.
>> coll2 }
>>
>> But then again, this did:
>>
>> arrayToSerialize := { coll2 }
>>
>> As did this:
>>
>> arrayToSerialize := { coll1 }
>>
>> excluding elements that referenced each other in coll1 and coll2 didn't matter either.
>>
>>
>> I have no clue at all, what might be the problem here. My only (far fetched) guess is that maybe one of the objects was somehow corrupt (although that should have somehow triggered exceptions or at least make serialization impossible in all cases, especially when not forking).
>>
>> Then, during testing, I messed up my snapshot and had to load a new one and tadaaa: the problem's gone… Just weird… I just hope, this never happens again. If it does, you'll hear from me :)
>>
>>
>>
>> What I wrote below now seems obsolete. I'm leaving it in for reference.
>>
>> Cheers and thanks a lot for your help!
>> Max
>>
>>
>> On 16.05.2013, at 14:33, David T. Lewis <[hidden email]> wrote:
>>
>>> First to check to see if I understand the problem:
>>>
>>> - You are forking an image using OSProcess, and the failures are occuring
>>> the child VM/image that you created using one of the #forkSqueak methods.
>>
>> Yes, precisely. I'm forking headless by the way.
>>
>>>
>>> - You are serializing some object in the child image so that they can be
>>> written to a serialized output file. You are doing this in the background
>>> child VM/image so that your main image can continue doing whatever it was
>>> doing without being impacted by the serializing process.
>>
>> Yes.
>>
>>>
>>> - If you do this 10 times, it will fail about 4 times out of the 10 on average.
>>
>> Yes, very roughly.
>>
>>>
>>> From the stack traces, it looks like your VM is hanging up in a normal
>>> garbage collection sweep that is happening as a side effect of allocating
>>> a new object. The first thing that comes to mind is that your VM is hanging
>>> up while trying to allocate more memory than the OS can provide to it, and
>>> it appears to be stuck while waiting memory to be provided by the OS. I'm
>>> suggesting this because I've seen similar symptoms when I intentially try
>>> to allocate very large object memories, so perhaps there may be something
>>> in your runtime environment (OS settings) that limit available memory.
>>> This is a guess only, but it might give you some ideas of where to look.
>>
>> I used vmstat to look at possible swap (paging) problems. There wasn't a single swap operation and there's ample memory.
>>
>>>
>>> This might be a VM problem, or it might be a symptom of some other problem
>>> that just puts the VM and OS into a condition where they are unexpectedly
>>> trying to use a very large amount of memory.
>>
>> That's what I hope, but I'm pretty clueless.
>>
>>>
>>> If possible it would be good to try to rule out the OSProcess forkImage as
>>> a possible contributing factor. If you were to do the serialization 10 times
>>> in the main image, rather than in the background child image, would it
>>> still fail about 4 times out of 10? This could be a factor in memory use.
>>
>> Serializing without forking always succeeds.
>>
>>> Although forkSqueak is very memory efficient, if the child and parent
>>> images do a lot of different things (such as try to serialize a bunch of
>>> stuff in the child image), then the OS will be forced to map in enough
>>> real memory to satisfy both of the images.
>>
>> As pointed out above, there wasn't a single swap. htop also shows swap usage at roughly 20% and there's always memory available.
>>
>>>
>>> Another thing to check, if possible, is to see if the OS is doing a
>>> lot of disk swapping at the time of the failure. If you see all of the
>>> system memory in use, along with a lot of disk activity, then you are
>>> very likely look at a case of the VM trying to allocate more memory
>>> than the system can reasonable provide.
>>
>> (Again, see above)
>> I/O is also pretty low. Nothing special there.
>>
>>>
>>> FWIW, here is what your VM is doing at the time of the "hang up". This
>>> involves scanning object memory through memEnd, which might be a very
>>> large value for one reason or another:
>>>
>>> ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd
>>>  "update pointers in root objects"
>>>  | oop |
>>>  <inline: false>
>>>  1 to: rootTableCount do: [:i |
>>>          oop := rootTable at: i.
>>>          ((self oop: oop isLessThan: memStart)
>>>                  or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
>>>              ifTrue: ["Note: must not remap the fields of any object twice!"
>>>                  "remap this oop only if not in the memory range
>>>                  covered below"
>>>                  self remapFieldsAndClassOf: oop]]
>>
>> I was able to look at two forked processes where one didn't terminate (call it 1) and the other (call it 2) did. What I saw was that 1 was doing GC's like crazy right from the start (like the stacks I posted) while 2 showed a lot of other calls too (e.g. interpret()) and spent (in my oppinion) much less time in GC.
>>
>> I tried two additional things:
>> 1. after forking I kill the weak finalization process. This seems to have bettered the problem but isn't a fix
>> 2. I modified the GC options the following way:
>>
>> "Limit incremental GC activity to run every 100k allocations"
>> SmalltalkImage current vmParameterAt: 5 put: 100000. "allocations between GCs (default: 4000)"
>>
>> "Limit tenuring threshold to only tenure w/ > 25k survivors"
>> SmalltalkImage current vmParameterAt: 6 put: 25000. "tenuring threshold (default: 2000)"
>>
>> "Tell the VM that we'd rather grow than spinning in tight GC loops"
>> SmalltalkImage current gcBiasToGrow: true. "default: false"
>>
>> "Tell the VM to do a fullGC for good measure if the above growth exceeded x MB"
>> SmalltalkImage current gcBiasToGrowLimit: 8*1024*1024. "default: 0"
>>
>> This did not have any effect (as far as I can tell) on the problem.
>>
>>>
>>> HTH,
>>>
>>> Dave
>>>
>>> On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:
>>>>
>>>> Hi
>>>>
>>>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
>>>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
>>>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>>>>
>>>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>>>>
>>>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>>>>
>>>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>>>>
>>>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>> Begin forwarded message:
>>>>
>>>>> From: Mariano Martinez Peck <[hidden email]>
>>>>> Subject: Re: [Pharo-fuel] Possible collections problem
>>>>> Date: 15. Mai 2013 16:53:10 MESZ
>>>>> To: The Fuel Project <[hidden email]>
>>>>> Reply-To: The Fuel Project <[hidden email]>
>>>>>
>>>>> I cannot see anything in particular. Too many GC stuff.
>>>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
>>>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>>>>
>>>>> I haven't yet tried your suggestion Mariano.
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>>
>>>>>
>>>>> [Thread debugging using libthread_db enabled]
>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>
>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>> #2  0x08060bb0 in incCompBody ()
>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>> #5  0x08069420 in primitiveNew ()
>>>>> #6  0x0806de15 in interpret ()
>>>>> #7  0x08073dfe in main ()
>>>>>
>>>>>
>>>>>
>>>>> [Thread debugging using libthread_db enabled]
>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>
>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>> #2  0x08060bb0 in incCompBody ()
>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>> #5  0x08069420 in primitiveNew ()
>>>>> #6  0x0806de15 in interpret ()
>>>>> #7  0x08073dfe in main ()
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [Thread debugging using libthread_db enabled]
>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>
>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>> #2  0x08060bb0 in incCompBody ()
>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>> #5  0x0806fed2 in clone ()
>>>>> #6  0x08070095 in primitiveClone ()
>>>>> #7  0x0806de15 in interpret ()
>>>>> #8  0x08073dfe in main ()
>>>>>
>>>>>
>>>>> [Thread debugging using libthread_db enabled]
>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>
>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>> #2  0x08060bb0 in incCompBody ()
>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>> #5  0x08069270 in primitiveNewWithArg ()
>>>>> #6  0x0806de15 in interpret ()
>>>>> #7  0x08073dfe in main ()
>>>>>
>>>>>
>>>>> [Thread debugging using libthread_db enabled]
>>>>> 0xb76f0f68 in select () from /lib/libc.so.6
>>>>>
>>>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>>>>> #1  0x08070880 in aioPoll ()
>>>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>>>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>>>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>>>>> #5  0x0806de15 in interpret ()
>>>>> #6  0x08073dfe in main ()
>>>>>
>>>>>
>>>>> [Thread debugging using libthread_db enabled]
>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>
>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>> #2  0x08060bb0 in incCompBody ()
>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>> #5  0x08069420 in primitiveNew ()
>>>>> #6  0x0806de15 in interpret ()
>>>>> #7  0x08073dfe in main ()
>>>>>
>>>>>
>>>>>
>>>>> [Thread debugging using libthread_db enabled]
>>>>> 0x08064e7e in markAndTrace ()
>>>>>
>>>>> #0  0x08064e7e in markAndTrace ()
>>>>> #1  0x0806593a in markPhase ()
>>>>> #2  0x08065f60 in incrementalGC ()
>>>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>>>>> #4  0x0806fed2 in clone ()
>>>>> #5  0x08070095 in primitiveClone ()
>>>>> #6  0x0806de15 in interpret ()
>>>>> #7  0x08073dfe in main ()
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
>>>>>
>>>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>>>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
>>>>>> Let us know!
>>>>>>
>>>>>>
>>>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
>>>>>>
>>>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
>>>>>>
>>>>>>> Hi Max. Question, are you able to reproduce the problem?
>>>>>>
>>>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
>>>>>>> Hi guys
>>>>>>>
>>>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>>>>
>>>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>>>>
>>>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>>>>
>>>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>>>>
>>>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>>>>
>>>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Max
>>>>>>> _______________________________________________
>>>>>>> Pharo-fuel mailing list
>>>>>>> [hidden email]
>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mariano
>>>>>>> http://marianopeck.wordpress.com
>>>>>>> _______________________________________________
>>>>>>> Pharo-fuel mailing list
>>>>>>> [hidden email]
>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pharo-fuel mailing list
>>>>>> [hidden email]
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mariano
>>>>>> http://marianopeck.wordpress.com
>>>>>> _______________________________________________
>>>>>> Pharo-fuel mailing list
>>>>>> [hidden email]
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mariano
>>>>> http://marianopeck.wordpress.com
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pharo-fuel mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

David T. Lewis
 
I have no idea what is going on here, but given that it seems to be happening
in the forked image only, and given that there is some indication that it is
associated with the finalization process, I am inclined to look for something
that might be associated with objects that somehow reference external resources.
I don't have anything specific in mind, just a general hunch.

An image/vm started with #forkSqueak is essentially an exact copy of the original
object memory, except that the X11 session connected to the display has been
restarted with a new connection. All of the other objects and external resources,
including file handles, are the same insofar as Unix fork() semantics support
them. For example, a Squeak file stream connection to a unix file stream remains
connected to the same stream at the OS level.

I can't think of any way in which this might affect object serialization, but
there must be something.

For purposes of debugging the problem, perhaps you could try using a SmartReferenceStream
for the serialization, and see if the symptoms change. That might give a clue
as to whether the issue is associated with traversing the object memory in
the forked vm/image (hence all serialization methods would have the same
problem), or if it is different with Fuel versus SmartReferenceStream (hence
something associated with the objects being serialized).

Interesting problem ... wish I had the answer.

Dave


On Fri, May 24, 2013 at 03:16:31PM +0200, Max Leske wrote:

>
> And again I find myself staring at multiple never terminating forks? The workaround did make the situation better but apparently doesn't prevent the problem completely.
>
> Max
>
>
> On 23.05.2013, at 14:31, Max Leske <[hidden email]> wrote:
>
> > Here I go again? Shortly after sending the last e-mail the problem was back (obviously?). The good news is: I've found a workaround. The bad news: this is the work around:
> >
> > [ (Delay forSeconds: 5) wait.
> > "serialize stuff in child" ] withImageFork: 'spoon'.
> >
> > Any ideas? I'm happy to have a workaround but I'd really like to resolve the issue. If, for instance, this were a problem with OSProcess, then the delay (if there's no way around it) should be in OSProcess and not in my code.
> >
> > Cheers,
> > Max
> >
> >
> >
> > On 22.05.2013, at 19:22, Max Leske <[hidden email]> wrote:
> >
> >> Thanks for the suggestions Dave. I wanted to get some more information before replying; took me longer than I thought though.
> >>
> >> I was able to pin down the problem around a couple of thousand objects. I'll sketch my findings roughly: I had two collections, both stored in an instance variable. I could serialize them both separately but not together. I could also serialize one of them and a subset of the other. The actual elements didn't seem to matter as much as the number of elements serialized. This worked for example:
> >>
> >> arrayToSerialize := {
> >> coll1 first: 180. "coll1 has 200 elements"
> >> coll2 }
> >>
> >> As did this:
> >>
> >> arrayToSerialize := {
> >> coll1 last: 20.
> >> coll2 }
> >>
> >> While this didn't:
> >>
> >> arrayToSerialize := {
> >> coll1 first: 180. "coll1 has 200 elements"
> >> coll1 last: 20.
> >> coll2 }
> >>
> >> But then again, this did:
> >>
> >> arrayToSerialize := { coll2 }
> >>
> >> As did this:
> >>
> >> arrayToSerialize := { coll1 }
> >>
> >> excluding elements that referenced each other in coll1 and coll2 didn't matter either.
> >>
> >>
> >> I have no clue at all, what might be the problem here. My only (far fetched) guess is that maybe one of the objects was somehow corrupt (although that should have somehow triggered exceptions or at least make serialization impossible in all cases, especially when not forking).
> >>
> >> Then, during testing, I messed up my snapshot and had to load a new one and tadaaa: the problem's gone? Just weird? I just hope, this never happens again. If it does, you'll hear from me :)
> >>
> >>
> >>
> >> What I wrote below now seems obsolete. I'm leaving it in for reference.
> >>
> >> Cheers and thanks a lot for your help!
> >> Max
> >>
> >>
> >> On 16.05.2013, at 14:33, David T. Lewis <[hidden email]> wrote:
> >>
> >>> First to check to see if I understand the problem:
> >>>
> >>> - You are forking an image using OSProcess, and the failures are occuring
> >>> the child VM/image that you created using one of the #forkSqueak methods.
> >>
> >> Yes, precisely. I'm forking headless by the way.
> >>
> >>>
> >>> - You are serializing some object in the child image so that they can be
> >>> written to a serialized output file. You are doing this in the background
> >>> child VM/image so that your main image can continue doing whatever it was
> >>> doing without being impacted by the serializing process.
> >>
> >> Yes.
> >>
> >>>
> >>> - If you do this 10 times, it will fail about 4 times out of the 10 on average.
> >>
> >> Yes, very roughly.
> >>
> >>>
> >>> From the stack traces, it looks like your VM is hanging up in a normal
> >>> garbage collection sweep that is happening as a side effect of allocating
> >>> a new object. The first thing that comes to mind is that your VM is hanging
> >>> up while trying to allocate more memory than the OS can provide to it, and
> >>> it appears to be stuck while waiting memory to be provided by the OS. I'm
> >>> suggesting this because I've seen similar symptoms when I intentially try
> >>> to allocate very large object memories, so perhaps there may be something
> >>> in your runtime environment (OS settings) that limit available memory.
> >>> This is a guess only, but it might give you some ideas of where to look.
> >>
> >> I used vmstat to look at possible swap (paging) problems. There wasn't a single swap operation and there's ample memory.
> >>
> >>>
> >>> This might be a VM problem, or it might be a symptom of some other problem
> >>> that just puts the VM and OS into a condition where they are unexpectedly
> >>> trying to use a very large amount of memory.
> >>
> >> That's what I hope, but I'm pretty clueless.
> >>
> >>>
> >>> If possible it would be good to try to rule out the OSProcess forkImage as
> >>> a possible contributing factor. If you were to do the serialization 10 times
> >>> in the main image, rather than in the background child image, would it
> >>> still fail about 4 times out of 10? This could be a factor in memory use.
> >>
> >> Serializing without forking always succeeds.
> >>
> >>> Although forkSqueak is very memory efficient, if the child and parent
> >>> images do a lot of different things (such as try to serialize a bunch of
> >>> stuff in the child image), then the OS will be forced to map in enough
> >>> real memory to satisfy both of the images.
> >>
> >> As pointed out above, there wasn't a single swap. htop also shows swap usage at roughly 20% and there's always memory available.
> >>
> >>>
> >>> Another thing to check, if possible, is to see if the OS is doing a
> >>> lot of disk swapping at the time of the failure. If you see all of the
> >>> system memory in use, along with a lot of disk activity, then you are
> >>> very likely look at a case of the VM trying to allocate more memory
> >>> than the system can reasonable provide.
> >>
> >> (Again, see above)
> >> I/O is also pretty low. Nothing special there.
> >>
> >>>
> >>> FWIW, here is what your VM is doing at the time of the "hang up". This
> >>> involves scanning object memory through memEnd, which might be a very
> >>> large value for one reason or another:
> >>>
> >>> ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd
> >>>  "update pointers in root objects"
> >>>  | oop |
> >>>  <inline: false>
> >>>  1 to: rootTableCount do: [:i |
> >>>          oop := rootTable at: i.
> >>>          ((self oop: oop isLessThan: memStart)
> >>>                  or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
> >>>              ifTrue: ["Note: must not remap the fields of any object twice!"
> >>>                  "remap this oop only if not in the memory range
> >>>                  covered below"
> >>>                  self remapFieldsAndClassOf: oop]]
> >>
> >> I was able to look at two forked processes where one didn't terminate (call it 1) and the other (call it 2) did. What I saw was that 1 was doing GC's like crazy right from the start (like the stacks I posted) while 2 showed a lot of other calls too (e.g. interpret()) and spent (in my oppinion) much less time in GC.
> >>
> >> I tried two additional things:
> >> 1. after forking I kill the weak finalization process. This seems to have bettered the problem but isn't a fix
> >> 2. I modified the GC options the following way:
> >>
> >> "Limit incremental GC activity to run every 100k allocations"
> >> SmalltalkImage current vmParameterAt: 5 put: 100000. "allocations between GCs (default: 4000)"
> >>
> >> "Limit tenuring threshold to only tenure w/ > 25k survivors"
> >> SmalltalkImage current vmParameterAt: 6 put: 25000. "tenuring threshold (default: 2000)"
> >>
> >> "Tell the VM that we'd rather grow than spinning in tight GC loops"
> >> SmalltalkImage current gcBiasToGrow: true. "default: false"
> >>
> >> "Tell the VM to do a fullGC for good measure if the above growth exceeded x MB"
> >> SmalltalkImage current gcBiasToGrowLimit: 8*1024*1024. "default: 0"
> >>
> >> This did not have any effect (as far as I can tell) on the problem.
> >>
> >>>
> >>> HTH,
> >>>
> >>> Dave
> >>>
> >>> On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:
> >>>>
> >>>> Hi
> >>>>
> >>>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
> >>>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
> >>>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
> >>>>
> >>>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
> >>>>
> >>>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
> >>>>
> >>>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
> >>>>
> >>>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
> >>>>
> >>>> Cheers,
> >>>> Max
> >>>>
> >>>> Begin forwarded message:
> >>>>
> >>>>> From: Mariano Martinez Peck <[hidden email]>
> >>>>> Subject: Re: [Pharo-fuel] Possible collections problem
> >>>>> Date: 15. Mai 2013 16:53:10 MESZ
> >>>>> To: The Fuel Project <[hidden email]>
> >>>>> Reply-To: The Fuel Project <[hidden email]>
> >>>>>
> >>>>> I cannot see anything in particular. Too many GC stuff.
> >>>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
> >>>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
> >>>>>
> >>>>> I haven't yet tried your suggestion Mariano.
> >>>>>
> >>>>> Cheers,
> >>>>> Max
> >>>>>
> >>>>>
> >>>>> [Thread debugging using libthread_db enabled]
> >>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>
> >>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>> #2  0x08060bb0 in incCompBody ()
> >>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>> #5  0x08069420 in primitiveNew ()
> >>>>> #6  0x0806de15 in interpret ()
> >>>>> #7  0x08073dfe in main ()
> >>>>>
> >>>>>
> >>>>>
> >>>>> [Thread debugging using libthread_db enabled]
> >>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>
> >>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>> #2  0x08060bb0 in incCompBody ()
> >>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>> #5  0x08069420 in primitiveNew ()
> >>>>> #6  0x0806de15 in interpret ()
> >>>>> #7  0x08073dfe in main ()
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> [Thread debugging using libthread_db enabled]
> >>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>
> >>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>> #2  0x08060bb0 in incCompBody ()
> >>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>> #5  0x0806fed2 in clone ()
> >>>>> #6  0x08070095 in primitiveClone ()
> >>>>> #7  0x0806de15 in interpret ()
> >>>>> #8  0x08073dfe in main ()
> >>>>>
> >>>>>
> >>>>> [Thread debugging using libthread_db enabled]
> >>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>
> >>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>> #2  0x08060bb0 in incCompBody ()
> >>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>> #5  0x08069270 in primitiveNewWithArg ()
> >>>>> #6  0x0806de15 in interpret ()
> >>>>> #7  0x08073dfe in main ()
> >>>>>
> >>>>>
> >>>>> [Thread debugging using libthread_db enabled]
> >>>>> 0xb76f0f68 in select () from /lib/libc.so.6
> >>>>>
> >>>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
> >>>>> #1  0x08070880 in aioPoll ()
> >>>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
> >>>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
> >>>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
> >>>>> #5  0x0806de15 in interpret ()
> >>>>> #6  0x08073dfe in main ()
> >>>>>
> >>>>>
> >>>>> [Thread debugging using libthread_db enabled]
> >>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>
> >>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>> #2  0x08060bb0 in incCompBody ()
> >>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>> #5  0x08069420 in primitiveNew ()
> >>>>> #6  0x0806de15 in interpret ()
> >>>>> #7  0x08073dfe in main ()
> >>>>>
> >>>>>
> >>>>>
> >>>>> [Thread debugging using libthread_db enabled]
> >>>>> 0x08064e7e in markAndTrace ()
> >>>>>
> >>>>> #0  0x08064e7e in markAndTrace ()
> >>>>> #1  0x0806593a in markPhase ()
> >>>>> #2  0x08065f60 in incrementalGC ()
> >>>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>> #4  0x0806fed2 in clone ()
> >>>>> #5  0x08070095 in primitiveClone ()
> >>>>> #6  0x0806de15 in interpret ()
> >>>>> #7  0x08073dfe in main ()
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
> >>>>>
> >>>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
> >>>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
> >>>>>> Let us know!
> >>>>>>
> >>>>>>
> >>>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
> >>>>>>
> >>>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
> >>>>>>
> >>>>>>> Hi Max. Question, are you able to reproduce the problem?
> >>>>>>
> >>>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
> >>>>>>> Hi guys
> >>>>>>>
> >>>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
> >>>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
> >>>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
> >>>>>>>
> >>>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
> >>>>>>>
> >>>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
> >>>>>>>
> >>>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
> >>>>>>>
> >>>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
> >>>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
> >>>>>>>
> >>>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Max
> >>>>>>> _______________________________________________
> >>>>>>> Pharo-fuel mailing list
> >>>>>>> [hidden email]
> >>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Mariano
> >>>>>>> http://marianopeck.wordpress.com
> >>>>>>> _______________________________________________
> >>>>>>> Pharo-fuel mailing list
> >>>>>>> [hidden email]
> >>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Pharo-fuel mailing list
> >>>>>> [hidden email]
> >>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Mariano
> >>>>>> http://marianopeck.wordpress.com
> >>>>>> _______________________________________________
> >>>>>> Pharo-fuel mailing list
> >>>>>> [hidden email]
> >>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Pharo-fuel mailing list
> >>>>> [hidden email]
> >>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Mariano
> >>>>> http://marianopeck.wordpress.com
> >>>>> _______________________________________________
> >>>>> Pharo-fuel mailing list
> >>>>> [hidden email]
> >>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pharo-fuel mailing list
> >>> [hidden email]
> >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>
> >
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

Max Leske
In reply to this post by Max Leske

>
> I have no idea what is going on here, but given that it seems to be happening
> in the forked image only, and given that there is some indication that it is
> associated with the finalization process, I am inclined to look for something
> that might be associated with objects that somehow reference external resources.
> I don't have anything specific in mind, just a general hunch.

We haven't been able to find any shared resources like files etc.

>
> An image/vm started with #forkSqueak is essentially an exact copy of the original
> object memory, except that the X11 session connected to the display has been
> restarted with a new connection. All of the other objects and external resources,
> including file handles, are the same insofar as Unix fork() semantics support
> them. For example, a Squeak file stream connection to a unix file stream remains
> connected to the same stream at the OS level.
>
> I can't think of any way in which this might affect object serialization, but
> there must be something.
>
> For purposes of debugging the problem, perhaps you could try using a SmartReferenceStream
> for the serialization, and see if the symptoms change.

We did try ReferenceStream an SmartRefStream, both with inconclusive results. SmartRefStream did hang multiple times, so there *might* be a correlation.

> That might give a clue
> as to whether the issue is associated with traversing the object memory in
> the forked vm/image (hence all serialization methods would have the same
> problem), or if it is different with Fuel versus SmartReferenceStream (hence
> something associated with the objects being serialized).
>
> Interesting problem ... wish I had the answer.
>
> Dave

We've made an interesting (and somewhat disturbing) discovery. To clarify: we're forking a headfull image headless. For debugging purposes we now reconnect the fork to XDisplay. Now, the forked processes still hang (from time to time). But: we have forks that have been running with 100% CPU for 35 minutes (serialization usually takes between 40 seconds and 2 minutes, depending on load) and when we now go and look at them via VNC and perform any UI action (like clicking), the image suddenly snaps out of it's weird state, finishes serialization within 5 to 10 seconds and terminates normally.

This sounds to us like the image is caught in a loop and needs the UI process to interrupt. We're going to try and simulate input events and see if that helps. But maybe you have an idea where this might come from? Unfortunately we haven't yet been able to figure out where the image is stuck before the interrupt.

Thanks a lot for your thoughts Dave, I really appreciate the help!

Cheers,
Max

>
>
> On 24.05.2013, at 15:16, Max Leske <[hidden email]> wrote:
>
>> And again I find myself staring at multiple never terminating forks… The workaround did make the situation better but apparently doesn't prevent the problem completely.
>>
>> Max
>>
>>
>> On 23.05.2013, at 14:31, Max Leske <[hidden email]> wrote:
>>
>>> Here I go again… Shortly after sending the last e-mail the problem was back (obviously…). The good news is: I've found a workaround. The bad news: this is the work around:
>>>
>>> [ (Delay forSeconds: 5) wait.
>>> "serialize stuff in child" ] withImageFork: 'spoon'.
>>>
>>> Any ideas? I'm happy to have a workaround but I'd really like to resolve the issue. If, for instance, this were a problem with OSProcess, then the delay (if there's no way around it) should be in OSProcess and not in my code.
>>>
>>> Cheers,
>>> Max
>>>
>>>
>>>
>>> On 22.05.2013, at 19:22, Max Leske <[hidden email]> wrote:
>>>
>>>> Thanks for the suggestions Dave. I wanted to get some more information before replying; took me longer than I thought though.
>>>>
>>>> I was able to pin down the problem around a couple of thousand objects. I'll sketch my findings roughly: I had two collections, both stored in an instance variable. I could serialize them both separately but not together. I could also serialize one of them and a subset of the other. The actual elements didn't seem to matter as much as the number of elements serialized. This worked for example:
>>>>
>>>> arrayToSerialize := {
>>>> coll1 first: 180. "coll1 has 200 elements"
>>>> coll2 }
>>>>
>>>> As did this:
>>>>
>>>> arrayToSerialize := {
>>>> coll1 last: 20.
>>>> coll2 }
>>>>
>>>> While this didn't:
>>>>
>>>> arrayToSerialize := {
>>>> coll1 first: 180. "coll1 has 200 elements"
>>>> coll1 last: 20.
>>>> coll2 }
>>>>
>>>> But then again, this did:
>>>>
>>>> arrayToSerialize := { coll2 }
>>>>
>>>> As did this:
>>>>
>>>> arrayToSerialize := { coll1 }
>>>>
>>>> excluding elements that referenced each other in coll1 and coll2 didn't matter either.
>>>>
>>>>
>>>> I have no clue at all, what might be the problem here. My only (far fetched) guess is that maybe one of the objects was somehow corrupt (although that should have somehow triggered exceptions or at least make serialization impossible in all cases, especially when not forking).
>>>>
>>>> Then, during testing, I messed up my snapshot and had to load a new one and tadaaa: the problem's gone… Just weird… I just hope, this never happens again. If it does, you'll hear from me :)
>>>>
>>>>
>>>>
>>>> What I wrote below now seems obsolete. I'm leaving it in for reference.
>>>>
>>>> Cheers and thanks a lot for your help!
>>>> Max
>>>>
>>>>
>>>> On 16.05.2013, at 14:33, David T. Lewis <[hidden email]> wrote:
>>>>
>>>>> First to check to see if I understand the problem:
>>>>>
>>>>> - You are forking an image using OSProcess, and the failures are occuring
>>>>> the child VM/image that you created using one of the #forkSqueak methods.
>>>>
>>>> Yes, precisely. I'm forking headless by the way.
>>>>
>>>>>
>>>>> - You are serializing some object in the child image so that they can be
>>>>> written to a serialized output file. You are doing this in the background
>>>>> child VM/image so that your main image can continue doing whatever it was
>>>>> doing without being impacted by the serializing process.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> - If you do this 10 times, it will fail about 4 times out of the 10 on average.
>>>>
>>>> Yes, very roughly.
>>>>
>>>>>
>>>>> From the stack traces, it looks like your VM is hanging up in a normal
>>>>> garbage collection sweep that is happening as a side effect of allocating
>>>>> a new object. The first thing that comes to mind is that your VM is hanging
>>>>> up while trying to allocate more memory than the OS can provide to it, and
>>>>> it appears to be stuck while waiting memory to be provided by the OS. I'm
>>>>> suggesting this because I've seen similar symptoms when I intentially try
>>>>> to allocate very large object memories, so perhaps there may be something
>>>>> in your runtime environment (OS settings) that limit available memory.
>>>>> This is a guess only, but it might give you some ideas of where to look.
>>>>
>>>> I used vmstat to look at possible swap (paging) problems. There wasn't a single swap operation and there's ample memory.
>>>>
>>>>>
>>>>> This might be a VM problem, or it might be a symptom of some other problem
>>>>> that just puts the VM and OS into a condition where they are unexpectedly
>>>>> trying to use a very large amount of memory.
>>>>
>>>> That's what I hope, but I'm pretty clueless.
>>>>
>>>>>
>>>>> If possible it would be good to try to rule out the OSProcess forkImage as
>>>>> a possible contributing factor. If you were to do the serialization 10 times
>>>>> in the main image, rather than in the background child image, would it
>>>>> still fail about 4 times out of 10? This could be a factor in memory use.
>>>>
>>>> Serializing without forking always succeeds.
>>>>
>>>>> Although forkSqueak is very memory efficient, if the child and parent
>>>>> images do a lot of different things (such as try to serialize a bunch of
>>>>> stuff in the child image), then the OS will be forced to map in enough
>>>>> real memory to satisfy both of the images.
>>>>
>>>> As pointed out above, there wasn't a single swap. htop also shows swap usage at roughly 20% and there's always memory available.
>>>>
>>>>>
>>>>> Another thing to check, if possible, is to see if the OS is doing a
>>>>> lot of disk swapping at the time of the failure. If you see all of the
>>>>> system memory in use, along with a lot of disk activity, then you are
>>>>> very likely look at a case of the VM trying to allocate more memory
>>>>> than the system can reasonable provide.
>>>>
>>>> (Again, see above)
>>>> I/O is also pretty low. Nothing special there.
>>>>
>>>>>
>>>>> FWIW, here is what your VM is doing at the time of the "hang up". This
>>>>> involves scanning object memory through memEnd, which might be a very
>>>>> large value for one reason or another:
>>>>>
>>>>> ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd
>>>>> "update pointers in root objects"
>>>>> | oop |
>>>>> <inline: false>
>>>>> 1 to: rootTableCount do: [:i |
>>>>>         oop := rootTable at: i.
>>>>>         ((self oop: oop isLessThan: memStart)
>>>>>                 or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
>>>>>             ifTrue: ["Note: must not remap the fields of any object twice!"
>>>>>                 "remap this oop only if not in the memory range
>>>>>                 covered below"
>>>>>                 self remapFieldsAndClassOf: oop]]
>>>>
>>>> I was able to look at two forked processes where one didn't terminate (call it 1) and the other (call it 2) did. What I saw was that 1 was doing GC's like crazy right from the start (like the stacks I posted) while 2 showed a lot of other calls too (e.g. interpret()) and spent (in my oppinion) much less time in GC.
>>>>
>>>> I tried two additional things:
>>>> 1. after forking I kill the weak finalization process. This seems to have bettered the problem but isn't a fix
>>>> 2. I modified the GC options the following way:
>>>>
>>>> "Limit incremental GC activity to run every 100k allocations"
>>>> SmalltalkImage current vmParameterAt: 5 put: 100000. "allocations between GCs (default: 4000)"
>>>>
>>>> "Limit tenuring threshold to only tenure w/ > 25k survivors"
>>>> SmalltalkImage current vmParameterAt: 6 put: 25000. "tenuring threshold (default: 2000)"
>>>>
>>>> "Tell the VM that we'd rather grow than spinning in tight GC loops"
>>>> SmalltalkImage current gcBiasToGrow: true. "default: false"
>>>>
>>>> "Tell the VM to do a fullGC for good measure if the above growth exceeded x MB"
>>>> SmalltalkImage current gcBiasToGrowLimit: 8*1024*1024. "default: 0"
>>>>
>>>> This did not have any effect (as far as I can tell) on the problem.
>>>>
>>>>>
>>>>> HTH,
>>>>>
>>>>> Dave
>>>>>
>>>>> On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
>>>>>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
>>>>>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
>>>>>>
>>>>>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
>>>>>>
>>>>>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
>>>>>>
>>>>>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
>>>>>>
>>>>>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
>>>>>>
>>>>>> Cheers,
>>>>>> Max
>>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>>> From: Mariano Martinez Peck <[hidden email]>
>>>>>>> Subject: Re: [Pharo-fuel] Possible collections problem
>>>>>>> Date: 15. Mai 2013 16:53:10 MESZ
>>>>>>> To: The Fuel Project <[hidden email]>
>>>>>>> Reply-To: The Fuel Project <[hidden email]>
>>>>>>>
>>>>>>> I cannot see anything in particular. Too many GC stuff.
>>>>>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
>>>>>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
>>>>>>>
>>>>>>> I haven't yet tried your suggestion Mariano.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Max
>>>>>>>
>>>>>>>
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>>
>>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>>>> #2  0x08060bb0 in incCompBody ()
>>>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>>>> #5  0x08069420 in primitiveNew ()
>>>>>>> #6  0x0806de15 in interpret ()
>>>>>>> #7  0x08073dfe in main ()
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>>
>>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>>>> #2  0x08060bb0 in incCompBody ()
>>>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>>>> #5  0x08069420 in primitiveNew ()
>>>>>>> #6  0x0806de15 in interpret ()
>>>>>>> #7  0x08073dfe in main ()
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>>
>>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>>>> #2  0x08060bb0 in incCompBody ()
>>>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>>>> #5  0x0806fed2 in clone ()
>>>>>>> #6  0x08070095 in primitiveClone ()
>>>>>>> #7  0x0806de15 in interpret ()
>>>>>>> #8  0x08073dfe in main ()
>>>>>>>
>>>>>>>
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>>
>>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>>>> #2  0x08060bb0 in incCompBody ()
>>>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>>>> #5  0x08069270 in primitiveNewWithArg ()
>>>>>>> #6  0x0806de15 in interpret ()
>>>>>>> #7  0x08073dfe in main ()
>>>>>>>
>>>>>>>
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> 0xb76f0f68 in select () from /lib/libc.so.6
>>>>>>>
>>>>>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
>>>>>>> #1  0x08070880 in aioPoll ()
>>>>>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
>>>>>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
>>>>>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
>>>>>>> #5  0x0806de15 in interpret ()
>>>>>>> #6  0x08073dfe in main ()
>>>>>>>
>>>>>>>
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>>
>>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
>>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
>>>>>>> #2  0x08060bb0 in incCompBody ()
>>>>>>> #3  0x08065fa7 in incrementalGC ()
>>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
>>>>>>> #5  0x08069420 in primitiveNew ()
>>>>>>> #6  0x0806de15 in interpret ()
>>>>>>> #7  0x08073dfe in main ()
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>> 0x08064e7e in markAndTrace ()
>>>>>>>
>>>>>>> #0  0x08064e7e in markAndTrace ()
>>>>>>> #1  0x0806593a in markPhase ()
>>>>>>> #2  0x08065f60 in incrementalGC ()
>>>>>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
>>>>>>> #4  0x0806fed2 in clone ()
>>>>>>> #5  0x08070095 in primitiveClone ()
>>>>>>> #6  0x0806de15 in interpret ()
>>>>>>> #7  0x08073dfe in main ()
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
>>>>>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
>>>>>>>> Let us know!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Hi Max. Question, are you able to reproduce the problem?
>>>>>>>>
>>>>>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
>>>>>>>>> Hi guys
>>>>>>>>>
>>>>>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
>>>>>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
>>>>>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
>>>>>>>>>
>>>>>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
>>>>>>>>>
>>>>>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
>>>>>>>>>
>>>>>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
>>>>>>>>>
>>>>>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
>>>>>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
>>>>>>>>>
>>>>>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Max
>>>>>>>>> _______________________________________________
>>>>>>>>> Pharo-fuel mailing list
>>>>>>>>> [hidden email]
>>>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mariano
>>>>>>>>> http://marianopeck.wordpress.com
>>>>>>>>> _______________________________________________
>>>>>>>>> Pharo-fuel mailing list
>>>>>>>>> [hidden email]
>>>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pharo-fuel mailing list
>>>>>>>> [hidden email]
>>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mariano
>>>>>>>> http://marianopeck.wordpress.com
>>>>>>>> _______________________________________________
>>>>>>>> Pharo-fuel mailing list
>>>>>>>> [hidden email]
>>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pharo-fuel mailing list
>>>>>>> [hidden email]
>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mariano
>>>>>>> http://marianopeck.wordpress.com
>>>>>>> _______________________________________________
>>>>>>> Pharo-fuel mailing list
>>>>>>> [hidden email]
>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pharo-fuel mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-fuel] [Vm-dev] Fwd: Possible collections problem

David T. Lewis
 
On Thu, May 30, 2013 at 12:24:21PM +0200, Max Leske wrote:

>
> >
> > I have no idea what is going on here, but given that it seems to be happening
> > in the forked image only, and given that there is some indication that it is
> > associated with the finalization process, I am inclined to look for something
> > that might be associated with objects that somehow reference external resources.
> > I don't have anything specific in mind, just a general hunch.
>
> We haven't been able to find any shared resources like files etc.
>
> >
> > An image/vm started with #forkSqueak is essentially an exact copy of the original
> > object memory, except that the X11 session connected to the display has been
> > restarted with a new connection. All of the other objects and external resources,
> > including file handles, are the same insofar as Unix fork() semantics support
> > them. For example, a Squeak file stream connection to a unix file stream remains
> > connected to the same stream at the OS level.
> >
> > I can't think of any way in which this might affect object serialization, but
> > there must be something.
> >
> > For purposes of debugging the problem, perhaps you could try using a SmartReferenceStream
> > for the serialization, and see if the symptoms change.
>
> We did try ReferenceStream an SmartRefStream, both with inconclusive results. SmartRefStream did hang multiple times, so there *might* be a correlation.
>
> > That might give a clue
> > as to whether the issue is associated with traversing the object memory in
> > the forked vm/image (hence all serialization methods would have the same
> > problem), or if it is different with Fuel versus SmartReferenceStream (hence
> > something associated with the objects being serialized).
> >
> > Interesting problem ... wish I had the answer.
> >
> > Dave
>
> We've made an interesting (and somewhat disturbing) discovery. To clarify: we're forking a headfull image headless. For debugging purposes we now reconnect the fork to XDisplay. Now, the forked processes still hang (from time to time). But: we have forks that have been running with 100% CPU for 35 minutes (serialization usually takes between 40 seconds and 2 minutes, depending on load) and when we now go and look at them via VNC and perform any UI action (like clicking), the image suddenly snaps out of it's weird state, finishes serialization within 5 to 10 seconds and terminates normally.
>
> This sounds to us like the image is caught in a loop and needs the UI process to interrupt. We're going to try and simulate input events and see if that helps. But maybe you have an idea where this might come from? Unfortunately we haven't yet been able to figure out where the image is stuck before the interrupt.
>
> Thanks a lot for your thoughts Dave, I really appreciate the help!
>

A couple more debugging ideas:

Maybe you can get more clues by looking at a stack dump from the
stuck process. You can ask the VM to dump stack on receipt of a unix
signal like this (in this example I'm using SIGUSR1):

  OSProcess accessor setPrintAllStacksOnSigUsr1.
  child := OSProcess thisOSProcess forkHeadlessSqueakAndDoThenQuit: [
  "do the serialization stuff" (Delay forSeconds: 10) wait
  ].
 
  "wait for child to get stuck" (Delay forSeconds: 5) wait.
 
  "send SIGUSR1 to forked squeak, you can do this from unix console of course"
  OSProcess accessor primSendSigusr1To: child pid.
  OSProcess accessor clearPrintAllStacksOnSigUsr1.
  child inspect.

Perhaps the issue is associated with the headless display (X11 disconnected)
and does not have anything to do with the forked image per se. If so, it should
be possible to reproduce the problem in a non-forked image by turning off the
X display while serialization is being done:

  OSProcess thisOSProcess decapitate.
  "Do serialization stuff" (Delay forSeconds: 5) wait.
  OSProcess thisOSProcess recapitate.

Try that about 10 times and see if it ever gets stuck.

Dave


> Cheers,
> Max
>
> >
> >
> > On 24.05.2013, at 15:16, Max Leske <[hidden email]> wrote:
> >
> >> And again I find myself staring at multiple never terminating forks? The workaround did make the situation better but apparently doesn't prevent the problem completely.
> >>
> >> Max
> >>
> >>
> >> On 23.05.2013, at 14:31, Max Leske <[hidden email]> wrote:
> >>
> >>> Here I go again? Shortly after sending the last e-mail the problem was back (obviously?). The good news is: I've found a workaround. The bad news: this is the work around:
> >>>
> >>> [ (Delay forSeconds: 5) wait.
> >>> "serialize stuff in child" ] withImageFork: 'spoon'.
> >>>
> >>> Any ideas? I'm happy to have a workaround but I'd really like to resolve the issue. If, for instance, this were a problem with OSProcess, then the delay (if there's no way around it) should be in OSProcess and not in my code.
> >>>
> >>> Cheers,
> >>> Max
> >>>
> >>>
> >>>
> >>> On 22.05.2013, at 19:22, Max Leske <[hidden email]> wrote:
> >>>
> >>>> Thanks for the suggestions Dave. I wanted to get some more information before replying; took me longer than I thought though.
> >>>>
> >>>> I was able to pin down the problem around a couple of thousand objects. I'll sketch my findings roughly: I had two collections, both stored in an instance variable. I could serialize them both separately but not together. I could also serialize one of them and a subset of the other. The actual elements didn't seem to matter as much as the number of elements serialized. This worked for example:
> >>>>
> >>>> arrayToSerialize := {
> >>>> coll1 first: 180. "coll1 has 200 elements"
> >>>> coll2 }
> >>>>
> >>>> As did this:
> >>>>
> >>>> arrayToSerialize := {
> >>>> coll1 last: 20.
> >>>> coll2 }
> >>>>
> >>>> While this didn't:
> >>>>
> >>>> arrayToSerialize := {
> >>>> coll1 first: 180. "coll1 has 200 elements"
> >>>> coll1 last: 20.
> >>>> coll2 }
> >>>>
> >>>> But then again, this did:
> >>>>
> >>>> arrayToSerialize := { coll2 }
> >>>>
> >>>> As did this:
> >>>>
> >>>> arrayToSerialize := { coll1 }
> >>>>
> >>>> excluding elements that referenced each other in coll1 and coll2 didn't matter either.
> >>>>
> >>>>
> >>>> I have no clue at all, what might be the problem here. My only (far fetched) guess is that maybe one of the objects was somehow corrupt (although that should have somehow triggered exceptions or at least make serialization impossible in all cases, especially when not forking).
> >>>>
> >>>> Then, during testing, I messed up my snapshot and had to load a new one and tadaaa: the problem's gone? Just weird? I just hope, this never happens again. If it does, you'll hear from me :)
> >>>>
> >>>>
> >>>>
> >>>> What I wrote below now seems obsolete. I'm leaving it in for reference.
> >>>>
> >>>> Cheers and thanks a lot for your help!
> >>>> Max
> >>>>
> >>>>
> >>>> On 16.05.2013, at 14:33, David T. Lewis <[hidden email]> wrote:
> >>>>
> >>>>> First to check to see if I understand the problem:
> >>>>>
> >>>>> - You are forking an image using OSProcess, and the failures are occuring
> >>>>> the child VM/image that you created using one of the #forkSqueak methods.
> >>>>
> >>>> Yes, precisely. I'm forking headless by the way.
> >>>>
> >>>>>
> >>>>> - You are serializing some object in the child image so that they can be
> >>>>> written to a serialized output file. You are doing this in the background
> >>>>> child VM/image so that your main image can continue doing whatever it was
> >>>>> doing without being impacted by the serializing process.
> >>>>
> >>>> Yes.
> >>>>
> >>>>>
> >>>>> - If you do this 10 times, it will fail about 4 times out of the 10 on average.
> >>>>
> >>>> Yes, very roughly.
> >>>>
> >>>>>
> >>>>> From the stack traces, it looks like your VM is hanging up in a normal
> >>>>> garbage collection sweep that is happening as a side effect of allocating
> >>>>> a new object. The first thing that comes to mind is that your VM is hanging
> >>>>> up while trying to allocate more memory than the OS can provide to it, and
> >>>>> it appears to be stuck while waiting memory to be provided by the OS. I'm
> >>>>> suggesting this because I've seen similar symptoms when I intentially try
> >>>>> to allocate very large object memories, so perhaps there may be something
> >>>>> in your runtime environment (OS settings) that limit available memory.
> >>>>> This is a guess only, but it might give you some ideas of where to look.
> >>>>
> >>>> I used vmstat to look at possible swap (paging) problems. There wasn't a single swap operation and there's ample memory.
> >>>>
> >>>>>
> >>>>> This might be a VM problem, or it might be a symptom of some other problem
> >>>>> that just puts the VM and OS into a condition where they are unexpectedly
> >>>>> trying to use a very large amount of memory.
> >>>>
> >>>> That's what I hope, but I'm pretty clueless.
> >>>>
> >>>>>
> >>>>> If possible it would be good to try to rule out the OSProcess forkImage as
> >>>>> a possible contributing factor. If you were to do the serialization 10 times
> >>>>> in the main image, rather than in the background child image, would it
> >>>>> still fail about 4 times out of 10? This could be a factor in memory use.
> >>>>
> >>>> Serializing without forking always succeeds.
> >>>>
> >>>>> Although forkSqueak is very memory efficient, if the child and parent
> >>>>> images do a lot of different things (such as try to serialize a bunch of
> >>>>> stuff in the child image), then the OS will be forced to map in enough
> >>>>> real memory to satisfy both of the images.
> >>>>
> >>>> As pointed out above, there wasn't a single swap. htop also shows swap usage at roughly 20% and there's always memory available.
> >>>>
> >>>>>
> >>>>> Another thing to check, if possible, is to see if the OS is doing a
> >>>>> lot of disk swapping at the time of the failure. If you see all of the
> >>>>> system memory in use, along with a lot of disk activity, then you are
> >>>>> very likely look at a case of the VM trying to allocate more memory
> >>>>> than the system can reasonable provide.
> >>>>
> >>>> (Again, see above)
> >>>> I/O is also pretty low. Nothing special there.
> >>>>
> >>>>>
> >>>>> FWIW, here is what your VM is doing at the time of the "hang up". This
> >>>>> involves scanning object memory through memEnd, which might be a very
> >>>>> large value for one reason or another:
> >>>>>
> >>>>> ObjectMemory>>updatePointersInRootObjectsFrom: memStart to: memEnd
> >>>>> "update pointers in root objects"
> >>>>> | oop |
> >>>>> <inline: false>
> >>>>> 1 to: rootTableCount do: [:i |
> >>>>>         oop := rootTable at: i.
> >>>>>         ((self oop: oop isLessThan: memStart)
> >>>>>                 or: [self oop: oop isGreaterThanOrEqualTo: memEnd])
> >>>>>             ifTrue: ["Note: must not remap the fields of any object twice!"
> >>>>>                 "remap this oop only if not in the memory range
> >>>>>                 covered below"
> >>>>>                 self remapFieldsAndClassOf: oop]]
> >>>>
> >>>> I was able to look at two forked processes where one didn't terminate (call it 1) and the other (call it 2) did. What I saw was that 1 was doing GC's like crazy right from the start (like the stacks I posted) while 2 showed a lot of other calls too (e.g. interpret()) and spent (in my oppinion) much less time in GC.
> >>>>
> >>>> I tried two additional things:
> >>>> 1. after forking I kill the weak finalization process. This seems to have bettered the problem but isn't a fix
> >>>> 2. I modified the GC options the following way:
> >>>>
> >>>> "Limit incremental GC activity to run every 100k allocations"
> >>>> SmalltalkImage current vmParameterAt: 5 put: 100000. "allocations between GCs (default: 4000)"
> >>>>
> >>>> "Limit tenuring threshold to only tenure w/ > 25k survivors"
> >>>> SmalltalkImage current vmParameterAt: 6 put: 25000. "tenuring threshold (default: 2000)"
> >>>>
> >>>> "Tell the VM that we'd rather grow than spinning in tight GC loops"
> >>>> SmalltalkImage current gcBiasToGrow: true. "default: false"
> >>>>
> >>>> "Tell the VM to do a fullGC for good measure if the above growth exceeded x MB"
> >>>> SmalltalkImage current gcBiasToGrowLimit: 8*1024*1024. "default: 0"
> >>>>
> >>>> This did not have any effect (as far as I can tell) on the problem.
> >>>>
> >>>>>
> >>>>> HTH,
> >>>>>
> >>>>> Dave
> >>>>>
> >>>>> On Thu, May 16, 2013 at 09:29:42AM +0200, Max Leske wrote:
> >>>>>>
> >>>>>> Hi
> >>>>>>
> >>>>>> I'm forwarding this because I'd like to rule out a VM problem. Short summary:
> >>>>>> I fork a squeak image and then serialize objects with Fuel. In roughly 40% of the cases the fork suddenly locks and consumes 100% CPU. The trace I most often see with gdb in that case is the one with
> >>>>>> "#0  0x08060453 in updatePointersInRootObjectsFromto ()" at the top.
> >>>>>>
> >>>>>> The object processed when the lockup occufrs is always of class Timestamp, although that doesn't necessarily mean anything. Maybe it's more about the number of objects.
> >>>>>>
> >>>>>> I'm working on Debian, 32-bit and I can reproduce the problem with SqueakVM 4.4.7-2364 and 4.0.3-2202 (the newer ones wont run because of glibc). I haven't tried Cog yet.
> >>>>>>
> >>>>>> I also just checked that the problem occurs even if I don't serialize any timestamps (nor Process, Delay, Monitor, Semaphore; just to be sure).
> >>>>>>
> >>>>>> So if anyone has a clue as to what might be going on I'd really appreciate the help.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Max
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>
> >>>>>>> From: Mariano Martinez Peck <[hidden email]>
> >>>>>>> Subject: Re: [Pharo-fuel] Possible collections problem
> >>>>>>> Date: 15. Mai 2013 16:53:10 MESZ
> >>>>>>> To: The Fuel Project <[hidden email]>
> >>>>>>> Reply-To: The Fuel Project <[hidden email]>
> >>>>>>>
> >>>>>>> I cannot see anything in particular. Too many GC stuff.
> >>>>>>> So I wouldn't spend more time trying to debug. I would try the none large collections. Then I would try with latest Cog and latest StackVM.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, May 15, 2013 at 11:47 AM, Max Leske <[hidden email]> wrote:
> >>>>>>> I've had several forks hanging around just now. Apart from one all of these were locked. I attached gdb and generated the c stack for all of them. Not sure if there's anything really interesting in there although clearly a lot of time is spent in GC and with creation of objects. That doesn't have to mean anything though.
> >>>>>>>
> >>>>>>> I haven't yet tried your suggestion Mariano.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Max
> >>>>>>>
> >>>>>>>
> >>>>>>> [Thread debugging using libthread_db enabled]
> >>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>>
> >>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>>>> #2  0x08060bb0 in incCompBody ()
> >>>>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>>>> #5  0x08069420 in primitiveNew ()
> >>>>>>> #6  0x0806de15 in interpret ()
> >>>>>>> #7  0x08073dfe in main ()
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> [Thread debugging using libthread_db enabled]
> >>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>>
> >>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>>>> #2  0x08060bb0 in incCompBody ()
> >>>>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>>>> #5  0x08069420 in primitiveNew ()
> >>>>>>> #6  0x0806de15 in interpret ()
> >>>>>>> #7  0x08073dfe in main ()
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> [Thread debugging using libthread_db enabled]
> >>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>>
> >>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>>>> #2  0x08060bb0 in incCompBody ()
> >>>>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>>>> #5  0x0806fed2 in clone ()
> >>>>>>> #6  0x08070095 in primitiveClone ()
> >>>>>>> #7  0x0806de15 in interpret ()
> >>>>>>> #8  0x08073dfe in main ()
> >>>>>>>
> >>>>>>>
> >>>>>>> [Thread debugging using libthread_db enabled]
> >>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>>
> >>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>>>> #2  0x08060bb0 in incCompBody ()
> >>>>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>>>> #5  0x08069270 in primitiveNewWithArg ()
> >>>>>>> #6  0x0806de15 in interpret ()
> >>>>>>> #7  0x08073dfe in main ()
> >>>>>>>
> >>>>>>>
> >>>>>>> [Thread debugging using libthread_db enabled]
> >>>>>>> 0xb76f0f68 in select () from /lib/libc.so.6
> >>>>>>>
> >>>>>>> #0  0xb76f0f68 in select () from /lib/libc.so.6
> >>>>>>> #1  0x08070880 in aioPoll ()
> >>>>>>> #2  0xb762419e in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-X11
> >>>>>>> #3  0x08073595 in ioRelinquishProcessorForMicroseconds ()
> >>>>>>> #4  0x08061f24 in primitiveRelinquishProcessor ()
> >>>>>>> #5  0x0806de15 in interpret ()
> >>>>>>> #6  0x08073dfe in main ()
> >>>>>>>
> >>>>>>>
> >>>>>>> [Thread debugging using libthread_db enabled]
> >>>>>>> 0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>>
> >>>>>>> #0  0x08060453 in updatePointersInRootObjectsFromto ()
> >>>>>>> #1  0x08060a77 in mapPointersInObjectsFromto ()
> >>>>>>> #2  0x08060bb0 in incCompBody ()
> >>>>>>> #3  0x08065fa7 in incrementalGC ()
> >>>>>>> #4  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>>>> #5  0x08069420 in primitiveNew ()
> >>>>>>> #6  0x0806de15 in interpret ()
> >>>>>>> #7  0x08073dfe in main ()
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> [Thread debugging using libthread_db enabled]
> >>>>>>> 0x08064e7e in markAndTrace ()
> >>>>>>>
> >>>>>>> #0  0x08064e7e in markAndTrace ()
> >>>>>>> #1  0x0806593a in markPhase ()
> >>>>>>> #2  0x08065f60 in incrementalGC ()
> >>>>>>> #3  0x080661a4 in sufficientSpaceAfterGC ()
> >>>>>>> #4  0x0806fed2 in clone ()
> >>>>>>> #5  0x08070095 in primitiveClone ()
> >>>>>>> #6  0x0806de15 in interpret ()
> >>>>>>> #7  0x08073dfe in main ()
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 15.05.2013, at 13:59, Mariano Martinez Peck <[hidden email]> wrote:
> >>>>>>>
> >>>>>>>> Ok. So, first thing you should try, is to replace the uses of LargeIdentityDictionary with IdentityDictionary. And LargeIdentitySet with IdentitySet.
> >>>>>>>> If the problem disappears, then yes, there is something wrong with LargeCollections. If there is a problem with them, try updating VM, since they use a particular primitive.
> >>>>>>>> Let us know!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, May 14, 2013 at 9:29 AM, Max Leske <[hidden email]> wrote:
> >>>>>>>>
> >>>>>>>> On 14.05.2013, at 13:52, Mariano Martinez Peck <[hidden email]> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Max. Question, are you able to reproduce the problem?
> >>>>>>>>
> >>>>>>>> Yes, but not "on purpose". The situation usually happens once or twice a day and then with consistent log entries. That's why I want to use gdb the next time it happens.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 30, 2013 at 3:57 PM, Max Leske <[hidden email]> wrote:
> >>>>>>>>> Hi guys
> >>>>>>>>>
> >>>>>>>>> I have a problem serializing a graph. Sometimes (not always) the image will consume +/- 100% CPU and stop responding. I was able to pin the problem down a bit:
> >>>>>>>>> - fails always in FLIteratingCluster>>registerIndexesOn: when called from FLFixedObjectCluster with class TimeStamp (this might not actually be relevan but it's consistent)
> >>>>>>>>> - the problem *might* be in FLLargeIdentityDictionary>>at:put: (or further up the stack)
> >>>>>>>>>
> >>>>>>>>> I've done excessive logging to a file but even with flushing after every write the results are not consistent. Sometimes the image locks after leaving #at:put: sometimes it does somewhere in the middle or in #registerIndexesOn: (but remember: the logging might not be precise).
> >>>>>>>>>
> >>>>>>>>> It's probably not the size of the objects in the cluster (the graph is big but not overly large), since there are other clusters with more objects.
> >>>>>>>>>
> >>>>>>>>> What I did find is that the #grow operation for HashedCollections can be *very* slow, up to 20 seconds or more, at other times the snapshot runs through within no time.
> >>>>>>>>>
> >>>>>>>>> So here's my theory: There migth be a VM problem with HashedCollections.
> >>>>>>>>> Now, the VM is a rather old one and I haven't had the possibility to test this with a newer one (but I'll probably have to). The version is Squeak4.0.3-2202 running on 32-bit Debian Squeeze.
> >>>>>>>>>
> >>>>>>>>> I'll try some more but if anyone has any ideas I'd be very happy :)
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Max
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Pharo-fuel mailing list
> >>>>>>>>> [hidden email]
> >>>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Mariano
> >>>>>>>>> http://marianopeck.wordpress.com
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Pharo-fuel mailing list
> >>>>>>>>> [hidden email]
> >>>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Pharo-fuel mailing list
> >>>>>>>> [hidden email]
> >>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Mariano
> >>>>>>>> http://marianopeck.wordpress.com
> >>>>>>>> _______________________________________________
> >>>>>>>> Pharo-fuel mailing list
> >>>>>>>> [hidden email]
> >>>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Pharo-fuel mailing list
> >>>>>>> [hidden email]
> >>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Mariano
> >>>>>>> http://marianopeck.wordpress.com
> >>>>>>> _______________________________________________
> >>>>>>> Pharo-fuel mailing list
> >>>>>>> [hidden email]
> >>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Pharo-fuel mailing list
> >>>>> [hidden email]
> >>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-fuel
> >>>>
> >>>
> >>
> >