Playing with the VM Limits, crash on many processes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Playing with the VM Limits, crash on many processes

Holger Freyther
Hi all,

I created the attached torture test to get a feeling of how many processes I
can create and if my planned approach would work. With about 100.000 processes
I ran into a crash inside the GC. Compiling GST without support for the
generational GC seem not to crash.

Is this test just hitting the limit of number of objects that the GC can
properly manage? I will now build with GC_DEBUG and see what we are hitting.

regards
        holger





With the generational GC:
#2  0x0013df6e in abort () at abort.c:92
#3  0x007e1615 in oldspace_sigsegv_handler (fault_address=0x10, serious=0) at
../../libgst/oop.c:942
#4  0x008331b4 in sigsegv_handler (sig=11, sc=...) at
../../../sigsegv/src/handler-unix.c:134
#5  <signal handler called>
#6  0x007e1242 in scanned_fields_in (object=<value optimized out>,
flags=<value optimized out>)
    at ../../libgst/oop.c:1940
#7  0x007e286d in _gst_copy_an_oop (oop=<value optimized out>) at
../../libgst/oop.c:2079
#8  0x007e2b58 in scan_grey_pages () at ../../libgst/oop.c:1847
#9  0x007e38fc in copy_oops () at ../../libgst/oop.c:1755
#10 _gst_scavenge () at ../../libgst/oop.c:1229
#11 0x007e3e5c in _gst_alloc_obj (size=20, p_oop=0xbf8abd6c) at
../../libgst/oop.c:772

Without the generational GC:



The backtrace



_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

ParallelTtest.st (488 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Paolo Bonzini-2
On 11/20/2010 06:30 PM, Holger Hans Peter Freyther wrote:
> I created the attached torture test to get a feeling of how many processes I
> can create and if my planned approach would work. With about 100.000 processes
> I ran into a crash inside the GC. Compiling GST without support for the
> generational GC seem not to crash.
>
> Is this test just hitting the limit of number of objects that the GC can
> properly manage?

It's certainly a heavy stress test, but it shouldn't crash the VM.  Thanks!

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Holger Freyther
On 11/20/2010 06:32 PM, Paolo Bonzini wrote:

>
> It's certainly a heavy stress test, but it shouldn't crash the VM.  Thanks!
>
GC_DEBUG didn't help. So I am now with valgrind and have some issues in
dict.c. It appears that init_runtime_objects is called before
_gst_init_dictionary is called, or at least the dictionary is initialized. I
am not sure what is the right way to resolve this though.







_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Holger Freyther
On 11/20/2010 07:14 PM, Holger Hans Peter Freyther wrote:
> On 11/20/2010 06:32 PM, Paolo Bonzini wrote:
>
>>
>> It's certainly a heavy stress test, but it shouldn't crash the VM.  Thanks!
>>
> GC_DEBUG didn't help. So I am now with valgrind and have some issues in
> dict.c. It appears that init_runtime_objects is called before
> _gst_init_dictionary is called, or at least the dictionary is initialized. I
> am not sure what is the right way to resolve this though.

okay, not quite true.. but somehow it is not initialized... i will keep
digging to get valgrind working on gst...

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Paolo Bonzini-2
In reply to this post by Holger Freyther
On 11/20/2010 06:30 PM, Holger Hans Peter Freyther wrote:
> I created the attached torture test to get a feeling of how many processes I
> can create and if my planned approach would work. With about 100.000 processes
> I ran into a crash inside the GC. Compiling GST without support for the
> generational GC seem not to crash.

How much time does it take to crash?  Does it happen even without the
printNl.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Holger Freyther
On 11/21/2010 11:48 AM, Paolo Bonzini wrote:
> On 11/20/2010 06:30 PM, Holger Hans Peter Freyther wrote:
>> I created the attached torture test to get a feeling of how many processes I
>> can create and if my planned approach would work. With about 100.000 processes
>> I ran into a crash inside the GC. Compiling GST without support for the
>> generational GC seem not to crash.
>
> How much time does it take to crash?  Does it happen even without the printNl.

It crashes without the printNl, it needs the call to delay wait. I can create
the Delay for each process once and it is still crashing, it also needs a lot
of processes to force this crash. It crashes within 30 seconds or such.

I am going to try your approach with GDB, watchpoints and continuing a couple
of times and see if I can get it to crash and have gdb right there. I might
also try reverse debugging...


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Holger Freyther
On 11/21/2010 01:14 PM, Holger Hans Peter Freyther wrote:

> I am going to try your approach with GDB, watchpoints and continuing a couple
> of times and see if I can get it to crash and have gdb right there. I might
> also try reverse debugging...

reverse debugging works better than it did in GDB 7.0, 7.1 but it is too slow
to be usable... I will just let it run anyway.. just to see if it can be helpful.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Paolo Bonzini-2
In reply to this post by Holger Freyther
On 11/21/2010 01:14 PM, Holger Hans Peter Freyther wrote:

> On 11/21/2010 11:48 AM, Paolo Bonzini wrote:
>> On 11/20/2010 06:30 PM, Holger Hans Peter Freyther wrote:
>>> I created the attached torture test to get a feeling of how many processes I
>>> can create and if my planned approach would work. With about 100.000 processes
>>> I ran into a crash inside the GC. Compiling GST without support for the
>>> generational GC seem not to crash.
>>
>> How much time does it take to crash?  Does it happen even without the printNl.
>
> It crashes without the printNl, it needs the call to delay wait. I can create
> the Delay for each process once and it is still crashing, it also needs a lot
> of processes to force this crash. It crashes within 30 seconds or such.

Ok, reproduced.  Here is a more deterministic testcase:

Object subclass: Scheduler [
     MutexSem := Semaphore forMutualExclusion.
     TimeoutSem := Semaphore new.

     Scheduler class >> step [ TimeoutSem wait ]
     Scheduler class >> kick [ MutexSem critical: [TimeoutSem signal] ]
]

Eval [
     [[Scheduler step] repeat] forkAt: Processor userInterruptPriority.
     1 to: 100000 do: [:thread_nr |
         [ | id | id := thread_nr.
             id \\ 1000 == 0 ifTrue: [id printNl].
             20 timesRepeat: [Scheduler kick].
         ] fork.
     ].

     Semaphore new wait.
]

where the Scheduler class is a heavily butchered version of Delay. :)

Interestingly, inlining the two methods in the Eval makes the testcase
work, so it's probably something related to contexts.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Paolo Bonzini-2
On 11/21/2010 04:10 PM, Paolo Bonzini wrote:
> where the Scheduler class is a heavily butchered version of Delay. :)
>
> Interestingly, inlining the two methods in the Eval makes the testcase
> work, so it's probably something related to contexts.

It's a memory corruption due to running out-of-memory and not detecting it.

FWIW, here are my debugging steps:

1) after some fruitless attempts to get to the point of corruption with
gdb, I added this patch

diff --git a/libgst/oop.c b/libgst/oop.c
index f5b885b..4c15f57 100644
--- a/libgst/oop.c
+++ b/libgst/oop.c
@@ -1076,6 +1076,7 @@ _gst_global_gc (int next_allocation)
    int old_limit;

    _gst_mem.numGlobalGCs++;
+  _gst_mem.numScavenges = 0;

    old_limit = _gst_mem.old->heap_limit;
    _gst_mem.old->heap_limit = 0;
@@ -2032,10 +2033,10 @@ _gst_copy_an_oop (OOP oop)
        obj = OOP_TO_OBJ (oop);
        pData = (OOP *) obj;

-#if defined(GC_DEBUG_OUTPUT)
-      printf (">Copy ");
+      if  (_gst_mem.numGlobalGCs == 20 && _gst_mem.numScavenges == 249) {
+      printf (">Copy %p ", ((gst_object)0x7ffff6dc87a0)->objClass);
        _gst_display_oop (oop);
-#endif
+      }

  #if defined (GC_DEBUGGING)
        if UNCOMMON (!IS_INT (obj->objSize))

I easily got the numbers (20/249/0x7ffff6dc87a0) from the breakpoints I
was using in gdb.  The debugging output wasn't too long and had

 >Copy 0x7fc75b361920 0x7fc75f495300   0x7ffff7268010  ...
 >Copy (nil)   ...

which showed that OOP 0x7fc75f495300 was being copied at the time of the
corruption.


2) I put a breakpoint on the call to _gst_display_oop, conditional on
printing the OOP that I got from the debugging output.


3) At the breakpoint, I put a watchpoint on *(void **)0x7ffff6dc87a0.  I
remembered hardware watchpoints didn't work so I used a software one.
HW watchpoints indeed didn't work because the corruption happened in
kernel mode (due to one mmap overwriting another):

Watchpoint 3: *(void **)0x7ffff6dc87a0

Old value = (void *) 0x23
New value = (void *) 0x0
0x0000003bda0dfffa in mmap64 () at ../sysdeps/unix/syscall-template.S:82
82 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x0000003bda0dfffa in mmap64 ()
#1  0x00007ffff7d684ac in anon_mmap_commit (base=<value optimized out>,
     size=<value optimized out>) at ../../libgst/sysdep/posix/mem.c:227
#2  0x00007ffff7d6684b in heap_sbrk_internal (hdp=0x7fffd6d82000,
     size=262144) at ../../libgst/heap.c:235
#3  0x00007ffff7d66692 in _gst_heap_sbrk (hd=0x7fffd6d83000 "@",
     size=262144) at ../../libgst/heap.c:187
(gdb) up 3
#3  0x00007ffff7d66692 in _gst_heap_sbrk (hd=0x7fffd6d83000 "@",
     size=262144) at ../../libgst/heap.c:187
187  return heap_sbrk_internal (hdp, size);
(gdb) p hdp
$5 = (struct heap *) 0x7fffd6d82000
(gdb) p *$
$6 = {areasize = 536870912, base = 0x7fffd6d82000 "",
   breakval = 0x7ffff6dc3000 "",  top = 0x7ffff6dc3000 ""}
(gdb) p hdp->breakval - hdp->base
$7 = 537137152

So the heap had overflowed.

Trivial patch follows:

diff --git a/libgst/heap.c b/libgst/heap.c
index 25d7f50..1f64fb2 100644
--- a/libgst/heap.c
+++ b/libgst/heap.c
@@ -218,6 +218,18 @@ heap_sbrk_internal (struct heap * hdp,
      }
    else if (hdp->breakval + size > hdp->top)
      {
+      if (hdp->breakval - hdp->base + size > hdp->areasize)
+        {
+          if (hdp->breakval - hdp->base == hdp->areasize);
+            {
+              /* FIXME: a library should never exit!  */
+              fprintf (stderr, "gst: out of memory allocating %d bytes\n",
+                       size);
+              exit (1);
+            }
+          size = hdp->areasize - (hdp->breakval - hdp->base);
+        }
+
        moveto = PAGE_ALIGN (hdp->breakval + size);
        mapbytes = moveto - hdp->top;
        mapto = _gst_osmem_commit (hdp->top, mapbytes);

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Playing with the VM Limits, crash on many processes

Holger Freyther
On 11/21/2010 08:14 PM, Paolo Bonzini wrote:
> On 11/21/2010 04:10 PM, Paolo Bonzini wrote:
>> where the Scheduler class is a heavily butchered version of Delay. :)
>>
>> Interestingly, inlining the two methods in the Eval makes the testcase
>> work, so it's probably something related to contexts.
>
> It's a memory corruption due to running out-of-memory and not detecting it.
>

Thanks, I am just back from a Concert. I was suspecting OOM as well and now
created a testcase which allocates a BigObject and it is crashing too but you
were faster.

What do you propose as a proper resolution? Is there some kind of exception
and Context we could pre-allocate and then raise it? Maybe reserve some more
heap for the OOM case?

z.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk