Hi Rob, On Mon, Jul 19, 2010 at 3:10 AM, Rob Withers <[hidden email]> wrote: Eliot, While the heap corruption might be a bug in Cog it might also be heap corruption from external code (e.g. objects passed through FFI calls to external code that overwrites those objects' bounds).
There's a leak checker in Cog (see the -leakcheck argument in platforms/unix/vm/sqUnixMain.c) that can help you localise this. Its best to distrust your code before you distrust the VM, simply because thinking it's the VM can blind-side you to potential bugs in your own code or other parts of the system. The goal here is a reproducible case. If you get a reproducible case that doesn't use any external code then the bug is in the VM.
HTH Eliot HTH, |
Hi Eliot,
Got home from my new job and started looking into
this. It turns out that this morning I found that I had a button bar that
was stepping and part of the step was a Smalltalk garbageCollect to force
collection before checking for instances. It may be something I don't need
to do anymore, however it helps expose this seg fault. Both stack dumps
were in the garbageCollect. I removed the button bar, uploaded the image,
and ran it. CPU% dropped from 33% to 2%. I let it run all day.
At some point it exited, for an unknown reason, as it was gone when I returned
tonight.
I have reinstated the button bar, to help this bug
occur, and uploaded it to the server.
Now I just need to enable -leakcheck. From
sqUnixMain.c it looks like it takes an argument. What is that
argument?
Thanks,
Rob
From: [hidden email]
Sent: Monday, July 19, 2010 1:47 PM
To: [hidden email]
Cc: [hidden email]
Subject: Re: Cog segmentation fault on Linux On Mon, Jul 19, 2010 at 3:10 AM, Rob Withers <[hidden email]> wrote: Eliot, While the heap corruption might be a bug in Cog it might also be heap
corruption from external code (e.g. objects passed through FFI calls to external
code that overwrites those objects' bounds).
There's a leak checker in Cog (see the -leakcheck argument
in platforms/unix/vm/sqUnixMain.c) that can help you localise this.
Its best to distrust your code before you distrust the VM, simply because
thinking it's the VM can blind-side you to potential bugs in your own code or
other parts of the system. The goal here is a reproducible case. If
you get a reproducible case that doesn't use any external code then the bug is
in the VM.
HTH
Eliot
HTH, |
In reply to this post by Eliot Miranda-2
Hi Eliot,
(I forgot to CC the mailing list -
added)
I made a few things happen.
First, I found that the argument to -leakcheck is
an integer that gets masked to determine whether to leak check an incremental or
full GC. I made the call with '-leakcheck 7'.
Second, I added a -leakcheck section to the COGVM
section:
#if COGVM
else if (!strcmp(argv[0], "-leakcheck")) { extern sqInt checkForLeaks; checkForLeaks = atoi(argv[1]); return 2; } I compiled and ran it. I am unsure where any
output from the leak checker goes. If it is to stdout or stderr, I forget
the magic incantation to redirect these to files. I think it is '2>
stderr.txt 1> stdout.txt' for /bin/sh. Is that right?
So when I ran it, it runs (the new image with
stepping button bar - takes 30% cpu). When I send 'kill -USR1 <pid>'
it seg faults guaranteed. This may or may not be the original seg fault -
it may be the leakchecker?
The only stuff I am doing that calls out of the
image is socket stuff. This may or may not be in the middle of a call when
it seg faults. I will work to turn off all the socket activity and see if
it still seg faults.
Am I activating the leakchecker ok?
Regards,
Rob
From: [hidden email]
Sent: Monday, July 19, 2010 9:22 PM
To: [hidden email]
Subject: Re: [Vm-dev] Re: Cog segmentation fault on
Linux hey Eliot,
It looks like this command line argument,
-leakcheck, is for the STACKVM, not the COGVM. Is this an
issue?
Thanks,
Rob
#if STACKVM
else if (!strcmp(argv[0], "-eden")) { extern sqInt desiredEdenBytes; desiredEdenBytes = strtobkm(argv[1]); return 2; } else if (!strcmp(argv[0], "-leakcheck")) { extern sqInt checkForLeaks; checkForLeaks = atoi(argv[1]); return 2; } else if (!strcmp(argv[0], "-stackpages")) { extern sqInt desiredNumStackPages; desiredNumStackPages = atoi(argv[1]); return 2; } else if (!strcmp(argv[0], "-breaksel")) { extern void setBreakSelector(char *); setBreakSelector(argv[1]); return 2; } else if (!strcmp(argv[0], "-noheartbeat")) { extern sqInt suppressHeartbeatFlag; suppressHeartbeatFlag = 1; return 1; } #endif /* STACKVM */ From: [hidden email]
Sent: Monday, July 19, 2010 9:19 PM
To: [hidden email]
Cc: [hidden email]
Subject: [Vm-dev] Re: Cog segmentation fault on
Linux
Hi Eliot,
Got home from my new job and started looking into
this. It turns out that this morning I found that I had a button bar that
was stepping and part of the step was a Smalltalk garbageCollect to force
collection before checking for instances. It may be something I don't need
to do anymore, however it helps expose this seg fault. Both stack dumps
were in the garbageCollect. I removed the button bar, uploaded the image,
and ran it. CPU% dropped from 33% to 2%. I let it run all day.
At some point it exited, for an unknown reason, as it was gone when I returned
tonight.
I have reinstated the button bar, to help this bug
occur, and uploaded it to the server.
Now I just need to enable -leakcheck. From
sqUnixMain.c it looks like it takes an argument. What is that
argument?
Thanks,
Rob
From: [hidden email]
Sent: Monday, July 19, 2010 1:47 PM
To: [hidden email]
Cc: [hidden email]
Subject: Re: Cog segmentation fault on Linux On Mon, Jul 19, 2010 at 3:10 AM, Rob Withers <[hidden email]> wrote: Eliot, While the heap corruption might be a bug in Cog it might also be heap
corruption from external code (e.g. objects passed through FFI calls to external
code that overwrites those objects' bounds).
There's a leak checker in Cog (see the -leakcheck argument
in platforms/unix/vm/sqUnixMain.c) that can help you localise this.
Its best to distrust your code before you distrust the VM, simply because
thinking it's the VM can blind-side you to potential bugs in your own code or
other parts of the system. The goal here is a reproducible case. If
you get a reproducible case that doesn't use any external code then the bug is
in the VM.
HTH
Eliot
HTH, |
Hey Eliot,
Here is what I have found. I never saw any
output from the leak checker. I was able to generate seg faults in the
original echat-server.image, which is doing socket stuff, AND I was able to
generate it in a looping GC image. In the original echat-server image, I
have a listening socket and I have a "Vat" which has installed a subclass of
Process and is looping and I am running the RFB server. In the looping GC
image, I turned off my listening socket, the Vat is not running and I stopped
the RFB server. I run headless and I supply a script to run. I
supplied the following script:
[Smalltalk garbageCollect] repeat.
It took a few attempts (8 attempts) but I
eventually seg faulted.
I have attached the logfiles for both the
echat-server scenario and the looping GC scenario. Search for
#SIGUSR1 for each process dump section. Search for #SEGFAULT to find the
section at the bottom that seg faulted. Search for #PREVSTACK to find
Processes in the SEGFAULT sections that have garbage in them and what the
corresponding stack in a previous healthy section was doing.
Note that of these corrupted Processes, #PREVSTACK
(Delay class>handleTimerEvent) and #PREVSTACK (EventSensor>eventTickler)
are bad in both scenarios.
HTH,
Rob
From: [hidden email]
Sent: Tuesday, July 20, 2010 4:51 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: [Vm-dev] Re: Cog segmentation fault on
Linux
Hi Eliot,
(I forgot to CC the mailing list -
added)
I made a few things happen.
First, I found that the argument to -leakcheck is
an integer that gets masked to determine whether to leak check an incremental or
full GC. I made the call with '-leakcheck 7'.
Second, I added a -leakcheck section to the COGVM
section:
#if COGVM
else if (!strcmp(argv[0], "-leakcheck")) { extern sqInt checkForLeaks; checkForLeaks = atoi(argv[1]); return 2; } I compiled and ran it. I am unsure where any
output from the leak checker goes. If it is to stdout or stderr, I forget
the magic incantation to redirect these to files. I think it is '2>
stderr.txt 1> stdout.txt' for /bin/sh. Is that right?
So when I ran it, it runs (the new image with
stepping button bar - takes 30% cpu). When I send 'kill -USR1 <pid>'
it seg faults guaranteed. This may or may not be the original seg fault -
it may be the leakchecker?
The only stuff I am doing that calls out of the
image is socket stuff. This may or may not be in the middle of a call when
it seg faults. I will work to turn off all the socket activity and see if
it still seg faults.
Am I activating the leakchecker ok?
Regards,
Rob
From: [hidden email]
Sent: Monday, July 19, 2010 9:22 PM
To: [hidden email]
Subject: Re: [Vm-dev] Re: Cog segmentation fault on
Linux hey Eliot,
It looks like this command line argument,
-leakcheck, is for the STACKVM, not the COGVM. Is this an
issue?
Thanks,
Rob
#if STACKVM
else if (!strcmp(argv[0], "-eden")) { extern sqInt desiredEdenBytes; desiredEdenBytes = strtobkm(argv[1]); return 2; } else if (!strcmp(argv[0], "-leakcheck")) { extern sqInt checkForLeaks; checkForLeaks = atoi(argv[1]); return 2; } else if (!strcmp(argv[0], "-stackpages")) { extern sqInt desiredNumStackPages; desiredNumStackPages = atoi(argv[1]); return 2; } else if (!strcmp(argv[0], "-breaksel")) { extern void setBreakSelector(char *); setBreakSelector(argv[1]); return 2; } else if (!strcmp(argv[0], "-noheartbeat")) { extern sqInt suppressHeartbeatFlag; suppressHeartbeatFlag = 1; return 1; } #endif /* STACKVM */ From: [hidden email]
Sent: Monday, July 19, 2010 9:19 PM
To: [hidden email]
Cc: [hidden email]
Subject: [Vm-dev] Re: Cog segmentation fault on
Linux
Hi Eliot,
Got home from my new job and started looking into
this. It turns out that this morning I found that I had a button bar that
was stepping and part of the step was a Smalltalk garbageCollect to force
collection before checking for instances. It may be something I don't need
to do anymore, however it helps expose this seg fault. Both stack dumps
were in the garbageCollect. I removed the button bar, uploaded the image,
and ran it. CPU% dropped from 33% to 2%. I let it run all day.
At some point it exited, for an unknown reason, as it was gone when I returned
tonight.
I have reinstated the button bar, to help this bug
occur, and uploaded it to the server.
Now I just need to enable -leakcheck. From
sqUnixMain.c it looks like it takes an argument. What is that
argument?
Thanks,
Rob
From: [hidden email]
Sent: Monday, July 19, 2010 1:47 PM
To: [hidden email]
Cc: [hidden email]
Subject: Re: Cog segmentation fault on Linux On Mon, Jul 19, 2010 at 3:10 AM, Rob Withers <[hidden email]> wrote: Eliot, While the heap corruption might be a bug in Cog it might also be heap
corruption from external code (e.g. objects passed through FFI calls to external
code that overwrites those objects' bounds).
There's a leak checker in Cog (see the -leakcheck argument
in platforms/unix/vm/sqUnixMain.c) that can help you localise this.
Its best to distrust your code before you distrust the VM, simply because
thinking it's the VM can blind-side you to potential bugs in your own code or
other parts of the system. The goal here is a reproducible case. If
you get a reproducible case that doesn't use any external code then the bug is
in the VM.
HTH
Eliot
HTH, stdout-files.zip (6K) Download Attachment |
Eliot,
I am trying to narrow down what may be causing
this. I took my looping GC image and shut down more processes where I
could, including the eventTickler. My script is:
Sensor
shutDown.
VatTPManager stop. [[Smalltalk garbageCollect]
repeat] fork.
My Processes are:
"timerEventLoop Process -
priority 80"
"lowSpaceWatcher Process -priority 60" "finalization Process - priority 50" "UIProcess - priority 40" "user Process - priority 40 - [[Smalltalk garbageCollect] repeat] fork." "idle Process - priority 10" as always, the active stack on SegFault is the user Process doing a Smalltalk garbageCollect. At this point I deciding to try and see what gdb
would tell me. I ran the following command:
gdb --args
lib/squeak/3.9-7/squeak -leakcheck 7 -vm-display-null -vm-sound-null
echat-server-off.image garbageCollect.sq
It loaded symbols. I then issued the 'run'
command. It runs squeak, but the resulting process doesn't accumulate
cputime, like it isn't really running. I tried sending a USR1 to that pid,
but it doesn't output anything. I issued the run command again but gdb
seems to think it is running. ps aux also thinks this as there is an entry
for squeak. It just isn't doing anything. Am I using gdb
wrong? Is the stack paused? Here is the results of
'bt':
(gdb) bt
#0 0xf7ffd430 in __kernel_vsyscall () #1 0x4b3c22f6 in nanosleep () from /lib/libpthread.so.0 #2 0x0805c958 in tickerSleepCycle (ignored=0x0) at /home1/vawhigso/public_html/squeakelib/Cog/platforms/unix/vm/sqUnixHeartbeat.c:375 #3 0x4b3ba832 in start_thread () from /lib/libpthread.so.0 #4 0x4b315e0e in clone () from /lib/libc.so.6 Thanks for any help,
Rob
From: [hidden email]
Sent: Tuesday, July 20, 2010 6:14 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: [Vm-dev] Re: Cog segmentation fault on
Linux
Hey Eliot,
Here is what I have found. I never saw any
output from the leak checker. I was able to generate seg faults in the
original echat-server.image, which is doing socket stuff, AND I was able to
generate it in a looping GC image. In the original echat-server image, I
have a listening socket and I have a "Vat" which has installed a subclass of
Process and is looping and I am running the RFB server. In the looping GC
image, I turned off my listening socket, the Vat is not running and I stopped
the RFB server. I run headless and I supply a script to run. I
supplied the following script:
[Smalltalk garbageCollect] repeat.
It took a few attempts (8 attempts) but I
eventually seg faulted.
I have attached the logfiles for both the
echat-server scenario and the looping GC scenario. Search for
#SIGUSR1 for each process dump section. Search for #SEGFAULT to find the
section at the bottom that seg faulted. Search for #PREVSTACK to find
Processes in the SEGFAULT sections that have garbage in them and what the
corresponding stack in a previous healthy section was doing.
Note that of these corrupted Processes, #PREVSTACK
(Delay class>handleTimerEvent) and #PREVSTACK (EventSensor>eventTickler)
are bad in both scenarios.
HTH,
Rob
From: [hidden email]
Sent: Tuesday, July 20, 2010 4:51 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: [Vm-dev] Re: Cog segmentation fault on
Linux
Hi Eliot,
(I forgot to CC the mailing list -
added)
I made a few things happen.
First, I found that the argument to -leakcheck is
an integer that gets masked to determine whether to leak check an incremental or
full GC. I made the call with '-leakcheck 7'.
Second, I added a -leakcheck section to the COGVM
section:
#if COGVM
else if (!strcmp(argv[0], "-leakcheck")) { extern sqInt checkForLeaks; checkForLeaks = atoi(argv[1]); return 2; } I compiled and ran it. I am unsure where any
output from the leak checker goes. If it is to stdout or stderr, I forget
the magic incantation to redirect these to files. I think it is '2>
stderr.txt 1> stdout.txt' for /bin/sh. Is that right?
So when I ran it, it runs (the new image with
stepping button bar - takes 30% cpu). When I send 'kill -USR1 <pid>'
it seg faults guaranteed. This may or may not be the original seg fault -
it may be the leakchecker?
The only stuff I am doing that calls out of the
image is socket stuff. This may or may not be in the middle of a call when
it seg faults. I will work to turn off all the socket activity and see if
it still seg faults.
Am I activating the leakchecker ok?
Regards,
Rob
From: [hidden email]
Sent: Monday, July 19, 2010 9:22 PM
To: [hidden email]
Subject: Re: [Vm-dev] Re: Cog segmentation fault on
Linux hey Eliot,
It looks like this command line argument,
-leakcheck, is for the STACKVM, not the COGVM. Is this an
issue?
Thanks,
Rob
#if STACKVM
else if (!strcmp(argv[0], "-eden")) { extern sqInt desiredEdenBytes; desiredEdenBytes = strtobkm(argv[1]); return 2; } else if (!strcmp(argv[0], "-leakcheck")) { extern sqInt checkForLeaks; checkForLeaks = atoi(argv[1]); return 2; } else if (!strcmp(argv[0], "-stackpages")) { extern sqInt desiredNumStackPages; desiredNumStackPages = atoi(argv[1]); return 2; } else if (!strcmp(argv[0], "-breaksel")) { extern void setBreakSelector(char *); setBreakSelector(argv[1]); return 2; } else if (!strcmp(argv[0], "-noheartbeat")) { extern sqInt suppressHeartbeatFlag; suppressHeartbeatFlag = 1; return 1; } #endif /* STACKVM */ From: [hidden email]
Sent: Monday, July 19, 2010 9:19 PM
To: [hidden email]
Cc: [hidden email]
Subject: [Vm-dev] Re: Cog segmentation fault on
Linux
Hi Eliot,
Got home from my new job and started looking into
this. It turns out that this morning I found that I had a button bar that
was stepping and part of the step was a Smalltalk garbageCollect to force
collection before checking for instances. It may be something I don't need
to do anymore, however it helps expose this seg fault. Both stack dumps
were in the garbageCollect. I removed the button bar, uploaded the image,
and ran it. CPU% dropped from 33% to 2%. I let it run all day.
At some point it exited, for an unknown reason, as it was gone when I returned
tonight.
I have reinstated the button bar, to help this bug
occur, and uploaded it to the server.
Now I just need to enable -leakcheck. From
sqUnixMain.c it looks like it takes an argument. What is that
argument?
Thanks,
Rob
From: [hidden email]
Sent: Monday, July 19, 2010 1:47 PM
To: [hidden email]
Cc: [hidden email]
Subject: Re: Cog segmentation fault on Linux On Mon, Jul 19, 2010 at 3:10 AM, Rob Withers <[hidden email]> wrote: Eliot, While the heap corruption might be a bug in Cog it might also be heap
corruption from external code (e.g. objects passed through FFI calls to external
code that overwrites those objects' bounds).
There's a leak checker in Cog (see the -leakcheck argument
in platforms/unix/vm/sqUnixMain.c) that can help you localise this.
Its best to distrust your code before you distrust the VM, simply because
thinking it's the VM can blind-side you to potential bugs in your own code or
other parts of the system. The goal here is a reproducible case. If
you get a reproducible case that doesn't use any external code then the bug is
in the VM.
HTH
Eliot
HTH, |
Free forum by Nabble | Edit this page |