Image freeze because handleTimerEvent and Seaside process gone?!

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Image freeze because handleTimerEvent and Seaside process gone?!

Adrian Lienhard
Hi all,

We've been experiencing an "interesting" problem: the image freezes and does not response to HTTP requests anymore after it has been running for days.

Here some basic information about our setup:

Squeak VM 4.0.3-2202 compiled with gcc 4.3.2
PharoCore 1.1
OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz)

- We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9 on the identical machine and with the same application source (modulo some adaptations to make it run on Pharo).
- We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the UI process is suspended (Project uiProcess suspend)
- VM does not hog the CPU and memory usage is normal
- The meantime between failure is several weeks and we haven't managed to reproduce the problem
- The application mainly serves HTTP requests. When the image does not receive requests for some time we send it a STOP signal, when a request comes in it is sent a CONT signal.
- lsof shows
        TCP *:9093 (LISTEN)
        TCP server:9093->server:46930 (CLOSE_WAIT)

Below is a GDB backtrace and the Smalltalk stacks from an image that was frozen (the VM had been running for almost 100 hours):

=============================================================
(gdb) bt
#0  0x08072020 in ?? ()
#1  <signal handler called>
#2  0xb766f5e0 in malloc () from /lib/libc.so.6
#3  <function called from gdb>
#4  0xb76c50c8 in select () from /lib/libc.so.6
#5  0x08071063 in aioPoll ()
#6  0xb778bb8d in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-null
#7  0x000003e8 in ?? ()
#8  0x997b5a34 in ?? ()
#9  0xbfe7cb28 in ?? ()
#10 0x08074575 in ioRelinquishProcessorForMicroseconds ()
Backtrace stopped: frame did not save the PC

(gdb) call printCallStack()
-1719969228 >idleProcess
-1719969320 >startUp
-1740134028 BlockClosure>newProcess
$3 = -1755344892

(gdb) call (int) printAllStacks()
Process
-1719969228 >idleProcess
-1719969320 >startUp
-1740134028 BlockClosure>newProcess

Process
-1740113860 >finalizationProcess
-1740113952 >restartFinalizationProcess
-1740113532 BlockClosure>newProcess

Process
-1740134424 SmalltalkImage>lowSpaceWatcher
-1740134516 SmalltalkImage>installLowSpaceWatcher
-1740134300 BlockClosure>newProcess

Process
-1719451488 Delay>wait
-1719451580 BlockClosure>ifCurtailed:
-1719451704 Delay>wait
-1719451796 InputEventPollingFetcher>waitForInput
-1740126940 InputEventFetcher>eventLoop
-1740127032 InputEventFetcher>installEventLoop
-1740126816 BlockClosure>newProcess

Process
-1719557780 UnixOSProcessAccessor>grimReaperProcess
-1740113624 BlockClosure>repeat
-1740113716 UnixOSProcessAccessor>grimReaperProcess
-1740117340 BlockClosure>newProcess

[omitted many newlines between output above]
=============================================================

What is striking from the above process listing is that two processes are missing: the handleTimerEvent process and the Seaside process (that is, the TCP listener loop). How comes these processes vanished?

This may be related to Pharo or to the Squeak VM.

Has anybody else seen this problem? Any idea how to debug/fix this issue is very much appreciated!

Cheers,
Adrian


CCed to pharo-dev since this may be related to Pharo; please respond on the squeak-vm list



Reply | Threaded
Open this post in threaded view
|

Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Mariano Martinez Peck


---------- Forwarded message ----------
From: David T. Lewis <[hidden email]>
Date: Tue, Dec 7, 2010 at 2:06 AM
Subject: Re: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!
To: Squeak Virtual Machine Development Discussion <[hidden email]>, [hidden email]



On Mon, Dec 06, 2010 at 12:33:59PM -0800, Andreas Raab wrote:
>
> At a guess, I'd say it's either one of two issues:
>
> 1) Your STOP/CONT handling. This sounds suspicious and it could affect
> the timer handling. I'm assuming that the issue happens after receiving
> the CONT signal, no? If you can, you might want to a) make sure that you
> only get the STOP signal when the VM is in ioRelinquish() and not (for
> example) currently executing the delay process and b) consider to dump
> the call stacks whenever the VM gets the CONT signal to see what the
> status is.
>
> 2) Some set of incomplete process/delay/semaphore changes in Pharo. One
> of the problems with processes and delays is that this part of the
> system reacts very badly to random "cleaning". I.e., changing "foo ==
> nil" to "foo isNil" can have dramatic effects (since it introduces a
> suspension point) with just the kind of weird issue you're seeing.

Actually #2 does seem like a likely culprit. I found a Pharo 1.1 image
and loaded the CommandShell and OSProcess test suites. The CommandShell
tests put a heavy load on process switching, and are rather timing
dependent. On Pharo 1.1 I get intermittent and non-reproducible errors
and test failures, and I can't get a clean run of the test suite. The
errors seem to be different each time.

On Pharo 1.1.1 and 1.2 I can get clean runs of the CommandShell/OSProcess
tests, so I think there must be some issues in Pharo 1.1. If you are
using PharoCore 1.1 now and have the option of moving to Pharo 1.1.1
or 1.2, I suspect you may see the problems go away.

Dave


>
> With regards to these processes not being printed, that's a side effect
> of how printAllStacks gathers the processes - it will not print
> suspended processes which explains why the UI process doesn't print and
> most likely handleTimerEvent is suspended in a debugger.
>
> Depending on how important this issue is you can also try to dissect the
> object memory itself. If you call writeImageFile (or is it
> writeImageFileIO?) from gdb it will dump the .image file and you can use
> the simulator to look at it more closely. Most likely you'll be able to
> find the processes and look at their stacks.
>
> Cheers,
>   - Andreas
>
> On 12/6/2010 2:55 AM, Adrian Lienhard wrote:
> >
> >Hi all,
> >
> >We've been experiencing an "interesting" problem: the image freezes and
> >does not response to HTTP requests anymore after it has been running for
> >days.
> >
> >Here some basic information about our setup:
> >
> >Squeak VM 4.0.3-2202 compiled with gcc 4.3.2
> >PharoCore 1.1
> >OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz)
> >
> >- We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9
> >on the identical machine and with the same application source (modulo some
> >adaptations to make it run on Pharo).
> >- We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the
> >UI process is suspended (Project uiProcess suspend)
> >- VM does not hog the CPU and memory usage is normal
> >- The meantime between failure is several weeks and we haven't managed to
> >reproduce the problem
> >- The application mainly serves HTTP requests. When the image does not
> >receive requests for some time we send it a STOP signal, when a request
> >comes in it is sent a CONT signal.
> >- lsof shows
> >     TCP *:9093 (LISTEN)
> >     TCP server:9093->server:46930 (CLOSE_WAIT)
> >
> >Below is a GDB backtrace and the Smalltalk stacks from an image that was
> >frozen (the VM had been running for almost 100 hours):
> >
> >=============================================================
> >(gdb) bt
> >#0  0x08072020 in ?? ()
> >#1<signal handler called>
> >#2  0xb766f5e0 in malloc () from /lib/libc.so.6
> >#3<function called from gdb>
> >#4  0xb76c50c8 in select () from /lib/libc.so.6
> >#5  0x08071063 in aioPoll ()
> >#6  0xb778bb8d in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-null
> >#7  0x000003e8 in ?? ()
> >#8  0x997b5a34 in ?? ()
> >#9  0xbfe7cb28 in ?? ()
> >#10 0x08074575 in ioRelinquishProcessorForMicroseconds ()
> >Backtrace stopped: frame did not save the PC
> >
> >(gdb) call printCallStack()
> >-1719969228>idleProcess
> >-1719969320>startUp
> >-1740134028 BlockClosure>newProcess
> >$3 = -1755344892
> >
> >(gdb) call (int) printAllStacks()
> >Process
> >-1719969228>idleProcess
> >-1719969320>startUp
> >-1740134028 BlockClosure>newProcess
> >
> >Process
> >-1740113860>finalizationProcess
> >-1740113952>restartFinalizationProcess
> >-1740113532 BlockClosure>newProcess
> >
> >Process
> >-1740134424 SmalltalkImage>lowSpaceWatcher
> >-1740134516 SmalltalkImage>installLowSpaceWatcher
> >-1740134300 BlockClosure>newProcess
> >
> >Process
> >-1719451488 Delay>wait
> >-1719451580 BlockClosure>ifCurtailed:
> >-1719451704 Delay>wait
> >-1719451796 InputEventPollingFetcher>waitForInput
> >-1740126940 InputEventFetcher>eventLoop
> >-1740127032 InputEventFetcher>installEventLoop
> >-1740126816 BlockClosure>newProcess
> >
> >Process
> >-1719557780 UnixOSProcessAccessor>grimReaperProcess
> >-1740113624 BlockClosure>repeat
> >-1740113716 UnixOSProcessAccessor>grimReaperProcess
> >-1740117340 BlockClosure>newProcess
> >
> >[omitted many newlines between output above]
> >=============================================================
> >
> >What is striking from the above process listing is that two processes are
> >missing: the handleTimerEvent process and the Seaside process (that is,
> >the TCP listener loop). How comes these processes vanished?
> >
> >This may be related to Pharo or to the Squeak VM.
> >
> >Has anybody else seen this problem? Any idea how to debug/fix this issue
> >is very much appreciated!
> >
> >Cheers,
> >Adrian
> >
> >
> >CCed to pharo-dev since this may be related to Pharo; please respond on
> >the squeak-vm list
> >
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Adrian Lienhard
The changes between 1.1 and 1.1.1 are the issues in [1]. None seems related... did I miss something?

One change that I don't understand, although it probably is unrelated, is in [2]:

LargePositiveInteger removeSelector: #=!
LargePositiveInteger removeSelector: #bitAnd:!
LargePositiveInteger removeSelector: #bitOr:!
LargePositiveInteger removeSelector: #bitShift:!
LargePositiveInteger removeSelector: #bitXor:!
LargePositiveInteger removeSelector: #'~='!

Why would one want to remove these primitive calls from large integers?

Cheers,
Adrian

[1] http://code.google.com/p/pharo/issues/list?can=1&q=Milestone%3D1.1.1&colspec=ID+Type+Status+Summary+Milestone+Difficulty&cells=tiles
[2] http://code.google.com/p/pharo/issues/attachmentText?id=2912&aid=-2442931684430823333&name=NecessaryImageChangesForCogToWork.Pharo1.1.cs&token=4a16b7709abc303c3826e5be2743eeb7


On Dec 7, 2010, at 09:52 , Mariano Martinez Peck wrote:

> ---------- Forwarded message ----------
> From: David T. Lewis <[hidden email]>
> Date: Tue, Dec 7, 2010 at 2:06 AM
> Subject: Re: [Vm-dev] Image freeze because handleTimerEvent and Seaside
> process gone?!
> To: Squeak Virtual Machine Development Discussion <
> [hidden email]>, [hidden email]
>
>
>
> On Mon, Dec 06, 2010 at 12:33:59PM -0800, Andreas Raab wrote:
>>
>> At a guess, I'd say it's either one of two issues:
>>
>> 1) Your STOP/CONT handling. This sounds suspicious and it could affect
>> the timer handling. I'm assuming that the issue happens after receiving
>> the CONT signal, no? If you can, you might want to a) make sure that you
>> only get the STOP signal when the VM is in ioRelinquish() and not (for
>> example) currently executing the delay process and b) consider to dump
>> the call stacks whenever the VM gets the CONT signal to see what the
>> status is.
>>
>> 2) Some set of incomplete process/delay/semaphore changes in Pharo. One
>> of the problems with processes and delays is that this part of the
>> system reacts very badly to random "cleaning". I.e., changing "foo ==
>> nil" to "foo isNil" can have dramatic effects (since it introduces a
>> suspension point) with just the kind of weird issue you're seeing.
>
> Actually #2 does seem like a likely culprit. I found a Pharo 1.1 image
> and loaded the CommandShell and OSProcess test suites. The CommandShell
> tests put a heavy load on process switching, and are rather timing
> dependent. On Pharo 1.1 I get intermittent and non-reproducible errors
> and test failures, and I can't get a clean run of the test suite. The
> errors seem to be different each time.
>
> On Pharo 1.1.1 and 1.2 I can get clean runs of the CommandShell/OSProcess
> tests, so I think there must be some issues in Pharo 1.1. If you are
> using PharoCore 1.1 now and have the option of moving to Pharo 1.1.1
> or 1.2, I suspect you may see the problems go away.
>
> Dave
>
>
>>
>> With regards to these processes not being printed, that's a side effect
>> of how printAllStacks gathers the processes - it will not print
>> suspended processes which explains why the UI process doesn't print and
>> most likely handleTimerEvent is suspended in a debugger.
>>
>> Depending on how important this issue is you can also try to dissect the
>> object memory itself. If you call writeImageFile (or is it
>> writeImageFileIO?) from gdb it will dump the .image file and you can use
>> the simulator to look at it more closely. Most likely you'll be able to
>> find the processes and look at their stacks.
>>
>> Cheers,
>>  - Andreas
>>
>> On 12/6/2010 2:55 AM, Adrian Lienhard wrote:
>>>
>>> Hi all,
>>>
>>> We've been experiencing an "interesting" problem: the image freezes and
>>> does not response to HTTP requests anymore after it has been running for
>>> days.
>>>
>>> Here some basic information about our setup:
>>>
>>> Squeak VM 4.0.3-2202 compiled with gcc 4.3.2
>>> PharoCore 1.1
>>> OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz)
>>>
>>> - We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9
>>> on the identical machine and with the same application source (modulo
> some
>>> adaptations to make it run on Pharo).
>>> - We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the
>>> UI process is suspended (Project uiProcess suspend)
>>> - VM does not hog the CPU and memory usage is normal
>>> - The meantime between failure is several weeks and we haven't managed to
>>> reproduce the problem
>>> - The application mainly serves HTTP requests. When the image does not
>>> receive requests for some time we send it a STOP signal, when a request
>>> comes in it is sent a CONT signal.
>>> - lsof shows
>>>    TCP *:9093 (LISTEN)
>>>    TCP server:9093->server:46930 (CLOSE_WAIT)
>>>
>>> Below is a GDB backtrace and the Smalltalk stacks from an image that was
>>> frozen (the VM had been running for almost 100 hours):
>>>
>>> =============================================================
>>> (gdb) bt
>>> #0  0x08072020 in ?? ()
>>> #1<signal handler called>
>>> #2  0xb766f5e0 in malloc () from /lib/libc.so.6
>>> #3<function called from gdb>
>>> #4  0xb76c50c8 in select () from /lib/libc.so.6
>>> #5  0x08071063 in aioPoll ()
>>> #6  0xb778bb8d in ?? () from
> /usr/lib/squeak/4.0.3-2202//so.vm-display-null
>>> #7  0x000003e8 in ?? ()
>>> #8  0x997b5a34 in ?? ()
>>> #9  0xbfe7cb28 in ?? ()
>>> #10 0x08074575 in ioRelinquishProcessorForMicroseconds ()
>>> Backtrace stopped: frame did not save the PC
>>>
>>> (gdb) call printCallStack()
>>> -1719969228>idleProcess
>>> -1719969320>startUp
>>> -1740134028 BlockClosure>newProcess
>>> $3 = -1755344892
>>>
>>> (gdb) call (int) printAllStacks()
>>> Process
>>> -1719969228>idleProcess
>>> -1719969320>startUp
>>> -1740134028 BlockClosure>newProcess
>>>
>>> Process
>>> -1740113860>finalizationProcess
>>> -1740113952>restartFinalizationProcess
>>> -1740113532 BlockClosure>newProcess
>>>
>>> Process
>>> -1740134424 SmalltalkImage>lowSpaceWatcher
>>> -1740134516 SmalltalkImage>installLowSpaceWatcher
>>> -1740134300 BlockClosure>newProcess
>>>
>>> Process
>>> -1719451488 Delay>wait
>>> -1719451580 BlockClosure>ifCurtailed:
>>> -1719451704 Delay>wait
>>> -1719451796 InputEventPollingFetcher>waitForInput
>>> -1740126940 InputEventFetcher>eventLoop
>>> -1740127032 InputEventFetcher>installEventLoop
>>> -1740126816 BlockClosure>newProcess
>>>
>>> Process
>>> -1719557780 UnixOSProcessAccessor>grimReaperProcess
>>> -1740113624 BlockClosure>repeat
>>> -1740113716 UnixOSProcessAccessor>grimReaperProcess
>>> -1740117340 BlockClosure>newProcess
>>>
>>> [omitted many newlines between output above]
>>> =============================================================
>>>
>>> What is striking from the above process listing is that two processes are
>>> missing: the handleTimerEvent process and the Seaside process (that is,
>>> the TCP listener loop). How comes these processes vanished?
>>>
>>> This may be related to Pharo or to the Squeak VM.
>>>
>>> Has anybody else seen this problem? Any idea how to debug/fix this issue
>>> is very much appreciated!
>>>
>>> Cheers,
>>> Adrian
>>>
>>>
>>> CCed to pharo-dev since this may be related to Pharo; please respond on
>>> the squeak-vm list
>>>
>>>
>>>


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Lukas Renggli
> One change that I don't understand, although it probably is unrelated, is in [2]:
>
> LargePositiveInteger removeSelector: #=!
> LargePositiveInteger removeSelector: #bitAnd:!
> LargePositiveInteger removeSelector: #bitOr:!
> LargePositiveInteger removeSelector: #bitShift:!
> LargePositiveInteger removeSelector: #bitXor:!
> LargePositiveInteger removeSelector: #'~='!
>
> Why would one want to remove these primitive calls from large integers?

AFAIK, Cog needs those.

Lukas

--
Lukas Renggli
www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Stéphane Ducasse

On Dec 7, 2010, at 12:41 PM, Lukas Renggli wrote:

>> One change that I don't understand, although it probably is unrelated, is in [2]:
>>
>> LargePositiveInteger removeSelector: #=!
>> LargePositiveInteger removeSelector: #bitAnd:!
>> LargePositiveInteger removeSelector: #bitOr:!
>> LargePositiveInteger removeSelector: #bitShift:!
>> LargePositiveInteger removeSelector: #bitXor:!
>> LargePositiveInteger removeSelector: #'~='!
>>
>> Why would one want to remove these primitive calls from large integers?
>
> AFAIK, Cog needs those.

In float?
But apparently eliot removed them in for Cog in LargePositiveInteger.
Can somebody check that point?
>
> Lukas
>
> --
> Lukas Renggli
> www.lukas-renggli.ch
>


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

David T. Lewis
In reply to this post by Adrian Lienhard
The symptoms that I see are intermittent, and it's hard for me to
say what the root cause might be. It might be worthwhile to see
if you can reproduce my results. If so, it might get you closer to
a reproducible test case.

What I did was take a Pharo core 1.1, 1.1.1, and 1.2 image, and
in each image, I loaded these two packages (ignoring the MVC
warnings in CommandShell):

  http://squeaksource.com/OSProcess/OSProcess-dtl.59.mcz
  http://squeaksource.com/OSProcess/CommandShell-dtl.49.mcz

On Pharo 1.1, I get intermittent failures and errors in the
CommandShell and OSProcess tests, but the problems seem to be
resolved in 1.1.1 and 1.2.

These tests are timing and machine dependent to some extent, so
I am not sure if you will see the same symptoms.

HTH,
Dave

On Tue, Dec 07, 2010 at 12:38:54PM +0100, Adrian Lienhard wrote:

> The changes between 1.1 and 1.1.1 are the issues in [1]. None seems related... did I miss something?
>
> One change that I don't understand, although it probably is unrelated, is in [2]:
>
> LargePositiveInteger removeSelector: #=!
> LargePositiveInteger removeSelector: #bitAnd:!
> LargePositiveInteger removeSelector: #bitOr:!
> LargePositiveInteger removeSelector: #bitShift:!
> LargePositiveInteger removeSelector: #bitXor:!
> LargePositiveInteger removeSelector: #'~='!
>
> Why would one want to remove these primitive calls from large integers?
>
> Cheers,
> Adrian
>
> [1] http://code.google.com/p/pharo/issues/list?can=1&q=Milestone%3D1.1.1&colspec=ID+Type+Status+Summary+Milestone+Difficulty&cells=tiles
> [2] http://code.google.com/p/pharo/issues/attachmentText?id=2912&aid=-2442931684430823333&name=NecessaryImageChangesForCogToWork.Pharo1.1.cs&token=4a16b7709abc303c3826e5be2743eeb7
>
>
> On Dec 7, 2010, at 09:52 , Mariano Martinez Peck wrote:
>
> > ---------- Forwarded message ----------
> > From: David T. Lewis <[hidden email]>
> > Date: Tue, Dec 7, 2010 at 2:06 AM
> > Subject: Re: [Vm-dev] Image freeze because handleTimerEvent and Seaside
> > process gone?!
> > To: Squeak Virtual Machine Development Discussion <
> > [hidden email]>, [hidden email]
> >
> >
> >
> > On Mon, Dec 06, 2010 at 12:33:59PM -0800, Andreas Raab wrote:
> >>
> >> At a guess, I'd say it's either one of two issues:
> >>
> >> 1) Your STOP/CONT handling. This sounds suspicious and it could affect
> >> the timer handling. I'm assuming that the issue happens after receiving
> >> the CONT signal, no? If you can, you might want to a) make sure that you
> >> only get the STOP signal when the VM is in ioRelinquish() and not (for
> >> example) currently executing the delay process and b) consider to dump
> >> the call stacks whenever the VM gets the CONT signal to see what the
> >> status is.
> >>
> >> 2) Some set of incomplete process/delay/semaphore changes in Pharo. One
> >> of the problems with processes and delays is that this part of the
> >> system reacts very badly to random "cleaning". I.e., changing "foo ==
> >> nil" to "foo isNil" can have dramatic effects (since it introduces a
> >> suspension point) with just the kind of weird issue you're seeing.
> >
> > Actually #2 does seem like a likely culprit. I found a Pharo 1.1 image
> > and loaded the CommandShell and OSProcess test suites. The CommandShell
> > tests put a heavy load on process switching, and are rather timing
> > dependent. On Pharo 1.1 I get intermittent and non-reproducible errors
> > and test failures, and I can't get a clean run of the test suite. The
> > errors seem to be different each time.
> >
> > On Pharo 1.1.1 and 1.2 I can get clean runs of the CommandShell/OSProcess
> > tests, so I think there must be some issues in Pharo 1.1. If you are
> > using PharoCore 1.1 now and have the option of moving to Pharo 1.1.1
> > or 1.2, I suspect you may see the problems go away.
> >
> > Dave
> >
> >
> >>
> >> With regards to these processes not being printed, that's a side effect
> >> of how printAllStacks gathers the processes - it will not print
> >> suspended processes which explains why the UI process doesn't print and
> >> most likely handleTimerEvent is suspended in a debugger.
> >>
> >> Depending on how important this issue is you can also try to dissect the
> >> object memory itself. If you call writeImageFile (or is it
> >> writeImageFileIO?) from gdb it will dump the .image file and you can use
> >> the simulator to look at it more closely. Most likely you'll be able to
> >> find the processes and look at their stacks.
> >>
> >> Cheers,
> >>  - Andreas
> >>
> >> On 12/6/2010 2:55 AM, Adrian Lienhard wrote:
> >>>
> >>> Hi all,
> >>>
> >>> We've been experiencing an "interesting" problem: the image freezes and
> >>> does not response to HTTP requests anymore after it has been running for
> >>> days.
> >>>
> >>> Here some basic information about our setup:
> >>>
> >>> Squeak VM 4.0.3-2202 compiled with gcc 4.3.2
> >>> PharoCore 1.1
> >>> OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz)
> >>>
> >>> - We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9
> >>> on the identical machine and with the same application source (modulo
> > some
> >>> adaptations to make it run on Pharo).
> >>> - We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the
> >>> UI process is suspended (Project uiProcess suspend)
> >>> - VM does not hog the CPU and memory usage is normal
> >>> - The meantime between failure is several weeks and we haven't managed to
> >>> reproduce the problem
> >>> - The application mainly serves HTTP requests. When the image does not
> >>> receive requests for some time we send it a STOP signal, when a request
> >>> comes in it is sent a CONT signal.
> >>> - lsof shows
> >>>    TCP *:9093 (LISTEN)
> >>>    TCP server:9093->server:46930 (CLOSE_WAIT)
> >>>
> >>> Below is a GDB backtrace and the Smalltalk stacks from an image that was
> >>> frozen (the VM had been running for almost 100 hours):
> >>>
> >>> =============================================================
> >>> (gdb) bt
> >>> #0  0x08072020 in ?? ()
> >>> #1<signal handler called>
> >>> #2  0xb766f5e0 in malloc () from /lib/libc.so.6
> >>> #3<function called from gdb>
> >>> #4  0xb76c50c8 in select () from /lib/libc.so.6
> >>> #5  0x08071063 in aioPoll ()
> >>> #6  0xb778bb8d in ?? () from
> > /usr/lib/squeak/4.0.3-2202//so.vm-display-null
> >>> #7  0x000003e8 in ?? ()
> >>> #8  0x997b5a34 in ?? ()
> >>> #9  0xbfe7cb28 in ?? ()
> >>> #10 0x08074575 in ioRelinquishProcessorForMicroseconds ()
> >>> Backtrace stopped: frame did not save the PC
> >>>
> >>> (gdb) call printCallStack()
> >>> -1719969228>idleProcess
> >>> -1719969320>startUp
> >>> -1740134028 BlockClosure>newProcess
> >>> $3 = -1755344892
> >>>
> >>> (gdb) call (int) printAllStacks()
> >>> Process
> >>> -1719969228>idleProcess
> >>> -1719969320>startUp
> >>> -1740134028 BlockClosure>newProcess
> >>>
> >>> Process
> >>> -1740113860>finalizationProcess
> >>> -1740113952>restartFinalizationProcess
> >>> -1740113532 BlockClosure>newProcess
> >>>
> >>> Process
> >>> -1740134424 SmalltalkImage>lowSpaceWatcher
> >>> -1740134516 SmalltalkImage>installLowSpaceWatcher
> >>> -1740134300 BlockClosure>newProcess
> >>>
> >>> Process
> >>> -1719451488 Delay>wait
> >>> -1719451580 BlockClosure>ifCurtailed:
> >>> -1719451704 Delay>wait
> >>> -1719451796 InputEventPollingFetcher>waitForInput
> >>> -1740126940 InputEventFetcher>eventLoop
> >>> -1740127032 InputEventFetcher>installEventLoop
> >>> -1740126816 BlockClosure>newProcess
> >>>
> >>> Process
> >>> -1719557780 UnixOSProcessAccessor>grimReaperProcess
> >>> -1740113624 BlockClosure>repeat
> >>> -1740113716 UnixOSProcessAccessor>grimReaperProcess
> >>> -1740117340 BlockClosure>newProcess
> >>>
> >>> [omitted many newlines between output above]
> >>> =============================================================
> >>>
> >>> What is striking from the above process listing is that two processes are
> >>> missing: the handleTimerEvent process and the Seaside process (that is,
> >>> the TCP listener loop). How comes these processes vanished?
> >>>
> >>> This may be related to Pharo or to the Squeak VM.
> >>>
> >>> Has anybody else seen this problem? Any idea how to debug/fix this issue
> >>> is very much appreciated!
> >>>
> >>> Cheers,
> >>> Adrian
> >>>
> >>>
> >>> CCed to pharo-dev since this may be related to Pharo; please respond on
> >>> the squeak-vm list
> >>>
> >>>
> >>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Eliot Miranda-2
In reply to this post by Lukas Renggli


On Tue, Dec 7, 2010 at 3:41 AM, Lukas Renggli <[hidden email]> wrote:
> One change that I don't understand, although it probably is unrelated, is in [2]:
>
> LargePositiveInteger removeSelector: #=!
> LargePositiveInteger removeSelector: #bitAnd:!
> LargePositiveInteger removeSelector: #bitOr:!
> LargePositiveInteger removeSelector: #bitShift:!
> LargePositiveInteger removeSelector: #bitXor:!
> LargePositiveInteger removeSelector: #'~='!
>
> Why would one want to remove these primitive calls from large integers?

AFAIK, Cog needs those.

That's right.  Those methods used SmallInteger primitives (7, 8, 14, 15, 16 & 17) and in Cog SmallInteger primitives only work on SmallIntegers, not on up to 64-bit LargeIntegers; i.e. they don't bother to test the receiver for being a SmallInteger and hence crash if installed on LargeInteger.  I probably made a mistake in deciding to save a tag test, but I did.  Instead these methods should use the relevant LargeInteger primitives.  The issue is I think those are missing in the standard VM.  See primitives 27 28 34 35 36 & 37 primitiveEqualLargeIntegers, primitiveNotEqualLargeIntegers, primitiveBitAndLargeIntegers, primitiveBitOrLargeIntegers, primitiveBitXorLargeIntegers, primitiveBitShiftLargeIntegers in the Cog VM and perhaps the standard VM.  If they're missing in teh standard VM it should be fine to redefine those methods to include the 27 28 34 35 36 & 37 primitive numbers.  If they're defined differently someone needs to check that using those numbers works in the standard VM.

HTH
Eliot

Lukas

--
Lukas Renggli
www.lukas-renggli.ch