I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success. What would be a good way to get a glimpse of what is causing problems? I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes. thanks, Norbert |
On 07.12.2011 10:47, Norbert Hartl wrote:
> I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened. > Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success. > What would be a good way to get a glimpse of what is causing problems? > I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes. > > thanks, > > Norbert Could it be related to http://forum.world.st/Too-many-semaphores-image-blocked-td3871970i20.html#a3925711 ? Cheers, Henry |
Am 07.12.2011 um 12:26 schrieb Henrik Sperre Johansen: > On 07.12.2011 10:47, Norbert Hartl wrote: >> I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened. >> Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success. >> What would be a good way to get a glimpse of what is causing problems? >> I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes. >> >> thanks, >> >> Norbert > Could it be related to > http://forum.world.st/Too-many-semaphores-image-blocked-td3871970i20.html#a3925711 ? > Is it by coincidence that there is a warning about the external semaphore but there is even more usage of them so that the image is saved with such a shortage on semaphores that makes a new start of the image block? Norbert |
In reply to this post by NorbertHartl
The worst thing a computer can do is grind to a halt and leave no trace of why it happened. In Dolphin, I had some success rigging deployed images (runtime session managers) to, on control-break, log callstacks for all non-dead processes. Conditions in which two threads were each waiting on the other to do something were fairly obvious from the callstacks.
________________________________________ From: [hidden email] [[hidden email]] on behalf of Norbert Hartl [[hidden email]] Sent: Wednesday, December 07, 2011 4:47 AM To: [hidden email] Subject: [Pharo-project] How to resurrect an unrepsonsive image? I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened. Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success. What would be a good way to get a glimpse of what is causing problems? I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes. thanks, Norbert |
In reply to this post by NorbertHartl
Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.
Cheers,
Javier.
On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote: I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened. Lic. Javier Pimás Ciudad de Buenos Aires |
That assumes there is an error. Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.
From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM To: [hidden email] Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image? Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.
Cheers,
Javier.
On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl
<[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened. Lic. Javier Pimás Ciudad de Buenos Aires |
In reply to this post by melkyades
The image didn't crash and in the debug log there is nothing hinting what could be the cause. The image just stopped serving the network and couldn't be restarted. Norbert
|
In reply to this post by Schwab,Wilhelm K
On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
best, Eliot |
Eliot,
Nice! Do you dump *all* stacks, or only those (in Dolphin lingo) for the non-dead processes. In a very active image at least, Dolphin can have a fair number of processes in various stages of being removed from the system, so it was useful to suppress stacks for the ones that were not going to come back. Bill From: [hidden email] [[hidden email]] on behalf of Eliot Miranda [[hidden email]]
Sent: Wednesday, December 07, 2011 4:49 PM To: [hidden email] Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image? On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K
<[hidden email]> wrote:
On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
best, Eliot
|
On Wed, Dec 7, 2011 at 2:30 PM, Schwab,Wilhelm K <[hidden email]> wrote:
It prints the active process, all processes waiting on Semaphores and/or Mutexes and the runnable processes, in that order.
best, Eliot |
In reply to this post by NorbertHartl
maybe then you can kill it manually and see which processes were living there,.. and which that should be were not?
On Wed, Dec 7, 2011 at 8:09 PM, Norbert Hartl <[hidden email]> wrote:
Lic. Javier Pimás Ciudad de Buenos Aires |
Am 08.12.2011 um 00:34 schrieb Javier Pimás: maybe then you can kill it manually and see which processes were living there,.. and which that should be were not?Sorry, but I don't understand what your saying. Norbert
|
In reply to this post by Eliot Miranda-2
Eliot,
can you take a look at the attached crash.dmp file? My knowledge about the internals is limited so I'm not a good candidate to get a feeling what could have been gone wrong. If I would have to guess I would think there is a dead lock in ExternalSemaphoreTable. It looks to me as if the outgoing network connect tries to register an external object at the same time the snapshot:andQuit: tries to clear the external objects. But I know nothing how this works. Norbert Am 07.12.2011 um 22:49 schrieb Eliot Miranda:
crash.dmp (17K) Download Attachment |
In reply to this post by NorbertHartl
On 08 Dec 2011, at 10:50, Norbert Hartl wrote: > Sorry, but I don't understand what your saying. Run your image on Linux or Mac OS X using a Miranda Banda Cog VM, find the process ID of the VM, and kill it with SIGUSR1, which will not actually kill it but will write the crash.dmp file: [sven@voyager:~/smalltalk]$ ps auxww | grep Cog sven 4943 3.5 1.0 1237260 40852 ?? R 11:03AM 0:01.03 /Users/sven/apps/Smalltalk/Virtual-Machines/Cog.app/Contents/MacOS/Croquet -psn_0_1200421 [sven@voyager:~/smalltalk]$ kill -s SIGUSR1 4943 [sven@voyager:~/smalltalk]$ ls *.dmp crash.dmp [sven@voyager:~/smalltalk]$ less crash.dmp Sven |
Sven,
Am 08.12.2011 um 11:06 schrieb Sven Van Caekenberghe: > > On 08 Dec 2011, at 10:50, Norbert Hartl wrote: > >> Sorry, but I don't understand what your saying. > > Run your image on Linux or Mac OS X using a Miranda Banda Cog VM, find the process ID of the VM, and kill it with SIGUSR1, which will not actually kill it but will write the crash.dmp file: > > [sven@voyager:~/smalltalk]$ ps auxww | grep Cog > sven 4943 3.5 1.0 1237260 40852 ?? R 11:03AM 0:01.03 /Users/sven/apps/Smalltalk/Virtual-Machines/Cog.app/Contents/MacOS/Croquet -psn_0_1200421 > [sven@voyager:~/smalltalk]$ kill -s SIGUSR1 4943 > [sven@voyager:~/smalltalk]$ ls *.dmp > crash.dmp > [sven@voyager:~/smalltalk]$ less crash.dmp > Norbert |
On 08 Dec 2011, at 11:20, Norbert Hartl wrote: > I didn't get that Javier meant the same as Eliot. What should be done is pretty clear to me after dealing 20 years with unix ;) But thank you very much for explaining. I didn't know about the -s switch in kill. I always just do kill -USR1 ... I was thinking you were one of those Windows guys ;-) What he meant was: are there any threads doing anything unusual (as compared to a clean image). But of course, if this was a busy server, there would of course be many threads and sockets open. Anyway: this issue was, as you probably read, that you lost a semaphore every time you saved the image, since you do that every hour… But that won't help in recovering the image. Sven |
Am 08.12.2011 um 11:32 schrieb Sven Van Caekenberghe: > > On 08 Dec 2011, at 11:20, Norbert Hartl wrote: > >> I didn't get that Javier meant the same as Eliot. What should be done is pretty clear to me after dealing 20 years with unix ;) But thank you very much for explaining. I didn't know about the -s switch in kill. I always just do kill -USR1 ... > > I was thinking you were one of those Windows guys ;-) > shame on you ;) > What he meant was: are there any threads doing anything unusual (as compared to a clean image). > But of course, if this was a busy server, there would of course be many threads and sockets open. > The image wasn't that busy. But I need some time to investigate what is normal in an image before I can judge about oddities in my other images. > Anyway: this issue was, as you probably read, that you lost a semaphore every time you saved the image, since you do that every hour… > > But that won't help in recovering the image. I lost all the data but the data wasn't that important. Dealing my whole time with GemStone I ran out of the habit to take extra measures to keep my data. So from now on I'm a Fuel user. I just write out the data using fuel before saving an image so it is easy to reconstruct. My quick take on journaling with manual replay :) Trying to recover the image was not important for me but for pharo. I just liked to participate in solving a problem that is hard to debug. thanks, Norbert |
In reply to this post by NorbertHartl
Hi Norbert,
On Thu, Dec 8, 2011 at 2:04 AM, Norbert Hartl <[hidden email]> wrote:
OK, as a favour to you. Next time you do the leg work. But all the info you need is in the dump. First thing, the active process is the idle process: Process 0x8f1d2a8 priority 10
0xbff60000 M ProcessorScheduler class>idleProcess 64373036: a(n) ProcessorScheduler class 0x8f929b0 s [] in ProcessorScheduler class>startUp 0x8f1d248 s [] in BlockClosure>newProcess
(you can see form the C stack trace that the VM is looping doing primitiveRelinquishProcessor) The next process is the finalization process, which has nothing to do.
Then Process 0x8d73268 priority 20 is trying to do a connect but is blocks in a critical section trying to register one or other of the Socket's semaphores. Then Process 0x8f23b98 priority 20 is trying to do a snapshot and is blocked in a critical section doing SmalltalkImage>clearExternalObjects.
So my guesses are either that a) something terminated a process that was in the critical section for registering external objects and the semaphore protecting it is missing a signal, or
b) there is a bug in the code and that if a process is in the critical section registering an external object then SmalltalkImage>clearExternalObjects my lock-up. But you can read Smalltalk stack traces as well as I. Just look at them. There are only 11 of them.
HTH Eliot
|
Am 08.12.2011 um 19:08 schrieb Eliot Miranda: Hi Norbert, Eliot, thanks for looking into this. As you can see from my guessing I did read the stack traces. And it seems I saw the same like you did :) this is the reason I was asking. I'm just not too fond of most of the code involved and I expected you are. These guesses of you exceed my capability of guessing to some extent and it is very helpful.
Norbert
|
In reply to this post by Eliot Miranda-2
Eliot,
It would be nice to have the #name of each process in the dump. I often end up with many similar/identical processes, and the name is very useful in sorting out which among them has/have strayed. Re guess (a - unsignaled semaphore), does that perhaps suggest a missing #ensure: block? Another possibility (just asking) is that we are using a semaphore for mutual exclusion when a mutex would be a better choice?? When I started using threads, I had a robust mutex class "from the start" and the differences between a mutual exclusion semaphore and a mutex were striking. Re guess (b - lockup in #clearExternalObjects), Norbert mentioned saving the image in connection with this. Saving the image is a very "main thread" activity, and as such, there might be a need to queue a deferred action vs. invoking the code from a background thread. I was going to add something about my reservations on our weak collections, which (IMHO must be thread safe and self-cleaning, and are not in Sqeak/Pharo). Even in Dolphin's earliest docs, weak collections and finalization were one topic, for good reason. Toward that end, I looked at ExternalSemaphoreTable, expecting to find it subclassed or using a weak collection of some type. What I found is a #forMutualExclusion semaphore in a a situation where I would use a Mutex. This looks like a matter of evolution and timing. Squeak dates back to an era before structured exception handling and improvements like Mutex. Dolphin got started after Squeak, with either a two or fifteen year head start, depending on how one wants to call it. Dolphin was built from the ground up to have weak collections, finalization, and full set of process synchronization tools, making Mutex a part of the toolkit when its sockets were written. We have Mutex now, and probably should be using it more widely than we do. Bill From: [hidden email] [[hidden email]] on behalf of Eliot Miranda [[hidden email]]
Sent: Thursday, December 08, 2011 1:08 PM To: [hidden email] Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image? Hi Norbert,
On Thu, Dec 8, 2011 at 2:04 AM, Norbert Hartl
<[hidden email]> wrote:
OK, as a favour to you. Next time you do the leg work. But all the info you need is in the dump. First thing, the active process is the idle process:
Process 0x8f1d2a8 priority 10
0xbff60000 M ProcessorScheduler class>idleProcess 64373036: a(n) ProcessorScheduler class
0x8f929b0 s [] in ProcessorScheduler class>startUp
0x8f1d248 s [] in BlockClosure>newProcess
(you can see form the C stack trace that the VM is looping doing primitiveRelinquishProcessor)
The next process is the finalization process, which has nothing to do.
Then Process 0x8d73268 priority 20 is trying to do a connect but is blocks in a critical section trying to register one or other of the Socket's semaphores.
Then Process 0x8f23b98 priority 20 is trying to do a snapshot and is blocked in a critical section doing SmalltalkImage>clearExternalObjects.
So my guesses are either that
a) something terminated a process that was in the critical section for registering external objects and the semaphore protecting it is missing a signal, or
b) there is a bug in the code and that if a process is in the critical section registering an external object then SmalltalkImage>clearExternalObjects my lock-up.
But you can read Smalltalk stack traces as well as I. Just look at them. There are only 11 of them.
HTH
Eliot
|
Free forum by Nabble | Edit this page |