How to resurrect an unrepsonsive image?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

How to resurrect an unrepsonsive image?

NorbertHartl
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Henrik Sperre Johansen
On 07.12.2011 10:47, Norbert Hartl wrote:
> I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
> Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
> What would be a good way to get  a glimpse of what is causing problems?
> I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.
>
> thanks,
>
> Norbert
Could it be related to
http://forum.world.st/Too-many-semaphores-image-blocked-td3871970i20.html#a3925711 
?

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

NorbertHartl

Am 07.12.2011 um 12:26 schrieb Henrik Sperre Johansen:

> On 07.12.2011 10:47, Norbert Hartl wrote:
>> I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
>> Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
>> What would be a good way to get  a glimpse of what is causing problems?
>> I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.
>>
>> thanks,
>>
>> Norbert
> Could it be related to
> http://forum.world.st/Too-many-semaphores-image-blocked-td3871970i20.html#a3925711 ?
>
Yes, it could. My image was a #13315 image so it made a good candidate for the problem. Furthermore I got the external semaphore message in my development image a few times. There I just could restart the image. So I don't understand really what happened. If I'm able to restart my development image and solve the problem then what makes my server image incapable of doing the same.
Is it by coincidence that there is a warning about the external semaphore but there is even more usage of them so that the image is saved with such a shortage on semaphores that makes a new start of the image block?

Norbert



Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Schwab,Wilhelm K
In reply to this post by NorbertHartl
The worst thing a computer can do is grind to a halt and leave no trace of why it happened.  In Dolphin, I had some success rigging deployed images (runtime session managers) to, on control-break, log callstacks for all non-dead processes.  Conditions in which two threads were each waiting on the other to do something were fairly obvious from the callstacks.



________________________________________
From: [hidden email] [[hidden email]] on behalf of Norbert Hartl [[hidden email]]
Sent: Wednesday, December 07, 2011 4:47 AM
To: [hidden email]
Subject: [Pharo-project] How to resurrect an unrepsonsive image?

I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert

Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

melkyades
In reply to this post by NorbertHartl
Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Schwab,Wilhelm K
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

NorbertHartl
In reply to this post by melkyades
The image didn't crash and in the debug log there is nothing hinting what could be the cause. The image just stopped serving the network and couldn't be restarted.

Norbert

Am 07.12.2011 um 19:52 schrieb Javier Pimás <[hidden email]>:

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Eliot Miranda-2
In reply to this post by Schwab,Wilhelm K


On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
 





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Schwab,Wilhelm K
Eliot,

Nice!  Do you dump *all* stacks, or only those (in Dolphin lingo) for the non-dead processes.  In a very active image at least, Dolphin can have a fair number of processes in various stages of being removed from the system, so it was useful to suppress stacks for the ones that were not going to come back.

Bill





From: [hidden email] [[hidden email]] on behalf of Eliot Miranda [[hidden email]]
Sent: Wednesday, December 07, 2011 4:49 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?



On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
 





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Eliot Miranda-2


On Wed, Dec 7, 2011 at 2:30 PM, Schwab,Wilhelm K <[hidden email]> wrote:
Eliot,

Nice!  Do you dump *all* stacks, or only those (in Dolphin lingo) for the non-dead processes.  In a very active image at least, Dolphin can have a fair number of processes in various stages of being removed from the system, so it was useful to suppress stacks for the ones that were not going to come back.

It prints the active process, all processes waiting on Semaphores and/or Mutexes and the runnable processes, in that order.
 

Bill





From: [hidden email] [[hidden email]] on behalf of Eliot Miranda [[hidden email]]
Sent: Wednesday, December 07, 2011 4:49 PM

To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?



On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
 





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot




--
best,
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

melkyades
In reply to this post by NorbertHartl
maybe then you can kill it manually and see which processes were living there,.. and which that should be were not?

On Wed, Dec 7, 2011 at 8:09 PM, Norbert Hartl <[hidden email]> wrote:
The image didn't crash and in the debug log there is nothing hinting what could be the cause. The image just stopped serving the network and couldn't be restarted.

Norbert

Am 07.12.2011 um 19:52 schrieb Javier Pimás <[hidden email]>:

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
Lic. Javier Pimás
Ciudad de Buenos Aires
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

NorbertHartl

Am 08.12.2011 um 00:34 schrieb Javier Pimás:

maybe then you can kill it manually and see which processes were living there,.. and which that should be were not?

Sorry, but I don't understand what your saying.

Norbert

On Wed, Dec 7, 2011 at 8:09 PM, Norbert Hartl <[hidden email]> wrote:
The image didn't crash and in the debug log there is nothing hinting what could be the cause. The image just stopped serving the network and couldn't be restarted.

Norbert

Am 07.12.2011 um 19:52 schrieb Javier Pimás <[hidden email]>:

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
Lic. Javier Pimás
Ciudad de Buenos Aires

Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

NorbertHartl
In reply to this post by Eliot Miranda-2
Eliot,

can you take a look at the attached crash.dmp file? My knowledge about the internals is limited so I'm not a good candidate to get a feeling what could have been gone wrong. 
If I would have to guess I would think there is a dead lock in ExternalSemaphoreTable. It looks to me as if the outgoing network connect tries to register an external object at the same time the snapshot:andQuit: tries to clear the external objects. But I know nothing how this works. 

Norbert



Am 07.12.2011 um 22:49 schrieb Eliot Miranda:



On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
 





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot



crash.dmp (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Sven Van Caekenberghe
In reply to this post by NorbertHartl

On 08 Dec 2011, at 10:50, Norbert Hartl wrote:

> Sorry, but I don't understand what your saying.

Run your image on Linux or Mac OS X using a Miranda Banda Cog VM, find the process ID of the VM, and kill it with SIGUSR1, which will not actually kill it but will write the crash.dmp file:

[sven@voyager:~/smalltalk]$ ps auxww | grep Cog
sven            4943   3.5  1.0  1237260  40852   ??  R    11:03AM   0:01.03 /Users/sven/apps/Smalltalk/Virtual-Machines/Cog.app/Contents/MacOS/Croquet -psn_0_1200421
[sven@voyager:~/smalltalk]$ kill -s SIGUSR1 4943
[sven@voyager:~/smalltalk]$ ls *.dmp
crash.dmp
[sven@voyager:~/smalltalk]$ less crash.dmp

Sven



Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

NorbertHartl
Sven,

Am 08.12.2011 um 11:06 schrieb Sven Van Caekenberghe:

>
> On 08 Dec 2011, at 10:50, Norbert Hartl wrote:
>
>> Sorry, but I don't understand what your saying.
>
> Run your image on Linux or Mac OS X using a Miranda Banda Cog VM, find the process ID of the VM, and kill it with SIGUSR1, which will not actually kill it but will write the crash.dmp file:
>
> [sven@voyager:~/smalltalk]$ ps auxww | grep Cog
> sven            4943   3.5  1.0  1237260  40852   ??  R    11:03AM   0:01.03 /Users/sven/apps/Smalltalk/Virtual-Machines/Cog.app/Contents/MacOS/Croquet -psn_0_1200421
> [sven@voyager:~/smalltalk]$ kill -s SIGUSR1 4943
> [sven@voyager:~/smalltalk]$ ls *.dmp
> crash.dmp
> [sven@voyager:~/smalltalk]$ less crash.dmp
>
I didn't get that Javier meant the same as Eliot. What should be done is pretty clear to me after dealing 20 years with unix ;) But thank you very much for explaining. I didn't know about the -s switch in kill. I always just do kill -USR1 ...

Norbert
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Sven Van Caekenberghe

On 08 Dec 2011, at 11:20, Norbert Hartl wrote:

> I didn't get that Javier meant the same as Eliot. What should be done is pretty clear to me after dealing 20 years with unix ;) But thank you very much for explaining. I didn't know about the -s switch in kill. I always just do kill -USR1 ...

I was thinking you were one of those Windows guys ;-)

What he meant was: are there any threads doing anything unusual (as compared to a clean image).
But of course, if this was a busy server, there would of course be many threads and sockets open.

Anyway: this issue was, as you probably read, that you lost a semaphore every time you saved the image, since you do that every hour…

But that won't help in recovering the image.

Sven




Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

NorbertHartl

Am 08.12.2011 um 11:32 schrieb Sven Van Caekenberghe:

>
> On 08 Dec 2011, at 11:20, Norbert Hartl wrote:
>
>> I didn't get that Javier meant the same as Eliot. What should be done is pretty clear to me after dealing 20 years with unix ;) But thank you very much for explaining. I didn't know about the -s switch in kill. I always just do kill -USR1 ...
>
> I was thinking you were one of those Windows guys ;-)
>
shame on you ;)

> What he meant was: are there any threads doing anything unusual (as compared to a clean image).
> But of course, if this was a busy server, there would of course be many threads and sockets open.
>
The image wasn't that busy. But I need some time to investigate what is normal in an image before I can judge about oddities in my other images.

> Anyway: this issue was, as you probably read, that you lost a semaphore every time you saved the image, since you do that every hour…
>
> But that won't help in recovering the image.

I lost all the data but the data wasn't that important. Dealing my whole time with GemStone I ran out of the habit to take extra measures to keep my data.  So from now on I'm a Fuel user. I just write out the data using fuel before saving an image so it is easy to reconstruct. My quick take on journaling with manual replay :)

Trying to recover the image was not important for me but for pharo. I just liked to participate in solving a problem that is hard to debug.

thanks,

Norbert
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Eliot Miranda-2
In reply to this post by NorbertHartl
Hi Norbert,

On Thu, Dec 8, 2011 at 2:04 AM, Norbert Hartl <[hidden email]> wrote:
Eliot,

can you take a look at the attached crash.dmp file? My knowledge about the internals is limited so I'm not a good candidate to get a feeling what could have been gone wrong. 
If I would have to guess I would think there is a dead lock in ExternalSemaphoreTable. It looks to me as if the outgoing network connect tries to register an external object at the same time the snapshot:andQuit: tries to clear the external objects. But I know nothing how this works. 

OK, as a favour to you.  Next time you do the leg work. But all the info you need is in the dump.  First thing, the active process is the idle process:
 
Process  0x8f1d2a8 priority 10
0xbff60000 M ProcessorScheduler class>idleProcess 64373036: a(n) ProcessorScheduler class
 0x8f929b0 s [] in ProcessorScheduler class>startUp
 0x8f1d248 s [] in BlockClosure>newProcess

(you can see form the C stack trace that the VM is looping doing primitiveRelinquishProcessor)

The next process is the finalization process, which has nothing to do.

Then Process  0x8d73268 priority 20 is trying to do a connect but is blocks in a critical section trying to register one or other of the Socket's semaphores.

Then Process  0x8f23b98 priority 20 is trying to do a snapshot and is blocked in a critical section doing SmalltalkImage>clearExternalObjects.

So my guesses are either that
    a) something terminated a process that was in the critical section for registering external objects and the semaphore protecting it is missing a signal, or
    b) there is a bug in the code and that if a process is in the critical section registering an external object then SmalltalkImage>clearExternalObjects my lock-up.

But you can read Smalltalk stack traces as well as I.  Just look at them. There are only 11 of them.

HTH
Eliot


Norbert



Am 07.12.2011 um 22:49 schrieb Eliot Miranda:



On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
 





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

NorbertHartl

Am 08.12.2011 um 19:08 schrieb Eliot Miranda:

Hi Norbert,

On Thu, Dec 8, 2011 at 2:04 AM, Norbert Hartl <[hidden email]> wrote:
Eliot,

can you take a look at the attached crash.dmp file? My knowledge about the internals is limited so I'm not a good candidate to get a feeling what could have been gone wrong. 
If I would have to guess I would think there is a dead lock in ExternalSemaphoreTable. It looks to me as if the outgoing network connect tries to register an external object at the same time the snapshot:andQuit: tries to clear the external objects. But I know nothing how this works. 

OK, as a favour to you.  Next time you do the leg work. But all the info you need is in the dump.  First thing, the active process is the idle process:
 
Process  0x8f1d2a8 priority 10
0xbff60000 M ProcessorScheduler class>idleProcess 64373036: a(n) ProcessorScheduler class
 0x8f929b0 s [] in ProcessorScheduler class>startUp
 0x8f1d248 s [] in BlockClosure>newProcess

(you can see form the C stack trace that the VM is looping doing primitiveRelinquishProcessor)

The next process is the finalization process, which has nothing to do.

Then Process  0x8d73268 priority 20 is trying to do a connect but is blocks in a critical section trying to register one or other of the Socket's semaphores.

Then Process  0x8f23b98 priority 20 is trying to do a snapshot and is blocked in a critical section doing SmalltalkImage>clearExternalObjects.

Eliot,

thanks for looking into this. As you can see from my guessing I did read the stack traces. And it seems I saw the same like you did :)

So my guesses are either that
    a) something terminated a process that was in the critical section for registering external objects and the semaphore protecting it is missing a signal, or
    b) there is a bug in the code and that if a process is in the critical section registering an external object then SmalltalkImage>clearExternalObjects my lock-up.

this is the reason I was asking. I'm just not too fond of most of the code involved and I expected you are. These guesses of you exceed my capability of guessing to some extent and it is very helpful.

But you can read Smalltalk stack traces as well as I.  Just look at them. There are only 11 of them.

I was fishing for some extra bits of information. And I'm glad I'm not too shy to ask stupid questions or do some random nagging. Sorry for bothering you with this.

Norbert





Norbert



Am 07.12.2011 um 22:49 schrieb Eliot Miranda:



On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
 





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: How to resurrect an unrepsonsive image?

Schwab,Wilhelm K
In reply to this post by Eliot Miranda-2
Eliot,

It would be nice to have the #name of each process in the dump.  I often end up with many similar/identical processes, and the name is very useful in sorting out which among them has/have strayed.

Re guess (a - unsignaled semaphore), does that perhaps suggest a missing #ensure: block?  Another possibility (just asking) is that we are using a semaphore for mutual exclusion when a mutex would be a better choice??  When I started using threads, I had a robust mutex class "from the start" and the differences between a mutual exclusion semaphore and a mutex were striking. 

Re guess (b - lockup in #clearExternalObjects), Norbert mentioned saving the image in connection with this.  Saving the image is a very "main thread" activity, and as such, there might be a need to queue a deferred action vs. invoking the code from a background thread.

I was going to add something about my reservations on our weak collections, which (IMHO must be thread safe and self-cleaning, and are not in Sqeak/Pharo).  Even in Dolphin's earliest docs, weak collections and finalization were one topic, for good reason.  Toward that end, I looked at ExternalSemaphoreTable, expecting to find it subclassed or using a weak collection of some type.  What I found is a #forMutualExclusion semaphore in a a situation where I would use a Mutex.

This looks like a matter of evolution and timing.  Squeak dates back to an era before structured exception handling and improvements like Mutex.  Dolphin got started after Squeak, with either a two or fifteen year head start, depending on how one wants to call it.  Dolphin was built from the ground up to have weak collections, finalization, and full set of process synchronization tools, making Mutex a part of the toolkit when its sockets were written.

We have Mutex now, and probably should be using it more widely than we do.

Bill




From: [hidden email] [[hidden email]] on behalf of Eliot Miranda [[hidden email]]
Sent: Thursday, December 08, 2011 1:08 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Hi Norbert,

On Thu, Dec 8, 2011 at 2:04 AM, Norbert Hartl <[hidden email]> wrote:
Eliot,

can you take a look at the attached crash.dmp file? My knowledge about the internals is limited so I'm not a good candidate to get a feeling what could have been gone wrong. 
If I would have to guess I would think there is a dead lock in ExternalSemaphoreTable. It looks to me as if the outgoing network connect tries to register an external object at the same time the snapshot:andQuit: tries to clear the external objects. But I know nothing how this works. 

OK, as a favour to you.  Next time you do the leg work. But all the info you need is in the dump.  First thing, the active process is the idle process:
 
Process  0x8f1d2a8 priority 10
0xbff60000 M ProcessorScheduler class>idleProcess 64373036: a(n) ProcessorScheduler class
 0x8f929b0 s [] in ProcessorScheduler class>startUp
 0x8f1d248 s [] in BlockClosure>newProcess

(you can see form the C stack trace that the VM is looping doing primitiveRelinquishProcessor)

The next process is the finalization process, which has nothing to do.

Then Process  0x8d73268 priority 20 is trying to do a connect but is blocks in a critical section trying to register one or other of the Socket's semaphores.

Then Process  0x8f23b98 priority 20 is trying to do a snapshot and is blocked in a critical section doing SmalltalkImage>clearExternalObjects.

So my guesses are either that
    a) something terminated a process that was in the critical section for registering external objects and the semaphore protecting it is missing a signal, or
    b) there is a bug in the code and that if a process is in the critical section registering an external object then SmalltalkImage>clearExternalObjects my lock-up.

But you can read Smalltalk stack traces as well as I.  Just look at them. There are only 11 of them.

HTH
Eliot


Norbert



Am 07.12.2011 um 22:49 schrieb Eliot Miranda:



On Wed, Dec 7, 2011 at 10:55 AM, Schwab,Wilhelm K <[hidden email]> wrote:
That assumes there is an error.  Another (even more frustrating) failure arises when an image does nothing. It is very help to be able to get callstacks in that scenario.

On Mac and Linux the Cog VM responds to SIGUSR1 by dumping all stacks to crash.dmp.
 





From: [hidden email] [[hidden email]] on behalf of Javier Pimás [[hidden email]]
Sent: Wednesday, December 07, 2011 1:52 PM
To: [hidden email]
Subject: Re: [Pharo-project] How to resurrect an unrepsonsive image?

Probably you have a crash.dmp or a PharoDebug.log which tells you what is happening. Look at the backtrace to see what causes the error and then maybe there's some way to help you fixing it.

Cheers,
Javier.

On Wed, Dec 7, 2011 at 10:47 AM, Norbert Hartl <[hidden email]> wrote:
I have a headless image that was running for a couple of days. Now it is not responding anymore. The image was running still but the sockets for http and vnc were closed and restarting the image just brings it up without any sockets opened.
Injecting something via script does not work either. I copied the image to my desktop and tried a few things but no success.
What would be a good way to get  a glimpse of what is causing problems?
I had Scheduler running that saved the image every hour. And the image itself issues http request to the outside world every 10 minutes.

thanks,

Norbert



--
Lic. Javier Pimás
Ciudad de Buenos Aires



--
best,
Eliot
12