seaside fastcgi gems that no longer accept connections

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

seaside fastcgi gems that no longer accept connections

Johan Brichau-2
Hi,

I wonder if anyone has investigated the issue that a seaside gem sometimes stops accepting fastcgi connections?
The gem is still running but it is no longer accepting connections (fastcgi). We also get the same behavior with service gems we are starting (for processing WAGemStoneServiceTask).

At times, we also get gems that are terminated by the stone because they did not respond quickly enough to the sigAbort. Because these also happen at times when we have no load on the server, I assume this is because the gem is simply also blocked.

I have tried to investigate this issue by sending the kill -USR1 but that does not learn me a lot (i.e. I only get the smalltalk stack that shows it's in the _reapEvent method)

Any clues to how to investigate and solve such problems are appreciated.

best regards,
Johan
Reply | Threaded
Open this post in threaded view
|

Re: seaside fastcgi gems that no longer accept connections

NorbertHartl

Am 16.12.2011 um 13:46 schrieb Johan Brichau:

Any clues to how to investigate and solve such problems are appreciated.

I just have a workaround. Do you remember that you asked me why I send fake fastcgi request in monit? That's the reason. You can at least detect it and restart the gem until the problem is resolved. You need to put

# Empty FastCGI request
   if failed port 6001
     # Send FastCGI packet: version 1 (0x01), cmd FCGI_GET_VALUES (0x09)
     # padding 8 bytes (0x08), followed by 8xNULLs padding
     send "\0x01\0x09\0x00\0x00\0x00\0x00\0x08\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00"
     # Expect FastCGI packet: version 1 (0x01), resp FCGI_GET_VALUES_RESULT (0x0A)
     expect "\0x01\0x0A"
     timeout 5 seconds
   then restart

in your monit configuration. 

Norbert
Reply | Threaded
Open this post in threaded view
|

Re: seaside fastcgi gems that no longer accept connections

Johan Brichau-2
Hi Norbert,

I forgot to mention, but that's indeed the same thing we are doing also.
Indeed, after you mentioned that to me, we started doing that as well.

However, I'm not feeling comfortable with the problem itself. Since we get better performance using sticky sessions, a user feels it that his gem has died because monit takes some time to notice.

Johan

On 16 Dec 2011, at 14:00, Norbert Hartl wrote:

>
> Am 16.12.2011 um 13:46 schrieb Johan Brichau:
>
>> Any clues to how to investigate and solve such problems are appreciated.
>
> I just have a workaround. Do you remember that you asked me why I send fake fastcgi request in monit? That's the reason. You can at least detect it and restart the gem until the problem is resolved. You need to put
>
> # Empty FastCGI request
>    if failed port 6001
>      # Send FastCGI packet: version 1 (0x01), cmd FCGI_GET_VALUES (0x09)
>      # padding 8 bytes (0x08), followed by 8xNULLs padding
>      send "\0x01\0x09\0x00\0x00\0x00\0x00\0x08\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00"
>      # Expect FastCGI packet: version 1 (0x01), resp FCGI_GET_VALUES_RESULT (0x0A)
>      expect "\0x01\0x0A"
>      timeout 5 seconds
>    then restart
>
> in your monit configuration.
>
> Norbert

Reply | Threaded
Open this post in threaded view
|

Re: seaside fastcgi gems that no longer accept connections

Johan Brichau-2
In reply to this post by NorbertHartl
Norbert,

Are you also getting the error that a gem dies because of the 'failed to respond quickly enough to a sigabort' ?

Of course, monit sees it too and restarts, but I wonder if the cause could be the same: i.e. the process that should respond to the request is no longer running (maybe?)

On 16 Dec 2011, at 14:00, Norbert Hartl wrote:

>
> Am 16.12.2011 um 13:46 schrieb Johan Brichau:
>
>> Any clues to how to investigate and solve such problems are appreciated.
>
> I just have a workaround. Do you remember that you asked me why I send fake fastcgi request in monit? That's the reason. You can at least detect it and restart the gem until the problem is resolved. You need to put
>
> # Empty FastCGI request
>    if failed port 6001
>      # Send FastCGI packet: version 1 (0x01), cmd FCGI_GET_VALUES (0x09)
>      # padding 8 bytes (0x08), followed by 8xNULLs padding
>      send "\0x01\0x09\0x00\0x00\0x00\0x00\0x08\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00"
>      # Expect FastCGI packet: version 1 (0x01), resp FCGI_GET_VALUES_RESULT (0x0A)
>      expect "\0x01\0x0A"
>      timeout 5 seconds
>    then restart
>
> in your monit configuration.
>
> Norbert

Reply | Threaded
Open this post in threaded view
|

Re: seaside fastcgi gems that no longer accept connections

NorbertHartl

Am 16.12.2011 um 15:00 schrieb Johan Brichau:

> Norbert,
>
> Are you also getting the error that a gem dies because of the 'failed to respond quickly enough to a sigabort' ?
>
Johan,

I did a quick scan of the logs and found nothing. In the last time I don't experience many gem hiccups. To be honest I didn't look closely for some time because they occur rarely. Sorry.

Norbert

> Of course, monit sees it too and restarts, but I wonder if the cause could be the same: i.e. the process that should respond to the request is no longer running (maybe?)
>
> On 16 Dec 2011, at 14:00, Norbert Hartl wrote:
>
>>
>> Am 16.12.2011 um 13:46 schrieb Johan Brichau:
>>
>>> Any clues to how to investigate and solve such problems are appreciated.
>>
>> I just have a workaround. Do you remember that you asked me why I send fake fastcgi request in monit? That's the reason. You can at least detect it and restart the gem until the problem is resolved. You need to put
>>
>> # Empty FastCGI request
>>   if failed port 6001
>>     # Send FastCGI packet: version 1 (0x01), cmd FCGI_GET_VALUES (0x09)
>>     # padding 8 bytes (0x08), followed by 8xNULLs padding
>>     send "\0x01\0x09\0x00\0x00\0x00\0x00\0x08\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00"
>>     # Expect FastCGI packet: version 1 (0x01), resp FCGI_GET_VALUES_RESULT (0x0A)
>>     expect "\0x01\0x0A"
>>     timeout 5 seconds
>>   then restart
>>
>> in your monit configuration.
>>
>> Norbert
>

Reply | Threaded
Open this post in threaded view
|

Re: seaside fastcgi gems that no longer accept connections

Dale Henrichs
In reply to this post by Johan Brichau-2
Johan,

I think that part of the problem is that you might have a heavily loaded machine....

In the standard startSeaside30Adaptor script there is an static exception handler installed to handle the SigAbort message from the stone and a thread that is forked at the lowest priority that wakes up at 30 second intervals to give an otherwise idle vm a chance to handle the sigAbort. If the only thread in a vm is waiting on an accept (the standard situation for a web server) then sigAborts might be ignored ...

The fact that your gems aren't responding to the sigAbort means that the sigAbort thread isn't getting a chance to run in a timely manner ... Assuming that all of the threads are alive and well, this sort of situation can occur if the machine is heavily loaded ... I don't think that Linux is necessarily completely fair in its scheduling algorithm and on busy systems with lots of process these restless processes can end up at the bottom of the list and not get run as often as they should.

This same thing could be affecting your service vm, since I think they sit on a restless thread, too.

For fastcgi, if you poke around in FSGsSocketServer there aren't too many places that aren't covered by an exception handler of some sort that logs things directly to the gem log ... however ....

...if you look at FSConnection>>handleFSError:, the errors that are protected there are only logged to the stone log and not to the gem log, so it is worth taking a peek at your stone log and see if there are error messages show up there ...

Another scenario would be that the accept connections are somehow not accepting connections and not logging an error and not signalling the gateway semaphore ... once 10 of these guys got in this situation, the gem would "stop accepting connections"

GsSocket>>listen:acceptingWith: will spin in an infinite loop trying to do a SocketAccept so it is not inconceivable that an error in that sequence could essentially plug up the works ...

The USR1 signal only lists the active process, so we don't know the state of the idle threads ...

It's possible that we could put a trigger into the sigabort thread ... like check the contents of a file on disk and if the file is not empty, trigger some smalltalk code to poke around in the image:

  WAFastCGIAdaptor default

Gets you to the adaptor instance and the server instance variable is an instance of FSGsSocketServer so one could poke around in that guy checking on the status of the gatewaySemaphore...there's an #addConnection: call in FSGsSocketServer when a connection is made (accept triggers) that does nothing right now, but you could hang onto the last 15 connections for analysis purposes just to see ...

If we learn a few things from that we can move on from there...

Dale

----- Original Message -----
| From: "Johan Brichau" <[hidden email]>
| To: "GemStone Seaside beta discussion" <[hidden email]>
| Sent: Friday, December 16, 2011 4:46:04 AM
| Subject: [GS/SS Beta] seaside fastcgi gems that no longer accept connections
|
| Hi,
|
| I wonder if anyone has investigated the issue that a seaside gem
| sometimes stops accepting fastcgi connections?
| The gem is still running but it is no longer accepting connections
| (fastcgi). We also get the same behavior with service gems we are
| starting (for processing WAGemStoneServiceTask).
|
| At times, we also get gems that are terminated by the stone because
| they did not respond quickly enough to the sigAbort. Because these
| also happen at times when we have no load on the server, I assume
| this is because the gem is simply also blocked.
|
| I have tried to investigate this issue by sending the kill -USR1 but
| that does not learn me a lot (i.e. I only get the smalltalk stack
| that shows it's in the _reapEvent method)
|
| Any clues to how to investigate and solve such problems are
| appreciated.
|
| best regards,
| Johan
Reply | Threaded
Open this post in threaded view
|

Re: seaside fastcgi gems that no longer accept connections

Johan Brichau-2
Hi Dale,

I managed to trace the unresponsiveness problem of my service vm.
It turns out that the service vm stopped processing tasks whenever a service task hit a commit conflict…. this put the entire service vm's transactional state in the 'conflict' mode, prohibiting any new 'begin' from proceeding without doing an abort.

After putting the correct transactional handlers in the service task block, I got things rolling correctly.

Would it be worthwhile to include an abort in the service vm before processing any task? (similar to how the seaside gems abort when a request comes in)

Once I discovered this commit conflict, I was able to quickly fix it. However, finding that these commit conflicts were occurring was really difficult. I only found them in the service gem log after doing a "kill -USR1" which flushed the log, printing the commit conflict messages before printing the stack. Afterwards, I also made a Smalltalk mistake in the abort handling code that I was writing (a DNU) but I never had the error showing up anywhere (no log entry, no gem crash, nada..). Is this because the tasks are processed in forked processes that such errors misteriously disappear?

I hope I'm making sense (have been fiddling too long on this problem today). Now I just want to know why I had not seen these commit conflicts popping up before ;-)

cheers
Johan


On 16 Dec 2011, at 22:30, Dale Henrichs wrote:

> Johan,
>
> I think that part of the problem is that you might have a heavily loaded machine....
>
> In the standard startSeaside30Adaptor script there is an static exception handler installed to handle the SigAbort message from the stone and a thread that is forked at the lowest priority that wakes up at 30 second intervals to give an otherwise idle vm a chance to handle the sigAbort. If the only thread in a vm is waiting on an accept (the standard situation for a web server) then sigAborts might be ignored ...
>
> The fact that your gems aren't responding to the sigAbort means that the sigAbort thread isn't getting a chance to run in a timely manner ... Assuming that all of the threads are alive and well, this sort of situation can occur if the machine is heavily loaded ... I don't think that Linux is necessarily completely fair in its scheduling algorithm and on busy systems with lots of process these restless processes can end up at the bottom of the list and not get run as often as they should.
>
> This same thing could be affecting your service vm, since I think they sit on a restless thread, too.
>
> For fastcgi, if you poke around in FSGsSocketServer there aren't too many places that aren't covered by an exception handler of some sort that logs things directly to the gem log ... however ....
>
> ...if you look at FSConnection>>handleFSError:, the errors that are protected there are only logged to the stone log and not to the gem log, so it is worth taking a peek at your stone log and see if there are error messages show up there ...
>
> Another scenario would be that the accept connections are somehow not accepting connections and not logging an error and not signalling the gateway semaphore ... once 10 of these guys got in this situation, the gem would "stop accepting connections"
>
> GsSocket>>listen:acceptingWith: will spin in an infinite loop trying to do a SocketAccept so it is not inconceivable that an error in that sequence could essentially plug up the works ...
>
> The USR1 signal only lists the active process, so we don't know the state of the idle threads ...
>
> It's possible that we could put a trigger into the sigabort thread ... like check the contents of a file on disk and if the file is not empty, trigger some smalltalk code to poke around in the image:
>
>  WAFastCGIAdaptor default
>
> Gets you to the adaptor instance and the server instance variable is an instance of FSGsSocketServer so one could poke around in that guy checking on the status of the gatewaySemaphore...there's an #addConnection: call in FSGsSocketServer when a connection is made (accept triggers) that does nothing right now, but you could hang onto the last 15 connections for analysis purposes just to see ...
>
> If we learn a few things from that we can move on from there...
>
> Dale
>
> ----- Original Message -----
> | From: "Johan Brichau" <[hidden email]>
> | To: "GemStone Seaside beta discussion" <[hidden email]>
> | Sent: Friday, December 16, 2011 4:46:04 AM
> | Subject: [GS/SS Beta] seaside fastcgi gems that no longer accept connections
> |
> | Hi,
> |
> | I wonder if anyone has investigated the issue that a seaside gem
> | sometimes stops accepting fastcgi connections?
> | The gem is still running but it is no longer accepting connections
> | (fastcgi). We also get the same behavior with service gems we are
> | starting (for processing WAGemStoneServiceTask).
> |
> | At times, we also get gems that are terminated by the stone because
> | they did not respond quickly enough to the sigAbort. Because these
> | also happen at times when we have no load on the server, I assume
> | this is because the gem is simply also blocked.
> |
> | I have tried to investigate this issue by sending the kill -USR1 but
> | that does not learn me a lot (i.e. I only get the smalltalk stack
> | that shows it's in the _reapEvent method)
> |
> | Any clues to how to investigate and solve such problems are
> | appreciated.
> |
> | best regards,
> | Johan

Reply | Threaded
Open this post in threaded view
|

Re: seaside fastcgi gems that no longer accept connections

Johan Brichau-2

On 29 Dec 2011, at 20:40, Johan Brichau wrote:

Now I just want to know why I had not seen these commit conflicts popping up before ;-)

Hm.. I think I nailed that one too:
- there was an error handler that caught any exception in the service task to make it an object log entry
- but making an object log entry requires a commit
- but the vm is in a conflict state

(did not verify, but it makes sense)

Mind boggling sometimes… ;-)

cheers
Johan