[Glass] out of resources

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[Glass] out of resources

otto
Hi,

With GS 3.1 we're running out of semaphores and shared memory because
the system resources are not freed up.

When we run ipcs, we get lists of shared memory segments (about 1GB
each) that report no attached processes. When we use ipcrm -m <id>,
the memory is free.

ipcs -s shows a long list of semaphore arrays. When using ipcs -s -i
<array id>, we see that the referenced processes are dead.

This happens on our jenkins machines where we start & stop GS a lot.
(Jobs running tests restore from a built GS backup.) We think we are
using stopstone properly (with waitstone to make sure it is stopped,
etc.). We need to investigate properly and make sure the jenkins jobs
do this in the way we are expecting.

I was hoping someone can give us some ideas on this. Perhaps there's a
GS flag or a OS setup that we're missing. Your insights are
appreciated.

Thanks
Otto
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

Dale Henrichs-3
Otto,

The way that we use shared memory resources, we have things setup so that the resources are deallocated when the last process detaches from the cache. 

So I would look around the system for rogue stoned, topaz or gem processes that may be hung/refusing to quit ... if you find some hanging around note their process ids and track down their log file...there should be information there as to why they are hung ...

Another thing is if you 'kill -9' the shrpc monitor process, the shared memory segment will not be told to clean up when the last process detaches and the shared memore/semaphores will be left around.

While not recommended, it is "safe" to kill -9 the stoned process as a last resort, you will lose any transactions that are in progress and will need to restore from tranlogs on restart, but kill -9 on stoned should not corrupt the db  ... the shrpc monitor process is really the only process that it is not "safe" to use kill -9 on but even then the db will not be corrupted, like killing the stone, you will lose any transactions in progress AND you will leave shared memory resources around to be cleaned up manually ... 

Dale


On Thu, May 8, 2014 at 7:38 AM, Otto Behrens <[hidden email]> wrote:
Hi,

With GS 3.1 we're running out of semaphores and shared memory because
the system resources are not freed up.

When we run ipcs, we get lists of shared memory segments (about 1GB
each) that report no attached processes. When we use ipcrm -m <id>,
the memory is free.

ipcs -s shows a long list of semaphore arrays. When using ipcs -s -i
<array id>, we see that the referenced processes are dead.

This happens on our jenkins machines where we start & stop GS a lot.
(Jobs running tests restore from a built GS backup.) We think we are
using stopstone properly (with waitstone to make sure it is stopped,
etc.). We need to investigate properly and make sure the jenkins jobs
do this in the way we are expecting.

I was hoping someone can give us some ideas on this. Perhaps there's a
GS flag or a OS setup that we're missing. Your insights are
appreciated.

Thanks
Otto
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

otto
Thanks for the response

> The way that we use shared memory resources, we have things setup so that
> the resources are deallocated when the last process detaches from the cache.
>
> So I would look around the system for rogue stoned, topaz or gem processes
> that may be hung/refusing to quit ... if you find some hanging around note
> their process ids and track down their log file...there should be
> information there as to why they are hung ...

There are no rogue stoned, topaz, gem or shared cache monitor
processes hanging around. The output from ipcs -m tells me that there
are no processes attached. eg:

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x6a0180a5 819200     wonka      660        1059602432 0
0x16018585 1146881    wonka      660        1059602432 0
0x950180a5 163844     wonka      660        1059602432 0
0xfd0180a5 458757     wonka      660        1059602432 0

The command

ipcrm -m 819200

frees up the memory instantly (the docs says that it frees up only
after the last process detatches), which is consistent with the nattch
count of 0.

> Another thing is if you 'kill -9' the shrpc monitor process, the shared
> memory segment will not be told to clean up when the last process detaches
> and the shared memore/semaphores will be left around.

No shrpc process around

> While not recommended, it is "safe" to kill -9 the stoned process as a last
> resort, you will lose any transactions that are in progress and will need to
> restore from tranlogs on restart, but kill -9 on stoned should not corrupt
> the db  ... the shrpc monitor process is really the only process that it is
> not "safe" to use kill -9 on but even then the db will not be corrupted,
> like killing the stone, you will lose any transactions in progress AND you
> will leave shared memory resources around to be cleaned up manually ...

AFAIK we are not killing processes. But we should have a very close
look to make sure.

What I can conclude from your response is that it is unexpected then
that these resources are not freed up?

Thanks
Otto
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

Dale Henrichs-3
You are correct ... in normal operation the resources should be cleaned up ... 

Another thing to consider is that if you are running on linux and get low on swap space, linux will start killing off processes ... in recent versions of 3.x we have made the necessary system calls so that the os won't kill the important gemstone processes, but I'm not exactly sure what version these calls were put in ... if you look in the system logs, there should be messages listing the pids that have been killed ... 

Dale


On Thu, May 8, 2014 at 8:32 AM, Otto Behrens <[hidden email]> wrote:
Thanks for the response

> The way that we use shared memory resources, we have things setup so that
> the resources are deallocated when the last process detaches from the cache.
>
> So I would look around the system for rogue stoned, topaz or gem processes
> that may be hung/refusing to quit ... if you find some hanging around note
> their process ids and track down their log file...there should be
> information there as to why they are hung ...

There are no rogue stoned, topaz, gem or shared cache monitor
processes hanging around. The output from ipcs -m tells me that there
are no processes attached. eg:

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x6a0180a5 819200     wonka      660        1059602432 0
0x16018585 1146881    wonka      660        1059602432 0
0x950180a5 163844     wonka      660        1059602432 0
0xfd0180a5 458757     wonka      660        1059602432 0

The command

ipcrm -m 819200

frees up the memory instantly (the docs says that it frees up only
after the last process detatches), which is consistent with the nattch
count of 0.

> Another thing is if you 'kill -9' the shrpc monitor process, the shared
> memory segment will not be told to clean up when the last process detaches
> and the shared memore/semaphores will be left around.

No shrpc process around

> While not recommended, it is "safe" to kill -9 the stoned process as a last
> resort, you will lose any transactions that are in progress and will need to
> restore from tranlogs on restart, but kill -9 on stoned should not corrupt
> the db  ... the shrpc monitor process is really the only process that it is
> not "safe" to use kill -9 on but even then the db will not be corrupted,
> like killing the stone, you will lose any transactions in progress AND you
> will leave shared memory resources around to be cleaned up manually ...

AFAIK we are not killing processes. But we should have a very close
look to make sure.

What I can conclude from your response is that it is unexpected then
that these resources are not freed up?

Thanks
Otto


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

nrgiii
In reply to this post by otto
Otto,

Yes, this is unexpected.  You should only see stuck shared memory segments and semaphore arrays if you kill -9 the SPC monitor process.

You should see if you can reproduce it:

  1. ipcs -a to verify there are no IPC resources in use
  2. start the stone with startstone
  3. stop the stone with stopstone (or however you normally shutdown GemStone)
  4. Repeat ipcs -a to see if the resources were released.  They should be.

Norm

On 5/8/14, 8:32, Otto Behrens wrote:
Thanks for the response

The way that we use shared memory resources, we have things setup so that
the resources are deallocated when the last process detaches from the cache.

So I would look around the system for rogue stoned, topaz or gem processes
that may be hung/refusing to quit ... if you find some hanging around note
their process ids and track down their log file...there should be
information there as to why they are hung ...
There are no rogue stoned, topaz, gem or shared cache monitor
processes hanging around. The output from ipcs -m tells me that there
are no processes attached. eg:

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x6a0180a5 819200     wonka      660        1059602432 0
0x16018585 1146881    wonka      660        1059602432 0
0x950180a5 163844     wonka      660        1059602432 0
0xfd0180a5 458757     wonka      660        1059602432 0

The command

ipcrm -m 819200

frees up the memory instantly (the docs says that it frees up only
after the last process detatches), which is consistent with the nattch
count of 0.

Another thing is if you 'kill -9' the shrpc monitor process, the shared
memory segment will not be told to clean up when the last process detaches
and the shared memore/semaphores will be left around.
No shrpc process around

While not recommended, it is "safe" to kill -9 the stoned process as a last
resort, you will lose any transactions that are in progress and will need to
restore from tranlogs on restart, but kill -9 on stoned should not corrupt
the db  ... the shrpc monitor process is really the only process that it is
not "safe" to use kill -9 on but even then the db will not be corrupted,
like killing the stone, you will lose any transactions in progress AND you
will leave shared memory resources around to be cleaned up manually ...
AFAIK we are not killing processes. But we should have a very close
look to make sure.

What I can conclude from your response is that it is unexpected then
that these resources are not freed up?

Thanks
Otto
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass


_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

otto
Thanks. Not doing that explicitly, so it may be something else. Still looking...

On Thu, May 8, 2014 at 6:14 PM, Norm Green
<[hidden email]> wrote:

> Otto,
>
> Yes, this is unexpected.  You should only see stuck shared memory segments
> and semaphore arrays if you kill -9 the SPC monitor process.
>
> You should see if you can reproduce it:
>
> ipcs -a to verify there are no IPC resources in use
> start the stone with startstone
> stop the stone with stopstone (or however you normally shutdown GemStone)
> Repeat ipcs -a to see if the resources were released.  They should be.
>
> Norm
>
> On 5/8/14, 8:32, Otto Behrens wrote:
>
> Thanks for the response
>
> The way that we use shared memory resources, we have things setup so that
> the resources are deallocated when the last process detaches from the cache.
>
> So I would look around the system for rogue stoned, topaz or gem processes
> that may be hung/refusing to quit ... if you find some hanging around note
> their process ids and track down their log file...there should be
> information there as to why they are hung ...
>
> There are no rogue stoned, topaz, gem or shared cache monitor
> processes hanging around. The output from ipcs -m tells me that there
> are no processes attached. eg:
>
> ------ Shared Memory Segments --------
> key        shmid      owner      perms      bytes      nattch     status
> 0x6a0180a5 819200     wonka      660        1059602432 0
> 0x16018585 1146881    wonka      660        1059602432 0
> 0x950180a5 163844     wonka      660        1059602432 0
> 0xfd0180a5 458757     wonka      660        1059602432 0
>
> The command
>
> ipcrm -m 819200
>
> frees up the memory instantly (the docs says that it frees up only
> after the last process detatches), which is consistent with the nattch
> count of 0.
>
> Another thing is if you 'kill -9' the shrpc monitor process, the shared
> memory segment will not be told to clean up when the last process detaches
> and the shared memore/semaphores will be left around.
>
> No shrpc process around
>
> While not recommended, it is "safe" to kill -9 the stoned process as a last
> resort, you will lose any transactions that are in progress and will need to
> restore from tranlogs on restart, but kill -9 on stoned should not corrupt
> the db  ... the shrpc monitor process is really the only process that it is
> not "safe" to use kill -9 on but even then the db will not be corrupted,
> like killing the stone, you will lose any transactions in progress AND you
> will leave shared memory resources around to be cleaned up manually ...
>
> AFAIK we are not killing processes. But we should have a very close
> look to make sure.
>
> What I can conclude from your response is that it is unexpected then
> that these resources are not freed up?
>
> Thanks
> Otto
> _______________________________________________
> Glass mailing list
> [hidden email]
> http://lists.gemtalksystems.com/mailman/listinfo/glass
>
>
>
> _______________________________________________
> Glass mailing list
> [hidden email]
> http://lists.gemtalksystems.com/mailman/listinfo/glass
>
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

otto
In reply to this post by Dale Henrichs-3
> Another thing to consider is that if you are running on linux and get low on
> swap space, linux will start killing off processes ... in recent versions of
> 3.x we have made the necessary system calls so that the os won't kill the
> important gemstone processes, but I'm not exactly sure what version these
> calls were put in ... if you look in the system logs, there should be
> messages listing the pids that have been killed ...

This may well be the issue. We've had a run in with the OOM killer
because we're over committing memory on the VM's. We're busy fixing
that. But I could not find a log on the VM that tells me something
yet. We've only seen the VM host kill the VM so far.

We're getting

Write to /proc/349/oom_score_adj failed with EACCES , linux user does
not have CAP_SYS_RESOURCE
No server process protection from OOM killer

in the pcmon and pagemanager logs. I think this happens once the
resources are unavailable.
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

nrgiii
That sounds right.  The OOM killer is effectively a kill -9 and the SPC
monitor is probably the biggest memory hog on your system since it
creates and owns the shared page cache.

Norm

On 5/8/14, 9:27, Otto Behrens wrote:

>> Another thing to consider is that if you are running on linux and get low on
>> swap space, linux will start killing off processes ... in recent versions of
>> 3.x we have made the necessary system calls so that the os won't kill the
>> important gemstone processes, but I'm not exactly sure what version these
>> calls were put in ... if you look in the system logs, there should be
>> messages listing the pids that have been killed ...
> This may well be the issue. We've had a run in with the OOM killer
> because we're over committing memory on the VM's. We're busy fixing
> that. But I could not find a log on the VM that tells me something
> yet. We've only seen the VM host kill the VM so far.
>
> We're getting
>
> Write to /proc/349/oom_score_adj failed with EACCES , linux user does
> not have CAP_SYS_RESOURCE
> No server process protection from OOM killer
>
> in the pcmon and pagemanager logs. I think this happens once the
> resources are unavailable.
> _______________________________________________
> Glass mailing list
> [hidden email]
> http://lists.gemtalksystems.com/mailman/listinfo/glass

_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

Martin McClure-5
In reply to this post by otto
On 05/08/2014 09:27 AM, Otto Behrens wrote:

>
> This may well be the issue. We've had a run in with the OOM killer
> because we're over committing memory on the VM's. We're busy fixing
> that. But I could not find a log on the VM that tells me something
> yet. We've only seen the VM host kill the VM so far.
>
> We're getting
>
> Write to /proc/349/oom_score_adj failed with EACCES , linux user does
> not have CAP_SYS_RESOURCE
> No server process protection from OOM killer
>
> in the pcmon and pagemanager logs. I think this happens once the
> resources are unavailable.

These messages are output at startup of the process if we are unable to
set up protection from the OOM killer. It does not necessarily mean that
the OOM killer has killed the process, though that seems likely given
the symptome.

If the OOM killer has been killing processes, the system log
(/var/log/syslog or /var/log/messages, depending on the Linux distro)
should contain entries showing which processes were killed and when.

Regards,

-Martin
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass
Reply | Threaded
Open this post in threaded view
|

Re: [Glass] out of resources

otto
Thanks for all the answers. Not finding anything useful in the logs so
far. Still hunting...

On Thu, May 8, 2014 at 7:12 PM, Martin McClure
<[hidden email]> wrote:

> On 05/08/2014 09:27 AM, Otto Behrens wrote:
>>
>>
>> This may well be the issue. We've had a run in with the OOM killer
>> because we're over committing memory on the VM's. We're busy fixing
>> that. But I could not find a log on the VM that tells me something
>> yet. We've only seen the VM host kill the VM so far.
>>
>> We're getting
>>
>> Write to /proc/349/oom_score_adj failed with EACCES , linux user does
>> not have CAP_SYS_RESOURCE
>> No server process protection from OOM killer
>>
>> in the pcmon and pagemanager logs. I think this happens once the
>> resources are unavailable.
>
>
> These messages are output at startup of the process if we are unable to set
> up protection from the OOM killer. It does not necessarily mean that the OOM
> killer has killed the process, though that seems likely given the symptome.
>
> If the OOM killer has been killing processes, the system log
> (/var/log/syslog or /var/log/messages, depending on the Linux distro) should
> contain entries showing which processes were killed and when.
>
> Regards,
>
> -Martin
_______________________________________________
Glass mailing list
[hidden email]
http://lists.gemtalksystems.com/mailman/listinfo/glass