Hi,
With GS 3.1 we're running out of semaphores and shared memory because the system resources are not freed up. When we run ipcs, we get lists of shared memory segments (about 1GB each) that report no attached processes. When we use ipcrm -m <id>, the memory is free. ipcs -s shows a long list of semaphore arrays. When using ipcs -s -i <array id>, we see that the referenced processes are dead. This happens on our jenkins machines where we start & stop GS a lot. (Jobs running tests restore from a built GS backup.) We think we are using stopstone properly (with waitstone to make sure it is stopped, etc.). We need to investigate properly and make sure the jenkins jobs do this in the way we are expecting. I was hoping someone can give us some ideas on this. Perhaps there's a GS flag or a OS setup that we're missing. Your insights are appreciated. Thanks Otto _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Otto, The way that we use shared memory resources, we have things setup so that the resources are deallocated when the last process detaches from the cache. So I would look around the system for rogue stoned, topaz or gem processes that may be hung/refusing to quit ... if you find some hanging around note their process ids and track down their log file...there should be information there as to why they are hung ...
Another thing is if you 'kill -9' the shrpc monitor process, the shared memory segment will not be told to clean up when the last process detaches and the shared memore/semaphores will be left around.
While not recommended, it is "safe" to kill -9 the stoned process as a last resort, you will lose any transactions that are in progress and will need to restore from tranlogs on restart, but kill -9 on stoned should not corrupt the db ... the shrpc monitor process is really the only process that it is not "safe" to use kill -9 on but even then the db will not be corrupted, like killing the stone, you will lose any transactions in progress AND you will leave shared memory resources around to be cleaned up manually ...
Dale On Thu, May 8, 2014 at 7:38 AM, Otto Behrens <[hidden email]> wrote: Hi, _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Thanks for the response
> The way that we use shared memory resources, we have things setup so that > the resources are deallocated when the last process detaches from the cache. > > So I would look around the system for rogue stoned, topaz or gem processes > that may be hung/refusing to quit ... if you find some hanging around note > their process ids and track down their log file...there should be > information there as to why they are hung ... There are no rogue stoned, topaz, gem or shared cache monitor processes hanging around. The output from ipcs -m tells me that there are no processes attached. eg: ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x6a0180a5 819200 wonka 660 1059602432 0 0x16018585 1146881 wonka 660 1059602432 0 0x950180a5 163844 wonka 660 1059602432 0 0xfd0180a5 458757 wonka 660 1059602432 0 The command ipcrm -m 819200 frees up the memory instantly (the docs says that it frees up only after the last process detatches), which is consistent with the nattch count of 0. > Another thing is if you 'kill -9' the shrpc monitor process, the shared > memory segment will not be told to clean up when the last process detaches > and the shared memore/semaphores will be left around. No shrpc process around > While not recommended, it is "safe" to kill -9 the stoned process as a last > resort, you will lose any transactions that are in progress and will need to > restore from tranlogs on restart, but kill -9 on stoned should not corrupt > the db ... the shrpc monitor process is really the only process that it is > not "safe" to use kill -9 on but even then the db will not be corrupted, > like killing the stone, you will lose any transactions in progress AND you > will leave shared memory resources around to be cleaned up manually ... AFAIK we are not killing processes. But we should have a very close look to make sure. What I can conclude from your response is that it is unexpected then that these resources are not freed up? Thanks Otto _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
You are correct ... in normal operation the resources should be cleaned up ... Another thing to consider is that if you are running on linux and get low on swap space, linux will start killing off processes ... in recent versions of 3.x we have made the necessary system calls so that the os won't kill the important gemstone processes, but I'm not exactly sure what version these calls were put in ... if you look in the system logs, there should be messages listing the pids that have been killed ...
Dale On Thu, May 8, 2014 at 8:32 AM, Otto Behrens <[hidden email]> wrote: Thanks for the response _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by otto
Otto,
Yes, this is unexpected. You should only see stuck shared memory segments and semaphore arrays if you kill -9 the SPC monitor process. You should see if you can reproduce it:
Norm On 5/8/14, 8:32, Otto Behrens wrote:
Thanks for the responseThe way that we use shared memory resources, we have things setup so that the resources are deallocated when the last process detaches from the cache. So I would look around the system for rogue stoned, topaz or gem processes that may be hung/refusing to quit ... if you find some hanging around note their process ids and track down their log file...there should be information there as to why they are hung ...There are no rogue stoned, topaz, gem or shared cache monitor processes hanging around. The output from ipcs -m tells me that there are no processes attached. eg: ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x6a0180a5 819200 wonka 660 1059602432 0 0x16018585 1146881 wonka 660 1059602432 0 0x950180a5 163844 wonka 660 1059602432 0 0xfd0180a5 458757 wonka 660 1059602432 0 The command ipcrm -m 819200 frees up the memory instantly (the docs says that it frees up only after the last process detatches), which is consistent with the nattch count of 0.Another thing is if you 'kill -9' the shrpc monitor process, the shared memory segment will not be told to clean up when the last process detaches and the shared memore/semaphores will be left around.No shrpc process aroundWhile not recommended, it is "safe" to kill -9 the stoned process as a last resort, you will lose any transactions that are in progress and will need to restore from tranlogs on restart, but kill -9 on stoned should not corrupt the db ... the shrpc monitor process is really the only process that it is not "safe" to use kill -9 on but even then the db will not be corrupted, like killing the stone, you will lose any transactions in progress AND you will leave shared memory resources around to be cleaned up manually ...AFAIK we are not killing processes. But we should have a very close look to make sure. What I can conclude from your response is that it is unexpected then that these resources are not freed up? Thanks Otto _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Thanks. Not doing that explicitly, so it may be something else. Still looking...
On Thu, May 8, 2014 at 6:14 PM, Norm Green <[hidden email]> wrote: > Otto, > > Yes, this is unexpected. You should only see stuck shared memory segments > and semaphore arrays if you kill -9 the SPC monitor process. > > You should see if you can reproduce it: > > ipcs -a to verify there are no IPC resources in use > start the stone with startstone > stop the stone with stopstone (or however you normally shutdown GemStone) > Repeat ipcs -a to see if the resources were released. They should be. > > Norm > > On 5/8/14, 8:32, Otto Behrens wrote: > > Thanks for the response > > The way that we use shared memory resources, we have things setup so that > the resources are deallocated when the last process detaches from the cache. > > So I would look around the system for rogue stoned, topaz or gem processes > that may be hung/refusing to quit ... if you find some hanging around note > their process ids and track down their log file...there should be > information there as to why they are hung ... > > There are no rogue stoned, topaz, gem or shared cache monitor > processes hanging around. The output from ipcs -m tells me that there > are no processes attached. eg: > > ------ Shared Memory Segments -------- > key shmid owner perms bytes nattch status > 0x6a0180a5 819200 wonka 660 1059602432 0 > 0x16018585 1146881 wonka 660 1059602432 0 > 0x950180a5 163844 wonka 660 1059602432 0 > 0xfd0180a5 458757 wonka 660 1059602432 0 > > The command > > ipcrm -m 819200 > > frees up the memory instantly (the docs says that it frees up only > after the last process detatches), which is consistent with the nattch > count of 0. > > Another thing is if you 'kill -9' the shrpc monitor process, the shared > memory segment will not be told to clean up when the last process detaches > and the shared memore/semaphores will be left around. > > No shrpc process around > > While not recommended, it is "safe" to kill -9 the stoned process as a last > resort, you will lose any transactions that are in progress and will need to > restore from tranlogs on restart, but kill -9 on stoned should not corrupt > the db ... the shrpc monitor process is really the only process that it is > not "safe" to use kill -9 on but even then the db will not be corrupted, > like killing the stone, you will lose any transactions in progress AND you > will leave shared memory resources around to be cleaned up manually ... > > AFAIK we are not killing processes. But we should have a very close > look to make sure. > > What I can conclude from your response is that it is unexpected then > that these resources are not freed up? > > Thanks > Otto > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > > > > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass > Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by Dale Henrichs-3
> Another thing to consider is that if you are running on linux and get low on
> swap space, linux will start killing off processes ... in recent versions of > 3.x we have made the necessary system calls so that the os won't kill the > important gemstone processes, but I'm not exactly sure what version these > calls were put in ... if you look in the system logs, there should be > messages listing the pids that have been killed ... This may well be the issue. We've had a run in with the OOM killer because we're over committing memory on the VM's. We're busy fixing that. But I could not find a log on the VM that tells me something yet. We've only seen the VM host kill the VM so far. We're getting Write to /proc/349/oom_score_adj failed with EACCES , linux user does not have CAP_SYS_RESOURCE No server process protection from OOM killer in the pcmon and pagemanager logs. I think this happens once the resources are unavailable. _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
That sounds right. The OOM killer is effectively a kill -9 and the SPC
monitor is probably the biggest memory hog on your system since it creates and owns the shared page cache. Norm On 5/8/14, 9:27, Otto Behrens wrote: >> Another thing to consider is that if you are running on linux and get low on >> swap space, linux will start killing off processes ... in recent versions of >> 3.x we have made the necessary system calls so that the os won't kill the >> important gemstone processes, but I'm not exactly sure what version these >> calls were put in ... if you look in the system logs, there should be >> messages listing the pids that have been killed ... > This may well be the issue. We've had a run in with the OOM killer > because we're over committing memory on the VM's. We're busy fixing > that. But I could not find a log on the VM that tells me something > yet. We've only seen the VM host kill the VM so far. > > We're getting > > Write to /proc/349/oom_score_adj failed with EACCES , linux user does > not have CAP_SYS_RESOURCE > No server process protection from OOM killer > > in the pcmon and pagemanager logs. I think this happens once the > resources are unavailable. > _______________________________________________ > Glass mailing list > [hidden email] > http://lists.gemtalksystems.com/mailman/listinfo/glass _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
In reply to this post by otto
On 05/08/2014 09:27 AM, Otto Behrens wrote:
> > This may well be the issue. We've had a run in with the OOM killer > because we're over committing memory on the VM's. We're busy fixing > that. But I could not find a log on the VM that tells me something > yet. We've only seen the VM host kill the VM so far. > > We're getting > > Write to /proc/349/oom_score_adj failed with EACCES , linux user does > not have CAP_SYS_RESOURCE > No server process protection from OOM killer > > in the pcmon and pagemanager logs. I think this happens once the > resources are unavailable. These messages are output at startup of the process if we are unable to set up protection from the OOM killer. It does not necessarily mean that the OOM killer has killed the process, though that seems likely given the symptome. If the OOM killer has been killing processes, the system log (/var/log/syslog or /var/log/messages, depending on the Linux distro) should contain entries showing which processes were killed and when. Regards, -Martin _______________________________________________ Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Thanks for all the answers. Not finding anything useful in the logs so
far. Still hunting... On Thu, May 8, 2014 at 7:12 PM, Martin McClure <[hidden email]> wrote: > On 05/08/2014 09:27 AM, Otto Behrens wrote: >> >> >> This may well be the issue. We've had a run in with the OOM killer >> because we're over committing memory on the VM's. We're busy fixing >> that. But I could not find a log on the VM that tells me something >> yet. We've only seen the VM host kill the VM so far. >> >> We're getting >> >> Write to /proc/349/oom_score_adj failed with EACCES , linux user does >> not have CAP_SYS_RESOURCE >> No server process protection from OOM killer >> >> in the pcmon and pagemanager logs. I think this happens once the >> resources are unavailable. > > > These messages are output at startup of the process if we are unable to set > up protection from the OOM killer. It does not necessarily mean that the OOM > killer has killed the process, though that seems likely given the symptome. > > If the OOM killer has been killing processes, the system log > (/var/log/syslog or /var/log/messages, depending on the Linux distro) should > contain entries showing which processes were killed and when. > > Regards, > > -Martin Glass mailing list [hidden email] http://lists.gemtalksystems.com/mailman/listinfo/glass |
Free forum by Nabble | Edit this page |