Guidance with unresponsive Seaside Images needed

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Guidance with unresponsive Seaside Images needed

jtuchel
Hi there,

We're running a Seaside / Glorp based application in production.

Frim time to time, an image freezes. It doesn't respond to HTTP requests any more. It slows down remarkably during two or three requests and comes to a standstill.
There is a background task in the image that touches a file every second. This process continues to work.

When this happens, there is no CPU or RAM shortage on the server (Ubuntu Linux). The only obvious thing is that top and htop show the SHR memory rises towards or above 1 Gigabyte.

So my theory is that we're leaking memory and/or running out of newspace, so that there is no chance to either create new objects nor throw an exception. The image can remain in that status for hours without exiting, dumping or anything.

So here's my quest: how to start analysing this? What to do about it? How to at least prove or disprove my thesis about newspace? I added a logging entry that is written often that lists the avlailable memory segments.

It's a cross-packaged headless image on Linux. No remote debugging.

If I understand correctly, VAST by default grows the newspace by 256 K each time it has to grow it. Seems like very tiny in our scenario. When Glorp reads an object, it will produce two copies of it: one to work with and one that keeps the initial values (for rollback and change detection), and our users read a few thousand objects during their work. So maybe we just need to tweak some of the memory settings. But I don't know what values would be useful.

Of course there is a chance we produce memory leaks in our code. Maybe we do something wrong in our external calls. But we get no exceptions, so it's something that silently makes the situations worse and worse over time. How to start and find out?

Any hints, tipps, guidance is highly appreciated

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Guidance with unresponsive Seaside Images needed

Seth Berman
Hello Joachim,

So the image must not be frozen if there is a background task that continues to work.  A frozen image implies the vm interpreter is no longer making forward progress.
I'm trying to decode what you wrote...but it sounds like the Smalltalk HTTP server process is going along just fine until 2 or 3 requests come in (at the same time?) which grinds everything to a halt in that HTTP process.
If this is true, it sounds like lock contention on some external resource..possibly some circular lock dependency leading to deadlock.

If it were me, I would want to use some separate background task (which you say still seems to work) to monitor what is going on in the HTTP Server task (or other spawned tasks). 
I would likely add detection for service rate falling to 0 (or some threshold) and, when this happens, turn on per process stack tracing to see what the processes are up to.
I've never used it, but perhaps Process Peek (http://www.instantiations.com/resources/goodies.html) can be of assistance in this area.

Your understanding about the GC may be a little off so I will weigh in.  And most people don't know how it works anyway so don't feel bad:)
In our vm, Newspace is a logical space composed of 2 halves (segments)....and these don't grow.  In reality, you can create new EsMemorySegment's from Smalltalk and they may be marked as newspace...but this is not not the same thing and not relevant here.
The default size for each segment half is 256KB (512KB total newspace)...but our default ini files that we ship actually specify 2MB halves (4MB total newspace).
This is the default location where all objects are created, unless otherwise specified or the object to be allocated can't fit...in which case it is allocated in old space.

Oldspace is a logical space composed of many segments.  More segments are allocated as needed and reclaimed if detectable.  So in this sense, only Oldspace grows and it will keep growing until memory is exhausted.  Some of the other VM Options in the ini files dictate how and when the old space grows.  If your always running up against the max process memory barrier...then tweaking these correctly becomes extremely important.
But from what you have described...I don't think GC is going to be your issue.

Hope this helps

-- Seth

On Monday, November 14, 2016 at 11:50:29 PM UTC-5, Joachim Tuchel wrote:
Hi there,

We're running a Seaside / Glorp based application in production.

Frim time to time, an image freezes. It doesn't respond to HTTP requests any more. It slows down remarkably during two or three requests and comes to a standstill.
There is a background task in the image that touches a file every second. This process continues to work.

When this happens, there is no CPU or RAM shortage on the server (Ubuntu Linux). The only obvious thing is that top and htop show the SHR memory rises towards or above 1 Gigabyte.

So my theory is that we're leaking memory and/or running out of newspace, so that there is no chance to either create new objects nor throw an exception. The image can remain in that status for hours without exiting, dumping or anything.

So here's my quest: how to start analysing this? What to do about it? How to at least prove or disprove my thesis about newspace? I added a logging entry that is written often that lists the avlailable memory segments.

It's a cross-packaged headless image on Linux. No remote debugging.

If I understand correctly, VAST by default grows the newspace by 256 K each time it has to grow it. Seems like very tiny in our scenario. When Glorp reads an object, it will produce two copies of it: one to work with and one that keeps the initial values (for rollback and change detection), and our users read a few thousand objects during their work. So maybe we just need to tweak some of the memory settings. But I don't know what values would be useful.

Of course there is a chance we produce memory leaks in our code. Maybe we do something wrong in our external calls. But we get no exceptions, so it's something that silently makes the situations worse and worse over time. How to start and find out?

Any hints, tipps, guidance is highly appreciated

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Guidance with unresponsive Seaside Images needed

jtuchel
Hi Seth,

as always, your answers are excellent! Thanks for taking time to write it.

So you are basically saying that "running out of new space" is impossible, it could just be that scavenging happens more often when you produce lots of new objects. This could slow the image down, but cannot make it come to a standstill.

Your theory (which clearly is based on a huge load of experience and knowledge about memory management than I ever want to have) basically is that it is quite sure the image standstills are not related to shortage of memory. Just to be sure: If an image runs out of Old Space, can I be sure to get some walkback, dump, exit with returncode or something? If so, I would follow your suggestion and concentrate on the HTTP Server / WASstServerAdapter corner of things and hope to find something in this area.

...which reminds me of a bug we've come to live with now. In our dev images, we encounter a situation almost daily where the image won't respond to Seaside requests, but the dev tools (debugger, browser, inspector etc.) continue to work normally. All you have to do is restart the WASstServerAdaptor and things are back to normal. This has become so "natural" during debugging/development, that it
a) is a question of muscular brain activity to resolve it during development
b) didn't come to my mind when we had these standstills in our production images.

Of course I should be looking there first. Stupid me.

So here is my second thanks for giving me this impulse. I need to find out if that is what is happening. I will porobably start by trying to understand this in our dev image first. Since it happens regularly, it shouldn't be too hard to at least see the error situation ;-)

Again: thanks a lot!

Joachim






Am Dienstag, 15. November 2016 17:24:23 UTC+1 schrieb Seth Berman:
Hello Joachim,

So the image must not be frozen if there is a background task that continues to work.  A frozen image implies the vm interpreter is no longer making forward progress.
I'm trying to decode what you wrote...but it sounds like the Smalltalk HTTP server process is going along just fine until 2 or 3 requests come in (at the same time?) which grinds everything to a halt in that HTTP process.
If this is true, it sounds like lock contention on some external resource..possibly some circular lock dependency leading to deadlock.

If it were me, I would want to use some separate background task (which you say still seems to work) to monitor what is going on in the HTTP Server task (or other spawned tasks). 
I would likely add detection for service rate falling to 0 (or some threshold) and, when this happens, turn on per process stack tracing to see what the processes are up to.
I've never used it, but perhaps Process Peek (<a href="http://www.instantiations.com/resources/goodies.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.instantiations.com%2Fresources%2Fgoodies.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFfJv4Q9pfTEZcOyJts3E-JtIvfEA&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.instantiations.com%2Fresources%2Fgoodies.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFfJv4Q9pfTEZcOyJts3E-JtIvfEA&#39;;return true;">http://www.instantiations.com/resources/goodies.html) can be of assistance in this area.

Your understanding about the GC may be a little off so I will weigh in.  And most people don't know how it works anyway so don't feel bad:)
In our vm, Newspace is a logical space composed of 2 halves (segments)....and these don't grow.  In reality, you can create new EsMemorySegment's from Smalltalk and they may be marked as newspace...but this is not not the same thing and not relevant here.
The default size for each segment half is 256KB (512KB total newspace)...but our default ini files that we ship actually specify 2MB halves (4MB total newspace).
This is the default location where all objects are created, unless otherwise specified or the object to be allocated can't fit...in which case it is allocated in old space.

Oldspace is a logical space composed of many segments.  More segments are allocated as needed and reclaimed if detectable.  So in this sense, only Oldspace grows and it will keep growing until memory is exhausted.  Some of the other VM Options in the ini files dictate how and when the old space grows.  If your always running up against the max process memory barrier...then tweaking these correctly becomes extremely important.
But from what you have described...I don't think GC is going to be your issue.

Hope this helps

-- Seth

On Monday, November 14, 2016 at 11:50:29 PM UTC-5, Joachim Tuchel wrote:
Hi there,

We're running a Seaside / Glorp based application in production.

Frim time to time, an image freezes. It doesn't respond to HTTP requests any more. It slows down remarkably during two or three requests and comes to a standstill.
There is a background task in the image that touches a file every second. This process continues to work.

When this happens, there is no CPU or RAM shortage on the server (Ubuntu Linux). The only obvious thing is that top and htop show the SHR memory rises towards or above 1 Gigabyte.

So my theory is that we're leaking memory and/or running out of newspace, so that there is no chance to either create new objects nor throw an exception. The image can remain in that status for hours without exiting, dumping or anything.

So here's my quest: how to start analysing this? What to do about it? How to at least prove or disprove my thesis about newspace? I added a logging entry that is written often that lists the avlailable memory segments.

It's a cross-packaged headless image on Linux. No remote debugging.

If I understand correctly, VAST by default grows the newspace by 256 K each time it has to grow it. Seems like very tiny in our scenario. When Glorp reads an object, it will produce two copies of it: one to work with and one that keeps the initial values (for rollback and change detection), and our users read a few thousand objects during their work. So maybe we just need to tweak some of the memory settings. But I don't know what values would be useful.

Of course there is a chance we produce memory leaks in our code. Maybe we do something wrong in our external calls. But we get no exceptions, so it's something that silently makes the situations worse and worse over time. How to start and find out?

Any hints, tipps, guidance is highly appreciated

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Guidance with unresponsive Seaside Images needed

Seth Berman
Hi Joachim,

You're welcome...here are some additional memory management points that might be helpful to know.

1. Running out of newspace happens all the time...that's what triggers a scavenge (copying the live objects from one semi-space segment to the other)
2. Scavenging is happening all the time and is very quick.  It is a stop-the-world algorithm so the vm halts while a scavenge is done.  This means the image technically does come to a standstill, but the time is so small that it would be hard to measure without high-resolution timers in the vm.
3. Newspace objects that survive a threshold # of scavenges are moved to the tenure segment within logical old space.
4. In the current production vm, the threshold is static...something like 10 scavenges survived before an object is tenured to old space.  In the new vm's, it's adaptive...so it adjusts with the allocation rate.
4. A full garbage collection is triggered when a new object allocation can't be satisfied..even after a scavenge.
5. Full GC is expensive...this will halt the image...a gross approximation would be something like 1 sec / GB of memory.
6. If GC is taking more time than this...then most likely your program is up against the process memory limits and the algorithms are having trouble making every last byte free for your allocation request.
7. Yes, you will get an OutOfMemory exception triggered if the an object allocation can not be satisfied. 

On Wednesday, November 16, 2016 at 2:50:41 AM UTC-5, Joachim Tuchel wrote:
Hi Seth,

as always, your answers are excellent! Thanks for taking time to write it.

So you are basically saying that "running out of new space" is impossible, it could just be that scavenging happens more often when you produce lots of new objects. This could slow the image down, but cannot make it come to a standstill.

Your theory (which clearly is based on a huge load of experience and knowledge about memory management than I ever want to have) basically is that it is quite sure the image standstills are not related to shortage of memory. Just to be sure: If an image runs out of Old Space, can I be sure to get some walkback, dump, exit with returncode or something? If so, I would follow your suggestion and concentrate on the HTTP Server / WASstServerAdapter corner of things and hope to find something in this area.

...which reminds me of a bug we've come to live with now. In our dev images, we encounter a situation almost daily where the image won't respond to Seaside requests, but the dev tools (debugger, browser, inspector etc.) continue to work normally. All you have to do is restart the WASstServerAdaptor and things are back to normal. This has become so "natural" during debugging/development, that it
a) is a question of muscular brain activity to resolve it during development
b) didn't come to my mind when we had these standstills in our production images.

Of course I should be looking there first. Stupid me.

So here is my second thanks for giving me this impulse. I need to find out if that is what is happening. I will porobably start by trying to understand this in our dev image first. Since it happens regularly, it shouldn't be too hard to at least see the error situation ;-)

Again: thanks a lot!

Joachim






Am Dienstag, 15. November 2016 17:24:23 UTC+1 schrieb Seth Berman:
Hello Joachim,

So the image must not be frozen if there is a background task that continues to work.  A frozen image implies the vm interpreter is no longer making forward progress.
I'm trying to decode what you wrote...but it sounds like the Smalltalk HTTP server process is going along just fine until 2 or 3 requests come in (at the same time?) which grinds everything to a halt in that HTTP process.
If this is true, it sounds like lock contention on some external resource..possibly some circular lock dependency leading to deadlock.

If it were me, I would want to use some separate background task (which you say still seems to work) to monitor what is going on in the HTTP Server task (or other spawned tasks). 
I would likely add detection for service rate falling to 0 (or some threshold) and, when this happens, turn on per process stack tracing to see what the processes are up to.
I've never used it, but perhaps Process Peek (<a href="http://www.instantiations.com/resources/goodies.html" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.instantiations.com%2Fresources%2Fgoodies.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFfJv4Q9pfTEZcOyJts3E-JtIvfEA&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\x3dhttp%3A%2F%2Fwww.instantiations.com%2Fresources%2Fgoodies.html\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNFfJv4Q9pfTEZcOyJts3E-JtIvfEA&#39;;return true;">http://www.instantiations.com/resources/goodies.html) can be of assistance in this area.

Your understanding about the GC may be a little off so I will weigh in.  And most people don't know how it works anyway so don't feel bad:)
In our vm, Newspace is a logical space composed of 2 halves (segments)....and these don't grow.  In reality, you can create new EsMemorySegment's from Smalltalk and they may be marked as newspace...but this is not not the same thing and not relevant here.
The default size for each segment half is 256KB (512KB total newspace)...but our default ini files that we ship actually specify 2MB halves (4MB total newspace).
This is the default location where all objects are created, unless otherwise specified or the object to be allocated can't fit...in which case it is allocated in old space.

Oldspace is a logical space composed of many segments.  More segments are allocated as needed and reclaimed if detectable.  So in this sense, only Oldspace grows and it will keep growing until memory is exhausted.  Some of the other VM Options in the ini files dictate how and when the old space grows.  If your always running up against the max process memory barrier...then tweaking these correctly becomes extremely important.
But from what you have described...I don't think GC is going to be your issue.

Hope this helps

-- Seth

On Monday, November 14, 2016 at 11:50:29 PM UTC-5, Joachim Tuchel wrote:
Hi there,

We're running a Seaside / Glorp based application in production.

Frim time to time, an image freezes. It doesn't respond to HTTP requests any more. It slows down remarkably during two or three requests and comes to a standstill.
There is a background task in the image that touches a file every second. This process continues to work.

When this happens, there is no CPU or RAM shortage on the server (Ubuntu Linux). The only obvious thing is that top and htop show the SHR memory rises towards or above 1 Gigabyte.

So my theory is that we're leaking memory and/or running out of newspace, so that there is no chance to either create new objects nor throw an exception. The image can remain in that status for hours without exiting, dumping or anything.

So here's my quest: how to start analysing this? What to do about it? How to at least prove or disprove my thesis about newspace? I added a logging entry that is written often that lists the avlailable memory segments.

It's a cross-packaged headless image on Linux. No remote debugging.

If I understand correctly, VAST by default grows the newspace by 256 K each time it has to grow it. Seems like very tiny in our scenario. When Glorp reads an object, it will produce two copies of it: one to work with and one that keeps the initial values (for rollback and change detection), and our users read a few thousand objects during their work. So maybe we just need to tweak some of the memory settings. But I don't know what values would be useful.

Of course there is a chance we produce memory leaks in our code. Maybe we do something wrong in our external calls. But we get no exceptions, so it's something that silently makes the situations worse and worse over time. How to start and find out?

Any hints, tipps, guidance is highly appreciated

Joachim

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.