Low-space signals in production environments

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Low-space signals in production environments

Andreas.Raab
Hi Guys -

I am just being very confused about the current behavior of Squeak in
the case of memory allocation failure. In my use case I have incoming
network requests which are handled at high I/O priority and need to
allocate memory based on the size of the request. Given a malformed
request, this can easily lead to an allocation failure which really
should raise an error, be caught and be done with.

However, there doesn't seem to be a way of handling low-space conditions
  by the client. In the case of an allocation failure, all that appears
to be happening is that the low-space semaphore is being signaled with
the obvious assumption that the low-space watcher will preempt the
running process, make some space and continue. But equally obviously
this just can't work if the running process is at a higher priority than
the low-space process and since the running process recurses directly
into #basicNew: again this will bring your system to a screeching halt.

Since I can't possibly be the first person who noticed that (or at least
I really don't hope I am) my question is, how do people deal with that
situation in their production systems? I have never seen the issue
discussed but I would expect that it has come up on some Seaside or
other network-related lists.

Right now I'm just thinking to do something like signaling an
OutOfMemory error which as its default action would signal the lowspace
condition, leaving the client with the option to handle the request
differently if needed.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

johnmci
If you set Smalltalk setGCBiasToGrow: 1
you may get different behavior, assume your vm supports that and you  
noted the issue with SLANG dropping the code I talked about last  
moth.  The code was fixed, not slang, although I don't think Tim has  
put the fix in to VMMaker yet (hint hint).

Really of course the VM must signal low space at some point if it  
can't grow, and the process must run instead of something else do  
whatever, then allow the process that is doing the memory allocation  
to run and try again.

I recall in VW it would attempt the allocation, if it failed it would  
mutter things to the memory policy to allocate this much more memory,  
with a % of extra slack, then retry with possible failure, the key  
being the process asking for the memory would be waiting for the VM  
to adjust the memory footprint. One failure case in the past was  
setting the % too low so that other processes would chew up the newly  
allocated memory leaving you without the memory you just asked for.  
Of course if the memory requested will push you over the memoryLimit  
set for the VM, nothing will help.

On Feb 10, 2007, at 10:40 AM, Andreas Raab wrote:

> Hi Guys -
>
> I am just being very confused about the current behavior of Squeak  
> in the case of memory allocation failure. In my use case I have  
> incoming network requests which are handled at high I/O priority  
> and need to allocate memory based on the size of the request. Given  
> a malformed request, this can easily lead to an allocation failure  
> which really should raise an error, be caught and be done with.
>
> However, there doesn't seem to be a way of handling low-space  
> conditions  by the client. In the case of an allocation failure,  
> all that appears to be happening is that the low-space semaphore is  
> being signaled with the obvious assumption that the low-space  
> watcher will preempt the running process, make some space and  
> continue. But equally obviously this just can't work if the running  
> process is at a higher priority than the low-space process and  
> since the running process recurses directly into #basicNew: again  
> this will bring your system to a screeching halt.
>
> Since I can't possibly be the first person who noticed that (or at  
> least I really don't hope I am) my question is, how do people deal  
> with that situation in their production systems? I have never seen  
> the issue discussed but I would expect that it has come up on some  
> Seaside or other network-related lists.
>
> Right now I'm just thinking to do something like signaling an  
> OutOfMemory error which as its default action would signal the  
> lowspace condition, leaving the client with the option to handle  
> the request differently if needed.
>
> Cheers,
>   - Andreas
>

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===



Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

timrowledge

On 10-Feb-07, at 9:05 PM, John M McIntosh wrote:

> If you set Smalltalk setGCBiasToGrow: 1
> you may get different behavior, assume your vm supports that and  
> you noted the issue with SLANG dropping the code I talked about  
> last moth.  The code was fixed, not slang, although I don't think  
> Tim has put the fix in to VMMaker yet (hint hint).

Money for time, hint, hint.

>
> Really of course the VM must signal low space at some point if it  
> can't grow, and the process must run instead of something else do  
> whatever, then allow the process that is doing the memory  
> allocation to run and try again.

Exactly. I'm a bit surprised to read of any process - except possibly  
the tick etc- running at a higher priority that the lowspace handler.  
That seems a bit daft; any process that could possibly allocate  
memory should be running at a lower process, unless perhaps one  
provides some sort of process locking mechanism.

Aside from in-image issues of policy to decide on whether to try to  
grow or not, there are VM complications in the limit set in some  
cases as well as a particularly egregious situation where the VM will  
steal a substantial chunk of the memory that is thought to be  
available in order to do a bit of gc work. It can go so far as to  
leave just a few bytes for the allocator which typically isn't a  
happy place to end up.

It *is* possible to make a memory resilient system. The ancient  
ActiveBook system built from Eliot's BHH was routinely tested by  
running it down to a few hundred bytes of free memory and a dozen or  
so free oops (this was an OT system) and it always recovered cleanly.

It takes tim, which takes money.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: OI: Vey



Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

johnmci
In reply to this post by Andreas.Raab
Also see my notes from
        Subject: Re: lowspace signalling and handling issues
        Date: May 3, 2005 7:56:09 PM PDT (CA)

 >I've taken it down to 32K on a 512MB image via that code that  
allocates links...
 >Grinds away until freespace goes under 98 bytes (can't allocate a  
context record).

but there was no interest in sticking those changes into the VM.  
Would have to hunt for the bits. It removed some complication between  
the low memory signal, the upperboundary check and doing a GC.  
However I think it was simpler to turn on the bias to grow logic.


On Feb 10, 2007, at 10:40 AM, Andreas Raab wrote:

> Hi Guys -
>
> I am just being very confused about the current behavior of Squeak  
> in the case of memory allocation failure. In my use case I have  
> incoming network requests which are handled at high I/O priority  
> and need to allocate memory based on the size of the request. Given  
> a malformed request, this can easily lead to an allocation failure  
> which really should raise an error, be caught and be done with.
>
> However, there doesn't seem to be a way of handling low-space  
> conditions  by the client. In the case of an allocation failure,  
> all that appears to be happening is that the low-space semaphore is  
> being signaled with the obvious assumption that the low-space  
> watcher will preempt the running process, make some space and  
> continue. But equally obviously this just can't work if the running  
> process is at a higher priority than the low-space process and  
> since the running process recurses directly into #basicNew: again  
> this will bring your system to a screeching halt.
>
> Since I can't possibly be the first person who noticed that (or at  
> least I really don't hope I am) my question is, how do people deal  
> with that situation in their production systems? I have never seen  
> the issue discussed but I would expect that it has come up on some  
> Seaside or other network-related lists.
>
> Right now I'm just thinking to do something like signaling an  
> OutOfMemory error which as its default action would signal the  
> lowspace condition, leaving the client with the option to handle  
> the request differently if needed.
>
> Cheers,
>   - Andreas
>

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===



Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

Andreas.Raab
In reply to this post by timrowledge
tim Rowledge wrote:

>> Really of course the VM must signal low space at some point if it
>> can't grow, and the process must run instead of something else do
>> whatever, then allow the process that is doing the memory allocation
>> to run and try again.
>
> Exactly. I'm a bit surprised to read of any process - except possibly
> the tick etc- running at a higher priority that the lowspace handler.
> That seems a bit daft; any process that could possibly allocate memory
> should be running at a lower process, unless perhaps one provides some
> sort of process locking mechanism.

There seem to be two misunderstandings here. For one thing, the lowspace
watcher runs at lowIOPriority and there are *plenty* of processes
running at that priority or higher.

Secondly, even if when signaling the low space semaphore (which can be
seen as a *hint* to the system that we're in trouble with respect to
memory) the outcome of an allocation ought to be either that the memory
was allocated, or that an error is raised. What sense does it make for
Behavior>>basicNew: to signal the lowspace semaphore? The result is that
you can lock up the system as simply as here:

    [Array new: SmallInteger maxVal] forkAt: Processor lowIOPriority.

And what's the point of that? At least I would expect if the allocation
within basicNew: fails we get a proper error condition. But
side-effecting by signaling the lowspace semaphore? What good does that
do? In particular considering that the lowspace semaphore can't really
do anything because it doesn't even know which process got interrupted!
Sorry, but this seems Just Wrong(tm).

> Aside from in-image issues of policy to decide on whether to try to grow
> or not, there are VM complications in the limit set in some cases as
> well as a particularly egregious situation where the VM will steal a
> substantial chunk of the memory that is thought to be available in order
> to do a bit of gc work. It can go so far as to leave just a few bytes
> for the allocator which typically isn't a happy place to end up.

Not really. Whether the VM is capable of allocating memory or not is a
binary decision. There is nothing complicated about it. Whether it can
recover from a failed allocation is of course a different question but
that's why we have the red zone which triggers a low space condition
when we enter it - the red zone is still sufficient to do a variety of
things. But when allocation fails, it fails, there is no policy. It fails.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

David T. Lewis
On Sun, Feb 11, 2007 at 02:08:29AM -0800, Andreas Raab wrote:
> In particular considering that the lowspace semaphore can't really
> do anything because it doesn't even know which process got interrupted!

Andreas,

Does your image have the fix from Mantis 1041?

"Under certain conditions the low space watcher was unable to determine the
correct process to suspend following a low space signal. These changes permit
the VM to remember the identity of the process that caused the low space
condition, and to report it to the image through a primitive."

Low space notification was badly broken for quite a while, including 3.8
images, but should be somewhat less broken after applying this change.
This might affect Squeakland or OLPC images, I'm not sure.

Dave



Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

johnmci
In reply to this post by Andreas.Raab

On Feb 11, 2007, at 2:08 AM, Andreas Raab wrote:

>
>
> Not really. Whether the VM is capable of allocating memory or not  
> is a binary decision. There is nothing complicated about it.  
> Whether it can recover from a failed allocation is of course a  
> different question but that's why we have the red zone which  
> triggers a low space condition when we enter it - the red zone is  
> still sufficient to do a variety of things. But when allocation  
> fails, it fails, there is no policy. It fails.

Well you get to invent new Policy.

My comment about having basicNew tell the VM there is a problem and  
then retry with failure *after* something has been done, seemed  
fairly reasonable.

Your example of Array new: SmallInteger maxVal would of course fail  
because the Policy (TM) would look at say oh currently used memory +  
(SmallInteger maxVal) > ceiling of Mac Carbon VM (which by default is  
512k)  thus your toast.    Some complication exist because once you  
are into failure mode is that because of an extra-ordinary request or  
are you hitting the maximum celing . One could of course have a chunk  
of reserved memory that one could free (couple of MB?)  Still in  
cases of recursive runaway process it's difficult to provide enough  
time for the developer to do something.

One could even determine which processes *must* run, versus user  
processes which could be halted at the time of allocation failure.

>
> Cheers,
>   - Andreas
>

--
========================================================================
===
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
========================================================================
===



Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

Andreas.Raab
In reply to this post by David T. Lewis
David T. Lewis wrote:
> On Sun, Feb 11, 2007 at 02:08:29AM -0800, Andreas Raab wrote:
>> In particular considering that the lowspace semaphore can't really
>> do anything because it doesn't even know which process got interrupted!
>
> Does your image have the fix from Mantis 1041?

No, but that doesn't really matter. My point was that a low-priority
process has no chance to ever interrupt a higher-priority process. And I
doubt your fix changes that.

Cheers,
   - Andreas

> "Under certain conditions the low space watcher was unable to determine the
> correct process to suspend following a low space signal. These changes permit
> the VM to remember the identity of the process that caused the low space
> condition, and to report it to the image through a primitive."
>
> Low space notification was badly broken for quite a while, including 3.8
> images, but should be somewhat less broken after applying this change.
> This might affect Squeakland or OLPC images, I'm not sure.
>
> Dave
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

timrowledge

On 11-Feb-07, at 12:15 PM, Andreas Raab wrote:

> David T. Lewis wrote:
>> On Sun, Feb 11, 2007 at 02:08:29AM -0800, Andreas Raab wrote:
>>> In particular considering that the lowspace semaphore can't  
>>> really do anything because it doesn't even know which process got  
>>> interrupted!
>> Does your image have the fix from Mantis 1041?
>
> No, but that doesn't really matter. My point was that a low-
> priority process has no chance to ever interrupt a higher-priority  
> process. And I doubt your fix changes that.

No, it simply does a somewhat better job of guessing which process  
might be the problem.

We could, as I'm pretty sure we have discussed, find some way to  
include the oop of the process that caused the allocation problem in  
the semaphore more directly, which would improve things a touch more  
by avoiding the possibility of race conditions. The real problem with  
identifying the *actually* problematic process is that the allocation  
request that triggers a lowspace may well not be part of the actual  
space hog. Suspending the wrong process and letting others -  
including maybe the monster - simply leads to more trouble.

If the lowspace handler suspended all other processes it would  
obviate some of the problems. If we wanted to interact with users as  
part of the handler we might have to permit some other process to  
start or resume, perhaps under some restrictions. If simply doing a  
gc solved the space problem then we could simply allow everything  
else to resume. And we should remove direct in-vm calls to gc  
wherever possible so that in-image code can apply more flexible  
policies.

Having a very high priority process to handle low space conditions  
seems like a plausible idea to me.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
A bug in the code is worth two in the documentation.



Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

timrowledge
In reply to this post by Andreas.Raab

On 11-Feb-07, at 2:08 AM, Andreas Raab wrote:

>
> [snip] that's why we have the red zone which triggers a low space  
> condition when we enter it - the red zone is still sufficient to do  
> a variety of things.

This turns out to be incorrect.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Disclaimer:  Any errors in spelling, tact, or fact are transmission  
errors.



Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

Andreas.Raab
tim Rowledge wrote:
> On 11-Feb-07, at 2:08 AM, Andreas Raab wrote:
>
>> [snip] that's why we have the red zone which triggers a low space
>> condition when we enter it - the red zone is still sufficient to do a
>> variety of things.
>
> This turns out to be incorrect.

In which way? Is it too small? We used to execute it regularly in the
old days before the VM would grow memory dynamically, so there is a
chance that this hasn't been executed in a while and needs some
adjustment. Still, the basic underlying principle of giving advanced
warning and have the image react to *that* as opposed to an actual
allocation failure seems sound to me.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Low-space signals in production environments

timrowledge

On 11-Feb-07, at 12:56 PM, Andreas Raab wrote:

> tim Rowledge wrote:
>> On 11-Feb-07, at 2:08 AM, Andreas Raab wrote:
>>> [snip] that's why we have the red zone which triggers a low space  
>>> condition when we enter it - the red zone is still sufficient to  
>>> do a variety of things.
>> This turns out to be incorrect.
>
> In which way? Is it too small? We used to execute it regularly in  
> the old days before the VM would grow memory dynamically, so there  
> is a chance that this hasn't been executed in a while and needs  
> some adjustment. Still, the basic underlying principle of giving  
> advanced warning and have the image react to *that* as opposed to  
> an actual allocation failure seems sound to me.
>
> Cheers,
>   - Andreas
>
We collectively discussed this a while back:-
http://lists.squeakfoundation.org/pipermail/vm-dev/2005-May/000213.html

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- A one-bit brain with a parity error.