Hi
I've been doing some performance work lately in Seaside. Long story short Seaside (and I guess AIDA too) spends of if it's rendering time in WriteStream (#nextPutAll:, #nextPut:). The way WriteStream >> #pastEndPut: behaves is not really ideal for Seaside. It grows the underlying collection by just enough to accommodate the argument collection (or 20 which ever is bigger). Now image the following not very unlikely scenario. You start with a 4k buffer and put on average a 10 element collection (remember all those tags are put individually) until you have a 16k response. You allocate more than a thousand intermediary collections to get there. What would be better suited for Seaside is doubling the required size. In the worst case that would mean wasting 50% of memory but it would make the overhead of creating intermediary collections logarithmic. In the given example that would take us only three allocations to get there. Now I do realize there are other applications for Pharo where this strategy is not ideal and this is not a killer for us. I just wanted to shed some light and this and ask whether other projects are in a similar situation. To get a feel how allocation limited Seaside is: avoiding one allocation of a 16k ByteArray per request can make a difference in throughput between 10 Mbyte/s and 30 Mbyte/s (see "[Progress Report] Zinc HTTP Components"). If anybody knows a way to make allocation of large young space objects faster (Set GC Bias to Grow?, #vmParameterAt:put:?) I'd like to hear it. Cheers Philippe |
> Hi
> > I've been doing some performance work lately in Seaside. Long story > short Seaside (and I guess AIDA too) spends of if it's rendering time in > WriteStream (#nextPutAll:, #nextPut:). > > The way WriteStream >> #pastEndPut: behaves is not really ideal for > Seaside. It grows the underlying collection by just enough to > accommodate the argument collection (or 20 which ever is bigger). Now > image the following not very unlikely scenario. You start with a 4k > buffer and put on average a 10 element collection (remember all those > tags are put individually) until you have a 16k response. You allocate > more than a thousand intermediary collections to get there. > What would be better suited for Seaside is doubling the required size. > In the worst case that would mean wasting 50% of memory but it would > make the overhead of creating intermediary collections logarithmic. In > the given example that would take us only three allocations to get there. > Now I do realize there are other applications for Pharo where this > strategy is not ideal and this is not a killer for us. I just wanted to > shed some light and this and ask whether other projects are in a similar > situation. Thanks for the info. Do you have an idea in which case allocating more would be a real problem? Because in VW there were some patterns: turning , into nextPut: to avoid exact the underlying string allocation. preallocation OrderedCollection new: instead of relying on its growing behavior. Final question May be we should be able to plug the stream behavior we want. Like that seaside people can get the speed out of it. I think that having fast web app is important. > > To get a feel how allocation limited Seaside is: avoiding one allocation > of a 16k ByteArray per request can make a difference in throughput > between 10 Mbyte/s and 30 Mbyte/s (see "[Progress Report] Zinc HTTP > Components"). If anybody knows a way to make allocation of large young > space objects faster (Set GC Bias to Grow?, #vmParameterAt:put:?) I'd > like to hear it. > > Cheers > Philippe > > |
In reply to this post by Philippe Marschall-2-3
On 01.01.2011 14:39, Levente Uzonyi wrote: > On Sat, 1 Jan 2011, Philippe Marschall wrote: > >> Hi >> >> I've been doing some performance work lately in Seaside. Long story >> short Seaside (and I guess AIDA too) spends of if it's rendering time in >> WriteStream (#nextPutAll:, #nextPut:). >> >> The way WriteStream >> #pastEndPut: behaves is not really ideal for >> Seaside. It grows the underlying collection by just enough to >> accommodate the argument collection (or 20 which ever is bigger). Now > > No it doesn't. The code in #pastEndPut: is > > collection := collection grownBy: ((collection size max: 20) min: 1000000). > > so the argument of #grownBy: is at least 20, at most 1000000 and it's > size of the internal buffer (collection) if it's between 20 and 1000000. > So in most practical cases it's the value of [collection size]. > #grownBy: allocates a new collection that has the size of the > collection's size + the argument, which is 2 * self size in the most > common case. So the size grows exponentially till it reaches 1000000 and > then linearly (with a relatively high constant 1000000). This guarantees > that the cost of #nextPut: is constant in most cases. You're right, I misread the code, that's the behaviour for #nextPut:. However #nextPutAll: does: newEnd := position + aCollection size. newEnd > writeLimit ifTrue: [self growTo: newEnd + 10]. and #growTo: does then: newSize := anInteger + (oldSize // 4 max: 20). grownCollection := collection class new: newSize. aCollection is the argument collection, not the underlying collection. So that's the required size puls ten plus the max of a fourth of the current size and 20. >> image the following not very unlikely scenario. You start with a 4k >> buffer and put on average a 10 element collection (remember all those >> tags are put individually) until you have a 16k response. You allocate >> more than a thousand intermediary collections to get there. > > So just 3: one for 4k (the initial), one for 8k and one for 16k. Yes. >> What would be better suited for Seaside is doubling the required size. > > That's exactly what's happening in most cases. In the #nextPut: case, not in the #nextPutAll: case. >> In the worst case that would mean wasting 50% of memory but it would >> make the overhead of creating intermediary collections logarithmic. In > > The overhead is constant (amortized) in most cases. > >> the given example that would take us only three allocations to get there. >> Now I do realize there are other applications for Pharo where this >> strategy is not ideal and this is not a killer for us. I just wanted to >> shed some light and this and ask whether other projects are in a similar >> situation. >> >> To get a feel how allocation limited Seaside is: avoiding one allocation >> of a 16k ByteArray per request can make a difference in throughput >> between 10 Mbyte/s and 30 Mbyte/s (see "[Progress Report] Zinc HTTP >> Components"). If anybody knows a way to make allocation of large young >> space objects faster (Set GC Bias to Grow?, #vmParameterAt:put:?) I'd >> like to hear it. > > IIRC CogVM's GC is different from SqueakVM's GC in the sense that it is > activated when a predefined amount of data is allocated, not when the > number of allocations since the last GC reaches a predefined limit. So > you have to tune the GC differently on Cog and non-Cog VMs. Do you know where I can find more information about this? Cheers Philippe |
In reply to this post by Stéphane Ducasse
On 01.01.2011 15:07, Stéphane Ducasse wrote:
>> Hi >> >> I've been doing some performance work lately in Seaside. Long story >> short Seaside (and I guess AIDA too) spends of if it's rendering time in >> WriteStream (#nextPutAll:, #nextPut:). >> >> The way WriteStream >> #pastEndPut: behaves is not really ideal for >> Seaside. It grows the underlying collection by just enough to >> accommodate the argument collection (or 20 which ever is bigger). Now >> image the following not very unlikely scenario. You start with a 4k >> buffer and put on average a 10 element collection (remember all those >> tags are put individually) until you have a 16k response. You allocate >> more than a thousand intermediary collections to get there. >> What would be better suited for Seaside is doubling the required size. >> In the worst case that would mean wasting 50% of memory but it would >> make the overhead of creating intermediary collections logarithmic. In >> the given example that would take us only three allocations to get there. >> Now I do realize there are other applications for Pharo where this >> strategy is not ideal and this is not a killer for us. I just wanted to >> shed some light and this and ask whether other projects are in a similar >> situation. > > Thanks for the info. > Do you have an idea in which case allocating more would be a real problem? Whenever you're memory constrained, I guess. When doubling would put you past the size limit of the image. > Because in VW there were some patterns: > turning , into nextPut: to avoid exact the underlying string allocation. I don't get that. > preallocation OrderedCollection new: instead of relying on its growing behavior. Right, but you never know how big the response is going to before actually rendering. Otherwise you could just do Array/String new: beforehand. > Final question > May be we should be able to plug the stream behavior we want. Like that seaside people can > get the speed out of it. I think that having fast web app is important. Oh, it's already way faster than it was previously ;-) And lets not forget, even when you run Seaside, you will other other code like a database driver as well, that will need streams as well. Whatever you plug in, has to work with whatever else you have in your image. Cheers Philippe |
In reply to this post by Philippe Marschall-2-3
"Philippe Marschall"<[hidden email]> wrote:
> > preallocation OrderedCollection new: instead of relying on its growing behavior. > > Right, but you never know how big the response is going to before > actually rendering. Otherwise you could just do Array/String new: > beforehand. I'm not quite clear about the context here so this may not be relevant, but one technique that paid of for us in terms of managing socket buffers was to start recycling them. The idea is that if you generally allocate let's say 32K buffer for every socket you open, rather then throwing the buffer away when you're done with the socket, just put it aside and let the next socket pick up and reuse the same instance. For us the main concern was a server being hammered by transient connection requests. Occasional buffer allocation could be handled by new space scavenger just fine, but a flood of such requests made the buffers spill over to the old space and throwing them away got expensive quickly. Anyway, I realize that your concern is different here, but the same technique might help growing in most common cases. If there's a reasonable upper bound on most responses (e.g. 32K), you could simply preallocate that size always. If the request fits, great, no growing, if not, it will grow. However you'd probably have the same sort of problem as we had above if you simply threw each buffer away. Recycling should help making it manageable. We use the same technique in Xtreams, where we need internal buffers quite frequently. Check out the RecyclingCenter class there if you want. It's a bit more generic because we need to accommodate different collection classes, so you'll probably want something different, but the basics should be similar. One aspect to note is that the recycling pool can be quite small. You don't need to recycle every instance created, you only need to be able to handle the "overflow" under load. The idea is that if you are under load and the number of concurrently used buffers fluctuates between 100 and 110, you don't need the recycling pool size to be 110, but only 10. It only needs to retain the temporarily unused buffers long enough for those to get picked up again. When the load subsides, it's fine to let the remaining 100 buffers GC, they can be allocated again quickly when next wave hits. At the same time it keeps the memory load reasonable in the quiet periods. HTH, Martin |
On 04. 01. 2011 04:25, [hidden email] wrote:
> "Philippe Marschall"<[hidden email]> wrote: >>> preallocation OrderedCollection new: instead of relying on its growing behavior. >> >> Right, but you never know how big the response is going to before >> actually rendering. Otherwise you could just do Array/String new: >> beforehand. > > I'm not quite clear about the context here so this may not be relevant, but one technique that paid of for us in terms of managing socket buffers was to start recycling them. The idea is that if you generally allocate let's say 32K buffer for every socket you open, rather then throwing the buffer away when you're done with the socket, just put it aside and let the next socket pick up and reuse the same instance. Yep, this is a technique used in Swazoo too. When you nextPutAll: to Swazoo's HTTPResponse, you actually write directly to the SwazooBuffer, which has such recyclable collection. Main reason to introduce those buffers was exactly to reduce garbage collection activity to minimum, which is very important when serving (or uploading) large content. Second reason is that network packets must be packaged for network performance into a reasonably big packets and not into a lot of small ones after each nextPutAll:. Best regards Janko -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
In reply to this post by mkobetic
On 04.01.2011 04:25, [hidden email] wrote:
> "Philippe Marschall"<[hidden email]> wrote: >>> preallocation OrderedCollection new: instead of relying on its growing behavior. >> >> Right, but you never know how big the response is going to before >> actually rendering. Otherwise you could just do Array/String new: >> beforehand. > > I'm not quite clear about the context here so this may not be relevant, but one technique that paid of for us in terms of managing socket buffers was to start recycling them. The idea is that if you generally allocate let's say 32K buffer for every socket you open, rather then throwing the buffer away when you're done with the socket, just put it aside and let the next socket pick up and reuse the same instance. For us the main concern was a server being hammered by transient connection requests. Occasional buffer allocation could be handled by new space scavenger just fine, but a flood of such requests made the buffers spill over to the old space and throwing them away got expensive quickly. Well yes and no. There are several issues. The first is that the growing policy of #nextPut: and #nextPutAll: is different. I believe #nextPut: has it right. That shows in general where you build a huge collection using Streams, one example is Seaside response generation but there are others. Partially related to that is that allocating huge collections is expensive. So Seaside/the server should reuse a buffer for response rendering. The problem is that this needs server support. If you have a 64k buffer and use 90% of that you don't want to send #contents to this because the server can handle only ByteArray/String. And you don't want the server having to do ByteArray/String conversion. AJP [1] has an experimental implementation of this and very happy with the results. When we have a streaming server all of this is not really a problem because we can/could directly render into the socket buffer. Issue 591 contains some clean up to make this approach more general/portable. > Anyway, I realize that your concern is different here, but the same technique might help growing in most common cases. If there's a reasonable upper bound on most responses (e.g. 32K), you could simply preallocate that size always. If the request fits, great, no growing, if not, it will grow. However you'd probably have the same sort of problem as we had above if you simply threw each buffer away. Recycling should help making it manageable. > > We use the same technique in Xtreams, where we need internal buffers quite frequently. Check out the RecyclingCenter class there if you want. It's a bit more generic because we need to accommodate different collection classes, so you'll probably want something different, but the basics should be similar. One aspect to note is that the recycling pool can be quite small. You don't need to recycle every instance created, you only need to be able to handle the "overflow" under load. The idea is that if you are under load and the number of concurrently used buffers fluctuates between 100 and 110, you don't need the recycling pool size to be 110, but only 10. It only needs to retain the temporarily unused buffers long enough for those to get picked up again. When the load subsides, it's fine to let the remaining 100 buffers GC, they can be allocated again quickly when next wave hits. At the same time it keeps the memory load reasonable in the quiet periods. No offense but such things quickly become like your own memory manager. Wasn't this GC stuff supposed to free (pun) me from this? [1] http://www.squeaksource.com/ajp Cheers Philippe |
Free forum by Nabble | Edit this page |