Hi Eliot, You wrote “it
would be able to prove that the result of findKeyOrNil: would be an index in
range” , could you explain a bit how this
can be done? The only thing I can think of would be because #findKeyOrNil: always
sends ‘self basicAt: location’ before returning the value of
location. So if that message doesn’t fail, the sender of #findKeyOrNil:
wouldn’t fail either if it would just use the result as an argument to #basicAt:.
Would this be the reasoning of the compiler/vm? Thanks, Mark From:
[hidden email] [mailto:[hidden email]] On Behalf Of Eliot Miranda On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure <[hidden email]>
wrote: [excellent analysis snipped]
Not only that, it would use an at: that would avoid the bounds check
and isInteger check because it would be able to prove that the result of findKeyOrNil:
would be an index in range. So the speed ups would be even higher than
achieved by this experiment.
A Self-styl;e VM can still be implemented to target bytecode and have
the VM generate the machine code. The VM needs additional bytecodes to
express things like non-bounds-checking at: etc. But you and I know this. We've been talking about it for a while
now. I'm looking forward to working on this in Cog. Eliot Qwaq, Inc. _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Sorry for going off-topic again, but speaking of findKeyOrNil: (and it's
cousin findElementOrNil: in Set), I have a hard time seeing the reasoning behind the part: pass := pass + 1. pass > 2 ifTrue: [^self grow findElementOrNil: anObject] As far as I can tell, this will only trigger if: 1. anObject is not part of the Set/Dictionary. 2. There are no nil slots in the Set/Dictionary (ie: it's entirely full). It prevents an infinite loop if the 2 conditions were ever met, but I can't see how that can ever be the case. - The basic add methods all have a fullCheck, which increases space if the dictionary after an add would lead to an entirely full dictionary. - The noCheckAdd: senders I could find only gets called by methods whom ensure the size is bigger than the number of elements added with noCheckAdd:. Is there another way to add elements/keys which would necessitate this step? I could somewhat understand a reasoning behind it if the check was pass > 1 (wouldn't really need the pass at all then though), in which case the underlaying assumption could be that if you somehow reached the last element of the set for an object whose initialIndex was another object, and all subsequent elements untill the end of the set were full, the reason for that would most probably be a high collision rate, and a grow would help reduce the collision amount, thus speeding up subsequent operations. With the new hash functions, whom if I've understood right, have a very low collision rate, even that is an extremely rare case though. Cheers, Henry Mark Plas wrote: > > Hi Eliot, > > > > You wrote “it would be able to prove that the result of findKeyOrNil: > would be an index in range” , could you explain a bit how this can be > done? The only thing I can think of would be because #findKeyOrNil: > always sends ‘self basicAt: location’ before returning the value of > location. So if that message doesn’t fail, the sender of > #findKeyOrNil: wouldn’t fail either if it would just use the result as > an argument to #basicAt:. Would this be the reasoning of the compiler/vm? > > > > Thanks, > > Mark > > > > ------------------------------------------------------------------------ > > *From:* [hidden email] [mailto:[hidden email]] *On > Behalf Of *Eliot Miranda > *Sent:* zaterdag 11 oktober 2008 0:39 > *To:* Martin McClure > *Cc:* [hidden email] > *Subject:* Re: [vwnc] VM optimization: code inlining? > > > > > > On Fri, Oct 10, 2008 at 1:03 PM, Martin McClure > <[hidden email] <mailto:[hidden email]>> wrote: > > > > [excellent analysis snipped] > > > > > That's a 105% speedup. Test2Dictionary>>at: is more than twice as fast > as Dictionary>>at: in the non-failure case. > > > A self-style VM could do this kind of inlining automatically. > > > > Not only that, it would use an at: that would avoid the bounds check > and isInteger check because it would be able to prove that the result > of findKeyOrNil: would be an index in range. So the speed ups would > be even higher than achieved by this experiment. > > > > The > bytecode-level inlining that Terry's talking about could also do it, > though without the frequency feedback it's a bit harder to know > where to > do this, and a mechanism would be needed to keep track of all of the > methods that had inlined a particular method in order to automatically > recompile all inlining methods whenever an inlined method is > recompiled. > > > > A Self-styl;e VM can still be implemented to target bytecode and have > the VM generate the machine code. The VM needs additional bytecodes > to express things like non-bounds-checking at: etc. > > > > But you and I know this. We've been talking about it for a while now. > I'm looking forward to working on this in Cog. > > > > Eliot > > Qwaq, Inc. > > ------------------------------------------------------------------------ > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Henrik Johansen wrote:
> but speaking of findKeyOrNil: (and it's > cousin findElementOrNil: in Set), > I have a hard time seeing the reasoning behind the part: > pass := pass + 1. > pass > 2 ifTrue: [^self grow findElementOrNil: anObject] In pass one the search starts somewhere in 'the middle' of the dictionary and proceeds scanning to the end of the dict. In pass two the search wraps around to the start and then proceeds to scan the whole dict (which is more than needed). When both passes fail the dict is grown and the search retried using recursion. > > As far as I can tell, this will only trigger if: > 1. anObject is not part of the Set/Dictionary. > 2. There are no nil slots in the Set/Dictionary (ie: it's entirely full). Correct > It prevents an infinite loop if the 2 conditions were ever met, but I > can't see how that can ever be the case. > - The basic add methods all have a fullCheck, which increases space if > the dictionary after an add would lead to an entirely full dictionary. > - The noCheckAdd: senders I could find only gets called by methods whom > ensure the size is bigger than the number of elements added with > noCheckAdd:. > Is there another way to add elements/keys which would necessitate this > step? See Dictionary>>add: for an example. Browse senders of #findkeyOrNil: to find others. > > I could somewhat understand a reasoning behind it if the check was pass >> 1 (wouldn't really need the pass at all then though), in which case > the underlaying assumption could be that if you somehow reached the last > element of the set for an object whose initialIndex was another object, > and all subsequent elements untill the end of the set were full, The dictionary treats its content as a circular buffer (if searching for a hash hits the end of the dict it wraps around to the beginning to find a slot for that hash 'bucket'). What you say only seems to hold if the dictionary would not do this wrapping trick, but would grow instead. More in general: I find it quite typical for Smalltalk programs to do a lot of such 'double' work you point out. In Smalltalk the emphasis is on code readability and correctness, but it seems that as a result a lot of this duplicative work gets hidden so deeply that it is hard to find (and hence remove). In my opinion the VM should remove those inefficiencies (but it seems I'll need to wait several decades before programming tools will reach that level)... R - _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
My point exactly.
The second scan never scans to the end of the dict, since the set of keys is never entirely full, due to add: performing a fullCheck, and those calling noCheckAdd: never fill the Set entirely. Hence it finds the key on one of the slots previous to the original index, or it finds a nil if the key isn't in the dictionary. Thus, the whole pass > 2 testing is really not needed... (Unless, as I said, there's some way of adding keys I'm not aware of). Reinout Heeck wrote: > Henrik Johansen wrote: > > >> but speaking of findKeyOrNil: (and it's >> cousin findElementOrNil: in Set), >> I have a hard time seeing the reasoning behind the part: >> pass := pass + 1. >> pass > 2 ifTrue: [^self grow findElementOrNil: anObject] >> > > > In pass one the search starts somewhere in 'the middle' of the > dictionary and proceeds scanning to the end of the dict. > > In pass two the search wraps around to the start and then proceeds to > scan the whole dict (which is more than needed). > > When both passes fail the dict is grown and the search retried using > recursion. > > > > >> >> As far as I can tell, this will only trigger if: >> 1. anObject is not part of the Set/Dictionary. >> 2. There are no nil slots in the Set/Dictionary (ie: it's entirely full). >> > > Correct > > >> It prevents an infinite loop if the 2 conditions were ever met, but I >> can't see how that can ever be the case. >> > > >> - The basic add methods all have a fullCheck, which increases space if >> the dictionary after an add would lead to an entirely full dictionary. >> - The noCheckAdd: senders I could find only gets called by methods whom >> ensure the size is bigger than the number of elements added with >> noCheckAdd:. >> Is there another way to add elements/keys which would necessitate this >> step? >> > > See Dictionary>>add: for an example. > > Browse senders of #findkeyOrNil: to find others. > > > >> I could somewhat understand a reasoning behind it if the check was pass >> >>> 1 (wouldn't really need the pass at all then though), in which case >>> >> the underlaying assumption could be that if you somehow reached the last >> element of the set for an object whose initialIndex was another object, >> and all subsequent elements untill the end of the set were full, >> > > The dictionary treats its content as a circular buffer (if searching for > a hash hits the end of the dict it wraps around to the beginning to find > a slot for that hash 'bucket'). > > What you say only seems to hold if the dictionary would not do this > wrapping trick, but would grow instead. > > > > > More in general: > > I find it quite typical for Smalltalk programs to do a lot of such > 'double' work you point out. > In Smalltalk the emphasis is on code readability and correctness, but it > seems that as a result a lot of this duplicative work gets hidden so > deeply that it is hard to find (and hence remove). > > In my opinion the VM should remove those inefficiencies (but it seems > I'll need to wait several decades before programming tools will reach > that level)... > > > > R > - > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > > vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Joachim Geidel
Hello,
I tried to simulate what would happen if the copied values are not stored in the copiedValues field, but as indexed instance variables. Class "TestA" corresponds to the current BlockClosure class with three instance variables. In case of one copied value we store it directly in field "b" of "TestA". In case of multiple values we store these in an array. Class "TestB" has two instance variables and stores the values in indexed variables. The timings are, of course, indicative as this is Smalltalk code and not the VM code. Except for the case of 1 copied value (similar results) the numbers suggest that the indexed variables variant may be a better alternative. Note that this is also relevant for full-copying blocks. Note that today we should not only focus on performance, but also on energy efficiency, ability to run in limited space (embedded applications), ... so I consider any gains worthwhile. Regards, michel === a) no copied value Time millisecondsToRun: [ 400000000 timesRepeat: [ TestA basicNew ]] 9381 Time millisecondsToRun: [ 400000000 timesRepeat: [ TestB basicNew: 0 ]] 8941 AllocationProfiler profile: [ 400000000 timesRepeat: [ TestA basicNew ]] 36258 scavenges, 1 incGCs, 9599999400 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ TestB basicNew: 0 ]] 30212 scavenges, 1 incGCs, 7999999280 bytes a) 1 copied value Time millisecondsToRun: [ 400000000 timesRepeat: [ TestA basicNew b: #a ]] 13329 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 1) at: 1 put: #a ]] 12742 AllocationProfiler profile: [ 400000000 timesRepeat: [ TestA basicNew b: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 1) at: 1 put: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes b) 2 copied values Time millisecondsToRun: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 2. a at: 1 put: #a; at: 2 put: #b. TestA basicNew b: a ]] 23409 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 14228 AllocationProfiler profile: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 2. a at: 1 put: #a; at: 2 put: #b. TestA basicNew b: a ]] 66324 scavenges, 1 incGCs, 17599999384 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 42253 scavenges, 0 incGCs, 11199999160 bytes c) 4 copied values Time millisecondsToRun: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 4. a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d. TestA basicNew b: a ]] 31158 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d ]] 22184 AllocationProfiler profile: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 4. a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d. TestA basicNew b: a ]] 78298 scavenges, 1 incGCs, 20800000000 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b ]] 54229 scavenges, 1 incGCs, 14399999532 bytes _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Mark Plas
On Mon, Oct 13, 2008 at 12:28 AM, Mark Plas <[hidden email]> wrote:
Look at your inlined at: ... length := self basicSize. ... index := (length > 4095 and: [aHashValue <= 16383]) ifTrue: [aHashValue * (length // 4095 + 1) \\ length + 1] ifFalse: [aHashValue \\ length + 1]. ... [(index := index + 1) > length
From these statements an optimizing compiler can infer that index is always within 1 to self basicSize. i.e. r := exp \\ length => r >= 0 && r < length or failure hence index >= 1 && index <= length && length == self basicSize and hence index >= 1 && index <= self basicSize or some such chain of reasoning. An SSA-style optimizer would accomodate such variable range analysis relatively easily.
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hi Eliot, I never heard of “SSA-style
compilers”. I thought that only humans would be able to do this kind of
code analysis, but it seems the basic ideas for it already existed in the ‘80s
(according to wikipedia). Thanks for your
answer, Mark From: Eliot Miranda
[mailto:[hidden email]] On Mon, Oct 13, 2008 at 12:28 AM, Mark Plas <[hidden email]> wrote: Hi Eliot, You wrote "it would be
able to prove that the result of findKeyOrNil: would be an index in range" , could you explain a bit how this can be done? The
only thing I can think of would be because #findKeyOrNil: always sends 'self
basicAt: location' before returning the value of location. So if that message
doesn't fail, the sender of #findKeyOrNil: wouldn't fail either if it would
just use the result as an argument to #basicAt:. Would this be the reasoning of
the compiler/vm? Look at your inlined at: ...
length := self basicSize. ... index :=
(length > 4095 and: [aHashValue <= 16383])
ifTrue: [aHashValue * (length // 4095 + 1) \\ length + 1]
ifFalse: [aHashValue \\ length + 1]. ...
[(index := index + 1) > length From these statements an optimizing compiler can
infer that index is always within 1 to self basicSize. i.e. r := exp \\ length => r >=
0 && r < length or failure hence index >= 1 && index
<= length && length == self basicSize and hence index >= 1 && index
<= self basicSize or some such chain of reasoning. An SSA-style
optimizer would accomodate such variable range analysis relatively easily.
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Michel Tilman
Hi Michel,
An interesting experiment that sounds like it would not be too hard to implement. The way I see it, these are the steps to be taken to get to the next level of performance for the VM: 1) inlining methods 2) more efficient block closures using your technique 3) adding extra code analysis to avoid type checks and in-bounds checks Now, if Cincom could put this into their 7.7 release... Mark -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Michel Tilman Sent: maandag 13 oktober 2008 20:23 To: [hidden email] Subject: Re: [vwnc] VM optimization: code inlining? Hello, I tried to simulate what would happen if the copied values are not stored in the copiedValues field, but as indexed instance variables. Class "TestA" corresponds to the current BlockClosure class with three instance variables. In case of one copied value we store it directly in field "b" of "TestA". In case of multiple values we store these in an array. Class "TestB" has two instance variables and stores the values in indexed variables. The timings are, of course, indicative as this is Smalltalk code and not the VM code. Except for the case of 1 copied value (similar results) the numbers suggest that the indexed variables variant may be a better alternative. Note that this is also relevant for full-copying blocks. Note that today we should not only focus on performance, but also on energy efficiency, ability to run in limited space (embedded applications), ... so I consider any gains worthwhile. Regards, michel === a) no copied value Time millisecondsToRun: [ 400000000 timesRepeat: [ TestA basicNew ]] 9381 Time millisecondsToRun: [ 400000000 timesRepeat: [ TestB basicNew: 0 ]] 8941 AllocationProfiler profile: [ 400000000 timesRepeat: [ TestA basicNew ]] 36258 scavenges, 1 incGCs, 9599999400 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ TestB basicNew: 0 ]] 30212 scavenges, 1 incGCs, 7999999280 bytes a) 1 copied value Time millisecondsToRun: [ 400000000 timesRepeat: [ TestA basicNew b: #a ]] 13329 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 1) at: 1 put: #a ]] 12742 AllocationProfiler profile: [ 400000000 timesRepeat: [ TestA basicNew b: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 1) at: 1 put: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes b) 2 copied values Time millisecondsToRun: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 2. a at: 1 put: #a; at: 2 put: #b. TestA basicNew b: a ]] 23409 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 14228 AllocationProfiler profile: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 2. a at: 1 put: #a; at: 2 put: #b. TestA basicNew b: a ]] 66324 scavenges, 1 incGCs, 17599999384 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 42253 scavenges, 0 incGCs, 11199999160 bytes c) 4 copied values Time millisecondsToRun: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 4. a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d. TestA basicNew b: a ]] 31158 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d ]] 22184 AllocationProfiler profile: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 4. a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d. TestA basicNew b: a ]] 78298 scavenges, 1 incGCs, 20800000000 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b ]] 54229 scavenges, 1 incGCs, 14399999532 bytes _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Thanks Mark,
That was indeed one of my goals: it seemed to me that this would be fairly easy to implement. In fact, you can get rid of more data (though not the overhead of the closure object itself), For instance, the "outerContext" field is not needed in many cases and could be dynamic, and if we really set our minds to it, we can even get rid of the "method" field by making each closure instance an instance of a block-specific subclass of BlockClosure, though I suspect this would impact more than one or two things. The code analysis is interesting (a while ago I had intended to work on SSA-type representations, but more in the context of parallelization, but have not found the time yet), but it seems to me that it is not without its own problems (type info / inference, fixed? semantics of arithmethic operators, ...), but that also makes it interesting... Regards, michel On 14 Oct 2008, at 10:12, Mark Plas wrote: > Hi Michel, > > An interesting experiment that sounds like it would not be too hard > to implement. > > The way I see it, these are the steps to be taken to get to the next > level of performance for the VM: > > 1) inlining methods > 2) more efficient block closures using your technique > 3) adding extra code analysis to avoid type checks and in-bounds > checks > > Now, if Cincom could put this into their 7.7 release... > > Mark > vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Mark Plas
Mark Plas wrote:
> > > To Martin: You wrote “since shared variable access is quite slow > compared to instvar access.”, so I changed my method to use an instvar, > but it was a bit slower than the version with a shared variable. What > would be the reason for this? > Interesting. I'm not sure why you would have seen a slowdown. In general, instvar access is quite a bit faster than shared variable access: test ^Time microsecondsToRun: [10000000 timesRepeat: [| a | a := instVar]] runs in about 38 milliseconds, whereas test ^Time microsecondsToRun: [10000000 timesRepeat: [| a | a := SharedVar]] takes about 90 milliseconds. If one takes into account that an empty loop takes about 30 milliseconds, this is 8 milliseconds vs. 60 milliseconds, almost a full order of magnitude difference. If I accounted for the common overhead of the assignment (bytecode <4D> store local 1; pop) the difference for just accessing the variable would probably be more than an order of magnitude. Within the past year, I have been able to make measurable speed-ups in real-world programs by caching the value of frequently-accessed shared variables in instvars. Regards, -Martin _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Hi Martin,
I've found what the problem is. This is the way I implemented the #at: method using an inst var: at: key | result | result := self at: key ifAbsent: [undefinedValue]. result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key]. ^result When you inspect the bytecodes of the method, it requires a copying block for the [undefinedValue]. When using a shared variable the copying block isn't required and a clean block is used instead. When I change the method to this: at: key | result | result := self at: key ifAbsent: undefinedValue. result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key]. ^result (removed the block for ifAbsent:) then it is clearly faster. I suppose you implemented it this way when you did your experiments? The results then are: 1) #at: with shared var --> 4.2sec 2) #at: with inst var & copying block --> 4.3sec 3) #at: with inst var & no block --> 3.9sec Mystery solved! Mark -----Original Message----- From: Martin McClure [mailto:[hidden email]] Sent: woensdag 15 oktober 2008 0:01 To: Mark Plas Cc: [hidden email] Subject: Re: [vwnc] VM optimization: code inlining? Mark Plas wrote: > > > To Martin: You wrote "since shared variable access is quite slow > compared to instvar access.", so I changed my method to use an instvar, > but it was a bit slower than the version with a shared variable. What > would be the reason for this? > Interesting. I'm not sure why you would have seen a slowdown. In general, instvar access is quite a bit faster than shared variable access: test ^Time microsecondsToRun: [10000000 timesRepeat: [| a | a := instVar]] runs in about 38 milliseconds, whereas test ^Time microsecondsToRun: [10000000 timesRepeat: [| a | a := SharedVar]] takes about 90 milliseconds. If one takes into account that an empty loop takes about 30 milliseconds, this is 8 milliseconds vs. 60 milliseconds, almost a full order of magnitude difference. If I accounted for the common overhead of the assignment (bytecode <4D> store local 1; pop) the difference for just accessing the variable would probably be more than an order of magnitude. Within the past year, I have been able to make measurable speed-ups in real-world programs by caching the value of frequently-accessed shared variables in instvars. Regards, -Martin _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Mark Plas
Hmmm. It seems my emails don't reach the vwnc list anymore?
-----Original Message----- From: Mark Plas Sent: dinsdag 14 oktober 2008 10:12 To: 'Michel Tilman'; [hidden email] Subject: RE: [vwnc] VM optimization: code inlining? Hi Michel, An interesting experiment that sounds like it would not be too hard to implement. The way I see it, these are the steps to be taken to get to the next level of performance for the VM: 1) inlining methods 2) more efficient block closures using your technique 3) adding extra code analysis to avoid type checks and in-bounds checks Now, if Cincom could put this into their 7.7 release... Mark -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Michel Tilman Sent: maandag 13 oktober 2008 20:23 To: [hidden email] Subject: Re: [vwnc] VM optimization: code inlining? Hello, I tried to simulate what would happen if the copied values are not stored in the copiedValues field, but as indexed instance variables. Class "TestA" corresponds to the current BlockClosure class with three instance variables. In case of one copied value we store it directly in field "b" of "TestA". In case of multiple values we store these in an array. Class "TestB" has two instance variables and stores the values in indexed variables. The timings are, of course, indicative as this is Smalltalk code and not the VM code. Except for the case of 1 copied value (similar results) the numbers suggest that the indexed variables variant may be a better alternative. Note that this is also relevant for full-copying blocks. Note that today we should not only focus on performance, but also on energy efficiency, ability to run in limited space (embedded applications), ... so I consider any gains worthwhile. Regards, michel === a) no copied value Time millisecondsToRun: [ 400000000 timesRepeat: [ TestA basicNew ]] 9381 Time millisecondsToRun: [ 400000000 timesRepeat: [ TestB basicNew: 0 ]] 8941 AllocationProfiler profile: [ 400000000 timesRepeat: [ TestA basicNew ]] 36258 scavenges, 1 incGCs, 9599999400 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ TestB basicNew: 0 ]] 30212 scavenges, 1 incGCs, 7999999280 bytes a) 1 copied value Time millisecondsToRun: [ 400000000 timesRepeat: [ TestA basicNew b: #a ]] 13329 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 1) at: 1 put: #a ]] 12742 AllocationProfiler profile: [ 400000000 timesRepeat: [ TestA basicNew b: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 1) at: 1 put: #a ]] 36258 scavenges, 1 incGCs, 9599999400 bytes b) 2 copied values Time millisecondsToRun: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 2. a at: 1 put: #a; at: 2 put: #b. TestA basicNew b: a ]] 23409 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 14228 AllocationProfiler profile: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 2. a at: 1 put: #a; at: 2 put: #b. TestA basicNew b: a ]] 66324 scavenges, 1 incGCs, 17599999384 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 2) at: 1 put: #a; at: 2 put: #b ]] 42253 scavenges, 0 incGCs, 11199999160 bytes c) 4 copied values Time millisecondsToRun: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 4. a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d. TestA basicNew b: a ]] 31158 Time millisecondsToRun: [ 400000000 timesRepeat: [ (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d ]] 22184 AllocationProfiler profile: [ 400000000 timesRepeat: [ | a | a := Array basicNew: 4. a at: 1 put: #a; at: 2 put: #b; at: 3 put: #c; at: 4 put: #d. TestA basicNew b: a ]] 78298 scavenges, 1 incGCs, 20800000000 bytes AllocationProfiler profile: [ 400000000 timesRepeat: [ (TestB basicNew: 4) at: 1 put: #a; at: 2 put: #b ]] 54229 scavenges, 1 incGCs, 14399999532 bytes _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Mark Plas
Mark Plas wrote:
> Hi Martin, > > I've found what the problem is. This is the way I implemented the #at: method using an inst var: > > at: key > > | result | > result := self at: key ifAbsent: [undefinedValue]. > result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key]. > ^result > > > When you inspect the bytecodes of the method, it requires a copying block for the [undefinedValue]. When using a shared variable the copying block isn't required and a clean block is used instead. > > When I change the method to this: > > at: key > > | result | > result := self at: key ifAbsent: undefinedValue. > result == undefinedValue ifTrue: [self keyNotFoundErrorFor: #at: index: key]. > ^result > > (removed the block for ifAbsent:) then it is clearly faster. I hadn't thought about this case, in which you must use a copying block referencing "self" before you can access an instvar from the block. I suppose you implemented it this way when you did your experiments? No, I actually did it the way you did for the intermediate experiment, then inlined at:ifAbsent: for the second improved version. I didn't compare instvar to shared var, I just used an instvar because "I know it's faster" (one of those dangerous things to "know"). That probably explains why my first improved version was still creating objects and doing (fewer) scavenges -- the block was still copying, but must have been creating smaller closures. Thanks for pointing this out. Regards, -Martin _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |