Hi,
I think that most of you have seen the tinyBechmarks results : http://fbanados.wordpress.com/2011/02/10/a-tinybenchmark/ In order to understand why we are I last, I've made some benchmarks: (***WARNING*** my vm was compiled without the generation gc to use vallgrind) 1) Simple optimized byte codes: ======================== x timesRepeat: [ 1 + 1 ] [1] source code line number 1 [3] push 100 [5] dup stack top [7] dup stack top [9] push 1 send 1 args message #>= [11] pop and if false jump to 21 [13] push 1 send 1 args message #- [15] push 1 [17] push 1 send 1 args message #+ [19] pop stack top jump to 7 [21] pop stack top [23] return stack top It only sends optimized byte codes and never send a message; I've choosen it to stress the byte code decoder and also never calls a GC. I've done it also with other optimized messages like at:, at:put: We are up to 3 times faster than cog. 2) Some optimized message send: ========================== SmallInteger [ selfReturn [ ^ self ] literalReturn [ ^ Object ] ] x timesRepeat: [ 1 selfReturn ] or x timesRepeat: [ 1 literalReturn ] Here we stress another part of the VM _gst_message_send : 1) "selfReturn" call a full lookup is done 2) after the cache does it work in _gst_message_send_internal the message is optimized too it will directly return the self or the literals,... we never create a context and trigger a GC too Here again we are faster than cog 3) Simple context activation: ===================== SmallInteger [ foo [ ^ 1+1 ] ] Again we stress _gst_message_send_internal but the messsage is really sent, so what's the difference: - a context is allocated - and recycled - GC is never call Here again we are faster than cog 4) Now here comes the problem: ======================= SmallInteger [ foo: anInteger time: aTimeInteger [ anInteger > 0 ifTrue: [ ^ self foo: anInteger - 1 time: aTimeInteger ]. ObjectMemory quit. ] ] Here another part of the vm is stressed : context activation (they are not recycled here) and this is the problem (***WARNING*** my vm was compiled without the generation gc to use vallgrind) 1) a GC is called => 76% of the global times of execution, it seems to be the problem 2) when gst is out of free chunk with long recursions it crashes : empty_context_stack 3) all the time an oop entry is allocated also gst could be low on oop and trigger a gc I hope those tiny simple benchmarks will help the gst community ;-) Cheers, Gwen _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
Sorry I forget the bench file ;-)
Cheers, Gwen On Mon, Feb 14, 2011 at 11:50 AM, Gwenaël Casaccio <[hidden email]> wrote: > Hi, > > I think that most of you have seen the tinyBechmarks results : > http://fbanados.wordpress.com/2011/02/10/a-tinybenchmark/ > In order to understand why we are I last, I've made some benchmarks: > > (***WARNING*** my vm was compiled without the generation gc to use vallgrind) > > 1) Simple optimized byte codes: > ======================== > > x timesRepeat: [ 1 + 1 ] > > [1] source code line number 1 > [3] push 100 > [5] dup stack top > [7] dup stack top > [9] push 1 > send 1 args message #>= > [11] pop and if false jump to 21 > [13] push 1 > send 1 args message #- > [15] push 1 > [17] push 1 > send 1 args message #+ > [19] pop stack top > jump to 7 > [21] pop stack top > [23] return stack top > > It only sends optimized byte codes and never send a message; I've choosen it > to stress the byte code decoder and also never calls a GC. > > I've done it also with other optimized messages like at:, at:put: > > We are up to 3 times faster than cog. > > 2) Some optimized message send: > ========================== > > SmallInteger [ > selfReturn [ > ^ self > ] > > literalReturn [ > ^ Object > ] > ] > > x timesRepeat: [ 1 selfReturn ] or x timesRepeat: [ 1 literalReturn ] > > Here we stress another part of the VM _gst_message_send : > 1) "selfReturn" call a full lookup is done > 2) after the cache does it work > > in _gst_message_send_internal the message is optimized too it > will directly return the self or the literals,... we never create a context > and trigger a GC too > > Here again we are faster than cog > > 3) Simple context activation: > ===================== > > SmallInteger [ > foo [ > ^ 1+1 > ] > ] > > Again we stress _gst_message_send_internal but the messsage is really sent, so > what's the difference: > - a context is allocated > - and recycled > - GC is never call > > Here again we are faster than cog > > 4) Now here comes the problem: > ======================= > > SmallInteger [ > foo: anInteger time: aTimeInteger [ > anInteger > 0 ifTrue: [ > ^ self foo: anInteger - 1 time: aTimeInteger > ]. > > ObjectMemory quit. > ] > ] > > Here another part of the vm is stressed : > > context activation (they are not recycled here) and this is the problem > (***WARNING*** my vm was compiled without the generation gc to use vallgrind) > > 1) a GC is called => 76% of the global times of execution, it seems to > be the problem > > 2) when gst is out of free chunk with long recursions it crashes : > empty_context_stack > > 3) all the time an oop entry is allocated also gst could be low on oop > and trigger a gc > > I hope those tiny simple benchmarks will help the gst community ;-) > > Cheers, > Gwen > _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk test.st (2K) Download Attachment |
On 02/14/2011 11:51 AM, Gwenaël Casaccio wrote:
>> 4) Now here comes the problem: >> ======================= >> >> SmallInteger [ >> foo: anInteger time: aTimeInteger [ >> anInteger> 0 ifTrue: [ >> ^ self foo: anInteger - 1 time: aTimeInteger >> ]. >> >> ObjectMemory quit. >> ] >> ] You're calling this with anInteger = 90000, and in this case I do expect GC to be responsible for bad performance. However, the numbers should be very different for, say, a depth of 50 like in your microbenchmark (Time millisecondsToRun: [ 1000000 timesRepeat: [ 5 recursionWithReturn: 50 ] ]) printNl. How do gst/cog/squeak compare in this case? Also, your benchmarks are missing one very important case, namely array access. I believe this is the cause of the slowdown in the bytecode benchmark, especially since you proved that everything else is faster. :) This cannot be helped really, because it's due to the object table. Paolo _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
Free forum by Nabble | Edit this page |