Hi,
I'm working on profiling squeak/etoys for the olpc project. It seems to me that bytecodes cases in the interpret function (in /platform/unix/src/vm/interp.c) are not ordered in respect to the probability of their execution. Here is a graph for reference (more games in Etoys have the same pattern): http://www.bodhidharma.info/instructions.pdf What I did so far is changing the `256 cases switch' statement to an array of function pointers like (*exec_bytecode_funcs)[bytecode_id]( ...shared data... ); The process was automated with a python script and that needed a little bit of code cleaning; after some testing I couldn't trigger any sort of bug in the new code. if we assume a constant time T both for the execution of every [if !(right_case) jump next_case] in the old code and for the function pointer dereference in the new code there is a 10000% gain for the task of calling the right action for a given bytecode (also given the distribution showed in the graph linked above). In my tests this is a 20/30% win over the interpret routine timings for different games in etoys. I appreciate any comments on this work. Maybe it could be done better than this ? Thanks, Riccardo |
Riccardo Lucchese wrote:
> Hi, > > I'm working on profiling squeak/etoys for the olpc project. > > It seems to me that bytecodes cases in the interpret function > (in /platform/unix/src/vm/interp.c) are not ordered in respect to > the probability of their execution. > > Here is a graph for reference (more games in Etoys have > the same pattern): > http://www.bodhidharma.info/instructions.pdf > > What I did so far is changing the `256 cases switch' statement to an array > of function pointers like (*exec_bytecode_funcs)[bytecode_id]( > ...shared data... ); > The process was automated with a python script and that needed a little bit > of code cleaning; after some testing I couldn't trigger any sort of > bug in the new code. > > > if we assume a constant time T both for the execution of every > [if !(right_case) jump next_case] in the old code and for the > function pointer dereference in the new code there is a 10000% gain > for the task of calling the right action for a given bytecode > (also given the distribution showed in the graph linked above). > > In my tests this is a 20/30% win over the interpret routine timings > for different games > in etoys. > Is that 20/30 % overall speedup ? If so, it is great! Karl > I appreciate any comments on this work. > Maybe it could be done better than this ? > > Thanks, > Riccardo > > > |
On Sat, Apr 19, 2008 at 1:48 PM, karl <[hidden email]> wrote:
> > Is that 20/30 % overall speedup ? If so, it is great! > > Karl > I'm working on one function at time as they step up in the profiling results. Cannot say what the overall improvements exactly is; in the patched code execution time is dominated by primitiveWarpBits instead of the interpret function. As those two functions are largely the most cpu intensive ones I think that the overall speedup is quite sensible. Anybody already working on performances in squeak ?; I'd really like to help in this part of the platform ! Riccardo |
In reply to this post by Riccardo Lucchese
Hi!
Quoting Riccardo Lucchese <[hidden email]>: > Hi, > > I'm working on profiling squeak/etoys for the olpc project. > > It seems to me that bytecodes cases in the interpret function > (in /platform/unix/src/vm/interp.c) are not ordered in respect to > the probability of their execution. > Since the switch-case statement in C is compiled into a jump table (http://en.wikipedia.org/wiki/Jump_table) the order of the branches shouldn't affect the speed of execution (you can even start with the default branch). > Here is a graph for reference (more games in Etoys have > the same pattern): > http://www.bodhidharma.info/instructions.pdf > > What I did so far is changing the `256 cases switch' statement to an array > of function pointers like (*exec_bytecode_funcs)[bytecode_id]( > ...shared data... ); Actually it should be slower, because function calls used to be slower than simple jumps or arithmetic instructions. > The process was automated with a python script and that needed a little bit > of code cleaning; after some testing I couldn't trigger any sort of > bug in the new code. > AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to run your script again. A pure Squeak solution might be better. > > if we assume a constant time T both for the execution of every > [if !(right_case) jump next_case] in the old code and for the > function pointer dereference in the new code there is a 10000% gain > for the task of calling the right action for a given bytecode > (also given the distribution showed in the graph linked above). This gain didn't come because of the jump table. > > In my tests this is a 20/30% win over the interpret routine timings > for different games > in etoys. Are you sure that the performance gain came from these changes? If so, then the only reason for the speedup i can think of is that the most common bytecodes' code are scattered across memory pages and there are more page faults with the switch-case implementation. I wonder if you could check how many page faults does the two implementations have. Cheers, Levente > > I appreciate any comments on this work. > Maybe it could be done better than this ? > > Thanks, > Riccardo > > |
In reply to this post by Riccardo Lucchese
Riccardo Lucchese wrote:
> On Sat, Apr 19, 2008 at 1:48 PM, karl <[hidden email]> wrote: > >> Is that 20/30 % overall speedup ? If so, it is great! >> >> Karl >> >> > > I'm working on one function at time as they step up in the profiling results. > > Cannot say what the overall improvements exactly is; in the patched code > execution time is dominated by primitiveWarpBits instead of the > interpret function. > As those two functions are largely the most cpu intensive ones I think that > the overall speedup is quite sensible. > > > Anybody already working on performances in squeak ?; I'd really > like to help in this part of the platform ! > I think the VM list is where primitive and plugin questions are mostly addressed: http://lists.squeakfoundation.org/mailman/listinfo/vm-dev Karl |
In reply to this post by Levente Uzonyi-2
2008/4/19 Levente Uzonyi <[hidden email]>:
> Hi! > > > Quoting Riccardo Lucchese <[hidden email]>: > > > > Hi, > > > > I'm working on profiling squeak/etoys for the olpc project. > > > > It seems to me that bytecodes cases in the interpret function > > (in /platform/unix/src/vm/interp.c) are not ordered in respect to > > the probability of their execution. > > > > > > Since the switch-case statement in C is compiled into a jump table > (http://en.wikipedia.org/wiki/Jump_table) the order of the branches > shouldn't affect the speed of execution (you can even start with the default > branch). > > > > > Here is a graph for reference (more games in Etoys have > > the same pattern): > > http://www.bodhidharma.info/instructions.pdf > > > > What I did so far is changing the `256 cases switch' statement to an array > > of function pointers like (*exec_bytecode_funcs)[bytecode_id]( > > ...shared data... ); > > > > Actually it should be slower, because function calls used to be slower than > simple jumps or arithmetic instructions. > > > > > The process was automated with a python script and that needed a little > bit > > of code cleaning; after some testing I couldn't trigger any sort of > > bug in the new code. > > > > > > AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to > run your script again. A pure Squeak solution might be better. > > Moreover, there are already a script which does exacly same what Riccardo did. See gnuify step, which transforms interp.c to gnu-interp.c and replaces all cases with jump labels and converts code which uses jump table. The reason why it fails, can be that you did some modifications to code, which makes this script fail to find exact patterns in generated sources. But if you following common procedure for generating VM (VMMaker -> make), it should work ok. > > > > > if we assume a constant time T both for the execution of every > > [if !(right_case) jump next_case] in the old code and for the > > function pointer dereference in the new code there is a 10000% gain > > for the task of calling the right action for a given bytecode > > (also given the distribution showed in the graph linked above). > > > > This gain didn't come because of the jump table. > > > > > > > In my tests this is a 20/30% win over the interpret routine timings > > for different games > > in etoys. > > > > Are you sure that the performance gain came from these changes? If so, then > the only reason for the speedup i can think of is that the most common > bytecodes' code are scattered across memory pages and there are more page > faults with the switch-case implementation. I wonder if you could check how > many page faults does the two implementations have. > > Cheers, > Levente > > > > > > > > I appreciate any comments on this work. > > Maybe it could be done better than this ? > > > > Thanks, > > Riccardo > > > > > > > > > > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Levente Uzonyi-2
> Since the switch-case statement in C is compiled into a jump table
> (http://en.wikipedia.org/wiki/Jump_table) the order of the branches > shouldn't affect the speed of execution (you can even start with the default > branch). Really many thanks for the pointer!, didn't know this one! > > The process was automated with a python script and that needed a little > bit > > of code cleaning; after some testing I couldn't trigger any sort of > > bug in the new code. > > > > > > AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to > run your script again. A pure Squeak solution might be better. I don't understand what you mean. I just patched the c file and rebuild squeak (interp.c doesn't get generated at every build.) > Are you sure that the performance gain came from these changes? If so, then > the only reason for the speedup i can think of is that the most common > bytecodes' code are scattered across memory pages and there are more page > faults with the switch-case implementation. I wonder if you could check how > many page faults does the two implementations have. Is there some automated test that I can run to verify just the vm troughput ? A cachegrind on that test would give more interesting result. Thanks, Riccardo |
2008/4/19 Riccardo Lucchese <[hidden email]>:
> > Since the switch-case statement in C is compiled into a jump table > > (http://en.wikipedia.org/wiki/Jump_table) the order of the branches > > shouldn't affect the speed of execution (you can even start with the default > > branch). > > Really many thanks for the pointer!, didn't know this one! > > > > > > The process was automated with a python script and that needed a little > > bit > > > of code cleaning; after some testing I couldn't trigger any sort of > > > bug in the new code. > > > > > > > > > > AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to > > run your script again. A pure Squeak solution might be better. > I don't understand what you mean. I just patched the c file and > rebuild squeak (interp.c doesn't get generated > at every build.) > > > > Are you sure that the performance gain came from these changes? If so, then > > the only reason for the speedup i can think of is that the most common > > bytecodes' code are scattered across memory pages and there are more page > > faults with the switch-case implementation. I wonder if you could check how > > many page faults does the two implementations have. > > Is there some automated test that I can run to verify just the vm troughput ? > A cachegrind on that test would give more interesting result. > I suggest you, before proceeding further you should read here http://wiki.squeak.org/squeak/2105 and here: http://squeakvm.org > Thanks, > Riccardo > > -- Best regards, Igor Stasenko AKA sig. |
> I suggest you, before proceeding further you should read here
> http://wiki.squeak.org/squeak/2105 > and here: http://squeakvm.org > Thanks for the links. I'm just pulling the source being used on the olpc platform. There is not such a test-tool to measure troughput? Thanks, Riccardo |
2008/4/19 Riccardo Lucchese <[hidden email]>:
> > I suggest you, before proceeding further you should read here > > http://wiki.squeak.org/squeak/2105 > > and here: http://squeakvm.org > > > Thanks for the links. I'm just pulling the source being used > on the olpc platform. > > > There is not such a test-tool to measure troughput? > Most smalltalkers measure throughput with following st code: 0 tinyBenchmarks :) There also some of the macro benchmarks, if you search a mail there was a discussion concerning them lately. > > Thanks, > Riccardo > > -- Best regards, Igor Stasenko AKA sig. |
>
> Most smalltalkers measure throughput with following st code: > > 0 tinyBenchmarks > mm.. how do I run it ? :) ... couldn't found any pointer in 40mins of googling. I never worked with st and that single line is driving me crazy :) |
Open transcript then do it on:
10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr] I run this before any mac carbon VM is shipped and compare the results to previous runs to ensure the compiler hasn't broken the expected assembler optimization. On Apr 19, 2008, at 9:22 AM, Riccardo Lucchese wrote: >> >> Most smalltalkers measure throughput with following st code: >> >> 0 tinyBenchmarks >> > > mm.. how do I run it ? :) > > ... couldn't found any pointer in 40mins of googling. > I never worked with st and that single line is driving me crazy :) > -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Riccardo Lucchese
On 19-Apr-08, at 7:22 AM, Riccardo Lucchese wrote: >> >> AFAIK the interp.c is generated with VMMaker, so for a new vm, you >> have to >> run your script again. A pure Squeak solution might be better. > I don't understand what you mean. I just patched the c file and > rebuild squeak (interp.c doesn't get generated > at every build.) The VM code exists in two parts a) the handwritten C (or C++ in some places) that implements the platform specific bits of the system such as the interface to the OS filing system, the IO etc. There are also some VM plugins with handwritten parts. b) the Slang code implemented in Smalltalk that get translated into C by the VMMaker tool. VMMaker is discussed at http://wiki.squeak.org/squeak/VMMaker and related links. It translates the relevant classes, writes the translated code to files, copies some of the handwritten files to suit the platform make setup and creates some config files. Each platform then has its own idiosyncratic (not to mention frequently idiotic) process to follow to compile and link that bunch into a working VM and (possibly, depending on details not relevant here) plugins. The central VM file is interp.c and you really shouldn't waste your time playing with it. It is barely readable, poorly structured and should be considered a temporary intermediate file; would you bother trying to read the intermediate language file(s) some compilers produce? The exception to this is when you have some really simple problem in the compiler - like complaints about a type mismatch for example - that you can very quickly edit in interp.c to see if it can be made to compile, after which you have to make the proper fix in the Slang code. Again, interp.c is a temporary file that should never ever be saved in any source control system. And yes, I know that it frequently is, and that is just plain bugfuck stupid. With regard to your original message about the bytecode dispatch 'switch' statement - well I don't think I've seen a compiler do something that dumb in a long time. A jump or branch table is the normal case and as has been mentioned the 'gnuify' step in the makefiles of most platforms even improves upon that when possible. One could consider building a table of function pointers and indexing into it with the bytecode - but that would put the table outside the run of the code and would require bytecode routines to be functions etc etc. In the current system the bytecode 'functions' are textually inlined into the main interpret() function as you would see by examing the VMMaker related code. It is distressingly easy to come up with 'optimisations' that completely ruin performance. Further, the bytecode dispatch time is unlikely to figure anywhere in typical performance profiles. Would anyone really characterise the performance of a system on the basis of how fast it can do trivial loops? Or is it more sensible to have decent benchmarks that reflect real usage? It is important to measure the right things the right way under the right circumstances if you want to uncover meaningful facts. In the early days of RISC development the designers would explain about how they ran benchmarks to find out what a cpu really spent its time doing and then they designed the architecture to optimise for those activities - no Test Left, Shift Mask and Dim the Lights instruction for us! - so that all would be wonderful. It was rather slyly pointed out that this could only result in a cpu that could run diagnostics and Eratosthenes Sieve *really* fast and left everything else to exception handling tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Futuristic: It will only run on a next generation supercomputer. |
In reply to this post by Igor Stasenko
On Apr 19, 2008, at 8:36 AM, Igor Stasenko wrote: > There also some of the macro benchmarks, if you search a mail there > was a discussion concerning them lately. Smalltalk macroBenchmarks A quick check showed the squeak image at http://ftp.squeak.org/3.4/ contained the code, and it was deleted in later versions, not sure when, but someone could look. I note standardTime: has to be changed the tieDown _ ByteArray new: spaceLeft - 1e7. "Leave exactly 10MB free" should be removed, it's intent was to pre-allocate memory but this might depending on the VM allocate 512MB or 1024MB of virtual memory because of how the VMs work now, and not have the desired effect it had when VM's had fixed memory sizes eight or so years back. -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Riccardo Lucchese
On Apr 19, 2008, at 1:28 AM, Riccardo Lucchese wrote: > Hi, > > I'm working on profiling squeak/etoys for the olpc project. Well I'll note since you are working with a specific hardware setup it's worth considering looking very carefully at the compiler optimization for the CPU in question. I'd guess the compiler setting used by the build process is fairly generic and does not take advantage of specific platform optimizations, thus producing safe and lousy code to make a 80386 happy... oh say see http://lists.laptop.org/pipermail/devel/2007-November/007430.html http://wiki.laptop.org/go/Geode_optimization_effort http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00452.html Also setting -DUSE_INLINE_MEMORY_ACCESSORS or NOT setting in the compiler options to change the code usage for accessors in interp.c might also have a different result. since that changes now memory accessing is done. The other area is variable accessing, you'll note the foo structure is built and the variables are sorted by usage, I did this to guess better at cache line optimization versus a random distribution, or sorted alphabetically. However I don't think anyone has done a careful study to see how best the variables should be arranged. Mind when I did this on the powerpc I was *unable* to see any difference. Lastly ArraysToGlobalStruct-JMM.1.cs in the squeak mac source tree is used to modify VMMaker to put or not put arrays in the global structure foo. In the past depending on the compiler would produce good or horrible code for accessing an array that was part of a structure. You would need to look at this change set, the current VMMaker code, then see if using it makes a difference in the assembler code generated. Oh and -fgcse Fiddling with that might produce different things depending on the compiler.... -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Levente Uzonyi-2
Levente Uzonyi wrote:
>> It seems to me that bytecodes cases in the interpret function >> (in /platform/unix/src/vm/interp.c) are not ordered in respect to >> the probability of their execution. > > Since the switch-case statement in C is compiled into a jump table > (http://en.wikipedia.org/wiki/Jump_table) the order of the branches > shouldn't affect the speed of execution (you can even start with the > default branch). In theory. In practice, the effect of branch-mispredictions can be quite noticable: http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-July/061996.html Cheers, - Andreas |
In reply to this post by johnmci
Hi!
> Open transcript then do it on: > > 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr] What I'm doing is: - run startsqueak - dragging a trasncript window from the tools bar to the 'desktop' - write the line you suggested me * wait for the magic, but nothing happens :) Shame on me... mm how do kids actually sort it out ? :) Thanks you for your help. I'd just like to see some numbers and than shutup if I'm all wrong with my initial assumptions. Riccardo |
Hi!
Try pressing alt + d (if you have an alt key) or open the menu and select "do it" Levente Quoting Riccardo Lucchese <[hidden email]>: > Hi! > >> Open transcript then do it on: >> >> 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr] > What I'm doing is: > - run startsqueak > - dragging a trasncript window from the tools bar to the 'desktop' > - write the line you suggested me > * wait for the magic, but nothing happens :) > > Shame on me... mm how do kids actually sort it out ? :) > > Thanks you for your help. I'd just like to see some numbers and than > shutup if > I'm all wrong with my initial assumptions. > > Riccardo > > |
In reply to this post by Riccardo Lucchese
>>>>> "Riccardo" == Riccardo Lucchese <[hidden email]> writes:
Riccardo> Hi! >> Open transcript then do it on: >> >> 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr] Riccardo> What I'm doing is: Riccardo> - run startsqueak Riccardo> - dragging a trasncript window from the tools bar to the 'desktop' Riccardo> - write the line you suggested me Riccardo> * wait for the magic, but nothing happens :) Riccardo> Shame on me... mm how do kids actually sort it out ? :) Transcript is not Workspace. Transcript is where you see things. Workspace is where you type things, then use do-it or print-it or debug-it. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <[hidden email]> <URL:http://www.stonehenge.com/merlyn/> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training! |
2008/4/20 Randal L. Schwartz <[hidden email]>:
> >>>>> "Riccardo" == Riccardo Lucchese <[hidden email]> writes: > > Riccardo> Hi! > > >> Open transcript then do it on: > >> > >> 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr] > Riccardo> What I'm doing is: > Riccardo> - run startsqueak > Riccardo> - dragging a trasncript window from the tools bar to the 'desktop' > Riccardo> - write the line you suggested me > Riccardo> * wait for the magic, but nothing happens :) > > Riccardo> Shame on me... mm how do kids actually sort it out ? :) > > Transcript is not Workspace. > > Transcript is where you see things. > > Workspace is where you type things, then use do-it or print-it or debug-it. > or any other place where you can enter text and press 'doit' :) Inspector and debugger related to context, so you can send messages to inst vars or use method's temps or arguments in expression. But it makes little difference for expressions which is context free , like '0 tinyBenchmarks' > -- > Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 > <[hidden email]> <URL:http://www.stonehenge.com/merlyn/> > Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. > See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training! > > -- Best regards, Igor Stasenko AKA sig. |
Free forum by Nabble | Edit this page |