Smalltalk › Squeak › Squeak - Dev

[squeak-dev] squeak profiling

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

23 messages Options

Riccardo Lucchese

[squeak-dev] squeak profiling

Hi,

I'm working on profiling squeak/etoys for the olpc project.

It seems to me that bytecodes cases in the interpret function
(in /platform/unix/src/vm/interp.c) are not ordered in respect to
the probability of their execution.

Here is a graph for reference (more games in Etoys have
the same pattern):
http://www.bodhidharma.info/instructions.pdf

What I did so far is changing the `256 cases switch' statement to an array
of function pointers like (*exec_bytecode_funcs)[bytecode_id](
...shared data... );
The process was automated with a python script and that needed a little bit
of code cleaning; after some testing I couldn't trigger any sort of
bug in the new code.

if we assume a constant time T both for the execution of every
[if !(right_case) jump next_case] in the old code and for the
function pointer dereference in the new code there is a 10000% gain
for the task of calling the right action for a given bytecode
(also given the distribution showed in the graph linked above).

In my tests this is a 20/30% win over the interpret routine timings
for different games
in etoys.

I appreciate any comments on this work.
Maybe it could be done better than this ?

Thanks,
Riccardo

Karl-19

Re: [squeak-dev] squeak profiling

Riccardo Lucchese wrote:

> Hi,
>
> I'm working on profiling squeak/etoys for the olpc project.
>
> It seems to me that bytecodes cases in the interpret function
> (in /platform/unix/src/vm/interp.c) are not ordered in respect to
> the probability of their execution.
>
> Here is a graph for reference (more games in Etoys have
> the same pattern):
> http://www.bodhidharma.info/instructions.pdf
>
> What I did so far is changing the `256 cases switch' statement to an array
> of function pointers like (*exec_bytecode_funcs)[bytecode_id](
> ...shared data... );
> The process was automated with a python script and that needed a little bit
> of code cleaning; after some testing I couldn't trigger any sort of
> bug in the new code.
>
>
> if we assume a constant time T both for the execution of every
> [if !(right_case) jump next_case] in the old code and for the
> function pointer dereference in the new code there is a 10000% gain
> for the task of calling the right action for a given bytecode
> (also given the distribution showed in the graph linked above).
>
> In my tests this is a 20/30% win over the interpret routine timings
> for different games
> in etoys.
>

Is that 20/30 % overall speedup ? If so, it is great!

Karl
> I appreciate any comments on this work.
> Maybe it could be done better than this ?
>
> Thanks,
> Riccardo
>
>
>

Riccardo Lucchese

Re: [squeak-dev] squeak profiling

On Sat, Apr 19, 2008 at 1:48 PM, karl <[hidden email]> wrote:
>
> Is that 20/30 % overall speedup ? If so, it is great!
>
> Karl
>

I'm working on one function at time as they step up in the profiling results.

Cannot say what the overall improvements exactly is; in the patched code
execution time is dominated by primitiveWarpBits instead of the
interpret function.
As those two functions are largely the most cpu intensive ones I think that
the overall speedup is quite sensible.

Anybody already working on performances in squeak ?; I'd really
like to help in this part of the platform !

Riccardo

Levente Uzonyi-2

Re: [squeak-dev] squeak profiling

In reply to this post by Riccardo Lucchese

Hi!

Quoting Riccardo Lucchese <[hidden email]>:

> Hi,
>
> I'm working on profiling squeak/etoys for the olpc project.
>
> It seems to me that bytecodes cases in the interpret function
> (in /platform/unix/src/vm/interp.c) are not ordered in respect to
> the probability of their execution.
>

Since the switch-case statement in C is compiled into a jump table
(http://en.wikipedia.org/wiki/Jump_table) the order of the branches
shouldn't affect the speed of execution (you can even start with the
default branch).

> Here is a graph for reference (more games in Etoys have
> the same pattern):
> http://www.bodhidharma.info/instructions.pdf
>
> What I did so far is changing the `256 cases switch' statement to an array
> of function pointers like (*exec_bytecode_funcs)[bytecode_id](
> ...shared data... );

Actually it should be slower, because function calls used to be slower
than simple jumps or arithmetic instructions.

> The process was automated with a python script and that needed a little bit
> of code cleaning; after some testing I couldn't trigger any sort of
> bug in the new code.
>

AFAIK the interp.c is generated with VMMaker, so for a new vm, you
have to run your script again. A pure Squeak solution might be better.

>
> if we assume a constant time T both for the execution of every
> [if !(right_case) jump next_case] in the old code and for the
> function pointer dereference in the new code there is a 10000% gain
> for the task of calling the right action for a given bytecode
> (also given the distribution showed in the graph linked above).

This gain didn't come because of the jump table.

>
> In my tests this is a 20/30% win over the interpret routine timings
> for different games
> in etoys.

Are you sure that the performance gain came from these changes? If so,
then the only reason for the speedup i can think of is that the most
common bytecodes' code are scattered across memory pages and there are
more page faults with the switch-case implementation. I wonder if you
could check how many page faults does the two implementations have.

Cheers,
Levente

>
> I appreciate any comments on this work.
> Maybe it could be done better than this ?
>
> Thanks,
> Riccardo
>
>

Karl-19

Re: [squeak-dev] squeak profiling

In reply to this post by Riccardo Lucchese

Riccardo Lucchese wrote:

> On Sat, Apr 19, 2008 at 1:48 PM, karl <[hidden email]> wrote:
>
>> Is that 20/30 % overall speedup ? If so, it is great!
>>
>> Karl
>>
>>
>
> I'm working on one function at time as they step up in the profiling results.
>
> Cannot say what the overall improvements exactly is; in the patched code
> execution time is dominated by primitiveWarpBits instead of the
> interpret function.
> As those two functions are largely the most cpu intensive ones I think that
> the overall speedup is quite sensible.
>
>
> Anybody already working on performances in squeak ?; I'd really
> like to help in this part of the platform !
>

Your effort is greatly appreciated, A faster VM is a good thing.
I think the VM list is where primitive and plugin questions are mostly
addressed:
http://lists.squeakfoundation.org/mailman/listinfo/vm-dev
Karl

Igor Stasenko

Re: [squeak-dev] squeak profiling

In reply to this post by Levente Uzonyi-2

2008/4/19 Levente Uzonyi <[hidden email]>:

> Hi!
>
>
> Quoting Riccardo Lucchese <[hidden email]>:
>
>
> > Hi,
> >
> > I'm working on profiling squeak/etoys for the olpc project.
> >
> > It seems to me that bytecodes cases in the interpret function
> > (in /platform/unix/src/vm/interp.c) are not ordered in respect to
> > the probability of their execution.
> >
> >
>
> Since the switch-case statement in C is compiled into a jump table
> (http://en.wikipedia.org/wiki/Jump_table) the order of the branches
> shouldn't affect the speed of execution (you can even start with the default
> branch).
>
>
>
> > Here is a graph for reference (more games in Etoys have
> > the same pattern):
> > http://www.bodhidharma.info/instructions.pdf
> >
> > What I did so far is changing the `256 cases switch' statement to an array
> > of function pointers like (*exec_bytecode_funcs)[bytecode_id](
> > ...shared data... );
> >
>
> Actually it should be slower, because function calls used to be slower than
> simple jumps or arithmetic instructions.
>
>
>
> > The process was automated with a python script and that needed a little
> bit
> > of code cleaning; after some testing I couldn't trigger any sort of
> > bug in the new code.
> >
> >
>
> AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to
> run your script again. A pure Squeak solution might be better.
>
>

Moreover, there are already a script which does exacly same what Riccardo did.
See gnuify step, which transforms interp.c to gnu-interp.c and
replaces all cases with jump labels
and converts code which uses jump table.
The reason why it fails, can be that you did some modifications to
code, which makes this script fail to find exact patterns in generated
sources.
But if you following common procedure for generating VM (VMMaker ->
make), it should work ok.

>
> >
> > if we assume a constant time T both for the execution of every
> > [if !(right_case) jump next_case] in the old code and for the
> > function pointer dereference in the new code there is a 10000% gain
> > for the task of calling the right action for a given bytecode
> > (also given the distribution showed in the graph linked above).
> >
>
> This gain didn't come because of the jump table.
>
>
>
> >
> > In my tests this is a 20/30% win over the interpret routine timings
> > for different games
> > in etoys.
> >
>
> Are you sure that the performance gain came from these changes? If so, then
> the only reason for the speedup i can think of is that the most common
> bytecodes' code are scattered across memory pages and there are more page
> faults with the switch-case implementation. I wonder if you could check how
> many page faults does the two implementations have.
>
> Cheers,
> Levente
>
>
>
>
> >
> > I appreciate any comments on this work.
> > Maybe it could be done better than this ?
> >
> > Thanks,
> > Riccardo
> >
> >
> >
>
>
>
>
>

--
Best regards,
Igor Stasenko AKA sig.

Riccardo Lucchese

Re: [squeak-dev] squeak profiling

In reply to this post by Levente Uzonyi-2

> Since the switch-case statement in C is compiled into a jump table
> (http://en.wikipedia.org/wiki/Jump_table) the order of the branches
> shouldn't affect the speed of execution (you can even start with the default
> branch).

Really many thanks for the pointer!, didn't know this one!

> > The process was automated with a python script and that needed a little
> bit
> > of code cleaning; after some testing I couldn't trigger any sort of
> > bug in the new code.
> >
> >
>
> AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to
> run your script again. A pure Squeak solution might be better.
I don't understand what you mean. I just patched the c file and
rebuild squeak (interp.c doesn't get generated
at every build.)

> Are you sure that the performance gain came from these changes? If so, then
> the only reason for the speedup i can think of is that the most common
> bytecodes' code are scattered across memory pages and there are more page
> faults with the switch-case implementation. I wonder if you could check how
> many page faults does the two implementations have.

Is there some automated test that I can run to verify just the vm troughput ?
A cachegrind on that test would give more interesting result.

Thanks,
Riccardo

Igor Stasenko

Re: [squeak-dev] squeak profiling

2008/4/19 Riccardo Lucchese <[hidden email]>:

> > Since the switch-case statement in C is compiled into a jump table
> > (http://en.wikipedia.org/wiki/Jump_table) the order of the branches
> > shouldn't affect the speed of execution (you can even start with the default
> > branch).
>
> Really many thanks for the pointer!, didn't know this one!
>
>
>
> > > The process was automated with a python script and that needed a little
> > bit
> > > of code cleaning; after some testing I couldn't trigger any sort of
> > > bug in the new code.
> > >
> > >
> >
> > AFAIK the interp.c is generated with VMMaker, so for a new vm, you have to
> > run your script again. A pure Squeak solution might be better.
> I don't understand what you mean. I just patched the c file and
> rebuild squeak (interp.c doesn't get generated
> at every build.)
>

>
> > Are you sure that the performance gain came from these changes? If so, then
> > the only reason for the speedup i can think of is that the most common
> > bytecodes' code are scattered across memory pages and there are more page
> > faults with the switch-case implementation. I wonder if you could check how
> > many page faults does the two implementations have.
>
> Is there some automated test that I can run to verify just the vm troughput ?
> A cachegrind on that test would give more interesting result.
>

I suggest you, before proceeding further you should read here
http://wiki.squeak.org/squeak/2105
and here: http://squeakvm.org

> Thanks,
> Riccardo
>
>

--
Best regards,
Igor Stasenko AKA sig.

Riccardo Lucchese

Re: [squeak-dev] squeak profiling

> I suggest you, before proceeding further you should read here
> http://wiki.squeak.org/squeak/2105
> and here: http://squeakvm.org
>
Thanks for the links. I'm just pulling the source being used
on the olpc platform.

There is not such a test-tool to measure troughput?

Thanks,
Riccardo

Igor Stasenko

Re: [squeak-dev] squeak profiling

2008/4/19 Riccardo Lucchese <[hidden email]>:

> > I suggest you, before proceeding further you should read here
> > http://wiki.squeak.org/squeak/2105
> > and here: http://squeakvm.org
> >
> Thanks for the links. I'm just pulling the source being used
> on the olpc platform.
>
>
> There is not such a test-tool to measure troughput?
>

Most smalltalkers measure throughput with following st code:

0 tinyBenchmarks

:)
There also some of the macro benchmarks, if you search a mail there
was a discussion concerning them lately.

>
> Thanks,
> Riccardo
>
>

--
Best regards,
Igor Stasenko AKA sig.

Riccardo Lucchese

Re: [squeak-dev] squeak profiling

>
> Most smalltalkers measure throughput with following st code:
>
> 0 tinyBenchmarks
>

mm.. how do I run it ? :)

... couldn't found any pointer in 40mins of googling.
I never worked with st and that single line is driving me crazy :)

johnmci

Re: [squeak-dev] squeak profiling

Open transcript then do it on:

10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]

I run this before any mac carbon VM is shipped and compare the results
to previous runs to
ensure the compiler hasn't broken the expected assembler optimization.

On Apr 19, 2008, at 9:22 AM, Riccardo Lucchese wrote:

>>
>> Most smalltalkers measure throughput with following st code:
>>
>> 0 tinyBenchmarks
>>
>
> mm.. how do I run it ? :)
>
> ... couldn't found any pointer in 40mins of googling.
> I never worked with st and that single line is driving me crazy :)
>

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

timrowledge

Re: [squeak-dev] squeak profiling

In reply to this post by Riccardo Lucchese

On 19-Apr-08, at 7:22 AM, Riccardo Lucchese wrote:
>>
>> AFAIK the interp.c is generated with VMMaker, so for a new vm, you
>> have to
>> run your script again. A pure Squeak solution might be better.
> I don't understand what you mean. I just patched the c file and
> rebuild squeak (interp.c doesn't get generated
> at every build.)

The VM code exists in two parts
a) the handwritten C (or C++ in some places) that implements the
platform specific bits of the system such as the interface to the OS
filing system, the IO etc. There are also some VM plugins with
handwritten parts.
b) the Slang code implemented in Smalltalk that get translated into C
by the VMMaker tool.

VMMaker is discussed at http://wiki.squeak.org/squeak/VMMaker and
related links. It translates the relevant classes, writes the
translated code to files, copies some of the handwritten files to suit
the platform make setup and creates some config files. Each platform
then has its own idiosyncratic (not to mention frequently idiotic)
process to follow to compile and link that bunch into a working VM and
(possibly, depending on details not relevant here) plugins.

The central VM file is interp.c and you really shouldn't waste your
time playing with it. It is barely readable, poorly structured and
should be considered a temporary intermediate file; would you bother
trying to read the intermediate language file(s) some compilers
produce? The exception to this is when you have some really simple
problem in the compiler - like complaints about a type mismatch for
example - that you can very quickly edit in interp.c to see if it can
be made to compile, after which you have to make the proper fix in the
Slang code. Again, interp.c is a temporary file that should never ever
be saved in any source control system. And yes, I know that it
frequently is, and that is just plain bugfuck stupid.

With regard to your original message about the bytecode dispatch
'switch' statement - well I don't think I've seen a compiler do
something that dumb in a long time. A jump or branch table is the
normal case and as has been mentioned the 'gnuify' step in the
makefiles of most platforms even improves upon that when possible. One
could consider building a table of function pointers and indexing into
it with the bytecode - but that would put the table outside the run of
the code and would require bytecode routines to be functions etc etc.
In the current system the bytecode 'functions' are textually inlined
into the main interpret() function as you would see by examing the
VMMaker related code. It is distressingly easy to come up with
'optimisations' that completely ruin performance. Further, the
bytecode dispatch time is unlikely to figure anywhere in typical
performance profiles. Would anyone really characterise the performance
of a system on the basis of how fast it can do trivial loops? Or is it
more sensible to have decent benchmarks that reflect real usage?

It is important to measure the right things the right way under the
right circumstances if you want to uncover meaningful facts. In the
early days of RISC development the designers would explain about how
they ran benchmarks to find out what a cpu really spent its time doing
and then they designed the architecture to optimise for those
activities - no Test Left, Shift Mask and Dim the Lights instruction
for us! - so that all would be wonderful. It was rather slyly pointed
out that this could only result in a cpu that could run diagnostics
and Eratosthenes Sieve *really* fast and left everything else to
exception handling

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Futuristic: It will only run on a next generation supercomputer.

johnmci

Re: [squeak-dev] squeak profiling

In reply to this post by Igor Stasenko

On Apr 19, 2008, at 8:36 AM, Igor Stasenko wrote:

> There also some of the macro benchmarks, if you search a mail there
> was a discussion concerning them lately.

Smalltalk macroBenchmarks

A quick check showed the squeak image at
http://ftp.squeak.org/3.4/

contained the code, and it was deleted in later versions, not sure
when, but someone could look.

I note standardTime:
has to be changed

the
tieDown _ ByteArray new: spaceLeft - 1e7. "Leave exactly 10MB free"

should be removed, it's intent was to pre-allocate memory but this
might depending on the VM
allocate 512MB or 1024MB of virtual memory because of how the VMs work
now, and not
have the desired effect it had when VM's had fixed memory sizes eight
or so years back.

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

johnmci

Re: [squeak-dev] squeak profiling

In reply to this post by Riccardo Lucchese

On Apr 19, 2008, at 1:28 AM, Riccardo Lucchese wrote:

> Hi,
>
> I'm working on profiling squeak/etoys for the olpc project.

Well I'll note since you are working with a specific hardware setup
it's worth considering looking very
carefully at the compiler optimization for the CPU in question. I'd
guess the compiler setting used
by the build process is fairly generic and does not take advantage of
specific platform optimizations,
thus producing safe and lousy code to make a 80386 happy...

oh say see
http://lists.laptop.org/pipermail/devel/2007-November/007430.html
http://wiki.laptop.org/go/Geode_optimization_effort
http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00452.html

Also setting -DUSE_INLINE_MEMORY_ACCESSORS or NOT setting
in the compiler options to change the code usage for accessors in
interp.c might also have a different result.
since that changes now memory accessing is done.

The other area is variable accessing, you'll note the foo structure is
built and the variables are sorted by
usage, I did this to guess better at cache line optimization versus a
random distribution, or sorted alphabetically.
However I don't think anyone has done a careful study to see how best
the variables should be arranged. Mind
when I did this on the powerpc I was *unable* to see any difference.

Lastly
ArraysToGlobalStruct-JMM.1.cs
in the squeak mac source tree is used to modify VMMaker to put or not
put arrays in the global structure foo.
In the past depending on the compiler would produce good or horrible
code for accessing an array that
was part of a structure.

You would need to look at this change set, the current VMMaker code,
then see if using it makes a
difference in the assembler code generated.

Oh and -fgcse

Fiddling with that might produce different things depending on the
compiler....

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

Andreas.Raab

[squeak-dev] Re: squeak profiling

In reply to this post by Levente Uzonyi-2

Levente Uzonyi wrote:
>> It seems to me that bytecodes cases in the interpret function
>> (in /platform/unix/src/vm/interp.c) are not ordered in respect to
>> the probability of their execution.
>
> Since the switch-case statement in C is compiled into a jump table
> (http://en.wikipedia.org/wiki/Jump_table) the order of the branches
> shouldn't affect the speed of execution (you can even start with the
> default branch).

In theory. In practice, the effect of branch-mispredictions can be quite
noticable:

http://lists.squeakfoundation.org/pipermail/squeak-dev/2003-July/061996.html

Cheers,
- Andreas

Riccardo Lucchese

Re: [squeak-dev] squeak profiling

In reply to this post by johnmci

Hi!

> Open transcript then do it on:
>
> 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
What I'm doing is:
- run startsqueak
- dragging a trasncript window from the tools bar to the 'desktop'
- write the line you suggested me
* wait for the magic, but nothing happens :)

Shame on me... mm how do kids actually sort it out ? :)

Thanks you for your help. I'd just like to see some numbers and than shutup if
I'm all wrong with my initial assumptions.

Riccardo

Levente Uzonyi-2

Re: [squeak-dev] squeak profiling

Hi!

Try pressing alt + d (if you have an alt key) or open the menu and
select "do it"

Levente

Quoting Riccardo Lucchese <[hidden email]>:

> Hi!
>
>> Open transcript then do it on:
>>
>> 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
> What I'm doing is:
> - run startsqueak
> - dragging a trasncript window from the tools bar to the 'desktop'
> - write the line you suggested me
> * wait for the magic, but nothing happens :)
>
> Shame on me... mm how do kids actually sort it out ? :)
>
> Thanks you for your help. I'd just like to see some numbers and than
> shutup if
> I'm all wrong with my initial assumptions.
>
> Riccardo
>
>

Randal L. Schwartz

Re: [squeak-dev] squeak profiling

In reply to this post by Riccardo Lucchese

>>>>> "Riccardo" == Riccardo Lucchese <[hidden email]> writes:

Riccardo> Hi!
>> Open transcript then do it on:
>>
>> 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
Riccardo> What I'm doing is:
Riccardo> - run startsqueak
Riccardo> - dragging a trasncript window from the tools bar to the 'desktop'
Riccardo> - write the line you suggested me
Riccardo> * wait for the magic, but nothing happens :)

Riccardo> Shame on me... mm how do kids actually sort it out ? :)

Transcript is not Workspace.

Transcript is where you see things.

Workspace is where you type things, then use do-it or print-it or debug-it.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[hidden email]> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

Igor Stasenko

Re: [squeak-dev] squeak profiling

2008/4/20 Randal L. Schwartz <[hidden email]>:

> >>>>> "Riccardo" == Riccardo Lucchese <[hidden email]> writes:
>
> Riccardo> Hi!
>
> >> Open transcript then do it on:
> >>
> >> 10 timesRepeat: [Transcript show: 1 tinyBenchmarks printString;cr]
> Riccardo> What I'm doing is:
> Riccardo> - run startsqueak
> Riccardo> - dragging a trasncript window from the tools bar to the 'desktop'
> Riccardo> - write the line you suggested me
> Riccardo> * wait for the magic, but nothing happens :)
>
> Riccardo> Shame on me... mm how do kids actually sort it out ? :)
>
> Transcript is not Workspace.
>
> Transcript is where you see things.
>
> Workspace is where you type things, then use do-it or print-it or debug-it.
>

Actually there are little difference between transcript and workspace
or any other place where you can enter text and press 'doit' :)
Inspector and debugger related to context, so you can send messages to
inst vars or use method's temps or arguments in expression.
But it makes little difference for expressions which is context free ,
like '0 tinyBenchmarks'

> --
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
> <[hidden email]> <URL:http://www.stonehenge.com/merlyn/>
> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
> See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
>
>

--
Best regards,
Igor Stasenko AKA sig.