Strongtalk and Exupery

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Strongtalk and Exupery

Bryce Kampjes

Hi David,
It's great to see that the Strongtalk VM has been open
sourced. Hopefully, it will be an asset to the community.

Does the Strongtalk mailing list have a publicly available archive? One
that doesn't require a yahoo sign-on? It would make it much easier for
interested outsiders to follow what's going on.

How does Strongtalk compare with current Java Hotspot VMs? They are
also available with source for study (though not open sourced).

I'm the primary author of Exupery, another attempt at fast execution
technology for Smalltalk. Exupery is written in Smalltalk. The
original design was to combine Self's dynamic inlining with a strong
optimising compiler. For that the goal, I don't think we can afford to
write in anything less productive than Smalltalk.  That is still the
goal but it's a long way off, Exupery is currently moving towards a
1.0 without full method inlining and without a strong optimiser. All
the needed high risk features are there.

Compile time is not the key issue for a dynamic compiler, pauses
are. Compile time only becomes critical if you are stopping execution
to compile. Exupery doesn't. Being normal Smalltalk like everything
else pausing execution to compile is tricky. The trade offs to allow
Exupery to be easily written in Smalltalk are the same as those
required to allow long compile times for high grade optimisations.

If you, or other Strongtalkers are interested in talking about
compiler design please feel free to join Exupery's mailing list.
Don't worry if you don't have time to study the source or play with
it. Sharing experience would be valuable. Exupery is now about 4
years old, revisiting the design decisions with knowledgeable people
would be useful, especially in an archived list. Exupery is another
chance to keep the ideas and vision alive, if not the C++.

The Exupery mailing list is here:

  http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery

Exupery's tiny benchmarks are:

  1,176,335,439 bytecodes/sec; 16,838,438 sends/sec

and with the interpreter:

    228,367,528 bytecodes/sec;  7,241,760 sends/sec

Which makes it currently much slower than Strongtalk for sends but the
same speed for bytecodes. That's comparing against the numbers
provided by Gilad via Dan's post to squeak-dev. Such a comparison is
not fair as relative performance does vary greatly with
architecture. Exupery is best on P4's, ok on Athlons, and least
impressive on Pentium-Ms.

The bytecode performance is the most interesting to me. Exupery does
not yet do dynamic method inlining which explains Strongtalks strong
send performance. Message inlining is not necessary for a 1.0. That
the bytecode numbers are so close, and I know Exupery's weaknesses, is
interesting. Exupery uses a colouring coalescing register allocator
but also lives with Squeak's object memory and could do with a bit
more tuning. I'm guessing Strongtalk's object memory is much cleaner
and better designed for speed based on reading the Self papers. Did
the Strongtalk team stop tuning for bytecode performance after they
passed VisualWorks?

Exupery has also recently been ported to Win 32 and Solaris 10 x86.
Both ports were done by other people. Pre-built VMs will be available
for both platforms in a few days.

Bryce

Reply | Threaded
Open this post in threaded view
|

RE: Strongtalk and Exupery

David Griswold-2
Hi Bryce,

I applaud what you are trying to do, and it sounds very interesting.  If you
can make it work with the compiler written in Smalltalk that would be great-
that is certainly the long-term goal for me too.  And you are more than
welcome to pick my brain about Strongtalk, if it would help you.  My only
goal here is to help speed up Smalltalk, however that happens.

Since you may be interested, I have responded in detail below:

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]]On Behalf Of Bryce
> Kampjes
> Sent: Wednesday, September 20, 2006 2:42 PM
> To: [hidden email];
> [hidden email]
> Subject: [Vm-dev] Strongtalk and Exupery
> [...]
> Does the Strongtalk mailing list have a publicly available archive? One
> that doesn't require a yahoo sign-on? It would make it much easier for
> interested outsiders to follow what's going on.

Sorry!  I didn't realize the archives weren't public.  They are now.  The
list is moving to Google in the next day anyhow.

> How does Strongtalk compare with current Java Hotspot VMs? They are
> also available with source for study (though not open sourced).

Certainly the Java Hotspot VMs are descendants of the Strongtalk VM, but
they have been basically rewritten, and are definitely not just tweaked
Smalltalk under the covers.  For one thing, the languages are different;
Java has untagged immediates of various sizes, and Java has guaranteed
implementation type information available, unlike Strongtalk.  Although
inlining choices are done differently, in some ways I actually like the way
Strongtalk does it more, but unfortunately I can't talk about the exact
differences.  But the Java VM is not what I would call a type-feedback VM
anymore, and Strongtalk is.

For another thing, the Java VMs are fully internally multi-threaded, which
is a lot of work (and a *huge* amount of testing) that hasn't been done for
Strongtalk.

Another issue is that the downside of having all the implementation type
information in Java is that it has to be validated before you can trust it,
so class loading becomes a gigantic nightmare.  Strongtalk doesn't have to
deal with any of that, since it doesn't assume anything about static
implementation types at all (other than for the hardcoded boolean messages).

Another issue is that the Java VMs have on-stack replacement so that
compiled methods are used immediately even for active contexts.  That isn't
there yet in Strongtalk.

And of course Smalltalk is a smaller, simpler, better language :-).

> I'm the primary author of Exupery, another attempt at fast execution
> technology for Smalltalk. Exupery is written in Smalltalk. The
> original design was to combine Self's dynamic inlining with a strong
> optimising compiler. For that the goal, I don't think we can afford to
> write in anything less productive than Smalltalk.  That is still the
> goal but it's a long way off, Exupery is currently moving towards a
> 1.0 without full method inlining and without a strong optimiser. All
> the needed high risk features are there.
> Compile time is not the key issue for a dynamic compiler, pauses
> are. Compile time only becomes critical if you are stopping execution
> to compile. Exupery doesn't. Being normal Smalltalk like everything
> else pausing execution to compile is tricky. The trade offs to allow
> Exupery to be easily written in Smalltalk are the same as those
> required to allow long compile times for high grade optimisations.

It was for a similar reason that I forked off the Java Server VM at Sun.
Good inlining and a good code generator are synergistic, so I wanted a
really good code generator.  But I got my #ss handed to me because of the
difficulty of making it work.

Part of the problem is that it is more important than you might think for
the compiler to be fast.  A compiler that does really good register
allocation is likely to be more than a factor of 2 slower than a fast JIT,
when you do inlining.  Here is the important point: once you do inlining,
the average size of the methods you compile becomes much larger, and
register allocation is highly non-linear.

Like you we moved to background compilation, which gets rid of pauses, but
the time it takes for the program to get up to speed is still significantly
affected by having a slower compiler.  The problem isn't just that the
optimized code becomes available later, it is also that the compiler is
chewing up CPU in the meantime, so until it is available you are running
much slower code *and* are also getting fewer time slices.  Now that
multiprocessors are really here on the desktop, though, this might become
less of an issue.

Another factor that interacts with the above issue is that if you don't
compile the method eagerly, you end up getting other spurious compiles later
because the unoptimized code is still running, setting off invocation
counters for called methods that are already scheduled to be inlined, etc.
So a background compiler ends up compiling more methods.  Theoretically this
is still happening a bit in Strongtalk because on-stack replacement isn't
there, which has a similar effect, but it certainly isn't noticeable.

But the constraints in our case were that it had to work well in *all*
situations, especially for short lived Java programs.  They can end before
the compiler ever finishes.  So that is why there are two Java HotSpot VMs.

For your case, the constraints aren't nearly so strict, since your audience
can select itself for applications where the startup speed doesn't matter,
and you probably won't be running things like tiny dynamically-loaded
applets.  So hopefully it won't be a problem for you.

> If you, or other Strongtalkers are interested in talking about
> compiler design please feel free to join Exupery's mailing list.
> Don't worry if you don't have time to study the source or play with
> it. Sharing experience would be valuable. Exupery is now about 4
> years old, revisiting the design decisions with knowledgeable people
> would be useful, especially in an archived list. Exupery is another
> chance to keep the ideas and vision alive, if not the C++.
> The Exupery mailing list is here:
>
>   http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery
>
> Exupery's tiny benchmarks are:
>
>   1,176,335,439 bytecodes/sec; 16,838,438 sends/sec
>
> and with the interpreter:
>
>     228,367,528 bytecodes/sec;  7,241,760 sends/sec
>
> Which makes it currently much slower than Strongtalk for sends but the
> same speed for bytecodes. That's comparing against the numbers
> provided by Gilad via Dan's post to squeak-dev. Such a comparison is
> not fair as relative performance does vary greatly with
> architecture. Exupery is best on P4's, ok on Athlons, and least
> impressive on Pentium-Ms.
>
> The bytecode performance is the most interesting to me. Exupery does
> not yet do dynamic method inlining which explains Strongtalks strong
> send performance. Message inlining is not necessary for a 1.0. That
> the bytecode numbers are so close, and I know Exupery's weaknesses, is
> interesting. Exupery uses a colouring coalescing register allocator
> but also lives with Squeak's object memory and could do with a bit
> more tuning. I'm guessing Strongtalk's object memory is much cleaner
> and better designed for speed based on reading the Self papers. Did
> the Strongtalk team stop tuning for bytecode performance after they
> passed VisualWorks?

I'm not sure what those bytecode performance #s mean; I don't know how Gilad
did those measurements.  The bytecodes in Strongtalk are not one-to-one with
other Smalltalks.  It doesn't sound like it an apples-to-apples comparison,
since you quote the ratio of bytecodes-to-sends under the Squeak interpreter
as 32 and Dan quoted 44; they should be the same.  We should do some proper
benchmarks.  There are lots of benchmarks in Strongtalk if you want to try
them.  Look for classes matching *Benchmark*.

The notion of sends/second performance in Strongtalk does not make sense.
An inlined send takes 0 time, so depending on how the code is written, an
arbitrarily high send/sec number can apply.  For example, when you really
totally factor your Smalltalk code, always use instance variable access
methods, and use lots of non-pure blocks, you can get really massive
speedups in Strongtalk.  My Dictionary implementation is written that way,
and when I ported it to VisualWorks (a while ago), Strongtalk was 35 *times*
as fast, and the code uses only SmallIntegers, Associations, and Arrays.
Almost all the sends and blocks are optimized completely away.

So for me, Strongtalk isn't so much about absolute bytecode performance, as
it is about being able to write all the control structures and blocks and
sends that I want, and be confident that I pay basically no price for
factoring overhead.  It is a really cool feeling!

> Exupery has also recently been ported to Win 32 and Solaris 10 x86.
> Both ports were done by other people. Pre-built VMs will be available
> for both platforms in a few days.
>
> Bryce
>

That sounds great!  Hopefully there will be technology transfer both ways!
Cheers,
Dave



Reply | Threaded
Open this post in threaded view
|

RE: Strongtalk and Exupery

David Griswold-2
In reply to this post by Bryce Kampjes
Hi Bryce,

I realized I didn't quite fully address a couple of issues:

> The bytecode performance is the most interesting to me. Exupery does
> not yet do dynamic method inlining which explains Strongtalks strong
> send performance. Message inlining is not necessary for a 1.0. That
> the bytecode numbers are so close, and I know Exupery's weaknesses, is
> interesting. Exupery uses a colouring coalescing register allocator
> but also lives with Squeak's object memory and could do with a bit
> more tuning. I'm guessing Strongtalk's object memory is much cleaner
> and better designed for speed based on reading the Self papers. Did
> the Strongtalk team stop tuning for bytecode performance after they
> passed VisualWorks?

we still have to figure out exactly what is meant by 'bytecode', and for
what benchmark, but I'll try to guess a definition for what you are
basically talking about: the performance of the generated code for primitive
operations, independent of the effect of sends and any inlining.

Although I haven't yet seen a benchmark I would trust, in that respect
Exupery probably has a better code generator.  The Strongtalk one is
virtually untuned, and does just a few basic optimizations.  You have to
realize that Strongtalk was just gotten running, we just got it fairly
stable, tuned for a few benchmarks, and it was frozen at that point.  Robert
Griesmer, who wrote the code generator, was already working on a better one
to replace it, and that work was frozen mostly done, but needs to be
finished and put in place (the new compiler was running, and I believe can
actually be turned on, but it just was starting to work for bigger than
snippets).  So the interesting thing is that Strongtalk is getting its
performance in spite of a very simple compiler.  Even the new compiler
wouldn't be doing anything as fancy as you are.

If you want to take full advantage of a better code generator like yours, it
really helps to have inlining.  Sends are so much more frequent in Smalltalk
than in C++, that there isn't much to do between sends, on average.  So you
should really want something like type-feedback; it would magnify the
benefits of your nice optimizations.

- I want to qualify something I said:  I said "An inlined send takes 0
time".  That is often true, but not always.  The call itself obviously takes
0 time, but the class check can't always be removed.  But often it is, and
both the class check and the call can be eliminated (the class check only
has to be done once per receiver(s) per inlined nmethod).

-Dave

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]]On Behalf Of Bryce
> Kampjes
> Sent: Wednesday, September 20, 2006 2:42 PM
> To: [hidden email];
> [hidden email]



Reply | Threaded
Open this post in threaded view
|

RE: Strongtalk and Exupery

Bryce Kampjes

Hi David,
The bytecode benchmark is a prime number sieve. It uses #at: and
#at:put:. The send benchmark is a simple recursive Fibonacci function.
Both are just measures of how quickly they execute, neither really
measures the actual bytecodes or sends performed. They are the old
tinyBenchmarks. I'd guess everyone ran the same code for these
benchmarks.

I 100% agree that inlining is the right way to optimise common sends
and block execution. I'd just rather finish debugging Exupery and
getting it fully working without inlining then add inlining. Inlining
will add another case to think about when debugging. Debugging full
method inlining (1) will be much easier if the compiler is bug free
first.

My rough long term plan is:
   1.0: The minimum necessary to be useful.
   2.0: Inlining
   3.0: SSA optimisation

A strong reason for not doing inlining in 1.0 is it will reduce scope
creep. If inlining is not in 1.0 then finishing 1.0 is more important.


I'd also not be surprised if Strongtalk is faster than Exupery for
bytecode performance. I'm guessing that Strongtalk's integer
arithmetic and #at: performance are better. Squeak uses 1 for it's
integer tag so in general it takes 3 instructions to detag then retag
and 2 clocks latency (this can be optimised often be optimised to 1
instruction and 1 clock latency). I'm guessing Strongtalk uses 0 for
it's integer tag.

Squeak uses a remembered set for it's write barrier which requires
checking if the object is in the remembered set, and checking if the
object is in new-space before adding it. Strongtalk might be using a
card marking table just requiring a single store.

Squeak stores the size of an object in one of two places. So to get
the size to range check you first need to figure out where it's
stored. I'm guessing that the size for an array is stored at a fixed
location in Strongtalk.

My assumptions about Strongtalk's object memory are based on reading
the papers from the Self project.

None of these things really matters to Squeak while it's running as an
interpreter because most of the time is spent recovering from branch
mispredicts or waiting for memory leaving plenty of time available to
hide the inefficiencies above.


One way to get around a slow compiler would be to save the code cache
beside the image. All relocation is done in Smalltalk, so doing this
shouldn't be too hard. But figuring out how get around a slow compiler
can wait until after the compiler has become useful. There are several
answers including writing a faster register allocator (2) or being the
third compiler.

Bryce

(1) Exupery can already inline primitives. It uses primitive inlining
to optimise #at: and #at:put:. This is one reason why Exupery has
PICs. They are a way to get type information for primitive calls.

(2) Having a coalescing register allocation makes unnecessary moves
free. This is helpful to hide working on a two operand machine from
the compiler front end. There may be some work to make Exupery perform
well without it's register allocator.

Reply | Threaded
Open this post in threaded view
|

RE: Strongtalk and Exupery

David Griswold-2
Hi Bryce,

> -----Original Message-----
> From: [hidden email]
>
> Hi David,
> The bytecode benchmark is a prime number sieve. It uses #at: and
> #at:put:. The send benchmark is a simple recursive Fibonacci function.
> Both are just measures of how quickly they execute, neither really
> measures the actual bytecodes or sends performed. They are the old
> tinyBenchmarks. I'd guess everyone ran the same code for these
> benchmarks.

That's fine, it's just that we need to actually run these benchmarks right-
with different architectures, clock speeds etc. I don't think we know the
relative performance yet.

> I 100% agree that inlining is the right way to optimise common sends
> and block execution. [...]

Ok, I was just trying to say that in Smalltalk, a mediocre compiler with
optimistic inlining is better than a great compiler without inlining.  As
long as you are headed in the direction of optimistic inlining, we are in
agreement.

I just want to re-emphasize the importance of "optimistic", which implies
the ability to deoptimize, not just the ability to inline.  Inlining the
common case non-optimistically (i.e. with an 'else' clause containing the
non-common case) is not nearly as good, since after those two cases merge
you can't assume anything, whereas with optimism the rest of the code can
assume the common case was taken, providing much more information for
optimization (e.g. if the common case returns a SmallInteger, that is known
in subsequent code, whereas without deoptimization, the subsequent code
can't assume anything about the return value, regardless of inlining).
Sorry if you already understood this, I couldn't tell from your post.

The reason I am pointing this out is that the machinery for deoptimization
is the hard part.  That is really the big advantage of the Strongtalk VM-
that it provides all that infrastructure.  I just want to make sure you are
taking that into consideration.

> I'd also not be surprised if Strongtalk is faster than Exupery for
> bytecode performance. I'm guessing that Strongtalk's integer
> arithmetic and #at: performance are better. Squeak uses 1 for it's
> integer tag so in general it takes 3 instructions to detag then retag
> and 2 clocks latency (this can be optimised often be optimised to 1
> instruction and 1 clock latency). I'm guessing Strongtalk uses 0 for
> it's integer tag.

Yes.

> Squeak uses a remembered set for it's write barrier which requires
> checking if the object is in the remembered set, and checking if the
> object is in new-space before adding it. Strongtalk might be using a
> card marking table just requiring a single store.

Yes, Strongtalk uses card marking; I think it is two instructions.  It is
Urs Holzle's write barrier, so it is probably the same as in Self.

> Squeak stores the size of an object in one of two places. So to get
> the size to range check you first need to figure out where it's
> stored. I'm guessing that the size for an array is stored at a fixed
> location in Strongtalk.

Yes.

> My assumptions about Strongtalk's object memory are based on reading
> the papers from the Self project.
>
> None of these things really matters to Squeak while it's running as an
> interpreter because most of the time is spent recovering from branch
> mispredicts or waiting for memory leaving plenty of time available to
> hide the inefficiencies above.
>
>
> One way to get around a slow compiler would be to save the code cache
> beside the image. All relocation is done in Smalltalk, so doing this
> shouldn't be too hard. But figuring out how get around a slow compiler
> can wait until after the compiler has become useful. There are several
> answers including writing a faster register allocator (2) or being the
> third compiler.

Yes, I have always wanted to be able to save the code.  We only have the
inlining DB right now, which doesn't avoid the compilation overhead on each
run.

-Dave