Hi David, It's great to see that the Strongtalk VM has been open sourced. Hopefully, it will be an asset to the community. Does the Strongtalk mailing list have a publicly available archive? One that doesn't require a yahoo sign-on? It would make it much easier for interested outsiders to follow what's going on. How does Strongtalk compare with current Java Hotspot VMs? They are also available with source for study (though not open sourced). I'm the primary author of Exupery, another attempt at fast execution technology for Smalltalk. Exupery is written in Smalltalk. The original design was to combine Self's dynamic inlining with a strong optimising compiler. For that the goal, I don't think we can afford to write in anything less productive than Smalltalk. That is still the goal but it's a long way off, Exupery is currently moving towards a 1.0 without full method inlining and without a strong optimiser. All the needed high risk features are there. Compile time is not the key issue for a dynamic compiler, pauses are. Compile time only becomes critical if you are stopping execution to compile. Exupery doesn't. Being normal Smalltalk like everything else pausing execution to compile is tricky. The trade offs to allow Exupery to be easily written in Smalltalk are the same as those required to allow long compile times for high grade optimisations. If you, or other Strongtalkers are interested in talking about compiler design please feel free to join Exupery's mailing list. Don't worry if you don't have time to study the source or play with it. Sharing experience would be valuable. Exupery is now about 4 years old, revisiting the design decisions with knowledgeable people would be useful, especially in an archived list. Exupery is another chance to keep the ideas and vision alive, if not the C++. The Exupery mailing list is here: http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery Exupery's tiny benchmarks are: 1,176,335,439 bytecodes/sec; 16,838,438 sends/sec and with the interpreter: 228,367,528 bytecodes/sec; 7,241,760 sends/sec Which makes it currently much slower than Strongtalk for sends but the same speed for bytecodes. That's comparing against the numbers provided by Gilad via Dan's post to squeak-dev. Such a comparison is not fair as relative performance does vary greatly with architecture. Exupery is best on P4's, ok on Athlons, and least impressive on Pentium-Ms. The bytecode performance is the most interesting to me. Exupery does not yet do dynamic method inlining which explains Strongtalks strong send performance. Message inlining is not necessary for a 1.0. That the bytecode numbers are so close, and I know Exupery's weaknesses, is interesting. Exupery uses a colouring coalescing register allocator but also lives with Squeak's object memory and could do with a bit more tuning. I'm guessing Strongtalk's object memory is much cleaner and better designed for speed based on reading the Self papers. Did the Strongtalk team stop tuning for bytecode performance after they passed VisualWorks? Exupery has also recently been ported to Win 32 and Solaris 10 x86. Both ports were done by other people. Pre-built VMs will be available for both platforms in a few days. Bryce |
Hi Bryce,
I applaud what you are trying to do, and it sounds very interesting. If you can make it work with the compiler written in Smalltalk that would be great- that is certainly the long-term goal for me too. And you are more than welcome to pick my brain about Strongtalk, if it would help you. My only goal here is to help speed up Smalltalk, however that happens. Since you may be interested, I have responded in detail below: > -----Original Message----- > From: [hidden email] > [mailto:[hidden email]]On Behalf Of Bryce > Kampjes > Sent: Wednesday, September 20, 2006 2:42 PM > To: [hidden email]; > [hidden email] > Subject: [Vm-dev] Strongtalk and Exupery > [...] > Does the Strongtalk mailing list have a publicly available archive? One > that doesn't require a yahoo sign-on? It would make it much easier for > interested outsiders to follow what's going on. Sorry! I didn't realize the archives weren't public. They are now. The list is moving to Google in the next day anyhow. > How does Strongtalk compare with current Java Hotspot VMs? They are > also available with source for study (though not open sourced). Certainly the Java Hotspot VMs are descendants of the Strongtalk VM, but they have been basically rewritten, and are definitely not just tweaked Smalltalk under the covers. For one thing, the languages are different; Java has untagged immediates of various sizes, and Java has guaranteed implementation type information available, unlike Strongtalk. Although inlining choices are done differently, in some ways I actually like the way Strongtalk does it more, but unfortunately I can't talk about the exact differences. But the Java VM is not what I would call a type-feedback VM anymore, and Strongtalk is. For another thing, the Java VMs are fully internally multi-threaded, which is a lot of work (and a *huge* amount of testing) that hasn't been done for Strongtalk. Another issue is that the downside of having all the implementation type information in Java is that it has to be validated before you can trust it, so class loading becomes a gigantic nightmare. Strongtalk doesn't have to deal with any of that, since it doesn't assume anything about static implementation types at all (other than for the hardcoded boolean messages). Another issue is that the Java VMs have on-stack replacement so that compiled methods are used immediately even for active contexts. That isn't there yet in Strongtalk. And of course Smalltalk is a smaller, simpler, better language :-). > I'm the primary author of Exupery, another attempt at fast execution > technology for Smalltalk. Exupery is written in Smalltalk. The > original design was to combine Self's dynamic inlining with a strong > optimising compiler. For that the goal, I don't think we can afford to > write in anything less productive than Smalltalk. That is still the > goal but it's a long way off, Exupery is currently moving towards a > 1.0 without full method inlining and without a strong optimiser. All > the needed high risk features are there. > Compile time is not the key issue for a dynamic compiler, pauses > are. Compile time only becomes critical if you are stopping execution > to compile. Exupery doesn't. Being normal Smalltalk like everything > else pausing execution to compile is tricky. The trade offs to allow > Exupery to be easily written in Smalltalk are the same as those > required to allow long compile times for high grade optimisations. It was for a similar reason that I forked off the Java Server VM at Sun. Good inlining and a good code generator are synergistic, so I wanted a really good code generator. But I got my #ss handed to me because of the difficulty of making it work. Part of the problem is that it is more important than you might think for the compiler to be fast. A compiler that does really good register allocation is likely to be more than a factor of 2 slower than a fast JIT, when you do inlining. Here is the important point: once you do inlining, the average size of the methods you compile becomes much larger, and register allocation is highly non-linear. Like you we moved to background compilation, which gets rid of pauses, but the time it takes for the program to get up to speed is still significantly affected by having a slower compiler. The problem isn't just that the optimized code becomes available later, it is also that the compiler is chewing up CPU in the meantime, so until it is available you are running much slower code *and* are also getting fewer time slices. Now that multiprocessors are really here on the desktop, though, this might become less of an issue. Another factor that interacts with the above issue is that if you don't compile the method eagerly, you end up getting other spurious compiles later because the unoptimized code is still running, setting off invocation counters for called methods that are already scheduled to be inlined, etc. So a background compiler ends up compiling more methods. Theoretically this is still happening a bit in Strongtalk because on-stack replacement isn't there, which has a similar effect, but it certainly isn't noticeable. But the constraints in our case were that it had to work well in *all* situations, especially for short lived Java programs. They can end before the compiler ever finishes. So that is why there are two Java HotSpot VMs. For your case, the constraints aren't nearly so strict, since your audience can select itself for applications where the startup speed doesn't matter, and you probably won't be running things like tiny dynamically-loaded applets. So hopefully it won't be a problem for you. > If you, or other Strongtalkers are interested in talking about > compiler design please feel free to join Exupery's mailing list. > Don't worry if you don't have time to study the source or play with > it. Sharing experience would be valuable. Exupery is now about 4 > years old, revisiting the design decisions with knowledgeable people > would be useful, especially in an archived list. Exupery is another > chance to keep the ideas and vision alive, if not the C++. > The Exupery mailing list is here: > > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/exupery > > Exupery's tiny benchmarks are: > > 1,176,335,439 bytecodes/sec; 16,838,438 sends/sec > > and with the interpreter: > > 228,367,528 bytecodes/sec; 7,241,760 sends/sec > > Which makes it currently much slower than Strongtalk for sends but the > same speed for bytecodes. That's comparing against the numbers > provided by Gilad via Dan's post to squeak-dev. Such a comparison is > not fair as relative performance does vary greatly with > architecture. Exupery is best on P4's, ok on Athlons, and least > impressive on Pentium-Ms. > > The bytecode performance is the most interesting to me. Exupery does > not yet do dynamic method inlining which explains Strongtalks strong > send performance. Message inlining is not necessary for a 1.0. That > the bytecode numbers are so close, and I know Exupery's weaknesses, is > interesting. Exupery uses a colouring coalescing register allocator > but also lives with Squeak's object memory and could do with a bit > more tuning. I'm guessing Strongtalk's object memory is much cleaner > and better designed for speed based on reading the Self papers. Did > the Strongtalk team stop tuning for bytecode performance after they > passed VisualWorks? I'm not sure what those bytecode performance #s mean; I don't know how Gilad did those measurements. The bytecodes in Strongtalk are not one-to-one with other Smalltalks. It doesn't sound like it an apples-to-apples comparison, since you quote the ratio of bytecodes-to-sends under the Squeak interpreter as 32 and Dan quoted 44; they should be the same. We should do some proper benchmarks. There are lots of benchmarks in Strongtalk if you want to try them. Look for classes matching *Benchmark*. The notion of sends/second performance in Strongtalk does not make sense. An inlined send takes 0 time, so depending on how the code is written, an arbitrarily high send/sec number can apply. For example, when you really totally factor your Smalltalk code, always use instance variable access methods, and use lots of non-pure blocks, you can get really massive speedups in Strongtalk. My Dictionary implementation is written that way, and when I ported it to VisualWorks (a while ago), Strongtalk was 35 *times* as fast, and the code uses only SmallIntegers, Associations, and Arrays. Almost all the sends and blocks are optimized completely away. So for me, Strongtalk isn't so much about absolute bytecode performance, as it is about being able to write all the control structures and blocks and sends that I want, and be confident that I pay basically no price for factoring overhead. It is a really cool feeling! > Exupery has also recently been ported to Win 32 and Solaris 10 x86. > Both ports were done by other people. Pre-built VMs will be available > for both platforms in a few days. > > Bryce > That sounds great! Hopefully there will be technology transfer both ways! Cheers, Dave |
In reply to this post by Bryce Kampjes
Hi Bryce,
I realized I didn't quite fully address a couple of issues: > The bytecode performance is the most interesting to me. Exupery does > not yet do dynamic method inlining which explains Strongtalks strong > send performance. Message inlining is not necessary for a 1.0. That > the bytecode numbers are so close, and I know Exupery's weaknesses, is > interesting. Exupery uses a colouring coalescing register allocator > but also lives with Squeak's object memory and could do with a bit > more tuning. I'm guessing Strongtalk's object memory is much cleaner > and better designed for speed based on reading the Self papers. Did > the Strongtalk team stop tuning for bytecode performance after they > passed VisualWorks? we still have to figure out exactly what is meant by 'bytecode', and for what benchmark, but I'll try to guess a definition for what you are basically talking about: the performance of the generated code for primitive operations, independent of the effect of sends and any inlining. Although I haven't yet seen a benchmark I would trust, in that respect Exupery probably has a better code generator. The Strongtalk one is virtually untuned, and does just a few basic optimizations. You have to realize that Strongtalk was just gotten running, we just got it fairly stable, tuned for a few benchmarks, and it was frozen at that point. Robert Griesmer, who wrote the code generator, was already working on a better one to replace it, and that work was frozen mostly done, but needs to be finished and put in place (the new compiler was running, and I believe can actually be turned on, but it just was starting to work for bigger than snippets). So the interesting thing is that Strongtalk is getting its performance in spite of a very simple compiler. Even the new compiler wouldn't be doing anything as fancy as you are. If you want to take full advantage of a better code generator like yours, it really helps to have inlining. Sends are so much more frequent in Smalltalk than in C++, that there isn't much to do between sends, on average. So you should really want something like type-feedback; it would magnify the benefits of your nice optimizations. - I want to qualify something I said: I said "An inlined send takes 0 time". That is often true, but not always. The call itself obviously takes 0 time, but the class check can't always be removed. But often it is, and both the class check and the call can be eliminated (the class check only has to be done once per receiver(s) per inlined nmethod). -Dave > -----Original Message----- > From: [hidden email] > [mailto:[hidden email]]On Behalf Of Bryce > Kampjes > Sent: Wednesday, September 20, 2006 2:42 PM > To: [hidden email]; > [hidden email] |
Hi David, The bytecode benchmark is a prime number sieve. It uses #at: and #at:put:. The send benchmark is a simple recursive Fibonacci function. Both are just measures of how quickly they execute, neither really measures the actual bytecodes or sends performed. They are the old tinyBenchmarks. I'd guess everyone ran the same code for these benchmarks. I 100% agree that inlining is the right way to optimise common sends and block execution. I'd just rather finish debugging Exupery and getting it fully working without inlining then add inlining. Inlining will add another case to think about when debugging. Debugging full method inlining (1) will be much easier if the compiler is bug free first. My rough long term plan is: 1.0: The minimum necessary to be useful. 2.0: Inlining 3.0: SSA optimisation A strong reason for not doing inlining in 1.0 is it will reduce scope creep. If inlining is not in 1.0 then finishing 1.0 is more important. I'd also not be surprised if Strongtalk is faster than Exupery for bytecode performance. I'm guessing that Strongtalk's integer arithmetic and #at: performance are better. Squeak uses 1 for it's integer tag so in general it takes 3 instructions to detag then retag and 2 clocks latency (this can be optimised often be optimised to 1 instruction and 1 clock latency). I'm guessing Strongtalk uses 0 for it's integer tag. Squeak uses a remembered set for it's write barrier which requires checking if the object is in the remembered set, and checking if the object is in new-space before adding it. Strongtalk might be using a card marking table just requiring a single store. Squeak stores the size of an object in one of two places. So to get the size to range check you first need to figure out where it's stored. I'm guessing that the size for an array is stored at a fixed location in Strongtalk. My assumptions about Strongtalk's object memory are based on reading the papers from the Self project. None of these things really matters to Squeak while it's running as an interpreter because most of the time is spent recovering from branch mispredicts or waiting for memory leaving plenty of time available to hide the inefficiencies above. One way to get around a slow compiler would be to save the code cache beside the image. All relocation is done in Smalltalk, so doing this shouldn't be too hard. But figuring out how get around a slow compiler can wait until after the compiler has become useful. There are several answers including writing a faster register allocator (2) or being the third compiler. Bryce (1) Exupery can already inline primitives. It uses primitive inlining to optimise #at: and #at:put:. This is one reason why Exupery has PICs. They are a way to get type information for primitive calls. (2) Having a coalescing register allocation makes unnecessary moves free. This is helpful to hide working on a two operand machine from the compiler front end. There may be some work to make Exupery perform well without it's register allocator. |
Hi Bryce,
> -----Original Message----- > From: [hidden email] > > Hi David, > The bytecode benchmark is a prime number sieve. It uses #at: and > #at:put:. The send benchmark is a simple recursive Fibonacci function. > Both are just measures of how quickly they execute, neither really > measures the actual bytecodes or sends performed. They are the old > tinyBenchmarks. I'd guess everyone ran the same code for these > benchmarks. That's fine, it's just that we need to actually run these benchmarks right- with different architectures, clock speeds etc. I don't think we know the relative performance yet. > I 100% agree that inlining is the right way to optimise common sends > and block execution. [...] Ok, I was just trying to say that in Smalltalk, a mediocre compiler with optimistic inlining is better than a great compiler without inlining. As long as you are headed in the direction of optimistic inlining, we are in agreement. I just want to re-emphasize the importance of "optimistic", which implies the ability to deoptimize, not just the ability to inline. Inlining the common case non-optimistically (i.e. with an 'else' clause containing the non-common case) is not nearly as good, since after those two cases merge you can't assume anything, whereas with optimism the rest of the code can assume the common case was taken, providing much more information for optimization (e.g. if the common case returns a SmallInteger, that is known in subsequent code, whereas without deoptimization, the subsequent code can't assume anything about the return value, regardless of inlining). Sorry if you already understood this, I couldn't tell from your post. The reason I am pointing this out is that the machinery for deoptimization is the hard part. That is really the big advantage of the Strongtalk VM- that it provides all that infrastructure. I just want to make sure you are taking that into consideration. > I'd also not be surprised if Strongtalk is faster than Exupery for > bytecode performance. I'm guessing that Strongtalk's integer > arithmetic and #at: performance are better. Squeak uses 1 for it's > integer tag so in general it takes 3 instructions to detag then retag > and 2 clocks latency (this can be optimised often be optimised to 1 > instruction and 1 clock latency). I'm guessing Strongtalk uses 0 for > it's integer tag. Yes. > Squeak uses a remembered set for it's write barrier which requires > checking if the object is in the remembered set, and checking if the > object is in new-space before adding it. Strongtalk might be using a > card marking table just requiring a single store. Yes, Strongtalk uses card marking; I think it is two instructions. It is Urs Holzle's write barrier, so it is probably the same as in Self. > Squeak stores the size of an object in one of two places. So to get > the size to range check you first need to figure out where it's > stored. I'm guessing that the size for an array is stored at a fixed > location in Strongtalk. Yes. > My assumptions about Strongtalk's object memory are based on reading > the papers from the Self project. > > None of these things really matters to Squeak while it's running as an > interpreter because most of the time is spent recovering from branch > mispredicts or waiting for memory leaving plenty of time available to > hide the inefficiencies above. > > > One way to get around a slow compiler would be to save the code cache > beside the image. All relocation is done in Smalltalk, so doing this > shouldn't be too hard. But figuring out how get around a slow compiler > can wait until after the compiler has become useful. There are several > answers including writing a faster register allocator (2) or being the > third compiler. Yes, I have always wanted to be able to save the code. We only have the inlining DB right now, which doesn't avoid the compilation overhead on each run. -Dave |
Free forum by Nabble | Edit this page |