Folks -
I just read Eliot's most recent blog post about the Cog VM and it reminded me how difficult it must be for others to see where this project stands. So here is a bit of an update on the current status: As you may recall we started this project in spring this year by hiring Eliot for the express purpose of building us a new VM that would speed up execution of our products. We decided to structure the work into stages at the end of each there would be a tangible deliverable i.e., a new VM that could be run and benchmarked. The first stage in this process is what we call the "Closure VM". It is nothing more (and nothing less) than a Squeak VM with closures and the required support (compiler, decompiler, debugger etc). Given past experience, we had originally expected this stage to cost us some speed (up to 20% were estimated) since closure support has a cost which at that stage is hard to offset with other improvements. However, thanks to a truly ingenious bit of engineering done by Eliot in the design for the closure implementation the resulting speed difference was negligible. Since there was no speed penalty we decided to jump ship earlier than we originally anticipated and the Closure VM has been the regular shipping VM with Qwaq products since September this year. The second stage in the process is the "Stack VM". It is a Closure VM that executes on the native stack and transparently maps contexts from and to stack frames as required. The VM itself is still an interpreter so any speed improvements come purely from the more efficient organization of the stack layout (no allocations, overlapping frames etc). For those of you having been around for long enough it is equivalent to what Anthony Hanan did a few years ago, except that it hides the existence of the native stack entirely and gives the programmer the naive view of just dealing with linked frames (contexts). The original expectations for the resulting speedups by Eliot were a little higher than we've seen in practice, but are in line with the results that Anthony got: approx. 30% improvements across the board in macro benchmarks. The work on the Stack VM was completed last month, we are currently rolling it out internally and the next product release will ship with the Stack VM. The third stage which has just begun is what we call the "Simple JIT VM" (well, really it doesn't have a name yet, I just made it up ;-) Its focus is send performance (as we see send performance as the single biggest current bottleneck). It will sport a very simple JIT w/ inline caches with the idea being to bring up send performance to the point where it's no longer the single biggest bottleneck, then measure performance again and figure out what the next best target is. I am not going to speculate on performance (we have been wrong every single step of the way ;-) but both Eliot and I do think that we'll see some nice improvements in application performance here. The fourth stage is a bit more speculation at this point because the concrete direction depends on what the results of stage 3 really show the new bottleneck to be. We have various candidates lined up: Very high on the list is a delayed code generator which can dramatically improve the code quality. Next to it are changes in the object format moving to a unified 32/64bit header model which would dramatically simplify some tests for inline caching and primitives etc. However, since this work is driven by product performance, it is possible (albeit unlikely at this point) that the focus might shift towards FFI speed or float inlining. There is no shortage of possible directions, the main issue will be to figure out what the bottlenecks at that point are and how to address them most efficiently. Stage four won't be the end of it, but from where we are this is how far we've planned at this point. And if you want to know all the gory details about the stuff that Eliot's working on, please do check out his blog at: http://cogblog.mirandabanda.org/ Cheers, - Andreas |
Hi Andreas,
thanks for sharing. This was really informative and interesting. >From the reading is sounds damn impressive. Norbert On Tue, 2008-12-16 at 19:30 -0800, Andreas Raab wrote: > Folks - > > I just read Eliot's most recent blog post about the Cog VM and it > reminded me how difficult it must be for others to see where this > project stands. So here is a bit of an update on the current status: > > As you may recall we started this project in spring this year by hiring > Eliot for the express purpose of building us a new VM that would speed > up execution of our products. We decided to structure the work into > stages at the end of each there would be a tangible deliverable i.e., a > new VM that could be run and benchmarked. > > The first stage in this process is what we call the "Closure VM". It is > nothing more (and nothing less) than a Squeak VM with closures and the > required support (compiler, decompiler, debugger etc). Given past > experience, we had originally expected this stage to cost us some speed > (up to 20% were estimated) since closure support has a cost which at > that stage is hard to offset with other improvements. However, thanks to > a truly ingenious bit of engineering done by Eliot in the design for the > closure implementation the resulting speed difference was negligible. > Since there was no speed penalty we decided to jump ship earlier than we > originally anticipated and the Closure VM has been the regular shipping > VM with Qwaq products since September this year. > > The second stage in the process is the "Stack VM". It is a Closure VM > that executes on the native stack and transparently maps contexts from > and to stack frames as required. The VM itself is still an interpreter > so any speed improvements come purely from the more efficient > organization of the stack layout (no allocations, overlapping frames > etc). For those of you having been around for long enough it is > equivalent to what Anthony Hanan did a few years ago, except that it > hides the existence of the native stack entirely and gives the > programmer the naive view of just dealing with linked frames (contexts). > The original expectations for the resulting speedups by Eliot were a > little higher than we've seen in practice, but are in line with the > results that Anthony got: approx. 30% improvements across the board in > macro benchmarks. The work on the Stack VM was completed last month, we > are currently rolling it out internally and the next product release > will ship with the Stack VM. > > The third stage which has just begun is what we call the "Simple JIT VM" > (well, really it doesn't have a name yet, I just made it up ;-) Its > focus is send performance (as we see send performance as the single > biggest current bottleneck). It will sport a very simple JIT w/ inline > caches with the idea being to bring up send performance to the point > where it's no longer the single biggest bottleneck, then measure > performance again and figure out what the next best target is. I am not > going to speculate on performance (we have been wrong every single step > of the way ;-) but both Eliot and I do think that we'll see some nice > improvements in application performance here. > > The fourth stage is a bit more speculation at this point because the > concrete direction depends on what the results of stage 3 really show > the new bottleneck to be. We have various candidates lined up: Very high > on the list is a delayed code generator which can dramatically improve > the code quality. Next to it are changes in the object format moving to > a unified 32/64bit header model which would dramatically simplify some > tests for inline caching and primitives etc. However, since this work is > driven by product performance, it is possible (albeit unlikely at this > point) that the focus might shift towards FFI speed or float inlining. > There is no shortage of possible directions, the main issue will be to > figure out what the bottlenecks at that point are and how to address > them most efficiently. > > Stage four won't be the end of it, but from where we are this is how far > we've planned at this point. And if you want to know all the gory > details about the stuff that Eliot's working on, please do check out his > blog at: > > http://cogblog.mirandabanda.org/ > > Cheers, > - Andreas > > > |
In reply to this post by Andreas.Raab
Hi Andreas, Eliot,
Thank you very much for this effort, for funding it, and for making it available to the community! > ... it is possible (albeit unlikely at this point) that the focus > might shift towards FFI speed or float inlining.... Can you tell a bit more about "float inlining"? I guess you're talking about immediate (unboxed) floats, right? That could mean no longer need to do plugins for numerical stuff. I would love to have that! Cheers, Juan Vuletich |
In reply to this post by Andreas.Raab
On Tue, Dec 16, 2008 at 10:30 PM, Andreas Raab <[hidden email]> wrote: Folks - <snip> This sounds great and many thanks to Qwaq for funding this development...it should be a great contribution to the squeak community. - Stephen
|
In reply to this post by NorbertHartl
On Tue, 2008-12-16 at 19:30 -0800, Andreas Raab wrote:
> Folks - > > I just read Eliot's most recent blog post about the Cog VM and it > reminded me how difficult it must be for others to see where this > project stands. So here is a bit of an update on the current status: > I've been following Eliots posts with great interest, it's great that you are doing this, thanks a lot! Many times I wanted to test some of the things Eliot talks about, and now it's no exception: I'd really love to use Bochs from Squeak as Eliot is doing. What are the chances that we can get our hands on that? I feel like asking too much, so go ahead and just say NO if that's the case. thanks! richie |
Hi Gerardo,
I will make an effort to push out what I have to my web site before the end of the year, and before the end of January at the latest. Send me a nagging message once a week and you may find it'll arrive on the web site earlier than not.
On Wed, Dec 17, 2008 at 4:58 AM, Gerardo Richarte <[hidden email]> wrote:
|
Eliot Miranda wrote:
> Hi Gerardo, > > I will make an effort to push out what I have to my web site > before the end of the year, and before the end of January at the > latest. Send me a nagging message once a week and you may find it'll > arrive on the web site earlier than not. heh, I won't nag you... not weekly at least :) thanks a lot! richie |
In reply to this post by Juan Vuletich-4
Hi Juan,
On Wed, Dec 17, 2008 at 5:31 AM, Juan Vuletich <[hidden email]> wrote: Hi Andreas, Eliot, I tried to post a reply yesterday but hit the 100k limit and the list moderator refused to let my reply in anyway. There are two things here. Yes, one is doing immediate floats in a 64-bit VM, which can produce floating-point that runs at half SmallInteger speed, perhaps three times faster than boxed floats.
But much more interesting is an adaptive optimization/speculative inlining scheme which aims to map floating-point operations down to the processor's floating-point registers. ere;s an abstract from my AOStA design sketch (that I tried to post yesterday but was bounced) that describes how this might be done. The basic idea for an adaptive optimizer is to use two levels, one bytecode to bytecode and another bytecode to machine code. The bytecode to bytecode level is written in Smalltalk and is clever (does type analisys, inlining, etc). The bytecode to machine code level is not clever, but is processor-specific.
The bytecode to bytecode optimizer targets bytecodes a little like the special selectors that define optimized operations from which the bytecode to machine code compiler can generate fast code. Conceptually this fast bytecode runs in OptimizedContexts, but the virtual machine and bytecode to machine code compiler arrange that it actually runs on a native stack in native machine registers, including the FPU. With that said the following might make sense:
3.5 Initial Floating-Point Unboxing Scheme While it should be a goal to unbox floats in pointer instances this sketch ignores that possibility for now. Smalltalk imposes no restriction on the type of object stored an a pointer instance variable. Therefore any unboxing scheme needs to be per-instance, not just per-class (although one could imagine a scheme that used anonymous behaviors to distinguish instances of a class that contained unboxed data from instances of the same class that did not). At least in the HPS memory manager such flexibility poses a problem and I would like to make immediate progress. So this unboxing scheme only handles unboxing within an OptimizedContext, being rather analogous to a floating-point co-processor unit.
An OptimizedContext has two stacks, one for normal objects, and one for raw data. The raw stack is organized as a number of slots large enough to hold the largest floating-point format supported by the Smalltalk VM. The size of an OptimizedContext's stack is zero by default. If non-zero its size is defined by information in the context's OptimizedMethod, e.g. either a field in the header, or some initial bytecode (analogous to pushCopiedValues at the start of a copying block) that specifies the number of slots. The stack can be implemented as a pair instance variables in OptimizedContext that are normally nil, but otherwise contain a suitably large ByteArray and a raw stack pointer. Whenever an OptimizedMethod that specifies a non-empty raw stack is activated the initial contents are undefined and the stack pointer is 0 (1 relative), i.e. there is no support for floating-point arguments. It is assumed that in-lining will reduce the demand for floating-point parameter passing enough for it to be lived without.
A set of in-line primitives can access the raw stack as IEEE floating-point data, moving values between the raw stack and the pointer stack or object fields. The primitive set would be extended to support unboxed access to fields in pointer instances if and when required. On Smalltalks with different sized floating-point classes (VisualWorks supports 32-bit Float and 64-bit Double) the primitive set may provide access to each float size. Here we sketch only a set for 64-bit Double floating-point values. If the set handles multiple sizes of data, each slot can hold only one instance of a small value.
The set of raw stack primitives are stack based because it is much easier to map a stack-based addressing scheme with a finite sized stack onto a register set than it is to map a register-based scheme onto a stack, and the infamous x86 floating-point processor, which is stack-based, is likely to remain an important target for users of this system.
The raw stack could also be used to optimize integer arithmetic, supporting arithmetic on untagged 8, 16, 32 and 64-bit widths as in Java. Rather than waste time specifying this I'll leave open the possibility of adding a set of bytecodes to allow 64-bit arithmetic and conversion to and from tagged and boxed SmallIntegers and LargeIntegers on the normal stack.
The paragraph about the x86 FPU is way out of date. So imagine that the bytecode to machine code compiler maps some portion of the raw data stack onto the mmx registers. Either it or the bytecode to bytecode compiler can do a usage frequency analysis to arrange that the most frequently used stack slots get mapped to the registers.
If this can work then yes, plugins could become a thing of the past. However, I doubt very much that Qwaq will fund me to do this. Right now with a Squeak VM that is 10 to 20 times slower than VisualWorks' VM Qwaq Forums spends roughly 2/3rds of its time executing Smalltalk. The bulk of the rest of the time is in OpenGL. The Cog JIT I'm working on now should be able to reach ViaualWorks VM speeds and hence the 66.6% should become no more than, say, 6% of entire execution time, with say 90% of the time in OpenGL.
A second stage JIT doing adaptive optimization/speculative inlining could probably improve performance by another factor of three. But that would produce only a 4% improvement in Qwaq Forums performance. Yes, it might allow Qwaq to rewrite all their C plugin code in Smalltalk and get the same performance from the Smalltalk code that would then be easier to maintain and enhance etc. But where is the return on investment (ROI)?
The system would not be measurably faster for Qwaq Forums. The maintainability/extensibility benefits are intangible and hard to sell to investors. Hence I don't see Qwaq funding this, and it it was my call and my money I'd probably make the same decision. However, we haven't even begun to discuss this inside Qwaq so you never know.
best Eliot
|
Hi Eliot,
Eliot Miranda wrote: > Hi Juan, > > On Wed, Dec 17, 2008 at 5:31 AM, Juan Vuletich <[hidden email] > <mailto:[hidden email]>> wrote: > > Hi Andreas, Eliot, > > Thank you very much for this effort, for funding it, and for > making it available to the community! > > ... it is possible (albeit unlikely at this point) that the > focus might shift towards FFI speed or float inlining.... > > > Can you tell a bit more about "float inlining"? I guess you're > talking about immediate (unboxed) floats, right? That could mean > no longer need to do plugins for numerical stuff. I would love to > have that! > > > I tried to post a reply yesterday but hit the 100k limit and the list > moderator refused to let my reply in anyway. If you Eliot have a limit to what you can say in this list, someone is really wrong! > > There are two things here. Yes, one is doing immediate floats in a > 64-bit VM, which can produce floating-point that runs at half > SmallInteger speed, perhaps three times faster than boxed floats. Yes! That would already be great. > > But much more interesting is an adaptive optimization/speculative > inlining scheme which aims to map floating-point operations down to > the processor's floating-point registers. ere;s an abstract from my > AOStA design sketch (that I tried to post yesterday but was bounced) > that describes how this might be done. The basic idea for an adaptive > optimizer is to use two levels, one bytecode to bytecode and another > bytecode to machine code. The bytecode to bytecode level is written > in Smalltalk and is clever (does type analisys, inlining, etc). The > bytecode to machine code level is not clever, but is processor-specific. > ... big snip... Thanks for all the detail. > If this can work then yes, plugins could become a thing of the past. > However, I doubt very much that Qwaq will fund me to do this. Right > now with a Squeak VM that is 10 to 20 times slower than VisualWorks' > VM Qwaq Forums spends roughly 2/3rds of its time executing Smalltalk. > The bulk of the rest of the time is in OpenGL. The Cog JIT I'm > working on now should be able to reach ViaualWorks VM speeds and hence > the 66.6% should become no more than, say, 6% of entire execution > time, with say 90% of the time in OpenGL. > > A second stage JIT doing adaptive optimization/speculative inlining > could probably improve performance by another factor of three. But > that would produce only a 4% improvement in Qwaq Forums performance. > Yes, it might allow Qwaq to rewrite all their C plugin code in > Smalltalk and get the same performance from the Smalltalk code that > would then be easier to maintain and enhance etc. But where is the > return on investment (ROI)? > > The system would not be measurably faster for Qwaq Forums. The > maintainability/extensibility benefits are intangible and hard to sell > to investors. Hence I don't see Qwaq funding this, and it it was my > call and my money I'd probably make the same decision. However, we > haven't even begun to discuss this inside Qwaq so you never know. > > best > Eliot I see. I can see a great use for such technology in Morphic 3. This is my project for redesigning Morphic. It makes heavy use of floating point for all the rendering. It currently uses a plugin, but that makes it harder to extend and enhance it. May be some day it could become a reason to actually do what you describe, and a means to get funding for it. Thanks for your detailed answer. Cheers, Juan Vuletich |
In reply to this post by Andreas.Raab
On Tue, Dec 16, 2008 at 07:30:30PM -0800, Andreas Raab wrote:
> Folks - > > I just read Eliot's most recent blog post about the Cog VM and it > reminded me how difficult it must be for others to see where this > project stands. So here is a bit of an update on the current status: Andreas, Thanks very much for providing this overview. Very helpful and much appreciated. Dave |
Free forum by Nabble | Edit this page |