> Anyway, we seem to agree that incremental improvements is precisely
> what gets you usable near-term pink-plane results, and I think that > was the gist of Marcus' message, too. > > - Bert - > This is the anticipation of value effect Kent Beck mention in its books. The unit tests help to make incremental improvments and anticipate value. For a company could be value delivered so customer satisfaction and $$ in the other hand for an open software community it would be motivation for volunteers/users. It's good we talk about our volunteering economy because we are not a big number of people so we must embrace and understood efficiency in this regard, Sebastian |
In reply to this post by Bert Freudenberg
Bert Freudenberg wrote:
> What we will never know is if the first Jitter had been incrementally > improved rather than being abandoned like all its successors, it may > have surpassed the current interpreter performance by far. The > downside is that it would inherently be much more complex - the > interpreter strikes a nice balance here. The first jitter had the advantage of being cross-platform, but as I had predicted it only improved performance by around 50% which is less than what we got by making the interpreter better. The second jitter (known as Jitter3) was actually finished as far as I can tell. There was still some stuff to be done, but it was no worse in that regard than Unicode or Traits support in current Squeak. There simply was not enough interest for it to be adopted. And since it was finished, you can download it (get Squeak 2.3) and run some benchmarks (in Linux machines, at least). Here are the numbers on this machine (3GHz Pentium 4): 2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325 sends/sec 2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459 sends/sec 3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec The Jitter3 numbers varied wildly, but even the more stable numbers for the normal 2.3 VM are very suspect. The problem is that the old image didn't expect such a fast machine and doesn't seem to loop enough times. But the point is that the code is out there and we can actually run it. -- Jecel |
In reply to this post by Bert Freudenberg
> 2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325 > sends/sec > 2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459 > sends/sec > 3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec > > The Jitter3 numbers varied wildly, but even the more stable numbers for > the normal 2.3 VM are very suspect. The problem is that the old image > didn't expect such a fast machine and doesn't seem to loop enough times. > But the point is that the code is out there and we can actually run it. Just increment the numbers in #tinyBenchmarks (incrementing by one roughly duplicates the run time, so incrementing by 4 should be more than enough). Paolo |
In reply to this post by Bert Freudenberg
Jecel Assumpcao Jr wrote:
> And since it was finished, you can download it (get Squeak 2.3) and run > some benchmarks (in Linux machines, at least). Here are the numbers on > this machine (3GHz Pentium 4): > > 2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325 > sends/sec > 2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459 > sends/sec > 3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec > > The Jitter3 numbers varied wildly, but even the more stable numbers for > the normal 2.3 VM are very suspect. The problem is that the old image > didn't expect such a fast machine and doesn't seem to loop enough times. Easy. Just copy and paste #tinyBenchmarks, #benchFib, and #benchmark to Integer. I did this in Squeak1.1.image (this was the baseline of my benchmark comparison) so it should work fine with 2.3. Cheers, - Andreas |
In reply to this post by Jecel Assumpcao Jr
> > downside is that it would inherently be much more complex - the
Hi Jecel,
> > interpreter strikes a nice balance here. > > The first jitter had the advantage of being cross-platform, > but as I had > predicted it only improved performance by around 50% which is > less than > what we got by making the interpreter better. The second jitter (known > as Jitter3) was actually finished as far as I can tell. There > was still > some stuff to be done, but it was no worse in that regard than Unicode > or Traits support in current Squeak. There simply was not enough > interest for it to be adopted. > > And since it was finished, you can download it (get Squeak > 2.3) and run > some benchmarks (in Linux machines, at least). Here are the numbers on > this machine (3GHz Pentium 4): > > 2.3 image and normal 2.3 VM - 62,500,000 bytecodes/sec; 4,591,325 > sends/sec > 2.3 image and Jitter3 VM - 100,000,000 bytecodes/sec; 10,494,459 > sends/sec > 3.9 image and 3.7 VM - 160,602,258 bytecodes/sec; 7,292,693 sends/sec > > The Jitter3 numbers varied wildly, but even the more stable > numbers for > the normal 2.3 VM are very suspect. The problem is that the old image > didn't expect such a fast machine and doesn't seem to loop > enough times. > But the point is that the code is out there and we can > actually run it. > > -- Jecel > thanks for informing about the state of jitter. Those 3 million sends/sec more represents a nice pie of 30% better sends/sec. That jitter can make team work with current interpreter? An updated and ported jitter, for lets say 3dot10, you will say it sum performance near that 30% in sends? cheers, Sebastian Sastre |
Hi folks!
"Sebastian Sastre" <[hidden email]> wrote: > > > downside is that it would inherently be much more complex - the > > > interpreter strikes a nice balance here. > > > > The first jitter had the advantage of being cross-platform, > > but as I had > > predicted it only improved performance by around 50% which is > > less than > > what we got by making the interpreter better. The second jitter (known > > as Jitter3) was actually finished as far as I can tell. [SNIP] > thanks for informing about the state of jitter. Those 3 million > sends/sec more represents a nice pie of 30% better sends/sec. That jitter can > make team work with current interpreter? An updated and ported jitter, for lets > say 3dot10, you will say it sum performance near that 30% in sends? If I remember correctly Jitter3 was actually a VM written in C++, Ian can correct me if I am wrong. And if my memory serves correctly it didn't show as impressive numbers in "real world scenarios" as compared to benchmarks simply due to the fact that most real world Squeak apps spend 50% of their time in primitives and those aren't affected. (modulo my memory failing me of course) My perception in all this is that Exupery is the *clearly* most promising speed technology we have available and it is getting faster and faster all the time. And it works together with the normal VM. The numbers Exupery shows today with normal Squeak are very impressive and the numbers Exupery shows in Huemul are even faster - due to some improvements that Guillermo could make due to the fact that he has full control of the design of the VM, if I understand it correctly. Fonc/cola/coke can also turn out to be really fast - but that is a whole new platform, not really "Squeak". regards, Göran |
[hidden email] wrote:
>> thanks for informing about the state of jitter. Those 3 million >> sends/sec more represents a nice pie of 30% better sends/sec. That jitter can >> make team work with current interpreter? An updated and ported jitter, for lets >> say 3dot10, you will say it sum performance near that 30% in sends? > > If I remember correctly Jitter3 was actually a VM written in C++, Ian > can correct me if I am wrong. And if my memory serves correctly it > didn't show as impressive numbers in "real world scenarios" as compared > to benchmarks simply due to the fact that most real world Squeak apps > spend 50% of their time in primitives and those aren't affected. > > (modulo my memory failing me of course) Yes, that's my recollection too. J3 had portions that were compiled into native code (I remember that there were a few macros that one had to implement on each platform) but the main reason why it didn't show too much real-world improvements was (IIRC) that it didn't to context mapping and inline caches. Those are the places that *really* make a difference (context-mapping of course is hard without fixing blocks to be strictly lifo). > My perception in all this is that Exupery is the *clearly* most > promising speed technology we have available and it is getting faster > and faster all the time. And it works together with the normal VM. One of my problems with Exupery is that I've only seen claims about byte code speed and if you know where the time goes in a real-life environment then you know it ain't bytecodes. In other words, it seems to me that Exupery is optimizing the least significant portion of the VM. I'd be rather more impressed if it did double the send speed. > The numbers Exupery shows today with normal Squeak are very impressive > and the numbers Exupery shows in Huemul are even faster - due to some > improvements that Guillermo could make due to the fact that he has full > control of the design of the VM, if I understand it correctly. Based on which benchmarks? Can I run them on Windows? > Fonc/cola/coke can also turn out to be really fast - but that is a whole > new platform, not really "Squeak". True. Cheers, - Andreas |
On Wed, Apr 2, 2008 at 4:36 AM, Andreas Raab <[hidden email]> wrote:
One of my problems with Exupery is that I've only seen claims about byte code speed and if you know where the time goes in a real-life environment then you know it ain't bytecodes. In other words, it seems to me that Exupery is optimizing the least significant portion of the VM. I'd be rather more impressed if it did double the send speed. I share similar views. In the purest of OO systems, everything is a message send except the things that are not and the only things that are not, are primitives. Message sends translate directly into some sort of canned set of machine code and you apply your compiling and optimizing dexterity to the primitives. Optimization is done through PICs and recursive inlining of primitives and recompiling/optimizing. Any method that isn't a primitive is just doing message sends, but any method could potentially get compiled into a primitive through the process of inlining. Of course, Exupery isn't trying to reinvent the world either. And, I'm sure aspects of the exupery compiler could be leveraged in a system that was closer to this kind of purity. - Stephen |
In reply to this post by Andreas.Raab
Andreas Raab writes:
> One of my problems with Exupery is that I've only seen claims about byte > code speed and if you know where the time goes in a real-life > environment then you know it ain't bytecodes. In other words, it seems > to me that Exupery is optimizing the least significant portion of the > VM. I'd be rather more impressed if it did double the send speed. Then be impressed. Exupery has had double Squeak's send performance since March 2005. http://people.squeakfoundation.org/person/willembryce/diary.html?start=23 That's done by using polymorphic inline caches which are also used to drive dynamic primitive inlining. It is true that further send performance gains are not planned before 1.0. Doubling send performance should be enough to provide a practical performance improvement. It's better to solve all the problems standing in the way of a practical performance improvement before starting work on full method inlining which should provide serious send performance. Here's the current benchmarks: Executing Code ============== arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122 bytecodeBenchmark 2183 compiled 435 ratio: 5.017 sendBenchmark 1657 compiled 741 ratio: 2.236 doLoopsBenchmark 1100 compiled 813 ratio: 1.353 pointCreation 988 compiled 968 ratio: 1.021 largeExplorers 729 compiled 780 ratio: 0.935 compilerBenchmark 529 compiled 480 ratio: 1.102 Cumulative Time 1113.161 compiled 538.355 ratio 2.068 Compile Time ============ ExuperyBenchmarks>>arithmeticLoop 199ms SmallInteger>>benchmark 791ms InstructionStream>>interpretExtension:in:for: 14266ms Average 1309.515 The bottom two executing code benchmarks are macro benchmarks. They compile a few methods based on a profile run then re-run the benchmark. There's several primitives that are inlined into the main interpret() loop in the interpreter but require full worst case dispatching in Exupery. They'll need to be implemented to prevent slow downs to code the benefits. Also there are few limitations that can cause Exupery to produce unperformant code in some situations. There are also bugs, the last release would run for about an hour of development before crashing. These are the issues that are currently being worked on. Here's the benchmarks from the 0.13 release: arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122 bytecodeBenchmark 2183 compiled 435 ratio: 5.017 sendBenchmark 1657 compiled 741 ratio: 2.236 doLoopsBenchmark 1100 compiled 813 ratio: 1.353 pointCreation 988 compiled 968 ratio: 1.021 largeExplorers 729 compiled 780 ratio: 0.935 compilerBenchmark 529 compiled 480 ratio: 1.102 Cumulative Time 1113.161 compiled 538.355 ratio 2.068 ExuperyBenchmarks>>arithmeticLoop 199ms SmallInteger>>benchmark 791ms InstructionStream>>interpretExtension:in:for: 14266ms Average 1309.515 The major gains are in the compileBenchmark macro benchmark and in compilation time. Both due to work on the register allocator. Exupery from the beginning has been an attempt to combine serious optimisation with full method inlining similar to Self while having the entire compiler written in Smalltalk. It's an ambitious goal that's best tackled in smaller steps. Bryce |
I had the wrong benchmarks for 0.13. This post fixes my copy/paste error. Thanks Goran. [hidden email] writes: > Andreas Raab writes: > > > One of my problems with Exupery is that I've only seen claims about byte > > code speed and if you know where the time goes in a real-life > > environment then you know it ain't bytecodes. In other words, it seems > > to me that Exupery is optimizing the least significant portion of the > > VM. I'd be rather more impressed if it did double the send speed. > > Then be impressed. Exupery has had double Squeak's send performance > since March 2005. > > http://people.squeakfoundation.org/person/willembryce/diary.html?start=23 > > That's done by using polymorphic inline caches which are also used to > drive dynamic primitive inlining. It is true that further send > performance gains are not planned before 1.0. Doubling send > performance should be enough to provide a practical performance > improvement. It's better to solve all the problems standing in the way > of a practical performance improvement before starting work on full > method inlining which should provide serious send performance. > > Here's the current benchmarks: > Executing Code > ============== > arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122 > bytecodeBenchmark 2183 compiled 435 ratio: 5.017 > sendBenchmark 1657 compiled 741 ratio: 2.236 > doLoopsBenchmark 1100 compiled 813 ratio: 1.353 > pointCreation 988 compiled 968 ratio: 1.021 > largeExplorers 729 compiled 780 ratio: 0.935 > compilerBenchmark 529 compiled 480 ratio: 1.102 > Cumulative Time 1113.161 compiled 538.355 ratio 2.068 > > Compile Time > ============ > ExuperyBenchmarks>>arithmeticLoop 199ms > SmallInteger>>benchmark 791ms > InstructionStream>>interpretExtension:in:for: 14266ms > Average 1309.515 > > The bottom two executing code benchmarks are macro benchmarks. They > compile a few methods based on a profile run then re-run the > benchmark. > > There's several primitives that are inlined into the main interpret() > loop in the interpreter but require full worst case dispatching in > Exupery. They'll need to be implemented to prevent slow downs to > code the benefits. Also there are few limitations that can cause > Exupery to produce unperformant code in some situations. There > are also bugs, the last release would run for about an hour of > development before crashing. These are the issues that are currently > being worked on. > > Here's the benchmarks from the 0.13 release: arithmaticLoopBenchmark 1396 compiled 128 ratio: 10.906 bytecodeBenchmark 2111 compiled 460 ratio: 4.589 sendBenchmark 1637 compiled 668 ratio: 2.451 doLoopsBenchmark 1081 compiled 715 ratio: 1.512 pointCreation 1245 compiled 1317 ratio: 0.945 largeExplorers 728 compiled 715 ratio: 1.018 compilerBenchmark 483 compiled 489 ratio: 0.988 Cumulative Time 1125 compiled 537 ratio 2.093 ExuperyBenchmarks>>arithmeticLoop 249ms SmallInteger>>benchmark 1112ms InstructionStream>>interpretExtension:in:for: 113460ms Average 3155.360 > > The major gains are in the compileBenchmark macro benchmark and in > compilation time. Both due to work on the register allocator. > > Exupery from the beginning has been an attempt to combine serious > optimisation with full method inlining similar to Self while having the > entire compiler written in Smalltalk. It's an ambitious goal that's > best tackled in smaller steps. > > Bryce > |
In reply to this post by Bryce Kampjes
[hidden email] wrote:
> Andreas Raab writes: > > > One of my problems with Exupery is that I've only seen claims about byte > > code speed and if you know where the time goes in a real-life > > environment then you know it ain't bytecodes. In other words, it seems > > to me that Exupery is optimizing the least significant portion of the > > VM. I'd be rather more impressed if it did double the send speed. > > Then be impressed. Exupery has had double Squeak's send performance > since March 2005. > > http://people.squeakfoundation.org/person/willembryce/diary.html?start=23 That's pretty impressive. > That's done by using polymorphic inline caches which are also used to > drive dynamic primitive inlining. It is true that further send > performance gains are not planned before 1.0. Doubling send > performance should be enough to provide a practical performance > improvement. It's better to solve all the problems standing in the way > of a practical performance improvement before starting work on full > method inlining which should provide serious send performance. Indeed. So what's in the way of practical performance improvement at this point? I was quite surprised that in your corrected benchmarks the two that were macros wouldn't show any improvement: bytecodeBenchmark 2111 compiled 460 ratio: 4.589 sendBenchmark 1637 compiled 668 ratio: 2.451 [...] largeExplorers 728 compiled 715 ratio: 1.018 compilerBenchmark 483 compiled 489 ratio: 0.988 With sends 2.5x faster I would expect *some* noticable improvement. Any ideas what the problem is? Cheers, - Andreas |
In reply to this post by Bryce Kampjes
>Andreas Raab writes:
> > > One of my problems with Exupery is that I've only seen claims about byte > > code speed and if you know where the time goes in a real-life > > environment then you know it ain't bytecodes. In other words, it seems > > to me that Exupery is optimizing the least significant portion of the > > VM. I'd be rather more impressed if it did double the send speed. > >Then be impressed. Exupery has had double Squeak's send performance >since March 2005. > > http://people.squeakfoundation.org/person/willembryce/diary.html?start=23 Way back when we were playing around with J3, we put together the macroBenchmark methods as a more realistic indication of performance than the various micro benchmarks. Are they still available? If not, they can be run in older images (3.4 has them and runs them even now). They have the minor downside of changing with tweaks to the image, but they have the upside of being quite realistic and absolutely comparable from VM to VM. I highly recommend them to anyone serious about performance evaluation. - Dan PS: As a reminder, here is what they do... "1: Decompile, pretty-print, and compile a bunch of methods. Does not install in classes, so does not flush cache." "2: Build morphic tiles for all methods over 800 bytes (;-). Does no display." "3: Translate the interpreter with inlining. Does not include any plugins." "4: Run the context step simulator. 200 iterations printing pi and 15 factorial." "5: Run the InterpreterSimulator for 150,000 bytecodes. Will only run if you have mini.image in your directory." "6: Open 10 browsers and close them. Includes browsing to a specific method." "7: Play a game of FreeCell with display, while running the MessageTally. Thanks to Bob Arning for the clever part of this one." |
Dan Ingalls wrote:
> Way back when we were playing around with J3, we put together the macroBenchmark methods as a more realistic indication of performance than the various micro benchmarks. Are they still available? If not, they can be run in older images (3.4 has them and runs them even now). They have the minor downside of changing with tweaks to the image, but they have the upside of being quite realistic and absolutely comparable from VM to VM. I highly recommend them to anyone serious about performance evaluation. I have resurrected most of them for Croquet. Cheers, - Andreas |
In reply to this post by Andreas.Raab
Andreas Raab writes:
> [hidden email] wrote: > Indeed. So what's in the way of practical performance improvement at > this point? I was quite surprised that in your corrected benchmarks the > two that were macros wouldn't show any improvement: > bytecodeBenchmark 2111 compiled 460 ratio: 4.589 > sendBenchmark 1637 compiled 668 ratio: 2.451 > [...] > largeExplorers 728 compiled 715 ratio: 1.018 > compilerBenchmark 483 compiled 489 ratio: 0.988 > > With sends 2.5x faster I would expect *some* noticable improvement. Any > ideas what the problem is? Here's the latest benchmarks, there's a 10% gain for the compiler benchmark. There's a bigger loss for largeExplorers but I think that's triggered by compiling more methods and thus hitting a missing optimisation that previous benchmark runs didn't hit. arithmaticLoopBenchmark 1397 compiled 138 ratio: 10.122 bytecodeBenchmark 2183 compiled 435 ratio: 5.017 sendBenchmark 1657 compiled 741 ratio: 2.236 doLoopsBenchmark 1100 compiled 813 ratio: 1.353 pointCreation 988 compiled 968 ratio: 1.021 largeExplorers 729 compiled 780 ratio: 0.935 compilerBenchmark 529 compiled 480 ratio: *1.102* Cumulative Time 1113.161 compiled 538.355 ratio 2.068 ExuperyBenchmarks>>arithmeticLoop 199ms SmallInteger>>benchmark 791ms InstructionStream>>interpretExtension:in:for: 14266ms Average 1309.515 There's many reasons why there isn't a larger gain. One is only a few methods are being compiled. There was a register allocation issue. String>>at: and at:put: are not yet compiled so are taking the slow path through to the interpreter. The same with ^ true, ^ false, and ^ nil. All are used by those macro benchmarks. The problem I'm working on now is the stack is being loaded into registers when Exupery enters a context then saved back into the context when it leaves. This is particularly inefficient if the registers get spilled to the C stack as then they're copied into registers to be immediately copied to memory at a different location. This makes real case sends potentially worse than the send benchmark. With earlier benchmarks compilation was slowing down object allocation because allocation is inlined into the main interpreter loop but Exupery was doing a full worst case send to get to the primitive. (1) To benefit from PICs both the sending and receiving methods must be compiled. That may not be happening as much as it should be. This can be blocked by some of the few missing bytecodes. Stack duplication is the only serious missing bytecode. The generated code is relatively sloppy still. Exupery doesn't handle SIB byte addressing modes so it can't access literal indirections which are used heavily to access interpreter state. Temporaries are not stored in registers, that's until after the stack register improvements are finished. The visible progress since the last release is the compiler benchmark now shows a 10% gain and compiling interpretExtension:in:for: is about 9 times faster. Bryce (1) It wasn't creating the context but it was going through the PIC then dropping into a helper method which ran through the interpreter's send code. |
[hidden email] wrote:
> There's many reasons why there isn't a larger gain. One is only a few > methods are being compiled. Why not compile everything to see what the baseline gain is? This is really the number I'm interested in - compile the entire image, and run the macro benchmarks and compare that. And not only would it be a good stress-test but for some applications (like our Croquet servers) that would be entirely reasonable; even desirable. Cheers, - Andreas > The problem I'm working on now is the stack is being loaded into > registers when Exupery enters a context then saved back into the > context when it leaves. This is particularly inefficient if the > registers get spilled to the C stack as then they're copied into > registers to be immediately copied to memory at a different location. > This makes real case sends potentially worse than the send benchmark. > > With earlier benchmarks compilation was slowing down object allocation > because allocation is inlined into the main interpreter loop but > Exupery was doing a full worst case send to get to the primitive. (1) > > To benefit from PICs both the sending and receiving methods must be > compiled. That may not be happening as much as it should be. This can > be blocked by some of the few missing bytecodes. Stack duplication is > the only serious missing bytecode. > > The generated code is relatively sloppy still. Exupery doesn't handle > SIB byte addressing modes so it can't access literal indirections > which are used heavily to access interpreter state. Temporaries are > not stored in registers, that's until after the stack register > improvements are finished. > > The visible progress since the last release is the compiler benchmark > now shows a 10% gain and compiling interpretExtension:in:for: is about > 9 times faster. > > Bryce > > (1) It wasn't creating the context but it was going through the PIC > then dropping into a helper method which ran through the interpreter's > send code. > > |
Andreas Raab writes:
> [hidden email] wrote: > > There's many reasons why there isn't a larger gain. One is only a few > > methods are being compiled. > > Why not compile everything to see what the baseline gain is? This is > really the number I'm interested in - compile the entire image, and run > the macro benchmarks and compare that. And not only would it be a good > stress-test but for some applications (like our Croquet servers) that > would be entirely reasonable; even desirable. The short answer is it's not ready yet. That's what I'm working on. Exupery normally crashes after about an hour of real use. Reliability is improving but it's not yet ready for real use except in very controlled ways. The benchmarks I showed are for driving development, they must be reasonably quick so they can be easily run during development. Total test time should be only a few minutes. I'd say if both macro benchmarks were providing over 10% performance gains then reliability would be the top priority rather than one of the top priorities. Compiling the entire image is probably overkill and may eat a few hundred megabytes of RAM. Bytecodes are a very dense format while machine code isn't. Also Exupery at the moment needs to compile a method once for each receiver, this is only required when compiling primitives such as #at: and #new where it's useful to specialise the compiled code for each different receiver. It would be easy to re-use the same compiled code for multiple receivers but this hasn't yet been done. It's not been a problem when compilation is driven from a profiler. Bryce |
Free forum by Nabble | Edit this page |