Attached is #primitiveApplyToFromTo as it compiles and runs using the
win32 Squeak 3.7.1 build environment here; I'm testing it with the 3.8 and 3.9 images. The performance is compared against plain #do: (apples v.s. apples) and #to:do: (apples v.s. oranges) further down below. Compared to #do: with block, #applyTo:from:to: with block is definitively faster (smaller hidden constants) and there also is at least one item waiting for optimization (and for me having time to do that). I'd say that the factor can arrive at about 2 (#do: v.s. #applyTo:from:to:). But the primitive will not outperform an inlined #to:do: this was clear from the beginning. As Jon wrote, the primitive is good for "fixing" the standard enumeration methods so they all have (more or less) identical performance. The current implementation is for non-Strings only (as dictated by an atCache parameter) but can of course be extended to work with Strings. Another item that's missing is returning self from the primitive, this is currently handled by falling into the shadow which then returns (and some superflous bytecode cycles are burned). Example method in SequenceableCollection applyTo: aBlock from: index to: limit <primitive: 164> [index <= limit] whileTrue: [aBlock value: (self basicAt: index). thisContext tempAt: 2 put: index + 1] The shadow code is exactly what is done by the primitive. As discussed ealier,the primitive can be called from #commonSend and from #commonReturn and in both cases might decide to exit with primitiveFail. The overrides for #commonSend and #commonReturn are not attached but if you want them ask me. Any questions? /Klaus -------------------- | time array sum | array := (1 to: 10565520) collect: [:none | 1]. Smalltalk garbageCollect. time := Time millisecondsToRun: [ sum := array sum]. sum. time => 9093 -------------------- | time array sum | array := (1 to: 10565520) collect: [:none | 1]. Smalltalk garbageCollect. time := Time millisecondsToRun: [ sum := 0. array do: [:each | sum := sum + each]]. sum. time => 4764 -------------------- | time array sum | array := (1 to: 10565520) collect: [:none | 1]. Smalltalk garbageCollect. time := Time millisecondsToRun: [ sum := 0. array applyTo: [:each | sum := sum + each] from: 1 to: array size]. sum. time => 3499 -------------------- | time array sum | array := (1 to: 10565520) collect: [:none | 1]. Smalltalk garbageCollect. time := Time millisecondsToRun: [ sum := 0. 1 to: array size do: [:index | sum := sum + (array at: index)]]. sum. time => 2089 -------------------- PrimitiveApplyFromTo-kwl.1.cs (3K) Download Attachment |
Where is the other half of the changes? What you sent isn't enough to
generate a functioning VM. I'm in particular curious how you've dealt with the loop inlining and it seems there is some code missing which does that. Cheers, - Andreas Klaus D. Witzel wrote: > Attached is #primitiveApplyToFromTo as it compiles and runs using the > win32 Squeak 3.7.1 build environment here; I'm testing it with the 3.8 > and 3.9 images. > > The performance is compared against plain #do: (apples v.s. apples) and > #to:do: (apples v.s. oranges) further down below. > > Compared to #do: with block, #applyTo:from:to: with block is > definitively faster (smaller hidden constants) and there also is at > least one item waiting for optimization (and for me having time to do > that). > > I'd say that the factor can arrive at about 2 (#do: v.s. > #applyTo:from:to:). But the primitive will not outperform an inlined > #to:do: this was clear from the beginning. As Jon wrote, the primitive > is good for "fixing" the standard enumeration methods so they all have > (more or less) identical performance. > > The current implementation is for non-Strings only (as dictated by an > atCache parameter) but can of course be extended to work with Strings. > Another item that's missing is returning self from the primitive, this > is currently handled by falling into the shadow which then returns (and > some superflous bytecode cycles are burned). > > Example method in SequenceableCollection > applyTo: aBlock from: index to: limit > <primitive: 164> > [index <= limit] whileTrue: > [aBlock value: (self basicAt: index). > thisContext tempAt: 2 put: index + 1] > > The shadow code is exactly what is done by the primitive. As discussed > ealier,the primitive can be called from #commonSend and from > #commonReturn and in both cases might decide to exit with primitiveFail. > > The overrides for #commonSend and #commonReturn are not attached but if > you want them ask me. > > Any questions? > > /Klaus > > -------------------- > > | time array sum | > array := (1 to: 10565520) collect: [:none | 1]. > Smalltalk garbageCollect. > time := Time millisecondsToRun: [ > sum := array sum]. > sum. > time > => 9093 > > -------------------- > > | time array sum | > array := (1 to: 10565520) collect: [:none | 1]. > Smalltalk garbageCollect. > time := Time millisecondsToRun: [ > sum := 0. > array do: [:each | sum := sum + each]]. > sum. > time > => 4764 > > -------------------- > > | time array sum | > array := (1 to: 10565520) collect: [:none | 1]. > Smalltalk garbageCollect. > time := Time millisecondsToRun: [ > sum := 0. > array applyTo: [:each | sum := sum + each] > from: 1 to: array size]. > sum. > time > => 3499 > > -------------------- > > | time array sum | > array := (1 to: 10565520) collect: [:none | 1]. > Smalltalk garbageCollect. > time := Time millisecondsToRun: [ > sum := 0. > 1 to: array size do: [:index | sum := sum + (array at: index)]]. > sum. > time > => 2089 > > -------------------- > > > ------------------------------------------------------------------------ > > |
In reply to this post by Klaus D. Witzel
The previously missing #returnReceiver case is now solved and the new .cs
posted to - http://bugs.impara.de/view.php?id=4925 The Mantis report also includes the overrides for #commonSend and #commonReturn. Andreas, #commonSend and #commonReturn are the same what I emailed you last night. It would be interesting to compare the performance figures for platforms other than win32. /Klaus |
In reply to this post by Klaus D. Witzel
At the moment no primitives are re-entered. We don't have primitive contexts. Changing that is a large design change even if the code change is small. Normal primitives can use shadowing because there is no way to capture their execution half way through. In this case that is not true. Shadowing does not fully hide the primitive as it may need to be re-entered later. This primitive will slow down sends and returns on some architectures but probably not all. Out of order execution is great at hiding cost behind other delays. I wouldn't be surprised if there is no cost on either a PPC or an x86 but there will be on some chips. Any optimisation of the common case send code will increase the performance loss caused by this primitive. (1) It could crash if you save an image on a VM with the primitive then load it on a VM with the image. This will only happen if one of the primitives is in an active context. Looking at internalActivateNewMethod the PC will be set to it's initial value but that will cause the loop to begin again which could be problematic too. What happens if you step back into the primitiveApplyToFromTo method from a debugger? So execution entered via the interpreter and used the primitive then the debugger (or any tool that simulates bytecode execution) re-enters the primitive method. There is maintenance risk if the shadow implementation and the real implementation get out of sync because the bugs may only occur when switching from running with the primitive to running without the primitive. This will definitely happen if you move an image to an older VM but could also happen if you improve the primitive so it can be re-entered as bytecode. Also a primitive failing does not necessarily mean that it can be replaced by the back-up code. In many cases the method code after a primitive handles a different set of conditions. Have a look at Object>>size. In general, no execution engine can assume that it can ignore a primitive. Bryce 1) Have a look at commonReturn. The simple case when a method or block is returning directly to it's caller can be simplified. The general case needs to handle any unwind blocks that might be walked over while exiting which the common case will not do. Also the common case could be coded without the loops which risk branch mispredict when exiting. |
Hi Bryce,
reading your comments below I'm under the impression that you have not understood the novel concept (which is, BTW old) but, of course, I may be mistaken. And I appreciate your comments and thoughts very much :) In the below I offer discussing some issues off list but, I'd happily discuss all issues here. On Fri, 15 Sep 2006 23:44:47 +0200, you wrote: > > At the moment no primitives are re-entered. primitiveApplyToFromTo is always called from the *beginning* with the same context. No pc is maintained. No sp is maintained. Of cause both quantities are defined for the context, but not touched in any way. > We don't have primitive > contexts. This is not true, all primitives have their arguments in the context and also leave their result there, since the Blue Book. > Changing that is a large design change even if the code > change is small. There was no design change, either executeNewMethod is called which in turn calls activateNewMethod then newActiveContext, or between activateNewMethod and newActiveContext is primitiveApplyToFromTo called. The flow of control is 100% pure[tm] Blue Book and untouched. > Normal primitives can use shadowing because there is > no way to capture their execution half way through. No, there is no half way through. primitiveApplyToFromTo is always full way through. It is not so that, say 50% of its code is covered by the first invocation and the rest by the next invocation. The coverage is always 100%. > In this case that > is not true. Shadowing does not fully hide the primitive as it may > need to be re-entered later. Absolutely not. primitiveApplyToFromTo can send, say, the first 10 values to the block and then may fail. Thereafter its shadow can, if it wants, send say, another 10 values to the block. Your comment shows me that the code of primitiveApplyToFromTo is not easy to understand. This must be my fault, we can discuss this off list to any level you want. > This primitive will slow down sends and returns on some architectures > but probably not all. On which, Bryce? > Out of order execution is great at hiding cost > behind other delays. primitiveApplyToFromTo concatenates (the primitive implementation of) Stream>>#next and BlockContext>>#value:, what is out of order with this? > I wouldn't be surprised if there is no cost on > either a PPC or an x86 but there will be on some chips. Which chips, Bryce? > Any > optimisation of the common case send code will increase the > performance loss caused by this primitive. (1) Until now, you have not shown any performance loss caused by this primitive. So what's this about? > It could crash if you save an image on a VM with the primitive then > load it on a VM with the image. It cannot crash. It can start on VM-a *with* primitiveApplyToFromTo support, then, before its block ends, the image can be written to a snapshot, then resumed on VM-b *without* primitiveApplyToFromTo support, and vice versa. Let's walk through this off list. > This will only happen if one of the > primitives is in an active context. Looking at > internalActivateNewMethod the PC will be set to it's initial value but > that will cause the loop to begin again which could be problematic too. Every loop begins again because of the long jump backwards. But in primitiveApplyToFromTo there is no jump backwards, no loop. And there is nothing which causes the loop to begin again. As I wrote earlier, it is (*marker*) [index < limit] whileTrue: ... and there exists no jump back to the (*marker*) position. I repeat: there is no jump. > What happens if you step back into the primitiveApplyToFromTo method > from a debugger? The same what happens when the block returns (I mean: indistiguishable, invariant). > So execution entered via the interpreter and used > the primitive then the debugger (or any tool that simulates bytecode > execution) re-enters the primitive method. This is supported. All the debugger must know is that it cannot step through the primitive (easy). But it can step through the shadow and then, when in the shadow (or in the block) one does "proceed", it is again the same as when the block returns (indistiguishable, invariant). > There is maintenance risk if the shadow implementation and the real > implementation get out of sync because the bugs may only occur > when switching from running with the primitive to running without the > primitive. You mean like with the shadow of any other primitive? Sure, this always was so and will always be so. Two pieces of software which implement the same function, but one transcedenting the other, have always this problem. > This will definitely happen if you move an image to an > older VM No, see the case with VM-a and VM-b above. > but could also happen if you improve the primitive so it can > be re-entered as bytecode. Huh? the primitive is not bytecode and it is not planned to make it such. > Also a primitive failing does not necessarily mean that it can be > replaced by the back-up code. No, primitives must be replaceable by their shadow, or else every other existing shadowed primitive could crash the VM. > In many cases the method code after a > primitive handles a different set of conditions. Why do you want me to handle conditions in primitiveApplyToFromTo which are not handled in the shadow. That would be a bit too much, if not crazy. > Have a look at > Object>>size. Have a look at all the primitives that expect Integer indices but are passed Floats. This is business as usual. Back to our case here, if someone passes a Float to primitiveApplyToFromTo then the shadow will attempt to index the receiver with a Float, like in (#(1) at: 0.5), so what? > In general, no execution engine can assume that it can > ignore a primitive. Not so fast and not so general: primitiveApplyToFromTo can be ignored by every VM which was compiled *with* it and by every VM which was compiled *without* it. Thank you very, very much Bryce :-) /Klaus > Bryce > I understand the following as a general comment on the current implementation of #commonReturn. > 1) Have a look at commonReturn. The simple case when a method or > block is returning directly to it's caller can be simplified. The > general case needs to handle any unwind blocks that might be walked > over while exiting which the common case will not do. Also the common > case could be coded without the loops which risk branch mispredict > when exiting. > > |
What I missed until now was the "thisContext tempAt: 2 put: index + 1" in the shadow code. I still maintain that your version is overly clever and very likely to cause mass confusion and maintenance issues. do: loops are something that should be fully understandable by any Smalltalk programmer. Everyone is going to be seeing this in any walk back that involves a do:. Normally, arguments are immutable in Squeak. Preserving that matters to me when reading code. Exupery doesn't care as the bytecodes are identical. I do care, as a programmer that normal invariants are held especially in code that I'm likely to be glancing at continually when developing. Now, the way to optimise an expression that uses thisContext on a system with Exupery or without your VM patch is to remove the use of thisContext. The shadow code is going to be much slower on systems that do not have the VM mod than the current implementation. My policy for use of thisContext and tempAt:put: is that neither will be compiled by Exupery, tempAt:put: should cause Exupery to de-optimise the context then drop back into the interpreter (it doesn't yet). I'm trying to optimise the common case. Re-calling a primitive from a return is close enough to re-entering it. That is something that hasn't been done. It is a major design change. Your VM mod adds work to both common send and common return which every send is going to have to execute. Do you have any measurements to show that any realistic code is spending enough time in do: loop overhead to justify slowing down all sends? On my machine, I can do an interpreted message send every 291 clocks. You are adding at least 11 instructions to the send return sequence which would take four clocks to execute at the peak execution rate. The worst case time is about 42 clocks which is the full latency costs plus two branch mispredicts (at 15 clocks each). The time estimates are based on a Athlon 64 though they will be very similar for other CPUs modern desktop CPUs (nuclear reactors in silicon) except the Pentium 4 where a mispredict costs 30 clocks. The performance effects of the VM mod will depend on the architecture, the compiler, and how well the branch predictor manages on those two extra branches. Unfortunately, to prove there is not a speed loss, you are really going to need to test on many architectures and compilers under many different loads. Two branch mispredicts alone could cost 10% in send performance. Just on the x86 the costs are likely to differ between the Pentium 3, Pentium 4, Pentium M, Intel's core, and Athlon (where the XP may be different to the 64). An out of order CPU may be able to hide the cost of the extra instructions behind the current flabby send and return code. An in order CPU will not be able do do this. So expect a greater performance loss on slower machines such as ARMs and other chips aimed at the hand-held and embedded market. Also the risks of a speed drop on a Pentium-M are greater that those on an Athlon, the Pentium-M manages to execute more instructions per clock when interpreting. I calculated the clocks for a send from the clock speed 2.2 GHz and the sends/sec from tinyBenchmarks. 232,515,894 bytecodes/sec; 7,563,509 sends/sec Exupery's tiny benchmark numbers are: 1,151,856,017 bytecodes/sec; 16,731,576 sends/sec So for Exupery a common case send costs 132 clocks. There is still plenty of room to remove waste from that. At 300 clocks, the cost is about 15% worst case and 1% best case without out of order execution being able to hide costs behind other delays. At 132 clocks, the numbers are much worse. There is a good chance that with more tuning Exuperys sends may be reduced to about 60 clocks without inlining. VisualWorks sends cost 30 clocks. Optimise sends and the best optimisation will be to remove primitiveApplyToFromTo if it's not a net loss in performance now. I vote strongly that this patch is not included in the VM. I've used IBM Smalltalk and enjoy working on a system where do: is easy to understand. Please don't trade simplicity for an optimisation that risks slowing down more code than it speeds up. A do: that is trivial to understand and is free from magic is worth a lot. Bryce |
Hi Bryce,
on Sat, 16 Sep 2006 14:08:09 +0200, you wrote: > What I missed until now was the "thisContext tempAt: 2 put: index + 1" > in the shadow code. Some clever Squeaker will present alternatives, I'm sure ;-) > I still maintain that your version is overly > clever and very likely to cause mass confusion and maintenance issues. I personally know of no software developer who ever participated in a mass confusion, Bryce, and I can say that for the last 30 years. Your prediction is not believable. Would you say that - http://en.wikipedia.org/wiki/Duff's_device caused mass confusion? It is overly clever as well. > do: loops are something that should be fully understandable by any > Smalltalk programmer. Everyone is going to be seeing this in any walk > back that involves a do:. Yes, that is a novelty, perhaps it should be called an innovation. But I cannot claim that I have invented it, this was done some time ago. > Normally, arguments are immutable in Squeak. Not all all! Pass an argument to another method which does a #become: on the argument, Bryce. This is an illusion, we are talking about Smalltalk. How do you handle this situation in Exupery. primitiveApplyToFromTo for example is robust, it does not do anything when an argument is #become:'ed behind its back to something else. > Now, the way to optimise an expression that uses thisContext on a > system with Exupery or without your VM patch is to remove the use of > thisContext. The shadow code is going to be much slower on systems > that do not have the VM mod than the current implementation. This is the case for all pimitive v.s. shadow code comparisions. They must be slower. Have you ever seen the opposite? > My policy > for use of thisContext and tempAt:put: is that neither will be > compiled by Exupery, But #tempAt:put: is used by debugger and friends (through clients)! Why blame debugger that it *must* use #tempAt:put: - what would the Smalltalker do without it, have Java? </grin> > Re-calling a primitive from a return is close enough to re-entering > it. That is something that hasn't been done. It is a major design > change. When you'd look at #commonReturn all you see is that another context is declared to be activeContext. This is irrelevant of my change. This happens as is since the Blue Book. But okay you want it to be a design change, I can live with compliments ;-) > Your VM mod adds work to both common send and common return which > every send is going to have to execute. Right. That's the price to pay. What do you expect when adding new functionality to the VM, that the system runs faster? If so, you must pay the price. > Do you have any measurements to show that any realistic code is > spending enough time in do: loop overhead to justify slowing down > all sends? No, but since you are insisting on this all the time I expect that you post that. > On my machine, I can do an interpreted message send every 291 > clocks. You are adding at least 11 instructions to the send return > sequence which would take four clocks to execute at the peak execution > rate ... yes, I agree it has a price. > The performance effects of the VM mod will depend on the architecture, > the compiler, and how well the branch predictor manages on those two > extra branches. Unfortunately, to prove there is not a speed loss, you > are really going to need to test on many architectures and compilers > under many different loads. Since the VM is used on so many platforms, I do not see any problem getting this feedback. > An out of order CPU may be able to hide the cost of the extra > instructions behind the current flabby send and return code. An in > order CPU will not be able do do this. So expect a greater performance > loss on slower machines such as ARMs and other chips aimed at the > hand-held and embedded market. Also the risks of a speed drop on a > Pentium-M are greater that those on an Athlon, the Pentium-M manages > to execute more instructions per clock when interpreting. Bryce, aren't you overexaggerating when you blame young, innocent primitiveApplyToFromTo for performance loss out of all these technical reasons. A simple ABC analysis reveals that, A bytecode routines, B interpreter primitives and C message sends have frequency A>>B>>C with >> the usual much grater than. So if you want to save VM's performance, get rid of performance lost in bytecode routines, thereafter in interpreter primitives and thereafter in message sends - not first C then B then A. I think that this is what you aim for with Exupery? > I calculated the clocks for a send from the clock speed 2.2 GHz and > the sends/sec from tinyBenchmarks. > > 232,515,894 bytecodes/sec; 7,563,509 sends/sec > > Exupery's tiny benchmark numbers are: > > 1,151,856,017 bytecodes/sec; 16,731,576 sends/sec Fascinating. Is this with or without primitiveApplyToFromTo compiled into the VM. > So for Exupery a common case send costs 132 clocks. There is still > plenty of room to remove waste from that. At 300 clocks, the cost is > about 15% worst case and 1% best case without out of order execution > being able to hide costs behind other delays. At 132 clocks, the > numbers are much worse. There is a good chance that with more tuning > Exuperys sends may be reduced to about 60 clocks without > inlining. VisualWorks sends cost 30 clocks. Optimise sends and the > best optimisation will be to remove primitiveApplyToFromTo if it's > not a net loss in performance now. Will be to remove, Bryce? > I vote strongly that this patch is not included in the VM. This was forseable ;-) Thanks again, Bryce, it was a pleasure. /Klaus > Bryce > > |
Klaus D. Witzel writes:
> Hi Bryce, > > on Sat, 16 Sep 2006 14:08:09 +0200, you wrote: > > > What I missed until now was the "thisContext tempAt: 2 put: index + 1" > > in the shadow code. > > Some clever Squeaker will present alternatives, I'm sure ;-) The issue is that arguments are immutable. The performance issues from the #tempAt:put: could be got around by using a custom bytecode compiler that allows assignment to arguments. But that does nothing to make your implementation simpler to understand or verify. > > I still maintain that your version is overly > > clever and very likely to cause mass confusion and maintenance issues. > > I personally know of no software developer who ever participated in a mass > confusion, Bryce, and I can say that for the last 30 years. Your > prediction is not believable. > > Would you say that > - http://en.wikipedia.org/wiki/Duff's_device > caused mass confusion? It is overly clever as well. Yes it is overly clever, for many uses. And I think Duff said so originally when he announced it. In this case IBM's version of this primitive did cause me a large amount of confusion many years ago. It's not until this argument that I'm starting to understand what they did and why. > > do: loops are something that should be fully understandable by any > > Smalltalk programmer. Everyone is going to be seeing this in any walk > > back that involves a do:. > > Yes, that is a novelty, perhaps it should be called an innovation. But I > cannot claim that I have invented it, this was done some time ago. > > > Normally, arguments are immutable in Squeak. > > Not all all! Pass an argument to another method which does a #become: on > the argument, Bryce. This is an illusion, we are talking about Smalltalk. > > How do you handle this situation in Exupery. > > primitiveApplyToFromTo for example is robust, it does not do anything when > an argument is #become:'ed behind its back to something else. Arguments are immutable. Try compiling "selector: a a := 1". We allow deep access and the possibility to extend the language. This is a good thing. However you have stepped into language modification from normal programming. Having the power to do that easily is a wonderful thing, using it is almost always mistaken. Exupery needs to bail out if you modify the context too much. It makes no assumptions when execution is outside of the method except that neither the PC nor stack pointer are touched. When it is executing it owns the context. Exupery is safe because it can always drop back to the interpreter. The issue with maintainability is how likely the implementation is to remain correct and how hard it is to verify that it is correct. It was only this morning that I was tolerably confident that it is correct ignoring interrupts. I have not done the work required to have any confidence that it is correct if an interrupt occurs, it may be. > > Now, the way to optimise an expression that uses thisContext on a > > system with Exupery or without your VM patch is to remove the use of > > thisContext. The shadow code is going to be much slower on systems > > that do not have the VM mod than the current implementation. > > This is the case for all pimitive v.s. shadow code comparisions. They must > be slower. Have you ever seen the opposite? Your shadow do: will be much slower than a regular do:. It will definitely be much slower after compilation. Now, there is no guarantee that the code after a primitive is shadow code. It is not always. Sometimes it handles cases that the primitive doesn't. The only way to know for sure is to study both carefully and fully understand what both do in all cases where they're used. Your modified VM will be slower executing sends on some architectures and some compilers executing some loads. It does more work. From my calculations the slow down should be between 1% and 15%. It is possible that the magic of modern hardware will hide the cost in some cases, but not all. You are doing more work in the common case to speed up a special case. > > My policy > > for use of thisContext and tempAt:put: is that neither will be > > compiled by Exupery, > > But #tempAt:put: is used by debugger and friends (through clients)! Why > blame debugger that it *must* use #tempAt:put: - what would the > Smalltalker do without it, have Java? </grin> I never said we should remove #tempAt:put:. It is a fine and useful tool. I just object to it being abused to break a language rule in code that will be read by many people. I also object to it because it is much slower than the equivalent byte-codes. Your shadow code will allow your implementation to run but it will slow down images when running on VMs without your primitive. > > Re-calling a primitive from a return is close enough to re-entering > > it. That is something that hasn't been done. It is a major design > > change. > > When you'd look at #commonReturn all you see is that another context is > declared to be activeContext. This is irrelevant of my change. This > happens as is since the Blue Book. What does the blue book have to do with this? > But okay you want it to be a design change, I can live with compliments ;-) > > > Your VM mod adds work to both common send and common return which > > every send is going to have to execute. > > Right. That's the price to pay. What do you expect when adding new > functionality to the VM, that the system runs faster? If so, you must pay > the price. You are not adding new functionality. This change is a purely an optimisation. Therefore if it must speed up the system, not just part of it. > > Do you have any measurements to show that any realistic code is > > spending enough time in do: loop overhead to justify slowing down > > all sends? > > No, but since you are insisting on this all the time I expect that you > post that. I'm not proposing changing the VM or do:. I'm arguing for the status quo. The burden of proof is on you. You're also proposing an optimisation with high maintenance costs that replaces simple code with very clever code, that adds cost to message sends which are a very common operation. Such changes should be considered guilty until proven innocent beyond any doubt. > > On my machine, I can do an interpreted message send every 291 > > clocks. You are adding at least 11 instructions to the send return > > sequence which would take four clocks to execute at the peak execution > > rate > ... yes, I agree it has a price. > > > The performance effects of the VM mod will depend on the architecture, > > the compiler, and how well the branch predictor manages on those two > > extra branches. Unfortunately, to prove there is not a speed loss, you > > are really going to need to test on many architectures and compilers > > under many different loads. > > Since the VM is used on so many platforms, I do not see any problem > getting this feedback. The VM is also a mature slow moving piece of software that many people rely on. VM bugs are painful. VM changes are a very conservative thing. It will take several years for a new VM to enter normal use after it's been released. What are you proposing here? We release this change then discover if it's good or not afterwards? That we add yet another optimisation that may may have no noticeable benefit which adds a high maintenance risk? For this change it's worse, it adds work to message sends. All high level code has to bear that. We have too many optimisations that provide negligible gain already. The ifNotNil: bug earlier was a perfect example. There were two implementations, they got out of sync. And in a normal system both will be used. And while we're at it class should be a standard primitive not a bytecode so that people can override it if they wish. The VM change for class is tiny, just reimplement the bytecode to do a send then let that execute the primitive. > > An out of order CPU may be able to hide the cost of the extra > > instructions behind the current flabby send and return code. An in > > order CPU will not be able do do this. So expect a greater performance > > loss on slower machines such as ARMs and other chips aimed at the > > hand-held and embedded market. Also the risks of a speed drop on a > > Pentium-M are greater that those on an Athlon, the Pentium-M manages > > to execute more instructions per clock when interpreting. > > Bryce, aren't you overexaggerating when you blame young, innocent > primitiveApplyToFromTo for performance loss out of all these technical > reasons. A simple ABC analysis reveals that, A bytecode routines, B > interpreter primitives and C message sends have frequency A>>B>>C with >> > the usual much grater than. Where are the numbers? Where is the analysis? You are not optimising primitive execution. You are optimising do:. You can only gain noticeably if most of the time is spent in in do: overhead not in the work that is done. And not waiting on the memory. primitiveApplyToFromTo is more subtle than most of the primitives in the VM. The only thing that I can think of that is more subtle is exception handling. That provides useful functionality. primitiveApplyToFromTo does not. All primitiveApplyToFromTo can provide is performance. So all it's costs, including performance costs must be justified by performance arguments. > So if you want to save VM's performance, get rid of performance lost in > bytecode routines, thereafter in interpreter primitives and thereafter in > message sends - not first C then B then A. I think that this is what you > aim for with Exupery? > > > I calculated the clocks for a send from the clock speed 2.2 GHz and > > the sends/sec from tinyBenchmarks. > > > > 232,515,894 bytecodes/sec; 7,563,509 sends/sec > > > > Exupery's tiny benchmark numbers are: > > > > 1,151,856,017 bytecodes/sec; 16,731,576 sends/sec > > Fascinating. Is this with or without primitiveApplyToFromTo compiled into > the VM. Without the primitiveApplyToFromTo applied. > > So for Exupery a common case send costs 132 clocks. There is still > > plenty of room to remove waste from that. At 300 clocks, the cost is > > about 15% worst case and 1% best case without out of order execution > > being able to hide costs behind other delays. At 132 clocks, the > > numbers are much worse. There is a good chance that with more tuning > > Exuperys sends may be reduced to about 60 clocks without > > inlining. VisualWorks sends cost 30 clocks. Optimise sends and the > > best optimisation will be to remove primitiveApplyToFromTo if it's > > not a net loss in performance now. > > Will be to remove, Bryce? I don't understand what you're asking here. Personally, I'd solve the original problem of by either leaving the current implementation of occurrencesOf: or just using count: as it stands. I'm concerned about making the core more complex both in the image and in the VM for no real gains and possibly a loss in performance. Bryce |
Hi Bryce,
I hope we both are now tired enough from your comparision of Exupery to the proposed primitiveApplyToFromTo. You have not shown a convincing argument, in the sense that any other primitive has to be (and always was) treated the same way by the VM and undergoes the same performance penalties under the various CPUs. When we would believe you, only primitiveApplyToFromTo would slow down the CPU, the existing primitives wouldn't do that. Of course you are absolutely correct by pointing with your finger to the additional instructions performed in #commonSend and #commonReturn and to the apparent incompatibility of proposed primitiveApplyToFromTo with Exupery's compiling technique (I took your words for granted, I'm not working with Exupery but with the standard VMMaker classes to which primitiveApplyToFromTo is not incompatible). Your arguing for the status quo is way too conservative for a living community, there was a reason why people asked for speeding up enumeration of collections and also why it was suggested to consider the IBM approach #apply:from:to:. Keeping the performance of #do: and friends at the unoptimized level was just the opposite of what the initiators of the original discussion where talking about, the question was for better solutions. Honestly, I wish you good luck with Exupery. Always keep cool 8-) I for my side have experienced the beauty, power and elegance of #commonSend and #commonReturn, the two most underestimated routines in a Smalltalk message sending VM :) /Klaus |
Klaus D. Witzel writes:
> Hi Bryce, > > I hope we both are now tired enough from your comparision of Exupery to > the proposed primitiveApplyToFromTo. I am tired of this argument but I would appreciate it if you would try respond to my arguments. My main issue is your change makes the system more complex in a visible way to everybody. Both normal programmers looking at their calls to do: in walk backs and to people working in the VM. I am not comparing your primitiveApplyToFromTo primitive to Exupery. 1% to 15% extra cost to sends was calculated against the interpreter now, not Exupery now, and definitely not what I think it could be. I did not provide the numbers for how much those additions to common send would provide for Exupery now, or potentially in the future but they should be easy to work out (hint multiply by 2 to 4). I even told you what it would take to avoid both Exupery and an unpatched VM from incurring a performance cost when executing your shadow over a standard shadow. That cost is a language change. The code change is trivial, just find the check that stops the byte code compiler from compiling an assignment to an argument and delete it. The restriction against changing an argument is not in either the VM or in Exupery. It is a language choice implemented at a higher level. Yes, I do end up chasing bugs across both commonSend and commonReturn code too regularly. Yes, I have found a few of my own bugs due to subtleties in it involving both GCs and Squeak interrupts. The additions to commonSend and commonReturn adds extra complexity to an already dangerously complex area. > You have not shown a convincing argument, in the sense that any other > primitive has to be (and always was) treated the same way by the VM and > undergoes the same performance penalties under the various CPUs. When we > would believe you, only primitiveApplyToFromTo would slow down the CPU, > the existing primitives wouldn't do that. Do you feel that simplicity and personal mastery are not important for a system like Squeak? Your change reduces both. I am asking you to justify loss of clarity and simplicity inside the the image that will be seen to normal non VM programmers. I am also asking you to provide a case for an optimisation. An optimisation that adds overhead to a critical part of the system: message sends and all returns. To justify the optimisation you need to prove that it gains more than it looses. Now, may be, you consider the request that you provide a proof that your optimisation does not lead to a loss of performance for real programs as unreasonable. I however feel that optimisations must lead to a net improvement that is worth more than their ongoing costs. Proving that an optimisation will not lead to a net loss in real use, without Exupery, can not be an unreasonable request. I am not asking you prove that this primitive is worth the cost of development. Merely that it provides enough of a speed improvement to cover the costs of living with it. > Of course you are absolutely correct by pointing with your finger to the > additional instructions performed in #commonSend and #commonReturn and to > the apparent incompatibility of proposed primitiveApplyToFromTo with > Exupery's compiling technique (I took your words for granted, I'm not > working with Exupery but with the standard VMMaker classes to which > primitiveApplyToFromTo is not incompatible). Exupery will survive in a system with primitiveApplyToFromTo. It will just not compile your shadow method. But this is irrelevant for why this primitive is inappropriate to be added to the VM. The numbers I chose to use were for interpreted code, not native compiled code. OK, I did give Exupery's numbers as well and the figures for VisualWorks. > Your arguing for the status quo is way too conservative for a living > community, there was a reason why people asked for speeding up enumeration > of collections and also why it was suggested to consider the IBM approach > #apply:from:to:. It was an idle discussion asking for reasons. A newbie rightfully wanted to know why it was implemented as it was. At the moment we don't know. We don't know that a count: implementation of occurencesOf: would be too slow. There are very good reasons for being conservative with the VM. Many people depend on it who are not VM hackers. It has an awkward release schedule. VM bugs are a right pain. Bryce P.S. And yes, I could outline where Exupery is weak for executing do:. But improving Exupery's do performance does not look like it will improve either of the large benchmarks I'm working with. Improving #do performance does not look currently like it will improve Exupery's practicality. P.P.S If non-VM hackers would like to contribute either to Exupery's development or to Squeak's performance then working on developing a better benchmark suite would be useful. Both benchmarks and the argument for why they matter are important. |
In reply to this post by Bryce Kampjes
Am 16.09.2006 um 14:48 schrieb Bryce Kampjes:
> I'm not proposing changing the VM or do:. I'm arguing for the status > quo. The burden of proof is on you. You're also proposing an > optimisation with high maintenance costs that replaces simple code > with very clever code, that adds cost to message sends which are a > very common operation. Such changes should be considered guilty until > proven innocent beyond any doubt. +1 - Bert - |
In reply to this post by Klaus D. Witzel
Klaus D. Witzel wrote:
> Your arguing for the status quo is way too conservative for a living > community, there was a reason why people asked for speeding up > enumeration of collections and also why it was suggested to consider > the IBM approach #apply:from:to:. I don't think it's conservative at all, just practical. Changing the VM, by adding a new primitive, will affect all VM development from that point forward. IMHO, such a change requires a rock solid case to support it. At this point is seems hard to justify the complexity and fragility (i.e. depending of primitive failure code running) of the proposed performance optimization. Introducing a new VM primitive has got to be way low down on the list of ways to meet a performance requirement. |
In reply to this post by Bryce Kampjes
Hi Bryce,
earlier you wrote: > Your shadow code will allow your implementation to run but it > will slow down images when running on VMs without your primitive. This is a good reason for objection, I agree. On Sun, 17 Sep 2006 00:10:00 +0200, you wrote: > Klaus D. Witzel writes: > > I hope we both are now tired enough from your comparision of Exupery > to > > the proposed primitiveApplyToFromTo. > > I am tired of this argument but I would appreciate it if you would > try respond to my arguments. I have neither the time nor the patience to go into your Exupery realm when "just" discussing the suggested primitiveApplyToFromTo. There is not that much experience with Exupery. In particular, I'm not interested to discuss in this thread that Exupery won't support use of thisContext and peeking and poking of contexts. Clients for such objections are, for example - http://www.shaffer-consulting.com/david/Seaside/dandelionOutput/Seaside-Continuations/Continuation.html Don't misunderstand, I just do not want to discuss this particular objections of yours in this thread. Another example, suddenly #class "should be a standard primitive not a bytecode". I do not want to discuss this in this thread. Even if it continues with > so that people can override it [#class] if they wish. > The VM change for class is tiny, just reimplement the bytecode to do > a send then let that execute the primitive. I do not want to argue for or against the performance loss of such a change, nor for or against > [th]is your change makes the system more complex in a > visible way to everybody My last two examples of what I do not want to discuss in this thread are > Also the common case [of existing #commonReturn] could be coded > without the loops which risk branch mispredict when exiting. and > That cost is a language change. ----------- Sorry for no better news :( /Klaus |
In reply to this post by Yanni Chiu
Hi Yanni,
on Sun, 17 Sep 2006 04:48:12 +0200, you wrote: > Klaus D. Witzel wrote: >> Your arguing for the status quo is way too conservative for a living >> community, there was a reason why people asked for speeding up >> enumeration of collections and also why it was suggested to consider >> the IBM approach #apply:from:to:. > > I don't think it's conservative at all, just practical. > Changing the VM, by adding a new primitive, will affect > all VM development from that point forward. IMHO, such > a change requires a rock solid case to support it. At this > point is seems hard to justify the complexity and fragility > (i.e. depending of primitive failure code running) of the > proposed performance optimization. Introducing a new VM > primitive has got to be way low down on the list of ways > to meet a performance requirement. I fully disagree :) I'm used to plan software change for years to come, usually the next 3-5 releases (equiv. 3-5 years). Now, there will be a time when older VM's will no longer run the then current .image, for example when 4.0 comes out. Not making use of the time in between is, literally speaking, the often cited waste of time, viewed from a more pragmatic perspective. Poor young primitiveApplyToFromTo cannot be adapted to in a minute. There is much work to be done and much feedback to be collected from all the various platforms, before it can be concluded that "this was a good idea, it performs as expected". This has to happen in the time before 4.0, for sure. Not doing anything but waiting for Godot has nothing to do with innovation, only with stagnation, IMO. Thanks for your comment. /Klaus |
In reply to this post by Klaus D. Witzel
Let me summarise my arguments as you have obviously not understood them or refuse to respond to them. 1) Your change complicates the system for everyone. Adding that complexity will make it harder to make any future changes in the areas effected. One of those areas is the VMs message sending code which is critical. 2) Your changes make the system more complex to all Squeak programmers when they are just doing normal development. Having nice readable walk backs is important. 3) Your VM changes add costs to message sends. Message sends are more common than do: loops. Thus with both your image changes and your VM changes the system may be slower. This is an optimisation, to have any value your change must speed up the system not slow it down. Based on my analysis the costs will be in the range of 1% to 15% of send speed. 4) Your image changes will slow #do: down for all VMs that do not have your image changes. I have told you how to avoid the performance costs here but the problem is you're trying to side step a deliberate language restriction against changing arguments. You are proposing an optimisation that you refuse to demonstrate does not slow the system down. To have a case you need to demonstrate that the gains to #do: will be greater than the losses on message sends. Optimisations must provide enough speed improvement to justify the cost of living with them and preferably the cost of developing them. No-one has yet demonstrated that there is a practical performance problem with either our current occurencesOf: or even an implementation that used count:. Bryce |
Hi Bryce,
on Sun, 17 Sep 2006 12:12:36 +0200, you wrote: > > Let me summarise my arguments as you have obviously not understood > them or refuse to respond to them. Hey, this is it, I appreciate this your summary :) > 1) Your change complicates the system for everyone. Adding that > complexity will make it harder to make any future changes in the > areas effected. One of those areas is the VMs message sending > code which is critical. Agreed, this is the price to pay. Perhaps the components of send/return, especially #executeNewMethod with its many #ifTrue:ifFalse: on primitiveIndex and return's kangaroo, could use a brush up ;-) > 2) Your changes make the system more complex to all Squeak > programmers when they are just doing normal development. Having > nice readable walk backs is important. Agreed, the debugger needs change. > 3) Your VM changes add costs to message sends. Message > sends are more common than do: loops. Thus with both your > image changes and your VM changes the system may be slower. Agreed, would you say that we have as yet not considered amortization. > This is an optimisation, to have any value your change must > speed up the system not slow it down. Based on my analysis the > costs will be in the range of 1% to 15% of send speed. I posted a comparision in the first message. But of course it didn't include the cost of all other message sends. I appreciate your analysis 1% to 15% but this is just a static, dry run view. I do not expect the cost to come close to 1% on average. Amortization is required. > 4) Your image changes will slow #do: down for all VMs that do > not have your image changes. I do not expect primitiveApplyToFromTo's possible before 4.0 and I think that older VM's will not run 4.x images out of other reasons. > I have told you how to avoid the > performance costs here but the problem is you're trying to side > step a deliberate language restriction against changing arguments. You perhaps misunderstood. In Smalltalk every argument value can be changed behind your back, whether you like it or not. I do not see this as language restriction, that's correct. Everything is an object. > You are proposing an optimisation that you refuse to demonstrate does > not slow the system down. This has nothing to do with unwillingness. I wrote earlier > It would be interesting to compare the performance figures for platforms > other than win32. And I consider how to measure performance cost of #commonSend and #commonReturn. > To have a case you need to demonstrate that > the gains to #do: will be greater than the losses on message sends. Sure, that's the idea. > Optimisations must provide enough speed improvement to justify the > cost of living with them and preferably the cost of developing them. > No-one has yet demonstrated that there is a practical performance > problem with either our current occurencesOf: or even an > implementation that used count:. This would be ignoring the simple benchmarks of Jon's with my follow up and the figures I posted in this thread. > Bryce Heh, t'was the first dialog without the E...y word (since our first meeting in the #squeak channel ;-) /Klaus |
In reply to this post by Klaus D. Witzel
Am 17.09.2006 um 03:37 schrieb Klaus D. Witzel:
> Not doing anything but waiting for Godot has nothing to do with > innovation, only with stagnation, IMO. Keep on discussing, implementing, benchmarking, etc. :) This might even convince people that rather want to err on the conservative side. For example, I'd strongly oppose anything that reduces general send performance, whereas requiring a new VM for a new image to run at full speed is fine by me. - Bert - |
Hi Bert,
on Sun, 17 Sep 2006 16:17:10 +0200, you wrote: > Am 17.09.2006 um 03:37 schrieb Klaus D. Witzel: > >> Not doing anything but waiting for Godot has nothing to do with >> innovation, only with stagnation, IMO. > > Keep on discussing, implementing, benchmarking, etc. :) This might even > convince people that rather want to err on the conservative side. :) > For example, I'd strongly oppose anything that reduces general send > performance, I think about moving the glue (in #commonSend, #commonReturn) out of the way, to a place where performance cannot not suffer, where the price is already payed: 0% . > whereas requiring a new VM for a new image to run at full speed is fine > by me. That's perhaps the best way to distribute it, as an optional .mcz package (the user just recompiles the VM), together with the changes to the Collection hierarchy. So the burden is not on the VM maintainers (and other maintainers). Applying kiss. /Klaus > - Bert - |
In reply to this post by Klaus D. Witzel
Klaus D. Witzel writes:
> > Heh, t'was the first dialog without the E...y word (since our first > meeting in the #squeak channel ;-) Exupery is not a swear word. If you're planning for something to be added with a 4.0 then Exupery should definitely be considered. It does exist. It does beat VisualWorks for the bytecode benchmark. It is probably as fast as Strongtalk for bytecode performance as well now (based on comparing the numbers Dan gave with the ones I got. Not apples to apples on the same hardware). Exupery is relevant if we're talking about Squeak's performance over the next few years. Exupery is Squeak's and Smalltalk's best current hope of beating both Strongtalk and Selfs pioneering work on high performance implementation's. The cost to Exupery should not be ignored if the community does want Exupery. I have not argued against your change because of this cost though I to object to your statement earlier that Exupery should be ignored for now. You ask that this change be given a chance. I have invested over 2 days into analysing it and discussing it. You then claim that something I have worked on for about 4 years should be ignored is not pleasant. My investment in your change is not that much less than yours. I have not argued that your change should be discounted solely because it will negatively effect performance when running with Exupery. I've even told you how the shadow can be implemented so it will not negatively impact performance on either the interpreter or with Exupery. Just use bytecodes to assign to the argument not tempAt:put:, neither Exupery nor the interpreter know the difference between an argument and a temporary. This distinction, while a language one, is made at a higher level. If you want to discuss where there are opportunities to speed up Squeak that will not mess up the language or it's implementation. Anthony Hannan's send performance work was rejected even though it did speed up sends and came with proper block closures. Proper block closures are something we really should have. My work on Exupery provided me with the knowledge, and also the experience of evaluating optimisations but it is my experience with GemStone, VisualWorks, and IBM Smalltalk that provide my dislike of the costs of living with such optimisations and cleverness. Bryce |
In reply to this post by Klaus D. Witzel
Klaus D. Witzel writes:
> > whereas requiring a new VM for a new image to run at full speed is fine > > by me. > > That's perhaps the best way to distribute it, as an optional .mcz package > (the user just recompiles the VM), together with the changes to the > Collection hierarchy. So the burden is not on the VM maintainers (and > other maintainers). Applying kiss. Distributing as an optional .mcz package is sensible. Bryce |
Free forum by Nabble | Edit this page |