Hi All,
so the performance benefit of the special selectors is interesting. One thing the interpreter does on all relational special selectors #= #~= #< #<= #> #>= is statically predict SmallIntegers and/or Floats and look ahead for a following conditional branch, evaluating the branch immediately, avoiding reifying the result of the relational as either true or false and later testing for it, hence jumping directly on the condition codes.
e.g. bytecodePrimLessThan | rcvr arg aBool | rcvr := self internalStackValue: 1.
arg := self internalStackValue: 0. (self areIntegers: rcvr and: arg) ifTrue:
["The C code can avoid detagging since tagged integers are still signed. But this means the simulator must override to do detagging."
^self cCode: [self booleanCheat: rcvr < arg] inSmalltalk: [self booleanCheat: (objectMemory integerValueOf: rcvr) < (objectMemory integerValueOf: arg)]].
self initPrimCall. aBool := self primitiveFloatLess: rcvr thanArg: arg.
self successful ifTrue: [^ self booleanCheat: aBool]. messageSelector := self specialSelector: 2.
argumentCount := 1. self normalSend booleanCheat: cond
"cheat the interpreter out of the pleasure of handling the next bytecode IFF it is a jump-on-boolean. Which it is, often enough when the current bytecode is something like bytecodePrimEqual"
<inline: true> cond ifTrue: [self booleanCheatTrue]
ifFalse: [self booleanCheatFalse] booleanCheatFalse "cheat the interpreter out of the pleasure of handling the next bytecode IFF it is a jump-on-boolean. Which it is, often enough when the current bytecode is something like bytecodePrimEqual"
| bytecode offset | <sharedCodeNamed: 'booleanCheatFalse' inCase: 179>
bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)" self internalPop: 2.
(bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse" [^self jump: bytecode - 151].
bytecode = 172 ifTrue: "long jumpIfFalse" [offset := self fetchByte.
^self jump: offset]. "not followed by a jumpIfFalse; undo instruction fetch and push boolean result"
localIP := localIP - 1. self fetchNextBytecode. self internalPush: objectMemory falseObject
booleanCheatTrue "cheat the interpreter out of the pleasure of handling the next bytecode IFF it is a jump-on-boolean. Which it is, often enough when the current bytecode is something like bytecodePrimEqual"
| bytecode | <sharedCodeNamed: 'booleanCheatTrue' inCase: 178> bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)" self internalPop: 2.
(bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse" [^self fetchNextBytecode].
bytecode = 172 ifTrue: "long jumpIfFalse" [self fetchByte.
^self fetchNextBytecode]. "not followed by a jumpIfFalse; undo instruction fetch and push boolean result"
localIP := localIP - 1. self fetchNextBytecode. self internalPush: objectMemory trueObject
Time millisecondsToRun: [1 to: 100000000 do: [:i| ]] The bytecode compiler optimizes the inner loop to code equivalent to | i | i := 1.
[i <= 100000000] whileTrue: [i := i + 1] Here's the Squeak bytecode: 25 <76> pushConstant: 1 26 <68> popIntoTemp: 0 27 <10> pushTemp: 0
28 <20> pushConstant: 100000000 29 <B4> send: <= 30 <9D> jumpFalse: 37 31 <10> pushTemp: 0 32 <76> pushConstant: 1 33 <B0> send: +
34 <68> popIntoTemp: 0 35 <A3 F6> jumpTo: 27 37: And here's the VisualWorks bytecode: 6 <4A> push 1 7 <4C> store local 0; pop
8 <67> loop head 9 <10> push local 0 10 <1C> push 100000000 11 <A4> send <= 12 <C4> jump false 18 13 <10> push local 0
14 <C8> push 1; send + 15 <4C> store local 0; pop 16 <E3 F6> jump 8 18: Note that loopHead is a no-op that allows the HPS JIT to be one pass; it tells the JIT that there is a backward branch to this instruction and so knows that loopHead is a control-flow join, which means that the registers that cache the receiver and its base pointer (VW has an indirection from the object header to an object body) must be reloaded if needed.
So the current version of Cog generates the following code for the loop: 25 <76> pushConstant: 1 a1ea4: movl $0x00000003, %eax : B8 03 00 00 00
a1ea9: pushl %eax : 50 26 <68> popIntoTemp: 0 a1eaa: popl %eax : 58
a1eab: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0 27 <10> pushTemp: 0 a1eae: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a1eb1: pushl %eax : 50 28 <20> pushConstant: 100000000 a1eb2: pushl $0x0bebc201 : 68 01 C2 EB 0B
29 <B4> send: <= a1eb7: movl %ss:0x4(%esp), %edx : 8B 54 24 04 a1ebb: movl $0x0045a458=#<=, %ecx : B9 58 A4 45 00
a1ec0: call .+0xfff5e5cb (0x00000490=ceSend1Args) : E8 CB E5 F5 FF IsSendCall:
a1ec5: pushl %edx : 52 30 <9D> jumpFalse: 37 a1ec6: popl %eax : 58
a1ec7: subl $0x00172270=false, %eax : 2D 70 22 17 00 IsObjectReference:
a1ecc: jz .+0x0000003d (0x000a1f0b=16rA1E50@BB) : 74 3D a1ece: cmpl $0x00000008, %eax : 83 F8 08
a1ed1: jz .+0x0000000b (0x000a1ede=16rA1E50@8E) : 74 0B a1ed3: addl $0x00172270=false, %eax : 05 70 22 17 00
IsObjectReference: a1ed8: pushl %eax : 50 a1ed9: call .+0xfff5e902 (0x000007e0=ceSendMustBeBooleanTrampoline) : E8 02 E9 F5 FF
HasBytecodePC: 31 <10> pushTemp: 0 a1ede: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a1ee1: pushl %eax : 50 32 <76> pushConstant: 1 a1ee2: movl $0x00000003, %eax : B8 03 00 00 00
a1ee7: pushl %eax : 50 33 <B0> send: + a1ee8: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a1eec: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00 a1ef1: call .+0xfff5e59a (0x00000490=ceSend1Args) : E8 9A E5 F5 FF
IsSendCall: a1ef6: pushl %edx : 52 34 <68> popIntoTemp: 0
a1ef7: popl %eax : 58 a1ef8: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0 35 <A3 F6> jumpTo: 27
a1efb: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00 a1f00: cmpl %eax, %esp : 39 C4
a1f02: jnb .+0xffffffaa (0x000a1eae=16rA1E50@5E) : 73 AA a1f04: call .+0xfff5e907 (0x00000810=ceCheckForInterruptsTrampoline) : E8 07 E9 F5 FF
HasBytecodePC: a1f09: jmp .+0xffffffa3 (0x000a1eae=16rA1E50@5E) : EB A3
37: 0x000a1f0b/16rA1E50@BB: And here's the primitive part of the SmallInteger>#<= method; the SmallInteger<#+ method is similar. I'll omit it for brevity.
entry: (check that the receiver matches the inline cache) 00002498: movl %edx, %eax : 89 D0
0000249a: andl $0x00000001, %eax : 83 E0 01 0000249d: jnz .+0x00000010 (0x000024af=<=@37) : 75 10 (jump if the receiver is a SmallInteger, which in our case it is)
0000249f: movl %ds:(%edx), %eax : 8B 02 000024a1: shrl $0x0a, %eax : C1 E8 0A
000024a4: andl $0x0000007c, %eax : 83 E0 7C 000024a7: jnz .+0x00000006 (0x000024af=<=@37) : 75 06
000024a9: movl %ds:0xfffffffc(%edx), %eax : 8B 42 FC 000024ac: andl $0xfffffffc, %eax : 83 E0 FC
000024af: cmpl %ecx, %eax : 39 C8 (compare the receiver's tags against the inline cache ecx, which in our case will match)
000024b1: jnz .+0xffffffdf (0x00002492=<=@1A) : 75 DF noCheckEntry: (SmallInteger primitive #<=)
000024b3: movl %ss:0x4(%esp), %eax : 8B 44 24 04 000024b7: movl %eax, %ecx : 89 C1
000024b9: andl $0x00000001, %eax : 83 E0 01 000024bc: jz .+0x00000014 (0x000024d2=<=@5A) : 74 14 (is the argument a SmallInteger, which in our case it will be)
000024be: cmpl %ecx, %edx : 39 CA 000024c0: jle .+0x00000008 (0x000024ca=<=@52) : 7E 08
000024c2: movl $0x00172270=false, %edx : BA 70 22 17 00 (answer the false object) IsObjectReference:
000024c7: ret $0x0008 : C2 08 00 000024ca: movl $0x00172278=true, %edx : BA 78 22 17 00 (answer the true object)
IsObjectReference: 000024cf: ret $0x0008 : C2 08 00 000024d2:
You can see that the Cog code generator is extremely naive; it's generating RISC code. There are no compares against memory; only against registers, etc. The question I want you to ask yourself is how much faster the loop will go if we add code to short-circuit the send of #<= and the comparison against true and false with a tag test and a direct comparison. Try answering it as we go along. I'll answer it below but try and answer it yourself. Basically we should speed up about a third of the loop substantially where the loop has three main parts, a) the compare and branch [i <= 100000000] whileTrue:, b) the increment i := i + 1, and c) the backward branch at the end of the while loop which checks the stackLimit to break out if there is an event (e.g. ctrl-.).
OK, so here's the code with the booleanCheat implemented in the JIT: 25 <76> pushConstant: 1 a52cc: movl $0x00000003, %eax : B8 03 00 00 00
a52d1: pushl %eax : 50 26 <68> popIntoTemp: 0 a52d2: popl %eax : 58
a52d3: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0 27 <10> pushTemp: 0 a52d6: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a52d9: pushl %eax : 50 28 <20> pushConstant: 100000000 a52da: pushl $0x0bebc201 : 68 01 C2 EB 0B
29 <B4> send: <= a52df: movl %ss:(%esp), %esi : 8B 34 24 a52e2: movl %esi, %eax : 89 F0
a52e4: movl %ss:0x4(%esp), %edx : 8B 54 24 04 a52e8: andl %edx, %eax : 23 C2
a52ea: andl $0x00000001, %eax : 83 E0 01 (are receiver and arg SmallIntegers, which they are) a52ed: jz .+0x00000009 (0x000a52f8=16rA5278@80) : 74 09
a52ef: addl $0x00000008, %esp : 83 C4 08 a52f2: cmpl %esi, %edx : 39 F2 (compare directly)
a52f4: jnle .+0x00000056 (0x000a534c=16rA5278@D4) : 7F 56 (jump on condition codes) a52f6: jmp .+0x00000027 (0x000a531f=16rA5278@A7) : EB 27 (jump past the send and the jumpFalse:)
a52f8: movl %ss:0x4(%esp), %edx : 8B 54 24 04 a52fc: movl $0x0045a458, %ecx : B9 58 A4 45 00
a5301: call .+0xfff5b18a (0x00000490=ceSend1Args) : E8 8A B1 F5 FF IsSendCall:
a5306: pushl %edx : 52 30 <9D> jumpFalse: 37 a5307: popl %eax : 58
a5308: subl $0x00172270=false, %eax : 2D 70 22 17 00 IsObjectReference:
a530d: jz .+0x0000003d (0x000a534c=16rA5278@D4) : 74 3D a530f: cmpl $0x00000008, %eax : 83 F8 08
a5312: jz .+0x0000000b (0x000a531f=16rA5278@A7) : 74 0B a5314: addl $0x00172270=false, %eax : 05 70 22 17 00
IsObjectReference: a5319: pushl %eax : 50 a531a: call .+0xfff5b4c1 (0x000007e0=ceSendMustBeBooleanTrampoline) : E8 C1 B4 F5 FF
HasBytecodePC: 31 <10> pushTemp: 0 a531f: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0
a5322: pushl %eax : 50 32 <76> pushConstant: 1 a5323: movl $0x00000003, %eax : B8 03 00 00 00
a5328: pushl %eax : 50 33 <B0> send: + a5329: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a532d: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00 a5332: call .+0xfff5b159 (0x00000490=ceSend1Args) : E8 59 B1 F5 FF
IsSendCall: a5337: pushl %edx : 52 34 <68> popIntoTemp: 0
a5338: popl %eax : 58 a5339: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0 35 <A3 F6> jumpTo: 27
a533c: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00 a5341: cmpl %eax, %esp : 39 C4
a5343: jnb .+0xffffff91 (0x000a52d6=16rA5278@5E) : 73 91 a5345: call .+0xfff5b4c6 (0x00000810=ceCheckForInterruptsTrampoline) : E8 C6 B4 F5 FF
HasBytecodePC: a534a: jmp .+0xffffff8a (0x000a52d6=16rA5278@5E) : EB 8A So roughly 1/3 of the loop has been sped up substantially. How much speedup? 2.7%. 2.7 measly percent. A good 2 - 3 hours work and 2.7 measly % ?!?!? i.e. 691 milliseconds fell to 672 milliseconds (and the measurements are nicely repeatable).
Well, one reason for the limited performance increase could be that the event check at the backward branch is being taken a lot and dominating costs. So what happens if I change the heartbeat from 2KHz to 20KHz? 672 falls to 664, so only 9 milliseconds or so is due to the stack limit event check.
Perhaps the x86 is so good at optimizing the code that it can't be sped up. Well, let's have a look at HPS's performance. It is *4* times faster, 168 ms vs 672 (that's *exactly* 4 times faster :) ). So what code does HPS generate? First its back-end is less naive; it doesn't move intermediate results through registers all the time. But most significantly it is able to know that both the 100000000 and the + 1 are SmallIntegers because it delays generating code until it gets to a send. So its code is a /lot/ better. It short-circuits bth the compare and branch and the increment. Here it is (remember that Squeak has only one tagged type, SmallInteger with tag 1, and 32-bit VW has two tagged types SmallInteger and Character with tags 3 and 1 respectively, so in VW 1 == 0x7):
6 <4A> push 1 7 <4C> store local 0; pop nm@039: xor rTemp,rTemp ! 33 c0
nm@03b: movb $7,%al ! b0 07
nm@03d: mov rTemp,-0x10(bFrame) ! 89 45 f0
8 <67> loop head 9 <10> push local 0 nm@040: mov -0x10(bFrame),rReceiver ! 8b 5d f0
10 <1C> push 100000000 nm@043: mov $0x17d78403,rArg1 ! be 03 84 d7 17
11 <A4> send <= nm@048: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver
nm@04b: jz nm@056 ! 74 09
nm@04d: cmp rArg1,rReceiver ! 3b de
nm@04f: jle nm@075 ! 7e 24
nm@051: jmp nm@0ab ! e9 55 00 00 00
nm@056: mov $0x16a8ec6c,rClass ! ba 6c ec a8 16
nm@05b: call _81080=send1Args ! e8 28 e7 6b ea
map: 0x3a: send1(26) #<= vpc 12: 12 <C4> jump false 18 nm@060: cmp $0x16b33da4,rReceiver ! 81 fb a4 3d b3 16
nm@066: jz nm@0ab ! 74 43
nm@068: cmp $0x16da90e4,rReceiver ! 81 fb e4 90 da 16
nm@06e: jz nm@075 ! 74 05
nm@070: call _159cadd8 ! e8 6b 84 00 00
nm@075: mov -0x10(bFrame),rReceiver ! 8b 5d f0
nm@078: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver
nm@07b: jz nm@085 ! 74 08
13 <10> push local 0 14 <C8> push 1; send + nm@07d: add $4,rReceiver ! 83 c3 04
nm@080: jno nm@092 ! 71 10
nm@082: sub $4,rReceiver ! 83 eb 04
nm@085: push $7 ! 6a 07 nm@087: pop rArg1 ! 5e
nm@088: mov $0x16bbab34,rClass ! ba 34 ab bb 16
nm@08d: call _81080=send1Args ! e8 f6 e6 6b ea
map: 0x22: send1(2) #+ 15 <4C> store local 0; pop nm@092: mov rReceiver,-0x10(bFrame) ! 89 5d f0
16 <E3 F6> jump 8 nm@095: cmp 0xfb190,bSP ! 3b 25 90 b1 0f 00
nm@09b: jae nm@040 ! 0f 83 9f ff ff ff
nm@0a1: call _80d30 ! e8 92 e3 6b ea
vpc 18: nm@0a6: jmp nm@040 ! e9 95 ff ff ff
Interesting. VW executes the following instructions around the loop: nm@040: mov -0x10(bFrame),rReceiver ! 8b 5d f0
nm@043: mov $0x17d78403,rArg1 ! be 03 84 d7 17
nm@048: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver
nm@04b: jz nm@056 ! 74 09
nm@04d: cmp rArg1,rReceiver ! 3b de
nm@04f: jle nm@075 ! 7e 24
nm@051: jmp nm@0ab ! e9 55 00 00 00
... nm@07d: add $4,rReceiver ! 83 c3 04
nm@080: jno nm@092 ! 71 10
... nm@092: mov rReceiver,-0x10(bFrame) ! 89 5d f0
nm@095: cmp 0xfb190,bSP ! 3b 25 90 b1 0f 00
nm@09b: jae nm@040 ! 0f 83 9f ff ff ff
12 instructions, one read and one write. Cog executes many more, quite a few of them reads and writes. How I itch to find the time to do delayed code generation/stack-to-register mapping in Cog...
best, Eliot On Tue, Nov 16, 2010 at 3:32 PM, Levente Uzonyi <[hidden email]> wrote: On Tue, 16 Nov 2010, Juan Vuletich wrote: |
On 19 November 2010 21:48, Eliot Miranda <[hidden email]> wrote:
> Hi All, > so the performance benefit of the special selectors is interesting. One > thing the interpreter does on all relational special selectors #= #~= #< #<= > #> #>= is statically predict SmallIntegers and/or Floats and look ahead for > a following conditional branch, evaluating the branch immediately, avoiding > reifying the result of the relational as either true or false and later > testing for it, hence jumping directly on the condition codes. > e.g. > bytecodePrimLessThan > | rcvr arg aBool | > rcvr := self internalStackValue: 1. > arg := self internalStackValue: 0. > (self areIntegers: rcvr and: arg) ifTrue: > ["The C code can avoid detagging since tagged integers are still signed. > But this means the simulator must override to do detagging." > ^self cCode: [self booleanCheat: rcvr < arg] > inSmalltalk: [self booleanCheat: (objectMemory integerValueOf: rcvr) < > (objectMemory integerValueOf: arg)]]. > self initPrimCall. > aBool := self primitiveFloatLess: rcvr thanArg: arg. > self successful ifTrue: [^ self booleanCheat: aBool]. > messageSelector := self specialSelector: 2. > argumentCount := 1. > self normalSend > booleanCheat: cond > "cheat the interpreter out of the pleasure of handling the next bytecode IFF > it is a jump-on-boolean. Which it is, often enough when the current bytecode > is something like bytecodePrimEqual" > <inline: true> > cond > ifTrue: [self booleanCheatTrue] > ifFalse: [self booleanCheatFalse] > booleanCheatFalse > "cheat the interpreter out of the pleasure of handling the next bytecode IFF > it is a jump-on-boolean. Which it is, often enough when the current bytecode > is something like bytecodePrimEqual" > | bytecode offset | > <sharedCodeNamed: 'booleanCheatFalse' inCase: 179> > bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)" > self internalPop: 2. > (bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse" > [^self jump: bytecode - 151]. > bytecode = 172 ifTrue: "long jumpIfFalse" > [offset := self fetchByte. > ^self jump: offset]. > "not followed by a jumpIfFalse; undo instruction fetch and push boolean > result" > localIP := localIP - 1. > self fetchNextBytecode. > self internalPush: objectMemory falseObject > booleanCheatTrue > "cheat the interpreter out of the pleasure of handling the next bytecode IFF > it is a jump-on-boolean. Which it is, often enough when the current bytecode > is something like bytecodePrimEqual" > | bytecode | > <sharedCodeNamed: 'booleanCheatTrue' inCase: 178> > bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)" > self internalPop: 2. > (bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse" > [^self fetchNextBytecode]. > bytecode = 172 ifTrue: "long jumpIfFalse" > [self fetchByte. > ^self fetchNextBytecode]. > "not followed by a jumpIfFalse; undo instruction fetch and push boolean > result" > localIP := localIP - 1. > self fetchNextBytecode. > self internalPush: objectMemory trueObject > > Until yesterday Cog didn't do this for jitted code. The conversation about > using #= for testing got me motivated to implement it. Note that > VisualWorks' HPS VM does something even but more aggressive than > booleanCheat:. So what are we talking about? Here's my micro-benchmark: > Time millisecondsToRun: [1 to: 100000000 do: [:i| ]] > The bytecode compiler optimizes the inner loop to code equivalent to > | i | > i := 1. > [i <= 100000000] whileTrue: [i := i + 1] i think that with agressive optimization, this loop can be turned into no-op. or more presisely to an instruction which sets i to 100000001 :) But i wonder what kind of analysis should be applied to determine that loop is bound to 100000000, and there is no other side effects than just incrementing counter. Btw, i think it is much easier to optimize original 1 to: 100000 do: since you know beforehead the loop bounds, and need only to analyze if loop body has any side effects. In that way, it is better to analyze the code at AST level, than on bytecode level, since once you turn it into bytecode, you losing a precious information that loop has bounds and have no choice but to strictly follow bytecode semantics. While of course, Cog have not much choice, since it can't operate with AST, just a bytecode. [snip] > best, > Eliot -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Eliot Miranda-2
Hi All,
so you'll probably have seen that I actually got round to implementing delayed code generation/stack-to-register mapping in Cog, specifically in the StackToRegisterMappingCogit. So let me show you the generated code for comparison with the below. The scheme implemented by StackToRegisterMappingCogit is to defer creating machine code until compiling a bytecode that consumes operands, i.e. any bytecode that merely produces operands (a stack push) is handled by pushing a descriptor of that operand on a "simulation stack" in the JIT that holds operand descriptors until they're needed and functions as a simple register allocator. When a bytecode such as a send, a store or a return is generated the JIT examines the descriptors on the simulation stack and so has type information, in particular whether an operand is a SmallInteger or not. The VM also implements a register-based calling convention for arg counts 0 and 1 (self and up to 1 arg passed in registers) which has the effect of avoiding the stack for high-dynamic-frequency arithmetic and accessing primitives.
So here's the code (see the quote below for code generated by the SimpleStackbasedCogit). The microbenchmark is [1 to: 100000000 do: [:i| ]] timeToRun. And here's the code the StackToRegisterMappingCogit generates for the loop:
25 <76> pushConstant: 1 86138: movl $0x00000003, %eax : B8 03 00 00 00 26 <68> popIntoTemp: 0
8613d: movl %eax, -16(%ebp) : 89 45 F0 27 <10> pushTemp: 0 28 <20> pushConstant: 100000000 29 <B4> send: <=
30 <9D> jumpFalse: 37 86140: movl -16(%ebp), %edx : 8B 55 F0 86143: movl %edx, %eax : 89 D0
86145: andl $0x00000001, %eax : 83 E0 01 86148: jz .+0x0000000a (0x00086154=16r86090@C4) : 74 0A
8614a: cmpl $0x0bebc201, %edx : 81 FA 01 C2 EB 0B 86150: jnle .+0x00000058 (0x000861aa=16r86090@11A) : 7F 58
86152: jmp .+0x00000022 (0x00086176=16r86090@E6) : EB 22 (29 <B4> send: <= if i is not a SmallInteger; args passed in edx and esi, result in edx)
86154: movl $0x0bebc201, %esi : BE 01 C2 EB 0B 86159: movl $0x00448c9c=#<=, %ecx : B9 9C 8C 44 00
8615e: call .+0xfff7a30d (0x00000470=ceSend1Args) : E8 0D A3 F7 FF IsSendCall:
(30 <9D> jumpFalse: 37 if #<= is actually sent) 86163: movl %edx, %eax : 89 D0 86165: subl $0x00162270=false, %eax : 2D 70 22 16 00
8616a: jz .+0x0000003e (0x000861aa=16r86090@11A) : 74 3E 8616c: cmpl $0x00000008, %eax : 83 F8 08
8616f: jz .+0x00000005 (0x00086176=16r86090@E6) : 74 05 86171: call .+0xfff7a7e2 (0x00000958=ceSendMustBeBooleanAddFalseTrampoline) : E8 E2 A7 F7 FF
HasBytecodePC: 31 <10> pushTemp: 0 32 <76> pushConstant: 1 33 <B0> send: + 86176: movl -16(%ebp), %edx : 8B 55 F0
86179: movl %edx, %eax : 89 D0 8617b: andl $0x00000001, %eax : 83 E0 01
8617e: jz .+0x00000008 (0x00086188=16r86090@F8) : 74 08 86180: addl $0x00000002, %edx : 83 C2 02
86183: jno .+0x00000012 (0x00086197=16r86090@107) : 71 12 (33 <B0> send: + if either i is not a SmallInteger or if the addition overflows; args passed in edx and esi)
86185: subl $0x00000002, %edx : 83 EA 02 86188: movl $0x00000003, %esi : BE 03 00 00 00
8618d: movl $0x004496a8=#+, %ecx : B9 A8 96 44 00 86192: call .+0xfff7a2d9 (0x00000470=ceSend1Args) : E8 D9 A2 F7 FF
IsSendCall: 34 <68> popIntoTemp: 0 86197: movl %edx, -16(%ebp) : 89 55 F0
35 <A3 F6> jumpTo: 27 8619a: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00 8619f: cmpl %eax, %esp : 39 C4
861a1: jnb .+0xffffff9d (0x00086140=16r86090@B0) : 73 9D 861a3: call .+0xfff7a920 (0x00000ac8=ceCheckForInterruptsTrampoline) : E8 20 A9 F7 FF
HasBytecodePC: 861a8: jmp .+0xffffff96 (0x00086140=16r86090@B0) : EB 96 So now Cog is evaluating the following instructions round the loop:
86140: movl -16(%ebp), %edx : 8B 55 F0 86143: movl %edx, %eax : 89 D0 86145: andl $0x00000001, %eax : 83 E0 01
86148: jz .+0x0000000a (0x00086154=16r86090@C4) : 74 0A 8614a: cmpl $0x0bebc201, %edx : 81 FA 01 C2 EB 0B
86150: jnle .+0x00000058 (0x000861aa=16r86090@11A) : 7F 58 ... 86176: movl -16(%ebp), %edx : 8B 55 F0
86179: movl %edx, %eax : 89 D0 8617b: andl $0x00000001, %eax : 83 E0 01
8617e: jz .+0x00000008 (0x00086188=16r86090@F8) : 74 08 86180: addl $0x00000002, %edx : 83 C2 02
86183: jno .+0x00000012 (0x00086197=16r86090@107) : 71 12 ... 86197: movl %edx, -16(%ebp) : 89 55 F0
8619a: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00 8619f: cmpl %eax, %esp : 39 C4
861a1: jnb .+0xffffff9d (0x00086140=16r86090@B0) : 73 9D which compares well with those executed by VW:
nm@040: mov -0x10(bFrame),rReceiver ! 8b 5d f0 nm@043: mov $0x17d78403,rArg1 ! be 03 84 d7 17
nm@048: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver nm@04b: jz nm@056 ! 74 09
nm@04d: cmp rArg1,rReceiver ! 3b de nm@04f: jle nm@075 ! 7e 24
nm@051: jmp nm@0ab ! e9 55 00 00 00 ...
nm@075: mov -0x10(bFrame),rReceiver ! 8b 5d f0 nm@078: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver
nm@07b: jz nm@085 ! 74 08 nm@07d: add $4,rReceiver ! 83 c3 04 nm@080: jno nm@092 ! 71 10 ... nm@092: mov rReceiver,-0x10(bFrame) ! 89 5d f0
nm@095: cmp 0xfb190,bSP ! 3b 25 90 b1 0f 00 nm@09b: jae nm@040 ! 0f 83 9f ff ff ff
Using the non-destructive testb avoids a register copy, but there's only one more instruction. I'm doing slightly better than VW in that I don't assign the limit 100000000 (0x0bebc201 in Squeak, 0x17d78403 in VW) to the arg register unless i is not a SmallInteger, and I'm able to generate short jumps (2 bytes vs 5 bytes). But these are angels on the heads of pins. The code generators are essentially equivalent.
best Eliot On Fri, Nov 19, 2010 at 11:48 AM, Eliot Miranda <[hidden email]> wrote: Hi All, |
Free forum by Nabble | Edit this page |