Smalltalk › Squeak › Squeak VM

Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

Eliot Miranda-2

Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

Hi All,

so the performance benefit of the special selectors is interesting. One thing the interpreter does on all relational special selectors #= #~= #< #<= #> #>= is statically predict SmallIntegers and/or Floats and look ahead for a following conditional branch, evaluating the branch immediately, avoiding reifying the result of the relational as either true or false and later testing for it, hence jumping directly on the condition codes.

e.g.

bytecodePrimLessThan

| rcvr arg aBool |

rcvr := self internalStackValue: 1.

arg := self internalStackValue: 0.

(self areIntegers: rcvr and: arg) ifTrue:

["The C code can avoid detagging since tagged integers are still signed.

But this means the simulator must override to do detagging."

^self cCode: [self booleanCheat: rcvr < arg]

inSmalltalk: [self booleanCheat: (objectMemory integerValueOf: rcvr) < (objectMemory integerValueOf: arg)]].

self initPrimCall.

aBool := self primitiveFloatLess: rcvr thanArg: arg.

self successful ifTrue: [^ self booleanCheat: aBool].

messageSelector := self specialSelector: 2.

argumentCount := 1.

self normalSend

booleanCheat: cond

"cheat the interpreter out of the pleasure of handling the next bytecode IFF it is a jump-on-boolean. Which it is, often enough when the current bytecode is something like bytecodePrimEqual"

<inline: true>

cond

ifTrue: [self booleanCheatTrue]

ifFalse: [self booleanCheatFalse]

booleanCheatFalse

"cheat the interpreter out of the pleasure of handling the next bytecode IFF it is a jump-on-boolean. Which it is, often enough when the current bytecode is something like bytecodePrimEqual"

| bytecode offset |

<sharedCodeNamed: 'booleanCheatFalse' inCase: 179>

bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)"

self internalPop: 2.

(bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse"

[^self jump: bytecode - 151].

bytecode = 172 ifTrue: "long jumpIfFalse"

[offset := self fetchByte.

^self jump: offset].

"not followed by a jumpIfFalse; undo instruction fetch and push boolean result"

localIP := localIP - 1.

self fetchNextBytecode.

self internalPush: objectMemory falseObject

booleanCheatTrue

"cheat the interpreter out of the pleasure of handling the next bytecode IFF it is a jump-on-boolean. Which it is, often enough when the current bytecode is something like bytecodePrimEqual"

| bytecode |

<sharedCodeNamed: 'booleanCheatTrue' inCase: 178>

bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)"

self internalPop: 2.

(bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse"

[^self fetchNextBytecode].

bytecode = 172 ifTrue: "long jumpIfFalse"

[self fetchByte.

^self fetchNextBytecode].

"not followed by a jumpIfFalse; undo instruction fetch and push boolean result"

localIP := localIP - 1.

self fetchNextBytecode.

self internalPush: objectMemory trueObject

Until yesterday Cog didn't do this for jitted code. The conversation about using #= for testing got me motivated to implement it. Note that VisualWorks' HPS VM does something even but more aggressive than booleanCheat:. So what are we talking about? Here's my micro-benchmark:

Time millisecondsToRun: [1 to: 100000000 do: [:i| ]]

The bytecode compiler optimizes the inner loop to code equivalent to

| i |

i := 1.

[i <= 100000000] whileTrue: [i := i + 1]

Here's the Squeak bytecode:

25 <76> pushConstant: 1

26 <68> popIntoTemp: 0

27 <10> pushTemp: 0

28 <20> pushConstant: 100000000

29 <B4> send: <=

30 <9D> jumpFalse: 37

31 <10> pushTemp: 0

32 <76> pushConstant: 1

33 <B0> send: +

34 <68> popIntoTemp: 0

35 <A3 F6> jumpTo: 27

37:

And here's the VisualWorks bytecode:

6 <4A> push 1

7 <4C> store local 0; pop

8 <67> loop head

9 <10> push local 0

10 <1C> push 100000000

11 <A4> send <=

12 <C4> jump false 18

13 <10> push local 0

14 <C8> push 1; send +

15 <4C> store local 0; pop

16 <E3 F6> jump 8

18:

Note that loopHead is a no-op that allows the HPS JIT to be one pass; it tells the JIT that there is a backward branch to this instruction and so knows that loopHead is a control-flow join, which means that the registers that cache the receiver and its base pointer (VW has an indirection from the object header to an object body) must be reloaded if needed.

So the current version of Cog generates the following code for the loop:

25 <76> pushConstant: 1

a1ea4: movl $0x00000003, %eax : B8 03 00 00 00

a1ea9: pushl %eax : 50

26 <68> popIntoTemp: 0

a1eaa: popl %eax : 58

a1eab: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0

27 <10> pushTemp: 0

a1eae: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0

a1eb1: pushl %eax : 50

28 <20> pushConstant: 100000000

a1eb2: pushl $0x0bebc201 : 68 01 C2 EB 0B

29 <B4> send: <=

a1eb7: movl %ss:0x4(%esp), %edx : 8B 54 24 04

a1ebb: movl $0x0045a458=#<=, %ecx : B9 58 A4 45 00

a1ec0: call .+0xfff5e5cb (0x00000490=ceSend1Args) : E8 CB E5 F5 FF

IsSendCall:

a1ec5: pushl %edx : 52

30 <9D> jumpFalse: 37

a1ec6: popl %eax : 58

a1ec7: subl $0x00172270=false, %eax : 2D 70 22 17 00

IsObjectReference:

a1ecc: jz .+0x0000003d (0x000a1f0b=16rA1E50@BB) : 74 3D

a1ece: cmpl $0x00000008, %eax : 83 F8 08

a1ed1: jz .+0x0000000b (0x000a1ede=16rA1E50@8E) : 74 0B

a1ed3: addl $0x00172270=false, %eax : 05 70 22 17 00

IsObjectReference:

a1ed8: pushl %eax : 50

a1ed9: call .+0xfff5e902 (0x000007e0=ceSendMustBeBooleanTrampoline) : E8 02 E9 F5 FF

HasBytecodePC:

31 <10> pushTemp: 0

a1ede: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0

a1ee1: pushl %eax : 50

32 <76> pushConstant: 1

a1ee2: movl $0x00000003, %eax : B8 03 00 00 00

a1ee7: pushl %eax : 50

33 <B0> send: +

a1ee8: movl %ss:0x4(%esp), %edx : 8B 54 24 04

a1eec: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00

a1ef1: call .+0xfff5e59a (0x00000490=ceSend1Args) : E8 9A E5 F5 FF

IsSendCall:

a1ef6: pushl %edx : 52

34 <68> popIntoTemp: 0

a1ef7: popl %eax : 58

a1ef8: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0

35 <A3 F6> jumpTo: 27

a1efb: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00

a1f00: cmpl %eax, %esp : 39 C4

a1f02: jnb .+0xffffffaa (0x000a1eae=16rA1E50@5E) : 73 AA

a1f04: call .+0xfff5e907 (0x00000810=ceCheckForInterruptsTrampoline) : E8 07 E9 F5 FF

HasBytecodePC:

a1f09: jmp .+0xffffffa3 (0x000a1eae=16rA1E50@5E) : EB A3

37: 0x000a1f0b/16rA1E50@BB:

And here's the primitive part of the SmallInteger>#<= method; the SmallInteger<#+ method is similar. I'll omit it for brevity.

entry: (check that the receiver matches the inline cache)

00002498: movl %edx, %eax : 89 D0

0000249a: andl $0x00000001, %eax : 83 E0 01

0000249d: jnz .+0x00000010 (0x000024af=<=@37) : 75 10 (jump if the receiver is a SmallInteger, which in our case it is)

0000249f: movl %ds:(%edx), %eax : 8B 02

000024a1: shrl $0x0a, %eax : C1 E8 0A

000024a4: andl $0x0000007c, %eax : 83 E0 7C

000024a7: jnz .+0x00000006 (0x000024af=<=@37) : 75 06

000024a9: movl %ds:0xfffffffc(%edx), %eax : 8B 42 FC

000024ac: andl $0xfffffffc, %eax : 83 E0 FC

000024af: cmpl %ecx, %eax : 39 C8 (compare the receiver's tags against the inline cache ecx, which in our case will match)

000024b1: jnz .+0xffffffdf (0x00002492=<=@1A) : 75 DF

noCheckEntry: (SmallInteger primitive #<=)

000024b3: movl %ss:0x4(%esp), %eax : 8B 44 24 04

000024b7: movl %eax, %ecx : 89 C1

000024b9: andl $0x00000001, %eax : 83 E0 01

000024bc: jz .+0x00000014 (0x000024d2=<=@5A) : 74 14 (is the argument a SmallInteger, which in our case it will be)

000024be: cmpl %ecx, %edx : 39 CA

000024c0: jle .+0x00000008 (0x000024ca=<=@52) : 7E 08

000024c2: movl $0x00172270=false, %edx : BA 70 22 17 00 (answer the false object)

IsObjectReference:

000024c7: ret $0x0008 : C2 08 00

000024ca: movl $0x00172278=true, %edx : BA 78 22 17 00 (answer the true object)

IsObjectReference:

000024cf: ret $0x0008 : C2 08 00

000024d2:

You can see that the Cog code generator is extremely naive; it's generating RISC code. There are no compares against memory; only against registers, etc.

The question I want you to ask yourself is how much faster the loop will go if we add code to short-circuit the send of #<= and the comparison against true and false with a tag test and a direct comparison. Try answering it as we go along. I'll answer it below but try and answer it yourself. Basically we should speed up about a third of the loop substantially where the loop has three main parts, a) the compare and branch [i <= 100000000] whileTrue:, b) the increment i := i + 1, and c) the backward branch at the end of the while loop which checks the stackLimit to break out if there is an event (e.g. ctrl-.).

OK, so here's the code with the booleanCheat implemented in the JIT:

25 <76> pushConstant: 1

a52cc: movl $0x00000003, %eax : B8 03 00 00 00

a52d1: pushl %eax : 50

26 <68> popIntoTemp: 0

a52d2: popl %eax : 58

a52d3: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0

27 <10> pushTemp: 0

a52d6: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0

a52d9: pushl %eax : 50

28 <20> pushConstant: 100000000

a52da: pushl $0x0bebc201 : 68 01 C2 EB 0B

29 <B4> send: <=

a52df: movl %ss:(%esp), %esi : 8B 34 24

a52e2: movl %esi, %eax : 89 F0

a52e4: movl %ss:0x4(%esp), %edx : 8B 54 24 04

a52e8: andl %edx, %eax : 23 C2

a52ea: andl $0x00000001, %eax : 83 E0 01 (are receiver and arg SmallIntegers, which they are)

a52ed: jz .+0x00000009 (0x000a52f8=16rA5278@80) : 74 09

a52ef: addl $0x00000008, %esp : 83 C4 08

a52f2: cmpl %esi, %edx : 39 F2 (compare directly)

a52f4: jnle .+0x00000056 (0x000a534c=16rA5278@D4) : 7F 56 (jump on condition codes)

a52f6: jmp .+0x00000027 (0x000a531f=16rA5278@A7) : EB 27 (jump past the send and the jumpFalse:)

a52f8: movl %ss:0x4(%esp), %edx : 8B 54 24 04

a52fc: movl $0x0045a458, %ecx : B9 58 A4 45 00

a5301: call .+0xfff5b18a (0x00000490=ceSend1Args) : E8 8A B1 F5 FF

IsSendCall:

a5306: pushl %edx : 52

30 <9D> jumpFalse: 37

a5307: popl %eax : 58

a5308: subl $0x00172270=false, %eax : 2D 70 22 17 00

IsObjectReference:

a530d: jz .+0x0000003d (0x000a534c=16rA5278@D4) : 74 3D

a530f: cmpl $0x00000008, %eax : 83 F8 08

a5312: jz .+0x0000000b (0x000a531f=16rA5278@A7) : 74 0B

a5314: addl $0x00172270=false, %eax : 05 70 22 17 00

IsObjectReference:

a5319: pushl %eax : 50

a531a: call .+0xfff5b4c1 (0x000007e0=ceSendMustBeBooleanTrampoline) : E8 C1 B4 F5 FF

HasBytecodePC:

31 <10> pushTemp: 0

a531f: movl %ss:0xfffffff0(%ebp), %eax : 8B 45 F0

a5322: pushl %eax : 50

32 <76> pushConstant: 1

a5323: movl $0x00000003, %eax : B8 03 00 00 00

a5328: pushl %eax : 50

33 <B0> send: +

a5329: movl %ss:0x4(%esp), %edx : 8B 54 24 04

a532d: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00

a5332: call .+0xfff5b159 (0x00000490=ceSend1Args) : E8 59 B1 F5 FF

IsSendCall:

a5337: pushl %edx : 52

34 <68> popIntoTemp: 0

a5338: popl %eax : 58

a5339: movl %eax, %ss:0xfffffff0(%ebp) : 89 45 F0

35 <A3 F6> jumpTo: 27

a533c: movl %ds:0x0013aafc=&stackLimit, %eax : A1 FC AA 13 00

a5341: cmpl %eax, %esp : 39 C4

a5343: jnb .+0xffffff91 (0x000a52d6=16rA5278@5E) : 73 91

a5345: call .+0xfff5b4c6 (0x00000810=ceCheckForInterruptsTrampoline) : E8 C6 B4 F5 FF

HasBytecodePC:

a534a: jmp .+0xffffff8a (0x000a52d6=16rA5278@5E) : EB 8A

So roughly 1/3 of the loop has been sped up substantially. How much speedup?

2.7%. 2.7 measly percent. A good 2 - 3 hours work and 2.7 measly % ?!?!? i.e. 691 milliseconds fell to 672 milliseconds (and the measurements are nicely repeatable).

Well, one reason for the limited performance increase could be that the event check at the backward branch is being taken a lot and dominating costs. So what happens if I change the heartbeat from 2KHz to 20KHz? 672 falls to 664, so only 9 milliseconds or so is due to the stack limit event check.

Perhaps the x86 is so good at optimizing the code that it can't be sped up. Well, let's have a look at HPS's performance. It is *4* times faster, 168 ms vs 672 (that's *exactly* 4 times faster :) ). So what code does HPS generate? First its back-end is less naive; it doesn't move intermediate results through registers all the time. But most significantly it is able to know that both the 100000000 and the + 1 are SmallIntegers because it delays generating code until it gets to a send. So its code is a /lot/ better. It short-circuits bth the compare and branch and the increment. Here it is (remember that Squeak has only one tagged type, SmallInteger with tag 1, and 32-bit VW has two tagged types SmallInteger and Character with tags 3 and 1 respectively, so in VW 1 == 0x7):

6 <4A> push 1

7 <4C> store local 0; pop

nm@039: xor rTemp,rTemp ! 33 c0

nm@03b: movb $7,%al ! b0 07

nm@03d: mov rTemp,-0x10(bFrame) ! 89 45 f0

8 <67> loop head

9 <10> push local 0

nm@040: mov -0x10(bFrame),rReceiver ! 8b 5d f0

10 <1C> push 100000000

nm@043: mov $0x17d78403,rArg1 ! be 03 84 d7 17

11 <A4> send <=

nm@048: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver

nm@04b: jz nm@056 ! 74 09

nm@04d: cmp rArg1,rReceiver ! 3b de

nm@04f: jle nm@075 ! 7e 24

nm@051: jmp nm@0ab ! e9 55 00 00 00

nm@056: mov $0x16a8ec6c,rClass ! ba 6c ec a8 16

nm@05b: call _81080=send1Args ! e8 28 e7 6b ea

map: 0x3a: send1(26) #<=

vpc 12:

12 <C4> jump false 18

nm@060: cmp $0x16b33da4,rReceiver ! 81 fb a4 3d b3 16

nm@066: jz nm@0ab ! 74 43

nm@068: cmp $0x16da90e4,rReceiver ! 81 fb e4 90 da 16

nm@06e: jz nm@075 ! 74 05

nm@070: call _159cadd8 ! e8 6b 84 00 00

nm@075: mov -0x10(bFrame),rReceiver ! 8b 5d f0

nm@078: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver

nm@07b: jz nm@085 ! 74 08

13 <10> push local 0

14 <C8> push 1; send +

nm@07d: add $4,rReceiver ! 83 c3 04

nm@080: jno nm@092 ! 71 10

nm@082: sub $4,rReceiver ! 83 eb 04

nm@085: push $7 ! 6a 07

nm@087: pop rArg1 ! 5e

nm@088: mov $0x16bbab34,rClass ! ba 34 ab bb 16

nm@08d: call _81080=send1Args ! e8 f6 e6 6b ea

map: 0x22: send1(2) #+

15 <4C> store local 0; pop

nm@092: mov rReceiver,-0x10(bFrame) ! 89 5d f0

16 <E3 F6> jump 8

nm@095: cmp 0xfb190,bSP ! 3b 25 90 b1 0f 00

nm@09b: jae nm@040 ! 0f 83 9f ff ff ff

nm@0a1: call _80d30 ! e8 92 e3 6b ea

vpc 18:

nm@0a6: jmp nm@040 ! e9 95 ff ff ff

Interesting. VW executes the following instructions around the loop:

nm@040: mov -0x10(bFrame),rReceiver ! 8b 5d f0

nm@043: mov $0x17d78403,rArg1 ! be 03 84 d7 17

nm@048: testb $2,%bl ! f6 c3 02 a.k.a. testb $2,rReceiver

nm@04b: jz nm@056 ! 74 09

nm@04d: cmp rArg1,rReceiver ! 3b de

nm@04f: jle nm@075 ! 7e 24

nm@051: jmp nm@0ab ! e9 55 00 00 00

...

nm@07d: add $4,rReceiver ! 83 c3 04

nm@080: jno nm@092 ! 71 10

...

nm@092: mov rReceiver,-0x10(bFrame) ! 89 5d f0

nm@095: cmp 0xfb190,bSP ! 3b 25 90 b1 0f 00

nm@09b: jae nm@040 ! 0f 83 9f ff ff ff

12 instructions, one read and one write. Cog executes many more, quite a few of them reads and writes. How I itch to find the time to do delayed code generation/stack-to-register mapping in Cog...

best,

Eliot

On Tue, Nov 16, 2010 at 3:32 PM, Levente Uzonyi <[hidden email]> wrote:

On Tue, 16 Nov 2010, Juan Vuletich wrote:

Hi Levente,

Levente Uzonyi wrote:

Hi,

I'm mostly ready with this cleanup, only a few methods left in the image that use #== for integer comparison. Some of them must use #==, others may be changed, but I wasn't sure if the code will work if it's changed or not. If you'd like to check these methods, then evaluate the following in a workspace:

... code here...

Cheers,
Levente

Thanks for the snippet! It is now a method in Cuis. I also checked a bit on trunk. In all senders of #nextObject you can apply the pattern in #allObjectsDo:. Instead of assuming == 0 for last #nextObject, it assumes that Object new will go at the end. If that ever changed, a few places would need fixing...

I'm not sure if it's ok to use that pattern in ImageSegment. Even if it's ok, the code is so messy, that it seems to be hard to apply the pattern without breaking the code.

As an experiment, in #allObjectsDo: I tried to remove the Object new stuff and replace with '[ 0 = object and: [ object isMemberOf: SmallInteger ]] whileFalse: ', but my test became extremely slow. I suspect #= is sending the message even if it has its own bytecode... So, the only method that remains with the '0 ==' pattern in Cuis is #critical:ifLocked: , as I'm not sure if we can clean it.

#= is a special selector, so I'm sure it won't send a message if it doesn't have to. It's possible that #= is a bit slower than #== because it has to do some extra checks, but the difference is probably neglible. Here's a benchmark:

| offset equality identity |
Smalltalk garbageCollect.
offset := [ 1 to: 1000000 do: [ :i | ] ] timeToRun.
equality := [ :x | [ 1 to: 1000000 do: [ :i | 0 = x ] ] timeToRun ].
identity := [ :x | [ 1 to: 1000000 do: [ :i | 0 == x ] ] timeToRun ].
{ 0. 1. SmallInteger maxVal. SmallInteger maxVal + 1. nil. Object new. 0.0. Array new. 1/2 } collect: [ :each |
each -> ({
equality value: each.
identity value: each.
equality value: each.
identity value: each.
equality value: each.
identity value: each } - offset) ].

And my results on CogVM:

{
0->#(3 2 3 1 3 2).
1->#(2 4 3 0 3 4).
1073741823->#(3 3 2 1 3 5).
1073741824->#(221 4 223 1 223 4).
nil->#(14 4 14 0 15 4).
an Object->#(14 5 13 1 14 4).
0.0->#(622 5 623 2 624 4).
#()->#(16 5 14 1 16 4).
(1/2)->#(260 4 259 1 258 5)
}

So the cause of the slowdown in your #allObjectsDo: implementation is that #= is only fast if both the receiver and the argument are SmallIntegers, otherwise there will be message sends.

Another way to implement #allObjectsDo: without the marker object is to replace the loop condition with: object class == SmallInteger. It's actually the same as your implementation without the #= check. #isMemberOf: is optimized to non-real sends. This is as fast as the current implementation on my pc.

Levente

Cheers,
Juan Vuletich

Igor Stasenko

Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

On 19 November 2010 21:48, Eliot Miranda <[hidden email]> wrote:

> Hi All,
> so the performance benefit of the special selectors is interesting. One
> thing the interpreter does on all relational special selectors #= #~= #< #<=
> #> #>= is statically predict SmallIntegers and/or Floats and look ahead for
> a following conditional branch, evaluating the branch immediately, avoiding
> reifying the result of the relational as either true or false and later
> testing for it, hence jumping directly on the condition codes.
> e.g.
> bytecodePrimLessThan
> | rcvr arg aBool |
> rcvr := self internalStackValue: 1.
> arg := self internalStackValue: 0.
> (self areIntegers: rcvr and: arg) ifTrue:
> ["The C code can avoid detagging since tagged integers are still signed.
> But this means the simulator must override to do detagging."
> ^self cCode: [self booleanCheat: rcvr < arg]
> inSmalltalk: [self booleanCheat: (objectMemory integerValueOf: rcvr) <
> (objectMemory integerValueOf: arg)]].
> self initPrimCall.
> aBool := self primitiveFloatLess: rcvr thanArg: arg.
> self successful ifTrue: [^ self booleanCheat: aBool].
> messageSelector := self specialSelector: 2.
> argumentCount := 1.
> self normalSend
> booleanCheat: cond
> "cheat the interpreter out of the pleasure of handling the next bytecode IFF
> it is a jump-on-boolean. Which it is, often enough when the current bytecode
> is something like bytecodePrimEqual"
> <inline: true>
> cond
> ifTrue: [self booleanCheatTrue]
> ifFalse: [self booleanCheatFalse]
> booleanCheatFalse
> "cheat the interpreter out of the pleasure of handling the next bytecode IFF
> it is a jump-on-boolean. Which it is, often enough when the current bytecode
> is something like bytecodePrimEqual"
> | bytecode offset |
> <sharedCodeNamed: 'booleanCheatFalse' inCase: 179>
> bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)"
> self internalPop: 2.
> (bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse"
> [^self jump: bytecode - 151].
> bytecode = 172 ifTrue: "long jumpIfFalse"
> [offset := self fetchByte.
> ^self jump: offset].
> "not followed by a jumpIfFalse; undo instruction fetch and push boolean
> result"
> localIP := localIP - 1.
> self fetchNextBytecode.
> self internalPush: objectMemory falseObject
> booleanCheatTrue
> "cheat the interpreter out of the pleasure of handling the next bytecode IFF
> it is a jump-on-boolean. Which it is, often enough when the current bytecode
> is something like bytecodePrimEqual"
> | bytecode |
> <sharedCodeNamed: 'booleanCheatTrue' inCase: 178>
> bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)"
> self internalPop: 2.
> (bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse"
> [^self fetchNextBytecode].
> bytecode = 172 ifTrue: "long jumpIfFalse"
> [self fetchByte.
> ^self fetchNextBytecode].
> "not followed by a jumpIfFalse; undo instruction fetch and push boolean
> result"
> localIP := localIP - 1.
> self fetchNextBytecode.
> self internalPush: objectMemory trueObject
>
> Until yesterday Cog didn't do this for jitted code. The conversation about
> using #= for testing got me motivated to implement it. Note that
> VisualWorks' HPS VM does something even but more aggressive than
> booleanCheat:. So what are we talking about? Here's my micro-benchmark:
> Time millisecondsToRun: [1 to: 100000000 do: [:i| ]]

> The bytecode compiler optimizes the inner loop to code equivalent to
> | i |
> i := 1.
> [i <= 100000000] whileTrue: [i := i + 1]

i think that with agressive optimization, this loop can be turned into no-op.
or more presisely to an instruction which sets i to 100000001 :)

But i wonder what kind of analysis should be applied to determine that
loop is bound
to 100000000, and there is no other side effects than just incrementing counter.
Btw, i think it is much easier to optimize original
1 to: 100000 do:
since you know beforehead the loop bounds, and need only to analyze if
loop body has any side effects.

In that way, it is better to analyze the code at AST level, than on
bytecode level, since once you turn it into bytecode,
you losing a precious information that loop has bounds and have no
choice but to strictly follow bytecode semantics.
While of course, Cog have not much choice, since it can't operate with
AST, just a bytecode.

[snip]

> best,
> Eliot

--
Best regards,
Igor Stasenko AKA sig.

Eliot Miranda-2

Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

Hi Igor,

On Fri, Nov 19, 2010 at 2:50 PM, Igor Stasenko <[hidden email]> wrote:

On 19 November 2010 21:48, Eliot Miranda <[hidden email]> wrote:
> Hi All,
> so the performance benefit of the special selectors is interesting. One
> thing the interpreter does on all relational special selectors #= #~= #< #<=
> #> #>= is statically predict SmallIntegers and/or Floats and look ahead for
> a following conditional branch, evaluating the branch immediately, avoiding
> reifying the result of the relational as either true or false and later
> testing for it, hence jumping directly on the condition codes.
> e.g.
> bytecodePrimLessThan
> | rcvr arg aBool |
> rcvr := self internalStackValue: 1.
> arg := self internalStackValue: 0.
> (self areIntegers: rcvr and: arg) ifTrue:
> ["The C code can avoid detagging since tagged integers are still signed.
> But this means the simulator must override to do detagging."
> ^self cCode: [self booleanCheat: rcvr < arg]
> inSmalltalk: [self booleanCheat: (objectMemory integerValueOf: rcvr) <
> (objectMemory integerValueOf: arg)]].
> self initPrimCall.
> aBool := self primitiveFloatLess: rcvr thanArg: arg.
> self successful ifTrue: [^ self booleanCheat: aBool].
> messageSelector := self specialSelector: 2.
> argumentCount := 1.
> self normalSend
> booleanCheat: cond
> "cheat the interpreter out of the pleasure of handling the next bytecode IFF
> it is a jump-on-boolean. Which it is, often enough when the current bytecode
> is something like bytecodePrimEqual"
> <inline: true>
> cond
> ifTrue: [self booleanCheatTrue]
> ifFalse: [self booleanCheatFalse]
> booleanCheatFalse
> "cheat the interpreter out of the pleasure of handling the next bytecode IFF
> it is a jump-on-boolean. Which it is, often enough when the current bytecode
> is something like bytecodePrimEqual"
> | bytecode offset |
> <sharedCodeNamed: 'booleanCheatFalse' inCase: 179>
> bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)"
> self internalPop: 2.
> (bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse"
> [^self jump: bytecode - 151].
> bytecode = 172 ifTrue: "long jumpIfFalse"
> [offset := self fetchByte.
> ^self jump: offset].
> "not followed by a jumpIfFalse; undo instruction fetch and push boolean
> result"
> localIP := localIP - 1.
> self fetchNextBytecode.
> self internalPush: objectMemory falseObject
> booleanCheatTrue
> "cheat the interpreter out of the pleasure of handling the next bytecode IFF
> it is a jump-on-boolean. Which it is, often enough when the current bytecode
> is something like bytecodePrimEqual"
> | bytecode |
> <sharedCodeNamed: 'booleanCheatTrue' inCase: 178>
> bytecode := self fetchByte. "assume next bytecode is jumpIfFalse (99%)"
> self internalPop: 2.
> (bytecode < 160 and: [bytecode > 151]) ifTrue: "short jumpIfFalse"
> [^self fetchNextBytecode].
> bytecode = 172 ifTrue: "long jumpIfFalse"
> [self fetchByte.
> ^self fetchNextBytecode].
> "not followed by a jumpIfFalse; undo instruction fetch and push boolean
> result"
> localIP := localIP - 1.
> self fetchNextBytecode.
> self internalPush: objectMemory trueObject
>
> Until yesterday Cog didn't do this for jitted code. The conversation about
> using #= for testing got me motivated to implement it. Note that
> VisualWorks' HPS VM does something even but more aggressive than
> booleanCheat:. So what are we talking about? Here's my micro-benchmark:
> Time millisecondsToRun: [1 to: 100000000 do: [:i| ]]

> The bytecode compiler optimizes the inner loop to code equivalent to
> | i |
> i := 1.
> [i <= 100000000] whileTrue: [i := i + 1]

i think that with agressive optimization, this loop can be turned into no-op.
or more presisely to an instruction which sets i to 100000001 :)

Are you missing the point, which is to try and generate efficient loop code, not to optimize empty loops? I stripped the example down to the smallest possible code (an empty loop) just to show how much better VW code is and how limited the effect of merely doing the booleanCheat:. But loops that /can't/ be optimized away are very common in the system, and right now Cog isn't doing a great job at making these go fast.

But i wonder what kind of analysis should be applied to determine that
loop is bound
to 100000000, and there is no other side effects than just incrementing counter.
Btw, i think it is much easier to optimize original
1 to: 100000 do:
since you know beforehead the loop bounds, and need only to analyze if
loop body has any side effects.

Imagine the loop had something meaningful in it such as Set>>do:.

In that way, it is better to analyze the code at AST level, than on
bytecode level, since once you turn it into bytecode,
you losing a precious information that loop has bounds and have no
choice but to strictly follow bytecode semantics.
While of course, Cog have not much choice, since it can't operate with
AST, just a bytecode.

Yes, but you're missing the point. One can still generate faster jitted code than Cog is producing now. My message was trying to show how much more there is to be gained by a better jit code generator, not about higher-level optimization.

best

Eliot

[snip]

> best,
> Eliot

--
Best regards,
Igor Stasenko AKA sig.

Igor Stasenko

Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

On 20 November 2010 01:43, Eliot Miranda <[hidden email]> wrote:

>
> Hi Igor,
>
> On Fri, Nov 19, 2010 at 2:50 PM, Igor Stasenko <[hidden email]> wrote:
>>
>> i think that with agressive optimization, this loop can be turned into no-op.
>> or more presisely to an instruction which sets i to 100000001 :)
>
>
> Are you missing the point, which is to try and generate efficient loop code, not to optimize empty loops? I stripped the example down to the smallest possible code (an empty loop) just to show how much better VW code is and how limited the effect of merely doing the booleanCheat:. But loops that /can't/ be optimized away are very common in the system, and right now Cog isn't doing a great job at making these go fast.
>
>> But i wonder what kind of analysis should be applied to determine that
>> loop is bound
>> to 100000000, and there is no other side effects than just incrementing counter.
>> Btw, i think it is much easier to optimize original
>> 1 to: 100000 do:
>> since you know beforehead the loop bounds, and need only to analyze if
>> loop body has any side effects.
>
> Imagine the loop had something meaningful in it such as Set>>do:.
>>
>> In that way, it is better to analyze the code at AST level, than on
>> bytecode level, since once you turn it into bytecode,
>> you losing a precious information that loop has bounds and have no
>> choice but to strictly follow bytecode semantics.
>> While of course, Cog have not much choice, since it can't operate with
>> AST, just a bytecode.
>
> Yes, but you're missing the point. One can still generate faster jitted code than Cog is producing now. My message was trying to show how much more there is to be gained by a better jit code generator, not about higher-level optimization.

Yes, there is a space for improvement.
My point was, that you have even more potential for optimizations when
dealing at AST level, not bytecode level.
And one of my intents is to create a compiler which could transform
AST directly to native code.

For example, a loop bounds , since they are smallint constants, could
be turned into machine integers, so code like following:

32 <76> pushConstant: 1
a1ee2: movl $0x00000003, %eax : B8 03 00 00 00
a1ee7: pushl %eax : 50
33 <B0> send: +
a1ee8: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a1eec: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00
a1ef1: call .+0xfff5e59a (0x00000490=ceSend1Args) : E8 9A E5 F5 FF
IsSendCall:
a1ef6: pushl %edx : 52
34 <68> popIntoTemp: 0
a1ef7: popl %eax : 58

could be replaced just by

inc %eax

and if loop counter passed somewhere, where it can't use optimized
value, then it can be turned back to smallint, like:

shl %eax,1
inc %eax

> best
> Eliot

--
Best regards,
Igor Stasenko AKA sig.

Eliot Miranda-2

Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

Hi Igor,

On Sat, Nov 20, 2010 at 1:51 AM, Igor Stasenko <[hidden email]> wrote:

On 20 November 2010 01:43, Eliot Miranda <[hidden email]> wrote:
>
> Hi Igor,
>
> On Fri, Nov 19, 2010 at 2:50 PM, Igor Stasenko <[hidden email]> wrote:
>>

>> i think that with agressive optimization, this loop can be turned into no-op.
>> or more presisely to an instruction which sets i to 100000001 :)
>
>
> Are you missing the point, which is to try and generate efficient loop code, not to optimize empty loops? I stripped the example down to the smallest possible code (an empty loop) just to show how much better VW code is and how limited the effect of merely doing the booleanCheat:. But loops that /can't/ be optimized away are very common in the system, and right now Cog isn't doing a great job at making these go fast.
>
>> But i wonder what kind of analysis should be applied to determine that
>> loop is bound
>> to 100000000, and there is no other side effects than just incrementing counter.
>> Btw, i think it is much easier to optimize original
>> 1 to: 100000 do:
>> since you know beforehead the loop bounds, and need only to analyze if
>> loop body has any side effects.
>
> Imagine the loop had something meaningful in it such as Set>>do:.
>>
>> In that way, it is better to analyze the code at AST level, than on
>> bytecode level, since once you turn it into bytecode,
>> you losing a precious information that loop has bounds and have no
>> choice but to strictly follow bytecode semantics.
>> While of course, Cog have not much choice, since it can't operate with
>> AST, just a bytecode.
>
> Yes, but you're missing the point. One can still generate faster jitted code than Cog is producing now. My message was trying to show how much more there is to be gained by a better jit code generator, not about higher-level optimization.

Yes, there is a space for improvement.
My point was, that you have even more potential for optimizations when
dealing at AST level, not bytecode level.
And one of my intents is to create a compiler which could transform
AST directly to native code.

For example, a loop bounds , since they are smallint constants, could
be turned into machine integers, so code like following:

32 <76> pushConstant: 1
a1ee2: movl $0x00000003, %eax : B8 03 00 00 00
a1ee7: pushl %eax : 50
33 <B0> send: +
a1ee8: movl %ss:0x4(%esp), %edx : 8B 54 24 04
a1eec: movl $0x0045ae64=#+, %ecx : B9 64 AE 45 00
a1ef1: call .+0xfff5e59a (0x00000490=ceSend1Args) : E8 9A E5 F5 FF
IsSendCall:
a1ef6: pushl %edx : 52
34 <68> popIntoTemp: 0
a1ef7: popl %eax : 58

could be replaced just by

inc %eax

and if loop counter passed somewhere, where it can't use optimized
value, then it can be turned back to smallint, like:

shl %eax,1
inc %eax

Yes, I agree. But I don't think that AST vs bytecode is really anything to do with it; they can be easily transformed into each other (via decompiler & compiler). The bytecode is a convenient form because it is compact and can efficiently be interpreted. The issue is *when* and *where* to spend the cycles trying to optimise aggressively. That's where performance counters come in. If one decorates the jitted code with e.g. a taken and untaken count at each conditional branch then when these counters trip one suspends execution, examines the current call stack, collecting concrete type information from inline caches, and optimises several nested activations into a single large method that is worth optimising with traditional static techniques (good register allocation etc). If one tries to optimise everything the system becomes unresponsive (see Craig Chambers' Self 2 compiler). If one defers optimization until finding a "hot spot" things work much better (see Urs Höltzle's Self 3 compiler, HotSpot et al).

- keep bytecode and an interpreter for compactness, portability and the ability to always fall back on the interpreter (e.g. when the JIT runs out of memory during some tricky relinking operation)

- use a simple JIT to optimize code run more than once, that does a reasonable job of stack to register mapping, implements PICs to collect type info and performance counters to collect block usage and invoke the aggressive optimizer

- use a speculative inliner and an aggressive optimiser to inline code based on hot spots, basic block counts, and PIC info, and optimize it using traditional techniques.

All of the above exists in various production VMs, AFAIA none all in the same place. So the above is arguably a proven architecture. Hence it is my direction (and Marcus' and I hope yours too). I can send you my architectural sketch if you're interested.

best

Eliot

> best
> Eliot

--
Best regards,
Igor Stasenko AKA sig.

Colin Putney-3

Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

On Sat, Nov 20, 2010 at 8:19 AM, Eliot Miranda <[hidden email]> wrote:

> Yes, I agree. But I don't think that AST vs bytecode is really anything to do
with it; they can be easily transformed into each other (via decompiler &
compiler). The bytecode is a convenient form because it is compact and can
efficiently be interpreted. The issue is *when* and *where* to spend the
cycles trying to optimise aggressively. That's where performance counters come
in. If one decorates the jitted code with e.g. a taken and untaken count at
each conditional branch then when these counters trip one suspends execution,
examines the current call stack, collecting concrete type information from
inline caches, and optimises several nested activations into a single large
method that is worth optimising with traditional static techniques (good
register allocation etc). If one tries to optimise everything the system
becomes unresponsive (see Craig Chambers' Self 2 compiler). If one defers
optimization until finding a "hot spot" things work much better (see Urs
Höltzle's Self 3 compiler, HotSpot et al).

> So

> - keep bytecode and an interpreter for compactness, portability and the
ability to always fall back on the interpreter (e.g. when the JIT runs out
of memory during some tricky relinking operation)

> - use a simple JIT to optimize code run more than once, that does a
reasonable job of stack to register mapping, implements PICs to collect
type info and performance counters to collect block usage and invoke the
aggressive optimizer

> - use a speculative inliner and an aggressive optimiser to inline code based
on hot spots, basic block counts, and PIC info, and optimize it using
traditional techniques. > All of the above exists in various production
VMs, AFAIA none all in the same place. So the above is arguably a proven
architecture. Hence it is my direction (and Marcus' and I hope yours too).
I can send you my architectural sketch if you're interested.

That would seem to describe Strongtalk pretty well, no? Or does it collect
type and profile information directly from the interpreter?

Also, my impression was that Strongtalk, Hotspot et al did inlining *down* the
stack. That is, it would find a method that was activated frequently, then
inline whatever sends and block activations it performed to get a large,
statically optimizable method.

What you mention above is slightly different, IIUC, in that you're
finding a hot spot
based on counters in basic blocks, then looking up the stack to find a method
that can be aggressively optimized. This sounds a bit like the
trace-based inlining
that Mozilla used for their Javascript, in that you're effectively
choosing to optimize
a particular path through the code rather than a particular method. Thoughts?

Colin

Eliot Miranda-2

Re: [squeak-dev] Short-circuiting comparisons (booleanCheat) via special selectors [Was Using #= for integer comparison instead of #==]

On Sat, Nov 20, 2010 at 9:32 AM, Colin Putney <[hidden email]> wrote:

On Sat, Nov 20, 2010 at 8:19 AM, Eliot Miranda <[hidden email]> wrote:

> Yes, I agree. But I don't think that AST vs bytecode is really anything to do
with it; they can be easily transformed into each other (via decompiler &
compiler). The bytecode is a convenient form because it is compact and can
efficiently be interpreted. The issue is *when* and *where* to spend the
cycles trying to optimise aggressively. That's where performance counters come
in. If one decorates the jitted code with e.g. a taken and untaken count at
each conditional branch then when these counters trip one suspends execution,
examines the current call stack, collecting concrete type information from
inline caches, and optimises several nested activations into a single large
method that is worth optimising with traditional static techniques (good
register allocation etc). If one tries to optimise everything the system
becomes unresponsive (see Craig Chambers' Self 2 compiler). If one defers
optimization until finding a "hot spot" things work much better (see Urs
Höltzle's Self 3 compiler, HotSpot et al).

> So

> - keep bytecode and an interpreter for compactness, portability and the
ability to always fall back on the interpreter (e.g. when the JIT runs out
of memory during some tricky relinking operation)

> - use a simple JIT to optimize code run more than once, that does a
reasonable job of stack to register mapping, implements PICs to collect
type info and performance counters to collect block usage and invoke the
aggressive optimizer

> - use a speculative inliner and an aggressive optimiser to inline code based
on hot spots, basic block counts, and PIC info, and optimize it using
traditional techniques. > All of the above exists in various production
VMs, AFAIA none all in the same place. So the above is arguably a proven
architecture. Hence it is my direction (and Marcus' and I hope yours too).
I can send you my architectural sketch if you're interested.

That would seem to describe Strongtalk pretty well, no? Or does it collect
type and profile information directly from the interpreter?

We should check with Steve Rees but I think you're right. As I understand it the Strongtalk VM generated threaded code operations at startup and jitted bytecode to threaded code. PICs and method activation counters were in the threaded code. The optimizing compiler then examined threaded code.

Also, my impression was that Strongtalk, Hotspot et al did inlining *down* the
stack. That is, it would find a method that was activated frequently, then
inline whatever sends and block activations it performed to get a large,
statically optimizable method.

Right. But it starts from the activation in which the counter trips. Finding out how far down to look is a heuristic. In Self 3 if it tripped within a block activation or a send within a block activation (or some small number of activations down) it would choose the home context of that block. But I remember talking with David Griswold and IIRC he said in Strongtalk it always compiles just one method out so that the inlining repeats incrementally. (I think). IIRC, he said that this was the best heuristic, better than Self 3's.

What you mention above is slightly different, IIUC, in that you're
finding a hot spot
based on counters in basic blocks, then looking up the stack to find a method
that can be aggressively optimized. This sounds a bit like the
trace-based inlining
that Mozilla used for their Javascript, in that you're effectively
choosing to optimize
a particular path through the code rather than a particular method. Thoughts?

Not really. It is the same in both the activation counter and the basic block counter. If you look one method down at the caller (*try to inline where the counter tripped into the caller) then that caller may have other sends in it that one could profitably inline (e.g. if the basic block counters said those sends were often executed). One can also try and inline sends made in the method where the counter tripped. Visualise it like a tree (the call graph) and you've got a counter tripped at some node. One searches for a subgraph of the tree guided by counts and heuristics to prevent one looking at too much code, repeatedly looking at unoptimizable code etc, and chooses to optimize the entire subgraph, mapping the current execution to the right point in the supermethod one produces. You can traverse the graph towards the root by following the call stack and towardss the leaves by following sends through PICs. So one can indeed explore a subgraph, not just a single path.

The trace-based thing looks like optimizing wearing blinkers to me.

best

Eliot?

Colin