ASM questions for tight loops

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

ASM questions for tight loops

Clément Bera-4
 
Hi Eliot, (sorry long mail but I think you will enjoy this discussion; read it)

Looking deeper into tight loops performance to get some boost out of Sista, I have questions and remarks.

First thing is when I compare 3 versions or array copies (code at the bottom of mail).
Version [1] : copyArray : written in plain Smalltalk
Version [2] : fastCopyArray : sista generated code (only unchecked operations)
Version [3] : primCopyArray : simply call the machine code primitive

For small arrays everything is fine, sista code is good. But for large arrays I get the following results (here a copy of 1000 elements):
1) 85k/second
2) 390k/second
3) 2,000k/second

So the sista version is ~5x compared to pure smalltalk code, but ~5x slower to the primitive.

My guess is that the reason for this is that the hand written primitive is written with a 7 instructions tight loop (2 additions, 1 read, 1 write, 2 jumps, 1 comp), while the sista generated code requires 6 temp read, 1 temp write and a couple cheap instructions in addition. We discussed earlier than register allocation might bring a 20% performance boost, but it seems tight loops are quite important and in this case the performance difference is around 5x (500%).

So big question is that should I had more macro-instructions (let's call them stubs) in the sista extended bytecode set such as bytesEquals which bulkCompares the bytes of 2 objects, since performance difference is bigger than expected, or should I wait for better register allocation algorithm ? 

I believe hand-written stub will always be faster than sista generated code (since a couple things can't be expressed at the bytecode level), but based on multiple benchmarks I think if the better register allocator generates code with the same number of memory read-write (including temp access on stack) the difference should be less than 20% performance difference as we expected. 

It's not clear how many stubs would be needed. I have a prototype of array copy, other things such as indexOf or fillObject are needed. Maybe I can add only 3-4 for now to show massive speed-ups like we did with bytesEquals. 

Let's note that we could try to share the stubs with normal primitives (primStringReplace with the arraycopy stub for example) to avoid too much maintenance cost. The problem is that normal primitives have complexity with all the checks at the beginning, then a generic case (copy without knowing anything about the operands) while sista stubs have no checks, but complexity is in generating better code when some operands are constants. So it's not really the same thing.

Second question is about RTL code for tight loops. I looked into V8 and it seems that for tight loop they try to use backward conditional instead of forward conditional + backward unconditional. What do you think will be the difference if I do it for array copy (the copy is guaranteed not to be empty at this point, example below Original-New) ? I wonder if conditional back jumps are not slower than conditional forward jumps or something like that (I am always confused with branch performance, and the performance is different depending on caches so small benchs may not be relevant). What's your take on this ?

And then the question is that should I be able to generate that kind of loop from the sista back-end (I don't really like it, I prefer to limit the number of back-jumps else it makes some things harder) ? Else I can do it in the stubs for most critical performance things and ignore it in other cases.

Original
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).
Total 7 instructions in the loop
New
instr := cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
        cogit CmpR: startReg R: stopReg.
cogit JumpGreaterOrEqual: instr.
        jumpEmpty jmpTarget: cogit genPrimReturn.
Total 6 instructions in the loop
But a conditional back-jump.

NB : Eliot I tried incrementing the pointer instead of the index and use a 0 displacement read/writes and it made no performance difference on Intel. Saves a register in the loop but we don't need them. It makes a difference in the inlined stub however since other registers can be used for other things in the method where it's inlined.

[1] Smalltalk code
1 to: (y - x + 1) do: [ :i |
array2 at: y2 + i put: (array at: x + i) ].
[2] (sista bytecodes of the tight loop)
32 <45> pushTemp: 5
33 <46> pushTemp: 6
34 <F8 F3 87> smiLessOrEqual:
37 <EF 1B> jumpFalse: 66
39 <43> pushTemp: 3
40 <45> pushTemp: 5
41 <40> pushTemp: 0
42 <45> pushTemp: 5
43 <F8 10 88> pointerAt:
46 <F8 B8 8B> pointerAt:put:
49 <D8> pop
50 <45> pushTemp: 5
51 <E1 00 E8 01> pushConstant: 1
55 <F8 D0 87> smiAdd:
58 <D5> popIntoTemp: 5
59 <E1 FF E8 DE> pushConstant: -34
63 <F8 70 97> backjumpNoInterrupt
[3] primitive call
array2 replaceFrom: x to: y with: array startingAt: y2
Current tight loop RTL:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).

--
Clément Béra
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Reply | Threaded
Open this post in threaded view
|

Re: ASM questions for tight loops

Eliot Miranda-2
 
Hi Clément,

On Sat, Jan 6, 2018 at 12:00 AM, Clément Bera <[hidden email]> wrote:
Hi Eliot, (sorry long mail but I think you will enjoy this discussion; read it)

Looking deeper into tight loops performance to get some boost out of Sista, I have questions and remarks.

First thing is when I compare 3 versions or array copies (code at the bottom of mail).
Version [1] : copyArray : written in plain Smalltalk
Version [2] : fastCopyArray : sista generated code (only unchecked operations)
Version [3] : primCopyArray : simply call the machine code primitive

For small arrays everything is fine, sista code is good. But for large arrays I get the following results (here a copy of 1000 elements):
1) 85k/second
2) 390k/second
3) 2,000k/second

So the sista version is ~5x compared to pure smalltalk code, but ~5x slower to the primitive.

My guess is that the reason for this is that the hand written primitive is written with a 7 instructions tight loop (2 additions, 1 read, 1 write, 2 jumps, 1 comp), while the sista generated code requires 6 temp read, 1 temp write and a couple cheap instructions in addition. We discussed earlier than register allocation might bring a 20% performance boost, but it seems tight loops are quite important and in this case the performance difference is around 5x (500%).

This is for me very good news.  I thought there was potential in good register allocation and this confirms it.
 

So big question is that should I had more macro-instructions (let's call them stubs) in the sista extended bytecode set such as bytesEquals which bulkCompares the bytes of 2 objects, since performance difference is bigger than expected, or should I wait for better register allocation algorithm ? 

I think *very much* we should take the register allocation route.  If we go the other route we're following the well-worn path of systems like Python where basic execution performance is poor and is solved by special-casing specific operations and implementing them with C primitives.  Eventually one gets to a system where all the performance critical parts are in C, which has a couple of really bad effects.  First, nothing one does to increase basic performance has any effect since all the time is in C, so one looses the impetus to improve basic language performance.  Second, much of the system is in large mono;ethic and unchangeable chunks of C, so the system looses its dynamic language derived flexibility.  We definitely want to move in the other direction.  So let's instead put in the effort to make register allocation work well.  We know what has to be done.  It isn't easy but the long-term benefits are great.  We should be able to make many primitives entirely optional in the JIT.  [Brain fart: We may even be able to generate primitives for the Interpreter versions using our framework]

I believe hand-written stub will always be faster than sista generated code (since a couple things can't be expressed at the bytecode level), but based on multiple benchmarks I think if the better register allocator generates code with the same number of memory read-write (including temp access on stack) the difference should be less than 20% performance difference as we expected. 

Good, then i think we are in agreement.  We can live without stubs.
 
It's not clear how many stubs would be needed. I have a prototype of array copy, other things such as indexOf or fillObject are needed. Maybe I can add only 3-4 for now to show massive speed-ups like we did with bytesEquals. 

I don't think it's the right way to go.  Having these for performaNc comparison is good, but having them as general features of inline primitives is, I think, a long-term strategic error, as outlined above.
 
Let's note that we could try to share the stubs with normal primitives (primStringReplace with the arraycopy stub for example) to avoid too much maintenance cost. The problem is that normal primitives have complexity with all the checks at the beginning, then a generic case (copy without knowing anything about the operands) while sista stubs have no checks, but complexity is in generating better code when some operands are constants. So it's not really the same thing.

Yes, this does seem interesting.  But medium-term interpreter performance isn't critical.  Let's get the JIT performance good first.
 
Second question is about RTL code for tight loops. I looked into V8 and it seems that for tight loop they try to use backward conditional instead of forward conditional + backward unconditional. What do you think will be the difference if I do it for array copy (the copy is guaranteed not to be empty at this point, example below Original-New) ? I wonder if conditional back jumps are not slower than conditional forward jumps or something like that (I am always confused with branch performance, and the performance is different depending on caches so small benchs may not be relevant). What's your take on this ?

That's a good question.  I don't know.  The current layout is dictated by the fact that backwards jumps are interrupt points along with byttecode-to-machine-code-pc-mappng.  The code look like this, this being from SequenceableCollection>>indexOf:startingAt:

startIndex to: self size do: [ :index |
(self at: index) == anElement ifTrue: [ ^index ] ].
^0

The end of the loop's bytecodes are

61 <B0> send: +
62 <6A> popIntoTemp: 2
63 <A3 ED> jumpTo: 46
65 <75> pushConstant: 0
66 <7C> returnTop

is compiled into

000010f2: movl $0x00100028=#+, %ecx : B9 28 00 10 00 
000010f7: call .+0xfffff37c (0x00000478=ceSend1Args) : E8 7C F3 FF FF 
IsSendCall + bc 61/62:
000010fc: movl %edx, -28(%ebp) : 89 55 E4 
000010ff: movl %ds:0x80000='stackLimit', %eax : A1 00 00 08 00 
00001104: cmpl %eax, %esp : 39 C4 
00001106: jnb .+0xffffff5e (0x0000106a=indexOf:startingAt:@6A) : 0F 83 5E FF FF FF 
0000110c: call .+0xfffffa87 (0x00000b98=ceCheckForInterruptsTrampoline) : E8 87 FA FF FF 
IsRelativeCall:
HasBytecodePC bc 62/63:
00001111: jmp .+0xffffff54 (0x0000106a=indexOf:startingAt:@6A) : E9 54 FF FF FF 
00001116: movl $0x00000001, %edx : BA 01 00 00 00 
0000111b: movl %ebp, %esp : 89 EC 
0000111d: popl %ebp : 5D 
0000111e: ret $0x000c : C2 0C 00 

So the suspension point is the pc following the call of ceCheckForInterruptsTrampoline, which is 00001111.  If execution continues after interruption the jump is taken and the loop continues, which is as required.  [I remember an early state in Cog;'s development at Qwaq when I had't implemented this and loops were terminating every time there was an interrupt, such as when one wiggled the mouse,  The system almost worked; it wad quite bizarre :-) ].

But notice that the conditional jump at 00001106 *is* backwards as you desire.  It seems we're already doing the right thing, at least with the backwards jump.  What do we do with backwards jumps in SistaV1 inline primitives?


And then the question is that should I be able to generate that kind of loop from the sista back-end (I don't really like it, I prefer to limit the number of back-jumps else it makes some things harder) ? Else I can do it in the stubs for most critical performance things and ignore it in other cases.

Original
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).
Total 7 instructions in the loop
New
instr := cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
        cogit CmpR: startReg R: stopReg.
cogit JumpGreaterOrEqual: instr.
        jumpEmpty jmpTarget: cogit genPrimReturn.
Total 6 instructions in the loop
But a conditional back-jump.

Yes, the latter looks better and matches what we're doing with the backwards jump.
 

NB : Eliot I tried incrementing the pointer instead of the index and use a 0 displacement read/writes and it made no performance difference on Intel. Saves a register in the loop but we don't need them. It makes a difference in the inlined stub however since other registers can be used for other things in the method where it's inlined.

In general we should always try to reduce register pressure.  So if we can write the primitive using incremented pointers instead of base registers and incremented indexes we should.  In some complex loop, once we have good register allocation, that extra register will be put to good use and we can expect that we would see increased performance.


[1] Smalltalk code
1 to: (y - x + 1) do: [ :i |
array2 at: y2 + i put: (array at: x + i) ].
[2] (sista bytecodes of the tight loop)
32 <45> pushTemp: 5
33 <46> pushTemp: 6
34 <F8 F3 87> smiLessOrEqual:
37 <EF 1B> jumpFalse: 66
39 <43> pushTemp: 3
40 <45> pushTemp: 5
41 <40> pushTemp: 0
42 <45> pushTemp: 5
43 <F8 10 88> pointerAt:
46 <F8 B8 8B> pointerAt:put:
49 <D8> pop
50 <45> pushTemp: 5
51 <E1 00 E8 01> pushConstant: 1
55 <F8 D0 87> smiAdd:
58 <D5> popIntoTemp: 5
59 <E1 FF E8 DE> pushConstant: -34
63 <F8 70 97> backjumpNoInterrupt

I'm confused by "59 <E1 FF E8 DE> pushConstant: -34"; what's that?  Are these actually extension bytes for the backwards jump?  Can't be because the E8 wouldn't be there.  So what is this?

[3] primitive call
array2 replaceFrom: x to: y with: array startingAt: y2
Current tight loop RTL:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).

Note that if you were to maintain a pointer and the delta between the source and the destination pointers, instead of a source pointer and a destination pointer, you'd only have to increment one register (for the one pointer) in the loop which would save another instruction.
 

This is exciting!!

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: ASM questions for tight loops

Levente Uzonyi
In reply to this post by Clément Bera-4
 
Hi Clément,

On Sat, 6 Jan 2018, Clément Bera wrote:

> [1] Smalltalk code
> 1 to: (y - x + 1) do: [ :i |
> array2 at: y2 + i put: (array at: x + i) ].

This is not equal to

> [3] primitive call
> array2 replaceFrom: x to: y with: array startingAt: y2

Because for example the value at x in array will never be copied.

I suggest you compare the results with the original fallback code:

  | index repOff |
  repOff := y2 - x.
  index := x - 1.
  [(index := index + 1) <= y]
  whileTrue: [self at: index put: (replacement at: repOff + index)]

I just ran some benchmarks, and the primitive was 19x faster than the
fallback code.

Levente
Reply | Threaded
Open this post in threaded view
|

Re: ASM questions for tight loops

Clément Bera-4
In reply to this post by Eliot Miranda-2
 


On Sat, Jan 6, 2018 at 7:16 PM, Eliot Miranda <[hidden email]> wrote:
Hi Clément,

On Sat, Jan 6, 2018 at 12:00 AM, Clément Bera <[hidden email]> wrote:
Hi Eliot, (sorry long mail but I think you will enjoy this discussion; read it)

Looking deeper into tight loops performance to get some boost out of Sista, I have questions and remarks.

First thing is when I compare 3 versions or array copies (code at the bottom of mail).
Version [1] : copyArray : written in plain Smalltalk
Version [2] : fastCopyArray : sista generated code (only unchecked operations)
Version [3] : primCopyArray : simply call the machine code primitive

For small arrays everything is fine, sista code is good. But for large arrays I get the following results (here a copy of 1000 elements):
1) 85k/second
2) 390k/second
3) 2,000k/second

So the sista version is ~5x compared to pure smalltalk code, but ~5x slower to the primitive.

My guess is that the reason for this is that the hand written primitive is written with a 7 instructions tight loop (2 additions, 1 read, 1 write, 2 jumps, 1 comp), while the sista generated code requires 6 temp read, 1 temp write and a couple cheap instructions in addition. We discussed earlier than register allocation might bring a 20% performance boost, but it seems tight loops are quite important and in this case the performance difference is around 5x (500%).

This is for me very good news.  I thought there was potential in good register allocation and this confirms it.
 

So big question is that should I had more macro-instructions (let's call them stubs) in the sista extended bytecode set such as bytesEquals which bulkCompares the bytes of 2 objects, since performance difference is bigger than expected, or should I wait for better register allocation algorithm ? 

I think *very much* we should take the register allocation route.  If we go the other route we're following the well-worn path of systems like Python where basic execution performance is poor and is solved by special-casing specific operations and implementing them with C primitives.  Eventually one gets to a system where all the performance critical parts are in C, which has a couple of really bad effects.  First, nothing one does to increase basic performance has any effect since all the time is in C, so one looses the impetus to improve basic language performance.  Second, much of the system is in large mono;ethic and unchangeable chunks of C, so the system looses its dynamic language derived flexibility.  We definitely want to move in the other direction.  So let's instead put in the effort to make register allocation work well.  We know what has to be done.  It isn't easy but the long-term benefits are great.  We should be able to make many primitives entirely optional in the JIT.  [Brain fart: We may even be able to generate primitives for the Interpreter versions using our framework]

I believe hand-written stub will always be faster than sista generated code (since a couple things can't be expressed at the bytecode level), but based on multiple benchmarks I think if the better register allocator generates code with the same number of memory read-write (including temp access on stack) the difference should be less than 20% performance difference as we expected. 

Good, then i think we are in agreement.  We can live without stubs.
 
It's not clear how many stubs would be needed. I have a prototype of array copy, other things such as indexOf or fillObject are needed. Maybe I can add only 3-4 for now to show massive speed-ups like we did with bytesEquals. 

I don't think it's the right way to go.  Having these for performaNc comparison is good, but having them as general features of inline primitives is, I think, a long-term strategic error, as outlined above.

Yeah my main concern is with maintaining huge Assembly code stubs. I did it for byteEquals already, and this is hard to maintain (80 lines in total for the stub ?). Having 5 of those is anooying.
 
 
Let's note that we could try to share the stubs with normal primitives (primStringReplace with the arraycopy stub for example) to avoid too much maintenance cost. The problem is that normal primitives have complexity with all the checks at the beginning, then a generic case (copy without knowing anything about the operands) while sista stubs have no checks, but complexity is in generating better code when some operands are constants. So it's not really the same thing.

Yes, this does seem interesting.  But medium-term interpreter performance isn't critical.  Let's get the JIT performance good first.

I mean primitive in the JIT like genPrimStringReplace versus inline primitives. 

I would love to have interpreter primitives and the main interpreter loop generated from the JIT back-end, so switching from the interpreter primitives and the interpreter loop to jitted code would be cheap (same calling conventions, etc.). But let's do that later.
 
Second question is about RTL code for tight loops. I looked into V8 and it seems that for tight loop they try to use backward conditional instead of forward conditional + backward unconditional. What do you think will be the difference if I do it for array copy (the copy is guaranteed not to be empty at this point, example below Original-New) ? I wonder if conditional back jumps are not slower than conditional forward jumps or something like that (I am always confused with branch performance, and the performance is different depending on caches so small benchs may not be relevant). What's your take on this ?

That's a good question.  I don't know.  The current layout is dictated by the fact that backwards jumps are interrupt points along with byttecode-to-machine-code-pc-mappng.  The code look like this, this being from SequenceableCollection>>indexOf:startingAt:

startIndex to: self size do: [ :index |
(self at: index) == anElement ifTrue: [ ^index ] ].
^0

The end of the loop's bytecodes are

61 <B0> send: +
62 <6A> popIntoTemp: 2
63 <A3 ED> jumpTo: 46
65 <75> pushConstant: 0
66 <7C> returnTop

is compiled into

000010f2: movl $0x00100028=#+, %ecx : B9 28 00 10 00 
000010f7: call .+0xfffff37c (0x00000478=ceSend1Args) : E8 7C F3 FF FF 
IsSendCall + bc 61/62:
000010fc: movl %edx, -28(%ebp) : 89 55 E4 
000010ff: movl %ds:0x80000='stackLimit', %eax : A1 00 00 08 00 
00001104: cmpl %eax, %esp : 39 C4 
00001106: jnb .+0xffffff5e (0x0000106a=indexOf:startingAt:@6A) : 0F 83 5E FF FF FF 
0000110c: call .+0xfffffa87 (0x00000b98=ceCheckForInterruptsTrampoline) : E8 87 FA FF FF 
IsRelativeCall:
HasBytecodePC bc 62/63:
00001111: jmp .+0xffffff54 (0x0000106a=indexOf:startingAt:@6A) : E9 54 FF FF FF 
00001116: movl $0x00000001, %edx : BA 01 00 00 00 
0000111b: movl %ebp, %esp : 89 EC 
0000111d: popl %ebp : 5D 
0000111e: ret $0x000c : C2 0C 00 

So the suspension point is the pc following the call of ceCheckForInterruptsTrampoline, which is 00001111.  If execution continues after interruption the jump is taken and the loop continues, which is as required.  [I remember an early state in Cog;'s development at Qwaq when I had't implemented this and loops were terminating every time there was an interrupt, such as when one wiggled the mouse,  The system almost worked; it wad quite bizarre :-) ].

But notice that the conditional jump at 00001106 *is* backwards as you desire.  It seems we're already doing the right thing, at least with the backwards jump.  What do we do with backwards jumps in SistaV1 inline primitives?

I have added backjumpNoInterrupt.
 


And then the question is that should I be able to generate that kind of loop from the sista back-end (I don't really like it, I prefer to limit the number of back-jumps else it makes some things harder) ? Else I can do it in the stubs for most critical performance things and ignore it in other cases.

Original
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).
Total 7 instructions in the loop
New
instr := cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
        cogit CmpR: startReg R: stopReg.
cogit JumpGreaterOrEqual: instr.
        jumpEmpty jmpTarget: cogit genPrimReturn.
Total 6 instructions in the loop
But a conditional back-jump.

Yes, the latter looks better and matches what we're doing with the backwards jump.
 

NB : Eliot I tried incrementing the pointer instead of the index and use a 0 displacement read/writes and it made no performance difference on Intel. Saves a register in the loop but we don't need them. It makes a difference in the inlined stub however since other registers can be used for other things in the method where it's inlined.

In general we should always try to reduce register pressure.  So if we can write the primitive using incremented pointers instead of base registers and incremented indexes we should.  In some complex loop, once we have good register allocation, that extra register will be put to good use and we can expect that we would see increased performance.


[1] Smalltalk code
1 to: (y - x + 1) do: [ :i |
array2 at: y2 + i put: (array at: x + i) ].
[2] (sista bytecodes of the tight loop)
32 <45> pushTemp: 5
33 <46> pushTemp: 6
34 <F8 F3 87> smiLessOrEqual:
37 <EF 1B> jumpFalse: 66
39 <43> pushTemp: 3
40 <45> pushTemp: 5
41 <40> pushTemp: 0
42 <45> pushTemp: 5
43 <F8 10 88> pointerAt:
46 <F8 B8 8B> pointerAt:put:
49 <D8> pop
50 <45> pushTemp: 5
51 <E1 00 E8 01> pushConstant: 1
55 <F8 D0 87> smiAdd:
58 <D5> popIntoTemp: 5
59 <E1 FF E8 DE> pushConstant: -34
63 <F8 70 97> backjumpNoInterrupt

I'm confused by "59 <E1 FF E8 DE> pushConstant: -34"; what's that?  Are these actually extension bytes for the backwards jump?  Can't be because the E8 wouldn't be there.  So what is this?

-34 is the distance of the backward jump pushed on stack. #backjumpNoInterrupt is an inlined primitive expecting the distance to be on stack with a long form pushInteger bytecode (the guaranteed long form simplifies Cogit's scan for fixups and latestContinuation). That way no inline primitive require extensions and the callPrimitive bytecode does not use any extensions. I changed the inlined primitives from 6000 and onwards to be jumps with different number of parameters (jumpReadOnly, jumpYoung, backwardJumpNotInterrupt, etc.). Works really well. I am finishing patches to Scorch and evaluation of performance to commit to VMMaker.
 

[3] primitive call
array2 replaceFrom: x to: y with: array startingAt: y2
Current tight loop RTL:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).

Note that if you were to maintain a pointer and the delta between the source and the destination pointers, instead of a source pointer and a destination pointer, you'd only have to increment one register (for the one pointer) in the loop which would save another instruction.
 

Yeah I wanted to do that at some point. It can be improved.

I did the simplest that could possibly work and the performance is better than the C primitive so I did not go any further. Remember that later we'll need to compare this performance with the sista generated code. For byte copies it needs to be improved (slower than C primitive).
 

_,,,^..^,,,_
best, Eliot



--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Reply | Threaded
Open this post in threaded view
|

Re: ASM questions for tight loops

Clément Bera-4
 
Note that for loops I tried to add backjumpNoInterrupt and backjumpAlwaysInterrupt to wrap things around at the Scorch level (to split loop in inner and outer, with only the outer one being interrupted to improve branch prediction), but I have not successfully implemented backjumpAlwaysInterrupt yet.

I use backjumpNoInterrupt for small loops with a fixed count of iteration that are too big to be fully unrolled and when I inline specific primitive such as array copies or atAllPut:. Makes a difference on performance on those cases.

I can answer and discuss more about this on monday. Busy week-end.

On Sat, Jan 6, 2018 at 9:11 PM, Clément Bera <[hidden email]> wrote:


On Sat, Jan 6, 2018 at 7:16 PM, Eliot Miranda <[hidden email]> wrote:
Hi Clément,

On Sat, Jan 6, 2018 at 12:00 AM, Clément Bera <[hidden email]> wrote:
Hi Eliot, (sorry long mail but I think you will enjoy this discussion; read it)

Looking deeper into tight loops performance to get some boost out of Sista, I have questions and remarks.

First thing is when I compare 3 versions or array copies (code at the bottom of mail).
Version [1] : copyArray : written in plain Smalltalk
Version [2] : fastCopyArray : sista generated code (only unchecked operations)
Version [3] : primCopyArray : simply call the machine code primitive

For small arrays everything is fine, sista code is good. But for large arrays I get the following results (here a copy of 1000 elements):
1) 85k/second
2) 390k/second
3) 2,000k/second

So the sista version is ~5x compared to pure smalltalk code, but ~5x slower to the primitive.

My guess is that the reason for this is that the hand written primitive is written with a 7 instructions tight loop (2 additions, 1 read, 1 write, 2 jumps, 1 comp), while the sista generated code requires 6 temp read, 1 temp write and a couple cheap instructions in addition. We discussed earlier than register allocation might bring a 20% performance boost, but it seems tight loops are quite important and in this case the performance difference is around 5x (500%).

This is for me very good news.  I thought there was potential in good register allocation and this confirms it.
 

So big question is that should I had more macro-instructions (let's call them stubs) in the sista extended bytecode set such as bytesEquals which bulkCompares the bytes of 2 objects, since performance difference is bigger than expected, or should I wait for better register allocation algorithm ? 

I think *very much* we should take the register allocation route.  If we go the other route we're following the well-worn path of systems like Python where basic execution performance is poor and is solved by special-casing specific operations and implementing them with C primitives.  Eventually one gets to a system where all the performance critical parts are in C, which has a couple of really bad effects.  First, nothing one does to increase basic performance has any effect since all the time is in C, so one looses the impetus to improve basic language performance.  Second, much of the system is in large mono;ethic and unchangeable chunks of C, so the system looses its dynamic language derived flexibility.  We definitely want to move in the other direction.  So let's instead put in the effort to make register allocation work well.  We know what has to be done.  It isn't easy but the long-term benefits are great.  We should be able to make many primitives entirely optional in the JIT.  [Brain fart: We may even be able to generate primitives for the Interpreter versions using our framework]

I believe hand-written stub will always be faster than sista generated code (since a couple things can't be expressed at the bytecode level), but based on multiple benchmarks I think if the better register allocator generates code with the same number of memory read-write (including temp access on stack) the difference should be less than 20% performance difference as we expected. 

Good, then i think we are in agreement.  We can live without stubs.
 
It's not clear how many stubs would be needed. I have a prototype of array copy, other things such as indexOf or fillObject are needed. Maybe I can add only 3-4 for now to show massive speed-ups like we did with bytesEquals. 

I don't think it's the right way to go.  Having these for performaNc comparison is good, but having them as general features of inline primitives is, I think, a long-term strategic error, as outlined above.

Yeah my main concern is with maintaining huge Assembly code stubs. I did it for byteEquals already, and this is hard to maintain (80 lines in total for the stub ?). Having 5 of those is anooying.
 
 
Let's note that we could try to share the stubs with normal primitives (primStringReplace with the arraycopy stub for example) to avoid too much maintenance cost. The problem is that normal primitives have complexity with all the checks at the beginning, then a generic case (copy without knowing anything about the operands) while sista stubs have no checks, but complexity is in generating better code when some operands are constants. So it's not really the same thing.

Yes, this does seem interesting.  But medium-term interpreter performance isn't critical.  Let's get the JIT performance good first.

I mean primitive in the JIT like genPrimStringReplace versus inline primitives. 

I would love to have interpreter primitives and the main interpreter loop generated from the JIT back-end, so switching from the interpreter primitives and the interpreter loop to jitted code would be cheap (same calling conventions, etc.). But let's do that later.
 
Second question is about RTL code for tight loops. I looked into V8 and it seems that for tight loop they try to use backward conditional instead of forward conditional + backward unconditional. What do you think will be the difference if I do it for array copy (the copy is guaranteed not to be empty at this point, example below Original-New) ? I wonder if conditional back jumps are not slower than conditional forward jumps or something like that (I am always confused with branch performance, and the performance is different depending on caches so small benchs may not be relevant). What's your take on this ?

That's a good question.  I don't know.  The current layout is dictated by the fact that backwards jumps are interrupt points along with byttecode-to-machine-code-pc-mappng.  The code look like this, this being from SequenceableCollection>>indexOf:startingAt:

startIndex to: self size do: [ :index |
(self at: index) == anElement ifTrue: [ ^index ] ].
^0

The end of the loop's bytecodes are

61 <B0> send: +
62 <6A> popIntoTemp: 2
63 <A3 ED> jumpTo: 46
65 <75> pushConstant: 0
66 <7C> returnTop

is compiled into

000010f2: movl $0x00100028=#+, %ecx : B9 28 00 10 00 
000010f7: call .+0xfffff37c (0x00000478=ceSend1Args) : E8 7C F3 FF FF 
IsSendCall + bc 61/62:
000010fc: movl %edx, -28(%ebp) : 89 55 E4 
000010ff: movl %ds:0x80000='stackLimit', %eax : A1 00 00 08 00 
00001104: cmpl %eax, %esp : 39 C4 
00001106: jnb .+0xffffff5e (0x0000106a=indexOf:startingAt:@6A) : 0F 83 5E FF FF FF 
0000110c: call .+0xfffffa87 (0x00000b98=ceCheckForInterruptsTrampoline) : E8 87 FA FF FF 
IsRelativeCall:
HasBytecodePC bc 62/63:
00001111: jmp .+0xffffff54 (0x0000106a=indexOf:startingAt:@6A) : E9 54 FF FF FF 
00001116: movl $0x00000001, %edx : BA 01 00 00 00 
0000111b: movl %ebp, %esp : 89 EC 
0000111d: popl %ebp : 5D 
0000111e: ret $0x000c : C2 0C 00 

So the suspension point is the pc following the call of ceCheckForInterruptsTrampoline, which is 00001111.  If execution continues after interruption the jump is taken and the loop continues, which is as required.  [I remember an early state in Cog;'s development at Qwaq when I had't implemented this and loops were terminating every time there was an interrupt, such as when one wiggled the mouse,  The system almost worked; it wad quite bizarre :-) ].

But notice that the conditional jump at 00001106 *is* backwards as you desire.  It seems we're already doing the right thing, at least with the backwards jump.  What do we do with backwards jumps in SistaV1 inline primitives?

I have added backjumpNoInterrupt.
 


And then the question is that should I be able to generate that kind of loop from the sista back-end (I don't really like it, I prefer to limit the number of back-jumps else it makes some things harder) ? Else I can do it in the stubs for most critical performance things and ignore it in other cases.

Original
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).
Total 7 instructions in the loop
New
instr := cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
        cogit CmpR: startReg R: stopReg.
cogit JumpGreaterOrEqual: instr.
        jumpEmpty jmpTarget: cogit genPrimReturn.
Total 6 instructions in the loop
But a conditional back-jump.

Yes, the latter looks better and matches what we're doing with the backwards jump.
 

NB : Eliot I tried incrementing the pointer instead of the index and use a 0 displacement read/writes and it made no performance difference on Intel. Saves a register in the loop but we don't need them. It makes a difference in the inlined stub however since other registers can be used for other things in the method where it's inlined.

In general we should always try to reduce register pressure.  So if we can write the primitive using incremented pointers instead of base registers and incremented indexes we should.  In some complex loop, once we have good register allocation, that extra register will be put to good use and we can expect that we would see increased performance.


[1] Smalltalk code
1 to: (y - x + 1) do: [ :i |
array2 at: y2 + i put: (array at: x + i) ].
[2] (sista bytecodes of the tight loop)
32 <45> pushTemp: 5
33 <46> pushTemp: 6
34 <F8 F3 87> smiLessOrEqual:
37 <EF 1B> jumpFalse: 66
39 <43> pushTemp: 3
40 <45> pushTemp: 5
41 <40> pushTemp: 0
42 <45> pushTemp: 5
43 <F8 10 88> pointerAt:
46 <F8 B8 8B> pointerAt:put:
49 <D8> pop
50 <45> pushTemp: 5
51 <E1 00 E8 01> pushConstant: 1
55 <F8 D0 87> smiAdd:
58 <D5> popIntoTemp: 5
59 <E1 FF E8 DE> pushConstant: -34
63 <F8 70 97> backjumpNoInterrupt

I'm confused by "59 <E1 FF E8 DE> pushConstant: -34"; what's that?  Are these actually extension bytes for the backwards jump?  Can't be because the E8 wouldn't be there.  So what is this?

-34 is the distance of the backward jump pushed on stack. #backjumpNoInterrupt is an inlined primitive expecting the distance to be on stack with a long form pushInteger bytecode (the guaranteed long form simplifies Cogit's scan for fixups and latestContinuation). That way no inline primitive require extensions and the callPrimitive bytecode does not use any extensions. I changed the inlined primitives from 6000 and onwards to be jumps with different number of parameters (jumpReadOnly, jumpYoung, backwardJumpNotInterrupt, etc.). Works really well. I am finishing patches to Scorch and evaluation of performance to commit to VMMaker.
 

[3] primitive call
array2 replaceFrom: x to: y with: array startingAt: y2
Current tight loop RTL:
instr := cogit CmpR: startReg R: stopReg.
jumpFinished := cogit JumpBelow: 0.
cogit MoveXwr: repStartReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit AddCq: 1 R: repStartReg.
cogit Jump: instr.
jumpFinished jmpTarget: (jumpEmpty jmpTarget: cogit genPrimReturn).

Note that if you were to maintain a pointer and the delta between the source and the destination pointers, instead of a source pointer and a destination pointer, you'd only have to increment one register (for the one pointer) in the loop which would save another instruction.
 

Yeah I wanted to do that at some point. It can be improved.

I did the simplest that could possibly work and the performance is better than the C primitive so I did not go any further. Remember that later we'll need to compare this performance with the sista generated code. For byte copies it needs to be improved (slower than C primitive).
 

_,,,^..^,,,_
best, Eliot



--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq



--
Clément Béra
Pharo consortium engineer
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq