Debugging Spur32 genPrimitiveHighBit

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Debugging Spur32 genPrimitiveHighBit

Nicolas Cellier
 
Hi all,
I confirm that generated code for CLZ (LZCNT) is incorrect on IA32 arch.
The registers are swapped!

Here is an extract:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax
    0x5c26a9f: eb 11           jmp    0x5c26ab2
    0x5c26aa1: 90              nop    
    0x5c26aa2: 90              nop    
    0x5c26aa3: 90              nop    
    0x5c26aa4: 89 d0           movl   %edx, %eax
    0x5c26aa6: 83 e0 03        andl   $0x3, %eax
    0x5c26aa9: 75 f1           jne    0x5c26a9c
    0x5c26aab: 8b 02           movl   (%edx), %eax
    0x5c26aad: 25 ff ff 3f 00  andl   $0x3fffff, %eax           ; imm = 0x3FFFFF
    0x5c26ab2: 39 c8           cmpl   %ecx, %eax
    0x5c26ab4: 75 e0           jne    0x5c26a96
    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx
    0x5c26aba: 74 0d           je     0x5c26ac9
    0x5c26abc: 35 1f 00 00 00  xorl   $0x1f, %eax
    0x5c26ac1: 89 c2           movl   %eax, %edx
    0x5c26ac3: d1 e2           shll   %edx
    0x5c26ac5: 83 c2 01        addl   $0x1, %edx
    0x5c26ac8: c3              retl  

What happens is that we count the leading zeros in $eax (TempReg) and store the result in $edx (ReceiverResultReg) ...

    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx

We want the contrary!

$eax contains 1, presumably because we used it to check for SmallInteger tag bit:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax

So we invariably get 31 leading zeroes in $edx (but we will later overwrite the contents of $edx).

Then we interpret $eax as the result (thus 1 leading zero), bitInvert, and get 30 as the result for highBit, store that in $edx (shifted and tagged), and we're done... Err!

Obviously the code generation is wrong!
It did work when I first wrote it, and still work on x64 because we use $eax (TempReg) as both source and dest reg...

Though, I do not see what we are doing wrong:

concretizeClzRR
<inline: true>
| maskReg dest  |
maskReg := operands at: 0.
dest := operands at: 1.
machineCode
at: 0 put: 16rF3;
at: 1 put: 16r0F;
at: 2 put: 16rBD;
at: 3 put: (self mod: ModReg RM: dest RO: maskReg).
^4

and we invoke it like that:
cogit ClzR: srcReg R: destReg.

ClzR: reg1 R: reg2
"reg2 := reg1 countLeadingZeros"
<inline: true>
<returnTypeC: #'AbstractInstruction *'>
^self gen: ClzRR operand: reg1 operand: reg2

So it seems to me that all is in the correct order...
cogitIA32 likewise seems perfrectly correct:

static sqInt
genPrimitiveHighBit(void)
{
    AbstractInstruction *anInstruction11;
    AbstractInstruction *anInstruction2;
    AbstractInstruction *anInstruction4;
    AbstractInstruction *jumpNegativeReceiver;
    AbstractInstruction *jumpNegativeReceiver11;
    AbstractInstruction *jumpNegativeReceiver3;
    sqInt literal1;


        /* remove excess tag bits from the receiver oop */

        /* and use the abstract cogit facility for case of single tag-bit */
        /* begin genHighBitIn:ofSmallIntegerOopWithSingleTagBit: */
        if (((ceCheckLZCNT()) & (1U << 5)) != 0) {
                /* begin genHighBitClzIn:ofSmallIntegerOopWithSingleTagBit: */
                genoperandoperand(ClzRR, ReceiverResultReg, TempReg);
                if (!(setsConditionCodesFor(lastOpcode(), JumpZero))) {
                        /* begin checkQuickConstant:forInstruction: */
                        anInstruction2 = genoperandoperand(CmpCqR, 0, TempReg);
                }

                /* Note the nice bit trick below:
                   highBit_1based_of_small_int_value = (BytesPerWord * 8) - leadingZeroCout_of_oop - 1 toAccountForTagBit.
                   This is like 2 complements (- reg - 1) on (BytesPerWord * 8) log2 bits, or exactly a bit invert operation... */
                jumpNegativeReceiver3 = genConditionalBranchoperand(JumpZero, ((sqInt)0));
                /* begin checkLiteral:forInstruction: */
                literal1 = (BytesPerWord * 8) - 1;
                anInstruction11 = genoperandoperand(XorCwR, (BytesPerWord * 8) - 1, TempReg);
                jumpNegativeReceiver = jumpNegativeReceiver3;
                goto l10;
        }

which concretize in:

        case ClzRR:
                /* begin concretizeClzRR */
                maskReg = ((self_in_dispatchConcretize->operands))[0];
                dest = ((self_in_dispatchConcretize->operands))[1];
                ((self_in_dispatchConcretize->machineCode))[0] = 243;
                ((self_in_dispatchConcretize->machineCode))[1] = 15;
                ((self_in_dispatchConcretize->machineCode))[2] = 189;
                ((self_in_dispatchConcretize->machineCode))[3] = (modRMRO(self_in_dispatchConcretize, ModReg, dest, maskReg));
                return 4;

The order seems correct all the way down...
As a workaround, I could revert Eliot's optimization and force a
    cogit MoveR: ReceiverResultReg R: TempReg.
But I'd rather want to understand where's the problem...
Another pair of eyes may help!

Reply | Threaded
Open this post in threaded view
|

Re: Debugging Spur32 genPrimitiveHighBit

Nicolas Cellier
 
It seems that I miss-interpreted the order of mod:RM:RO:, it seems like O means output, not operand...

Le mar. 18 févr. 2020 à 22:53, Nicolas Cellier <[hidden email]> a écrit :
Hi all,
I confirm that generated code for CLZ (LZCNT) is incorrect on IA32 arch.
The registers are swapped!

Here is an extract:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax
    0x5c26a9f: eb 11           jmp    0x5c26ab2
    0x5c26aa1: 90              nop    
    0x5c26aa2: 90              nop    
    0x5c26aa3: 90              nop    
    0x5c26aa4: 89 d0           movl   %edx, %eax
    0x5c26aa6: 83 e0 03        andl   $0x3, %eax
    0x5c26aa9: 75 f1           jne    0x5c26a9c
    0x5c26aab: 8b 02           movl   (%edx), %eax
    0x5c26aad: 25 ff ff 3f 00  andl   $0x3fffff, %eax           ; imm = 0x3FFFFF
    0x5c26ab2: 39 c8           cmpl   %ecx, %eax
    0x5c26ab4: 75 e0           jne    0x5c26a96
    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx
    0x5c26aba: 74 0d           je     0x5c26ac9
    0x5c26abc: 35 1f 00 00 00  xorl   $0x1f, %eax
    0x5c26ac1: 89 c2           movl   %eax, %edx
    0x5c26ac3: d1 e2           shll   %edx
    0x5c26ac5: 83 c2 01        addl   $0x1, %edx
    0x5c26ac8: c3              retl  

What happens is that we count the leading zeros in $eax (TempReg) and store the result in $edx (ReceiverResultReg) ...

    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx

We want the contrary!

$eax contains 1, presumably because we used it to check for SmallInteger tag bit:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax

So we invariably get 31 leading zeroes in $edx (but we will later overwrite the contents of $edx).

Then we interpret $eax as the result (thus 1 leading zero), bitInvert, and get 30 as the result for highBit, store that in $edx (shifted and tagged), and we're done... Err!

Obviously the code generation is wrong!
It did work when I first wrote it, and still work on x64 because we use $eax (TempReg) as both source and dest reg...

Though, I do not see what we are doing wrong:

concretizeClzRR
<inline: true>
| maskReg dest  |
maskReg := operands at: 0.
dest := operands at: 1.
machineCode
at: 0 put: 16rF3;
at: 1 put: 16r0F;
at: 2 put: 16rBD;
at: 3 put: (self mod: ModReg RM: dest RO: maskReg).
^4

and we invoke it like that:
cogit ClzR: srcReg R: destReg.

ClzR: reg1 R: reg2
"reg2 := reg1 countLeadingZeros"
<inline: true>
<returnTypeC: #'AbstractInstruction *'>
^self gen: ClzRR operand: reg1 operand: reg2

So it seems to me that all is in the correct order...
cogitIA32 likewise seems perfrectly correct:

static sqInt
genPrimitiveHighBit(void)
{
    AbstractInstruction *anInstruction11;
    AbstractInstruction *anInstruction2;
    AbstractInstruction *anInstruction4;
    AbstractInstruction *jumpNegativeReceiver;
    AbstractInstruction *jumpNegativeReceiver11;
    AbstractInstruction *jumpNegativeReceiver3;
    sqInt literal1;


        /* remove excess tag bits from the receiver oop */

        /* and use the abstract cogit facility for case of single tag-bit */
        /* begin genHighBitIn:ofSmallIntegerOopWithSingleTagBit: */
        if (((ceCheckLZCNT()) & (1U << 5)) != 0) {
                /* begin genHighBitClzIn:ofSmallIntegerOopWithSingleTagBit: */
                genoperandoperand(ClzRR, ReceiverResultReg, TempReg);
                if (!(setsConditionCodesFor(lastOpcode(), JumpZero))) {
                        /* begin checkQuickConstant:forInstruction: */
                        anInstruction2 = genoperandoperand(CmpCqR, 0, TempReg);
                }

                /* Note the nice bit trick below:
                   highBit_1based_of_small_int_value = (BytesPerWord * 8) - leadingZeroCout_of_oop - 1 toAccountForTagBit.
                   This is like 2 complements (- reg - 1) on (BytesPerWord * 8) log2 bits, or exactly a bit invert operation... */
                jumpNegativeReceiver3 = genConditionalBranchoperand(JumpZero, ((sqInt)0));
                /* begin checkLiteral:forInstruction: */
                literal1 = (BytesPerWord * 8) - 1;
                anInstruction11 = genoperandoperand(XorCwR, (BytesPerWord * 8) - 1, TempReg);
                jumpNegativeReceiver = jumpNegativeReceiver3;
                goto l10;
        }

which concretize in:

        case ClzRR:
                /* begin concretizeClzRR */
                maskReg = ((self_in_dispatchConcretize->operands))[0];
                dest = ((self_in_dispatchConcretize->operands))[1];
                ((self_in_dispatchConcretize->machineCode))[0] = 243;
                ((self_in_dispatchConcretize->machineCode))[1] = 15;
                ((self_in_dispatchConcretize->machineCode))[2] = 189;
                ((self_in_dispatchConcretize->machineCode))[3] = (modRMRO(self_in_dispatchConcretize, ModReg, dest, maskReg));
                return 4;

The order seems correct all the way down...
As a workaround, I could revert Eliot's optimization and force a
    cogit MoveR: ReceiverResultReg R: TempReg.
But I'd rather want to understand where's the problem...
Another pair of eyes may help!

Reply | Threaded
Open this post in threaded view
|

Re: Debugging Spur32 genPrimitiveHighBit

Nicolas Cellier
 
mod is the mod field (2 bits) RM is the r/m field (3 bits), RO is the reg (or opcode) field (3 bits).
For LZCNT dest is in RO, and mask (source) is in r/m.
Same for BSR.
So I had it wrong... will fix ASAP

Le mar. 18 févr. 2020 à 23:37, Nicolas Cellier <[hidden email]> a écrit :
It seems that I miss-interpreted the order of mod:RM:RO:, it seems like O means output, not operand...

Le mar. 18 févr. 2020 à 22:53, Nicolas Cellier <[hidden email]> a écrit :
Hi all,
I confirm that generated code for CLZ (LZCNT) is incorrect on IA32 arch.
The registers are swapped!

Here is an extract:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax
    0x5c26a9f: eb 11           jmp    0x5c26ab2
    0x5c26aa1: 90              nop    
    0x5c26aa2: 90              nop    
    0x5c26aa3: 90              nop    
    0x5c26aa4: 89 d0           movl   %edx, %eax
    0x5c26aa6: 83 e0 03        andl   $0x3, %eax
    0x5c26aa9: 75 f1           jne    0x5c26a9c
    0x5c26aab: 8b 02           movl   (%edx), %eax
    0x5c26aad: 25 ff ff 3f 00  andl   $0x3fffff, %eax           ; imm = 0x3FFFFF
    0x5c26ab2: 39 c8           cmpl   %ecx, %eax
    0x5c26ab4: 75 e0           jne    0x5c26a96
    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx
    0x5c26aba: 74 0d           je     0x5c26ac9
    0x5c26abc: 35 1f 00 00 00  xorl   $0x1f, %eax
    0x5c26ac1: 89 c2           movl   %eax, %edx
    0x5c26ac3: d1 e2           shll   %edx
    0x5c26ac5: 83 c2 01        addl   $0x1, %edx
    0x5c26ac8: c3              retl  

What happens is that we count the leading zeros in $eax (TempReg) and store the result in $edx (ReceiverResultReg) ...

    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx

We want the contrary!

$eax contains 1, presumably because we used it to check for SmallInteger tag bit:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax

So we invariably get 31 leading zeroes in $edx (but we will later overwrite the contents of $edx).

Then we interpret $eax as the result (thus 1 leading zero), bitInvert, and get 30 as the result for highBit, store that in $edx (shifted and tagged), and we're done... Err!

Obviously the code generation is wrong!
It did work when I first wrote it, and still work on x64 because we use $eax (TempReg) as both source and dest reg...

Though, I do not see what we are doing wrong:

concretizeClzRR
<inline: true>
| maskReg dest  |
maskReg := operands at: 0.
dest := operands at: 1.
machineCode
at: 0 put: 16rF3;
at: 1 put: 16r0F;
at: 2 put: 16rBD;
at: 3 put: (self mod: ModReg RM: dest RO: maskReg).
^4

and we invoke it like that:
cogit ClzR: srcReg R: destReg.

ClzR: reg1 R: reg2
"reg2 := reg1 countLeadingZeros"
<inline: true>
<returnTypeC: #'AbstractInstruction *'>
^self gen: ClzRR operand: reg1 operand: reg2

So it seems to me that all is in the correct order...
cogitIA32 likewise seems perfrectly correct:

static sqInt
genPrimitiveHighBit(void)
{
    AbstractInstruction *anInstruction11;
    AbstractInstruction *anInstruction2;
    AbstractInstruction *anInstruction4;
    AbstractInstruction *jumpNegativeReceiver;
    AbstractInstruction *jumpNegativeReceiver11;
    AbstractInstruction *jumpNegativeReceiver3;
    sqInt literal1;


        /* remove excess tag bits from the receiver oop */

        /* and use the abstract cogit facility for case of single tag-bit */
        /* begin genHighBitIn:ofSmallIntegerOopWithSingleTagBit: */
        if (((ceCheckLZCNT()) & (1U << 5)) != 0) {
                /* begin genHighBitClzIn:ofSmallIntegerOopWithSingleTagBit: */
                genoperandoperand(ClzRR, ReceiverResultReg, TempReg);
                if (!(setsConditionCodesFor(lastOpcode(), JumpZero))) {
                        /* begin checkQuickConstant:forInstruction: */
                        anInstruction2 = genoperandoperand(CmpCqR, 0, TempReg);
                }

                /* Note the nice bit trick below:
                   highBit_1based_of_small_int_value = (BytesPerWord * 8) - leadingZeroCout_of_oop - 1 toAccountForTagBit.
                   This is like 2 complements (- reg - 1) on (BytesPerWord * 8) log2 bits, or exactly a bit invert operation... */
                jumpNegativeReceiver3 = genConditionalBranchoperand(JumpZero, ((sqInt)0));
                /* begin checkLiteral:forInstruction: */
                literal1 = (BytesPerWord * 8) - 1;
                anInstruction11 = genoperandoperand(XorCwR, (BytesPerWord * 8) - 1, TempReg);
                jumpNegativeReceiver = jumpNegativeReceiver3;
                goto l10;
        }

which concretize in:

        case ClzRR:
                /* begin concretizeClzRR */
                maskReg = ((self_in_dispatchConcretize->operands))[0];
                dest = ((self_in_dispatchConcretize->operands))[1];
                ((self_in_dispatchConcretize->machineCode))[0] = 243;
                ((self_in_dispatchConcretize->machineCode))[1] = 15;
                ((self_in_dispatchConcretize->machineCode))[2] = 189;
                ((self_in_dispatchConcretize->machineCode))[3] = (modRMRO(self_in_dispatchConcretize, ModReg, dest, maskReg));
                return 4;

The order seems correct all the way down...
As a workaround, I could revert Eliot's optimization and force a
    cogit MoveR: ReceiverResultReg R: TempReg.
But I'd rather want to understand where's the problem...
Another pair of eyes may help!

Reply | Threaded
Open this post in threaded view
|

Re: Debugging Spur32 genPrimitiveHighBit

Nicolas Cellier
 
I think that my confusion came from the order of specification, like this

mod operand1:reg operand2:r/m


versus mod: modReg RM: r/m RO: reg
Intel uses such mixture of middle endian in the documentation too...


I have also further fixed the rexw:r:x:b: for X64 encoding of R8-R15 in LZCNT and BSR (but did not regenerate, that can wait)

Le mer. 19 févr. 2020 à 00:51, Nicolas Cellier <[hidden email]> a écrit :
mod is the mod field (2 bits) RM is the r/m field (3 bits), RO is the reg (or opcode) field (3 bits).
For LZCNT dest is in RO, and mask (source) is in r/m.
Same for BSR.
So I had it wrong... will fix ASAP

Le mar. 18 févr. 2020 à 23:37, Nicolas Cellier <[hidden email]> a écrit :
It seems that I miss-interpreted the order of mod:RM:RO:, it seems like O means output, not operand...

Le mar. 18 févr. 2020 à 22:53, Nicolas Cellier <[hidden email]> a écrit :
Hi all,
I confirm that generated code for CLZ (LZCNT) is incorrect on IA32 arch.
The registers are swapped!

Here is an extract:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax
    0x5c26a9f: eb 11           jmp    0x5c26ab2
    0x5c26aa1: 90              nop    
    0x5c26aa2: 90              nop    
    0x5c26aa3: 90              nop    
    0x5c26aa4: 89 d0           movl   %edx, %eax
    0x5c26aa6: 83 e0 03        andl   $0x3, %eax
    0x5c26aa9: 75 f1           jne    0x5c26a9c
    0x5c26aab: 8b 02           movl   (%edx), %eax
    0x5c26aad: 25 ff ff 3f 00  andl   $0x3fffff, %eax           ; imm = 0x3FFFFF
    0x5c26ab2: 39 c8           cmpl   %ecx, %eax
    0x5c26ab4: 75 e0           jne    0x5c26a96
    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx
    0x5c26aba: 74 0d           je     0x5c26ac9
    0x5c26abc: 35 1f 00 00 00  xorl   $0x1f, %eax
    0x5c26ac1: 89 c2           movl   %eax, %edx
    0x5c26ac3: d1 e2           shll   %edx
    0x5c26ac5: 83 c2 01        addl   $0x1, %edx
    0x5c26ac8: c3              retl  

What happens is that we count the leading zeros in $eax (TempReg) and store the result in $edx (ReceiverResultReg) ...

    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx

We want the contrary!

$eax contains 1, presumably because we used it to check for SmallInteger tag bit:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax

So we invariably get 31 leading zeroes in $edx (but we will later overwrite the contents of $edx).

Then we interpret $eax as the result (thus 1 leading zero), bitInvert, and get 30 as the result for highBit, store that in $edx (shifted and tagged), and we're done... Err!

Obviously the code generation is wrong!
It did work when I first wrote it, and still work on x64 because we use $eax (TempReg) as both source and dest reg...

Though, I do not see what we are doing wrong:

concretizeClzRR
<inline: true>
| maskReg dest  |
maskReg := operands at: 0.
dest := operands at: 1.
machineCode
at: 0 put: 16rF3;
at: 1 put: 16r0F;
at: 2 put: 16rBD;
at: 3 put: (self mod: ModReg RM: dest RO: maskReg).
^4

and we invoke it like that:
cogit ClzR: srcReg R: destReg.

ClzR: reg1 R: reg2
"reg2 := reg1 countLeadingZeros"
<inline: true>
<returnTypeC: #'AbstractInstruction *'>
^self gen: ClzRR operand: reg1 operand: reg2

So it seems to me that all is in the correct order...
cogitIA32 likewise seems perfrectly correct:

static sqInt
genPrimitiveHighBit(void)
{
    AbstractInstruction *anInstruction11;
    AbstractInstruction *anInstruction2;
    AbstractInstruction *anInstruction4;
    AbstractInstruction *jumpNegativeReceiver;
    AbstractInstruction *jumpNegativeReceiver11;
    AbstractInstruction *jumpNegativeReceiver3;
    sqInt literal1;


        /* remove excess tag bits from the receiver oop */

        /* and use the abstract cogit facility for case of single tag-bit */
        /* begin genHighBitIn:ofSmallIntegerOopWithSingleTagBit: */
        if (((ceCheckLZCNT()) & (1U << 5)) != 0) {
                /* begin genHighBitClzIn:ofSmallIntegerOopWithSingleTagBit: */
                genoperandoperand(ClzRR, ReceiverResultReg, TempReg);
                if (!(setsConditionCodesFor(lastOpcode(), JumpZero))) {
                        /* begin checkQuickConstant:forInstruction: */
                        anInstruction2 = genoperandoperand(CmpCqR, 0, TempReg);
                }

                /* Note the nice bit trick below:
                   highBit_1based_of_small_int_value = (BytesPerWord * 8) - leadingZeroCout_of_oop - 1 toAccountForTagBit.
                   This is like 2 complements (- reg - 1) on (BytesPerWord * 8) log2 bits, or exactly a bit invert operation... */
                jumpNegativeReceiver3 = genConditionalBranchoperand(JumpZero, ((sqInt)0));
                /* begin checkLiteral:forInstruction: */
                literal1 = (BytesPerWord * 8) - 1;
                anInstruction11 = genoperandoperand(XorCwR, (BytesPerWord * 8) - 1, TempReg);
                jumpNegativeReceiver = jumpNegativeReceiver3;
                goto l10;
        }

which concretize in:

        case ClzRR:
                /* begin concretizeClzRR */
                maskReg = ((self_in_dispatchConcretize->operands))[0];
                dest = ((self_in_dispatchConcretize->operands))[1];
                ((self_in_dispatchConcretize->machineCode))[0] = 243;
                ((self_in_dispatchConcretize->machineCode))[1] = 15;
                ((self_in_dispatchConcretize->machineCode))[2] = 189;
                ((self_in_dispatchConcretize->machineCode))[3] = (modRMRO(self_in_dispatchConcretize, ModReg, dest, maskReg));
                return 4;

The order seems correct all the way down...
As a workaround, I could revert Eliot's optimization and force a
    cogit MoveR: ReceiverResultReg R: TempReg.
But I'd rather want to understand where's the problem...
Another pair of eyes may help!


=?UTF-8?B?Q2FwdHVyZSBk4oCZZcyBY3JhbiAyMDIwLTAyLTE5IGHMgCAwMi4xMy41Ny5wbmc=?= (30K) Download Attachment
=?UTF-8?B?Q2FwdHVyZSBk4oCZZcyBY3JhbiAyMDIwLTAyLTE5IGHMgCAwMi4xNy4wNi5wbmc=?= (18K) Download Attachment
=?UTF-8?B?Q2FwdHVyZSBk4oCZZcyBY3JhbiAyMDIwLTAyLTE5IGHMgCAwMi4yMi4yMS5wbmc=?= (42K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Debugging Spur32 genPrimitiveHighBit

Eliot Miranda-2
 
Hi Nicolas,

On Wed, Feb 19, 2020 at 12:10 AM Nicolas Cellier <[hidden email]> wrote:
 
I think that my confusion came from the order of specification, like this

mod operand1:reg operand2:r/m


versus mod: modReg RM: r/m RO: reg
Intel uses such mixture of middle endian in the documentation too...


I have also further fixed the rexw:r:x:b: for X64 encoding of R8-R15 in LZCNT and BSR (but did not regenerate, that can wait)


Don't worry abut it.  It's a weird;y complex ISA :-)  You;'ve fixed it and yesterday I gave it a fairly thorough test and it seems to be working fine. We could write some proper tests in terms of right shift until zero, comparing with highBit I suppose.
 
Le mer. 19 févr. 2020 à 00:51, Nicolas Cellier <[hidden email]> a écrit :
mod is the mod field (2 bits) RM is the r/m field (3 bits), RO is the reg (or opcode) field (3 bits).
For LZCNT dest is in RO, and mask (source) is in r/m.
Same for BSR.
So I had it wrong... will fix ASAP

Le mar. 18 févr. 2020 à 23:37, Nicolas Cellier <[hidden email]> a écrit :
It seems that I miss-interpreted the order of mod:RM:RO:, it seems like O means output, not operand...

Le mar. 18 févr. 2020 à 22:53, Nicolas Cellier <[hidden email]> a écrit :
Hi all,
I confirm that generated code for CLZ (LZCNT) is incorrect on IA32 arch.
The registers are swapped!

Here is an extract:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax
    0x5c26a9f: eb 11           jmp    0x5c26ab2
    0x5c26aa1: 90              nop    
    0x5c26aa2: 90              nop    
    0x5c26aa3: 90              nop    
    0x5c26aa4: 89 d0           movl   %edx, %eax
    0x5c26aa6: 83 e0 03        andl   $0x3, %eax
    0x5c26aa9: 75 f1           jne    0x5c26a9c
    0x5c26aab: 8b 02           movl   (%edx), %eax
    0x5c26aad: 25 ff ff 3f 00  andl   $0x3fffff, %eax           ; imm = 0x3FFFFF
    0x5c26ab2: 39 c8           cmpl   %ecx, %eax
    0x5c26ab4: 75 e0           jne    0x5c26a96
    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx
    0x5c26aba: 74 0d           je     0x5c26ac9
    0x5c26abc: 35 1f 00 00 00  xorl   $0x1f, %eax
    0x5c26ac1: 89 c2           movl   %eax, %edx
    0x5c26ac3: d1 e2           shll   %edx
    0x5c26ac5: 83 c2 01        addl   $0x1, %edx
    0x5c26ac8: c3              retl  

What happens is that we count the leading zeros in $eax (TempReg) and store the result in $edx (ReceiverResultReg) ...

    0x5c26ab6: f3 0f bd d0     lzcntl %eax, %edx

We want the contrary!

$eax contains 1, presumably because we used it to check for SmallInteger tag bit:

    0x5c26a9c: 83 e0 01        andl   $0x1, %eax

So we invariably get 31 leading zeroes in $edx (but we will later overwrite the contents of $edx).

Then we interpret $eax as the result (thus 1 leading zero), bitInvert, and get 30 as the result for highBit, store that in $edx (shifted and tagged), and we're done... Err!

Obviously the code generation is wrong!
It did work when I first wrote it, and still work on x64 because we use $eax (TempReg) as both source and dest reg...

Though, I do not see what we are doing wrong:

concretizeClzRR
<inline: true>
| maskReg dest  |
maskReg := operands at: 0.
dest := operands at: 1.
machineCode
at: 0 put: 16rF3;
at: 1 put: 16r0F;
at: 2 put: 16rBD;
at: 3 put: (self mod: ModReg RM: dest RO: maskReg).
^4

and we invoke it like that:
cogit ClzR: srcReg R: destReg.

ClzR: reg1 R: reg2
"reg2 := reg1 countLeadingZeros"
<inline: true>
<returnTypeC: #'AbstractInstruction *'>
^self gen: ClzRR operand: reg1 operand: reg2

So it seems to me that all is in the correct order...
cogitIA32 likewise seems perfrectly correct:

static sqInt
genPrimitiveHighBit(void)
{
    AbstractInstruction *anInstruction11;
    AbstractInstruction *anInstruction2;
    AbstractInstruction *anInstruction4;
    AbstractInstruction *jumpNegativeReceiver;
    AbstractInstruction *jumpNegativeReceiver11;
    AbstractInstruction *jumpNegativeReceiver3;
    sqInt literal1;


        /* remove excess tag bits from the receiver oop */

        /* and use the abstract cogit facility for case of single tag-bit */
        /* begin genHighBitIn:ofSmallIntegerOopWithSingleTagBit: */
        if (((ceCheckLZCNT()) & (1U << 5)) != 0) {
                /* begin genHighBitClzIn:ofSmallIntegerOopWithSingleTagBit: */
                genoperandoperand(ClzRR, ReceiverResultReg, TempReg);
                if (!(setsConditionCodesFor(lastOpcode(), JumpZero))) {
                        /* begin checkQuickConstant:forInstruction: */
                        anInstruction2 = genoperandoperand(CmpCqR, 0, TempReg);
                }

                /* Note the nice bit trick below:
                   highBit_1based_of_small_int_value = (BytesPerWord * 8) - leadingZeroCout_of_oop - 1 toAccountForTagBit.
                   This is like 2 complements (- reg - 1) on (BytesPerWord * 8) log2 bits, or exactly a bit invert operation... */
                jumpNegativeReceiver3 = genConditionalBranchoperand(JumpZero, ((sqInt)0));
                /* begin checkLiteral:forInstruction: */
                literal1 = (BytesPerWord * 8) - 1;
                anInstruction11 = genoperandoperand(XorCwR, (BytesPerWord * 8) - 1, TempReg);
                jumpNegativeReceiver = jumpNegativeReceiver3;
                goto l10;
        }

which concretize in:

        case ClzRR:
                /* begin concretizeClzRR */
                maskReg = ((self_in_dispatchConcretize->operands))[0];
                dest = ((self_in_dispatchConcretize->operands))[1];
                ((self_in_dispatchConcretize->machineCode))[0] = 243;
                ((self_in_dispatchConcretize->machineCode))[1] = 15;
                ((self_in_dispatchConcretize->machineCode))[2] = 189;
                ((self_in_dispatchConcretize->machineCode))[3] = (modRMRO(self_in_dispatchConcretize, ModReg, dest, maskReg));
                return 4;

The order seems correct all the way down...
As a workaround, I could revert Eliot's optimization and force a
    cogit MoveR: ReceiverResultReg R: TempReg.
But I'd rather want to understand where's the problem...
Another pair of eyes may help!



--
_,,,^..^,,,_
best, Eliot