Hi all, I confirm that generated code for CLZ (LZCNT) is incorrect on IA32 arch. The registers are swapped! Here is an extract: 0x5c26a9c: 83 e0 01 andl $0x1, %eax 0x5c26a9f: eb 11 jmp 0x5c26ab2 0x5c26aa1: 90 nop 0x5c26aa2: 90 nop 0x5c26aa3: 90 nop 0x5c26aa4: 89 d0 movl %edx, %eax 0x5c26aa6: 83 e0 03 andl $0x3, %eax 0x5c26aa9: 75 f1 jne 0x5c26a9c 0x5c26aab: 8b 02 movl (%edx), %eax 0x5c26aad: 25 ff ff 3f 00 andl $0x3fffff, %eax ; imm = 0x3FFFFF 0x5c26ab2: 39 c8 cmpl %ecx, %eax 0x5c26ab4: 75 e0 jne 0x5c26a96 0x5c26ab6: f3 0f bd d0 lzcntl %eax, %edx 0x5c26aba: 74 0d je 0x5c26ac9 0x5c26abc: 35 1f 00 00 00 xorl $0x1f, %eax 0x5c26ac1: 89 c2 movl %eax, %edx 0x5c26ac3: d1 e2 shll %edx 0x5c26ac5: 83 c2 01 addl $0x1, %edx 0x5c26ac8: c3 retl What happens is that we count the leading zeros in $eax (TempReg) and store the result in $edx (ReceiverResultReg) ... 0x5c26ab6: f3 0f bd d0 lzcntl %eax, %edx We want the contrary! $eax contains 1, presumably because we used it to check for SmallInteger tag bit: 0x5c26a9c: 83 e0 01 andl $0x1, %eax So we invariably get 31 leading zeroes in $edx (but we will later overwrite the contents of $edx). Then we interpret $eax as the result (thus 1 leading zero), bitInvert, and get 30 as the result for highBit, store that in $edx (shifted and tagged), and we're done... Err! Obviously the code generation is wrong! It did work when I first wrote it, and still work on x64 because we use $eax (TempReg) as both source and dest reg... Though, I do not see what we are doing wrong: concretizeClzRR <inline: true> | maskReg dest | maskReg := operands at: 0. dest := operands at: 1. machineCode at: 0 put: 16rF3; at: 1 put: 16r0F; at: 2 put: 16rBD; at: 3 put: (self mod: ModReg RM: dest RO: maskReg). ^4 and we invoke it like that: cogit ClzR: srcReg R: destReg. ClzR: reg1 R: reg2 "reg2 := reg1 countLeadingZeros" <inline: true> <returnTypeC: #'AbstractInstruction *'> ^self gen: ClzRR operand: reg1 operand: reg2 So it seems to me that all is in the correct order... cogitIA32 likewise seems perfrectly correct: static sqInt genPrimitiveHighBit(void) { AbstractInstruction *anInstruction11; AbstractInstruction *anInstruction2; AbstractInstruction *anInstruction4; AbstractInstruction *jumpNegativeReceiver; AbstractInstruction *jumpNegativeReceiver11; AbstractInstruction *jumpNegativeReceiver3; sqInt literal1; /* remove excess tag bits from the receiver oop */ /* and use the abstract cogit facility for case of single tag-bit */ /* begin genHighBitIn:ofSmallIntegerOopWithSingleTagBit: */ if (((ceCheckLZCNT()) & (1U << 5)) != 0) { /* begin genHighBitClzIn:ofSmallIntegerOopWithSingleTagBit: */ genoperandoperand(ClzRR, ReceiverResultReg, TempReg); if (!(setsConditionCodesFor(lastOpcode(), JumpZero))) { /* begin checkQuickConstant:forInstruction: */ anInstruction2 = genoperandoperand(CmpCqR, 0, TempReg); } /* Note the nice bit trick below: highBit_1based_of_small_int_value = (BytesPerWord * 8) - leadingZeroCout_of_oop - 1 toAccountForTagBit. This is like 2 complements (- reg - 1) on (BytesPerWord * 8) log2 bits, or exactly a bit invert operation... */ jumpNegativeReceiver3 = genConditionalBranchoperand(JumpZero, ((sqInt)0)); /* begin checkLiteral:forInstruction: */ literal1 = (BytesPerWord * 8) - 1; anInstruction11 = genoperandoperand(XorCwR, (BytesPerWord * 8) - 1, TempReg); jumpNegativeReceiver = jumpNegativeReceiver3; goto l10; } which concretize in: case ClzRR: /* begin concretizeClzRR */ maskReg = ((self_in_dispatchConcretize->operands))[0]; dest = ((self_in_dispatchConcretize->operands))[1]; ((self_in_dispatchConcretize->machineCode))[0] = 243; ((self_in_dispatchConcretize->machineCode))[1] = 15; ((self_in_dispatchConcretize->machineCode))[2] = 189; ((self_in_dispatchConcretize->machineCode))[3] = (modRMRO(self_in_dispatchConcretize, ModReg, dest, maskReg)); return 4; The order seems correct all the way down... As a workaround, I could revert Eliot's optimization and force a cogit MoveR: ReceiverResultReg R: TempReg. But I'd rather want to understand where's the problem... Another pair of eyes may help! |
It seems that I miss-interpreted the order of mod:RM:RO:, it seems like O means output, not operand... Le mar. 18 févr. 2020 à 22:53, Nicolas Cellier <[hidden email]> a écrit :
|
ModR/M is a byte which encodes instruction operands https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf mod is the mod field (2 bits) RM is the r/m field (3 bits), RO is the reg (or opcode) field (3 bits). For LZCNT dest is in RO, and mask (source) is in r/m. Same for BSR. So I had it wrong... will fix ASAP Le mar. 18 févr. 2020 à 23:37, Nicolas Cellier <[hidden email]> a écrit :
|
I think that my confusion came from the order of specification, like this mod operand1:reg operand2:r/m https://www.felixcloutier.com/x86/lzcnt (same as Intel manuals) versus mod: modReg RM: r/m RO: reg Intel uses such mixture of middle endian in the documentation too... I have also further fixed the rexw:r:x:b: for X64 encoding of R8-R15 in LZCNT and BSR (but did not regenerate, that can wait) Le mer. 19 févr. 2020 à 00:51, Nicolas Cellier <[hidden email]> a écrit :
=?UTF-8?B?Q2FwdHVyZSBk4oCZZcyBY3JhbiAyMDIwLTAyLTE5IGHMgCAwMi4xMy41Ny5wbmc=?= (30K) Download Attachment =?UTF-8?B?Q2FwdHVyZSBk4oCZZcyBY3JhbiAyMDIwLTAyLTE5IGHMgCAwMi4xNy4wNi5wbmc=?= (18K) Download Attachment =?UTF-8?B?Q2FwdHVyZSBk4oCZZcyBY3JhbiAyMDIwLTAyLTE5IGHMgCAwMi4yMi4yMS5wbmc=?= (42K) Download Attachment |
Hi Nicolas, On Wed, Feb 19, 2020 at 12:10 AM Nicolas Cellier <[hidden email]> wrote:
Don't worry abut it. It's a weird;y complex ISA :-) You;'ve fixed it and yesterday I gave it a fairly thorough test and it seems to be working fine. We could write some proper tests in terms of right shift until zero, comparing with highBit I suppose.
_,,,^..^,,,_ best, Eliot |
Free forum by Nabble | Edit this page |