Oh! Many thanks for all the details! I thought about special selectors and remembered that #bitAnd: and #bitOr: were those special kind of "beasts" but I falsely assumed that #bitXor: was also a special selector... :( I should have looked at the code!
So that also explains the "strange" behavior of the 64bit image. Lots of the tests in the 32bit image fell into the LargeInteger "range" and then, executing the same code on the 64bit image, those numbers suddenly became considered SmallInteger... That explains it all and why some operations "suddenly" became so fast for no apparent reason !!!
Now, just for my personal curiosity, what is going to be the impact of a 64bit VM on such code? (Unfortunately, I am on Windows so I will have to wait...) ?
P.S. I should have persevered in my google searches as I just came across this *excellent* post : Arithmetic, inlined and special selectors
Thanks a lot!
Yahoo! Messenger: bstjean
"A standpoint is an intellectual horizon of radius zero". (A. Einstein)
From: Clément Bera <[hidden email]>
To: Benoit St-Jean <[hidden email]>; Pharo Development List <[hidden email]>
Cc: The General-purpose Squeak Developers List <[hidden email]>
Sent: Friday, February 10, 2017 5:14 AM
Subject: Re: [Pharo-dev] Integer arithmetic and bit counting performance in Squeak and Pharo
This is a lovely post. Thanks for posting about Smalltalk / Pharo / Squeak.
If I am correct your questions are exclusively for bitOr: and bitAnd:. I don't know what you are aware of and what you're not aware of, hence I am going to give you some details on how bitAnd: is executed (bitOr: is done exactly the same way), and hopefully there will be something you don't know yet that should help you understand what's going on.
Execution of bitAnd:
bitAnd: a special selector. If you run Smalltalk specialSelectors you get this array: #(#+ 1 #- 1 #< 1 #> 1 #<= 1 #>= 1 #= 1 #~= 1 #* 1 #/ 1 #\\ 1 #@ 1 #bitShift: 1 #// 1 #bitAnd: 1 #bitOr: 1 #at: 1 #at:put: 2 #size 0 #next 0 #nextPut: 1 #atEnd 0 #== 1 #class 0 #blockCopy: 1 #value 0 #value: 1 #do: 1 #new 0 #new: 1 #x 0 #y 0) which includes bitAnd: .
Special selectors are encoded in the bytecode differently to save memory, so they can be executed differently without inducing overhead. In Cog, bitAnd: is executed differently using type-prediction.
In the Stack Interpreter, the execution of bitAnd: is done as follows:
If both operands are SmallIntegers,
push the bitAnd: value of the 2 SmallIntegers
Try the primitive BitAnd for SmallIntegers (14)
push the result
perform the send.
In the JIT, bitAnd: sends are compiled differently that normal send, the compilation logic is as follows:
If both operands are SmallInteger literals
compute the result during compilation and use this value
If one of the 2 operands is a SmallInteger literal
generate 2 paths, a quick inlined path testing at runtime that the other operand is a SmallInteger, and if so, Ands the operands and use that result, else use the slow path performing a send
generate a normal send
primitive 14 -> Works for SmallIntegers and LargePositiveInteger fitting in an unsigned machine word. (Slower than direct SmallInteger bitAnd performed by the special selector)
primitive 34 -> Works for SmallIntegers and LargePositiveInteger fitting in an unsigned int64. (Slower than primitive 14 in 32 bits but covering more cases, equivalent in 64 bits)
primDigitBitAnd (in Integer) -> Works for any integer. (Slower than the primitive 14 and 34)
In 32 bits, SmallIntegers are signed 31bits integer.
In 64 bits, SmallIntegers are signed 61 bits integer.
Let's make some assertions based on previous information:
1) If you have the exact same bitAnd: implementation for #bitAnd: and #bitAnd2:, #bitAnd2: is slower because it is not a special selector.
2) Still with the exact same bitAnd: implementation, using the JIT, things like that:16r1 bitAnd: 1. "1 bits"
16r2 bitAnd: 2. "2 bit"
16r4 bitAnd: 4. "3 bit"
16r8 bitAnd: 3. "4 bit"
Are bitAnd: operations between SmallInteger constants and the result is unused. Hence this whole portion of code does not generate any native instructions.
On the other hand, things like that:16r1 bitAnd2: 1. "1 bits"
16r2 bitAnd2: 2. "2 bit"
16r4 bitAnd2: 4. "3 bit"
16r8 bitAnd2: 3. "4 bit"
Are sends, generating multiple native instructions including calls.
3) Because of the SmallInteger encoding,
16r200000000000000 bitAnd: 98490667780264321. "58 bit"
is compiled by the JIT to nothing in 64 bits but to a send in 32 bits (the JIT can't solve LargeInteger arithmetic at compilation time).
4) primitive 34 provides some speed-up over primitive 14 in 32 bits, covering bitAnd: operations from unsigned 33 bits integer to unsigned 64 bits integer, but makes no difference in 64 bits. In 64bits all the cases covered by primitive 34 are covered by primitive 14, so it should even slow down things.
Does this answer your questions ?
Don't hesitate to ask further questions.
If you want more details, I wrote something about special selectors a while ago: https://clementbera.wordpress.com/2014/08/12/arithmetic-inlined-and-special-selectors/ .
PS1: If I'm correct, you referenced my blog a couple times, thank you for doing this, I really appreciate that.
PS2: I enjoy chess myself, so if you have an open-source/MIT working algorithm at some point please send it to me, I will play against the AI and look at the code.
On Fri, Feb 10, 2017 at 8:18 AM, Benoit St-Jean via Pharo-dev <[hidden email]> wrote:
the upshot is that in 32-bits one should probably model a chess board bit mask with three or four SmallIntegers, for example 16 squares per integer, but in 64-bits one can use two SmallIntegers, with 32 squares per integer. You could construct a family of class that hide the representational details. A ChessBoardBitMap abstract class with subclasses for the 4 32-bit SmallIntegers representation, the 2 64-bit SmallIntegers representation, and the polymorphic single integer that will be either a SmallInteger or a LargePositiveInteger, depending on the board and the word size of the system. You can then play with the representation as your program evolves and as e.g. the Sista optimizer evolves, which should be abel to optimize away some of the overhead of using this "boxed" ChessBoardBitMap approach.
On Fri, Feb 10, 2017 at 4:21 AM, Benoit St-Jean via Pharo-dev <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|