Re: [Pharo-dev] Integer arithmetic and bit counting performance in Squeak and Pharo

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Integer arithmetic and bit counting performance in Squeak and Pharo

Benoit St-Jean
Oh!  Many thanks for all the details!  I thought about special selectors and remembered that #bitAnd: and #bitOr: were those special kind of "beasts" but I falsely assumed that #bitXor: was also a special selector... :(  I should have looked at the code!

So that also explains the "strange" behavior of the 64bit image.  Lots of the tests in the 32bit image fell into the LargeInteger "range" and then, executing the same code on the 64bit image, those numbers suddenly became considered SmallInteger...  That explains it all and why some operations "suddenly" became so fast for no apparent reason !!!

Now, just for my personal curiosity, what is going to be the impact of a 64bit VM on such code? (Unfortunately, I am on Windows so I will have to wait...) ?

P.S. I should have persevered in my google searches as I just came across this *excellent* post :  Arithmetic, inlined and special selectors

:)

Thanks a lot!


 
-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)



From: Clément Bera <[hidden email]>
To: Benoit St-Jean <[hidden email]>; Pharo Development List <[hidden email]>
Cc: The General-purpose Squeak Developers List <[hidden email]>
Sent: Friday, February 10, 2017 5:14 AM
Subject: Re: [Pharo-dev] Integer arithmetic and bit counting performance in Squeak and Pharo

Hi,

This is a lovely post. Thanks for posting about Smalltalk / Pharo / Squeak.

  1. Why exactly does the override work (in 32bit images)?
  2. What changed so that things are different in Squeak 5.1 64bit image (overrides partially work)?
If I am correct your questions are exclusively for bitOr: and bitAnd:. I don't know what you are aware of and what you're not aware of, hence I am going to give you some details on how bitAnd: is executed (bitOr: is done exactly the same way), and hopefully there will be something you don't know yet that should help you understand what's going on. 

Execution of bitAnd:

bitAnd: a special selector. If you run Smalltalk specialSelectors you get this array:  #(#+ 1 #- 1 #< 1 #> 1 #<= 1 #>= 1 #= 1 #~= 1 #* 1 #/ 1 #\\ 1 #@ 1 #bitShift: 1 #// 1 #bitAnd: 1 #bitOr: 1 #at: 1 #at:put: 2 #size 0 #next 0 #nextPut: 1 #atEnd 0 #== 1 #class 0 #blockCopy: 1 #value 0 #value: 1 #do: 1 #new 0 #new: 1 #x 0 #y 0) which includes bitAnd: .

Special selectors are encoded in the bytecode differently to save memory, so they can be executed differently without inducing overhead. In Cog, bitAnd: is executed differently using type-prediction. 

In the Stack Interpreter, the execution of bitAnd: is done as follows:
    If both operands are SmallIntegers, 
        push the bitAnd: value of the 2 SmallIntegers
    else 
        Try the primitive BitAnd for SmallIntegers (14)
        on success
            push the result
        on failure
            perform the send.

In the JIT, bitAnd: sends are compiled differently that normal send, the compilation logic is as follows:
    If both operands are SmallInteger literals
        compute the result during compilation and use this value
    else
        If one of the 2 operands is a SmallInteger literal
            generate 2 paths, a quick inlined path testing at runtime that the other operand is a SmallInteger, and if so, Ands the operands and use that result, else use the slow path performing a send
        else
            generate a normal send

Primitives

primitive 14 -> Works for SmallIntegers and LargePositiveInteger fitting in an unsigned machine word. (Slower than direct SmallInteger bitAnd performed by the special selector)
primitive 34 -> Works for SmallIntegers and LargePositiveInteger fitting in an unsigned int64. (Slower than primitive 14 in 32 bits but covering more cases, equivalent in 64 bits)
primDigitBitAnd (in Integer) -> Works for any integer. (Slower than the primitive 14 and 34)

SmallInteger encoding

In 32 bits, SmallIntegers are signed 31bits integer.
In 64 bits, SmallIntegers are signed 61 bits integer.

Discussion

Let's make some assertions based on previous information:

1) If you have the exact same bitAnd: implementation for #bitAnd: and #bitAnd2:, #bitAnd2: is slower because it is not a special selector.

2) Still with the exact same bitAnd: implementation, using the JIT, things like that:
16r1 bitAnd: 1. "1 bits"
16r2 bitAnd: 2. "2 bit"
16r4 bitAnd: 4. "3 bit"
16r8 bitAnd: 3. "4 bit"
Are bitAnd: operations between SmallInteger constants and the result is unused. Hence this whole portion of code does not generate any native instructions.
On the other hand, things like that:
16r1 bitAnd2: 1. "1 bits"
16r2 bitAnd2: 2. "2 bit"
16r4 bitAnd2: 4. "3 bit"
16r8 bitAnd2: 3. "4 bit"
Are sends, generating multiple native instructions including calls.

3) Because of the SmallInteger encoding,
16r200000000000000 bitAnd: 98490667780264321. "58 bit"
is compiled by the JIT to nothing in 64 bits but to a send in 32 bits (the JIT can't solve LargeInteger arithmetic at compilation time).

4) primitive 34 provides some speed-up over primitive 14 in 32 bits, covering bitAnd: operations from unsigned 33 bits integer to unsigned 64 bits integer, but makes no difference in 64 bits. In 64bits all the cases covered by primitive 34 are covered by primitive 14, so it should even slow down things.

So...

Does this answer your questions ?
Don't hesitate to ask further questions. 

If you want more details, I wrote something about special selectors a while ago: https://clementbera.wordpress.com/2014/08/12/arithmetic-inlined-and-special-selectors/ .

Best,

Clement

PS1: If I'm correct, you referenced my blog a couple times, thank you for doing this, I really appreciate that.
PS2: I enjoy chess myself, so if you have an open-source/MIT working algorithm at some point please send it to me, I will play against the AI and look at the code.


On Fri, Feb 10, 2017 at 8:18 AM, Benoit St-Jean via Pharo-dev <[hidden email]> wrote:


---------- Forwarded message ----------
From: Benoit St-Jean <[hidden email]>
To: The General-purpose Squeak Developers List <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: 
Date: Fri, 10 Feb 2017 07:18:02 +0000 (UTC)
Subject: Integer arithmetic and bit counting performance in Squeak and Pharo
For those interested in performance of bit operations and integer arithmetic in Squeak and Pharo :

Bit operations in general : https://endormitoire.wordpress .com/2017/02/10/bits-and- pieces/

Bit counting algorithms : https://endormitoire.wordpress .com/2017/02/10/1001001-sos/

Some questions, some results, some answers...
 
-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)






Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Integer arithmetic and bit counting performance in Squeak and Pharo

Eliot Miranda-2
Hi Benoit,

   the upshot is that in 32-bits one should probably model a chess board bit mask with three or four SmallIntegers, for example 16 squares per integer, but in 64-bits one can use two SmallIntegers, with 32 squares per integer.  You could construct a family of class that hide the representational details.  A ChessBoardBitMap abstract class with subclasses for the 4 32-bit SmallIntegers representation, the 2 64-bit SmallIntegers representation, and the polymorphic single integer that will be either a SmallInteger or a LargePositiveInteger, depending on the board and the word size of the system.  You can then play with the representation as your program evolves and as e.g. the Sista optimizer evolves, which should be abel to optimize away some of the overhead of using this "boxed" ChessBoardBitMap approach.

On Fri, Feb 10, 2017 at 4:21 AM, Benoit St-Jean via Pharo-dev <[hidden email]> wrote:


---------- Forwarded message ----------
From: Benoit St-Jean <[hidden email]>
To: "Clément Bera" <[hidden email]>, Pharo Development List <[hidden email]>
Cc: The General-purpose Squeak Developers List <[hidden email]>
Date: Fri, 10 Feb 2017 12:21:33 +0000 (UTC)
Subject: Re: [Pharo-dev] Integer arithmetic and bit counting performance in Squeak and Pharo
Oh!  Many thanks for all the details!  I thought about special selectors and remembered that #bitAnd: and #bitOr: were those special kind of "beasts" but I falsely assumed that #bitXor: was also a special selector... :(  I should have looked at the code!

So that also explains the "strange" behavior of the 64bit image.  Lots of the tests in the 32bit image fell into the LargeInteger "range" and then, executing the same code on the 64bit image, those numbers suddenly became considered SmallInteger...  That explains it all and why some operations "suddenly" became so fast for no apparent reason !!!

Now, just for my personal curiosity, what is going to be the impact of a 64bit VM on such code? (Unfortunately, I am on Windows so I will have to wait...) ?

P.S. I should have persevered in my google searches as I just came across this *excellent* post :  Arithmetic, inlined and special selectors

:)

Thanks a lot!


 
-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)



From: Clément Bera <[hidden email]>
To: Benoit St-Jean <[hidden email]>; Pharo Development List <[hidden email]>
Cc: The General-purpose Squeak Developers List <[hidden email]>
Sent: Friday, February 10, 2017 5:14 AM
Subject: Re: [Pharo-dev] Integer arithmetic and bit counting performance in Squeak and Pharo

Hi,

This is a lovely post. Thanks for posting about Smalltalk / Pharo / Squeak.

  1. Why exactly does the override work (in 32bit images)?
  2. What changed so that things are different in Squeak 5.1 64bit image (overrides partially work)?
If I am correct your questions are exclusively for bitOr: and bitAnd:. I don't know what you are aware of and what you're not aware of, hence I am going to give you some details on how bitAnd: is executed (bitOr: is done exactly the same way), and hopefully there will be something you don't know yet that should help you understand what's going on. 

Execution of bitAnd:

bitAnd: a special selector. If you run Smalltalk specialSelectors you get this array:  #(#+ 1 #- 1 #< 1 #> 1 #<= 1 #>= 1 #= 1 #~= 1 #* 1 #/ 1 #\\ 1 #@ 1 #bitShift: 1 #// 1 #bitAnd: 1 #bitOr: 1 #at: 1 #at:put: 2 #size 0 #next 0 #nextPut: 1 #atEnd 0 #== 1 #class 0 #blockCopy: 1 #value 0 #value: 1 #do: 1 #new 0 #new: 1 #x 0 #y 0) which includes bitAnd: .

Special selectors are encoded in the bytecode differently to save memory, so they can be executed differently without inducing overhead. In Cog, bitAnd: is executed differently using type-prediction. 

In the Stack Interpreter, the execution of bitAnd: is done as follows:
    If both operands are SmallIntegers, 
        push the bitAnd: value of the 2 SmallIntegers
    else 
        Try the primitive BitAnd for SmallIntegers (14)
        on success
            push the result
        on failure
            perform the send.

In the JIT, bitAnd: sends are compiled differently that normal send, the compilation logic is as follows:
    If both operands are SmallInteger literals
        compute the result during compilation and use this value
    else
        If one of the 2 operands is a SmallInteger literal
            generate 2 paths, a quick inlined path testing at runtime that the other operand is a SmallInteger, and if so, Ands the operands and use that result, else use the slow path performing a send
        else
            generate a normal send

Primitives

primitive 14 -> Works for SmallIntegers and LargePositiveInteger fitting in an unsigned machine word. (Slower than direct SmallInteger bitAnd performed by the special selector)
primitive 34 -> Works for SmallIntegers and LargePositiveInteger fitting in an unsigned int64. (Slower than primitive 14 in 32 bits but covering more cases, equivalent in 64 bits)
primDigitBitAnd (in Integer) -> Works for any integer. (Slower than the primitive 14 and 34)

SmallInteger encoding

In 32 bits, SmallIntegers are signed 31bits integer.
In 64 bits, SmallIntegers are signed 61 bits integer.

Discussion

Let's make some assertions based on previous information:

1) If you have the exact same bitAnd: implementation for #bitAnd: and #bitAnd2:, #bitAnd2: is slower because it is not a special selector.

2) Still with the exact same bitAnd: implementation, using the JIT, things like that:
16r1 bitAnd: 1. "1 bits"
16r2 bitAnd: 2. "2 bit"
16r4 bitAnd: 4. "3 bit"
16r8 bitAnd: 3. "4 bit"
Are bitAnd: operations between SmallInteger constants and the result is unused. Hence this whole portion of code does not generate any native instructions.
On the other hand, things like that:
16r1 bitAnd2: 1. "1 bits"
16r2 bitAnd2: 2. "2 bit"
16r4 bitAnd2: 4. "3 bit"
16r8 bitAnd2: 3. "4 bit"
Are sends, generating multiple native instructions including calls.

3) Because of the SmallInteger encoding,
16r200000000000000 bitAnd: 98490667780264321. "58 bit"
is compiled by the JIT to nothing in 64 bits but to a send in 32 bits (the JIT can't solve LargeInteger arithmetic at compilation time).

4) primitive 34 provides some speed-up over primitive 14 in 32 bits, covering bitAnd: operations from unsigned 33 bits integer to unsigned 64 bits integer, but makes no difference in 64 bits. In 64bits all the cases covered by primitive 34 are covered by primitive 14, so it should even slow down things.

So...

Does this answer your questions ?
Don't hesitate to ask further questions. 

If you want more details, I wrote something about special selectors a while ago: https://clementbera.wordpress.com/2014/08/12/arithmetic-inlined-and-special-selectors/ .

Best,

Clement

PS1: If I'm correct, you referenced my blog a couple times, thank you for doing this, I really appreciate that.
PS2: I enjoy chess myself, so if you have an open-source/MIT working algorithm at some point please send it to me, I will play against the AI and look at the code.


On Fri, Feb 10, 2017 at 8:18 AM, Benoit St-Jean via Pharo-dev <[hidden email]> wrote:


---------- Forwarded message ----------
From: Benoit St-Jean <[hidden email]>
To: The General-purpose Squeak Developers List <[hidden email]>, "[hidden email]" <[hidden email]>
Cc: 
Date: Fri, 10 Feb 2017 07:18:02 +0000 (UTC)
Subject: Integer arithmetic and bit counting performance in Squeak and Pharo
For those interested in performance of bit operations and integer arithmetic in Squeak and Pharo :

Bit operations in general : https://endormitoire.wordpress .com/2017/02/10/bits-and- pieces/

Bit counting algorithms : https://endormitoire.wordpress .com/2017/02/10/1001001-sos/

Some questions, some results, some answers...
 
-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)








--
_,,,^..^,,,_
best, Eliot