Smalltalk › Squeak › Squeak VM

in-line method lookup caching of booleans

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Ben Coman

in-line method lookup caching of booleans

Just a brain twitch about something I'd like to understand better...

At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
it says... "[Monomorphic] inline caching depends on the fact that in
most programs at most send sites there is no polymorphism and that
most sends bind to *exactly_one* class of receiver over some usefully
long period of time. ... In Cog we implement inline caches in three
forms ... monomorphic inline cache ... closed polymorphic inline cache
... open polymorphic cache. What’s nice is that while these sends are
increasingly expensive moving from monomorphic to megamorphic they are
also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
typical Smalltalk programs"

First, I'm curious what is the relative performance of the three
different caches ?

Second, I'm curious how Booleans are dealt with. Boolean handling
must be fairly common, and at the Image level these are two different
classes, which pushes out of the monomorphic inline cache, which may
be a significant impact on performance.

I started wondering if under the hood the VM could treat the two
booleans as one class and in the instance store "which-boolean" it
actually is. Then for example the #ifTrue:ifFalse: method of each
class would be compiled into a single method conditioned at the start
on "which-boolean" as to which path is executed. How feasible would
that be, and would it make much of a difference in performance of
booleans ?

Except then I realized that this situation is already bypassed by
common boolean methods being inlined by the compiler. Still curious,
if there are quick answers to my questions.

cheers -ben

Clément Béra

Re: in-line method lookup caching of booleans

Hi,

On Fri, Aug 3, 2018 at 4:28 AM, Ben Coman <[hidden email]> wrote:

Just a brain twitch about something I'd like to understand better...

At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
it says... "[Monomorphic] inline caching depends on the fact that in
most programs at most send sites there is no polymorphism and that
most sends bind to *exactly_one* class of receiver over some usefully
long period of time. ... In Cog we implement inline caches in three
forms ... monomorphic inline cache ... closed polymorphic inline cache
... open polymorphic cache. What’s nice is that while these sends are
increasingly expensive moving from monomorphic to megamorphic they are
also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
typical Smalltalk programs"

Note that in my experience the ratio was, each time I measured in the Cog Simulator, 93%, 4%, 3% or something like that. Cliff Click told me recently he had the same feeling with Java code, sends are either mono or megamorphic, but rarely poly (through in the JVM polymorphism is up to 4).

First, I'm curious what is the relative performance of the three
different caches ?

I guess you can measure that with micro benchs. Levente can help with that, I would write complete crap. Note that our polymorphism is up to 6.

In the Case of Cog, I would expect monomorphic caches to be almost as fast as direct calls, polymorphic caches to be almost as fast as monomorphic, and megamorphic caches to be considerably slower. To be confirmed.

Note also that for some reason in Cog PICs are called ClosedPICs and megamorphic caches are called OpenPICs.

Second, I'm curious how Booleans are dealt with. Boolean handling
must be fairly common, and at the Image level these are two different
classes, which pushes out of the monomorphic inline cache, which may
be a significant impact on performance.

Control flow operations are inlined by the bytecode compiler, and they're the most critical performance wise.

The VM fixes the addresses of specific objects (nil, true, false). Become and become forward don't work with those objects. The JIT can generate constants in machine code for the addresses of those objects. That allows to quicken inlined control flow operations.

Non inlined operations are usually dealt with PICs with 2 cases.

One issue I had in Sista was with early PIC promotion. Currently in Cog if there's already a megamorphic cache for a selector, monomorphic caches are rewritten to the megamorphic cache directly and not to PIC then megamorphic. This is a problem typically with sends such as #isNil. In some cases the isNil is megamorphic and using such cache is relevant. In other case there's just a given type and Undefined object, and there the VM currently uses a megamorphic cache instead of a PIC. This was especially annoying since megamorphic caches don't provide any runtime type feedback. Similar 2 cases issue, non boolean.

I started wondering if under the hood the VM could treat the two
booleans as one class and in the instance store "which-boolean" it
actually is. Then for example the #ifTrue:ifFalse: method of each
class would be compiled into a single method conditioned at the start
on "which-boolean" as to which path is executed. How feasible would
that be, and would it make much of a difference in performance of
booleans ?

Some VMs have boolean as primitive types or as immediate objects. Honestly, given our optimization that the booleans can't move in memory and that we can use their address as a constant, I don't think any of these optimizations would make any significant difference in terms of performance.

What might make a difference is to add as primitive operations / inlined operations some other boolean methods, such as #not. It's a trade-off between system flexibility, boolean and non boolean performance, etc.

Except then I realized that this situation is already bypassed by
common boolean methods being inlined by the compiler. Still curious,
if there are quick answers to my questions.

cheers -ben

Clément Béra
https://clementbera.github.io/

https://clementbera.wordpress.com/

Eliot Miranda-2

Re: in-line method lookup caching of booleans

Hi Clément, Hi Ben,

_,,,^..^,,,_ (phone)

On Aug 3, 2018, at 1:19 AM, Clément Béra <[hidden email]> wrote:

Hi,

On Fri, Aug 3, 2018 at 4:28 AM, Ben Coman <[hidden email]> wrote:

Just a brain twitch about something I'd like to understand better...

At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
it says... "[Monomorphic] inline caching depends on the fact that in
most programs at most send sites there is no polymorphism and that
most sends bind to *exactly_one* class of receiver over some usefully
long period of time. ... In Cog we implement inline caches in three
forms ... monomorphic inline cache ... closed polymorphic inline cache
... open polymorphic cache. What’s nice is that while these sends are
increasingly expensive moving from monomorphic to megamorphic they are
also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
typical Smalltalk programs"

Note that in my experience the ratio was, each time I measured in the Cog Simulator, 93%, 4%, 3% or something like that. Cliff Click told me recently he had the same feeling with Java code, sends are either mono or megamorphic, but rarely poly (through in the JVM polymorphism is up to 4).

First, I'm curious what is the relative performance of the three
different caches ?

Towards the bottom of the blog post is an example that measures relative performance. The table I give is as follows, but the data are from a V3 image. Ben, maybe you could run the example on Spur and see if things are much different.

homogenous immediate 2.8 nsecs
homogenous compact 8.6 nsecs
homogenous normal 8.5 nsecs
polymorphic 11.2 nsecs
megamorphic 16.7 nsecs

I guess you can measure that with micro benchs. Levente can help with that, I would write complete crap. Note that our polymorphism is up to 6.

In the Case of Cog, I would expect monomorphic caches to be almost as fast as direct calls, polymorphic caches to be almost as fast as monomorphic, and megamorphic caches to be considerably slower. To be confirmed.

Note also that for some reason in Cog PICs are called ClosedPICs and megamorphic caches are called OpenPICs.

Here’s my explanation from the blog post:

In Cog we implement inline caches in three forms:

– a monomorphic inline cache; a load of a cache tag and a call of a target method whose prolog checks that the current receiver’s class matches the cache tag

– a closed polymorphic inline cache; a growable but finite jump table of class checks and jumps to method bodies, (closed because it deals with a closed set of classes).

– an open polymorphic cache; a first-level method lookup probe specialized for a particular selector, (open since it deals with an open set of classes).

Second, I'm curious how Booleans are dealt with. Boolean handling
must be fairly common, and at the Image level these are two different
classes, which pushes out of the monomorphic inline cache, which may
be a significant impact on performance.

Control flow operations are inlined by the bytecode compiler, and they're the most critical performance wise.

The VM fixes the addresses of specific objects (nil, true, false). Become and become forward don't work with those objects. The JIT can generate constants in machine code for the addresses of those objects. That allows to quicken inlined control flow operations.

Non inlined operations are usually dealt with PICs with 2 cases.

One issue I had in Sista was with early PIC promotion. Currently in Cog if there's already a megamorphic cache for a selector, monomorphic caches are rewritten to the megamorphic cache directly and not to PIC then megamorphic. This is a problem typically with sends such as #isNil. In some cases the isNil is megamorphic and using such cache is relevant. In other case there's just a given type and Undefined object, and there the VM currently uses a megamorphic cache instead of a PIC. This was especially annoying since megamorphic caches don't provide any runtime type feedback. Similar 2 cases issue, non boolean.

Hence we disable the optimisation in the Sista JIT right?

I started wondering if under the hood the VM could treat the two
booleans as one class and in the instance store "which-boolean" it
actually is. Then for example the #ifTrue:ifFalse: method of each
class would be compiled into a single method conditioned at the start
on "which-boolean" as to which path is executed. How feasible would
that be, and would it make much of a difference in performance of
booleans ?

Some VMs have boolean as primitive types or as immediate objects. Honestly, given our optimization that the booleans can't move in memory and that we can use their address as a constant, I don't think any of these optimizations would make any significant difference in terms of performance.

What might make a difference is to add as primitive operations / inlined operations some other boolean methods, such as #not. It's a trade-off between system flexibility, boolean and non boolean performance, etc.

Except then I realized that this situation is already bypassed by
common boolean methods being inlined by the compiler. Still curious,
if there are quick answers to my questions.

..

cheers -ben

--
Clément Béra
https://clementbera.github.io/
https://clementbera.wordpress.com/

Eliot Miranda-2

Re: in-line method lookup caching of booleans

In reply to this post by Ben Coman

Hi Ben,

On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <[hidden email]> wrote:

Just a brain twitch about something I'd like to understand better...

At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
it says... "[Monomorphic] inline caching depends on the fact that in
most programs at most send sites there is no polymorphism and that
most sends bind to *exactly_one* class of receiver over some usefully
long period of time. ... In Cog we implement inline caches in three
forms ... monomorphic inline cache ... closed polymorphic inline cache
... open polymorphic cache. What’s nice is that while these sends are
increasingly expensive moving from monomorphic to megamorphic they are
also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
typical Smalltalk programs"

First, I'm curious what is the relative performance of the three
different caches ?

here's a measurement for Spur:

| n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic |

Smalltalk garbageCollect.

n := 1000000.

void := Array new: 1024 withAll: nil.

homogenousImmediateA := Array new: 1024 withAll: 1.

homogenousImmediateB := Array new: 1024 withAll: $1.

homogenousNormal := Array new: 1024 withAll: 1 / 2.

polymorphic := Array new: 1024.

1 to: polymorphic size by: 4 do:

[:i|

polymorphic

at: i put: i;

at: i + 1 put: i asFloat;

at: i + 2 put: i / 1025;

at: i + 3 put: i * 1.0s1 "scaled decimal"].

megamorphic := Array new: 1024.

1 to: megamorphic size by: 8 do:

[:i|

megamorphic

at: i put: i;

at: i + 1 put: i asFloat;

at: i + 2 put: i / 1025;

at: i + 3 put: i * 1.0s1; "scaled decimal"

at: i + 4 put: i class;

at: i + 5 put: i asFloat class;

at: i + 6 put: (i / 1025) class;

at: i + 7 put: (i * 1.0s1) class].

null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun.

{ homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect:

[:a|

"Smalltalk voidCogVMState."

([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null]

And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are

64-bit: #(1510 1703 1612 1830 2451)

32-bit: #(2852 4133 2843 3534 5317)

or...

64-bit 32-bit

monomorphic to 1 1510 2852

monomorphic to $1 1703 4133

monomorphic to obj 1612 2843

closed polymorphic 1830 3534

open polymorphic 2451 5317

The sad thing is that Spur's monomorphic sends to immediate are very slow because 32-bit Spur supports 31-bit SmallInteger immediates (for compatibility with 32-bit V3). Were it to use the simple tagging scheme of two bit tags sends would look more like the 64-bit ones.

The difference between monomorphic sends to 1 and sends to $1 on 64-bit shows the variability in timing due to a heavily loaded machine. There's no difference in the instructions executed.

Here's the monomorphic cache checking sequence generated for the prolog of each method in (32-bit) V3:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary

xorl %edx, %edx

Lfail:

call ceMethodAbort0Args

nop

entry: # aligned on an 8 byte boundary

movl %edx, %eax

andl $1, %eax # only immediate class is SmallInteger

jnz Lcompare

movl (%edx), %eax

andl $0x1f000, %eax # if non-zero, class is compact & compact class index used as cache tag

jnz Lcompare

movl -4(%edx), %eax # normal class plus 2 bits of header type tag, eliminate header type tag

andl $0xfffffffc, %eax

Lcompare

cmpl %ecx, %eax

jnz Lfail

noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary

xorl %edx, %edx

Lfail:

call ceMethodAbort0Args

nop

Limmediate: # aligned on an 8 byte boundary

andl $1, %eax # map SmallInteger to 1, Character to 0

jmp Lcompare

nop

entry: # aligned on an 8 byte boundary

movl %edx, %eax

andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10

jnz Limmediate

movl (%edx), %eax

andl $0x3fffff, %eax # cache tag is class index field

Lcompare:

cmpl %ecx, %eax

jnz Lfail

noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary

xorq %rdx, %rdx

Lfail:

call ceMethodAbort0Args

entry:

movq %rdx, %rax

andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100

jnz Lcompare

movq (%rdx), %rax

andq $0x3fffff, %rax # cache tag is class index field
Lcompare:

cmpq %rcx, %rax

jnz Lfail

noCheckEntry:

The 64-bit sequence is the cleanest. But if 32-bit Spur had 30-bit SmallIntegers alongside its 30-bit Characters then that sequence would be equally simple.

I suppose something like this could be quicker for 32-bit Spur with 31-bit SmallIntegers. It would mean that sends to odd SmallIntegers were slower than sends to even ones, but would be quicker for Characters and even SmallIntegers:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary

xorl %edx, %edx

xorl %eax, %eax # necessary because %eax may have some other value than %edx bitAnd: $3 when branching to LstackOverflow

Lfail:

cmpl $3, %eax

jnz Limmediate

call ceMethodAbort0Args

Limmediate:

andl $1, %eax # map odd SmallInteger (2r11) to 2r01

jmp Lcompare

entry: # aligned on an 8 byte boundary

movl %edx, %eax

andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10. SmallInteger cache tag is 2r01

jnz Lcompare

movl (%edx), %eax

andl $0x3fffff, %eax # cache tag is class index field

Lcompare:

cmpl %ecx, %eax

jnz Lfail

noCheckEntry:

Second, I'm curious how Booleans are dealt with. Boolean handling
must be fairly common, and at the Image level these are two different
classes, which pushes out of the monomorphic inline cache, which may
be a significant impact on performance.

I started wondering if under the hood the VM could treat the two
booleans as one class and in the instance store "which-boolean" it
actually is. Then for example the #ifTrue:ifFalse: method of each
class would be compiled into a single method conditioned at the start
on "which-boolean" as to which path is executed. How feasible would
that be, and would it make much of a difference in performance of
booleans ?

You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance.

Except then I realized that this situation is already bypassed by
common boolean methods being inlined by the compiler. Still curious,
if there are quick answers to my questions.

cheers -ben

_,,,^..^,,,_

best, Eliot

Eliot Miranda-2

Re: in-line method lookup caching of booleans

On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <[hidden email]> wrote:

Hi Ben,

On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <[hidden email]> wrote:

Just a brain twitch about something I'd like to understand better...

At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
it says... "[Monomorphic] inline caching depends on the fact that in
most programs at most send sites there is no polymorphism and that
most sends bind to *exactly_one* class of receiver over some usefully
long period of time. ... In Cog we implement inline caches in three
forms ... monomorphic inline cache ... closed polymorphic inline cache
... open polymorphic cache. What’s nice is that while these sends are
increasingly expensive moving from monomorphic to megamorphic they are
also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
typical Smalltalk programs"

First, I'm curious what is the relative performance of the three
different caches ?

here's a measurement for Spur:

| n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic |
Smalltalk garbageCollect.
n := 1000000.
void := Array new: 1024 withAll: nil.
homogenousImmediateA := Array new: 1024 withAll: 1.
homogenousImmediateB := Array new: 1024 withAll: $1.
homogenousNormal := Array new: 1024 withAll: 1 / 2.
polymorphic := Array new: 1024.
1 to: polymorphic size by: 4 do:
[:i|
polymorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1 "scaled decimal"].
megamorphic := Array new: 1024.
1 to: megamorphic size by: 8 do:
[:i|
megamorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1; "scaled decimal"
at: i + 4 put: i class;
at: i + 5 put: i asFloat class;
at: i + 6 put: (i / 1025) class;
at: i + 7 put: (i * 1.0s1) class].
null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun.
{ homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect:
[:a|
"Smalltalk voidCogVMState."
([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null]

And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are

64-bit: #(1510 1703 1612 1830 2451)
32-bit: #(2852 4133 2843 3534 5317)

or...
64-bit 32-bit
monomorphic to 1 1510 2852
monomorphic to $1 1703 4133
monomorphic to obj 1612 2843
closed polymorphic 1830 3534
open polymorphic 2451 5317

The sad thing is that Spur's monomorphic sends to immediate are very slow because 32-bit Spur supports 31-bit SmallInteger immediates (for compatibility with 32-bit V3). Were it to use the simple tagging scheme of two bit tags sends would look more like the 64-bit ones.

The difference between monomorphic sends to 1 and sends to $1 on 64-bit shows the variability in timing due to a heavily loaded machine. There's no difference in the instructions executed.

Here's the monomorphic cache checking sequence generated for the prolog of each method in (32-bit) V3:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $1, %eax # only immediate class is SmallInteger
jnz Lcompare
movl (%edx), %eax
andl $0x1f000, %eax # if non-zero, class is compact & compact class index used as cache tag
jnz Lcompare
movl -4(%edx), %eax # normal class plus 2 bits of header type tag, eliminate header type tag
andl $0xfffffffc, %eax
Lcompare
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
Limmediate: # aligned on an 8 byte boundary
andl $1, %eax # map SmallInteger to 1, Character to 0
jmp Lcompare
nop
nop
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10
jnz Limmediate
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorq %rdx, %rdx
Lfail:
call ceMethodAbort0Args
entry:
movq %rdx, %rax
andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100
jnz Lcompare
movq (%rdx), %rax
andq $0x3fffff, %rax # cache tag is class index field
Lcompare:
cmpq %rcx, %rax
jnz Lfail
noCheckEntry:

The 64-bit sequence is the cleanest. But if 32-bit Spur had 30-bit SmallIntegers alongside its 30-bit Characters then that sequence would be equally simple.

I suppose something like this could be quicker for 32-bit Spur with 31-bit SmallIntegers. It would mean that sends to odd SmallIntegers were slower than sends to even ones, but would be quicker for Characters and even SmallIntegers:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
xorl %eax, %eax # necessary because %eax may have some other value than %edx bitAnd: $3 when branching to LstackOverflow
Lfail:
cmpl $3, %eax

Oops!! :

jnz Limmediate

I of course meant

jz Limmediate

call ceMethodAbort0Args
Limmediate:
andl $1, %eax # map odd SmallInteger (2r11) to 2r01
jmp Lcompare
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10. SmallInteger cache tag is 2r01
jnz Lcompare
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Second, I'm curious how Booleans are dealt with. Boolean handling
must be fairly common, and at the Image level these are two different
classes, which pushes out of the monomorphic inline cache, which may
be a significant impact on performance.

I started wondering if under the hood the VM could treat the two
booleans as one class and in the instance store "which-boolean" it
actually is. Then for example the #ifTrue:ifFalse: method of each
class would be compiled into a single method conditioned at the start
on "which-boolean" as to which path is executed. How feasible would
that be, and would it make much of a difference in performance of
booleans ?

You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance.

Except then I realized that this situation is already bypassed by
common boolean methods being inlined by the compiler. Still curious,
if there are quick answers to my questions.

cheers -ben

_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot

Eliot Miranda-2

Re: in-line method lookup caching of booleans

BTW, I have hanges for the suggested speedup that are not yet complete. The compilation is ClosedPICs which need to check for the odd SmnallInteger in a different place. If you're burning with curiosity and want to try and make this work find the changes I have so far attached...

On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <[hidden email]> wrote:

On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <[hidden email]> wrote:
Hi Ben,

On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <[hidden email]> wrote:

Just a brain twitch about something I'd like to understand better...

At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
it says... "[Monomorphic] inline caching depends on the fact that in
most programs at most send sites there is no polymorphism and that
most sends bind to *exactly_one* class of receiver over some usefully
long period of time. ... In Cog we implement inline caches in three
forms ... monomorphic inline cache ... closed polymorphic inline cache
... open polymorphic cache. What’s nice is that while these sends are
increasingly expensive moving from monomorphic to megamorphic they are
also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
typical Smalltalk programs"

First, I'm curious what is the relative performance of the three
different caches ?

here's a measurement for Spur:

| n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic |
Smalltalk garbageCollect.
n := 1000000.
void := Array new: 1024 withAll: nil.
homogenousImmediateA := Array new: 1024 withAll: 1.
homogenousImmediateB := Array new: 1024 withAll: $1.
homogenousNormal := Array new: 1024 withAll: 1 / 2.
polymorphic := Array new: 1024.
1 to: polymorphic size by: 4 do:
[:i|
polymorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1 "scaled decimal"].
megamorphic := Array new: 1024.
1 to: megamorphic size by: 8 do:
[:i|
megamorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1; "scaled decimal"
at: i + 4 put: i class;
at: i + 5 put: i asFloat class;
at: i + 6 put: (i / 1025) class;
at: i + 7 put: (i * 1.0s1) class].
null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun.
{ homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect:
[:a|
"Smalltalk voidCogVMState."
([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null]

And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are

64-bit: #(1510 1703 1612 1830 2451)
32-bit: #(2852 4133 2843 3534 5317)

or...
64-bit 32-bit
monomorphic to 1 1510 2852
monomorphic to $1 1703 4133
monomorphic to obj 1612 2843
closed polymorphic 1830 3534
open polymorphic 2451 5317

The sad thing is that Spur's monomorphic sends to immediate are very slow because 32-bit Spur supports 31-bit SmallInteger immediates (for compatibility with 32-bit V3). Were it to use the simple tagging scheme of two bit tags sends would look more like the 64-bit ones.

The difference between monomorphic sends to 1 and sends to $1 on 64-bit shows the variability in timing due to a heavily loaded machine. There's no difference in the instructions executed.

Here's the monomorphic cache checking sequence generated for the prolog of each method in (32-bit) V3:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $1, %eax # only immediate class is SmallInteger
jnz Lcompare
movl (%edx), %eax
andl $0x1f000, %eax # if non-zero, class is compact & compact class index used as cache tag
jnz Lcompare
movl -4(%edx), %eax # normal class plus 2 bits of header type tag, eliminate header type tag
andl $0xfffffffc, %eax
Lcompare
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
Limmediate: # aligned on an 8 byte boundary
andl $1, %eax # map SmallInteger to 1, Character to 0
jmp Lcompare
nop
nop
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10
jnz Limmediate
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorq %rdx, %rdx
Lfail:
call ceMethodAbort0Args
entry:
movq %rdx, %rax
andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100
jnz Lcompare
movq (%rdx), %rax
andq $0x3fffff, %rax # cache tag is class index field
Lcompare:
cmpq %rcx, %rax
jnz Lfail
noCheckEntry:

The 64-bit sequence is the cleanest. But if 32-bit Spur had 30-bit SmallIntegers alongside its 30-bit Characters then that sequence would be equally simple.

I suppose something like this could be quicker for 32-bit Spur with 31-bit SmallIntegers. It would mean that sends to odd SmallIntegers were slower than sends to even ones, but would be quicker for Characters and even SmallIntegers:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
xorl %eax, %eax # necessary because %eax may have some other value than %edx bitAnd: $3 when branching to LstackOverflow
Lfail:
cmpl $3, %eax

Oops!! :

jnz Limmediate

I of course meant

jz Limmediate

call ceMethodAbort0Args
Limmediate:
andl $1, %eax # map odd SmallInteger (2r11) to 2r01
jmp Lcompare
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10. SmallInteger cache tag is 2r01
jnz Lcompare
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Second, I'm curious how Booleans are dealt with. Boolean handling
must be fairly common, and at the Image level these are two different
classes, which pushes out of the monomorphic inline cache, which may
be a significant impact on performance.

I started wondering if under the hood the VM could treat the two
booleans as one class and in the instance store "which-boolean" it
actually is. Then for example the #ifTrue:ifFalse: method of each
class would be compiled into a single method conditioned at the start
on "which-boolean" as to which path is executed. How feasible would
that be, and would it make much of a difference in performance of
booleans ?

You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance.

Except then I realized that this situation is already bypassed by
common boolean methods being inlined by the compiler. Still curious,
if there are quick answers to my questions.

cheers -ben

_,,,^..^,,,_
best, Eliot

--
_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot

AlternativeInlineCacheProbeFor32BitSpur.st (27K) Download Attachment

Eliot Miranda-2

Re: in-line method lookup caching of booleans

BTW, I have changes for the suggested speedup that are not yet complete. The complication is closed PICs which need to check for the odd SmallInteger receiver in a different place. If you're burning with curiosity and want to try and make this work then find the changes I have so far attached...

On Fri, Aug 3, 2018 at 5:35 PM, Eliot Miranda <[hidden email]> wrote:

BTW, I have hanges for the suggested speedup that are not yet complete. The compilation is ClosedPICs which need to check for the odd SmnallInteger in a different place. If you're burning with curiosity and want to try and make this work find the changes I have so far attached...

On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <[hidden email]> wrote:

On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <[hidden email]> wrote:
Hi Ben,

On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <[hidden email]> wrote:

Just a brain twitch about something I'd like to understand better...

At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
it says... "[Monomorphic] inline caching depends on the fact that in
most programs at most send sites there is no polymorphism and that
most sends bind to *exactly_one* class of receiver over some usefully
long period of time. ... In Cog we implement inline caches in three
forms ... monomorphic inline cache ... closed polymorphic inline cache
... open polymorphic cache. What’s nice is that while these sends are
increasingly expensive moving from monomorphic to megamorphic they are
also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
typical Smalltalk programs"

First, I'm curious what is the relative performance of the three
different caches ?

here's a measurement for Spur:

| n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic |
Smalltalk garbageCollect.
n := 1000000.
void := Array new: 1024 withAll: nil.
homogenousImmediateA := Array new: 1024 withAll: 1.
homogenousImmediateB := Array new: 1024 withAll: $1.
homogenousNormal := Array new: 1024 withAll: 1 / 2.
polymorphic := Array new: 1024.
1 to: polymorphic size by: 4 do:
[:i|
polymorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1 "scaled decimal"].
megamorphic := Array new: 1024.
1 to: megamorphic size by: 8 do:
[:i|
megamorphic
at: i put: i;
at: i + 1 put: i asFloat;
at: i + 2 put: i / 1025;
at: i + 3 put: i * 1.0s1; "scaled decimal"
at: i + 4 put: i class;
at: i + 5 put: i asFloat class;
at: i + 6 put: (i / 1025) class;
at: i + 7 put: (i * 1.0s1) class].
null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun.
{ homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect:
[:a|
"Smalltalk voidCogVMState."
([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null]

And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are

64-bit: #(1510 1703 1612 1830 2451)
32-bit: #(2852 4133 2843 3534 5317)

or...
64-bit 32-bit
monomorphic to 1 1510 2852
monomorphic to $1 1703 4133
monomorphic to obj 1612 2843
closed polymorphic 1830 3534
open polymorphic 2451 5317

The sad thing is that Spur's monomorphic sends to immediate are very slow because 32-bit Spur supports 31-bit SmallInteger immediates (for compatibility with 32-bit V3). Were it to use the simple tagging scheme of two bit tags sends would look more like the 64-bit ones.

The difference between monomorphic sends to 1 and sends to $1 on 64-bit shows the variability in timing due to a heavily loaded machine. There's no difference in the instructions executed.

Here's the monomorphic cache checking sequence generated for the prolog of each method in (32-bit) V3:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $1, %eax # only immediate class is SmallInteger
jnz Lcompare
movl (%edx), %eax
andl $0x1f000, %eax # if non-zero, class is compact & compact class index used as cache tag
jnz Lcompare
movl -4(%edx), %eax # normal class plus 2 bits of header type tag, eliminate header type tag
andl $0xfffffffc, %eax
Lcompare
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
Lfail:
call ceMethodAbort0Args
nop
Limmediate: # aligned on an 8 byte boundary
andl $1, %eax # map SmallInteger to 1, Character to 0
jmp Lcompare
nop
nop
nop
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10
jnz Limmediate
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorq %rdx, %rdx
Lfail:
call ceMethodAbort0Args
entry:
movq %rdx, %rax
andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100
jnz Lcompare
movq (%rdx), %rax
andq $0x3fffff, %rax # cache tag is class index field
Lcompare:
cmpq %rcx, %rax
jnz Lfail
noCheckEntry:

The 64-bit sequence is the cleanest. But if 32-bit Spur had 30-bit SmallIntegers alongside its 30-bit Characters then that sequence would be equally simple.

I suppose something like this could be quicker for 32-bit Spur with 31-bit SmallIntegers. It would mean that sends to odd SmallIntegers were slower than sends to even ones, but would be quicker for Characters and even SmallIntegers:

LstackOverflow: # first byte of code, aligned on an 8 byte boundary
xorl %edx, %edx
xorl %eax, %eax # necessary because %eax may have some other value than %edx bitAnd: $3 when branching to LstackOverflow
Lfail:
cmpl $3, %eax

Oops!! :

jnz Limmediate

I of course meant

jz Limmediate

call ceMethodAbort0Args
Limmediate:
andl $1, %eax # map odd SmallInteger (2r11) to 2r01
jmp Lcompare
entry: # aligned on an 8 byte boundary
movl %edx, %eax
andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10. SmallInteger cache tag is 2r01
jnz Lcompare
movl (%edx), %eax
andl $0x3fffff, %eax # cache tag is class index field
Lcompare:
cmpl %ecx, %eax
jnz Lfail
noCheckEntry:

Second, I'm curious how Booleans are dealt with. Boolean handling
must be fairly common, and at the Image level these are two different
classes, which pushes out of the monomorphic inline cache, which may
be a significant impact on performance.

I started wondering if under the hood the VM could treat the two
booleans as one class and in the instance store "which-boolean" it
actually is. Then for example the #ifTrue:ifFalse: method of each
class would be compiled into a single method conditioned at the start
on "which-boolean" as to which path is executed. How feasible would
that be, and would it make much of a difference in performance of
booleans ?

You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance.

Except then I realized that this situation is already bypassed by
common boolean methods being inlined by the compiler. Still curious,
if there are quick answers to my questions.

cheers -ben

_,,,^..^,,,_
best, Eliot

--
_,,,^..^,,,_
best, Eliot

--
_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot

AlternativeInlineCacheProbeFor32BitSpur.st (27K) Download Attachment

Ben Coman

Re: in-line method lookup caching of booleans

In reply to this post by Eliot Miranda-2

Hi Eliot, Thanks for the great detail in your reply.

On 4 August 2018 at 08:35, Eliot Miranda <[hidden email]> wrote:

>
> BTW, I have hanges for the suggested speedup that are not yet complete. The compilation is ClosedPICs which need to check for the odd SmnallInteger in a different place. If you're burning with curiosity and want to try and make this work find the changes I have so far attached...
>
> On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <[hidden email]> wrote:
>>
>>
>>
>> On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <[hidden email]> wrote:
>>>
>>> Hi Ben,
>>>
>>> On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <[hidden email]> wrote:
>>>>
>>>>
>>>> Just a brain twitch about something I'd like to understand better...
>>>>
>>>> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/
>>>> it says... "[Monomorphic] inline caching depends on the fact that in
>>>> most programs at most send sites there is no polymorphism and that
>>>> most sends bind to *exactly_one* class of receiver over some usefully
>>>> long period of time. ... In Cog we implement inline caches in three
>>>> forms ... monomorphic inline cache ... closed polymorphic inline cache
>>>> ... open polymorphic cache. What’s nice is that while these sends are
>>>> increasingly expensive moving from monomorphic to megamorphic they are
>>>> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for
>>>> typical Smalltalk programs"
>>>>
>>>> First, I'm curious what is the relative performance of the three
>>>> different caches ?
>>>
>>>
>>> here's a measurement for Spur:
>>>
>>> | n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic |
>>> Smalltalk garbageCollect.
>>> n := 1000000.
>>> void := Array new: 1024 withAll: nil.
>>> homogenousImmediateA := Array new: 1024 withAll: 1.
>>> homogenousImmediateB := Array new: 1024 withAll: $1.
>>> homogenousNormal := Array new: 1024 withAll: 1 / 2.
>>> polymorphic := Array new: 1024.
>>> 1 to: polymorphic size by: 4 do:
>>> [:i|
>>> polymorphic
>>> at: i put: i;
>>> at: i + 1 put: i asFloat;
>>> at: i + 2 put: i / 1025;
>>> at: i + 3 put: i * 1.0s1 "scaled decimal"].
>>> megamorphic := Array new: 1024.
>>> 1 to: megamorphic size by: 8 do:
>>> [:i|
>>> megamorphic
>>> at: i put: i;
>>> at: i + 1 put: i asFloat;
>>> at: i + 2 put: i / 1025;
>>> at: i + 3 put: i * 1.0s1; "scaled decimal"
>>> at: i + 4 put: i class;
>>> at: i + 5 put: i asFloat class;
>>> at: i + 6 put: (i / 1025) class;
>>> at: i + 7 put: (i * 1.0s1) class].
>>> null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun.
>>> { homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect:
>>> [:a|
>>> "Smalltalk voidCogVMState."
>>> ([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null]
>>>
>>> And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are
>>>
>>> 64-bit: #(1510 1703 1612 1830 2451)
>>> 32-bit: #(2852 4133 2843 3534 5317)
>>>
>>> or...
>>> 64-bit 32-bit

1. >>> monomorphic to 1 1510 2852
2. >>> monomorphic to $1 1703 4133
3. >>> monomorphic to obj 1612 2843
4. >>> closed polymorphic 1830 3534
5. >>> open polymorphic 2451 5317

>>> You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance.

I presume those descriptions are a direct match for each of these
desciptions from the V3 article.
1. homogenous immediate 2.8 nsecs
2. homogenous compact 8.6 nsecs
3. homogenous normal 8.5 nsecs
4. polymorphic 11.2 nsecs
5. megamorphic 16.7 nsecs

The article says about V3 "the cost of class access is a significant
percentage of overall send costs: 8.6 – 2.8 = 5.8"

That cost seems to be gone from Spur. I presume because...
>>> # cache tag is class index field
whereas (IIUC) in V3 the class tag was a pointer that needed to be
lookup up elsewhere.

So a random thought (I'm out of my depth, but curious...)
what is the cost (in machine code instructions) of distinguishing
between monomorphic and closed polymorphic caches?
If monomorphic and closed polymorphic performance is so close, could
it be faster to eliminate the monomorphic cache checks
and first look up the polymorphic cache?
I suppose its a matter of space taken up by a fixed size polymorphic
cache. Could the size of polymorphic cache be dynamic?

And a provocative thought, while it seems *REALLY* counter-intuitive,
do we actually need immediates?
That would seem to eliminate two thirds of the 64bit prolog.
But maybe I'm mislead, if the stats above are only the method lookup
and not inclusive of getting hold of the values for calculation.

Or is an explicit check for immediates only required for certain
methods doing a direct calculation with the values?
Or perhaps Sista might be able to determine that immediate-objects are
not common at a particular send site,
and avoid the immediate-checks at that site (presuming that if an
immediate-object turns up, a normal class lookup would still work?).

I presume here you mean 64-bit Spur...

>>> Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur:
>>>
>>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary
>>> xorq %rdx, %rdx
>>> Lfail:
>>> call ceMethodAbort0Args
>>> entry:
>>> movq %rdx, %rax
>>> andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100
>>> jnz Lcompare
>>> movq (%rdx), %rax
>>> andq $0x3fffff, %rax # cache tag is class index field
>>> Lcompare:
>>> cmpq %rcx, %rax
>>> jnz Lfail
>>> noCheckEntry:
>>>
>>> The 64-bit sequence is the cleanest.

> BTW, I have changes for the suggested speedup that are not yet complete. The complication is closed PICs which need to check for the odd SmallInteger receiver in a different place. If you're burning with curiosity and want to try and make this work then find the changes I have so far attached...

I'm tempted, but I have a few other priorities atm.

cheers -ben