Just a brain twitch about something I'd like to understand better... At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/ it says... "[Monomorphic] inline caching depends on the fact that in most programs at most send sites there is no polymorphism and that most sends bind to *exactly_one* class of receiver over some usefully long period of time. ... In Cog we implement inline caches in three forms ... monomorphic inline cache ... closed polymorphic inline cache ... open polymorphic cache. What’s nice is that while these sends are increasingly expensive moving from monomorphic to megamorphic they are also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for typical Smalltalk programs" First, I'm curious what is the relative performance of the three different caches ? Second, I'm curious how Booleans are dealt with. Boolean handling must be fairly common, and at the Image level these are two different classes, which pushes out of the monomorphic inline cache, which may be a significant impact on performance. I started wondering if under the hood the VM could treat the two booleans as one class and in the instance store "which-boolean" it actually is. Then for example the #ifTrue:ifFalse: method of each class would be compiled into a single method conditioned at the start on "which-boolean" as to which path is executed. How feasible would that be, and would it make much of a difference in performance of booleans ? Except then I realized that this situation is already bypassed by common boolean methods being inlined by the compiler. Still curious, if there are quick answers to my questions. cheers -ben |
Hi,
On Fri, Aug 3, 2018 at 4:28 AM, Ben Coman <[hidden email]> wrote:
Note that in my experience the ratio was, each time I measured in the Cog Simulator, 93%, 4%, 3% or something like that. Cliff Click told me recently he had the same feeling with Java code, sends are either mono or megamorphic, but rarely poly (through in the JVM polymorphism is up to 4).
I guess you can measure that with micro benchs. Levente can help with that, I would write complete crap. Note that our polymorphism is up to 6. In the Case of Cog, I would expect monomorphic caches to be almost as fast as direct calls, polymorphic caches to be almost as fast as monomorphic, and megamorphic caches to be considerably slower. To be confirmed. Note also that for some reason in Cog PICs are called ClosedPICs and megamorphic caches are called OpenPICs.
Control flow operations are inlined by the bytecode compiler, and they're the most critical performance wise. The VM fixes the addresses of specific objects (nil, true, false). Become and become forward don't work with those objects. The JIT can generate constants in machine code for the addresses of those objects. That allows to quicken inlined control flow operations. Non inlined operations are usually dealt with PICs with 2 cases. One issue I had in Sista was with early PIC promotion. Currently in Cog if there's already a megamorphic cache for a selector, monomorphic caches are rewritten to the megamorphic cache directly and not to PIC then megamorphic. This is a problem typically with sends such as #isNil. In some cases the isNil is megamorphic and using such cache is relevant. In other case there's just a given type and Undefined object, and there the VM currently uses a megamorphic cache instead of a PIC. This was especially annoying since megamorphic caches don't provide any runtime type feedback. Similar 2 cases issue, non boolean. I started wondering if under the hood the VM could treat the two Some VMs have boolean as primitive types or as immediate objects. Honestly, given our optimization that the booleans can't move in memory and that we can use their address as a constant, I don't think any of these optimizations would make any significant difference in terms of performance. What might make a difference is to add as primitive operations / inlined operations some other boolean methods, such as #not. It's a trade-off between system flexibility, boolean and non boolean performance, etc. Except then I realized that this situation is already bypassed by ..
|
Hi Clément, Hi Ben, _,,,^..^,,,_ (phone)
homogenous immediate 2.8 nsecs
homogenous compact 8.6 nsecs homogenous normal 8.5 nsecs polymorphic 11.2 nsecs megamorphic 16.7 nsecs
In Cog we implement inline caches in three forms: – a monomorphic inline cache; a load of a cache tag and a call of a target method whose prolog checks that the current receiver’s class matches the cache tag – a closed polymorphic inline cache; a growable but finite jump table of class checks and jumps to method bodies, (closed because it deals with a closed set of classes). – an open polymorphic cache; a first-level method lookup probe specialized for a particular selector, (open since it deals with an open set of classes).
|
In reply to this post by Ben Coman
Hi Ben,
On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <[hidden email]> wrote:
here's a measurement for Spur: | n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic | Smalltalk garbageCollect. n := 1000000. void := Array new: 1024 withAll: nil. homogenousImmediateA := Array new: 1024 withAll: 1. homogenousImmediateB := Array new: 1024 withAll: $1. homogenousNormal := Array new: 1024 withAll: 1 / 2. polymorphic := Array new: 1024. 1 to: polymorphic size by: 4 do: [:i| polymorphic at: i put: i; at: i + 1 put: i asFloat; at: i + 2 put: i / 1025; at: i + 3 put: i * 1.0s1 "scaled decimal"]. megamorphic := Array new: 1024. 1 to: megamorphic size by: 8 do: [:i| megamorphic at: i put: i; at: i + 1 put: i asFloat; at: i + 2 put: i / 1025; at: i + 3 put: i * 1.0s1; "scaled decimal" at: i + 4 put: i class; at: i + 5 put: i asFloat class; at: i + 6 put: (i / 1025) class; at: i + 7 put: (i * 1.0s1) class]. null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun. { homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect: [:a| "Smalltalk voidCogVMState." ([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null] And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are 64-bit: #(1510 1703 1612 1830 2451) 32-bit: #(2852 4133 2843 3534 5317) or... 64-bit 32-bit monomorphic to 1 1510 2852 monomorphic to $1 1703 4133 monomorphic to obj 1612 2843 closed polymorphic 1830 3534 open polymorphic 2451 5317 The sad thing is that Spur's monomorphic sends to immediate are very slow because 32-bit Spur supports 31-bit SmallInteger immediates (for compatibility with 32-bit V3). Were it to use the simple tagging scheme of two bit tags sends would look more like the 64-bit ones. The difference between monomorphic sends to 1 and sends to $1 on 64-bit shows the variability in timing due to a heavily loaded machine. There's no difference in the instructions executed. Here's the monomorphic cache checking sequence generated for the prolog of each method in (32-bit) V3: LstackOverflow: # first byte of code, aligned on an 8 byte boundary xorl %edx, %edx Lfail: call ceMethodAbort0Args nop entry: # aligned on an 8 byte boundary movl %edx, %eax andl $1, %eax # only immediate class is SmallInteger jnz Lcompare movl (%edx), %eax andl $0x1f000, %eax # if non-zero, class is compact & compact class index used as cache tag jnz Lcompare movl -4(%edx), %eax # normal class plus 2 bits of header type tag, eliminate header type tag andl $0xfffffffc, %eax Lcompare cmpl %ecx, %eax jnz Lfail noCheckEntry: Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur: LstackOverflow: # first byte of code, aligned on an 8 byte boundary xorl %edx, %edx Lfail: call ceMethodAbort0Args nop Limmediate: # aligned on an 8 byte boundary andl $1, %eax # map SmallInteger to 1, Character to 0 jmp Lcompare nop nop nop entry: # aligned on an 8 byte boundary movl %edx, %eax andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10 jnz Limmediate movl (%edx), %eax andl $0x3fffff, %eax # cache tag is class index field Lcompare: cmpl %ecx, %eax jnz Lfail noCheckEntry: Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur: LstackOverflow: # first byte of code, aligned on an 8 byte boundary xorq %rdx, %rdx Lfail: call ceMethodAbort0Args entry: movq %rdx, %rax andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100 jnz Lcompare movq (%rdx), %rax andq $0x3fffff, %rax # cache tag is class index field Lcompare: cmpq %rcx, %rax jnz Lfail noCheckEntry: The 64-bit sequence is the cleanest. But if 32-bit Spur had 30-bit SmallIntegers alongside its 30-bit Characters then that sequence would be equally simple. I suppose something like this could be quicker for 32-bit Spur with 31-bit SmallIntegers. It would mean that sends to odd SmallIntegers were slower than sends to even ones, but would be quicker for Characters and even SmallIntegers: LstackOverflow: # first byte of code, aligned on an 8 byte boundary xorl %edx, %edx xorl %eax, %eax # necessary because %eax may have some other value than %edx bitAnd: $3 when branching to LstackOverflowLfail: cmpl $3, %eax jnz Limmediate call ceMethodAbort0Args Limmediate: andl $1, %eax # map odd SmallInteger (2r11) to 2r01 jmp Lcompare entry: # aligned on an 8 byte boundary movl %edx, %eax andl $3, %eax # SmallInteger is both 2r01 and 2r11, Character is 2r10. SmallInteger cache tag is 2r01 jnz Lcompare movl (%edx), %eax andl $0x3fffff, %eax # cache tag is class index field Lcompare: cmpl %ecx, %eax jnz Lfail noCheckEntry: Second, I'm curious how Booleans are dealt with. Boolean handling You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance. Except then I realized that this situation is already bypassed by _,,,^..^,,,_ best, Eliot |
On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <[hidden email]> wrote:
Oops!! :
I of course meant
_,,,^..^,,,_ best, Eliot |
BTW, I have hanges for the suggested speedup that are not yet complete. The compilation is ClosedPICs which need to check for the odd SmnallInteger in a different place. If you're burning with curiosity and want to try and make this work find the changes I have so far attached... On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <[hidden email]> wrote:
_,,,^..^,,,_ best, Eliot AlternativeInlineCacheProbeFor32BitSpur.st (27K) Download Attachment |
BTW, I have changes for the suggested speedup that are not yet complete. The complication is closed PICs which need to check for the odd SmallInteger receiver in a different place. If you're burning with curiosity and want to try and make this work then find the changes I have so far attached... On Fri, Aug 3, 2018 at 5:35 PM, Eliot Miranda <[hidden email]> wrote:
_,,,^..^,,,_ best, Eliot AlternativeInlineCacheProbeFor32BitSpur.st (27K) Download Attachment |
In reply to this post by Eliot Miranda-2
Hi Eliot, Thanks for the great detail in your reply. On 4 August 2018 at 08:35, Eliot Miranda <[hidden email]> wrote: > > BTW, I have hanges for the suggested speedup that are not yet complete. The compilation is ClosedPICs which need to check for the odd SmnallInteger in a different place. If you're burning with curiosity and want to try and make this work find the changes I have so far attached... > > On Fri, Aug 3, 2018 at 4:04 PM, Eliot Miranda <[hidden email]> wrote: >> >> >> >> On Fri, Aug 3, 2018 at 3:54 PM, Eliot Miranda <[hidden email]> wrote: >>> >>> Hi Ben, >>> >>> On Thu, Aug 2, 2018 at 7:28 PM, Ben Coman <[hidden email]> wrote: >>>> >>>> >>>> Just a brain twitch about something I'd like to understand better... >>>> >>>> At http://www.mirandabanda.org/cogblog/2011/03/01/build-me-a-jit-as-fast-as-you-can/ >>>> it says... "[Monomorphic] inline caching depends on the fact that in >>>> most programs at most send sites there is no polymorphism and that >>>> most sends bind to *exactly_one* class of receiver over some usefully >>>> long period of time. ... In Cog we implement inline caches in three >>>> forms ... monomorphic inline cache ... closed polymorphic inline cache >>>> ... open polymorphic cache. What’s nice is that while these sends are >>>> increasingly expensive moving from monomorphic to megamorphic they are >>>> also decreasingly common in roughly a 90%, 9% 0.9% ratio, at least for >>>> typical Smalltalk programs" >>>> >>>> First, I'm curious what is the relative performance of the three >>>> different caches ? >>> >>> >>> here's a measurement for Spur: >>> >>> | n null void homogenousImmediateA homogenousImmediateB homogenousNormal polymorphic megamorphic | >>> Smalltalk garbageCollect. >>> n := 1000000. >>> void := Array new: 1024 withAll: nil. >>> homogenousImmediateA := Array new: 1024 withAll: 1. >>> homogenousImmediateB := Array new: 1024 withAll: $1. >>> homogenousNormal := Array new: 1024 withAll: 1 / 2. >>> polymorphic := Array new: 1024. >>> 1 to: polymorphic size by: 4 do: >>> [:i| >>> polymorphic >>> at: i put: i; >>> at: i + 1 put: i asFloat; >>> at: i + 2 put: i / 1025; >>> at: i + 3 put: i * 1.0s1 "scaled decimal"]. >>> megamorphic := Array new: 1024. >>> 1 to: megamorphic size by: 8 do: >>> [:i| >>> megamorphic >>> at: i put: i; >>> at: i + 1 put: i asFloat; >>> at: i + 2 put: i / 1025; >>> at: i + 3 put: i * 1.0s1; "scaled decimal" >>> at: i + 4 put: i class; >>> at: i + 5 put: i asFloat class; >>> at: i + 6 put: (i / 1025) class; >>> at: i + 7 put: (i * 1.0s1) class]. >>> null := [1 to: 1024 * n do: [:k| | e | e := void at: k \\ 1024 + 1. e "yourself"]] timeToRun; timeToRun. >>> { homogenousImmediateA. homogenousImmediateB. homogenousNormal. polymorphic. megamorphic } collect: >>> [:a| >>> "Smalltalk voidCogVMState." >>> ([1 to: 1024 * n do: [:k| | e | e := a at: k \\ 1024 + 1. e yourself". e yourself"]] timeToRun; timeToRun) - null] >>> >>> And results on my 2.3GHz Intel Core i7 MacMini (in a far from quiet state, so take these with a pinch of salt) are >>> >>> 64-bit: #(1510 1703 1612 1830 2451) >>> 32-bit: #(2852 4133 2843 3534 5317) >>> >>> or... >>> 64-bit 32-bit 2. >>> monomorphic to $1 1703 4133 3. >>> monomorphic to obj 1612 2843 4. >>> closed polymorphic 1830 3534 5. >>> open polymorphic 2451 5317 >>> You can see that the closeness in performance between monomorphic and closed polymorphic sends means that handling true and false as instances of two different classes costs relatively little in performance. I presume those descriptions are a direct match for each of these desciptions from the V3 article. 1. homogenous immediate 2.8 nsecs 2. homogenous compact 8.6 nsecs 3. homogenous normal 8.5 nsecs 4. polymorphic 11.2 nsecs 5. megamorphic 16.7 nsecs The article says about V3 "the cost of class access is a significant percentage of overall send costs: 8.6 – 2.8 = 5.8" That cost seems to be gone from Spur. I presume because... >>> # cache tag is class index field whereas (IIUC) in V3 the class tag was a pointer that needed to be lookup up elsewhere. So a random thought (I'm out of my depth, but curious...) what is the cost (in machine code instructions) of distinguishing between monomorphic and closed polymorphic caches? If monomorphic and closed polymorphic performance is so close, could it be faster to eliminate the monomorphic cache checks and first look up the polymorphic cache? I suppose its a matter of space taken up by a fixed size polymorphic cache. Could the size of polymorphic cache be dynamic? And a provocative thought, while it seems *REALLY* counter-intuitive, do we actually need immediates? That would seem to eliminate two thirds of the 64bit prolog. But maybe I'm mislead, if the stats above are only the method lookup and not inclusive of getting hold of the values for calculation. Or is an explicit check for immediates only required for certain methods doing a direct calculation with the values? Or perhaps Sista might be able to determine that immediate-objects are not common at a particular send site, and avoid the immediate-checks at that site (presuming that if an immediate-object turns up, a normal class lookup would still work?). I presume here you mean 64-bit Spur... >>> Here's the monomorphic cache checking sequence generated for the prolog of each method in 32-bit Spur: >>> >>> LstackOverflow: # first byte of code, aligned on an 8 byte boundary >>> xorq %rdx, %rdx >>> Lfail: >>> call ceMethodAbort0Args >>> entry: >>> movq %rdx, %rax >>> andq $7, %rax # SmallInteger is 2r001, Character is 2r010, SmallFloat is 2r100 >>> jnz Lcompare >>> movq (%rdx), %rax >>> andq $0x3fffff, %rax # cache tag is class index field >>> Lcompare: >>> cmpq %rcx, %rax >>> jnz Lfail >>> noCheckEntry: >>> >>> The 64-bit sequence is the cleanest. > BTW, I have changes for the suggested speedup that are not yet complete. The complication is closed PICs which need to check for the odd SmallInteger receiver in a different place. If you're burning with curiosity and want to try and make this work then find the changes I have so far attached... I'm tempted, but I have a few other priorities atm. cheers -ben |
Free forum by Nabble | Edit this page |