Hi all, particularly Clement and Eliot, One of the most annoying limit of bytecode is the number of arguments (<16 in V3), not so much annoying for pure Smalltalk, but certainly so for FFI (FORTRAN 77 lacks structures so existing code base often have functions with many arguments). For scientific Smalltalk, some of those old FORTRAN libraries are still around nowadays (LAPACK is an example). I patched the old Squeak compiler in Smallapack to workaround this limitation (it was easy enough to pass a single Array, and invoke FFI with many args). In modern Pharo flavour, this is more involved with the new OpalCompiler (iit does not seem to be designed for extensibility as it seems necessary to patch many pieces/subclasses for a single feature change...). But we now have Sista V1 bytecodes which removed a lot of limitations (# inst vars, #literals, max jump offset ...). Alas I don't see a modified limit for number of arguments (source: https://hal.inria.fr/hal-01088801/document a bytecode set for adaptive optimization): there is still a limit of 4 reserved bits in compiled method header documented in link above. Though, there is an adjacent unused bit now... In Squeak,/Pharo, EncoderForSistaV1>>genSend:numArgs: suggests that the limit is 31 (sic) (nArgs < 0 or: [nArgs > 31]) ifTrue: [^self outOfRangeError: 'numArgs' index: nArgs range: 0 to: 31 "!!"]. or at least 2047 if we believe code below: "234 11101010 i i i i i j j j Send Literal Selector #iiiii (+ Extend A * 32) with jjj (+ Extend B * 8) Arguments" Pharo also limit the numArgs to 15 whatever the encoding in CompiledMethod>>newBytes:trailerBytes:nArgs:nTemps:nStack:nLits:primitive: But Squeak does not limit nArgs at all in EncoderForSistaV1>>computeMethodHeaderForNumArgs:numTemps:numLits:primitive: So my questions: - is that doc up-to-date? - if so, couldn't we expand the limit to 31 args by using the unused bit? Note: there is another unused bit in V3 (not adjacent), and the double extended (send) byte code has room for 31 args in V3 too, since only the first 3 bits of second byte encode the type of operation... |
Le mar. 8 janv. 2019 à 22:51, Nicolas Cellier <[hidden email]> a écrit :
Confirmed in VMMaker: max num args is still 15 in CompiledMethod header StackInterpreter>>argumentCountOfMethodHeader: header <api> ^header >> MethodHeaderArgCountShift bitAnd: 16rF I see that this decoding is shared whatever byte-code alternative... |
Le mar. 8 janv. 2019 à 23:07, Nicolas Cellier <[hidden email]> a écrit :
And the other unused bit (flag, position 30, 0-based) is not always unused... See maybeFlagMethodAsInterpreted: realHeader := realHeader bitOr: (objectMemory integerObjectOf: 1 << MethodHeaderFlagBitPosition). Why do we flag the interpreted methods? I had the impression that the offset was off by 1 MethodHeaderFlagBitPosition := 28 + tagBits. but we are tricky, since integerObjectOf: will perform the missing 1-bit shift... This way, Newspeak that is using both bits 29-30 has the right offset. Ouch.I now see no room left in that header... |
In reply to this post by Nicolas Cellier
Hi Nicolas, On Tue, Jan 8, 2019 at 1:51 PM Nicolas Cellier <[hidden email]> wrote:
Agreed. There are VW users out there with autogenerated code that requires more than 15 arguments. Clément and I already have a design in mind, which is much more elegant than using the extra bit below. However, it does require that we change the maximum Context stack size, which is one reason (the other being lack of time) why we haven't implemented this so far. ]In 2008 my closure design introduced indirection vectors for closed over arguments and among the five bytecodes added to implement it was the Create Array bytes ode that can do one of two things: V3PlusClosures: 138 10001010 jkkkkkkk Push (Array new: kkkkkkk) (j = 0) or Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1) SistaV1: 231 11100111 jkkkkkkk Push (Array new: kkkkkkk) (j = 0) & Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1) This bytecode is used to create indirection vectors, and to create tuples of size <= 8. e.g. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8 } #('89 <52> pushThisContext: 90 <81> send: method 91 <80> send: symbolic 92 <E8 02> pushConstant: 2 94 <E8 03> pushConstant: 3 96 <E8 04> pushConstant: 4 98 <E8 05> pushConstant: 5 100 <E8 06> pushConstant: 6 102 <E8 07> pushConstant: 7 104 <E8 08> pushConstant: 8 106 <E7 88> pop 8 into (Array new: 8) 108 <5C> returnTop We call this version of the bytecode the cons array bytecode. The other form, used to create in direction vectors is the greater array bytecode. (c.f. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8. 9 } which produces much more code but requires only 2 elements of stack depth). There are also three bytecodes used to access indirection vectors: V3PlusClosures: 140 10001100 kkkkkkkk jjjjjjjj Push Temp At kkkkkkkk In Temp Vector At: jjjjjjjj 141 10001101 kkkkkkkk jjjjjjjj Store Temp At kkkkkkkk In Temp Vector At: jjjjjjjj 142 10001110 kkkkkkkk jjjjjjjj Pop and Store Temp At kkkkkkkk In Temp Vector At: jjjjjjjj SistaV1: 251 11111011 kkkkkkkk sjjjjjjj Push Temp At kkkkkkkk In Temp Vector At: jjjjjjj, s = 1 implies remote inst var access instead of remote temp vector access * 252 (3) 11111100 kkkkkkkk sjjjjjjj Store Temp At kkkkkkkk In Temp Vector At: jjjjjjj s = 1 implies remote inst var access instead of remote temp vector access * 253 (3) 11111101 kkkkkkkk sjjjjjjj Pop and Store Temp At kkkkkkkk In Temp Vector At: jjjjjjj s = 1 implies remote inst var access instead of remote temp vector access So the insight is that if we pass arguments beyond 14 in an indirection vector we can have up to 15 + 127 = 142 arguments without needing any extra bits in a CompiledMethod header or range in a bytecode. We simply pop arguments beyond the 14th into an indirection vector, using the cons array bytecode. Yes, this is slow compared to "native" support, but such methods are extremely rare, and supporting them this way means we have less waste elsewhere. It will require some sophistication in the Decompiler, but otherwise seems quite simple. With this design, as far as the VM is concerned the maximum argument count is still 15. Only the image need bother with how to record the argument count for a method that has 15 or more arguments, and indeed a method with 15 arguments can still use all 15 arguments without having to create an indirection vector. This isolates the effects to the compiler (arguments beyond the 14th in methods with more than 15 arguments must be accessed using the indirection vector bytecodes above), but otherwise are quite localized: indirection vector creation occurs immediately after normal argument marshaling and immediately before the send bytecode. Does this design appeal to you? If it does, then we should discuss when and how it should be implemented. One thing would be to make the maximum size of a Context, defined at the image level by CompiledCode's LargeFrame class variable, but hard coded into the VM, some kind of VM parameter, e.g. stored in the image header and read at start-up. It would be quite easy to add this. If we did so we should also ensure the stack page size calculation allows for a stack page big enough for one or two huge frames. Note that the design also means that a large stack is needed only to *marshal* arguments, not to activate a method with many arguments, since the excess arguments are stored in an indirection vector. P.S. Indeed we could use the scheme used for arbitrary sized tuples to marshall extra arguments, but this would affect code generation much more. Different code would have to be used to marshall each argument beyond 15; whereas using the cons array bytecode
_,,,^..^,,,_ best, Eliot |
In reply to this post by Nicolas Cellier
Hi Nicolas, On Tue, Jan 8, 2019 at 2:30 PM Nicolas Cellier <[hidden email]> wrote:
This is for me for profiling. Because Cog is a hybrid interpreter+JIT it can use the policy of avoiding JITting a method until it is found in the method lookup cache, or (to make sure that doits run at full speed) evaluated by one of the withArgs:executeMethod: primitives. This means that the JIT doesn't waste time JITing huge methods that are only evaluated one e on system startup. JITting is like a very slow interpretation, so it doesn't make sense to JIT a method that is simply a long sequence of statements if it is only used once; it will be much quicker and use much less space to simply interpret it once, instead of JITing it and then evaluating it once. Further, the VM won't JIOT anything with a certain number of literals anyway. This is a startup parameter and defaults to 60 literals. Any method with more than 60 literals won't be JITted. But I wanted to be sure that this policy was not affecting any methods that are performance critical. So I added the facility top flag interpreted methods as I can see which methods are executed that the VM chooses not to JIT. Back in 2009/2010 I did indeed use the facility to check and found that my intuitions were correct and that the policy is a good one (although our including of the selectors for inlined messages such as ifTrue:ifFaklse: et al might mean we want to increase the limit on number fo literals for JITing as little, 65 perhaps?).
_,,,^..^,,,_ best, Eliot |
On Tue, Jan 8, 2019 at 2:39 PM Eliot Miranda <[hidden email]> wrote:
Note also that the interpreter counts backward branches and if it finds it is in a method that is looping a lot (default 20 iterations IIRC) it will JIT compile the method and switch to the JIT version of it on the 20th iteration. This avoids the VM getting stuck interpreting a method that loops forever.
_,,,^..^,,,_ best, Eliot |
In reply to this post by Eliot Miranda-2
Le mar. 8 janv. 2019 à 23:31, Eliot Miranda <[hidden email]> a écrit :
Didn't we discussed that already? The fact that we have not super fast calls is OK for me. But I prefer a single Array because I will use invokeWthArguments: FFI primitive. transformFFICallwithManyArguments: aPattern "Transform current parseNode into an invokeWithArguments: send suitable for the case of many arguments" ^BlockNode new temporaries: #() ; "Just to avoid later problems with BlockAnalyzer" arguments: #() statements: (OrderedCollection with: (MessageNode new receiver: (BlockNode new temporaries: #() ; "Just to avoid later problems with BlockAnalyzer" arguments: #() statements: (OrderedCollection with: (MessageNode new receiver: (encoder encodeLiteral: encoder literals first) selector: #invokeWithArguments: arguments: aPattern arguments precedence: 3 from: encoder)) returns: false from: encoder) selector: #ifError: arguments: (Array with: (parseNode temporaries: parseNode temporaries; yourself)) precedence: 3 from: encoder) asReturnNode) returns: true from: encoder to generate somthing like: <cdecl: long 'dtgexc_' (long* long* long* double* long* double* long* double* long* double* long* long* long* double* long* long*)> 65 <8B 78 00> callPrimitive: 120 68 <10> pushTemp: 0 69 <8F 10 00 04> closureNumCopied: 1 numArgs: 0 bytes 73 to 76 73 <23> pushConstant: <cdecl: long 'dtgexc_' (long* long* long* double* long* double* long* double* long* double* long* long* long* double* long* long*)> 74 <10> pushTemp: 0 75 <E2> send: invokeWithArguments: 76 <7D> blockReturn 77 <8F 00 00 03> closureNumCopied: 0 numArgs: 0 bytes 81 to 83 81 <70> self 82 <D4> send: externalCallFailed 83 <7C> returnTop 84 <E1> send: ifError: 85 <7C> returnTop If first 14 arguments come individually, and excess
arguments into a separate Array, then I will have to remarshall all the
arguments into a larger Array. Unless primitive 120 (primitiveCalloutToFFI) learns how to unmashall by itself? Currently primitive 120 will fail (I could eventually remove the primitive number but I don't remember if some image side code depends on it...) Note that I have a kind of SmalltalkPattern that will answer the already packed list of arguments if # >15 (temp 0 above) SmalltalkPatternForCodeGeneration>>arguments ^arrayArgumentVariable ifNil: [argumentNodes asArray] ifNotNil: [Array with: arrayArgumentVariable]. SmalltalkPattern>>bindArg: name encoder: encoder "Accumulate another argument. Check byte code limit and work around" | node | argumentNames size = 15 ifTrue: [arrayArgumentVariable := encoder resetArgumentsAsArray: #singleArrayOfArguments. argumentNodes := encoder rebindArguments: argumentNames asArray: arrayArgumentVariable pattern: self]. argumentNames add: name. arrayArgumentVariable isNil ifTrue: [node := encoder bindTemp: name] ifFalse: [node := self buildAccessorForArgumentRank: argumentNames size encoder: encoder. encoder bind: name to: node]. node nowHasDef nowHasRef. argumentNodes add: node. ^ node For Marshalling, I already use the BraceNode, which we did not care yet to optimize... SmallapackMessageNode>>sizeCodeForValue: encoder | total argSize | arguments size > 15 ifFalse: [^super sizeCodeForValue: encoder]. receiver == NodeSuper ifTrue: [selector := selector copy "only necess for splOops"]. total := selector sizeCode: encoder args: 1 super: receiver == NodeSuper. receiver == nil ifFalse: [total := total + (receiver sizeCodeForValue: encoder)]. "re-use equalNode for creating the array argument... not a very clean hack" equalNode := BraceNode new elements: arguments. argSize := equalNode sizeCodeForValue: encoder. total := total + argSize. sizes := Array with: argSize. ^ total 97 <70> self 98 <75> pushConstant: 0 99 <E0> send: cIntegerPointerOn: 100 <82 4F> popIntoTemp: 15 102 <70> self 103 <43> pushLitVar: #Array=>Array 104 <24> pushConstant: 16 105 <E2> send: braceStream: 106 <88> dup 107 <70> self 108 <10> pushTemp: 0 109 <E5> send: cLogicalPointerOn: 110 <C4> send: nextPut: 111 <87> pop 112 <88> dup 113 <70> self 114 <11> pushTemp: 1 115 <E5> send: cLogicalPointerOn: 116 <C4> send: nextPut: 117 <87> pop 118 <88> dup 119 <70> self 120 <12> pushTemp: 2 121 <E0> send: cIntegerPointerOn: 122 <C4> send: nextPut: 123 <87> pop 124 <88> dup 125 <13> pushTemp: 3 126 <C4> send: nextPut: 127 <87> pop 128 <88> dup 129 <70> self 130 <14> pushTemp: 4 131 <E0> send: cIntegerPointerOn: 132 <C4> send: nextPut: 133 <87> pop 134 <88> dup 135 <15> pushTemp: 5 136 <C4> send: nextPut: 137 <87> pop 138 <88> dup 139 <70> self 140 <16> pushTemp: 6 141 <E0> send: cIntegerPointerOn: 142 <C4> send: nextPut: 143 <87> pop 144 <88> dup 145 <17> pushTemp: 7 146 <C4> send: nextPut: 147 <87> pop 148 <88> dup 149 <70> self 150 <18> pushTemp: 8 151 <E0> send: cIntegerPointerOn: 152 <C4> send: nextPut: 153 <87> pop 154 <88> dup 155 <19> pushTemp: 9 156 <C4> send: nextPut: 157 <87> pop 158 <88> dup 159 <70> self 160 <1A> pushTemp: 10 161 <E0> send: cIntegerPointerOn: 162 <C4> send: nextPut: 163 <87> pop 164 <88> dup 165 <1B> pushTemp: 11 166 <C4> send: nextPut: 167 <87> pop 168 <88> dup 169 <1C> pushTemp: 12 170 <C4> send: nextPut: 171 <87> pop 172 <88> dup 173 <1D> pushTemp: 13 174 <C4> send: nextPut: 175 <87> pop 176 <88> dup 177 <70> self 178 <1E> pushTemp: 14 179 <E0> send: cIntegerPointerOn: 180 <C4> send: nextPut: 181 <87> pop 182 <88> dup 183 <1F> pushTemp: 15 184 <C4> send: nextPut: 185 <87> pop 186 <D6> send: braceArray 187 <E1> send: #xtgexcWithwantq:wantz:n:a:lda:b:ldb:q:ldq:z:ldz:ifst:ilst:work:lwork:info: (1 arg) 188 <87> pop 189 <1F> pushTemp: 15 190 <D8> send: getHandle 191 <76> pushConstant: 1 192 <E7> send: signedLongAt: 193 <7C> returnTop For unmarshalling, I did not care too much to optimize the bytecode either, because most use cases are just wrappers to FFI: SmalltalkPatternForCodeGeneration>>buildAccessorForArgumentRank: index encoder: encoder | indexNode | indexNode := encoder encodeLiteral: index. ^ MessageForTooManyArgNode new receiver: arrayArgumentVariable selector: #at: arguments: (Array with: indexNode) precedence: 3 from: encoder Note that the quick hacks that I already have could be transformed to the form you suggest (14 args + excess_array) without too much efforts if you think that primitiveCalloutToFFI (120) can be re-written to handle 14+excess...
|
After the existing legacy Compiler implementation in Squeak/Pharo, I draw the contour of an Opal Compiler variant for compiling methods/messages with more than 15 arguments. I'am using the single array scheme, but this could be changed easily IMO to 14+excess... Contrarily to what I wrote in the comment, the VM does NOT crash and I can implement methods and send messages like this a1: a1 a2: a2 a3: a3 a4: a4 a5: a5 a6: a6 a7: a7 a8: a8 a9: a9 a10: a10 a11: a11 a12: a12 a13: a13 a14: a14 a15: a15 a16: a16 a17: a17 a18: a18 a19: a19 a20: a20 ^(a1+a2+a3+a4+a5+a6+a7+a8+a9+a10) + (a11+a12+a13+a14+a15+a16+a17+a18+a19+a20) testSend ^(self a1: 1 a2: 2 a3: 3 a4: 4 a5: 5 a6: 6 a7: 7 a8: 8 a9: 9 a10: 10 a11: 11 a12: 12 a13: 13 a14: 14 a15: 15 a16: 16 a17: 17 a18: 18 a19: 19 a20: 20) = (20*21/2) Playground: TestManyArgs new testSend. -> true Find it here: It would be nice to re-integrate these tricks in Pharo8 once stable. Le mer. 9 janv. 2019 à 01:46, Nicolas Cellier <[hidden email]> a écrit :
|
Free forum by Nabble | Edit this page |