Re: [Vm-dev] Max number of method arguments

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Max number of method arguments

Pharo Smalltalk Developers mailing list
Hi,

I added an issue tracker entry so we look at it later:

https://github.com/pharo-project/pharo/issues/2213

On 9 Jan 2019, at 22:26, Nicolas Cellier <[hidden email]> wrote:

After the existing legacy Compiler implementation in Squeak/Pharo,
I draw the contour of an Opal Compiler variant for compiling methods/messages with more than 15 arguments.
I'am using the single array scheme, but this could be changed easily IMO to 14+excess...

Contrarily to what I wrote in the comment, the VM does NOT crash and I can implement methods and send messages like this

a1: a1 a2: a2 a3: a3 a4: a4 a5: a5 a6: a6 a7: a7 a8: a8 a9: a9 a10: a10
a11: a11 a12: a12 a13: a13 a14: a14 a15: a15 a16: a16 a17: a17 a18: a18 a19: a19 a20: a20
    ^(a1+a2+a3+a4+a5+a6+a7+a8+a9+a10) + (a11+a12+a13+a14+a15+a16+a17+a18+a19+a20)

testSend
    ^(self
        a1: 1 a2: 2 a3: 3 a4: 4 a5: 5 a6: 6 a7: 7 a8: 8 a9: 9 a10: 10
        a11: 11 a12: 12 a13: 13 a14: 14 a15: 15 a16: 16 a17: 17 a18: 18 a19: 19 a20: 20) = (20*21/2)

Playground: TestManyArgs new testSend. -> true

Find it here:

It would be nice to re-integrate these tricks in Pharo8 once stable.

Le mer. 9 janv. 2019 à 01:46, Nicolas Cellier <[hidden email]> a écrit :


Le mar. 8 janv. 2019 à 23:31, Eliot Miranda <[hidden email]> a écrit :
 
Hi Nicolas,

On Tue, Jan 8, 2019 at 1:51 PM Nicolas Cellier <[hidden email]> wrote:
 
Hi all,
particularly Clement and Eliot,

One of the most annoying limit of bytecode is the number of arguments (<16 in V3), not so much annoying for pure Smalltalk, but certainly so for FFI (FORTRAN 77 lacks structures so existing code base often have functions with many arguments).
For scientific Smalltalk, some of those old FORTRAN libraries are still around nowadays (LAPACK is an example).

Agreed.  There are VW users out there with autogenerated code that requires more than 15 arguments.  Clément and I already have a design in mind, which is much more elegant than using the extra bit below.  However, it does require that we change the maximum Context stack size, which is one reason (the other being lack of time) why we haven't implemented this so far.

]In 2008 my closure design introduced indirection vectors for closed over arguments and among the five bytecodes added to implement it was the Create Array bytes ode that can do one of two things:

V3PlusClosures:
138   10001010 jkkkkkkk Push (Array new: kkkkkkk) (j = 0)
or Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1)

SistaV1:
231 11100111 jkkkkkkk Push (Array new: kkkkkkk) (j = 0)
& Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1)

This bytecode is used to create indirection vectors, and to create tuples of size <= 8.  e.g. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8 }
#('89 <52> pushThisContext: 
90 <81> send: method
91 <80> send: symbolic
92 <E8 02> pushConstant: 2
94 <E8 03> pushConstant: 3
96 <E8 04> pushConstant: 4
98 <E8 05> pushConstant: 5
100 <E8 06> pushConstant: 6
102 <E8 07> pushConstant: 7
104 <E8 08> pushConstant: 8
106 <E7 88> pop 8 into (Array new: 8)
108 <5C> returnTop

We call this version of the bytecode the cons array bytecode.  The other form, used to create in direction vectors is the greater array bytecode.

(c.f. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8. 9 } which produces much more code but requires only 2 elements of stack depth).

There are also three bytecodes used to access indirection vectors:

V3PlusClosures:
140   10001100 kkkkkkkk jjjjjjjj Push Temp At kkkkkkkk In Temp Vector At: jjjjjjjj
141   10001101 kkkkkkkk jjjjjjjj Store Temp At kkkkkkkk In Temp Vector At: jjjjjjjj
142   10001110 kkkkkkkk jjjjjjjj Pop and Store Temp At kkkkkkkk In Temp Vector At: jjjjjjjj
SistaV1:
251 11111011 kkkkkkkk sjjjjjjj Push Temp At kkkkkkkk In Temp Vector At: jjjjjjj, s = 1 implies remote inst var access instead of remote temp vector access 
* 252 (3) 11111100 kkkkkkkk sjjjjjjj Store Temp At kkkkkkkk In Temp Vector At: jjjjjjj s = 1 implies remote inst var access instead of remote temp vector access 
* 253 (3) 11111101 kkkkkkkk sjjjjjjj Pop and Store Temp At kkkkkkkk In Temp Vector At: jjjjjjj s = 1 implies remote inst var access instead of remote temp vector access

So the insight is that if we pass arguments beyond 14 in an indirection vector we can have up to 15 + 127 = 142 arguments without needing any extra bits in a CompiledMethod header or range in a bytecode.  We simply pop arguments beyond the 14th into an indirection vector, using the cons array bytecode.  Yes, this is slow compared to "native" support, but such methods are extremely rare, and supporting them this way means we have less waste elsewhere.  It will require some sophistication in the Decompiler, but otherwise seems quite simple.  

With this design, as far as the VM is concerned the maximum argument count is still 15.  Only the image need bother with how to record the argument count for a method that has 15 or more arguments, and indeed a method with 15 arguments can still use all 15 arguments without having to create an indirection vector.  This isolates the effects to the compiler (arguments beyond the 14th in methods with more than 15 arguments must be accessed using the indirection vector bytecodes above), but otherwise are quite localized: indirection vector creation occurs immediately after normal argument marshaling and immediately before the send bytecode.

Does this design appeal to you?  If it does, then we should discuss when and how it should be implemented.  One thing would be to make the maximum size of a Context, defined at the image level by CompiledCode's LargeFrame class variable, but hard coded into the VM, some kind of VM parameter, e.g. stored in the image header and read at start-up.  It would be quite easy to add this.  If we did so we should also ensure the stack page size calculation allows for a stack page big enough for one or two huge frames.  Note that the design also means that a large stack is needed only to *marshal* arguments, not to activate a method with many arguments, since the excess arguments are stored in an indirection vector.  

P.S. Indeed we could use the scheme used for arbitrary sized tuples to marshall extra arguments, but this would affect code generation much more.  Different code would have to be used to marshall each argument beyond 15; whereas using the cons array bytecode


Didn't we discussed that already?
The fact that we have not super fast calls is OK for me.
But I prefer a single Array because I will use invokeWthArguments: FFI primitive.

transformFFICallwithManyArguments: aPattern
    "Transform current parseNode into an invokeWithArguments: send suitable for the case of many arguments"
   
    ^BlockNode new
        temporaries: #() ; "Just to avoid later problems with BlockAnalyzer"
        arguments: #()
        statements: (OrderedCollection with: (MessageNode new
                    receiver: (BlockNode new
                            temporaries: #() ; "Just to avoid later problems with BlockAnalyzer"
                            arguments: #()
                            statements: (OrderedCollection
                                    with: (MessageNode new
                                            receiver: (encoder encodeLiteral: encoder literals first)
                                            selector: #invokeWithArguments:
                                            arguments: aPattern arguments
                                            precedence: 3
                                            from: encoder))
                            returns: false
                            from: encoder)
                    selector: #ifError:
                    arguments: (Array with: (parseNode temporaries: parseNode temporaries; yourself))
                    precedence: 3
                    from: encoder) asReturnNode)
        returns: true
        from: encoder

to generate somthing like:

<cdecl: long 'dtgexc_' (long* long* long* double* long* double* long* double* long* double* long* long* long* double* long* long*)>
65 <8B 78 00> callPrimitive: 120
68 <10> pushTemp: 0
69 <8F 10 00 04> closureNumCopied: 1 numArgs: 0 bytes 73 to 76
73     <23> pushConstant: <cdecl: long 'dtgexc_' (long* long* long* double* long* double* long* double* long* double* long* long* long* double* long* long*)>
74     <10> pushTemp: 0
75     <E2> send: invokeWithArguments:
76     <7D> blockReturn
77 <8F 00 00 03> closureNumCopied: 0 numArgs: 0 bytes 81 to 83
81     <70> self
82     <D4> send: externalCallFailed
83     <7C> returnTop
84 <E1> send: ifError:
85 <7C> returnTop

If first 14 arguments come individually, and excess arguments into a separate Array, then I will have to remarshall all the arguments into a larger Array.
Unless primitive 120 (primitiveCalloutToFFI) learns how to unmashall by itself?
Currently primitive 120 will fail (I could eventually remove the primitive number but I don't remember if some image side code depends on it...)

Note that I have a kind of SmalltalkPattern that will answer the already packed list of arguments if # >15 (temp 0 above)
SmalltalkPatternForCodeGeneration>>arguments
    ^arrayArgumentVariable
            ifNil: [argumentNodes asArray]
            ifNotNil: [Array with: arrayArgumentVariable].

SmalltalkPattern>>bindArg: name encoder: encoder
    "Accumulate another argument.
    Check byte code limit and work around"
    | node |
    argumentNames size = 15
        ifTrue: [arrayArgumentVariable := encoder resetArgumentsAsArray: #singleArrayOfArguments.
            argumentNodes := encoder
                        rebindArguments: argumentNames
                        asArray: arrayArgumentVariable
                        pattern: self].
    argumentNames add: name.
    arrayArgumentVariable isNil
        ifTrue: [node := encoder bindTemp: name]
        ifFalse: [node := self buildAccessorForArgumentRank: argumentNames size encoder: encoder.
            encoder bind: name to: node].
    node nowHasDef nowHasRef.
    argumentNodes add: node.
    ^ node

For Marshalling, I already use the BraceNode, which we did not care yet to optimize...
SmallapackMessageNode>>sizeCodeForValue: encoder
    | total argSize |
    arguments size > 15 ifFalse: [^super sizeCodeForValue: encoder].
    receiver == NodeSuper
        ifTrue: [selector := selector copy "only necess for splOops"].
    total := selector
                sizeCode: encoder
                args: 1
                super: receiver == NodeSuper.
    receiver == nil
        ifFalse: [total := total + (receiver sizeCodeForValue: encoder)].
    "re-use equalNode for creating the array argument... not a very clean hack"
    equalNode := BraceNode new elements: arguments.
    argSize := equalNode sizeCodeForValue: encoder.
    total := total + argSize.
    sizes := Array with: argSize.
    ^ total

97 <70> self
98 <75> pushConstant: 0
99 <E0> send: cIntegerPointerOn:
100 <82 4F> popIntoTemp: 15
102 <70> self
103 <43> pushLitVar: #Array=>Array
104 <24> pushConstant: 16
105 <E2> send: braceStream:
106 <88> dup
107 <70> self
108 <10> pushTemp: 0
109 <E5> send: cLogicalPointerOn:
110 <C4> send: nextPut:
111 <87> pop
112 <88> dup
113 <70> self
114 <11> pushTemp: 1
115 <E5> send: cLogicalPointerOn:
116 <C4> send: nextPut:
117 <87> pop
118 <88> dup
119 <70> self
120 <12> pushTemp: 2
121 <E0> send: cIntegerPointerOn:
122 <C4> send: nextPut:
123 <87> pop
124 <88> dup
125 <13> pushTemp: 3
126 <C4> send: nextPut:
127 <87> pop
128 <88> dup
129 <70> self
130 <14> pushTemp: 4
131 <E0> send: cIntegerPointerOn:
132 <C4> send: nextPut:
133 <87> pop
134 <88> dup
135 <15> pushTemp: 5
136 <C4> send: nextPut:
137 <87> pop
138 <88> dup
139 <70> self
140 <16> pushTemp: 6
141 <E0> send: cIntegerPointerOn:
142 <C4> send: nextPut:
143 <87> pop
144 <88> dup
145 <17> pushTemp: 7
146 <C4> send: nextPut:
147 <87> pop
148 <88> dup
149 <70> self
150 <18> pushTemp: 8
151 <E0> send: cIntegerPointerOn:
152 <C4> send: nextPut:
153 <87> pop
154 <88> dup
155 <19> pushTemp: 9
156 <C4> send: nextPut:
157 <87> pop
158 <88> dup
159 <70> self
160 <1A> pushTemp: 10
161 <E0> send: cIntegerPointerOn:
162 <C4> send: nextPut:
163 <87> pop
164 <88> dup
165 <1B> pushTemp: 11
166 <C4> send: nextPut:
167 <87> pop
168 <88> dup
169 <1C> pushTemp: 12
170 <C4> send: nextPut:
171 <87> pop
172 <88> dup
173 <1D> pushTemp: 13
174 <C4> send: nextPut:
175 <87> pop
176 <88> dup
177 <70> self
178 <1E> pushTemp: 14
179 <E0> send: cIntegerPointerOn:
180 <C4> send: nextPut:
181 <87> pop
182 <88> dup
183 <1F> pushTemp: 15
184 <C4> send: nextPut:
185 <87> pop
186 <D6> send: braceArray
187 <E1> send: #xtgexcWithwantq:wantz:n:a:lda:b:ldb:q:ldq:z:ldz:ifst:ilst:work:lwork:info: (1 arg)
188 <87> pop
189 <1F> pushTemp: 15
190 <D8> send: getHandle
191 <76> pushConstant: 1
192 <E7> send: signedLongAt:
193 <7C> returnTop

For unmarshalling, I did not care too much to optimize the bytecode either, because most use cases are just wrappers to FFI:
SmalltalkPatternForCodeGeneration>>buildAccessorForArgumentRank: index encoder: encoder
    | indexNode |
    indexNode := encoder encodeLiteral: index.
    ^ MessageForTooManyArgNode new
        receiver: arrayArgumentVariable
        selector: #at:
        arguments: (Array with: indexNode)
        precedence: 3
        from: encoder

Note that the quick hacks that I already have could be transformed to the form you suggest (14 args + excess_array) without too much efforts if you think that primitiveCalloutToFFI (120) can be re-written to handle 14+excess...

I patched the old Squeak compiler in Smallapack to workaround this limitation (it was easy enough to pass a single Array, and invoke FFI with many args).
In modern Pharo flavour, this is more involved with the new OpalCompiler (iit does not seem to be designed for extensibility as it seems necessary to patch many pieces/subclasses for a single feature change...).

But we now have Sista V1 bytecodes which removed a lot of limitations (# inst vars, #literals, max jump offset ...). Alas I don't see a modified limit for number of arguments (source: https://hal.inria.fr/hal-01088801/document a bytecode set for adaptive optimization): there is still a limit of 4 reserved bits in compiled method header documented in link above.
Though, there is an adjacent unused bit now...
In Squeak,/Pharo, EncoderForSistaV1>>genSend:numArgs: suggests that the limit is 31 (sic)

    (nArgs < 0 or: [nArgs > 31]) ifTrue:
        [^self outOfRangeError: 'numArgs' index: nArgs range: 0 to: 31 "!!"].

or at least 2047 if we believe code below:

    "234        11101010    i i i i i j j j    Send Literal Selector #iiiii (+ Extend A * 32) with jjj (+ Extend B * 8) Arguments"


Pharo also limit the numArgs to 15 whatever the encoding in CompiledMethod>>newBytes:trailerBytes:nArgs:nTemps:nStack:nLits:primitive:

But Squeak does not limit nArgs at all in
EncoderForSistaV1>>computeMethodHeaderForNumArgs:numTemps:numLits:primitive:

So my questions:
- is that doc up-to-date?
- if so, couldn't we expand the limit to 31 args by using the unused bit?

Note: there is another unused bit in V3 (not adjacent), and the double extended (send) byte code has room for 31 args in V3 too, since only the first 3 bits of second byte encode the type of operation...


--
_,,,^..^,,,_
best, Eliot