Hi All, I'm in the progress of rewriting the primitives of MiscPrimitivePlugin (mostly done, I'm stuck with the tests). My main goal was to add the necessary checks to make these primitives safer. But while I was doing that, I decided to optimize the code a bit by getting rid of a few repeated function calls, notably stackValue() calls, because those are pretty much the only ones that are called more than once with the same argument from these primitives. This kind of rewrite resulted in small, but measurable speedup (~5-15%), despite of the new checks being in place. Unfortunately, the generated assembly code for almost all interpeterProxy methods will be a function call (callq). IMHO the best solution would be if these calls were just macros, so that the compiler could optimize them away, but I don't know how to achieve that. So, I decided to check if the C compiler could do that for us,m and yes, gcc starting from 4.5 supports link time optimization (lto)[1]. The results are promising, performance is better, but it still won't optimize everything it could[2]. I wrote a small benchmark to see how lto and my rewrite affects the overhead of primitive calls: | string collation | string := ''. collation := (0 to: 255) asByteArray. [ 1 to: 10000000 do: [ :i | ByteString compare: string with: string collated: collation ] ] timeToRun. The results on my machine were; original: 16 function calls - 795 ms rewrite-no lto: 13 function calls - 762 ms rewrite-with lto: 10 function calls - 674 ms So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when compiling with gcc 4.5+ to see how stable it is. Also, I would be happy if someone could point me to the direction to generate macros and use them instead of the function calls for interpeterProxy functions. Levente [1] https://gcc.gnu.org/wiki/LinkTimeOptimization [2] Generated assembly code for comparison Without lto: 00000000004ed310 <primitiveCompareString>: 4ed310: 41 55 push %r13 4ed312: 31 ff xor %edi,%edi 4ed314: 41 54 push %r12 4ed316: 55 push %rbp 4ed317: 53 push %rbx 4ed318: 48 83 ec 08 sub $0x8,%rsp 4ed31c: e8 2f 19 f8 ff callq 46ec50 <stackValue> 4ed321: bf 01 00 00 00 mov $0x1,%edi 4ed326: 48 89 c3 mov %rax,%rbx 4ed329: e8 22 19 f8 ff callq 46ec50 <stackValue> 4ed32e: bf 02 00 00 00 mov $0x2,%edi 4ed333: 48 89 c5 mov %rax,%rbp 4ed336: e8 15 19 f8 ff callq 46ec50 <stackValue> 4ed33b: 48 89 df mov %rbx,%rdi 4ed33e: 49 89 c4 mov %rax,%r12 4ed341: e8 4a d4 f3 ff callq 42a790 <isBytes> 4ed346: 48 85 c0 test %rax,%rax 4ed349: 74 0d je 4ed358 <primitiveCompareString+0x48> ... With lto: 00000000004fe4f0 <primitiveCompareString.35928>: 4fe4f0: 41 55 push %r13 4fe4f2: 41 54 push %r12 4fe4f4: 55 push %rbp 4fe4f5: 53 push %rbx 4fe4f6: 48 83 ec 08 sub $0x8,%rsp 4fe4fa: 48 8b 05 77 6b 32 00 mov 0x326b77(%rip),%rax # 825078 <stackPointer.7294> 4fe501: 48 8b 18 mov (%rax),%rbx 4fe504: 48 8b 68 08 mov 0x8(%rax),%rbp 4fe508: 4c 8b 60 10 mov 0x10(%rax),%r12 4fe50c: 48 89 df mov %rbx,%rdi 4fe50f: e8 bc fe fb ff callq 4be3d0 <isBytes> 4fe514: 48 85 c0 test %rax,%rax 4fe517: 74 0d je 4fe526 <primitiveCompareString.35928+0x36> ... So, gcc could optimize away the three stackvalue calls with 4 mov instructions, but failed to inline isBytes() and the rest of the functions. |
Hi Levente,
On Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <[hidden email]> wrote:
This is cool. Unfortunately, the generated assembly code for almost all interpeterProxy methods will be a function call (callq). Look at VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:. This has to generate code to - declare interpreterProxy functions imported from the VM (for builtin plugins) - declare function pointers for - assign the function pointers from the interpreterProxy in the plugin's setInterpreter routine. So you would have to add a third option for builtin plugins which declared those functions as macros, or generated them as local static inlined functions. So, I decided to check if the C compiler could do that for us, and yes, gcc starting from 4.5 supports link time optimization (lto)[1]. The easiest thing would be to generate them as local inlined static functions. But I would only bother to do this for performance-critical primitives, for example by having plugin classes (such as LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as performance4-critical via a class-side method.
V nice. Also look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack. The idea here is that if a primitive does not have a deep call chain (only contains local loops and simple argument checking, does not call allocation or GC, etc) then - the interpreter can implement it as a simple C function that takes its arguments as parameters of the function and returns the result object, or 0 on failure (0 not being an up), and - the JIT can call it directly from machine code, avoiding the slow stack switch Currently we don't use this at all. We used this only for hash multiply (see mcprimHashMultiply:), but superseded it with a JIT primitive. However, it could be used for several MiscPrimitivePlugin primitives. And of course, this is orthogonal to the improvements you're achieving through inlining argument marshaling. _,,,^..^,,,_ best, Eliot |
Free forum by Nabble | Edit this page |