Primitives, interpreterProxy, function calls, lto

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Primitives, interpreterProxy, function calls, lto

Levente Uzonyi
 
Hi All,

I'm in the progress of rewriting the primitives of MiscPrimitivePlugin
(mostly done, I'm stuck with the tests).
My main goal was to add the necessary checks to make these primitives
safer.
But while I was doing that, I decided to optimize the code a bit by
getting rid of a few repeated function calls, notably stackValue() calls,
because those are pretty much the only ones that are called more than
once with the same argument from these primitives.

This kind of rewrite resulted in small, but measurable speedup (~5-15%),
despite of the new checks being in place.

Unfortunately, the generated assembly code for almost all interpeterProxy
methods will be a function call (callq).
IMHO the best solution would be if these calls were just macros, so that
the compiler could optimize them away, but I don't know how to achieve
that.
So, I decided to check if the C compiler could do that for us,m and yes,
gcc starting from 4.5 supports link time optimization (lto)[1].
The results are promising, performance is better, but it still won't
optimize everything it could[2].

I wrote a small benchmark to see how lto and my rewrite affects the
overhead of primitive calls:

| string collation |
string := ''.
collation := (0 to: 255) asByteArray.
[ 1 to: 10000000 do: [ :i |
  ByteString compare: string with: string collated: collation ] ] timeToRun.

The results on my machine were;
original: 16 function calls - 795 ms
rewrite-no lto: 13 function calls - 762 ms
rewrite-with lto: 10 function calls - 674 ms

So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when
compiling with gcc 4.5+ to see how stable it is.
Also, I would be happy if someone could point me to the direction to
generate macros and use them instead of the function calls for
interpeterProxy functions.

Levente

[1] https://gcc.gnu.org/wiki/LinkTimeOptimization
[2] Generated assembly code for comparison

Without lto:
00000000004ed310 <primitiveCompareString>:
   4ed310:       41 55                   push   %r13
   4ed312:       31 ff                   xor    %edi,%edi
   4ed314:       41 54                   push   %r12
   4ed316:       55                      push   %rbp
   4ed317:       53                      push   %rbx
   4ed318:       48 83 ec 08             sub    $0x8,%rsp
   4ed31c:       e8 2f 19 f8 ff          callq  46ec50 <stackValue>
   4ed321:       bf 01 00 00 00          mov    $0x1,%edi
   4ed326:       48 89 c3                mov    %rax,%rbx
   4ed329:       e8 22 19 f8 ff          callq  46ec50 <stackValue>
   4ed32e:       bf 02 00 00 00          mov    $0x2,%edi
   4ed333:       48 89 c5                mov    %rax,%rbp
   4ed336:       e8 15 19 f8 ff          callq  46ec50 <stackValue>
   4ed33b:       48 89 df                mov    %rbx,%rdi
   4ed33e:       49 89 c4                mov    %rax,%r12
   4ed341:       e8 4a d4 f3 ff          callq  42a790 <isBytes>
   4ed346:       48 85 c0                test   %rax,%rax
   4ed349:       74 0d                   je     4ed358 <primitiveCompareString+0x48>
...

With lto:
00000000004fe4f0 <primitiveCompareString.35928>:
   4fe4f0:       41 55                   push   %r13
   4fe4f2:       41 54                   push   %r12
   4fe4f4:       55                      push   %rbp
   4fe4f5:       53                      push   %rbx
   4fe4f6:       48 83 ec 08             sub    $0x8,%rsp
   4fe4fa:       48 8b 05 77 6b 32 00    mov    0x326b77(%rip),%rax        # 825078 <stackPointer.7294>
   4fe501:       48 8b 18                mov    (%rax),%rbx
   4fe504:       48 8b 68 08             mov    0x8(%rax),%rbp
   4fe508:       4c 8b 60 10             mov    0x10(%rax),%r12
   4fe50c:       48 89 df                mov    %rbx,%rdi
   4fe50f:       e8 bc fe fb ff          callq  4be3d0 <isBytes>
   4fe514:       48 85 c0                test   %rax,%rax
   4fe517:       74 0d                   je     4fe526 <primitiveCompareString.35928+0x36>
...

So, gcc could optimize away the three stackvalue calls with 4 mov
instructions, but failed to inline isBytes() and the rest of the
functions.
Reply | Threaded
Open this post in threaded view
|

Re: Primitives, interpreterProxy, function calls, lto

Eliot Miranda-2
 
Hi Levente,

On Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <[hidden email]> wrote:

Hi All,

I'm in the progress of rewriting the primitives of MiscPrimitivePlugin (mostly done, I'm stuck with the tests). My main goal was to add the necessary checks to make these primitives safer.
But while I was doing that, I decided to optimize the code a bit by getting rid of a few repeated function calls, notably stackValue() calls, because those are pretty much the only ones that are called more than once with the same argument from these primitives.

This kind of rewrite resulted in small, but measurable speedup (~5-15%), despite of the new checks being in place.

This is cool.
 
Unfortunately, the generated assembly code for almost all interpeterProxy methods will be a function call (callq).
IMHO the best solution would be if these calls were just macros, so that the compiler could optimize them away, but I don't know how to achieve that.

Look at VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:.  This has to generate code to
- declare interpreterProxy functions imported from the VM (for builtin plugins)
- declare function pointers for 
- assign the function pointers from the interpreterProxy in the plugin's setInterpreter routine.

So you would have to add a third option for builtin plugins which declared those functions as macros, or generated them as local static inlined functions.
 
So, I decided to check if the C compiler could do that for us, and yes, gcc starting from 4.5 supports link time optimization (lto)[1].
The results are promising, performance is better, but it still won't optimize everything it could[2].

I wrote a small benchmark to see how lto and my rewrite affects the overhead of primitive calls:

| string collation |
string := ''.
collation := (0 to: 255) asByteArray.
[ 1 to: 10000000 do: [ :i |
        ByteString compare: string with: string collated: collation ] ] timeToRun.

The results on my machine were;
original: 16 function calls - 795 ms
rewrite-no lto: 13 function calls - 762 ms
rewrite-with lto: 10 function calls - 674 ms

So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when compiling with gcc 4.5+ to see how stable it is.
Also, I would be happy if someone could point me to the direction to generate macros and use them instead of the function calls for interpeterProxy functions.

The easiest thing would be to generate them as local inlined static functions.  But I would only bother to do this for performance-critical primitives, for example by having plugin classes (such as LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as performance4-critical via a class-side method.


Levente

[1] https://gcc.gnu.org/wiki/LinkTimeOptimization
[2] Generated assembly code for comparison

Without lto:
00000000004ed310 <primitiveCompareString>:
  4ed310:       41 55                   push   %r13
  4ed312:       31 ff                   xor    %edi,%edi
  4ed314:       41 54                   push   %r12
  4ed316:       55                      push   %rbp
  4ed317:       53                      push   %rbx
  4ed318:       48 83 ec 08             sub    $0x8,%rsp
  4ed31c:       e8 2f 19 f8 ff          callq  46ec50 <stackValue>
  4ed321:       bf 01 00 00 00          mov    $0x1,%edi
  4ed326:       48 89 c3                mov    %rax,%rbx
  4ed329:       e8 22 19 f8 ff          callq  46ec50 <stackValue>
  4ed32e:       bf 02 00 00 00          mov    $0x2,%edi
  4ed333:       48 89 c5                mov    %rax,%rbp
  4ed336:       e8 15 19 f8 ff          callq  46ec50 <stackValue>
  4ed33b:       48 89 df                mov    %rbx,%rdi
  4ed33e:       49 89 c4                mov    %rax,%r12
  4ed341:       e8 4a d4 f3 ff          callq  42a790 <isBytes>
  4ed346:       48 85 c0                test   %rax,%rax
  4ed349:       74 0d                   je     4ed358 <primitiveCompareString+0x48>
...

With lto:
00000000004fe4f0 <primitiveCompareString.35928>:
  4fe4f0:       41 55                   push   %r13
  4fe4f2:       41 54                   push   %r12
  4fe4f4:       55                      push   %rbp
  4fe4f5:       53                      push   %rbx
  4fe4f6:       48 83 ec 08             sub    $0x8,%rsp
  4fe4fa:       48 8b 05 77 6b 32 00    mov    0x326b77(%rip),%rax        # 825078 <stackPointer.7294>
  4fe501:       48 8b 18                mov    (%rax),%rbx
  4fe504:       48 8b 68 08             mov    0x8(%rax),%rbp
  4fe508:       4c 8b 60 10             mov    0x10(%rax),%r12
  4fe50c:       48 89 df                mov    %rbx,%rdi
  4fe50f:       e8 bc fe fb ff          callq  4be3d0 <isBytes>
  4fe514:       48 85 c0                test   %rax,%rax
  4fe517:       74 0d                   je     4fe526 <primitiveCompareString.35928+0x36>
...

So, gcc could optimize away the three stackvalue calls with 4 mov instructions, but failed to inline isBytes() and the rest of the functions.


V nice.

Also look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack.  The idea here is that if a primitive does not have a deep call chain (only contains local loops and simple argument checking, does not call allocation or GC, etc) then
- the interpreter can implement it as a simple C function that takes its arguments as parameters of the function and returns the result object, or 0 on failure (0 not being an up), and
- the JIT can call it directly from machine code, avoiding the slow stack switch

Currently we don't use this at all.  We used this only for hash multiply (see mcprimHashMultiply:), but superseded it with a JIT primitive.  However, it could be used for several MiscPrimitivePlugin primitives.  And of course, this is orthogonal to the improvements you're achieving through inlining argument marshaling.

_,,,^..^,,,_
best, Eliot