Smalltalk › Squeak › Squeak VM

Primitives, interpreterProxy, function calls, lto

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

2 messages Options

Levente Uzonyi

Primitives, interpreterProxy, function calls, lto

Hi All,

I'm in the progress of rewriting the primitives of MiscPrimitivePlugin
(mostly done, I'm stuck with the tests).
My main goal was to add the necessary checks to make these primitives
safer.
But while I was doing that, I decided to optimize the code a bit by
getting rid of a few repeated function calls, notably stackValue() calls,
because those are pretty much the only ones that are called more than
once with the same argument from these primitives.

This kind of rewrite resulted in small, but measurable speedup (~5-15%),
despite of the new checks being in place.

Unfortunately, the generated assembly code for almost all interpeterProxy
methods will be a function call (callq).
IMHO the best solution would be if these calls were just macros, so that
the compiler could optimize them away, but I don't know how to achieve
that.
So, I decided to check if the C compiler could do that for us,m and yes,
gcc starting from 4.5 supports link time optimization (lto)[1].
The results are promising, performance is better, but it still won't
optimize everything it could[2].

I wrote a small benchmark to see how lto and my rewrite affects the
overhead of primitive calls:

| string collation |
string := ''.
collation := (0 to: 255) asByteArray.
[ 1 to: 10000000 do: [ :i |
ByteString compare: string with: string collated: collation ] ] timeToRun.

The results on my machine were;
original: 16 function calls - 795 ms
rewrite-no lto: 13 function calls - 762 ms
rewrite-with lto: 10 function calls - 674 ms

So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when
compiling with gcc 4.5+ to see how stable it is.
Also, I would be happy if someone could point me to the direction to
generate macros and use them instead of the function calls for
interpeterProxy functions.

Levente

[1] https://gcc.gnu.org/wiki/LinkTimeOptimization
[2] Generated assembly code for comparison

Without lto:
00000000004ed310 <primitiveCompareString>:
4ed310: 41 55 push %r13
4ed312: 31 ff xor %edi,%edi
4ed314: 41 54 push %r12
4ed316: 55 push %rbp
4ed317: 53 push %rbx
4ed318: 48 83 ec 08 sub $0x8,%rsp
4ed31c: e8 2f 19 f8 ff callq 46ec50 <stackValue>
4ed321: bf 01 00 00 00 mov $0x1,%edi
4ed326: 48 89 c3 mov %rax,%rbx
4ed329: e8 22 19 f8 ff callq 46ec50 <stackValue>
4ed32e: bf 02 00 00 00 mov $0x2,%edi
4ed333: 48 89 c5 mov %rax,%rbp
4ed336: e8 15 19 f8 ff callq 46ec50 <stackValue>
4ed33b: 48 89 df mov %rbx,%rdi
4ed33e: 49 89 c4 mov %rax,%r12
4ed341: e8 4a d4 f3 ff callq 42a790 <isBytes>
4ed346: 48 85 c0 test %rax,%rax
4ed349: 74 0d je 4ed358 <primitiveCompareString+0x48>
...

With lto:
00000000004fe4f0 <primitiveCompareString.35928>:
4fe4f0: 41 55 push %r13
4fe4f2: 41 54 push %r12
4fe4f4: 55 push %rbp
4fe4f5: 53 push %rbx
4fe4f6: 48 83 ec 08 sub $0x8,%rsp
4fe4fa: 48 8b 05 77 6b 32 00 mov 0x326b77(%rip),%rax # 825078 <stackPointer.7294>
4fe501: 48 8b 18 mov (%rax),%rbx
4fe504: 48 8b 68 08 mov 0x8(%rax),%rbp
4fe508: 4c 8b 60 10 mov 0x10(%rax),%r12
4fe50c: 48 89 df mov %rbx,%rdi
4fe50f: e8 bc fe fb ff callq 4be3d0 <isBytes>
4fe514: 48 85 c0 test %rax,%rax
4fe517: 74 0d je 4fe526 <primitiveCompareString.35928+0x36>
...

So, gcc could optimize away the three stackvalue calls with 4 mov
instructions, but failed to inline isBytes() and the rest of the
functions.

Eliot Miranda-2

Re: Primitives, interpreterProxy, function calls, lto

Hi Levente,

On Wed, Apr 18, 2018 at 2:54 PM, Levente Uzonyi <[hidden email]> wrote:

Hi All,

I'm in the progress of rewriting the primitives of MiscPrimitivePlugin (mostly done, I'm stuck with the tests). My main goal was to add the necessary checks to make these primitives safer.
But while I was doing that, I decided to optimize the code a bit by getting rid of a few repeated function calls, notably stackValue() calls, because those are pretty much the only ones that are called more than once with the same argument from these primitives.

This kind of rewrite resulted in small, but measurable speedup (~5-15%), despite of the new checks being in place.

This is cool.

Unfortunately, the generated assembly code for almost all interpeterProxy methods will be a function call (callq).
IMHO the best solution would be if these calls were just macros, so that the compiler could optimize them away, but I don't know how to achieve that.

Look at VMPluginCodeGenerator>>#generateInterpreterProxyFunctionDereference:on:indent:. This has to generate code to

- declare interpreterProxy functions imported from the VM (for builtin plugins)

- declare function pointers for

- assign the function pointers from the interpreterProxy in the plugin's setInterpreter routine.

So you would have to add a third option for builtin plugins which declared those functions as macros, or generated them as local static inlined functions.

So, I decided to check if the C compiler could do that for us, and yes, gcc starting from 4.5 supports link time optimization (lto)[1].
The results are promising, performance is better, but it still won't optimize everything it could[2].

I wrote a small benchmark to see how lto and my rewrite affects the overhead of primitive calls:

| string collation |
string := ''.
collation := (0 to: 255) asByteArray.
[ 1 to: 10000000 do: [ :i |
ByteString compare: string with: string collated: collation ] ] timeToRun.

The results on my machine were;
original: 16 function calls - 795 ms
rewrite-no lto: 13 function calls - 762 ms
rewrite-with lto: 10 function calls - 674 ms

So, for now, I suggest we try to add -flto to CFLAGS and LDFLAGS when compiling with gcc 4.5+ to see how stable it is.
Also, I would be happy if someone could point me to the direction to generate macros and use them instead of the function calls for interpeterProxy functions.

The easiest thing would be to generate them as local inlined static functions. But I would only bother to do this for performance-critical primitives, for example by having plugin classes (such as LargeIntegersPlugin and MiscPrimitivePlugin) mark themselves as performance4-critical via a class-side method.

Levente

[1] https://gcc.gnu.org/wiki/LinkTimeOptimization
[2] Generated assembly code for comparison

Without lto:
00000000004ed310 <primitiveCompareString>:
4ed310: 41 55 push %r13
4ed312: 31 ff xor %edi,%edi
4ed314: 41 54 push %r12
4ed316: 55 push %rbp
4ed317: 53 push %rbx
4ed318: 48 83 ec 08 sub $0x8,%rsp
4ed31c: e8 2f 19 f8 ff callq 46ec50 <stackValue>
4ed321: bf 01 00 00 00 mov $0x1,%edi
4ed326: 48 89 c3 mov %rax,%rbx
4ed329: e8 22 19 f8 ff callq 46ec50 <stackValue>
4ed32e: bf 02 00 00 00 mov $0x2,%edi
4ed333: 48 89 c5 mov %rax,%rbp
4ed336: e8 15 19 f8 ff callq 46ec50 <stackValue>
4ed33b: 48 89 df mov %rbx,%rdi
4ed33e: 49 89 c4 mov %rax,%r12
4ed341: e8 4a d4 f3 ff callq 42a790 <isBytes>
4ed346: 48 85 c0 test %rax,%rax
4ed349: 74 0d je 4ed358 <primitiveCompareString+0x48>
...

With lto:
00000000004fe4f0 <primitiveCompareString.35928>:
4fe4f0: 41 55 push %r13
4fe4f2: 41 54 push %r12
4fe4f4: 55 push %rbp
4fe4f5: 53 push %rbx
4fe4f6: 48 83 ec 08 sub $0x8,%rsp
4fe4fa: 48 8b 05 77 6b 32 00 mov 0x326b77(%rip),%rax # 825078 <stackPointer.7294>
4fe501: 48 8b 18 mov (%rax),%rbx
4fe504: 48 8b 68 08 mov 0x8(%rax),%rbp
4fe508: 4c 8b 60 10 mov 0x10(%rax),%r12
4fe50c: 48 89 df mov %rbx,%rdi
4fe50f: e8 bc fe fb ff callq 4be3d0 <isBytes>
4fe514: 48 85 c0 test %rax,%rax
4fe517: 74 0d je 4fe526 <primitiveCompareString.35928+0x36>
...

So, gcc could optimize away the three stackvalue calls with 4 mov instructions, but failed to inline isBytes() and the rest of the functions.

V nice.

Also look at CogMethodConstants bindingOf: #PrimCallOnSmalltalkStack. The idea here is that if a primitive does not have a deep call chain (only contains local loops and simple argument checking, does not call allocation or GC, etc) then

- the interpreter can implement it as a simple C function that takes its arguments as parameters of the function and returns the result object, or 0 on failure (0 not being an up), and

- the JIT can call it directly from machine code, avoiding the slow stack switch

Currently we don't use this at all. We used this only for hash multiply (see mcprimHashMultiply:), but superseded it with a JIT primitive. However, it could be used for several MiscPrimitivePlugin primitives. And of course, this is orthogonal to the improvements you're achieving through inlining argument marshaling.

_,,,^..^,,,_

best, Eliot