Smalltalk - Re: RoarVM: The Manycore SqueakVM

Smalltalk › Squeak › Squeak VM

Re: RoarVM: The Manycore SqueakVM

Posted by Igor Stasenko on Nov 06, 2010; 5:49pm
URL: https://forum.world.st/RoarVM-The-Manycore-SqueakVM-tp3025321p3030209.html

2010/11/6 Levente Uzonyi <[hidden email]>:

>
> On Sat, 6 Nov 2010, Igor Stasenko wrote:
>
>>
>> On 6 November 2010 17:26, Levente Uzonyi <[hidden email]> wrote:
>>>
>>> On Thu, 4 Nov 2010, Stefan Marr wrote:
>>>
>>>>
>>>> Hi Bert:
>>>>
>>>> On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
>>>>
>>>>>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes.
>>>>>>> It needs 8 cores to be faster the regular interpreter on a single core. To
>>>>>>> the good news is that it can beat the old interpreter :) But why is it so
>>>>>>> much slower than the normal interpreter?
>>>>>>
>>>>>> Well, one the one hand, we don't use stuff like the GCC label-as-value
>>>>>> extension to have threaded-interpretation, which should help quite a bit.
>>>>>> Then, the current implementation based on pthreads is quite a bit slower
>>>>>> then our version which uses plain Unix processes.
>>>>>> The GC is really not state of the art.
>>>>>> And all that adds up rather quickly I suppose...
>>>>>
>>>>> Hmm, that doesn't sound like it should make it 4x slower ...
>>>>
>>>> Do you know some numbers for the switch/case-based vs. the threaded
>>>> version on the standard VM?
>>>> How much do you typically gain by it?
>>>
>>> If threaded means gnuified (jump table instead of the linear search), then
>>> it gives ~2x speedup for the standard SqueakVM.
>>>
>> to my own experience it gives 30%
>
> Right, it depends on what we take into account. According to this mail: http://lists.squeakfoundation.org/pipermail/vm-dev/2010-January/003761.html
> tinyBenchmarks gives
> '248543689 bytecodes/sec; 8117987 sends/sec' without gnuification and
> '411244979 bytecodes/sec; 10560900 sends/sec' with gnuification.
>
> These aren't fully optimized VMs, so the difference may be smaller or larger with better optimizations. Anyway in this case in terms of bytecodes the difference is 65%, for sends it's 30%. So the general speedup is not 2x, but it's not 30% either.
>
> The actual performance difference may be greater depending on the used bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities. Btw I wonder why gcc can't compile switch statements like this to jump tables by itself without gnuification.
>

Maybe because C sucks? :)
But if seriously, if you look into numerous cases where switch
statement used, only few of them would have an ordered set of cases,
and from them, only few will have a power of two number of cases.
So, its not worth the effort.

Another way would be to use function table, but then compiler should
be able to inline all functions from that table, which is also
non-trivial from compiler perspective, since its indirect.

>
> Levente
>
>>
>>>
>>> Levente
>>>
>>> snip
>>>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>
>

--
Best regards,
Igor Stasenko AKA sig.