Smalltalk › Squeak › Squeak VM

RoarVM: The Manycore SqueakVM

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

39 messages Options

Stefan Marr

Nov 05, 2010; 2:57pm

Re: RoarVM: The Manycore SqueakVM

91 posts

Hi Bert:

On 04 Nov 2010, at 21:49, Bert Freudenberg wrote:

>>>> Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
>
> Wait. What do you mean by "current version" vs. "our version"?
Just imprecise wording, mixed with my wish to actually experiment with that issue.
On the Tilera we are using the process model with Tilera specific libraries.
On x86 systems, we are using threads. However, that makes things quite a bit slower, and when I find the time, I will implement the necessary cross-process shared-memory libraries to get rid of the threads and use processes like on Tilera.

>>>> The GC is really not state of the art.
>>>> And all that adds up rather quickly I suppose...
>>>
>>> Hmm, that doesn't sound like it should make it 4x slower ...
>> Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM?
>> How much do you typically gain by it?
>
> I don't really remember but it was well below 50%, more like 10%-20% I think.
Hm, ok, that doesn't sound like something I want to spend time on then.

>> One thing I forgot to mentioned in this context, it the object table we use.
>> That is also something which is not exactly making the VM faster.
>
> Ah, yes. That could make quite a difference. You wouldn't be calling a function for each object access though I hope?
Well, depends on what the compiler is doing with our code.
It is supposed to inline, and there are no virtual calls...

Best regards
Stefan

--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax: +32 2 629 3525

Levente Uzonyi-2

Nov 06, 2010; 3:26pm

Re: RoarVM: The Manycore SqueakVM

3136 posts

In reply to this post by Stefan Marr

On Thu, 4 Nov 2010, Stefan Marr wrote:

>
> Hi Bert:
>
> On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
>
>>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :) But why is it so much slower than the normal interpreter?
>>> Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit.
>>> Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
>>> The GC is really not state of the art.
>>> And all that adds up rather quickly I suppose...
>>
>> Hmm, that doesn't sound like it should make it 4x slower ...
> Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM?
> How much do you typically gain by it?

... [show rest of quote]

If threaded means gnuified (jump table instead of the linear search), then
it gives ~2x speedup for the standard SqueakVM.

Levente

snip

Igor Stasenko

Nov 06, 2010; 4:01pm

Re: RoarVM: The Manycore SqueakVM

9533 posts

On 6 November 2010 17:26, Levente Uzonyi <[hidden email]> wrote:

>
> On Thu, 4 Nov 2010, Stefan Marr wrote:
>
>>
>> Hi Bert:
>>
>> On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
>>
>>>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes.
>>>>> It needs 8 cores to be faster the regular interpreter on a single core. To
>>>>> the good news is that it can beat the old interpreter :) But why is it so
>>>>> much slower than the normal interpreter?
>>>>
>>>> Well, one the one hand, we don't use stuff like the GCC label-as-value
>>>> extension to have threaded-interpretation, which should help quite a bit.
>>>> Then, the current implementation based on pthreads is quite a bit slower
>>>> then our version which uses plain Unix processes.
>>>> The GC is really not state of the art.
>>>> And all that adds up rather quickly I suppose...
>>>
>>> Hmm, that doesn't sound like it should make it 4x slower ...
>>
>> Do you know some numbers for the switch/case-based vs. the threaded
>> version on the standard VM?
>> How much do you typically gain by it?
>
> If threaded means gnuified (jump table instead of the linear search), then
> it gives ~2x speedup for the standard SqueakVM.
>

... [show rest of quote]

to my own experience it gives 30%

>
> Levente
>
> snip
>

--
Best regards,
Igor Stasenko AKA sig.

Levente Uzonyi-2

Nov 06, 2010; 5:19pm

Re: RoarVM: The Manycore SqueakVM

3136 posts

On Sat, 6 Nov 2010, Igor Stasenko wrote:

>
> On 6 November 2010 17:26, Levente Uzonyi <[hidden email]> wrote:
>>
>> On Thu, 4 Nov 2010, Stefan Marr wrote:
>>
>>>
>>> Hi Bert:
>>>
>>> On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
>>>
>>>>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes.
>>>>>> It needs 8 cores to be faster the regular interpreter on a single core. To
>>>>>> the good news is that it can beat the old interpreter :) But why is it so
>>>>>> much slower than the normal interpreter?
>>>>>
>>>>> Well, one the one hand, we don't use stuff like the GCC label-as-value
>>>>> extension to have threaded-interpretation, which should help quite a bit.
>>>>> Then, the current implementation based on pthreads is quite a bit slower
>>>>> then our version which uses plain Unix processes.
>>>>> The GC is really not state of the art.
>>>>> And all that adds up rather quickly I suppose...
>>>>
>>>> Hmm, that doesn't sound like it should make it 4x slower ...
>>>
>>> Do you know some numbers for the switch/case-based vs. the threaded
>>> version on the standard VM?
>>> How much do you typically gain by it?
>>
>> If threaded means gnuified (jump table instead of the linear search), then
>> it gives ~2x speedup for the standard SqueakVM.
>>
> to my own experience it gives 30%

... [show rest of quote]

Right, it depends on what we take into account. According to this mail:
http://lists.squeakfoundation.org/pipermail/vm-dev/2010-January/003761.html
tinyBenchmarks gives
'248543689 bytecodes/sec; 8117987 sends/sec' without gnuification and
'411244979 bytecodes/sec; 10560900 sends/sec' with gnuification.

These aren't fully optimized VMs, so the difference may be smaller or
larger with better optimizations. Anyway in this case in terms of
bytecodes the difference is 65%, for sends it's 30%. So the general
speedup is not 2x, but it's not 30% either.

The actual performance difference may be greater depending on the used
bytecodes (tinyBenchmarks uses only a few) and the compiler's
capabilities. Btw I wonder why gcc can't compile switch statements like
this to jump tables by itself without gnuification.

Levente

>
>>
>> Levente
>>
>> snip
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

Igor Stasenko

Nov 06, 2010; 5:49pm

Re: RoarVM: The Manycore SqueakVM

9533 posts

2010/11/6 Levente Uzonyi <[hidden email]>:

>
> On Sat, 6 Nov 2010, Igor Stasenko wrote:
>
>>
>> On 6 November 2010 17:26, Levente Uzonyi <[hidden email]> wrote:
>>>
>>> On Thu, 4 Nov 2010, Stefan Marr wrote:
>>>
>>>>
>>>> Hi Bert:
>>>>
>>>> On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
>>>>
>>>>>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes.
>>>>>>> It needs 8 cores to be faster the regular interpreter on a single core. To
>>>>>>> the good news is that it can beat the old interpreter :) But why is it so
>>>>>>> much slower than the normal interpreter?
>>>>>>
>>>>>> Well, one the one hand, we don't use stuff like the GCC label-as-value
>>>>>> extension to have threaded-interpretation, which should help quite a bit.
>>>>>> Then, the current implementation based on pthreads is quite a bit slower
>>>>>> then our version which uses plain Unix processes.
>>>>>> The GC is really not state of the art.
>>>>>> And all that adds up rather quickly I suppose...
>>>>>
>>>>> Hmm, that doesn't sound like it should make it 4x slower ...
>>>>
>>>> Do you know some numbers for the switch/case-based vs. the threaded
>>>> version on the standard VM?
>>>> How much do you typically gain by it?
>>>
>>> If threaded means gnuified (jump table instead of the linear search), then
>>> it gives ~2x speedup for the standard SqueakVM.
>>>
>> to my own experience it gives 30%
>
> Right, it depends on what we take into account. According to this mail: http://lists.squeakfoundation.org/pipermail/vm-dev/2010-January/003761.html
> tinyBenchmarks gives
> '248543689 bytecodes/sec; 8117987 sends/sec' without gnuification and
> '411244979 bytecodes/sec; 10560900 sends/sec' with gnuification.
>
> These aren't fully optimized VMs, so the difference may be smaller or larger with better optimizations. Anyway in this case in terms of bytecodes the difference is 65%, for sends it's 30%. So the general speedup is not 2x, but it's not 30% either.
>
> The actual performance difference may be greater depending on the used bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities. Btw I wonder why gcc can't compile switch statements like this to jump tables by itself without gnuification.
>

... [show rest of quote]

Maybe because C sucks? :)
But if seriously, if you look into numerous cases where switch
statement used, only few of them would have an ordered set of cases,
and from them, only few will have a power of two number of cases.
So, its not worth the effort.

Another way would be to use function table, but then compiler should
be able to inline all functions from that table, which is also
non-trivial from compiler perspective, since its indirect.

>
> Levente
>
>>
>>>
>>> Levente
>>>
>>> snip
>>>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>
>

... [show rest of quote]

--
Best regards,
Igor Stasenko AKA sig.

Stephen Pair

Nov 06, 2010; 5:49pm

Re: RoarVM: The Manycore SqueakVM

112 posts

In reply to this post by Stefan Marr

On Wednesday, November 3, 2010, Stefan Marr <[hidden email]> wrote:

>
> Dear Smalltalk community:
>
>
> We are happy to announce, now officially, RoarVM: the first single-image
> manycore virtual machine for Smalltalk.
>
>
> The RoarVM supports the parallel execution of Smalltalk programs on x86
> compatible multicore systems and Tilera TILE64-based manycore systems. It is
> tested with standard Squeak 4.1 closure-enabled images, and with a stripped
> down version of a MVC-based Squeak 3.7 image. Support for Pharo 1.2 is
> currently limited to 1 core, but we are working on it.

What great news!

The main question that comes to mind for me is concurrency. Does this
VM do anything special to preserve the concurrency semantics of
smalltalk processes scheduled on a single core? As I'm sure most
people are aware, the existing squeak library isn't written with
thread safety in mind...and as such, even on a single core a naive
implementation can have issues when executing with concurrent
processes. The solution is generally to try and isolate the objects
running in different, concurrent processes and use message passing and
replication as needed. If this VM doesn't do anything in this regard,
I would expect that it would present an even greater risk of issues
stemming from concurrency and it would make it all the more important
to keep objects accessible in different processes cleanly separated.

Another question that comes to mind is how much of a performance hit
you might see from the architecture trying to maintain cache
consistency in the face of multiple processes simultaneously updating
shared memory. Is that something you've found to be an issue? Is
this something you would have to be careful about when crafting code?
And, if it is a problem, is it something where you'd need to be
concerned not just with shared objects, but also with shared pages
(ie. would you need some measure of control over pages being updated
from multiple concurrent processes to effectively deal with this
issue)?

Lastly, could you summarize how the design of this VM differs from the
designs of other multithreaded VMs that have been implemented in the
past?

Its exciting to see this kind of innovation happening in the squeak community!

-Stephen

Levente Uzonyi-2

Nov 06, 2010; 6:30pm

Re: RoarVM: The Manycore SqueakVM

3136 posts

In reply to this post by Igor Stasenko

On Sat, 6 Nov 2010, Igor Stasenko wrote:

>
> 2010/11/6 Levente Uzonyi <[hidden email]>:
>>
>> The actual performance difference may be greater depending on the used bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities. Btw I wonder why gcc can't compile switch statements like this to jump tables by itself without gnuification.
>>
> Maybe because C sucks? :)
> But if seriously, if you look into numerous cases where switch
> statement used, only few of them would have an ordered set of cases,
> and from them, only few will have a power of two number of cases.
> So, its not worth the effort.

When I learned C, my teacher said that the switch statement was designed
to be compiled to jump tables in most cases. I was a bit disappointed when
I realized that most compilers don't even try to do it.

I guess you're wrong about the jump table implementation. Here's a simple
example:

switch (i) {
case 0: ...;
case 1: ...;
case 2: ...;
default: ...;
}

It doesn't have power of two size, but it still can be implemented
as a jump table. You just need an "array", that has 3 slots with the
pointers of the instructions of the cases. The mapping from keys to
indices is simply identity. You just check the bounds and jump to the
"default" address if the key is not in the range of the array (between 0
and 2 in this case). If there are "holes" in the table, say:

switch (i) {
case 0: ...;
case 1: ...;
case 3: ...;
default :...;
}

then the slot at index 2 will contain the address of the "default" case.

If the keys have a different offset, or are multiplied by a number, then
the key to index mapping is a bit more complicated:

switch (i) {
case 28: ...;
case 20: ...;
case 16: ...;
default
}

The mapping is index >> 2 - 4. Also note that the cases are not ordered,
but that doesn't matter. The array has the pointers to case 16, 20, 24 and
28.

I think these cases should be detected by the compiler. Also compilers
could generate perfect hash tables for all other cases to achieve similar
performance.

Levente

>
> Another way would be to use function table, but then compiler should
> be able to inline all functions from that table, which is also
> non-trivial from compiler perspective, since its indirect.
>
>>
>> Levente
>>
>>>
>>>>
>>>> Levente
>>>>
>>>> snip
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Igor Stasenko AKA sig.
>>
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

... [show rest of quote]

Igor Stasenko

Nov 06, 2010; 7:26pm

Re: RoarVM: The Manycore SqueakVM

9533 posts

On 6 November 2010 20:30, Levente Uzonyi <[hidden email]> wrote:

>
> On Sat, 6 Nov 2010, Igor Stasenko wrote:
>
>>
>> 2010/11/6 Levente Uzonyi <[hidden email]>:
>>>
>>> The actual performance difference may be greater depending on the used
>>> bytecodes (tinyBenchmarks uses only a few) and the compiler's capabilities.
>>> Btw I wonder why gcc can't compile switch statements like this to jump
>>> tables by itself without gnuification.
>>>
>> Maybe because C sucks? :)
>> But if seriously, if you look into numerous cases where switch
>> statement used, only few of them would have an ordered set of cases,
>> and from them, only few will have a power of two number of cases.
>> So, its not worth the effort.
>
> When I learned C, my teacher said that the switch statement was designed to
> be compiled to jump tables in most cases. I was a bit disappointed when I
> realized that most compilers don't even try to do it.
>
> I guess you're wrong about the jump table implementation. Here's a simple
> example:
>
> switch (i) {
> case 0: ...;
> case 1: ...;
> case 2: ...;
> default: ...;
> }
>
> It doesn't have power of two size, but it still can be implemented as a jump
> table. You just need an "array", that has 3 slots with the pointers of the
> instructions of the cases. The mapping from keys to indices is simply
> identity. You just check the bounds and jump to the "default" address if the
> key is not in the range of the array (between 0 and 2 in this case). If
> there are "holes" in the table, say:
>
> switch (i) {
> case 0: ...;
> case 1: ...;
> case 3: ...;
> default :...;
> }
>
> then the slot at index 2 will contain the address of the "default" case.
>
> If the keys have a different offset, or are multiplied by a number, then the
> key to index mapping is a bit more complicated:
>
> switch (i) {
> case 28: ...;
> case 20: ...;
> case 16: ...;
> default
> }
>
> The mapping is index >> 2 - 4. Also note that the cases are not ordered, but
> that doesn't matter. The array has the pointers to case 16, 20, 24 and 28.
>
> I think these cases should be detected by the compiler. Also compilers could
> generate perfect hash tables for all other cases to achieve similar
> performance.
>
>

... [show rest of quote]

I suppose its related to fact, that with small number of cases (like up to 16)
its not worth using indirect jump, because with branch prediction you
can achieve same or even better performance.
That's why i think, modern compilers do not using jump tables.
Having more than 16 number of cases, is an edge case. Not saying about 256.

Btw, I was also thinking that compilers generating jump tables for
case statements.

> Levente
>
>>
>> Another way would be to use function table, but then compiler should
>> be able to inline all functions from that table, which is also
>> non-trivial from compiler perspective, since its indirect.
>>
>>>
>>> Levente
>>>
>>>>
>>>>>
>>>>> Levente
>>>>>
>>>>> snip
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Igor Stasenko AKA sig.
>>>
>>>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>

... [show rest of quote]

--
Best regards,
Igor Stasenko AKA sig.

Stefan Marr

Nov 06, 2010; 7:30pm

Re: [squeak-dev] Re: [Vm-dev] RoarVM: The Manycore SqueakVM

91 posts

In reply to this post by Stephen Pair

Hello Stephen:

On 06 Nov 2010, at 18:49, Stephen Pair wrote:
> The main question that comes to mind for me is concurrency. Does this
> VM do anything special to preserve the concurrency semantics of
> smalltalk processes scheduled on a single core? As I'm sure most
> people are aware, the existing squeak library isn't written with
> thread safety in mind...and as such, even on a single core a naive
> implementation can have issues when executing with concurrent
> processes.
The VM doesn't do anything, thus, you have to take care of ensuring the semantic
you want by yourself. The idea is to preserve the standard Smalltalk programming model
without introducing new facilities.
Thus, what you get with the RoarVM is what you had before, i.e., processes + mutexes/semaphores,
and in addition to that your processes can be executed in parallel.

I agree, that is a very low-level programming model, but that is the right foundation
for us to experiment with ideas like Ly and Sly.
Ly aims to be a language in which you program by embracing non-determenism and the goal is to
find ways to benefit from the parallelism and the natural non-determinism of the system.

> The solution is generally to try and isolate the objects
> running in different, concurrent processes and use message passing and
> replication as needed. If this VM doesn't do anything in this regard,
> I would expect that it would present an even greater risk of issues
> stemming from concurrency and it would make it all the more important
> to keep objects accessible in different processes cleanly separated.
There are certain actor implementations for Smalltalk, I guess, you could get them working on the RoarVM.
At the moment, there isn't any special VM support for these programming models. However, there are ideas and plans to enable language designers to build such languages in an efficient manner by providing VM support for a flexible notion of encapsulation.

> Another question that comes to mind is how much of a performance hit
> you might see from the architecture trying to maintain cache
> consistency in the face of multiple processes simultaneously updating
> shared memory. Is that something you've found to be an issue?
Sure, that is a big issue. It is already problematic for performance on x86 systems with a few cores, and even worse on Tilera.

Just imagine a simple counter that needs to be updated atomically by 63 other cores...
There is no way to make such a counter scale on any system.
The only thing you can do is to not use such a counter. In 99% of the cases you don't need it anyway.

As Igor pointed out, if you want performance, you will avoid shared mutable state. The solution for such a counter would be to have local counters, and you synchronize only to get a global sum and only when it is really really necessary. But the optimal solution is very application specific.

> Is this something you would have to be careful about when crafting code?
> And, if it is a problem, is it something where you'd need to be
> concerned not just with shared objects, but also with shared pages
> (ie. would you need some measure of control over pages being updated
> from multiple concurrent processes to effectively deal with this
> issue)?
Locality is how I would name the problem, and together with the notion of encapsulation, it is something I am currently looking into.

A brief description of what I am up to can be found here: http://soft.vub.ac.be/~smarr/2010/07/doctoral-symposium-at-splash-2010/
Or even more fluffy here: http://soft.vub.ac.be/~smarr/2010/08/poster-at-splash10/

> Lastly, could you summarize how the design of this VM differs from the
> designs of other multithreaded VMs that have been implemented in the
> past?
Compared to your standard JVM or .NET CLR, the RoarVM provides a similar programming model, but the implementation is designed to experiment on the TILE architecture. And to reach that goal, the VM is kept simple. The heap is divided up into number_of_core parts which are each owned by a single core, and then split into a read-mostly and a read-write heap.
That is an optimization for the TILE64 chip with its restricted caching scheme.
A few more details have been discussed on this thread already.

Best regards
Stefan

--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax: +32 2 629 3525

Levente Uzonyi-2

Nov 06, 2010; 8:42pm

Re: RoarVM: The Manycore SqueakVM

3136 posts

In reply to this post by Igor Stasenko

On Sat, 6 Nov 2010, Igor Stasenko wrote:

> I suppose its related to fact, that with small number of cases (like up to 16)
> its not worth using indirect jump, because with branch prediction you
> can achieve same or even better performance.
> That's why i think, modern compilers do not using jump tables.
> Having more than 16 number of cases, is an edge case. Not saying about 256.

I expect a compiler to generate the fastest possible code, if I ask for
that. I think we agree, that gcc doesn't do this.

Levente

>
>
> Btw, I was also thinking that compilers generating jump tables for
> case statements.
>

johnmci

Nov 06, 2010; 9:00pm

Re: RoarVM: The Manycore SqueakVM

2511 posts

The other cheerful thing the switch statement does is put a range check in.
This would cause a range check to be done for each bytecode looked up unless you did the gnuify.
15 years back we actually altered the 68K binary to no-op out the range check which improved performance a
measurable amount on.

On 2010-11-06, at 1:42 PM, Levente Uzonyi wrote:

> On Sat, 6 Nov 2010, Igor Stasenko wrote:
>
>> I suppose its related to fact, that with small number of cases (like up to 16)
>> its not worth using indirect jump, because with branch prediction you
>> can achieve same or even better performance.
>> That's why i think, modern compilers do not using jump tables.
>> Having more than 16 number of cases, is an edge case. Not saying about 256.
>
> I expect a compiler to generate the fastest possible code, if I ask for that. I think we agree, that gcc doesn't do this.
>
>
> Levente
>
>>
>>
>> Btw, I was also thinking that compilers generating jump tables for
>> case statements.
>>

... [show rest of quote]

--
===========================================================================
John M. McIntosh <[hidden email]> Twitter: squeaker68882
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
===========================================================================

smime.p7s (5K) Download Attachment

Stefan Marr

Nov 06, 2010; 9:02pm

Re: RoarVM: The Manycore SqueakVM

91 posts

In reply to this post by Levente Uzonyi-2

Hello:

On 06 Nov 2010, at 21:42, Levente Uzonyi wrote:

> On Sat, 6 Nov 2010, Igor Stasenko wrote:
>
>> I suppose its related to fact, that with small number of cases (like up to 16)
>> its not worth using indirect jump, because with branch prediction you
>> can achieve same or even better performance.
>> That's why i think, modern compilers do not using jump tables.
>> Having more than 16 number of cases, is an edge case. Not saying about 256.
>
> I expect a compiler to generate the fastest possible code, if I ask for that. I think we agree, that gcc doesn't do this.
I don't really follow this discussion.
Are you assuming that GCC is not using jumptables?
Well, if I read the assembly code correctly, it does. See below.

However, Levente assumed that using jumptables is threaded code.
But, no, threaded code is something different: http://en.wikipedia.org/wiki/Threaded_code
Michael Haupt has a few nice slides visualizing what threaded code is about :)
The idea is that you try to avoid the jump back to the switch/case and chain the bytecode implementations for a given method.

Best regards
Stefan

void Squeak_Interpreter::dispatch(u_char currentByte) {
if (Check_Prefetch) have_executed_currentBytecode = true;
switch (currentByte) {
case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: case 8: case 9:
case 10: case 11: case 12: case 13: case 14: case 15:
pushReceiverVariableBytecode(); break;
case 16: case 17: case 18: case 19:
case 20: case 21: case 22: case 23: case 24: case 25: case 26: case 27: case 28: case 29:
case 30: case 31:
pushTemporaryVariableBytecode(); break;

GCC 4.2 -O3

0x00013b50 <+0000> push %ebp
0x00013b51 <+0001> mov %esp,%ebp
0x00013b53 <+0003> mov 0x8(%ebp),%edx
0x00013b56 <+0006> movb $0x1,0x4171(%edx)
0x00013b5d <+0013> movzbl 0xc(%ebp),%eax
0x00013b61 <+0017> jmp *0xc32ec(,%eax,4)
0x00013b68 <+0024> mov %edx,0x8(%ebp)
0x00013b6b <+0027> leave
0x00013b6c <+0028> jmp 0x42d50 <_ZN18Squeak_Interpreter15unknownBytecodeEv>
0x00013b71 <+0033> mov %edx,0x8(%ebp)
0x00013b74 <+0036> leave
0x00013b75 <+0037> jmp 0x46e10 <_ZN18Squeak_Interpreter20pushNewArrayBytecodeEv>
0x00013b7a <+0042> mov %edx,0x8(%ebp)
0x00013b7d <+0045> leave
0x00013b7e <+0046> jmp 0x43950 <_ZN18Squeak_Interpreter25pushActiveContextBytecodeEv>
0x00013b83 <+0051> mov %edx,0x8(%ebp)
0x00013b86 <+0054> leave
0x00013b87 <+0055> jmp 0x44450 <_ZN18Squeak_Interpreter20duplicateTopBytecodeEv>
0x00013b8c <+0060> mov %edx,0x8(%ebp)
0x00013b8f <+0063> leave
0x00013b90 <+0064> jmp 0x42ec0 <_ZN18Squeak_Interpreter16popStackBytecodeEv>
0x00013b95 <+0069> mov %edx,0x8(%ebp)
0x00013b98 <+0072> leave
0x00013b99 <+0073> jmp 0x45f50 <_ZN18Squeak_Interpreter26secondExtendedSendBytecodeEv>
0x00013b9e <+0078> mov %edx,0x8(%ebp)
0x00013ba1 <+0081> leave

>
>
> Levente
>
>>
>>
>> Btw, I was also thinking that compilers generating jump tables for
>> case statements.
>>

--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax: +32 2 629 3525

Ian Piumarta

Nov 06, 2010; 9:45pm

Re: RoarVM: The Manycore SqueakVM

151 posts

On Nov 7, 2010, at 06:02 , Stefan Marr wrote:

> Are you assuming that GCC is not using jumptables?
> Well, if I read the assembly code correctly, it does. See below.

GCC does produce jump tables for switch() whenever it makes sense to. The conditions & solution we need are more complex.

Given:

1. an integer driving a switch that is guaranteed to be in a known range (in this case 8-bit unsigned);
2. every case ends with a break to the end of the switch statement;
2. the switch is inside an infinite loop that generates the next integer for the switch,

then the compiler could decide to:

1. remove the range check;
2. turn off common subexpression elimination (a useless extra jump in many bytecodes);
3. combine the head of the loop (fetch) with the switch's indirect jump through the table;
4. distribute the combined fetch+jump over the entire switch as the final step in each branch (another useless jump in every bytecode).

That's an awful lot to ask of an optimiser for a general-purpose programming language.

On Nov 6, 2010, at 2:05 PM, Igor Stasenko <[hidden email]> wrote:

> In an ideal world it would all be implemented in Smalltalk (or something similar) and dynamically translated directly into machine code.

Agreed. Critical optimisations should be part of the program specification, not the compiler implementation.

> Michael Haupt has a few nice slides visualizing what threaded code is about :)

A sort explanation is here, too: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.513&rep=rep1&type=pdf

Cheers,
Ian

Michael Haupt-3

Nov 06, 2010; 10:05pm

Re: RoarVM: The Manycore SqueakVM

797 posts

Hi,

Am 06.11.2010 um 22:45 schrieb Ian Piumarta <[hidden email]>:
>
>> Michael Haupt has a few nice slides visualizing what threaded code is about :)
>
> A sort explanation is here, too: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.513&rep=rep1&type=pdf

one of the sources extensively used in the slides. ;-)

>
Best,

Michael

ungar

Nov 07, 2010; 1:47am

Re: RoarVM: The Manycore SqueakVM

18 posts

Back when I did Berkeley Smalltalk (early 1980's), I fretted a lot about incremental performance improvements to the interpreter.
Every few days I was able to get another few percent, as I recall. Then Peter and Alan did the first OO JIT ([Deutsch-Schiffman '84])
and it was so much faster than BS, it became clear to me that I didn't want to spend any more of my time optimizing an interpreter
when I could be adding a simple compiler instead. But that's just own experience.

- David

On Nov 6, 2010, at 3:05 PM, Michael Haupt wrote:

>
> Hi,
>
> Am 06.11.2010 um 22:45 schrieb Ian Piumarta <[hidden email]>:
>>
>>> Michael Haupt has a few nice slides visualizing what threaded code is about :)
>>
>> A sort explanation is here, too: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.513&rep=rep1&type=pdf
>
> one of the sources extensively used in the slides. ;-)
>
>>
> Best,
>
> Michael

... [show rest of quote]

Michael Haupt-3

Nov 07, 2010; 7:06am

Re: RoarVM: The Manycore SqueakVM

797 posts

Hi Dave,

Am 07.11.2010 um 02:47 schrieb [hidden email]:
> Back when I did Berkeley Smalltalk (early 1980's), I fretted a lot about incremental performance improvements to the interpreter.
> Every few days I was able to get another few percent, as I recall. Then Peter and Alan did the first OO JIT ([Deutsch-Schiffman '84])
> and it was so much faster than BS, it became clear to me that I didn't want to spend any more of my time optimizing an interpreter
> when I could be adding a simple compiler instead. But that's just own experience.

sure! Those interpreter considerations are just one part of the slides ... JIT compilers are there as well.

One interesting observation that students made with their threading implementations is that it did not significantly improve performance: today's branch predictors seem to be too advanced to be impressed by threading. (They measured this using HPM.) Of course, the VM used in that course is really simple, and also has a very simple and small bytecode set.

Best,

Michael

>

ungar

Nov 07, 2010; 7:14am

Re: RoarVM: The Manycore SqueakVM

18 posts

Michael,

Yup. I was mainly trying to save folks who might invest a lot of work in threading, etc., from repeating my own mistakes.

Interesting about today's branch predictors!

Thank you,

- David

On Nov 7, 2010, at 12:06 AM, Michael Haupt wrote:

>
> Hi Dave,
>
> Am 07.11.2010 um 02:47 schrieb [hidden email]:
>> Back when I did Berkeley Smalltalk (early 1980's), I fretted a lot about incremental performance improvements to the interpreter.
>> Every few days I was able to get another few percent, as I recall. Then Peter and Alan did the first OO JIT ([Deutsch-Schiffman '84])
>> and it was so much faster than BS, it became clear to me that I didn't want to spend any more of my time optimizing an interpreter
>> when I could be adding a simple compiler instead. But that's just own experience.
>
> sure! Those interpreter considerations are just one part of the slides ... JIT compilers are there as well.
>
> One interesting observation that students made with their threading implementations is that it did not significantly improve performance: today's branch predictors seem to be too advanced to be impressed by threading. (They measured this using HPM.) Of course, the VM used in that course is really simple, and also has a very simple and small bytecode set.
>
> Best,
>
> Michael
>
>>

... [show rest of quote]

Stefan Marr

Nov 07, 2010; 9:20am

Re: RoarVM: The Manycore SqueakVM

91 posts

In reply to this post by Michael Haupt-3

Hi:

On 07 Nov 2010, at 08:06, Michael Haupt wrote:

> today's branch predictors seem to be too advanced to be impressed by threading.
Just as a side note: the share of machines with CPUs that have awesome branch-predictors is likely to decline. With all those low-powered systems and with manycore systems, the cost of using the die space/your power envelope for branch-prediction, out-of-order execution, and the like does not seem to be worth it when compared to just spending those transistors on additional cores.
So, when you stay with an interpreter, these kind of optimizations will remain interesting for sequential performance.

Best regards
Stefan

--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax: +32 2 629 3525

Bert Freudenberg

Nov 07, 2010; 2:07pm

Re: RoarVM: The Manycore SqueakVM

6568 posts

On 07.11.2010, at 10:20, Stefan Marr wrote:

> Hi:
>
> On 07 Nov 2010, at 08:06, Michael Haupt wrote:
>
>> today's branch predictors seem to be too advanced to be impressed by threading.
> Just as a side note: the share of machines with CPUs that have awesome branch-predictors is likely to decline. With all those low-powered systems and with manycore systems, the cost of using the die space/your power envelope for branch-prediction, out-of-order execution, and the like does not seem to be worth it when compared to just spending those transistors on additional cores.
> So, when you stay with an interpreter, these kind of optimizations will remain interesting for sequential performance.

And given the current security paranoia which leads to certain platforms preventing jitting (memory pages are writable *or* executable, but bot both) then interpreters might have to stay around for quite some time ...

- Bert -