RoarVM: The Manycore SqueakVM

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RoarVM: The Manycore SqueakVM

Stefan Marr

Dear Smalltalk community:


We are happy to announce, now officially, RoarVM: the first single-image
manycore virtual machine for Smalltalk.


The RoarVM supports the parallel execution of Smalltalk programs on x86
compatible multicore systems and Tilera TILE64-based manycore systems. It is
tested with standard Squeak 4.1 closure-enabled images, and with a stripped
down version of a MVC-based Squeak 3.7 image. Support for Pharo 1.2 is
currently limited to 1 core, but we are working on it.

A small teaser:
  1 core   66286897 bytecodes/sec;  2910474 sends/sec
  8 cores 470588235 bytecodes/sec; 19825677 sends/sec


RoarVM is based on the work of David Ungar and Sam S. Adams at IBM Research.
The port to x86 multicore systems was done by me. They open-sourced their VM,
formerly know as Renaissance VM (RVM), under the Eclipse Public License [1].
Official announcement of the IBM source code release:
  http://soft.vub.ac.be/~smarr/rvm-open-source-release/

The source code of the RoarVM has been released as open source to enable the
Smalltalk community to evaluate the ideas and possibly integrate them into
existing systems. So, the RoarVM is meant to experiment with Smalltalk systems
on multi- and manycore machines.

The open source project, and downloads can also be found on GitHub:
    http://github.com/smarr/RoarVM
    http://github.com/smarr/RoarVM/downloads

For more detailed information, please refer to the README file[3].
Instructions to compile the RoarVM on Linux and OS X can be found at [4].
Windows is currently not supported, however, there are good chances that it
will work with cygwin or pthreads for win32, but that has not be verified in
anyway. If you feel brave, please give it a shot and report back.

If the community does not object, we would like to occupy the
[hidden email] mailinglist for related discussions. So, if
you run into any trouble while experimenting with the RoarVM, do not hesitate
to report any problems and ask any questions.

You can also follow us on Twitter @roarvm [5].

Best regards
Stefan Marr

[1] http://www.eclipse.org/legal/epl-v10.html
[2] http://soft.vub.ac.be/~smarr/rvm-open-source-release/
[3] http://github.com/smarr/RoarVM/blob/master/README.rst
[4] http://github.com/smarr/RoarVM/blob/master/INSTALL.rst
[5] http://twitter.com/roarvm

--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Igor Stasenko
 
On 3 November 2010 15:13, Stefan Marr <[hidden email]> wrote:

[snip]

> If the community does not object, we would like to occupy the
> [hidden email] mailinglist for related discussions. So, if
> you run into any trouble while experimenting with the RoarVM, do not hesitate
> to report any problems and ask any questions.
>

No objections from me! You are welcome there!



--
Best regards,
Igor Stasenko AKA sig.
Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Yanni Chiu
In reply to this post by Stefan Marr
 
On 03/11/10 9:13 AM, Stefan Marr wrote:
>
> A small teaser:
>    1 core   66286897 bytecodes/sec;  2910474 sends/sec
>    8 cores 470588235 bytecodes/sec; 19825677 sends/sec

I'm trying to understand what is meant by this benchmark. How does
tinyBenchmarks get run on 8 cores. How does the work get distributed
among the cores. But I cannot find the answer quickly - see below.

> For more detailed information, please refer to the README file[3].

The links take you to the ACM portal. One promising link on the portal:
     DLS '09 Proceedings of the 5th symposium on Dynamic languages
is a dead link to HPI (www.uni-potsdam.de). I doubt I want to subscribe
to ACM portal at this point, just to find out more about RoarVM. And,
given that ACM portal is the only place to find out more, I doubt I'm
going to look at RoarVM for any more than the brief glance through the
github source that I've already done.
Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Andreas.Raab
In reply to this post by Stefan Marr
 
On 11/3/2010 6:13 AM, Stefan Marr wrote:
> We are happy to announce, now officially, RoarVM: the first single-image
> manycore virtual machine for Smalltalk.

Congrats! That's a great step forward.

> RoarVM is based on the work of David Ungar and Sam S. Adams at IBM Research.
> The port to x86 multicore systems was done by me. They open-sourced their VM,
> formerly know as Renaissance VM (RVM), under the Eclipse Public License [1].
> Official announcement of the IBM source code release:
>    http://soft.vub.ac.be/~smarr/rvm-open-source-release/
>
> The source code of the RoarVM has been released as open source to enable the
> Smalltalk community to evaluate the ideas and possibly integrate them into
> existing systems. So, the RoarVM is meant to experiment with Smalltalk systems
> on multi- and manycore machines.

Can you give us a high-level overview about what "the trick" is? I.e.,
which approach did you take to make this possible?

> If the community does not object, we would like to occupy the
> [hidden email] mailinglist for related discussions. So, if
> you run into any trouble while experimenting with the RoarVM, do not hesitate
> to report any problems and ask any questions.

Sounds great. Thanks for sharing.

Cheers,
   - Andreas
Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Stefan Marr
In reply to this post by Yanni Chiu

Hi Yanni:

On 03 Nov 2010, at 17:45, Yanni Chiu wrote:

> On 03/11/10 9:13 AM, Stefan Marr wrote:
>>
>> A small teaser:
>>   1 core   66286897 bytecodes/sec;  2910474 sends/sec
>>   8 cores 470588235 bytecodes/sec; 19825677 sends/sec
>
> I'm trying to understand what is meant by this benchmark. How does tinyBenchmarks get run on 8 cores. How does the work get distributed among the cores. But I cannot find the answer quickly - see below.
It is an adapted version of the tinyBenchmarks.
I will make the code available soonish...

Anyway, the idea is that the benchmark is started n-times in n-Smalltalk processes.
Then the overall time for execution is measured. If actually more cores are scheduling the started processes for execution, you will see the increase above. Otherwise, you just see your normal sequential performance.

>> For more detailed information, please refer to the README file[3].
>
> The links take you to the ACM portal. One promising link on the portal:
>    DLS '09 Proceedings of the 5th symposium on Dynamic languages
> is a dead link to HPI (www.uni-potsdam.de). I doubt I want to subscribe to ACM portal at this point, just to find out more about RoarVM. And, given that ACM portal is the only place to find out more, I doubt I'm going to look at RoarVM for any more than the brief glance through the github source that I've already done.
I am sorry that there is a pay-wall, however, I can't really do anything about it...
But, you can always ask the authors of a particular paper to give you a copy, I suppose.

So, the question is how the work gets distributed?
Well, like in Multiprocessor Smalltalk:

You have one scheduler, like in a standard image, it is only slightly adapted to accommodate the fact that there can be more than one activeProcess (See http://github.com/smarr/RoarVM/raw/master/image.st/RVM-multicore-support.mvc.st).

Once a core gets to a point where it needs to revaluate its scheduling decision, it goes to the central scheduler and picks a process to execute it.
Nothing fancy here, no sophisticated work stealing or distribution process.

Does that answer your question?

Best regards
Stefan


--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Stefan Marr
In reply to this post by Andreas.Raab

Hi Andreas:

On 03 Nov 2010, at 17:57, Andreas Raab wrote:

>> The source code of the RoarVM has been released as open source to enable the
>> Smalltalk community to evaluate the ideas and possibly integrate them into
>> existing systems. So, the RoarVM is meant to experiment with Smalltalk systems
>> on multi- and manycore machines.
>
> Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
Ehm, sorry, I do not really know where to start.
Could you be a bit more specific with your question?

A probably a bit too condensed overview is given in the technical section of the README.
http://github.com/smarr/RoarVM/blob/master/README.rst

Best regards
Stefan


--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Andreas.Raab
 
On 11/3/2010 10:27 AM, Stefan Marr wrote:
> On 03 Nov 2010, at 17:57, Andreas Raab wrote:
>>> The source code of the RoarVM has been released as open source to enable the
>>> Smalltalk community to evaluate the ideas and possibly integrate them into
>>> existing systems. So, the RoarVM is meant to experiment with Smalltalk systems
>>> on multi- and manycore machines.
>>
>> Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
> Ehm, sorry, I do not really know where to start.
> Could you be a bit more specific with your question?

Well, I was hoping you could tell us where to start. After all you know
more about this stuff than we do :-)

> A probably a bit too condensed overview is given in the technical section of the README.
> http://github.com/smarr/RoarVM/blob/master/README.rst

Any chance you could make versions of the referenced papers available?
This would surely help.

cheers,
   - Andreas
Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

John Dougan
 
http://www.duke.edu/~jd135/papers/  has copies of the two papers referenced in the readme.

Cheers,
  -- John

On Wed, Nov 3, 2010 at 10:39, Andreas Raab <[hidden email]> wrote:

On 11/3/2010 10:27 AM, Stefan Marr wrote:
On 03 Nov 2010, at 17:57, Andreas Raab wrote:
The source code of the RoarVM has been released as open source to enable the
Smalltalk community to evaluate the ideas and possibly integrate them into
existing systems. So, the RoarVM is meant to experiment with Smalltalk systems
on multi- and manycore machines.

Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
Ehm, sorry, I do not really know where to start.
Could you be a bit more specific with your question?

Well, I was hoping you could tell us where to start. After all you know more about this stuff than we do :-)


A probably a bit too condensed overview is given in the technical section of the README.
http://github.com/smarr/RoarVM/blob/master/README.rst

Any chance you could make versions of the referenced papers available? This would surely help.

cheers,
 - Andreas



--
John Dougan
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Stefan Marr
In reply to this post by Andreas.Raab

Hi Andreas:


On 03 Nov 2010, at 18:39, Andreas Raab wrote:

> On 11/3/2010 10:27 AM, Stefan Marr wrote:
>> On 03 Nov 2010, at 17:57, Andreas Raab wrote:
>>>> The source code of the RoarVM has been released as open source to enable the
>>>> Smalltalk community to evaluate the ideas and possibly integrate them into
>>>> existing systems. So, the RoarVM is meant to experiment with Smalltalk systems
>>>> on multi- and manycore machines.
>>>
>>> Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
>> Ehm, sorry, I do not really know where to start.
>> Could you be a bit more specific with your question?
>
> Well, I was hoping you could tell us where to start. After all you know more about this stuff than we do :-)

Partially, the question is also what degree of detail do you want?

The VM runs one interpreter per core.
At the moment, you decide on the number of cores you want to use at startup.

Once the interpreters are initialized they try to grab some work.
For this, there is the standard scheduler, just protected by a mutex.
As soon as an interpreter gets hold of a process for execution it starts to execute it.
And then there are the normal points when it could reevaluate the scheduling decision.

Other than that, there are some important simplifications in the VM. For instance it uses an object table to be able to move objects around easily. The GC uses a simple compacting mark/sweep stop-the-world algorithm. Many of the design decisions are made with respect to the manycore architecture we are running the VM on. So, they are not necessarily optimal for today's x86 multicore systems.
For instance, the interpreters pass messages instead of using shared memory for certain information that needs to be kept consistent.

So, well, I hope that gives you some points you could ask more detailed questions about :)


>> A probably a bit too condensed overview is given in the technical section of the README.
>> http://github.com/smarr/RoarVM/blob/master/README.rst
>
> Any chance you could make versions of the referenced papers available? This would surely help.
I asked David.

Best regards
Stefan



--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

Reply | Threaded
Open this post in threaded view
|

Re: [Bulk] Re: RoarVM: The Manycore SqueakVM

Yanni Chiu
In reply to this post by Stefan Marr
 
On 03/11/10 1:11 PM, Stefan Marr wrote:
> So, the question is how the work gets distributed?
> Well, like in Multiprocessor Smalltalk:
>
> You have one scheduler, like in a standard image, it is only slightly adapted to accommodate the fact that there can be more than one activeProcess (See http://github.com/smarr/RoarVM/raw/master/image.st/RVM-multicore-support.mvc.st).
>
> Once a core gets to a point where it needs to revaluate its scheduling decision, it goes to the central scheduler and picks a process to execute it.
> Nothing fancy here, no sophisticated work stealing or distribution process.
>
> Does that answer your question?

Not on first reading, but combined with your other answers, I think I've
got it.
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] RoarVM: The Manycore SqueakVM

Bert Freudenberg
In reply to this post by Stefan Marr

On 03.11.2010, at 14:13, Stefan Marr wrote:

> A small teaser:
>  1 core   66286897 bytecodes/sec;  2910474 sends/sec
>  8 cores 470588235 bytecodes/sec; 19825677 sends/sec

I tried your precompiled OS X VM and the Sly3 image.

1 core:  93,910,491 bytecodes/sec; 4,056,440 sends/sec
2 cores: 91,559,370 bytecodes/sec; 4,007,927 sends/sec
3 cores: can't start
4 cores: 90,844,570 bytecodes/sec; 3,935,516 sends/sec
5 cores: can't start
6 cores: can't start
7 cores: can't start
8 cores: 89,698,668 bytecodes/sec; 3,910,787 sends/sec

So it looks like you have to use a power-of-two cores?

And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?

I tried something myself:

n := 16.
q := SharedQueue new.
time := Time millisecondsToRun:
        [n timesRepeat: [[q nextPut: [30 benchFib] timeToRun] fork].
        n timesRepeat: [Transcript space; show: q next]].
Transcript space; show: time; cr

1 core:  664 664 665 666 667 662 664 664 668 665 667 665 666 669 666 10700
2 cores: 675 674 672 669 677 669 669 672 678 670 668 669 674 668 668 5425
4 cores: 721 726 729 740 713 728 740 734 731 737 721 737 734 756 788 749 3030
8 cores: 786 807 837 847 865 872 916 840 800 873 792 880 846 865 829 1820

Now that scales pretty nicely :) The overhead is about 25% at 8 cores, 12% for 4 cores.

For our regular interpreter (*) I get:
1 core: 162 159 157 158 158 160 159 159 159 159 159 158 160 158 159 2585

So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?

Btw, user interrupt didn't work on the Mac.

And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.

- Bert -

(*) For comparison, a regular interpreter (not Cog) on this machine gets
    789,514,263 bytecodes/sec; 17,199,374 sends/sec
and Cog does
    880,481,513 bytecodes/sec; 70,113,306 sends/sec

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] RoarVM: The Manycore SqueakVM

Stefan Marr

Hi Bert:


On 04 Nov 2010, at 19:07, Bert Freudenberg wrote:

> So it looks like you have to use a power-of-two cores?
Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.


> And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
The code I used to generate the numbers isn't actually in any image yet.
I pasted it below for reference, its just a quick hack to have a parallel tinyBenchmarks version.


> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?
Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit.
Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
The GC is really not state of the art.
And all that adds up rather quickly I suppose...


> Btw, user interrupt didn't work on the Mac.
Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?


> And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.



Best regards
Stefan


My tiny Benchmarks:

> !Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:30'!
> tinyBenchmarksParallel16Processes
> "Report the results of running the two tiny Squeak benchmarks.
> ar 9/10/1999: Adjusted to run at least 1 sec to get more stable results"
> "0 tinyBenchmarks"
> "On a 292 MHz G3 Mac: 22727272 bytecodes/sec; 984169 sends/sec"
> "On a 400 MHz PII/Win98:  18028169 bytecodes/sec; 1081272 sends/sec"
> | t1 t2 r n1 n2 |
> n1 := 1.
> [t1 := Time millisecondsToRun: [n1 benchmark].
> t1 < 1000] whileTrue:[n1 := n1 * 2]. "Note: #benchmark's runtime is about O(n)"
>
> "now n1 is the value for which we do the measurement"
> t1 := Time millisecondsToRun: [self run: #benchmark on: n1 times: 16].
>
> n2 := 28.
> [t2 := Time millisecondsToRun: [r := n2 benchFib].
> t2 < 1000] whileTrue:[n2 := n2 + 1].
> "Note: #benchFib's runtime is about O(k^n),
> where k is the golden number = (1 + 5 sqrt) / 2 = 1.618...."
>
> "now we have our target n2 and r value.
> lets take the time for it"
> t2 := Time millisecondsToRun: [self run: #benchFib on: n2 times: 16].
>
> ^ { ((n1 * 16 * 500000 * 1000) // t1). " printString, ' bytecodes/sec; ',"
>   ((r * 16 * 1000) // t2) " printString, ' sends/sec'"
>   }! !
> !Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:29'!
> run: aSymbol on: aReceiver times: nTimes
> | mtx sig n |
>
> mtx := Semaphore forMutualExclusion.
> sig := Semaphore new.
> n := nTimes.
>
> nTimes timesRepeat: [
> [ aReceiver perform: aSymbol.
> mtx critical: [
> n := n - 1.
> (n == 0) ifTrue: [sig signal]]
> ] fork
> ].
> sig wait.! !

--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] RoarVM: The Manycore SqueakVM

Igor Stasenko
In reply to this post by Bert Freudenberg

On 4 November 2010 20:07, Bert Freudenberg <[hidden email]> wrote:

>
> On 03.11.2010, at 14:13, Stefan Marr wrote:
>
>> A small teaser:
>>  1 core   66286897 bytecodes/sec;  2910474 sends/sec
>>  8 cores 470588235 bytecodes/sec; 19825677 sends/sec
>
> I tried your precompiled OS X VM and the Sly3 image.
>
> 1 core:  93,910,491 bytecodes/sec; 4,056,440 sends/sec
> 2 cores: 91,559,370 bytecodes/sec; 4,007,927 sends/sec
> 3 cores: can't start
> 4 cores: 90,844,570 bytecodes/sec; 3,935,516 sends/sec
> 5 cores: can't start
> 6 cores: can't start
> 7 cores: can't start
> 8 cores: 89,698,668 bytecodes/sec; 3,910,787 sends/sec
>
> So it looks like you have to use a power-of-two cores?
>
> And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
>
> I tried something myself:
>
> n := 16.
> q := SharedQueue new.
> time := Time millisecondsToRun:
>        [n timesRepeat: [[q nextPut: [30 benchFib] timeToRun] fork].
>        n timesRepeat: [Transcript space; show: q next]].
> Transcript space; show: time; cr
>
> 1 core:  664 664 665 666 667 662 664 664 668 665 667 665 666 669 666 10700
> 2 cores: 675 674 672 669 677 669 669 672 678 670 668 669 674 668 668 5425
> 4 cores: 721 726 729 740 713 728 740 734 731 737 721 737 734 756 788 749 3030
> 8 cores: 786 807 837 847 865 872 916 840 800 873 792 880 846 865 829 1820
>
> Now that scales pretty nicely :) The overhead is about 25% at 8 cores, 12% for 4 cores.
>
i don't like this tendency. for 16 cores it will be 50%, and for 32 - 100% :)
Doesn't sounds like 'designed for manycore systems'.
But i suspect that it's because code you running don't takes new VM
capabilities into account.

> For our regular interpreter (*) I get:
> 1 core: 162 159 157 158 158 160 159 159 159 159 159 158 160 158 159 2585
>
> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?
>

I would not care much about single core performance for now. Since
once you got the potential of hundred of cores at your disposal,
you can even run things at a lower clock rate, because it not really
matters anymore.

> Btw, user interrupt didn't work on the Mac.
>
> And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
>
not a surprise. Image and code, which not aware of new VM
capabilities, usually wins nothing, and even losing comparing to
'standard' VM.

Hydra VM were able to run multiple interpreters in single process
space, and overhead of this are 5-10% performance degradation.
But given that you can run N interpreters in parallel, such slowdown
can be neglected.


> - Bert -
>
> (*) For comparison, a regular interpreter (not Cog) on this machine gets
>    789,514,263 bytecodes/sec; 17,199,374 sends/sec
> and Cog does
>    880,481,513 bytecodes/sec; 70,113,306 sends/sec
>
>



--
Best regards,
Igor Stasenko AKA sig.
Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Bert Freudenberg
In reply to this post by Stefan Marr

On 04.11.2010, at 19:36, Stefan Marr wrote:

> On 04 Nov 2010, at 19:07, Bert Freudenberg wrote:
>
>> So it looks like you have to use a power-of-two cores?
> Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.
>
>
>> And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
> The code I used to generate the numbers isn't actually in any image yet.
> I pasted it below for reference, its just a quick hack to have a parallel tinyBenchmarks version.
>
>
>> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?
> Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit.
> Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
> The GC is really not state of the art.
> And all that adds up rather quickly I suppose...

Hmm, that doesn't sound like it should make it 4x slower ...

>> Btw, user interrupt didn't work on the Mac.
> Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?

I was doing the equivalent of

        SharedQueue new next

and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.

>> And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
> Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.

Hmm. There are long freezes of many seconds and I would have no idea where to start even ...

- Bert -


Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Stefan Marr

Hi Bert:

On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:

>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?
>> Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit.
>> Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
>> The GC is really not state of the art.
>> And all that adds up rather quickly I suppose...
>
> Hmm, that doesn't sound like it should make it 4x slower ...
Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM?
How much do you typically gain by it?

One thing I forgot to mentioned in this context, it the object table we use.
That is also something which is not exactly making the VM faster.



>>> Btw, user interrupt didn't work on the Mac.
>> Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
>
> I was doing the equivalent of
>
> SharedQueue new next
>
> and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.

Closing the window, how does that relate to processes? You mean a window inside the image?
Per se, processes are not really owned or managed, so there is also nobody kill processes. However, I am not sure what you are referring to exactly.  


Ctrl-C doesn't work, that's true.
However, closing the X11 window usually does the trick ;)

>
>>> And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
>> Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
>
> Hmm. There are long freezes of many seconds and I would have no idea where to start even ...
Ok, several seconds, hm. That does not really sound like the GC pauses I see.
But I haven't used the Squeak image a lot myself on the RoarVM.
I was more thinking in the direction of what kind of tricks are currently pulled to make things fast.
Perhaps, the X11 interface is already not the fastest compared to the standard approach, or there are some plugins in the VM which help performance but aren't included in RoarVM yet. I have never looked at the Squeak VM code myself, so I don't know a lot about what is actually done there.

Best regards
Stefan


>
> - Bert -
>
>



--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Bert Freudenberg

On 04.11.2010, at 21:18, Stefan Marr wrote:

> Hi Bert:
>
> On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
>
>>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?
>>> Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit.
>>> Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.

Wait. What do you mean by "current version" vs. "our version"?

>>> The GC is really not state of the art.
>>> And all that adds up rather quickly I suppose...
>>
>> Hmm, that doesn't sound like it should make it 4x slower ...
> Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM?
> How much do you typically gain by it?

I don't really remember but it was well below 50%, more like 10%-20% I think.

> One thing I forgot to mentioned in this context, it the object table we use.
> That is also something which is not exactly making the VM faster.

Ah, yes. That could make quite a difference. You wouldn't be calling a function for each object access though I hope?

>>>> Btw, user interrupt didn't work on the Mac.
>>> Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
>>
>> I was doing the equivalent of
>>
>> SharedQueue new next
>>
>> and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.
>
> Closing the window, how does that relate to processes? You mean a window inside the image?
> Per se, processes are not really owned or managed, so there is also nobody kill processes. However, I am not sure what you are referring to exactly.  

The X11 window.

> Ctrl-C doesn't work, that's true.
> However, closing the X11 window usually does the trick ;)

Well, it didn't. At least not immediately. Took several seconds ad ctrl-c's untill I got my prompt back.

>>>> And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
>>> Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
>>
>> Hmm. There are long freezes of many seconds and I would have no idea where to start even ...
> Ok, several seconds, hm. That does not really sound like the GC pauses I see.
> But I haven't used the Squeak image a lot myself on the RoarVM.
> I was more thinking in the direction of what kind of tricks are currently pulled to make things fast.
> Perhaps, the X11 interface is already not the fastest compared to the standard approach, or there are some plugins in the VM which help performance but aren't included in RoarVM yet. I have never looked at the Squeak VM code myself, so I don't know a lot about what is actually done there.
>
> Best regards
> Stefan

I guess some profiling is in order ...

- Bert -


Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

ungar
In reply to this post by Andreas.Raab
 
The "trick" seems to me to have been lots of blood sweat and tears.
Debugging this thing has been some of the toughest work I've tackled at times, and I bet Stefan would agree.
A separate interpreter per core, a common address space for the objects, separate memory areas for each core, each object freely references any other object,
an object table to make it each to move objects from core to core, straightforward extension of ProcessorScheduler, and a too-simple-at-the-moment garbage collector.

Lots of interesting bits in the safepoint and messaging systems.

- David




On Nov 3, 2010, at 9:57 AM, Andreas Raab wrote:

> On 11/3/2010 6:13 AM, Stefan Marr wrote:
>> We are happy to announce, now officially, RoarVM: the first single-image
>> manycore virtual machine for Smalltalk.
>
> Congrats! That's a great step forward.
>
>> RoarVM is based on the work of David Ungar and Sam S. Adams at IBM Research.
>> The port to x86 multicore systems was done by me. They open-sourced their VM,
>> formerly know as Renaissance VM (RVM), under the Eclipse Public License [1].
>> Official announcement of the IBM source code release:
>>   http://soft.vub.ac.be/~smarr/rvm-open-source-release/
>>
>> The source code of the RoarVM has been released as open source to enable the
>> Smalltalk community to evaluate the ideas and possibly integrate them into
>> existing systems. So, the RoarVM is meant to experiment with Smalltalk systems
>> on multi- and manycore machines.
>
> Can you give us a high-level overview about what "the trick" is? I.e., which approach did you take to make this possible?
>
>> If the community does not object, we would like to occupy the
>> [hidden email] mailinglist for related discussions. So, if
>> you run into any trouble while experimenting with the RoarVM, do not hesitate
>> to report any problems and ask any questions.
>
> Sounds great. Thanks for sharing.
>
> Cheers,
>  - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] RoarVM: The Manycore SqueakVM

ungar
In reply to this post by Stefan Marr
 
Hold on... don't we run on 56 cores on Tilera?


On Nov 4, 2010, at 11:36 AM, Stefan Marr wrote:

>
> Hi Bert:
>
>
> On 04 Nov 2010, at 19:07, Bert Freudenberg wrote:
>
>> So it looks like you have to use a power-of-two cores?
> Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.
>
>
>> And the benchmark invocation should be different if you want to actually use multiple cores. What's the magic incantation?
> The code I used to generate the numbers isn't actually in any image yet.
> I pasted it below for reference, its just a quick hack to have a parallel tinyBenchmarks version.
>
>
>> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?
> Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit.
> Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
> The GC is really not state of the art.
> And all that adds up rather quickly I suppose...
>
>
>> Btw, user interrupt didn't work on the Mac.
> Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
>
>
>> And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
> Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
>
>
>
> Best regards
> Stefan
>
>
> My tiny Benchmarks:
>> !Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:30'!
>> tinyBenchmarksParallel16Processes
>> "Report the results of running the two tiny Squeak benchmarks.
>> ar 9/10/1999: Adjusted to run at least 1 sec to get more stable results"
>> "0 tinyBenchmarks"
>> "On a 292 MHz G3 Mac: 22727272 bytecodes/sec; 984169 sends/sec"
>> "On a 400 MHz PII/Win98:  18028169 bytecodes/sec; 1081272 sends/sec"
>> | t1 t2 r n1 n2 |
>> n1 := 1.
>> [t1 := Time millisecondsToRun: [n1 benchmark].
>> t1 < 1000] whileTrue:[n1 := n1 * 2]. "Note: #benchmark's runtime is about O(n)"
>>
>> "now n1 is the value for which we do the measurement"
>> t1 := Time millisecondsToRun: [self run: #benchmark on: n1 times: 16].
>>
>> n2 := 28.
>> [t2 := Time millisecondsToRun: [r := n2 benchFib].
>> t2 < 1000] whileTrue:[n2 := n2 + 1].
>> "Note: #benchFib's runtime is about O(k^n),
>> where k is the golden number = (1 + 5 sqrt) / 2 = 1.618...."
>>
>> "now we have our target n2 and r value.
>> lets take the time for it"
>> t2 := Time millisecondsToRun: [self run: #benchFib on: n2 times: 16].
>>
>> ^ { ((n1 * 16 * 500000 * 1000) // t1). " printString, ' bytecodes/sec; ',"
>>   ((r * 16 * 1000) // t2) " printString, ' sends/sec'"
>>   }! !
>> !Integer methodsFor: 'benchmarks' stamp: 'sm 10/11/2010 22:29'!
>> run: aSymbol on: aReceiver times: nTimes
>> | mtx sig n |
>>
>> mtx := Semaphore forMutualExclusion.
>> sig := Semaphore new.
>> n := nTimes.
>>
>> nTimes timesRepeat: [
>> [ aReceiver perform: aSymbol.
>> mtx critical: [
>> n := n - 1.
>> (n == 0) ifTrue: [sig signal]]
>> ] fork
>> ].
>> sig wait.! !
>
> --
> Stefan Marr
> Software Languages Lab
> Vrije Universiteit Brussel
> Pleinlaan 2 / B-1050 Brussels / Belgium
> http://soft.vub.ac.be/~smarr
> Phone: +32 2 629 2974
> Fax:   +32 2 629 3525
>

Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

ungar
In reply to this post by Bert Freudenberg
 
On performance: we have done very little to tune it. Please feel free to pitch in! I suspect you might find some tasty, low-hanging fruit.

- David


On Nov 4, 2010, at 1:49 PM, Bert Freudenberg wrote:

>
> On 04.11.2010, at 21:18, Stefan Marr wrote:
>
>> Hi Bert:
>>
>> On 04 Nov 2010, at 20:20, Bert Freudenberg wrote:
>>
>>>>> So RoarVM is about 4 times slower in sends, even more so for bytecodes. It needs 8 cores to be faster the regular interpreter on a single core. To the good news is that it can beat the old interpreter :)  But why is it so much slower than the normal interpreter?
>>>> Well, one the one hand, we don't use stuff like the GCC label-as-value extension to have threaded-interpretation, which should help quite a bit.
>>>> Then, the current implementation based on pthreads is quite a bit slower then our version which uses plain Unix processes.
>
> Wait. What do you mean by "current version" vs. "our version"?
>
>>>> The GC is really not state of the art.
>>>> And all that adds up rather quickly I suppose...
>>>
>>> Hmm, that doesn't sound like it should make it 4x slower ...
>> Do you know some numbers for the switch/case-based vs. the threaded version on the standard VM?
>> How much do you typically gain by it?
>
> I don't really remember but it was well below 50%, more like 10%-20% I think.
>
>> One thing I forgot to mentioned in this context, it the object table we use.
>> That is also something which is not exactly making the VM faster.
>
> Ah, yes. That could make quite a difference. You wouldn't be calling a function for each object access though I hope?
>
>>>>> Btw, user interrupt didn't work on the Mac.
>>>> Cmd+. ? Works for me ;) Well, can you be a bit more specific? In which situation did it not work?
>>>
>>> I was doing the equivalent of
>>>
>>> SharedQueue new next
>>>
>>> and that seems not interruptable. Also, when there are multiple processes, closing the window does not quit all processes, and even ctrl-c did not quit the VM.
>>
>> Closing the window, how does that relate to processes? You mean a window inside the image?
>> Per se, processes are not really owned or managed, so there is also nobody kill processes. However, I am not sure what you are referring to exactly.  
>
> The X11 window.
>
>> Ctrl-C doesn't work, that's true.
>> However, closing the X11 window usually does the trick ;)
>
> Well, it didn't. At least not immediately. Took several seconds ad ctrl-c's untill I got my prompt back.
>
>>>>> And in the Squeak-4.1 image, when running on 2 or more cores Morphic gets incredibly sluggish, pretty much unusably so.
>>>> Yes, same here. Sorry. Any hints where to start looking to fix such issues are appreciated.
>>>
>>> Hmm. There are long freezes of many seconds and I would have no idea where to start even ...
>> Ok, several seconds, hm. That does not really sound like the GC pauses I see.
>> But I haven't used the Squeak image a lot myself on the RoarVM.
>> I was more thinking in the direction of what kind of tricks are currently pulled to make things fast.
>> Perhaps, the X11 interface is already not the fastest compared to the standard approach, or there are some plugins in the VM which help performance but aren't included in RoarVM yet. I have never looked at the Squeak VM code myself, so I don't know a lot about what is actually done there.
>>
>> Best regards
>> Stefan
>
> I guess some profiling is in order ...
>
> - Bert -
>
>

Reply | Threaded
Open this post in threaded view
|

Re: RoarVM: The Manycore SqueakVM

Stefan Marr
In reply to this post by ungar

Hi David:

On 05 Nov 2010, at 05:07, [hidden email] wrote:
>>> So it looks like you have to use a power-of-two cores?
>> Yes, that is right. At the moment, the system isn't able to handle other numbers of cores.
> Hold on... don't we run on 56 cores on Tilera?
Yes, however, on OSX and standard Linux there are problems with certain configurations.
If I remember correctly mmap or something fails with certain kinds of requests.
Something related to the number of cores.

Have it on my todo list.

Best regards
Stefan


--
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

12