Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Clément Béra
 
Hi all,

Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...

For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).

For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.

Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.

What do you think ? 

Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?

I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.

Best,

Clement

Phd Student
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Levente Uzonyi
 
A bit more than seven years ago, I had this idea:

http://forum.world.st/Multiprocessing-with-Squeak-td1312224.html

And I think it would still be the best solution.

Levente
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

David T. Lewis
In reply to this post by Clément Béra
 
Hi Clement,

Well, since you mentioned "cheap" ;-)

One idea that I have been interested in for some time is that of using
RemoteTask (http://wiki.squeak.org/squeak/6176) for multiprocessing, in
conjunction with something like Nicolas' Smallapack
(http://www.squeaksource.com/Smallapack).

I am expecting that the Spur memory model will allow memory writes in the
object memory to be very well localized compared to the V3 memory. That
should open the possibility of very efficient memory utilization with
course-grained multiprocessing, even for very large object memories and
large numbers of cooperating images. And it would completely bypass the
difficult problem of implementing multi-threading in the image/VM.

Dave


>  Hi all,
>
> Tim's just shared this lovely article with a 10,000+ core ARM machine.
> With
> this kind of machines, it's a bit stupid to use only 1 core when you have
> 10,000+. I believe we have to find a way to introduce multi-threading in
> Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards,
> I
> guess it's ok not to use them because their not general purpose processors
> while the VM is general purpose, but all those 10,000 cores...
>
> For parallel programming, we could consider doing something cheap like the
> parallel C# loops (Parallel.for and co). The Smalltalk programmer would
> then explicitly write "collection parallelDo: aBlock" instead of
> "collection do: aBlock", and if the block is long enough to execute, the
> cost of parallelisation becomes negligible compared to the performance
> boost of parallelisation. The block has to perform independent tasks, and
> if multiple blocks executed in parallel read/write the same memory
> location, as in C#, the behavior is undefined leading to freezes /
> crashes.
> It's the responsibility of the programmer to find out if loop iterations
> are independent or not (and it's not obvious).
>
> For concurrent programming, there's this design from E where we could have
> an actor model in Smalltalk where each actor is completely independent
> from
> each other, one native thread per actor, and all the common objects
> (including what's necessary for look-up such as method dictionaries) could
> be shared as long as they're read-only or immutable. Mutating a shared
> object such as installing a method in a method dictionary would be
> detected
> because such objects are read-only and we can stop all the threads sharing
> such object to mutate it. The programmer has to keep uncommon the mutation
> of shared objects to have good performance.
>
> Both design have different goals using multiple cores (parallel and
> concurrent programming), but in both cases we don't need to rewrite any
> library to make Squeak / Pharo multi-threaded like they did in Java.
>
> What do you think ?
>
> Is there anybody on the mailing list having ideas on how to introduce
> threads in Squeak / Pharo in a cheap way that does not require rewriting
> all core/collection libraries ?
>
> I'm not really into multi-threading myself but I believe the Cog VM will
> die in 10 years from now if we don't add something to support
> multi-threading, so I would like to hear suggestions.
>
> Best,
>
> Clement
>
> Phd Student
> Bâtiment B 40, avenue Halley 59650 *Villeneuve d'Ascq*
>


cbc
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

cbc
In reply to this post by Clément Béra
 
Another option mentioned occassionally:
(roarvm - mutli-core (research) vm for squeak, tested on 'up to 59 cores').

-cbc

On Mon, Jan 30, 2017 at 12:19 PM, Clément Bera <[hidden email]> wrote:
 
Hi all,

Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...

For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).

For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.

Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.

What do you think ? 

Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?

I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.

Best,

Clement

Phd Student
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq


Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Ben Coman
In reply to this post by Clément Béra
 
On Tue, Jan 31, 2017 at 4:19 AM, Clément Bera <[hidden email]> wrote:

>
> Hi all,
>
> Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...
>
> For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).
>
> For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.
>
> Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.
>
> What do you think ?
>
> Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?
>
> I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.

My naive idea is that lots might be simplified by having spawned
cputhreads use a different bytecode set that enforces a functional
style of programming by having no write codes.  While restrictive, my
inspiration is that functional languages are supposedly more suited to
parallelsim by having no shared state.  So all algorithms must work on
the stack only, which may be simpler to managing multiple updaters to
objectspace.  This may(?) avoid the need to garbage collect the 1000
cputhreads since everything gets cleared away when the stack dies with
the thread.  On the flip side, might not want to scan these 1000
cputhreads when garbage collecting the main Image thread.  So these
cputhreads might have a marshaling area that reference counts object
accesses external to the thread, and the garbage collector only needs
to scan that area.  Or alternatively, each cputhread maintains its own
objectspace that pulls in copies of objects Spoon style.

Would each cputhread need its own method cache?  Since the application
may have a massive number of individually short lived calculations, to
minimise method lookups perhaps a self-contained
mini-objectspace/method-cache could be seeded/warmed-up by the single
threaded main image, which is copied to each spawned cputhread with
parameters passed to the first invoked function.

Presumably a major use case for these multiple threads would be
numeric calculations.  So perhaps you get enough bang for the buck by
restricting cputhreads to operate only on immediate types?

Another idea is for cputhreads to be written in Slang which is
dynamically compiled and executes as native code, completely avoiding
the complexity of managing multiple access to objectspace.

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

timrowledge
 
Why worry about sharing object space by sharing the same memory space? A gazillion VMs running on a gazillion processors, whether separate chips or share-die cores or most likely a mix of both,  with communication channels between them seems more likely to be effective. Forget trying to coherently garbage collect across many cores etc. Each looks after their own world and in a sense each becomes its own object in something getting a bit closer to some of the ideas Alan originally wrote about regarding cells and messages in a biology sense.

Leave the OS to sort out shared memory on any particular core(s) for executable sharing or even object space sharing. We’d likely want to have VM normally configured with no internal plugins since most would never need to waste space on file or display stuff etc.

It’s 35+ years ago but my last experience in very parallel systems left me convinced that the first thing you do is prioritise the inter-process communication and leave the ‘real work’ as something to do in the machine’s spare time. I had a Meiko Transputer Computing Surface when I was an IBM research fellow around about the time coal beds were being laid down.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- A prime candidate for natural deselection.


Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Frank Shearar-3
In reply to this post by Ben Coman
 
On 30 January 2017 at 17:15, Ben Coman <[hidden email]> wrote:

On Tue, Jan 31, 2017 at 4:19 AM, Clément Bera <[hidden email]> wrote:
>
> Hi all,
>
> Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...
>
> For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).
>
> For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.
>
> Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.
>
> What do you think ?
>
> Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?
>
> I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.

My naive idea is that lots might be simplified by having spawned
cputhreads use a different bytecode set that enforces a functional
style of programming by having no write codes.  While restrictive, my
inspiration is that functional languages are supposedly more suited to
parallelsim by having no shared state.  So all algorithms must work on
the stack only

No: functional languages often share state. It's just that they share _immutable_ state. Or if you prefer, you can't tell if two threads are accessing the same data, or merely identical data.

For example, in Erlang, messages bigger than 64kB are shared between processes on the same machine, because it's much more efficient to share a pointer.

To make things slightly more confusing, the rule is more generally "functions are APPARENTLY pure". In languages like Clojure or F#, it's quite acceptable to use locally mutable state, as long as no one gets to see you cheat.

(ML languages are capable of sharing mutable state, it's just that you have to opt into such things through "ref" or "mutable" markers on things.)

frank
 
, which may be simpler to managing multiple updaters to
objectspace.  This may(?) avoid the need to garbage collect the 1000
cputhreads since everything gets cleared away when the stack dies with
the thread.  On the flip side, might not want to scan these 1000
cputhreads when garbage collecting the main Image thread.  So these
cputhreads might have a marshaling area that reference counts object
accesses external to the thread, and the garbage collector only needs
to scan that area.  Or alternatively, each cputhread maintains its own
objectspace that pulls in copies of objects Spoon style.

Would each cputhread need its own method cache?  Since the application
may have a massive number of individually short lived calculations, to
minimise method lookups perhaps a self-contained
mini-objectspace/method-cache could be seeded/warmed-up by the single
threaded main image, which is copied to each spawned cputhread with
parameters passed to the first invoked function.

Presumably a major use case for these multiple threads would be
numeric calculations.  So perhaps you get enough bang for the buck by
restricting cputhreads to operate only on immediate types?

Another idea is for cputhreads to be written in Slang which is
dynamically compiled and executes as native code, completely avoiding
the complexity of managing multiple access to objectspace.

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Ronie Salgado
 
Hi all,
 
It’s 35+ years ago but my last experience in very parallel systems left me convinced that the first thing you do is prioritise the inter-process communication and leave the ‘real work’ as something to do in the machine’s spare time. I had a Meiko Transputer Computing Surface when I was an IBM research fellow around about the time coal beds were being laid down.
I agree. When having more than a dozen (or some dozens) of cores, having a shared memory starts becoming the biggest bottleneck. Super computers and clusters are not built in the way that a traditional machine is built. A big single computer with multiples CPUs is usually a NUMA machine (not uniform memory access). A cluster is composed of several nodes which are indenpendent computers that are connected via a very fast network, but the network connection it is still slower in comparison with shared memory communication. Each one of the nodes in a cluster could also be a NUMA machine.

For this kind of machines, threads are useless in comparison with inter process, and inter node communication. IPC is usually made via MPI (message passing interface), instead of using shared memory.

Threads are more useful when one needs high performance and low latency in an application that runs in a single computer. High performance video games and (soft) realtime graphics are usually in this domain.

Best regards,
Ronie

2017-01-30 23:33 GMT-03:00 Frank Shearar <[hidden email]>:
 
On 30 January 2017 at 17:15, Ben Coman <[hidden email]> wrote:

On Tue, Jan 31, 2017 at 4:19 AM, Clément Bera <[hidden email]> wrote:
>
> Hi all,
>
> Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...
>
> For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).
>
> For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.
>
> Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.
>
> What do you think ?
>
> Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?
>
> I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.

My naive idea is that lots might be simplified by having spawned
cputhreads use a different bytecode set that enforces a functional
style of programming by having no write codes.  While restrictive, my
inspiration is that functional languages are supposedly more suited to
parallelsim by having no shared state.  So all algorithms must work on
the stack only

No: functional languages often share state. It's just that they share _immutable_ state. Or if you prefer, you can't tell if two threads are accessing the same data, or merely identical data.

For example, in Erlang, messages bigger than 64kB are shared between processes on the same machine, because it's much more efficient to share a pointer.

To make things slightly more confusing, the rule is more generally "functions are APPARENTLY pure". In languages like Clojure or F#, it's quite acceptable to use locally mutable state, as long as no one gets to see you cheat.

(ML languages are capable of sharing mutable state, it's just that you have to opt into such things through "ref" or "mutable" markers on things.)

frank
 
, which may be simpler to managing multiple updaters to
objectspace.  This may(?) avoid the need to garbage collect the 1000
cputhreads since everything gets cleared away when the stack dies with
the thread.  On the flip side, might not want to scan these 1000
cputhreads when garbage collecting the main Image thread.  So these
cputhreads might have a marshaling area that reference counts object
accesses external to the thread, and the garbage collector only needs
to scan that area.  Or alternatively, each cputhread maintains its own
objectspace that pulls in copies of objects Spoon style.

Would each cputhread need its own method cache?  Since the application
may have a massive number of individually short lived calculations, to
minimise method lookups perhaps a self-contained
mini-objectspace/method-cache could be seeded/warmed-up by the single
threaded main image, which is copied to each spawned cputhread with
parameters passed to the first invoked function.

Presumably a major use case for these multiple threads would be
numeric calculations.  So perhaps you get enough bang for the buck by
restricting cputhreads to operate only on immediate types?

Another idea is for cputhreads to be written in Slang which is
dynamically compiled and executes as native code, completely avoiding
the complexity of managing multiple access to objectspace.

cheers -ben



Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Clément Béra
 
Hi all,

Ronie, you said: 

Threads are more useful when one needs high performance and low latency in an application that runs in a single computer. High performance video games and (soft) realtime graphics are usually in this domain.

I know you're working with high performance video games. If you would introduce multi-threading in Squeak/Pharo, how would you do it ? Especially, do you have a design in mind that does not require to rewrite all the core libraries ?

To sum-up previous mails:

1) There's this idea of having a multiple images communicating together, each image on a different VM, potentially 1 native thread per image. I think there is work on-going in this direction through multiple frameworks. With a minimal image and a minimal VM, the cost of the pair image+VM remains quite cheap, already today <15Mb for the pair is possible. I believe this idea is great but does not solve entirely the problem.

2) Levente's idea is basically to share objects between images, the shared objects being read-only and lazily duplicated to worker images upon mutation to have low-memory footprint images on the same VM. I like the idea, I was thinking of stopping threads to mutate shared objects and to give the programmer the responsibility to define a set of shared objects that are not frequently mutated instead of duplication, and go later in the direction of shared writable memory.

3) Ben's idea is to create a process in a new thread that cannot mutate objects in memory. I have issues with this design because each worker thread as you say have to work only with the stack, hence they cannot allocate objects, hence they cannot use closures.

4) I need to look into the Roar VM project again and Dave Ungar's work on multithreaded Smalltalk. I should contact again Stefan Marr I guess.

5) I didn't mention it earlier, but there's Eliot's work on ThreadedFFI to use multiple native threads when using FFI. It also solves part of the multi-threading problem.

Thanks for sharing ideas.

On Tue, Jan 31, 2017 at 4:25 AM, Ronie Salgado <[hidden email]> wrote:
 
Hi all,
 
It’s 35+ years ago but my last experience in very parallel systems left me convinced that the first thing you do is prioritise the inter-process communication and leave the ‘real work’ as something to do in the machine’s spare time. I had a Meiko Transputer Computing Surface when I was an IBM research fellow around about the time coal beds were being laid down.
I agree. When having more than a dozen (or some dozens) of cores, having a shared memory starts becoming the biggest bottleneck. Super computers and clusters are not built in the way that a traditional machine is built. A big single computer with multiples CPUs is usually a NUMA machine (not uniform memory access). A cluster is composed of several nodes which are indenpendent computers that are connected via a very fast network, but the network connection it is still slower in comparison with shared memory communication. Each one of the nodes in a cluster could also be a NUMA machine.

For this kind of machines, threads are useless in comparison with inter process, and inter node communication. IPC is usually made via MPI (message passing interface), instead of using shared memory.

Threads are more useful when one needs high performance and low latency in an application that runs in a single computer. High performance video games and (soft) realtime graphics are usually in this domain.

Best regards,
Ronie

2017-01-30 23:33 GMT-03:00 Frank Shearar <[hidden email]>:
 
On 30 January 2017 at 17:15, Ben Coman <[hidden email]> wrote:

On Tue, Jan 31, 2017 at 4:19 AM, Clément Bera <[hidden email]> wrote:
>
> Hi all,
>
> Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...
>
> For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).
>
> For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.
>
> Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.
>
> What do you think ?
>
> Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?
>
> I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.

My naive idea is that lots might be simplified by having spawned
cputhreads use a different bytecode set that enforces a functional
style of programming by having no write codes.  While restrictive, my
inspiration is that functional languages are supposedly more suited to
parallelsim by having no shared state.  So all algorithms must work on
the stack only

No: functional languages often share state. It's just that they share _immutable_ state. Or if you prefer, you can't tell if two threads are accessing the same data, or merely identical data.

For example, in Erlang, messages bigger than 64kB are shared between processes on the same machine, because it's much more efficient to share a pointer.

To make things slightly more confusing, the rule is more generally "functions are APPARENTLY pure". In languages like Clojure or F#, it's quite acceptable to use locally mutable state, as long as no one gets to see you cheat.

(ML languages are capable of sharing mutable state, it's just that you have to opt into such things through "ref" or "mutable" markers on things.)

frank
 
, which may be simpler to managing multiple updaters to
objectspace.  This may(?) avoid the need to garbage collect the 1000
cputhreads since everything gets cleared away when the stack dies with
the thread.  On the flip side, might not want to scan these 1000
cputhreads when garbage collecting the main Image thread.  So these
cputhreads might have a marshaling area that reference counts object
accesses external to the thread, and the garbage collector only needs
to scan that area.  Or alternatively, each cputhread maintains its own
objectspace that pulls in copies of objects Spoon style.

Would each cputhread need its own method cache?  Since the application
may have a massive number of individually short lived calculations, to
minimise method lookups perhaps a self-contained
mini-objectspace/method-cache could be seeded/warmed-up by the single
threaded main image, which is copied to each spawned cputhread with
parameters passed to the first invoked function.

Presumably a major use case for these multiple threads would be
numeric calculations.  So perhaps you get enough bang for the buck by
restricting cputhreads to operate only on immediate types?

Another idea is for cputhreads to be written in Slang which is
dynamically compiled and executes as native code, completely avoiding
the complexity of managing multiple access to objectspace.

cheers -ben





Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Ben Coman
 
On Tue, Jan 31, 2017 at 4:51 PM, Clément Bera <[hidden email]> wrote:
>
> Hi all,
>
> 3) Ben's idea is to create a process in a new thread that cannot mutate objects in memory. I have issues with this design because each worker thread as you say have to work only with the stack, hence they cannot allocate objects, hence they cannot use closures.

Hi Clement,

Could you expand on this a little. I sense there is something for me
to learn here.
What objects need to be allocated?
One case is where a block references a method's local variable. This
seems okay calling forward to other methods, but may be a problem to
return the block from the method.
Is the problem because the block is an object and to pass arguments
into it mutates the closure?
Or something else?

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Stefan Marr-3
In reply to this post by Clément Béra
 
Hi:

> On 31 Jan 2017, at 09:51, Clément Bera <[hidden email]> wrote:
>
> 1) There's this idea of having a multiple images communicating together, each image on a different VM, potentially 1 native thread per image. I think there is work on-going in this direction through multiple frameworks. With a minimal image and a minimal VM, the cost of the pair image+VM remains quite cheap, already today <15Mb for the pair is possible. I believe this idea is great but does not solve entirely the problem.

That’s the cheapest solution there is. No VM changes required, just plugging together existing things.

> 2) Levente's idea is basically to share objects between images, the shared objects being read-only and lazily duplicated to worker images upon mutation to have low-memory footprint images on the same VM. I like the idea, I was thinking of stopping threads to mutate shared objects and to give the programmer the responsibility to define a set of shared objects that are not frequently mutated instead of duplication, and go later in the direction of shared writable memory.

There are all kind of variations possible on that theme.

Also the question is does it really need to be objects? Alternatives include things like tuple spaces (think Linda), low-level shared memory buffers (Python and others, and apparently ECMAScript 2017).

If you go with objects, the problem is that you need to support GC. And, I suppose Eliot will agree that GC for multithreaded systems isn’t exactly zero cost.

> 3) Ben's idea is to create a process in a new thread that cannot mutate objects in memory. I have issues with this design because each worker thread as you say have to work only with the stack, hence they cannot allocate objects, hence they cannot use closures.
>
> 4) I need to look into the Roar VM project again and Dave Ungar's work on multithreaded Smalltalk. I should contact again Stefan Marr I guess.

I am here, and reading…
You still need a GC that’s capable to work in a multithreaded system.
Well, and the rest of the VM should also be designed for that, but, the image changes, and the ‘safety’ for concurrency was minimal.
This is as cheap as it gets for shared multithreading, but of course, all the burden of getting things right is on the application developer.

For a Smalltalk-like language, I’d argue, you’d always want at least a GC/VM that does the right thing.
That’s not easy.

On the language level, with classes, globals, and all those things, I fear, Smalltalk as a language isn’t any better than Java. So, if you don’t plan to make a real cut, things will always be messy and strange. Ruby struggles with the same problem. They are talking about ‘Guilds’ http://olivierlacan.com/posts/concurrency-in-ruby-3-with-guilds/ But, you still got shared classes/globals. Python and others with there global interpreter lock are in the same boat. And work around with things like ‘multiprocessing’, essentially giving a nicer interface to option 1.

So, option 1 seems to be a rather clean solution. Gives you also a good natural programming model, and the right expectation: strong isolation.
From that, one could think about having multiple independent interpreters with separate heaps within the same CogVM process, to avoid marshaling overhead and stuff. That’s very similar to JavaScript web workers.
From there, one could consider lifting some of the restrictions, perhaps like option 2, or like work we did for JavaScript: http://stefan-marr.de/downloads/oopsla16-bonetta-et-al-gems-shared-memory-parallel-programming-for-nodejs.pdf

Those ways seem to avoid huge VM changes, and rewriting a lot of code. Whether the programing model is nice or not, is up to the personal taste I suppose.
If you want a programming model that’s not introducing any surprises, and avoids low-level concurrency issues from the start, you’ll have to bite the bullet and get rid of globals and global classes anyway. Everything else is just as problematic as Java, C#, etc in that department. But, I am biased, because I still like the tradeoffs I get from Newspeak for my work.

Best regards
Stefan

--
Stefan Marr
Johannes Kepler Universität Linz
http://stefan-marr.de/research/



Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Clément Béra
 
Thanks for your advises Stefan, I was just reading part of your thesis to understand what has to be done. I believe there is work on-going to remove all the globals (at least in Pharo).

To conclude this thread:

To introduce multi-threading in Squeak / Pharo, the easiest way is to start with multiple pairs of image+VM communicating together. It's clean, simple and working. The problem then lies in the time spent in communication between the pairs of image+VM. To lower this communication time, we can have all the images running on the same VM, though they still have independent heaps, caches and interpreters, with communication APIs implemented in the VM. Once this is done, we can try to remove restrictions, for example by having shared memory buffers between images.

Thanks everyone for sharing ideas and remarks.

Best,

Clement


On Tue, Jan 31, 2017 at 10:39 AM, Stefan Marr <[hidden email]> wrote:

Hi:

> On 31 Jan 2017, at 09:51, Clément Bera <[hidden email]> wrote:
>
> 1) There's this idea of having a multiple images communicating together, each image on a different VM, potentially 1 native thread per image. I think there is work on-going in this direction through multiple frameworks. With a minimal image and a minimal VM, the cost of the pair image+VM remains quite cheap, already today <15Mb for the pair is possible. I believe this idea is great but does not solve entirely the problem.

That’s the cheapest solution there is. No VM changes required, just plugging together existing things.

> 2) Levente's idea is basically to share objects between images, the shared objects being read-only and lazily duplicated to worker images upon mutation to have low-memory footprint images on the same VM. I like the idea, I was thinking of stopping threads to mutate shared objects and to give the programmer the responsibility to define a set of shared objects that are not frequently mutated instead of duplication, and go later in the direction of shared writable memory.

There are all kind of variations possible on that theme.

Also the question is does it really need to be objects? Alternatives include things like tuple spaces (think Linda), low-level shared memory buffers (Python and others, and apparently ECMAScript 2017).

If you go with objects, the problem is that you need to support GC. And, I suppose Eliot will agree that GC for multithreaded systems isn’t exactly zero cost.

> 3) Ben's idea is to create a process in a new thread that cannot mutate objects in memory. I have issues with this design because each worker thread as you say have to work only with the stack, hence they cannot allocate objects, hence they cannot use closures.
>
> 4) I need to look into the Roar VM project again and Dave Ungar's work on multithreaded Smalltalk. I should contact again Stefan Marr I guess.

I am here, and reading…
You still need a GC that’s capable to work in a multithreaded system.
Well, and the rest of the VM should also be designed for that, but, the image changes, and the ‘safety’ for concurrency was minimal.
This is as cheap as it gets for shared multithreading, but of course, all the burden of getting things right is on the application developer.

For a Smalltalk-like language, I’d argue, you’d always want at least a GC/VM that does the right thing.
That’s not easy.

On the language level, with classes, globals, and all those things, I fear, Smalltalk as a language isn’t any better than Java. So, if you don’t plan to make a real cut, things will always be messy and strange. Ruby struggles with the same problem. They are talking about ‘Guilds’ http://olivierlacan.com/posts/concurrency-in-ruby-3-with-guilds/ But, you still got shared classes/globals. Python and others with there global interpreter lock are in the same boat. And work around with things like ‘multiprocessing’, essentially giving a nicer interface to option 1.

So, option 1 seems to be a rather clean solution. Gives you also a good natural programming model, and the right expectation: strong isolation.
From that, one could think about having multiple independent interpreters with separate heaps within the same CogVM process, to avoid marshaling overhead and stuff. That’s very similar to JavaScript web workers.
From there, one could consider lifting some of the restrictions, perhaps like option 2, or like work we did for JavaScript: http://stefan-marr.de/downloads/oopsla16-bonetta-et-al-gems-shared-memory-parallel-programming-for-nodejs.pdf

Those ways seem to avoid huge VM changes, and rewriting a lot of code. Whether the programing model is nice or not, is up to the personal taste I suppose.
If you want a programming model that’s not introducing any surprises, and avoids low-level concurrency issues from the start, you’ll have to bite the bullet and get rid of globals and global classes anyway. Everything else is just as problematic as Java, C#, etc in that department. But, I am biased, because I still like the tradeoffs I get from Newspeak for my work.

Best regards
Stefan

--
Stefan Marr
Johannes Kepler Universität Linz
http://stefan-marr.de/research/




Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Levente Uzonyi
In reply to this post by Stefan Marr-3
 
On Tue, 31 Jan 2017, Stefan Marr wrote:

>
> Hi:
>
>> On 31 Jan 2017, at 09:51, Clément Bera <[hidden email]> wrote:
>>
>> 1) There's this idea of having a multiple images communicating together, each image on a different VM, potentially 1 native thread per image. I think there is work on-going in this direction through multiple frameworks. With a minimal image and a minimal VM, the cost of the pair image+VM remains quite cheap, already today <15Mb for the pair is possible. I believe this idea is great but does not solve entirely the problem.
>
> That’s the cheapest solution there is. No VM changes required, just plugging together existing things.
>
>> 2) Levente's idea is basically to share objects between images, the shared objects being read-only and lazily duplicated to worker images upon mutation to have low-memory footprint images on the same VM. I like the idea, I was thinking of stopping threads to mutate shared objects and to give the programmer the responsibility to define a set of shared objects that are not frequently mutated instead of duplication, and go later in the direction of shared writable memory.
>
> There are all kind of variations possible on that theme.
>
> Also the question is does it really need to be objects? Alternatives include things like tuple spaces (think Linda), low-level shared memory buffers (Python and others, and apparently ECMAScript 2017).
You'd actually share a segment with objects stored in it. Low-level
buffers are very restricting. They force you to serialize objects if you
want to keep using them. And that has some unwanted overhead.

>
> If you go with objects, the problem is that you need to support GC. And, I suppose Eliot will agree that GC for multithreaded systems isn’t exactly zero cost.

You don't need multi-threaded GC here, just many independent
single-threaded GCs, which we have already.
Btw, this is the same thing Erlang does.

Levente

>
>> 3) Ben's idea is to create a process in a new thread that cannot mutate objects in memory. I have issues with this design because each worker thread as you say have to work only with the stack, hence they cannot allocate objects, hence they cannot use closures.
>>
>> 4) I need to look into the Roar VM project again and Dave Ungar's work on multithreaded Smalltalk. I should contact again Stefan Marr I guess.
>
> I am here, and reading…
> You still need a GC that’s capable to work in a multithreaded system.
> Well, and the rest of the VM should also be designed for that, but, the image changes, and the ‘safety’ for concurrency was minimal.
> This is as cheap as it gets for shared multithreading, but of course, all the burden of getting things right is on the application developer.
>
> For a Smalltalk-like language, I’d argue, you’d always want at least a GC/VM that does the right thing.
> That’s not easy.
>
> On the language level, with classes, globals, and all those things, I fear, Smalltalk as a language isn’t any better than Java. So, if you don’t plan to make a real cut, things will always be messy and strange. Ruby struggles with the same problem. They are talking about ‘Guilds’ http://olivierlacan.com/posts/concurrency-in-ruby-3-with-guilds/ But, you still got shared classes/globals. Python and others with there global interpreter lock are in the same boat. And work around with things like ‘multiprocessing’, essentially giving a nicer interface to option 1.
>
> So, option 1 seems to be a rather clean solution. Gives you also a good natural programming model, and the right expectation: strong isolation.
> From that, one could think about having multiple independent interpreters with separate heaps within the same CogVM process, to avoid marshaling overhead and stuff. That’s very similar to JavaScript web workers.
> From there, one could consider lifting some of the restrictions, perhaps like option 2, or like work we did for JavaScript: http://stefan-marr.de/downloads/oopsla16-bonetta-et-al-gems-shared-memory-parallel-programming-for-nodejs.pdf
>
> Those ways seem to avoid huge VM changes, and rewriting a lot of code. Whether the programing model is nice or not, is up to the personal taste I suppose.
> If you want a programming model that’s not introducing any surprises, and avoids low-level concurrency issues from the start, you’ll have to bite the bullet and get rid of globals and global classes anyway. Everything else is just as problematic as Java, C#, etc in that department. But, I am biased, because I still like the tradeoffs I get from Newspeak for my work.
>
> Best regards
> Stefan
>
> --
> Stefan Marr
> Johannes Kepler Universität Linz
> http://stefan-marr.de/research/
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Stefan Marr-3
 
Hi Levente:

> On 31 Jan 2017, at 15:22, Levente Uzonyi <[hidden email]> wrote:
>
>> Also the question is does it really need to be objects? Alternatives include things like tuple spaces (think Linda), low-level shared memory buffers (Python and others, and apparently ECMAScript 2017).
>
> You'd actually share a segment with objects stored in it. Low-level buffers are very restricting. They force you to serialize objects if you want to keep using them. And that has some unwanted overhead.

What’s a segment? Who controls the lifetime of it? Are you doing local GC plus global reference counting?
Somehow you’d still manage those objects, no?


>> If you go with objects, the problem is that you need to support GC. And, I suppose Eliot will agree that GC for multithreaded systems isn’t exactly zero cost.
>
> You don't need multi-threaded GC here, just many independent single-threaded GCs, which we have already.
> Btw, this is the same thing Erlang does.

I am probably missing something, but I’d think you need some global GC mechanism. If you got shared objects, you need to coordinate the local GCs.
In Erlang, most messages are copied, only large data chunks are shared by reference. So, that restricts the need for globally coordinated GC quite a bit, but you still need it as far as I can tell.

Best regards
Stefan


--
Stefan Marr
Johannes Kepler Universität Linz
http://stefan-marr.de/research/



Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Levente Uzonyi
 
On Tue, 31 Jan 2017, Stefan Marr wrote:

>
> Hi Levente:
>
>> On 31 Jan 2017, at 15:22, Levente Uzonyi <[hidden email]> wrote:
>>
>>> Also the question is does it really need to be objects? Alternatives include things like tuple spaces (think Linda), low-level shared memory buffers (Python and others, and apparently ECMAScript 2017).
>>
>> You'd actually share a segment with objects stored in it. Low-level buffers are very restricting. They force you to serialize objects if you want to keep using them. And that has some unwanted overhead.
>
> What’s a segment?
It's a read-only chunk of memory holding objects.

> Who controls the lifetime of it?

It's permanent.

> Are you doing local GC plus global reference counting?

GC never touches that memory, because it can't change.

> Somehow you’d still manage those objects, no?

No.

>
>
>>> If you go with objects, the problem is that you need to support GC. And, I suppose Eliot will agree that GC for multithreaded systems isn’t exactly zero cost.
>>
>> You don't need multi-threaded GC here, just many independent single-threaded GCs, which we have already.
>> Btw, this is the same thing Erlang does.
>
> I am probably missing something, but I’d think you need some global GC mechanism. If you got shared objects, you need to coordinate the local GCs.

All shared objects are permanent and read-only.

> In Erlang, most messages are copied, only large data chunks are shared by reference. So, that restricts the need for globally coordinated GC quite a bit, but you still need it as far as I can tell.

Here objects shared by reference would be permanent, therefore no GC would
be required.

Levente

>
> Best regards
> Stefan
>
>
> --
> Stefan Marr
> Johannes Kepler Universität Linz
> http://stefan-marr.de/research/
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Ronie Salgado
 
Hi All,

Threads are more useful when one needs high performance and low latency in an application that runs in a single computer. High performance video games and (soft) realtime graphics are usually in this domain.

I know you're working with high performance video games. If you would introduce multi-threading in Squeak/Pharo, how would you do it ? Especially, do you have a design in mind that does not require to rewrite all the core libraries ?
For Pharo I am going to the GPGPU route, using either OpenCL or a low level graphics API (Vulkan, D3D 12 or Metal). This way, I do not have to change the VM or Pharo for using the many threads present in the GPGPU. I am modifying Pharo and the VM with other purposes, such as being able to submit lot of data to the GPU so that it can be kept busy.

For actual CPU side multithreading, I am abandoning Pharo and the VM by making an Ahead of Time Compiler(something similar to Bee Smalltalk), where I am using the OpalCompiler as a frontend, and a SSA based intermediate representation which is very similar to the one offered by LLVM, but written in Pharo. I had to make this SSA IR to be able to generate the shaders for Vulkan from Pharo, so for this AoT I am just reusing it by adding a machine code backend. With my framework I am able to generate an elf32 or an elf64 that can be linked with any C library directly, such as a minimalistic runtime( https://github.com/ronsaldo/slvm-native ) for providing Smalltalk facilities such as message sends, object allocation, GC, segmented stack, etc.

I have already gotten some things working like message sends, the segmented stack, block closure creation and activation. For the object model, I am using the Spur object model, but with some slight modifications. Object interiors are aligned to 16 bytes, for being able to use SSE instructions. There is a small preheader for implementing the LISP2 GC algorithm (I choose it by its simplicity), become and heap management. The preheader is not used by generated code, except for serializing objects in the object file. I changed the CompiledMethod object type for having generic mixed oop and native data objects. For GC and multithreading, I will be just stopping the whole world in safe points and doing GC in a single thread. By disabling the GC, the user could be scheduling the GC to happen in non user perceived times, such as just after sending a frame rendering command.

AoT compilation of Smalltalk is going to make modifications to method dictionaries a very rare operation, because you cannot AoT compile methods on runtime time, so you do not need the compiler in a shipping application. This places the burden of thread safetyness to a small number of places that can be protected explicitly by using some Mutexes.

My plan with this infrastructure is leaving Pharo and the standard VM as a game prototyping and development environment, but doing the actual deployment by using this  very experimental Ahead of Time compiler, and the minimalistic Smalltalk runtime.

Best regards,
Ronie

2017-01-31 12:57 GMT-03:00 Levente Uzonyi <[hidden email]>:
 
On Tue, 31 Jan 2017, Stefan Marr wrote:


Hi Levente:

On 31 Jan 2017, at 15:22, Levente Uzonyi <[hidden email]> wrote:

Also the question is does it really need to be objects? Alternatives include things like tuple spaces (think Linda), low-level shared memory buffers (Python and others, and apparently ECMAScript 2017).

You'd actually share a segment with objects stored in it. Low-level buffers are very restricting. They force you to serialize objects if you want to keep using them. And that has some unwanted overhead.

What’s a segment?

It's a read-only chunk of memory holding objects.

Who controls the lifetime of it?

It's permanent.

Are you doing local GC plus global reference counting?

GC never touches that memory, because it can't change.

Somehow you’d still manage those objects, no?

No.



If you go with objects, the problem is that you need to support GC. And, I suppose Eliot will agree that GC for multithreaded systems isn’t exactly zero cost.

You don't need multi-threaded GC here, just many independent single-threaded GCs, which we have already.
Btw, this is the same thing Erlang does.

I am probably missing something, but I’d think you need some global GC mechanism. If you got shared objects, you need to coordinate the local GCs.

All shared objects are permanent and read-only.

In Erlang, most messages are copied, only large data chunks are shared by reference. So, that restricts the need for globally coordinated GC quite a bit, but you still need it as far as I can tell.

Here objects shared by reference would be permanent, therefore no GC would be required.

Levente


Best regards
Stefan


--
Stefan Marr
Johannes Kepler Universität Linz
http://stefan-marr.de/research/


Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Ben Coman
 
On Wed, Feb 1, 2017 at 2:08 AM, Ronie Salgado <[hidden email]> wrote:

>
> Hi All,
>
>> Threads are more useful when one needs high performance and low latency in an application that runs in a single computer. High performance video games and (soft) realtime graphics are usually in this domain.
>>
>> I know you're working with high performance video games. If you would introduce multi-threading in Squeak/Pharo, how would you do it ? Especially, do you have a design in mind that does not require to rewrite all the core libraries ?
>
> For Pharo I am going to the GPGPU route, using either OpenCL or a low level graphics API (Vulkan, D3D 12 or Metal). This way, I do not have to change the VM or Pharo for using the many threads present in the GPGPU. I am modifying Pharo and the VM with other purposes, such as being able to submit lot of data to the GPU so that it can be kept busy.
>
> For actual CPU side multithreading, I am abandoning Pharo and the VM by making an Ahead of Time Compiler(something similar to Bee Smalltalk), where I am using the OpalCompiler as a frontend, and a SSA based intermediate representation which is very similar to the one offered by LLVM, but written in Pharo. I had to make this SSA IR to be able to generate the shaders for Vulkan from Pharo, so for this AoT I am just reusing it by adding a machine code backend. With my framework I am able to generate an elf32 or an elf64 that can be linked with any C library directly, such as a minimalistic runtime( https://github.com/ronsaldo/slvm-native ) for providing Smalltalk facilities such as message sends, object allocation, GC, segmented stack, etc.
>
> I have already gotten some things working like message sends, the segmented stack, block closure creation and activation. For the object model, I am using the Spur object model, but with some slight modifications. Object interiors are aligned to 16 bytes, for being able to use SSE instructions. There is a small preheader for implementing the LISP2 GC algorithm (I choose it by its simplicity), become and heap management. The preheader is not used by generated code, except for serializing objects in the object file. I changed the CompiledMethod object type for having generic mixed oop and native data objects. For GC and multithreading, I will be just stopping the whole world in safe points and doing GC in a single thread. By disabling the GC, the user could be scheduling the GC to happen in non user perceived times, such as just after sending a frame rendering command.
>
> AoT compilation of Smalltalk is going to make modifications to method dictionaries a very rare operation, because you cannot AoT compile methods on runtime time, so you do not need the compiler in a shipping application. This places the burden of thread safetyness to a small number of places that can be protected explicitly by using some Mutexes.
>
> My plan with this infrastructure is leaving Pharo and the standard VM as a game prototyping and development environment, but doing the actual deployment by using this  very experimental Ahead of Time compiler, and the minimalistic Smalltalk runtime.

Thanks for that detailed rundown. Very interesting stuff.  Just
curious if you would keep Pharo as a user scripting engine?  Also are
you planning any facility to spawn your SLVM directly from Pharo, and
perhaps implementing server-side requirement for PharmIDE so you debug
your SLVM images from Pharo?

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Ideas on cheap multi-threading for Squeak / Pharo ? (from Tim's article)

Ben Coman
In reply to this post by Clément Béra
 


On Tue, Jan 31, 2017 at 4:19 AM, Clément Bera <[hidden email]> wrote:
 
Hi all,

Tim's just shared this lovely article with a 10,000+ core ARM machine. With this kind of machines, it's a bit stupid to use only 1 core when you have 10,000+. I believe we have to find a way to introduce multi-threading in Squeak / Pharo. For co-processors like the Xeon Phi or the graphic cards, I guess it's ok not to use them because their not general purpose processors while the VM is general purpose, but all those 10,000 cores...

For parallel programming, we could consider doing something cheap like the parallel C# loops (Parallel.for and co). The Smalltalk programmer would then explicitly write "collection parallelDo: aBlock" instead of "collection do: aBlock", and if the block is long enough to execute, the cost of parallelisation becomes negligible compared to the performance boost of parallelisation. The block has to perform independent tasks, and if multiple blocks executed in parallel read/write the same memory location, as in C#, the behavior is undefined leading to freezes / crashes. It's the responsibility of the programmer to find out if loop iterations are independent or not (and it's not obvious).

For concurrent programming, there's this design from E where we could have an actor model in Smalltalk where each actor is completely independent from each other, one native thread per actor, and all the common objects (including what's necessary for look-up such as method dictionaries) could be shared as long as they're read-only or immutable. Mutating a shared object such as installing a method in a method dictionary would be detected because such objects are read-only and we can stop all the threads sharing such object to mutate it. The programmer has to keep uncommon the mutation of shared objects to have good performance.

Both design have different goals using multiple cores (parallel and concurrent programming), but in both cases we don't need to rewrite any library to make Squeak / Pharo multi-threaded like they did in Java.

What do you think ? 

Is there anybody on the mailing list having ideas on how to introduce threads in Squeak / Pharo in a cheap way that does not require rewriting all core/collection libraries ?

I'm not really into multi-threading myself but I believe the Cog VM will die in 10 years from now if we don't add something to support multi-threading, so I would like to hear suggestions.


Just came across [1] about PyParallel which advocates, 
at the point that a parallel computation is invoked...
1. Suspend the main thread
2. Just prior to suspension, write-protect all pages of main thread
3. After the parallel contexts/threads have finished, return the protection to normal and resume main thread

Those slide also mention somewhere that each computational thread gets its own heap,
and rather than garbage collect those heaps they just thrown them away.  I guess one way that could work
is for each computation thread to return a single object, and copy anything reached from that to the main heap
and throw the rest away.


cheers -ben