Multy-core CPUs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
194 messages Options
1 ... 78910
Reply | Threaded
Open this post in threaded view
|

Re: Multy-core CPUs

Jason Johnson-5
Hi Andreas,

Glad you weren't scared off by this thread. :)

(comments below)

On 10/25/07, Andreas Raab <[hidden email]> wrote:

>
> Deadlock can only happen if one process waits for another. E does
> *never* wait, there is no wait instruction. Instead, it schedules a
> message to be run when the desired computation completes. This leaves
> the Vat free to execute other messages in the meantime. In Croquet, this
> looks like here:
>
>    "future is Croquet's variant for sending async messages"
>    promise := rcvr future doSomething.
>    promise whenResolved:[:value|
>         "do something with the result of the computation
>           this block will executed once the concurrent computation
>          is completed and the response message is being processed
>          in this Vat/Island."
>    ].

This is interesting.  Is this already truly parallel in Croquet (or
even E for that matter)?

The thing I have always worried about with futures is data versioning.
 For example, if you have a really big chunk of data that gets passed
to another process, how does that work?

In an Erlang model, the process would simply block until it had copied
the entire structure and sent it.  In a future model you don't have to
do that (right?), but don't you have to do a local in-image copy to
ensure that the local process doesn't mutate the structure while it's
being traversed by the remote reading process?  Or perhaps I'm missing
something.

> Because there is no wait, "classic deadlock" simply cannot happen. There
> is an equivalent situation which is called "data lock" where circular
> dependencies will cause a computation not to make progress (because a is
> computed in response to the completion of b, b in response to completion
> of c and c in response of completion of a). But there are *major*
> differences to deadlocks: First, datalock is deterministic, it only
> depends on the sequence of messages which can be examined. Second,
> because the Vat is *not* blocked, you are free to send further messages
> to resolve any one of the dependencies and continue making progress.
>
> In other words, the "control flow problem" of deadlock has been turned
> around into a "data flow problem" (promises and completion messages)
> with much less dramatic consequences when things go wrong.
>
> Cheers,
>    - Andreas

Yes this is indeed quite interesting.  Do you have a feel for how
complex the futures model is, and how complex it would be to make it
truly parallel (assuming it isn't already)?

Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

Jason Johnson-5
In reply to this post by Marcel Weiher-3
Interesting.  I really think that to make real progress in
parallelization we will have to walk away from the Intel model of
shared memory stacked on 3 levels of cache, snoopy busses to propagate
writes and so on.  Of course they will force that model to scale to a
point, but it can't go on forever and there just has to be a simpler
way.

On 10/26/07, Marcel Weiher <[hidden email]> wrote:

>
> On Oct 25, 2007, at 12:28 PM, Peter William Lount wrote:
>
> > The Tile-64 processor is expected to grow to about 4096 processors
> > by pushing the limits of technology beyond what they are today. To
> > reach the levels you are talking about for a current Smalltalk image
> > with millions of objects each having their own thread (or process)
> > isn't going to happen anytime soon.
> >
> > I work with real hardware.
>
>
> A couple of numbers:
>
> - Montecito, the new dual-core Itanic has 1.72 billion transistors.
> - The ARM6 macrocell has around 35000 transistors
> - divide the two, and you will find that you could get more ARM6 cores
> for the Montecito transistor budget than the ARM6 has transistors
>
> So we can have a 35K object system with every processor having its own
> CPU core and all message-passing being asynchronous.  This is likely
> to be highly inefficient, with most of the CPUs waiting/idle most of
> the time, say 99%.  With 1% efficiency, and say, a 200MHz clock, the
> effective throughput would still be 200M * 35000 / 100 =  70 billion
> instructions per second.  That's a lot of instructions.  And wait what
> happens if we have some really parallel algorithm that cranks
> efficiency up to 10%!
>
> I am not saying any of these numbers are valid or that this is a
> realistic system, but I do find the numbers of that little thought
> experiment...interesting.  And of coures, while Moore's law appears to
> have stoppe for cycle times, it does seem to still be going for
> transistors per chip.
>
> Marcel
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

timrowledge
In reply to this post by Marcel Weiher-3

On 25-Oct-07, at 9:25 PM, Marcel Weiher wrote:

>
> - Montecito, the new dual-core Itanic has 1.72 billion transistors.
> - The ARM6 macrocell has around 35000 transistors
> - divide the two, and you will find that you could get more ARM6  
> cores for the Montecito transistor budget than the ARM6 has  
> transistors

Nicely pointed out Marcel! I've been trying to make a similar point  
for about, oh two decades now....

In fact around ten years ago TI announced some new technology  
relating to wafer scale fabrication (I think, don't hold me to this)  
and as an illustration of its possibilities they said it meant they  
could put (something like) 128 StrongARM cpus each with 4MB ram on a  
wafer.  Now let's say we take an easy path and put a mere 1000 ARM  
cores on a chip, so as to leave some room for caches and transputer-
like links (I think someone actually did those for ARM at some point  
in the past) and interface stuff. ARM 1176 cores are rated for 800MHz  
with claims of up to 1GHz so we have potential for a quadrillion  
instruction per second. Even Microsoft would surely have trouble  
soaking up that much cpu with pointless fiddle-faddle.

If we got no better than 1% useful work because of poor code we'd  
still be getting 10 gips.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful Latin Phrases:- Utinam logica falsa tuam philosophiam totam  
suffodiant! = May faulty logic undermine your entire philosophy!



Reply | Threaded
Open this post in threaded view
|

RE: Multi-core CPUs

Sebastian Sastre-2
In reply to this post by Marcel Weiher-3

> -----Mensaje original-----
> De: [hidden email]
> [mailto:[hidden email]] En
> nombre de Marcel Weiher
> Enviado el: Viernes, 26 de Octubre de 2007 01:26
> Para: [hidden email]; The general-purpose Squeak developers list
> Asunto: Re: Multi-core CPUs
...

> A couple of numbers:
>
> - Montecito, the new dual-core Itanic has 1.72 billion transistors.
> - The ARM6 macrocell has around 35000 transistors
> - divide the two, and you will find that you could get more
> ARM6 cores for the Montecito transistor budget than the ARM6
> has transistors
>
> So we can have a 35K object system with every processor
> having its own CPU core and all message-passing being
> asynchronous.  This is likely to be highly inefficient, with
> most of the CPUs waiting/idle most of the time, say 99%.  
> With 1% efficiency, and say, a 200MHz clock, the effective
> throughput would still be 200M * 35000 / 100 =  70 billion
> instructions per second.  That's a lot of instructions.  And
> wait what happens if we have some really parallel algorithm
> that cranks efficiency up to 10%!
>
> I am not saying any of these numbers are valid or that this
> is a realistic system, but I do find the numbers of that
> little thought experiment...interesting.  And of coures,
> while Moore's law appears to have stoppe for cycle times, it
> does seem to still be going for transistors per chip.
>
> Marcel
>
>
>
Marcel, as Smalltalk always shine in scaling in complexity I posted, there
in the beginning of this matter, about the different dimensions of
scalability. For the CPU, if cycle times is vertical and cores are
horizontal we are, as you suggest, entering a horizontal cpu scaling moment
(next years) mesurable with transistors per chip.

No matter wich model we choose to map the conceptual model in boolean
processors (due to holy transistors) it will have impedance mismach.

When we select a solution our trade off will necessarily be making the
choice by balancing, that impedance because of complexity, between machines,
boolean domain, and persons, conceptual domain.

Ironically, this industry made by persons has an incredible talent to make
things easier for machines at the cost of polluting the conceptual model.

Given that we can choose a path that pollute the conceptual model or not. As
I see things polluting conceptual model is a "shoot in the foot". I think
that the Smalltalk community should prioritize again the heurístic spirit of
Smalltalk by showin willing to evade the injection of pollution in the
conceptual model. Anyway it's our choice.

Regards,

Sebastian


Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

Rob Withers
Peter,

I also want to thank you for this link:
http://www.greenteapress.com/semaphores/downey05semaphores.pdf
I started to read it after David's comment about it and it is entertaining
and I am learning lots.

I also plan on using it in phase 3 of my multi-threaded vm project.

Phase 1, my current phase, is to get all msg sends to be interceptable by
the SqueakElib promise framework.  This includes things that have been macro
transformed by the Compiler, like #ifTrue:, #ifNil:, #whileTrue:, and so on.
It also includes bytecodeMethods like #class and #==.

Phase 2 is to allow all primitives and bytecode methods to have a promise as
an argument.  Here, my plan is to stop short the primitive call and send the
excapsulated primitive call to the promise(s) as part of a whenMoreResolved
call.  When the promise resolves, the primitive call will be made.  QoS can
be satisfied by joining the promise with a timer, such that if the promise
does not resolve in xxx milliseconds, it will become broken and the
primitive call will "fail".

Phase 3 is to make the Interpreter multithreaded, while protecting
ObjectMemory with Semaphores.  I have a quad-core chip and so I want 4
Interpreter threads (Vats).  Only one of them can be inside of ObjectMemory
at a time and that could be for purposes of allocation, mutation, or GC.
It's possible that a simple mutex semaphore would suffice, initially.   In
this model, references to objects in other Vats will be ThreadRefs (a form
of a FarRef) and msgs will be serialized to the other Vat (reassigned the
VatID in the same shared ObjectMemory, or copied to a different but
co-located ObjectMemory).

I don't think having a single ObjectMemory will scale to 10's of
"processors", but will probably also need to be multithreaded with one per
Vat.  It's good from the standpoint of no shared memory.  One challenge then
is what if refs from 2 Vats are involved in the same primitive call.  Well,
memory reads don't have to be protected, unless memory can be relocated,
that is.  One thing at a time, I tell myself.

I have 0 experience in this area (Interpreter+ObjectMemory), but I thought
it would be fun.  Your link will help tremendously.

Cheers,
Rob


pwl
Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

pwl
Rob Withers wrote:

> Peter,
>
> I also want to thank you for this link:
> http://www.greenteapress.com/semaphores/downey05semaphores.pdf
> I started to read it after David's comment about it and it is
> entertaining and I am learning lots.
>
> I also plan on using it in phase 3 of my multi-threaded vm project.
>
> Phase 1, my current phase, is to get all msg sends to be interceptable
> by the SqueakElib promise framework.  This includes things that have
> been macro transformed by the Compiler, like #ifTrue:, #ifNil:,
> #whileTrue:, and so on. It also includes bytecodeMethods like #class
> and #==.
>
> Phase 2 is to allow all primitives and bytecode methods to have a
> promise as an argument.  Here, my plan is to stop short the primitive
> call and send the excapsulated primitive call to the promise(s) as
> part of a whenMoreResolved call.  When the promise resolves, the
> primitive call will be made.  QoS can be satisfied by joining the
> promise with a timer, such that if the promise does not resolve in xxx
> milliseconds, it will become broken and the primitive call will "fail".
>
> Phase 3 is to make the Interpreter multithreaded, while protecting
> ObjectMemory with Semaphores.  I have a quad-core chip and so I want 4
> Interpreter threads (Vats).  Only one of them can be inside of
> ObjectMemory at a time and that could be for purposes of allocation,
> mutation, or GC. It's possible that a simple mutex semaphore would
> suffice, initially.   In this model, references to objects in other
> Vats will be ThreadRefs (a form of a FarRef) and msgs will be
> serialized to the other Vat (reassigned the VatID in the same shared
> ObjectMemory, or copied to a different but co-located ObjectMemory).
>
> I don't think having a single ObjectMemory will scale to 10's of
> "processors", but will probably also need to be multithreaded with one
> per Vat.  It's good from the standpoint of no shared memory.  One
> challenge then is what if refs from 2 Vats are involved in the same
> primitive call.  Well, memory reads don't have to be protected, unless
> memory can be relocated, that is.  One thing at a time, I tell myself.
>
> I have 0 experience in this area (Interpreter+ObjectMemory), but I
> thought it would be fun.  Your link will help tremendously.
>
> Cheers,
> Rob
>
>
Hi Rob,

Yeah, it's an awesome little book that cuts right to the chase. I
particularly like that they show some solutions that won't work as
concurrency is quite difficult and sometimes you think it's correct when
it isn't. It's good to learn about those pitfalls.

You're plan sounds excellent. Thank you for taking up the task of making
Squeak VM multi-threaded with native threads!

If you need anything...

All the best,

Peter


Reply | Threaded
Open this post in threaded view
|

RE: Multi-core CPUs

Ron Teitelbaum
http://news.squeak.org/2007/10/26/wait-for-it-the-little-book-of-semaphores/


I like it too!  :)

Ron Teitelbaum
Squeak News Team Leader

> -----Original Message-----
> From: Peter William Lount
>
> Rob Withers wrote:
> > Peter,
> >
> > I also want to thank you for this link:
> > http://www.greenteapress.com/semaphores/downey05semaphores.pdf
> > I started to read it after David's comment about it and it is
> > entertaining and I am learning lots.
> >
> > I also plan on using it in phase 3 of my multi-threaded vm project.
> >
> > Phase 1, my current phase, is to get all msg sends to be interceptable
> > by the SqueakElib promise framework.  This includes things that have
> > been macro transformed by the Compiler, like #ifTrue:, #ifNil:,
> > #whileTrue:, and so on. It also includes bytecodeMethods like #class
> > and #==.
> >
> > Phase 2 is to allow all primitives and bytecode methods to have a
> > promise as an argument.  Here, my plan is to stop short the primitive
> > call and send the excapsulated primitive call to the promise(s) as
> > part of a whenMoreResolved call.  When the promise resolves, the
> > primitive call will be made.  QoS can be satisfied by joining the
> > promise with a timer, such that if the promise does not resolve in xxx
> > milliseconds, it will become broken and the primitive call will "fail".
> >
> > Phase 3 is to make the Interpreter multithreaded, while protecting
> > ObjectMemory with Semaphores.  I have a quad-core chip and so I want 4
> > Interpreter threads (Vats).  Only one of them can be inside of
> > ObjectMemory at a time and that could be for purposes of allocation,
> > mutation, or GC. It's possible that a simple mutex semaphore would
> > suffice, initially.   In this model, references to objects in other
> > Vats will be ThreadRefs (a form of a FarRef) and msgs will be
> > serialized to the other Vat (reassigned the VatID in the same shared
> > ObjectMemory, or copied to a different but co-located ObjectMemory).
> >
> > I don't think having a single ObjectMemory will scale to 10's of
> > "processors", but will probably also need to be multithreaded with one
> > per Vat.  It's good from the standpoint of no shared memory.  One
> > challenge then is what if refs from 2 Vats are involved in the same
> > primitive call.  Well, memory reads don't have to be protected, unless
> > memory can be relocated, that is.  One thing at a time, I tell myself.
> >
> > I have 0 experience in this area (Interpreter+ObjectMemory), but I
> > thought it would be fun.  Your link will help tremendously.
> >
> > Cheers,
> > Rob
> >
> >
> Hi Rob,
>
> Yeah, it's an awesome little book that cuts right to the chase. I
> particularly like that they show some solutions that won't work as
> concurrency is quite difficult and sometimes you think it's correct when
> it isn't. It's good to learn about those pitfalls.
>
> You're plan sounds excellent. Thank you for taking up the task of making
> Squeak VM multi-threaded with native threads!
>
> If you need anything...
>
> All the best,
>
> Peter
>



pwl
Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

pwl
In reply to this post by timrowledge
tim Rowledge wrote:

>
> On 25-Oct-07, at 9:25 PM, Marcel Weiher wrote:
>>
>> - Montecito, the new dual-core Itanic has 1.72 billion transistors.
>> - The ARM6 macrocell has around 35000 transistors
>> - divide the two, and you will find that you could get more ARM6
>> cores for the Montecito transistor budget than the ARM6 has transistors
>
> Nicely pointed out Marcel! I've been trying to make a similar point
> for about, oh two decades now....
>
> In fact around ten years ago TI announced some new technology relating
> to wafer scale fabrication (I think, don't hold me to this) and as an
> illustration of its possibilities they said it meant they could put
> (something like) 128 StrongARM cpus each with 4MB ram on a wafer.  Now
> let's say we take an easy path and put a mere 1000 ARM cores on a
> chip, so as to leave some room for caches and transputer-like links (I
> think someone actually did those for ARM at some point in the past)
> and interface stuff. ARM 1176 cores are rated for 800MHz with claims
> of up to 1GHz so we have potential for a quadrillion instruction per
> second. Even Microsoft would surely have trouble soaking up that much
> cpu with pointless fiddle-faddle.
>
> If we got no better than 1% useful work because of poor code we'd
> still be getting 10 gips.

Hi,

That is essentially what Tilera is doing with their Tile-N processors
(where N is 36, 64, 128, 1024, 4096, ...). They are shipping the Tile-64
chip now or shortly. http://www.Tilera.com.

They have a design "kill rule" which states that if they increase the
surface area by N% the cpu performance must also increase by at least N%.

The Itanium however is an awesome processor in it's own right regardless
of the number of transistors it's using. It has predicate registers plus
128 64 bit integer registers and 128 floating point registers. Lots of
registers so the arguments about not enough registers can be put to bed.
In fact the register file is sort of like but not quite like the Sun
Sparc processors. It has instruction level parallelism which is good for
a great many problems. Overall a very interesting and powerful processor.

When it comes to transistor budgets your analysis is correct... and may
will the day in the market place. We'll see if Tilera or Intel will
bring these internally networked grid chips to the mainstream market.

Peter


pwl
Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs - Synchronization Patterns as Classes?

pwl
In reply to this post by Ron Teitelbaum
Hi,

It seems that the "patterns" of synchronization in "The Little Book of
Semaphores" are just that, patterns. Like other patterns they could be
implemented as abstract and concrete classes so that rather than having
to rewrite the solutions all over each time they are off the shelf and
available for use. A class library of synchronization using semaphores
might help with enabling people to leverage multi-threading with N-core
cpus in Smalltalk (where N is greater than or equal to 1) and using
green threads, native threads or both.

Just a thought.

Cheers,

peter

"The Little Book of Semaphores"
http://www.greenteapress.com/semaphores/downey05semaphores.pdf

Reply | Threaded
Open this post in threaded view
|

Re: Multy-core CPUs

Steven Elkins
In reply to this post by Jason Johnson-5
The Design and Implementation of ConcurrentSmalltalk

http://www.amazon.com/Implementation-Concurrent-Smalltalk-Computer-Science/dp/9810201125

>From the Introduction: "In Concurrent Smalltalk, an object is not only
a unit of data abstraction but also a unit of execution."

On 10/25/07, Jason Johnson <[hidden email]> wrote:

> On 10/24/07, Sebastian Sastre <[hidden email]> wrote:
> >
> > So I'm stating here that in a smalltalk image of the future *every object
> > should have a process*. Every instance. All of them.
>
> That is an interesting idea.  That would open a door to a new way of
> Garbage collection, because it can then be tied to the exit of a
> process.
>
> > Said that I return to the problem you stated about the need of copy copy
> > copy, saying that this premise changes things and you don't need to copy
> > anymore because a VM like that, no matter who or when, an instVar of an
> > object is to be modified it will provide you of guarantee that the write
> > will be made by the process that corresponds to that instance.
>
> Yes, in such a system, you don't need to copy because all that gets
> passed around are references to processes.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

Rob Withers
In reply to this post by pwl

----- Original Message -----
From: "Peter William Lount" <[hidden email]>

> You're plan sounds excellent. Thank you for taking up the task of making
> Squeak VM multi-threaded with native threads!
>
> If you need anything...

I don't want to make it sound like I can't use some help, especially if it's
offered.  I can't do this alone.  Forget it, especially with the day job.
No, I figure it to be a 2 year task, at least.  But I would rather build
something than talk about all the theory.  I fleshed out the phases I posted
earlier with what I thought were some more managable tasks.  I'd like to
point out that Phase 3, implementing the multithreaded vm, is entirely
independent of SqueakElib and would be useable by anyone wanting to do
multithreading.

Here's the new page, add what you like, help where you can, holler to talk
it over: http://wiki.squeak.org/squeak/6011

Cheers,
Rob


Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

Jason Johnson-5
In reply to this post by pwl
On 10/26/07, Peter William Lount <[hidden email]> wrote:
>
> You're plan sounds excellent. Thank you for taking up the task of making
> Squeak VM multi-threaded with native threads!

Yes, thanks.  I will need a true mult-threaded VM at some point as
well.  I just have to make it transparent to the processes running in
the VM. :)

pwl
Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

pwl
Jason Johnson wrote:
On 10/26/07, Peter William Lount [hidden email] wrote:
  
You're plan sounds excellent. Thank you for taking up the task of making
Squeak VM multi-threaded with native threads!
    

Yes, thanks.  I will need a true mult-threaded VM at some point as
well.  I just have to make it transparent to the processes running in
the VM. :)

  
Hi,

I really do like the notion of easy multi threading - really. I've admired Erlang for what it's achieved in that regard for years now. I encourage everyone interested in that to keep persevering and searching for a practical way forward towards your vision.

All the best,

Peter



Reply | Threaded
Open this post in threaded view
|

Re: Multi-core CPUs

Jason Johnson-5
Well, what I plan to try out isn't the only way, and probably not the
best, but I think it's a baby step in the right direction.  As Andreas
pointed out, there are other solutions that may even be better from a
high level point of view (message passing still requires careful
design).

I really believe shared state concurrency with fine grained locking
can't scale much further then it already has.  And I'm by no means the
only one.  Here is another thread on the matter:

http://lambda-the-ultimate.org/node/2048

On 10/27/07, Peter William Lount <[hidden email]> wrote:

>
>  Jason Johnson wrote:
>  On 10/26/07, Peter William Lount <[hidden email]> wrote:
>
>
>  You're plan sounds excellent. Thank you for taking up the task of making
> Squeak VM multi-threaded with native threads!
>
>  Yes, thanks. I will need a true mult-threaded VM at some point as
> well. I just have to make it transparent to the processes running in
> the VM. :)
>
>
>  Hi,
>
>  I really do like the notion of easy multi threading - really. I've admired
> Erlang for what it's achieved in that regard for years now. I encourage
> everyone interested in that to keep persevering and searching for a
> practical way forward towards your vision.
>
>  All the best,
>
>  Peter
>
>
>
>
>

1 ... 78910