Hi Andreas,
Glad you weren't scared off by this thread. :) (comments below) On 10/25/07, Andreas Raab <[hidden email]> wrote: > > Deadlock can only happen if one process waits for another. E does > *never* wait, there is no wait instruction. Instead, it schedules a > message to be run when the desired computation completes. This leaves > the Vat free to execute other messages in the meantime. In Croquet, this > looks like here: > > "future is Croquet's variant for sending async messages" > promise := rcvr future doSomething. > promise whenResolved:[:value| > "do something with the result of the computation > this block will executed once the concurrent computation > is completed and the response message is being processed > in this Vat/Island." > ]. This is interesting. Is this already truly parallel in Croquet (or even E for that matter)? The thing I have always worried about with futures is data versioning. For example, if you have a really big chunk of data that gets passed to another process, how does that work? In an Erlang model, the process would simply block until it had copied the entire structure and sent it. In a future model you don't have to do that (right?), but don't you have to do a local in-image copy to ensure that the local process doesn't mutate the structure while it's being traversed by the remote reading process? Or perhaps I'm missing something. > Because there is no wait, "classic deadlock" simply cannot happen. There > is an equivalent situation which is called "data lock" where circular > dependencies will cause a computation not to make progress (because a is > computed in response to the completion of b, b in response to completion > of c and c in response of completion of a). But there are *major* > differences to deadlocks: First, datalock is deterministic, it only > depends on the sequence of messages which can be examined. Second, > because the Vat is *not* blocked, you are free to send further messages > to resolve any one of the dependencies and continue making progress. > > In other words, the "control flow problem" of deadlock has been turned > around into a "data flow problem" (promises and completion messages) > with much less dramatic consequences when things go wrong. > > Cheers, > - Andreas Yes this is indeed quite interesting. Do you have a feel for how complex the futures model is, and how complex it would be to make it truly parallel (assuming it isn't already)? |
In reply to this post by Marcel Weiher-3
Interesting. I really think that to make real progress in
parallelization we will have to walk away from the Intel model of shared memory stacked on 3 levels of cache, snoopy busses to propagate writes and so on. Of course they will force that model to scale to a point, but it can't go on forever and there just has to be a simpler way. On 10/26/07, Marcel Weiher <[hidden email]> wrote: > > On Oct 25, 2007, at 12:28 PM, Peter William Lount wrote: > > > The Tile-64 processor is expected to grow to about 4096 processors > > by pushing the limits of technology beyond what they are today. To > > reach the levels you are talking about for a current Smalltalk image > > with millions of objects each having their own thread (or process) > > isn't going to happen anytime soon. > > > > I work with real hardware. > > > A couple of numbers: > > - Montecito, the new dual-core Itanic has 1.72 billion transistors. > - The ARM6 macrocell has around 35000 transistors > - divide the two, and you will find that you could get more ARM6 cores > for the Montecito transistor budget than the ARM6 has transistors > > So we can have a 35K object system with every processor having its own > CPU core and all message-passing being asynchronous. This is likely > to be highly inefficient, with most of the CPUs waiting/idle most of > the time, say 99%. With 1% efficiency, and say, a 200MHz clock, the > effective throughput would still be 200M * 35000 / 100 = 70 billion > instructions per second. That's a lot of instructions. And wait what > happens if we have some really parallel algorithm that cranks > efficiency up to 10%! > > I am not saying any of these numbers are valid or that this is a > realistic system, but I do find the numbers of that little thought > experiment...interesting. And of coures, while Moore's law appears to > have stoppe for cycle times, it does seem to still be going for > transistors per chip. > > Marcel > > > > |
In reply to this post by Marcel Weiher-3
On 25-Oct-07, at 9:25 PM, Marcel Weiher wrote: > > - Montecito, the new dual-core Itanic has 1.72 billion transistors. > - The ARM6 macrocell has around 35000 transistors > - divide the two, and you will find that you could get more ARM6 > cores for the Montecito transistor budget than the ARM6 has > transistors Nicely pointed out Marcel! I've been trying to make a similar point for about, oh two decades now.... In fact around ten years ago TI announced some new technology relating to wafer scale fabrication (I think, don't hold me to this) and as an illustration of its possibilities they said it meant they could put (something like) 128 StrongARM cpus each with 4MB ram on a wafer. Now let's say we take an easy path and put a mere 1000 ARM cores on a chip, so as to leave some room for caches and transputer- like links (I think someone actually did those for ARM at some point in the past) and interface stuff. ARM 1176 cores are rated for 800MHz with claims of up to 1GHz so we have potential for a quadrillion instruction per second. Even Microsoft would surely have trouble soaking up that much cpu with pointless fiddle-faddle. If we got no better than 1% useful work because of poor code we'd still be getting 10 gips. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful Latin Phrases:- Utinam logica falsa tuam philosophiam totam suffodiant! = May faulty logic undermine your entire philosophy! |
In reply to this post by Marcel Weiher-3
> -----Mensaje original----- > De: [hidden email] > [mailto:[hidden email]] En > nombre de Marcel Weiher > Enviado el: Viernes, 26 de Octubre de 2007 01:26 > Para: [hidden email]; The general-purpose Squeak developers list > Asunto: Re: Multi-core CPUs ... > A couple of numbers: > > - Montecito, the new dual-core Itanic has 1.72 billion transistors. > - The ARM6 macrocell has around 35000 transistors > - divide the two, and you will find that you could get more > ARM6 cores for the Montecito transistor budget than the ARM6 > has transistors > > So we can have a 35K object system with every processor > having its own CPU core and all message-passing being > asynchronous. This is likely to be highly inefficient, with > most of the CPUs waiting/idle most of the time, say 99%. > With 1% efficiency, and say, a 200MHz clock, the effective > throughput would still be 200M * 35000 / 100 = 70 billion > instructions per second. That's a lot of instructions. And > wait what happens if we have some really parallel algorithm > that cranks efficiency up to 10%! > > I am not saying any of these numbers are valid or that this > is a realistic system, but I do find the numbers of that > little thought experiment...interesting. And of coures, > while Moore's law appears to have stoppe for cycle times, it > does seem to still be going for transistors per chip. > > Marcel > > > in the beginning of this matter, about the different dimensions of scalability. For the CPU, if cycle times is vertical and cores are horizontal we are, as you suggest, entering a horizontal cpu scaling moment (next years) mesurable with transistors per chip. No matter wich model we choose to map the conceptual model in boolean processors (due to holy transistors) it will have impedance mismach. When we select a solution our trade off will necessarily be making the choice by balancing, that impedance because of complexity, between machines, boolean domain, and persons, conceptual domain. Ironically, this industry made by persons has an incredible talent to make things easier for machines at the cost of polluting the conceptual model. Given that we can choose a path that pollute the conceptual model or not. As I see things polluting conceptual model is a "shoot in the foot". I think that the Smalltalk community should prioritize again the heurÃstic spirit of Smalltalk by showin willing to evade the injection of pollution in the conceptual model. Anyway it's our choice. Regards, Sebastian |
Peter,
I also want to thank you for this link: http://www.greenteapress.com/semaphores/downey05semaphores.pdf I started to read it after David's comment about it and it is entertaining and I am learning lots. I also plan on using it in phase 3 of my multi-threaded vm project. Phase 1, my current phase, is to get all msg sends to be interceptable by the SqueakElib promise framework. This includes things that have been macro transformed by the Compiler, like #ifTrue:, #ifNil:, #whileTrue:, and so on. It also includes bytecodeMethods like #class and #==. Phase 2 is to allow all primitives and bytecode methods to have a promise as an argument. Here, my plan is to stop short the primitive call and send the excapsulated primitive call to the promise(s) as part of a whenMoreResolved call. When the promise resolves, the primitive call will be made. QoS can be satisfied by joining the promise with a timer, such that if the promise does not resolve in xxx milliseconds, it will become broken and the primitive call will "fail". Phase 3 is to make the Interpreter multithreaded, while protecting ObjectMemory with Semaphores. I have a quad-core chip and so I want 4 Interpreter threads (Vats). Only one of them can be inside of ObjectMemory at a time and that could be for purposes of allocation, mutation, or GC. It's possible that a simple mutex semaphore would suffice, initially. In this model, references to objects in other Vats will be ThreadRefs (a form of a FarRef) and msgs will be serialized to the other Vat (reassigned the VatID in the same shared ObjectMemory, or copied to a different but co-located ObjectMemory). I don't think having a single ObjectMemory will scale to 10's of "processors", but will probably also need to be multithreaded with one per Vat. It's good from the standpoint of no shared memory. One challenge then is what if refs from 2 Vats are involved in the same primitive call. Well, memory reads don't have to be protected, unless memory can be relocated, that is. One thing at a time, I tell myself. I have 0 experience in this area (Interpreter+ObjectMemory), but I thought it would be fun. Your link will help tremendously. Cheers, Rob |
Rob Withers wrote:
> Peter, > > I also want to thank you for this link: > http://www.greenteapress.com/semaphores/downey05semaphores.pdf > I started to read it after David's comment about it and it is > entertaining and I am learning lots. > > I also plan on using it in phase 3 of my multi-threaded vm project. > > Phase 1, my current phase, is to get all msg sends to be interceptable > by the SqueakElib promise framework. This includes things that have > been macro transformed by the Compiler, like #ifTrue:, #ifNil:, > #whileTrue:, and so on. It also includes bytecodeMethods like #class > and #==. > > Phase 2 is to allow all primitives and bytecode methods to have a > promise as an argument. Here, my plan is to stop short the primitive > call and send the excapsulated primitive call to the promise(s) as > part of a whenMoreResolved call. When the promise resolves, the > primitive call will be made. QoS can be satisfied by joining the > promise with a timer, such that if the promise does not resolve in xxx > milliseconds, it will become broken and the primitive call will "fail". > > Phase 3 is to make the Interpreter multithreaded, while protecting > ObjectMemory with Semaphores. I have a quad-core chip and so I want 4 > Interpreter threads (Vats). Only one of them can be inside of > ObjectMemory at a time and that could be for purposes of allocation, > mutation, or GC. It's possible that a simple mutex semaphore would > suffice, initially. In this model, references to objects in other > Vats will be ThreadRefs (a form of a FarRef) and msgs will be > serialized to the other Vat (reassigned the VatID in the same shared > ObjectMemory, or copied to a different but co-located ObjectMemory). > > I don't think having a single ObjectMemory will scale to 10's of > "processors", but will probably also need to be multithreaded with one > per Vat. It's good from the standpoint of no shared memory. One > challenge then is what if refs from 2 Vats are involved in the same > primitive call. Well, memory reads don't have to be protected, unless > memory can be relocated, that is. One thing at a time, I tell myself. > > I have 0 experience in this area (Interpreter+ObjectMemory), but I > thought it would be fun. Your link will help tremendously. > > Cheers, > Rob > > Yeah, it's an awesome little book that cuts right to the chase. I particularly like that they show some solutions that won't work as concurrency is quite difficult and sometimes you think it's correct when it isn't. It's good to learn about those pitfalls. You're plan sounds excellent. Thank you for taking up the task of making Squeak VM multi-threaded with native threads! If you need anything... All the best, Peter |
http://news.squeak.org/2007/10/26/wait-for-it-the-little-book-of-semaphores/
I like it too! :) Ron Teitelbaum Squeak News Team Leader > -----Original Message----- > From: Peter William Lount > > Rob Withers wrote: > > Peter, > > > > I also want to thank you for this link: > > http://www.greenteapress.com/semaphores/downey05semaphores.pdf > > I started to read it after David's comment about it and it is > > entertaining and I am learning lots. > > > > I also plan on using it in phase 3 of my multi-threaded vm project. > > > > Phase 1, my current phase, is to get all msg sends to be interceptable > > by the SqueakElib promise framework. This includes things that have > > been macro transformed by the Compiler, like #ifTrue:, #ifNil:, > > #whileTrue:, and so on. It also includes bytecodeMethods like #class > > and #==. > > > > Phase 2 is to allow all primitives and bytecode methods to have a > > promise as an argument. Here, my plan is to stop short the primitive > > call and send the excapsulated primitive call to the promise(s) as > > part of a whenMoreResolved call. When the promise resolves, the > > primitive call will be made. QoS can be satisfied by joining the > > promise with a timer, such that if the promise does not resolve in xxx > > milliseconds, it will become broken and the primitive call will "fail". > > > > Phase 3 is to make the Interpreter multithreaded, while protecting > > ObjectMemory with Semaphores. I have a quad-core chip and so I want 4 > > Interpreter threads (Vats). Only one of them can be inside of > > ObjectMemory at a time and that could be for purposes of allocation, > > mutation, or GC. It's possible that a simple mutex semaphore would > > suffice, initially. In this model, references to objects in other > > Vats will be ThreadRefs (a form of a FarRef) and msgs will be > > serialized to the other Vat (reassigned the VatID in the same shared > > ObjectMemory, or copied to a different but co-located ObjectMemory). > > > > I don't think having a single ObjectMemory will scale to 10's of > > "processors", but will probably also need to be multithreaded with one > > per Vat. It's good from the standpoint of no shared memory. One > > challenge then is what if refs from 2 Vats are involved in the same > > primitive call. Well, memory reads don't have to be protected, unless > > memory can be relocated, that is. One thing at a time, I tell myself. > > > > I have 0 experience in this area (Interpreter+ObjectMemory), but I > > thought it would be fun. Your link will help tremendously. > > > > Cheers, > > Rob > > > > > Hi Rob, > > Yeah, it's an awesome little book that cuts right to the chase. I > particularly like that they show some solutions that won't work as > concurrency is quite difficult and sometimes you think it's correct when > it isn't. It's good to learn about those pitfalls. > > You're plan sounds excellent. Thank you for taking up the task of making > Squeak VM multi-threaded with native threads! > > If you need anything... > > All the best, > > Peter > |
In reply to this post by timrowledge
tim Rowledge wrote:
> > On 25-Oct-07, at 9:25 PM, Marcel Weiher wrote: >> >> - Montecito, the new dual-core Itanic has 1.72 billion transistors. >> - The ARM6 macrocell has around 35000 transistors >> - divide the two, and you will find that you could get more ARM6 >> cores for the Montecito transistor budget than the ARM6 has transistors > > Nicely pointed out Marcel! I've been trying to make a similar point > for about, oh two decades now.... > > In fact around ten years ago TI announced some new technology relating > to wafer scale fabrication (I think, don't hold me to this) and as an > illustration of its possibilities they said it meant they could put > (something like) 128 StrongARM cpus each with 4MB ram on a wafer. Now > let's say we take an easy path and put a mere 1000 ARM cores on a > chip, so as to leave some room for caches and transputer-like links (I > think someone actually did those for ARM at some point in the past) > and interface stuff. ARM 1176 cores are rated for 800MHz with claims > of up to 1GHz so we have potential for a quadrillion instruction per > second. Even Microsoft would surely have trouble soaking up that much > cpu with pointless fiddle-faddle. > > If we got no better than 1% useful work because of poor code we'd > still be getting 10 gips. Hi, That is essentially what Tilera is doing with their Tile-N processors (where N is 36, 64, 128, 1024, 4096, ...). They are shipping the Tile-64 chip now or shortly. http://www.Tilera.com. They have a design "kill rule" which states that if they increase the surface area by N% the cpu performance must also increase by at least N%. The Itanium however is an awesome processor in it's own right regardless of the number of transistors it's using. It has predicate registers plus 128 64 bit integer registers and 128 floating point registers. Lots of registers so the arguments about not enough registers can be put to bed. In fact the register file is sort of like but not quite like the Sun Sparc processors. It has instruction level parallelism which is good for a great many problems. Overall a very interesting and powerful processor. When it comes to transistor budgets your analysis is correct... and may will the day in the market place. We'll see if Tilera or Intel will bring these internally networked grid chips to the mainstream market. Peter |
In reply to this post by Ron Teitelbaum
Hi,
It seems that the "patterns" of synchronization in "The Little Book of Semaphores" are just that, patterns. Like other patterns they could be implemented as abstract and concrete classes so that rather than having to rewrite the solutions all over each time they are off the shelf and available for use. A class library of synchronization using semaphores might help with enabling people to leverage multi-threading with N-core cpus in Smalltalk (where N is greater than or equal to 1) and using green threads, native threads or both. Just a thought. Cheers, peter "The Little Book of Semaphores" http://www.greenteapress.com/semaphores/downey05semaphores.pdf |
In reply to this post by Jason Johnson-5
The Design and Implementation of ConcurrentSmalltalk
http://www.amazon.com/Implementation-Concurrent-Smalltalk-Computer-Science/dp/9810201125 >From the Introduction: "In Concurrent Smalltalk, an object is not only a unit of data abstraction but also a unit of execution." On 10/25/07, Jason Johnson <[hidden email]> wrote: > On 10/24/07, Sebastian Sastre <[hidden email]> wrote: > > > > So I'm stating here that in a smalltalk image of the future *every object > > should have a process*. Every instance. All of them. > > That is an interesting idea. That would open a door to a new way of > Garbage collection, because it can then be tied to the exit of a > process. > > > Said that I return to the problem you stated about the need of copy copy > > copy, saying that this premise changes things and you don't need to copy > > anymore because a VM like that, no matter who or when, an instVar of an > > object is to be modified it will provide you of guarantee that the write > > will be made by the process that corresponds to that instance. > > Yes, in such a system, you don't need to copy because all that gets > passed around are references to processes. > > |
In reply to this post by pwl
----- Original Message ----- From: "Peter William Lount" <[hidden email]> > You're plan sounds excellent. Thank you for taking up the task of making > Squeak VM multi-threaded with native threads! > > If you need anything... I don't want to make it sound like I can't use some help, especially if it's offered. I can't do this alone. Forget it, especially with the day job. No, I figure it to be a 2 year task, at least. But I would rather build something than talk about all the theory. I fleshed out the phases I posted earlier with what I thought were some more managable tasks. I'd like to point out that Phase 3, implementing the multithreaded vm, is entirely independent of SqueakElib and would be useable by anyone wanting to do multithreading. Here's the new page, add what you like, help where you can, holler to talk it over: http://wiki.squeak.org/squeak/6011 Cheers, Rob |
In reply to this post by pwl
On 10/26/07, Peter William Lount <[hidden email]> wrote:
> > You're plan sounds excellent. Thank you for taking up the task of making > Squeak VM multi-threaded with native threads! Yes, thanks. I will need a true mult-threaded VM at some point as well. I just have to make it transparent to the processes running in the VM. :) |
Jason Johnson wrote:
Hi,On 10/26/07, Peter William Lount [hidden email] wrote:You're plan sounds excellent. Thank you for taking up the task of making Squeak VM multi-threaded with native threads!Yes, thanks. I will need a true mult-threaded VM at some point as well. I just have to make it transparent to the processes running in the VM. :) I really do like the notion of easy multi threading - really. I've admired Erlang for what it's achieved in that regard for years now. I encourage everyone interested in that to keep persevering and searching for a practical way forward towards your vision. All the best, Peter |
Well, what I plan to try out isn't the only way, and probably not the
best, but I think it's a baby step in the right direction. As Andreas pointed out, there are other solutions that may even be better from a high level point of view (message passing still requires careful design). I really believe shared state concurrency with fine grained locking can't scale much further then it already has. And I'm by no means the only one. Here is another thread on the matter: http://lambda-the-ultimate.org/node/2048 On 10/27/07, Peter William Lount <[hidden email]> wrote: > > Jason Johnson wrote: > On 10/26/07, Peter William Lount <[hidden email]> wrote: > > > You're plan sounds excellent. Thank you for taking up the task of making > Squeak VM multi-threaded with native threads! > > Yes, thanks. I will need a true mult-threaded VM at some point as > well. I just have to make it transparent to the processes running in > the VM. :) > > > Hi, > > I really do like the notion of easy multi threading - really. I've admired > Erlang for what it's achieved in that regard for years now. I encourage > everyone interested in that to keep persevering and searching for a > practical way forward towards your vision. > > All the best, > > Peter > > > > > |
Free forum by Nabble | Edit this page |