On 10/29/07, Rob Withers <[hidden email]> wrote:
> > This is what I am trying to do with SqueakElib. Any old object referenced > in the system is an eventual local ref, but the system should handle > promises or non-local refs anywhere. Are you going to make the "vats" Smalltalk processes? Are you going to isolate them from other Smalltalk processes in the same image? If so then I will probably have some synergy with that. |
In reply to this post by Andreas.Raab
On 10/29/07, Andreas Raab <[hidden email]> wrote:
> > Not "all messages sends". Only messages between concurrent entities > (islands). This is the main difference to the all-out actors model > (where each object is its own unit of concurrency) and has the advantage > that you can reuse all of todays single-threaded code. I hope you, or anyone else are not under the impression that I am pushing for an "all-out" actors model. I want actor sending to be as explicit as Croquet futures are. And this would mean the same; all of today's single-threaded code continues to work. > The similarity is striking. Both in terms of tradeoffs (trade low-level > control for better productivity) as well as the style of arguments made > against it ;-) I see the similarity! :) >Not that I mind by the way, I find these discussions > necessary. Ah good, I assumed you had kill-filed me for this one, if not before. :) |
In reply to this post by Igor Stasenko
On 10/30/07, Igor Stasenko <[hidden email]> wrote:
> which is _NOT_ concurrent > computing anymore, simple because its not using shared memory, and in > fact there is no sharing at all, only a glimpse of it. Huh? What does sharing have to do with concurrency? The one and only thing shared state has to do with concurrency is the desire to speed it up, i.e. a premature optimization. That's it. |
In reply to this post by Igor Stasenko
On 10/30/07, Igor Stasenko <[hidden email]> wrote:
> > Simply because its not scales well. Consider a 1000 and 1 Vats ( 1000 > and 1 tales comes in mind :). > A 1000 Vats sending a message to the same object, which is scheduled > in single Vat. So, there will be a HUGE difference in time between > first sender and last sender when they will receive an answer. Are you some how under the impression that a shared state solution would actually scale *better*? Think about that. In the E solution, those 1000 vats basically post an event to that same object in the single vat.... then they go on about their business. Shared state on the other hand..... if there is *any* code *anywhere* in the image that can modify this object then *all* access to the object has to be synchronized [1]. This means that while the E code is chugging away doing all kinds of things, your synchronized, native thread code has 1000 processes all sitting in a queue waiting their turn to get the lock. Of course you could spawn a new thread for every blocking access so you don't have to wait, but then you'll just be where E already is with a much higher resource usage. [1] The exception here would be if the object is so simple that you can *prove* that any writes to it are atomic. But you better put a huge flashing comment with it saying that if anyone adds *anything* they will have to add synchronization in a *lot* of places, and will possibly fundamentally alter the run time performance of the program. |
In reply to this post by Andreas.Raab
I wish you had gotten involved in this thread earlier on. I think you
explained everything better in this one message then I have in the whole thread. :) On 10/30/07, Andreas Raab <[hidden email]> wrote: > Igor Stasenko wrote: > > How would you define a boundaries of these entities in same image? > > It is defined implicitly by the island in which a message executes. All > objects created by the execution of a message are part of the island the > computation occurs in. > > To create an object in another island you need to artificially "move the > computation" there. That's why islands implement the #new: message, so > that you can create an object in another island by moving the > computation, for example: > > space := island future new: TSpace. > > This will create an instance of TSpace in the target island. Once we > have created the "root object" further messages that create objects will > be inside that island, too. For example, take this method: > > TSpace>>makeNewCube > "Create a new cube in this space" > cube := TCube new. > self addChild: cube. > ^cube > > and then: > > cube := space future makeNewCube. > > Both, cube and space will be in the same island. > > > Could you illustrate by some simple examples, or strategy which can be > > used for using them for concurrent execution within single VM? > > I'm confused about your use of the term "concurrent". Earlier you wrote > "There is a BIG difference between concurrency (parallel execution with > shared memory) and distributed computing." which seems to imply that you > discount all means of concurrency that do not use shared memory. If that > is really what you mean (which is clearly different from the usual > meaning of the term concurrent) then indeed, there is no way for it to > be "concurrent" because there simply is no shared mutable state between > islands. > > > I'm very interested in practical usage of futures myself. > > What will you do, or how you would avoid the situation , when > > sometimes a two different islands containing a reference to the same > > object in VM will send direct messages to it, causing racing > > condition? > > The implementation of future message sending uses locks and mutexes. You > might say "aha! so it *is* using locks and mutexes" but just as with > automatic garbage collection (which uses pointers and pointer arithmetic > and explicit freeing) it is simply a means to implement the higher-level > semantics. And since no mutual/nested locks are required > deadlock-freeness can again be proven. > > > Yes. But this example is significant one. Sometimes i want these > > messages run in parallel, sometimes i don't. Even for single 'island'. > > In the island model, this is not an option. The unit of concurrency is > an island, period. If want to run computations in parallel that share > data you either make the data immutable (which can enable sharing in > some limited cases) or you copy the needed data to "worker islands". > Basic load balancing. > > > Then, for general solution we need these islands be a very small (a > > smaller one is a single object) or contain big number of objects. The > > question is, how to give control of their sizes to developer. How > > developer can define a boundaries of island within single image? > > By sending messages. See above. > > > I will not accept any solutions like 'multiple images' because this > > drives us into distributed computing domain, which is _NOT_ concurrent > > computing anymore, simple because its not using shared memory, and in > > fact there is no sharing at all, only a glimpse of it. > > Again, you have a strange definition of the term concurrency. It does > not (neither in general english nor in CS) require use of shared memory. > There are two main classes of concurrent systems, namely those relying > on (mutable) shared memory and those relying on message passing > (sometimes utilizing immutable shared memory for optimization purposes > because it's indistinguishable from copying). Erlang and E (and Croquet > as long as you use it "correctly") all fall into the latter category. > > >> This may be the outcome for an interim period. The good thing here is > >> that you can *prove* that your program is deadlock-free simply by not > >> using waits. And ain't that a nice property to have. > >> > > you mean waits like this (consider following two lines of code run in parallel): > > > > [ a isUnlocked ] whileFalse: [ ]. b unlock. > > > > and > > > > [ b isUnlocked] whileFalse: []. a unlock. > > Just like in your previous example, this code is meaningless in Croquet. > You are assuming that a and b can be sent synchronous messages to and > that they resolve while being in the busy-loop. As I have pointed out > earlier this simply doesn't happen. Think of it that way: Results are > itself communicated using future messages, e.g., > > Island>>invokeMessage: aMessage > "Invoke the message and post the result back to the sender island" > result := aMessage value. "compute result of the message" > aMessage promise future value: result. "resolve associated promise" > > so you cannot possibly wait for the response to a message you just > scheduled. It is simply not possible, neither actively nor passively. > > > And how could you guarantee, that any bit of code in current ST image > > does not contain such hidden locks - like a loops or recursive loops > > which will never return until some external entity will change the > > state of some object(s)? > > No more than I can or have to guarantee that any particular bit of the > Squeak library is free of infinite loops. All we need to guarantee is > that we don't introduce new dependencies, which thanks to future > messages and promises we can guarantee. So if the subsystem is deadlock > free before it will stay so in our usage of it. If it's not then, well, > broken code is broken code no matter how you look at it. > > >>> I pointed that futures as an 'automatic lock-free' approach is not > >>> quite parallel to 'automatic memory management by GC'. > >> The similarity is striking. Both in terms of tradeoffs (trade low-level > >> control for better productivity) as well as the style of arguments made > >> against it ;-) Not that I mind by the way, I find these discussions > >> necessary. > >> > > The striking is, that introducing GC does good things - removing a > > necessity to care about memory, which helps a lot in developing and > > makes code more clear and smaller. But i can't see how futures does > > same. There are still lot things to consider for developer even by > > using futures. > > The main advantages are increased robustness and productivity. We worry > a *lot* about deadlocks since some our usage of Croquet shows exactly > the kind of "mixed usage" that you pointed out. But never, not once, > have we had a deadlock or even have had to worry about it, in places > where we used event-loop concurrency consistently. (interesting aside: > Just today we had a very complex deadlock on one of our servers and my > knee-jerk reaction was to try to convert it to event-loop concurrency > because although we got stack traces we may not be able to completely > figure out how the system ended up in that deadlock :-) > > We've gradually continued to move to event-loop concurrency more and > more in many areas of our code because the knowledge that this code will > be deadlock-free allows us to concentrate on solving the problem at hand > instead of figuring out the most unlikely occurences that can cause > deadlock - I suspect that I'll be faster rewriting the code from today > as event-loops than figuring out what caused and how to avoid that deadlock. > > And that is in my understanding the most important part - how many hours > have you spent thinking about how exactly a highly concurrent system > could possibly deadlock? What if you could spend this time on improving > the system instead, knowing that deadlock *simply cannot happen*? > > Cheers, > - Andreas > > |
In reply to this post by Igor Stasenko
On 10/30/07, Igor Stasenko <[hidden email]> wrote:
> > We are really don't need to have more than a fixed number of threads > in VM (one for each core, and maybe 1 more for GC). I'm totally on board the idea of native threads being internal to the VM, and client code not being aware. But my plan is to provide a way change how many threads that is, as 1 per CPU is not always optimal. When Erlang first did this, their approach was making a schedular 1-to-1 with a native thread. They seem to hit max performance at about 4 native threads per CPU core (if you want the reference I will try to dig it up, but I'm sure if your google-fu is strong, you should find it fairly quick). |
In reply to this post by Igor Stasenko
Igor Stasenko a écrit :
> > I'd like to hear more critics about such model :) If it proves to be > viable and much or less easily doable (comparing to other models) then > i could start working on it :) > Hi Igor, I like your approach. Main problem I see is that a lot of methods in current image are not multi-process safe! Imagine one SortedCollection, one Process iterates on it, another add to it. Even now, with a single threaded image, you have to care! (see http://bugs.squeak.org/view.php?id=6030). Exactly the OrderedCollection counter argument you served to Andreas! Except Andreas knows he has to carefully choose shared states, and also has explicit futures and promises he is using sparingly, and can identify easily. Your users will need atomic operations. Thus you have to introduce another state attached to receiver object telling that you can read concurrently, but not put a write lock on it. (in fact, you have 3 states, read-locked <-> free <-> write-locked) From pragmatic POV, prepare to put atomic blocks everywhere before having something usable! (maybe an <#atomic> pragma to be lighter). You cannot simply state "that's a programmer problem, I provide the framework". Bugs might occur from time to time, very hard to debug ones! And your framework won't help much. Beside, your users will have to deal with deadlock cases too... We'd better start thinking of automatic deadlock detection... Nicolas |
On 30/10/2007, nicolas cellier <[hidden email]> wrote:
> Igor Stasenko a écrit : > > > > I'd like to hear more critics about such model :) If it proves to be > > viable and much or less easily doable (comparing to other models) then > > i could start working on it :) > > > > Hi Igor, I like your approach. > Main problem I see is that a lot of methods in current image are not > multi-process safe! > Imagine one SortedCollection, one Process iterates on it, another add to > it. Even now, with a single threaded image, you have to care! (see > http://bugs.squeak.org/view.php?id=6030). > > Exactly the OrderedCollection counter argument you served to Andreas! > Except Andreas knows he has to carefully choose shared states, and also > has explicit futures and promises he is using sparingly, and can > identify easily. > > Your users will need atomic operations. > Thus you have to introduce another state attached to receiver object > telling that you can read concurrently, but not put a write lock on it. > (in fact, you have 3 states, read-locked <-> free <-> write-locked) > > From pragmatic POV, prepare to put atomic blocks everywhere before > having something usable! (maybe an <#atomic> pragma to be lighter). > You cannot simply state "that's a programmer problem, I provide the > framework". > Bugs might occur from time to time, very hard to debug ones! > And your framework won't help much. > > Beside, your users will have to deal with deadlock cases too... > We'd better start thinking of automatic deadlock detection... > mutithreading problems at language side. There is no ways how to deal with them at low level (such as VM). And if you read in previous discussions, they proven that there can't be a single generic solution for all problems which raising when we go in parallel world. Some solutions work best for ones, but can be too ineffective for others. That's why i proposed to not bind VM parallelism with language parallelism. We can't have a multiple ways in having concurrency in single VM implementation simply because of complexity of such system will be paramount. So we must choose a single solution (be it good or bad :) ). In same way as currently you can have multiple ST processes running in parallel, same you can do in future VM. The rest is on your own hands. You are free to use mutexes/futures or anything you like to deal with concurrency. A new VM simply should utilize CPU power better, so that if you have more and more cores from year to year, your code runs faster and faster. Of course this happens only, if you using algorithms which can divide your task into number of parallel sub-tasks. > Nicolas -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Jason Johnson-5
On 30/10/2007, Jason Johnson <[hidden email]> wrote:
> On 10/30/07, Igor Stasenko <[hidden email]> wrote: > > > > We are really don't need to have more than a fixed number of threads > > in VM (one for each core, and maybe 1 more for GC). > > I'm totally on board the idea of native threads being internal to the > VM, and client code not being aware. But my plan is to provide a way > change how many threads that is, as 1 per CPU is not always optimal. > I agree on that. But such details can be discovered later. > When Erlang first did this, their approach was making a schedular > 1-to-1 with a native thread. They seem to hit max performance at > about 4 native threads per CPU core (if you want the reference I will > try to dig it up, but I'm sure if your google-fu is strong, you should > find it fairly quick). > > Most of reasons why CPU not utilized at 100% is using a blocking I/O calls. Then a simplest solution to not use them and instead of blowing up the number of threads use asynchronous I/O . Most major platforms support asynchronous I/O and there are many libraries which support async data handling almost in each area we need for. We just need to build on top of them. -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Jason Johnson-5
On 30/10/2007, Jason Johnson <[hidden email]> wrote:
> On 10/30/07, Igor Stasenko <[hidden email]> wrote: > > which is _NOT_ concurrent > > computing anymore, simple because its not using shared memory, and in > > fact there is no sharing at all, only a glimpse of it. > > Huh? What does sharing have to do with concurrency? The one and only > thing shared state has to do with concurrency is the desire to speed > it up, i.e. a premature optimization. That's it. > Look. A current multi-core architecture uses shared memory. So the logical way how we can utilize such architecture at maximum power is to build on top of it. Any other (such as share nothing) introducing too much noise on such architectures. -- Best regards, Igor Stasenko AKA sig. |
Igor Stasenko wrote:
> Look. A current multi-core architecture uses shared memory. So the > logical way how we can utilize such architecture at maximum power is > to build on top of it. That's like saying: "Look. A current multi-core architecture uses x86 instructions. So the logical way to utilize it is to write assembler programs". It's neither logical nor true. > Any other (such as share nothing) introducing too much noise on such > architectures. And if that statement were true, then Erlang shouldn't be able to benefit from multiple cores. In fact, not even operating systems should be able because running processes in their own address spaces is a "share nothing" approach. More importantly, you are missing a major point in the comparison: Programmer effectiveness. The biggest cost factor for any software is programmer hours. If you can get 80% of the efficiency by investing half the resources you have a powerful advantage already. If you put this together with added robustness in a modern 24/7 software-as-a-service environment, you got a winner. In this sense it may be true that these solutions "introduce noise" (I'm not really sure what you mean when saying this but whatever) but they make it more than up in the amount of time you spend actually solving the problem at hand. Cheers, - Andreas |
In reply to this post by Jason Johnson-5
I would like to mention some of my previous work in this area:
- tinySelf 1 (1996) http://www.lsi.usp.br/~jecel/tiny.html#rel1 This was a Self interepreter written in Self which implemented the one thread per object model. All messages were future messages but since sending a message to an unresolved future would block, you would have deadlock on any recursion (direct or indirect). This problem was solved by detecting the cycles and preempting the blocked mesasge with the one it depends on. This results in interleaved execution, but since the semantics are exactly the same as in a sequential execution of the recursive code any bugs that appear won't be due to concurrency. I was able to test simple expressions and was very happy with how much parallelism I was able to extract from seemingly sequential code, but I made the mistake of introducing a significant optimization (tail send elimination) that made debugging so much harder that I was unable to finish in the two weeks that I was able to dedicate to this project. - 64 node Smalltalk machine (1992) http://www.lsi.usp.br/~jecel/ms8702.html The most interesting result in this project was the notion that most objects in the system are immutable at any given time and that a security system might be used to detect this. For example, just because you can edit some font today doesn't mean that you will do it. And if you and everyone currently logged on the local system only have read permission for that font then it is effectively immutable. Only when the font's owner logs in is this assumption invalid. The advantage of knowing that an object is immutable is that you can replicate it and you can allow multiple threads to access it at the same time. The only paper in English from this project describes how adaptive compilation could be used to trim away excessive concurrency by transforming future message passing into sequential message passing (the semantics allow this) and then inlining them away. So if a machine has 64 processors and the application initially starts out with 10 thousand threads, the compiler will eventually change this into code with 200 or so threads (some are blocked at any given instant, so going down to 64 threads would not be good).. http://www.lsi.usp.br/~jecel/jabs1.html - operating system in an Objective-C like language (1988) http://www.lsi.usp.br/~jecel/atos.html (this page has download links but the text still hasn't been written) This operating system for 286 machine used the virtual memory of that hardware to isolate groups of objects, with one thread per group. This would be similar to the vat/island model. All messages were sent in exactly the same way and if the receiver was a local object then it was just a fancy subroutine call but for remote objects you got a "segment not present" fault and the message was packed up and sent to the other task (possibly over the network). All messages were synchronous since I was not aware of futures at that time. -- current model -- I moved back to the one thread per object group model since I feel that makes it easier for programmers to control things without having to worry to much about details most of the time. Since my target is children this is particularly important. An alternative that I experimented with was having a separation between active and passive objects. A passive object could be known only to a single active one, but it is just too hard to program without ever accidentally letting references to passive objects "leak". With the group/vat/island model there is just one kind of object and things are simpler for the programmer (but more complicated for the implementor). I have a limitation that you can only create new objects in your own group or in an entirely new group - I think forcing some other random group to create an object for you is rude, though of course you can always ask for an object there to please do it. Some of the loaded groups are read/write but many are read-only. The latter don't actually have their own threads but instead their code executes in the thread of the calling group. I have hardware support for this. Speaking of hardware, I would like to stress how fantastically slow (relatively speaking) main memory is these days. If I have a good network connecting processor cores in a single chip then I can probably send a message from one to another, get a reply, send a second message and get another reply in the time that it takes to read a byte from external RAM. So we should start thinking of DDR SDRAM as a really fast disk to swap objects to/from and not as a shared memory. We should start to take message passing seriously. -- Jecel |
In reply to this post by Igor Stasenko
On Oct 30, 2007, at 3:28 PM, Igor Stasenko wrote: > On 30/10/2007, Jason Johnson <[hidden email]> wrote: >> On 10/30/07, Igor Stasenko <[hidden email]> wrote: >>> which is _NOT_ concurrent >>> computing anymore, simple because its not using shared memory, >>> and in >>> fact there is no sharing at all, only a glimpse of it. >> >> Huh? What does sharing have to do with concurrency? The one and >> only >> thing shared state has to do with concurrency is the desire to speed >> it up, i.e. a premature optimization. That's it. >> > Look. A current multi-core architecture uses shared memory. So the > logical way how we can utilize such architecture at maximum power is > to build on top of it. > Any other (such as share nothing) introducing too much noise on such > architectures. It is unreasonable to assume that ad-hoc, fine-grained sharing of objects between processors will give you the fastest performance on the upcoming machines with 100s and 1000s of cores. What about memory locality and cache coherency? It is not cheap to juggle an object between processors now, and it will become more expensive as the number of cores increase. In a different email in the thread, you made it clear that you consider distributed computing to be a fundamentally different beast from concurrency. Intel's chip designers don't see it this way. In fact, they explicitly formulate inter-core communication as a networking problem. For example, see http://www.intel.com/technology/ itj/2007/v11i3/1-integration/4-on-die.htm (I've better links in the past, but this is the best that I could quickly find now). I think that your proposal is very "clever", elegant, and fun to think about. But I don't see what real problem it solves. It doesn't help the application programmer write correct programs (you delegate this responsibility to the language/libraries). It doesn't make code at maximum speed, since it doesn't handle memory locality. In short, it seems like too much work to do for such uncertain gains... I think that we can get farther by examining some of our assumptions before we start, and revising or goals accordingly. Cheers, Josh > > -- > Best regards, > Igor Stasenko AKA sig. > |
In reply to this post by Andreas.Raab
On 31/10/2007, Andreas Raab <[hidden email]> wrote:
> Igor Stasenko wrote: > > Look. A current multi-core architecture uses shared memory. So the > > logical way how we can utilize such architecture at maximum power is > > to build on top of it. > > That's like saying: "Look. A current multi-core architecture uses x86 > instructions. So the logical way to utilize it is to write assembler > programs". It's neither logical nor true. > > > Any other (such as share nothing) introducing too much noise on such > > architectures. > > And if that statement were true, then Erlang shouldn't be able to > benefit from multiple cores. In fact, not even operating systems should > be able because running processes in their own address spaces is a > "share nothing" approach. I don't saying that it shouldn't be able to benefit. I'm just like to say, that this haves a considerable cost (a noise). And my point is to select such model, which will have minimum noise for given architecture. (what is noise read below). > > More importantly, you are missing a major point in the comparison: > Programmer effectiveness. The biggest cost factor for any software is > programmer hours. If you can get 80% of the efficiency by investing half > the resources you have a powerful advantage already. If you put this > together with added robustness in a modern 24/7 software-as-a-service > environment, you got a winner. In this sense it may be true that these > solutions "introduce noise" (I'm not really sure what you mean when > saying this but whatever) but they make it more than up in the amount of > time you spend actually solving the problem at hand. > By saying noise i saying that to perform a simple two integers sum we need (for instance) write first integer in file A, write second in file B, then run OS script which reads them, computes the sum and writes result in file C. Then you read from file C, converting result from text and finally got a result. That's what i call the noise :) > Cheers, > - Andreas > > -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Jason Johnson-5
----- Original Message ----- From: "Jason Johnson" <[hidden email]> To: "The general-purpose Squeak developers list" <[hidden email]> Sent: Tuesday, October 30, 2007 1:31 PM Subject: Re: Concurrent Futures > On 10/29/07, Rob Withers <[hidden email]> wrote: >> >> This is what I am trying to do with SqueakElib. Any old object >> referenced >> in the system is an eventual local ref, but the system should handle >> promises or non-local refs anywhere. > > Are you going to make the "vats" Smalltalk processes? They already are. > Are you going > to isolate them from other Smalltalk processes in the same image? If > so then I will probably have some synergy with that. I am trying to ensure that objects in one Vat don't get directly manipulated by objects in other Processes. Rob |
In reply to this post by Joshua Gargus-2
On 31/10/2007, Joshua Gargus <[hidden email]> wrote:
> > On Oct 30, 2007, at 3:28 PM, Igor Stasenko wrote: > > > On 30/10/2007, Jason Johnson <[hidden email]> wrote: > >> On 10/30/07, Igor Stasenko <[hidden email]> wrote: > >>> which is _NOT_ concurrent > >>> computing anymore, simple because its not using shared memory, > >>> and in > >>> fact there is no sharing at all, only a glimpse of it. > >> > >> Huh? What does sharing have to do with concurrency? The one and > >> only > >> thing shared state has to do with concurrency is the desire to speed > >> it up, i.e. a premature optimization. That's it. > >> > > Look. A current multi-core architecture uses shared memory. So the > > logical way how we can utilize such architecture at maximum power is > > to build on top of it. > > Any other (such as share nothing) introducing too much noise on such > > architectures. > > It is unreasonable to assume that ad-hoc, fine-grained sharing of > objects between processors will give you the fastest performance on > the upcoming machines with 100s and 1000s of cores. What about > memory locality and cache coherency? It is not cheap to juggle an > object between processors now, and it will become more expensive as > the number of cores increase. > > In a different email in the thread, you made it clear that you > consider distributed computing to be a fundamentally different beast > from concurrency. Intel's chip designers don't see it this way. In > fact, they explicitly formulate inter-core communication as a > networking problem. For example, see http://www.intel.com/technology/ > itj/2007/v11i3/1-integration/4-on-die.htm (I've better links in the > past, but this is the best that I could quickly find now). > Then i wonder, why they don't drop the idea of having shared memory at all? Each CPU then could have own memory, and they could interact by sending messages in network-style fashion. And we then would write a code which uses such architecture in best way. But while this is not true, should we assume that such code will work faster than code which 'knows' that there is a single shared memory for all CPUs and uses such knowledge in best way? > I think that your proposal is very "clever", elegant, and fun to > think about. Thanks :) > But I don't see what real problem it solves. It > doesn't help the application programmer write correct programs (you > delegate this responsibility to the language/libraries). It doesn't > make code at maximum speed, since it doesn't handle memory locality. > In short, it seems like too much work to do for such uncertain > gains... I think that we can get farther by examining some of our > assumptions before we start, and revising or goals accordingly. > I thought that goals was pretty clear. We have a single image. And we want to run multiple native threads upon it to utilize all cores of multi-core CPU's. What we currently have is a VM, which can't do that. So, i think, any other , even naively implemented, which can do, is better than nothing. If you have any ideas how such VM would look like i'm glad to hear. -- Best regards, Igor Stasenko AKA sig. |
On Oct 30, 2007, at 5:26 PM, Igor Stasenko wrote: > On 31/10/2007, Joshua Gargus <[hidden email]> wrote: >> >> On Oct 30, 2007, at 3:28 PM, Igor Stasenko wrote: >> >>> On 30/10/2007, Jason Johnson <[hidden email]> wrote: >>>> On 10/30/07, Igor Stasenko <[hidden email]> wrote: >>>>> which is _NOT_ concurrent >>>>> computing anymore, simple because its not using shared memory, >>>>> and in >>>>> fact there is no sharing at all, only a glimpse of it. >>>> >>>> Huh? What does sharing have to do with concurrency? The one and >>>> only >>>> thing shared state has to do with concurrency is the desire to >>>> speed >>>> it up, i.e. a premature optimization. That's it. >>>> >>> Look. A current multi-core architecture uses shared memory. So the >>> logical way how we can utilize such architecture at maximum power is >>> to build on top of it. >>> Any other (such as share nothing) introducing too much noise on such >>> architectures. >> >> It is unreasonable to assume that ad-hoc, fine-grained sharing of >> objects between processors will give you the fastest performance on >> the upcoming machines with 100s and 1000s of cores. What about >> memory locality and cache coherency? It is not cheap to juggle an >> object between processors now, and it will become more expensive as >> the number of cores increase. >> > >> In a different email in the thread, you made it clear that you >> consider distributed computing to be a fundamentally different beast >> from concurrency. Intel's chip designers don't see it this way. In >> fact, they explicitly formulate inter-core communication as a >> networking problem. For example, see http://www.intel.com/ >> technology/ >> itj/2007/v11i3/1-integration/4-on-die.htm (I've better links in the >> past, but this is the best that I could quickly find now). >> > Then i wonder, why they don't drop the idea of having shared memory > at all? It's convenient for programmers. Aside from the huge complexity of programming everything this way, we might also have to program AMD chips differently from Intel ones (at least until a standard emerged). > Each CPU then could have own memory, and they could interact by > sending messages in network-style fashion. And we then would write a > code which uses such architecture in best way. But while this is not > true, should we assume that such code will work faster than code which > 'knows' that there is a single shared memory for all CPUs and uses > such knowledge in best way? Could you restate this? I don't understand what you mean. > >> I think that your proposal is very "clever", elegant, and fun to >> think about. > Thanks :) You're welcome :-) > >> But I don't see what real problem it solves. It >> doesn't help the application programmer write correct programs (you >> delegate this responsibility to the language/libraries). It doesn't >> make code at maximum speed, since it doesn't handle memory locality. >> In short, it seems like too much work to do for such uncertain >> gains... I think that we can get farther by examining some of our >> assumptions before we start, and revising or goals accordingly. >> > I thought that goals was pretty clear. We have a single image. And we > want to run multiple native threads upon it to utilize all cores of > multi-core CPU's. > What we currently have is a VM, which can't do that. So, i think, any > other , even naively implemented, which can do, is better than > nothing. > If you have any ideas how such VM would look like i'm glad to hear. Very briefly, because this is Andreas's idea (my original one was similar but worse), and I think I convinced him to write it up. My take on it is to rework the VM so that it can support multiple images in one process, each with its own thread. Give each image an event- loop to process messages from other images. Make it easy for an image to spawn another image. A small Spoon-like image could be spawned in very few milliseconds. I like this approach because it is dead simple, and it keeps all objects that can talk directly to one another on the same processor. Because of its simplicity, it is a low-risk way to quickly get somewhere useful. Once we have some experience with it, we can decide whether we need finer-grained object access between threads (as I've said, I don't think that we will). Cheers, Josh > > -- > Best regards, > Igor Stasenko AKA sig. > |
On 31/10/2007, Joshua Gargus <[hidden email]> wrote:
> > > Then i wonder, why they don't drop the idea of having shared memory > > at all? > > It's convenient for programmers. Aside from the huge complexity of > programming everything this way, we might also have to program AMD > chips differently from Intel ones (at least until a standard emerged). > > > Each CPU then could have own memory, and they could interact by > > sending messages in network-style fashion. And we then would write a > > code which uses such architecture in best way. But while this is not > > true, should we assume that such code will work faster than code which > > 'knows' that there is a single shared memory for all CPUs and uses > > such knowledge in best way? > > Could you restate this? I don't understand what you mean. > approaches which is best fit for architecture where our project(s) will run on. Of course what is best fit is arguable. But i don't think we should drop a shared memory model support when we building a system on top of architecture which haves it. > > > >> I think that your proposal is very "clever", elegant, and fun to > >> think about. > > Thanks :) > > You're welcome :-) > > > Very briefly, because this is Andreas's idea (my original one was > similar but worse), and I think I convinced him to write it up. My > take on it is to rework the VM so that it can support multiple images > in one process, each with its own thread. Give each image an event- > loop to process messages from other images. Make it easy for an > image to spawn another image. A small Spoon-like image could be > spawned in very few milliseconds. > > I like this approach because it is dead simple, and it keeps all > objects that can talk directly to one another on the same processor. > Because of its simplicity, it is a low-risk way to quickly get > somewhere useful. Once we have some experience with it, we can > decide whether we need finer-grained object access between threads > (as I've said, I don't think that we will). > Ah, yes, this was mentioned before. And i like that idea in general because of its simplicity but don't like a memory overhead and some other issues, like: - persistence - generic usage Lets imagine that i run a squeak , started couple of servers, started tetris game and playing it. Then at some moment i feel the urgent need of shutting show my PC to insert additional 64 cores to my CPU to be able run two tetrises instead of one. ;) What is interesting now, is how i can have a dead-simple 'save image(s)' one-click action so, that after i restart system i will able to continue running same set of images from the same point as we have in current single-image squeak? If we store each separate image to separate file(s), then soon there will be a thousands of them polluting same directory. And i will lost, what do i really need to move my project on different PC. Or, maybe we should merge multiple images into same file? This is much better. But how about .changes file? About generic usage. A small images like an ants is good when you deal with small sized tasks and job of concrete ant is simple and monotonic. But how about real-world applications such as CAD systems, or business applications which could have a couple hundred megabytes of image size? Spawning multiple images of such size is a waste of resources. Ok, then lets suppose we can have a single 'queen' image and smaller 'ant' images. But now we need a sophisticated framework for coordinating moves. And also it is clear that most work load still will be in queen image(because most of objects located there) which means that its not scales well. -- Best regards, Igor Stasenko AKA sig. |
In reply to this post by Igor Stasenko
Igor Stasenko wrote:
> If you have any ideas how such VM would look like i'm glad to hear. Okay, so Josh convinced me to write up the ideas. The main problem as I see it with a *practical* solution to the problem is that all of the solutions so far require huge leaps and can't be implemented step-by-step (which almost certainly dooms them to failure). So what do we know and what do we actually all pretty much agree on? It's that we need to be able to utilize multiple cores and that we need a practical way to get there (if you disagree with the latter this message is not meant for you ;-) Running multiple processes is one option but it is not always sufficient. For example, some OSes would have trouble firing off a couple of thousand processes whereas the same OS may have no problem at all with a couple of thousand threads in one process. To give an example, starting a thread on Windows cost somewhere in the range of a millisecond which is admittedly slow, but still orders of magnitude faster than creating a new process. Then there are issues with resource sharing (like file handles) which are practically guaranteed not to work across process boundaries etc. So while there are perfectly good reasons to run multiple processes, there are reasons just as good to wanting to run multiple threads in one process. The question then is, can we find an easy way to extend the Squeak VM to run multiple threads and if so how? Given the simplistic nature of the Squeak interpreter, there is actually very little global state that is not encapsulated in objects on the Squeak heap - basically all the variables in class interpreter. So if we would put them into state that is local to each thread, we could trivially run multiple instances of the byte code interpreter in the same VM. This gets us to the two major questions: * How do we encapsulate the interpreter state? * How do we deal with primitives and plugins? Let's start with the first one. Obviously, the answer is "make it an object". The way how I would go about is by modifying the CCodeGenerator such that it generates all functions with an argument of type "struct VM" and that variable accesses prefix things properly and that all functions calls pass the extra argument along. In short, what used to be translated as: sqInt primitiveAdd(void) { integerResult = stackIntegerValue(1) + stackIntegerValue(0) /* etc. */ } will then become something like here: sqInt primitiveAdd(struct VM *vm) { integerResult = stackIntegerValue(vm,1) + stackIntegerValue(vm,0) /* etc. */ } This is a *purely* mechanical step that can be done independent of anything else. It should be possible to generate code that is entirely equivalent to todays code and with a bit of tweaking it should be possible to make that code roughly as fast as we have today (not that I think it matters but understanding the speed difference between this and the default interpreter is important for judging relative speed improvements later). The above takes care about the interpreter but there are still primitives and plugins that need to be dealt with. What I would do here is define operations like ioLock(struct VM) and ioUnlock(struct VM) that are the effective equivalent of Python's GIL (global interpreter lock) and allow exclusive access to primitives that have not been converted to multi-threading yet. How exactly this conversion should happen is deliberately left open here; maybe changing the VMs major proxy version is the right thing to do to indicate the changed semantics. In any case, the GIL allows us to readily reuse all existing plugins without having to worry about conversion early on. So now we've taken care of the two major parts of Squeak: We have the ability to run new interpreters and we have the ability to use primitives. This is when the fun begins, because at this point we have options: For example, if you are into shared-state concurrency, you might implement a primitive that forks a new instance of the interpreter running in the same object memory that your previous interpreter is running in. Or, and that would be the path that I would take, implement a primitive that loads an image into a new object memory (I can explain in more detail how memory allocation needs to work for that; it is a fairly straightforward scheme but a little too long for this message) and run that interpreter. And at this point, the *real* fun begins because we can now start to define the communication patterns we'd like to use (initially sockets, later shared memory or event queues or whatever else). We can have tiny worker images that only do minimal stuff but we can also do a Spoon-like thing where we have a "master image" that contains all the code possibly needed and fire off micro-images that (via imprinting) swap in just the code they need to run. [Whoa! I just got interrupted by a little 5.6 quake some 50 miles away] Sorry but I lost my train of thought here. Happens at 5.6 Richter ;-) Anyway, the main thing I'm trying to say in the above is that for a *practical* solution to the problem there are some steps that are pretty much required whichever way you look at it. And I think that regardless of your interest in shared state or message passing concurrency we may be able to define a road that leads to interesting experiments without sacrificing the practical artifact. A VM built like described in the above would be strictly a superset of the current VM so it would be able to run any current images and leave room for further experiments. Cheers, - Andreas |
In reply to this post by Igor Stasenko
On Oct 30, 2007, at 7:56 PM, Igor Stasenko wrote: > On 31/10/2007, Joshua Gargus <[hidden email]> wrote: >> >>> Then i wonder, why they don't drop the idea of having shared memory >>> at all? >> >> It's convenient for programmers. Aside from the huge complexity of >> programming everything this way, we might also have to program AMD >> chips differently from Intel ones (at least until a standard >> emerged). >> >>> Each CPU then could have own memory, and they could interact by >>> sending messages in network-style fashion. And we then would write a >>> code which uses such architecture in best way. But while this is not >>> true, should we assume that such code will work faster than code >>> which >>> 'knows' that there is a single shared memory for all CPUs and uses >>> such knowledge in best way? >> >> Could you restate this? I don't understand what you mean. >> > I don't know what to add to above. I just said that we should use > approaches which is best fit for architecture where our project(s) > will run on. > Of course what is best fit is arguable. But i don't think we should > drop a shared memory model support when we building a system on top of > architecture which haves it. OK, this is basically what I thought you meant, but wasn't sure. <snip> > > Lets imagine that i run a squeak , started couple of servers, started > tetris game and playing it. Then at some moment i feel the urgent need > of shutting show my PC to insert additional 64 cores to my CPU to be > able run two tetrises instead of one. ;) Only 2 tetrises? I hope we can do better than that :-) > What is interesting now, is how i can have a dead-simple 'save > image(s)' one-click action so, that after i restart system i will able > to continue running same set of images from the same point as we have > in current single-image squeak? > If we store each separate image to separate file(s), then soon there > will be a thousands of them polluting same directory. And i will lost, > what do i really need to move my project on different PC. > Or, maybe we should merge multiple images into same file? This is much > better. But how about .changes file? Or source code control in general. Good point. There are a number of options. For simplicity, I'd say to connect a single blessed development image to the single changes file. Another would be to use a Monticello-like approach (although there are still things that changesets are needed for, and we don't want to bring in all sorts of new fundamental dependencies). However, these are the first thoughts that popped into my head... you seem to have thought about this more than I have. Re: persistence in general... I used to do all of my thinking in a Squeak image, using morphs and projects. I grew tired of the breakage of my media when I updated my code to the latest version or screwing up my image in some way. I much prefer the way that we deal with data in Croquet, using a separate serialization format instead of the Squeak image format. So for me, it's acceptable to have many of the images to be transient... created on startup and discarded ("garbage-collected" :-) ) at shutdown. (BTW, I hope you'll excuse me for continually saying "we" even there's no chance I'll have time to work on it with you all... I'll have to be content with cheering from the sidelines :-( ) > About generic usage. A small images like an ants is good when you deal > with small sized tasks and job of concrete ant is simple and > monotonic. > But how about real-world applications such as CAD systems, or business > applications which could have a couple hundred megabytes of image > size? Spawning multiple images of such size is a waste of resources. Each image can be as big as it needs to be. We don't need to spawn off identical clones of a single image. I don't have a lot of first-hand experience with business systems. I'm picturing large tables of relational data... I'm not sure how I would approach this with the model I've described. Honestly, it's difficult me to think about this stuff outside of the context of Croquet. There might be use cases that aren't addressed well. But my instinct is that if the model is good enough for a world-wide network of interconnected, collaborative 3D spaces (and I believe that it is), then it is good enough for most anything you'd want to do with it. > Ok, then lets suppose we can have a single 'queen' image and smaller > 'ant' images. But now we need a sophisticated framework for > coordinating moves. If the application is so sophisticated, it will need a sophisticated framework for coordinating concurrency anyway. No way around it. It doesn't matter if it is one queen and may ants, or a heterogeneous mix of islands. Cheers, Josh > And also it is clear that most work load still > will be in queen image(because most of objects located there) which > means that its not scales well. > > -- > Best regards, > Igor Stasenko AKA sig. > |
Free forum by Nabble | Edit this page |