Smalltalk › Squeak › Squeak - Dev

Concurrent Futures

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

152 messages Options

123456 ... 8

Jason Johnson-5

Re: Concurrent Futures

On 10/29/07, Rob Withers <[hidden email]> wrote:
>
> This is what I am trying to do with SqueakElib. Any old object referenced
> in the system is an eventual local ref, but the system should handle
> promises or non-local refs anywhere.

Are you going to make the "vats" Smalltalk processes? Are you going
to isolate them from other Smalltalk processes in the same image? If
so then I will probably have some synergy with that.

Jason Johnson-5

Re: Concurrent Futures

In reply to this post by Andreas.Raab

On 10/29/07, Andreas Raab <[hidden email]> wrote:
>
> Not "all messages sends". Only messages between concurrent entities
> (islands). This is the main difference to the all-out actors model
> (where each object is its own unit of concurrency) and has the advantage
> that you can reuse all of todays single-threaded code.

I hope you, or anyone else are not under the impression that I am
pushing for an "all-out" actors model. I want actor sending to be as
explicit as Croquet futures are. And this would mean the same; all of
today's single-threaded code continues to work.

> The similarity is striking. Both in terms of tradeoffs (trade low-level
> control for better productivity) as well as the style of arguments made
> against it ;-)

I see the similarity! :)

>Not that I mind by the way, I find these discussions
> necessary.

Ah good, I assumed you had kill-filed me for this one, if not before. :)

Jason Johnson-5

Re: Concurrent Futures

In reply to this post by Igor Stasenko

On 10/30/07, Igor Stasenko <[hidden email]> wrote:
> which is _NOT_ concurrent
> computing anymore, simple because its not using shared memory, and in
> fact there is no sharing at all, only a glimpse of it.

Huh? What does sharing have to do with concurrency? The one and only
thing shared state has to do with concurrency is the desire to speed
it up, i.e. a premature optimization. That's it.

Jason Johnson-5

Re: Concurrent Futures

In reply to this post by Igor Stasenko

On 10/30/07, Igor Stasenko <[hidden email]> wrote:
>
> Simply because its not scales well. Consider a 1000 and 1 Vats ( 1000
> and 1 tales comes in mind :).
> A 1000 Vats sending a message to the same object, which is scheduled
> in single Vat. So, there will be a HUGE difference in time between
> first sender and last sender when they will receive an answer.

Are you some how under the impression that a shared state solution
would actually scale *better*? Think about that. In the E solution,
those 1000 vats basically post an event to that same object in the
single vat.... then they go on about their business.

Shared state on the other hand..... if there is *any* code *anywhere*
in the image that can modify this object then *all* access to the
object has to be synchronized [1]. This means that while the E code
is chugging away doing all kinds of things, your synchronized, native
thread code has 1000 processes all sitting in a queue waiting their
turn to get the lock. Of course you could spawn a new thread for
every blocking access so you don't have to wait, but then you'll just
be where E already is with a much higher resource usage.

[1] The exception here would be if the object is so simple that you
can *prove* that any writes to it are atomic. But you better put a
huge flashing comment with it saying that if anyone adds *anything*
they will have to add synchronization in a *lot* of places, and will
possibly fundamentally alter the run time performance of the program.

Jason Johnson-5

Re: Concurrent Futures

In reply to this post by Andreas.Raab

I wish you had gotten involved in this thread earlier on. I think you
explained everything better in this one message then I have in the
whole thread. :)

On 10/30/07, Andreas Raab <[hidden email]> wrote:

> Igor Stasenko wrote:
> > How would you define a boundaries of these entities in same image?
>
> It is defined implicitly by the island in which a message executes. All
> objects created by the execution of a message are part of the island the
> computation occurs in.
>
> To create an object in another island you need to artificially "move the
> computation" there. That's why islands implement the #new: message, so
> that you can create an object in another island by moving the
> computation, for example:
>
> space := island future new: TSpace.
>
> This will create an instance of TSpace in the target island. Once we
> have created the "root object" further messages that create objects will
> be inside that island, too. For example, take this method:
>
> TSpace>>makeNewCube
> "Create a new cube in this space"
> cube := TCube new.
> self addChild: cube.
> ^cube
>
> and then:
>
> cube := space future makeNewCube.
>
> Both, cube and space will be in the same island.
>
> > Could you illustrate by some simple examples, or strategy which can be
> > used for using them for concurrent execution within single VM?
>
> I'm confused about your use of the term "concurrent". Earlier you wrote
> "There is a BIG difference between concurrency (parallel execution with
> shared memory) and distributed computing." which seems to imply that you
> discount all means of concurrency that do not use shared memory. If that
> is really what you mean (which is clearly different from the usual
> meaning of the term concurrent) then indeed, there is no way for it to
> be "concurrent" because there simply is no shared mutable state between
> islands.
>
> > I'm very interested in practical usage of futures myself.
> > What will you do, or how you would avoid the situation , when
> > sometimes a two different islands containing a reference to the same
> > object in VM will send direct messages to it, causing racing
> > condition?
>
> The implementation of future message sending uses locks and mutexes. You
> might say "aha! so it *is* using locks and mutexes" but just as with
> automatic garbage collection (which uses pointers and pointer arithmetic
> and explicit freeing) it is simply a means to implement the higher-level
> semantics. And since no mutual/nested locks are required
> deadlock-freeness can again be proven.
>
> > Yes. But this example is significant one. Sometimes i want these
> > messages run in parallel, sometimes i don't. Even for single 'island'.
>
> In the island model, this is not an option. The unit of concurrency is
> an island, period. If want to run computations in parallel that share
> data you either make the data immutable (which can enable sharing in
> some limited cases) or you copy the needed data to "worker islands".
> Basic load balancing.
>
> > Then, for general solution we need these islands be a very small (a
> > smaller one is a single object) or contain big number of objects. The
> > question is, how to give control of their sizes to developer. How
> > developer can define a boundaries of island within single image?
>
> By sending messages. See above.
>
> > I will not accept any solutions like 'multiple images' because this
> > drives us into distributed computing domain, which is _NOT_ concurrent
> > computing anymore, simple because its not using shared memory, and in
> > fact there is no sharing at all, only a glimpse of it.
>
> Again, you have a strange definition of the term concurrency. It does
> not (neither in general english nor in CS) require use of shared memory.
> There are two main classes of concurrent systems, namely those relying
> on (mutable) shared memory and those relying on message passing
> (sometimes utilizing immutable shared memory for optimization purposes
> because it's indistinguishable from copying). Erlang and E (and Croquet
> as long as you use it "correctly") all fall into the latter category.
>
> >> This may be the outcome for an interim period. The good thing here is
> >> that you can *prove* that your program is deadlock-free simply by not
> >> using waits. And ain't that a nice property to have.
> >>
> > you mean waits like this (consider following two lines of code run in parallel):
> >
> > [ a isUnlocked ] whileFalse: [ ]. b unlock.
> >
> > and
> >
> > [ b isUnlocked] whileFalse: []. a unlock.
>
> Just like in your previous example, this code is meaningless in Croquet.
> You are assuming that a and b can be sent synchronous messages to and
> that they resolve while being in the busy-loop. As I have pointed out
> earlier this simply doesn't happen. Think of it that way: Results are
> itself communicated using future messages, e.g.,
>
> Island>>invokeMessage: aMessage
> "Invoke the message and post the result back to the sender island"
> result := aMessage value. "compute result of the message"
> aMessage promise future value: result. "resolve associated promise"
>
> so you cannot possibly wait for the response to a message you just
> scheduled. It is simply not possible, neither actively nor passively.
>
> > And how could you guarantee, that any bit of code in current ST image
> > does not contain such hidden locks - like a loops or recursive loops
> > which will never return until some external entity will change the
> > state of some object(s)?
>
> No more than I can or have to guarantee that any particular bit of the
> Squeak library is free of infinite loops. All we need to guarantee is
> that we don't introduce new dependencies, which thanks to future
> messages and promises we can guarantee. So if the subsystem is deadlock
> free before it will stay so in our usage of it. If it's not then, well,
> broken code is broken code no matter how you look at it.
>
> >>> I pointed that futures as an 'automatic lock-free' approach is not
> >>> quite parallel to 'automatic memory management by GC'.
> >> The similarity is striking. Both in terms of tradeoffs (trade low-level
> >> control for better productivity) as well as the style of arguments made
> >> against it ;-) Not that I mind by the way, I find these discussions
> >> necessary.
> >>
> > The striking is, that introducing GC does good things - removing a
> > necessity to care about memory, which helps a lot in developing and
> > makes code more clear and smaller. But i can't see how futures does
> > same. There are still lot things to consider for developer even by
> > using futures.
>
> The main advantages are increased robustness and productivity. We worry
> a *lot* about deadlocks since some our usage of Croquet shows exactly
> the kind of "mixed usage" that you pointed out. But never, not once,
> have we had a deadlock or even have had to worry about it, in places
> where we used event-loop concurrency consistently. (interesting aside:
> Just today we had a very complex deadlock on one of our servers and my
> knee-jerk reaction was to try to convert it to event-loop concurrency
> because although we got stack traces we may not be able to completely
> figure out how the system ended up in that deadlock :-)
>
> We've gradually continued to move to event-loop concurrency more and
> more in many areas of our code because the knowledge that this code will
> be deadlock-free allows us to concentrate on solving the problem at hand
> instead of figuring out the most unlikely occurences that can cause
> deadlock - I suspect that I'll be faster rewriting the code from today
> as event-loops than figuring out what caused and how to avoid that deadlock.
>
> And that is in my understanding the most important part - how many hours
> have you spent thinking about how exactly a highly concurrent system
> could possibly deadlock? What if you could spend this time on improving
> the system instead, knowing that deadlock *simply cannot happen*?
>
> Cheers,
> - Andreas
>
>

Jason Johnson-5

Re: Concurrent Futures

In reply to this post by Igor Stasenko

On 10/30/07, Igor Stasenko <[hidden email]> wrote:
>
> We are really don't need to have more than a fixed number of threads
> in VM (one for each core, and maybe 1 more for GC).

I'm totally on board the idea of native threads being internal to the
VM, and client code not being aware. But my plan is to provide a way
change how many threads that is, as 1 per CPU is not always optimal.

When Erlang first did this, their approach was making a schedular
1-to-1 with a native thread. They seem to hit max performance at
about 4 native threads per CPU core (if you want the reference I will
try to dig it up, but I'm sure if your google-fu is strong, you should
find it fairly quick).

Nicolas Cellier-3

Re: Concurrent Futures

In reply to this post by Igor Stasenko

Igor Stasenko a écrit :
>
> I'd like to hear more critics about such model :) If it proves to be
> viable and much or less easily doable (comparing to other models) then
> i could start working on it :)
>

Hi Igor, I like your approach.
Main problem I see is that a lot of methods in current image are not
multi-process safe!
Imagine one SortedCollection, one Process iterates on it, another add to
it. Even now, with a single threaded image, you have to care! (see
http://bugs.squeak.org/view.php?id=6030).

Exactly the OrderedCollection counter argument you served to Andreas!
Except Andreas knows he has to carefully choose shared states, and also
has explicit futures and promises he is using sparingly, and can
identify easily.

Your users will need atomic operations.
Thus you have to introduce another state attached to receiver object
telling that you can read concurrently, but not put a write lock on it.
(in fact, you have 3 states, read-locked <-> free <-> write-locked)

From pragmatic POV, prepare to put atomic blocks everywhere before
having something usable! (maybe an <#atomic> pragma to be lighter).
You cannot simply state "that's a programmer problem, I provide the
framework".
Bugs might occur from time to time, very hard to debug ones!
And your framework won't help much.

Beside, your users will have to deal with deadlock cases too...
We'd better start thinking of automatic deadlock detection...

Nicolas

Igor Stasenko

Re: Concurrent Futures

On 30/10/2007, nicolas cellier <[hidden email]> wrote:

> Igor Stasenko a écrit :
> >
> > I'd like to hear more critics about such model :) If it proves to be
> > viable and much or less easily doable (comparing to other models) then
> > i could start working on it :)
> >
>
> Hi Igor, I like your approach.
> Main problem I see is that a lot of methods in current image are not
> multi-process safe!
> Imagine one SortedCollection, one Process iterates on it, another add to
> it. Even now, with a single threaded image, you have to care! (see
> http://bugs.squeak.org/view.php?id=6030).
>
> Exactly the OrderedCollection counter argument you served to Andreas!
> Except Andreas knows he has to carefully choose shared states, and also
> has explicit futures and promises he is using sparingly, and can
> identify easily.
>
> Your users will need atomic operations.
> Thus you have to introduce another state attached to receiver object
> telling that you can read concurrently, but not put a write lock on it.
> (in fact, you have 3 states, read-locked <-> free <-> write-locked)
>
> From pragmatic POV, prepare to put atomic blocks everywhere before
> having something usable! (maybe an <#atomic> pragma to be lighter).
> You cannot simply state "that's a programmer problem, I provide the
> framework".
> Bugs might occur from time to time, very hard to debug ones!
> And your framework won't help much.
>
> Beside, your users will have to deal with deadlock cases too...
> We'd better start thinking of automatic deadlock detection...
>

Hi Nicolas. I stated previously that my approach not dealing with
mutithreading problems at language side. There is no ways how to deal
with them at low level (such as VM). And if you read in previous
discussions, they proven that there can't be a single generic solution
for all problems which raising when we go in parallel world. Some
solutions work best for ones, but can be too ineffective for others.

That's why i proposed to not bind VM parallelism with language parallelism.
We can't have a multiple ways in having concurrency in single VM
implementation simply because of complexity of such system will be
paramount. So we must choose a single solution (be it good or bad :)
).
In same way as currently you can have multiple ST processes running in
parallel, same you can do in future VM. The rest is on your own hands.
You are free to use mutexes/futures or anything you like to deal with
concurrency. A new VM simply should utilize CPU power better, so that
if you have more and more cores from year to year, your code runs
faster and faster. Of course this happens only, if you using
algorithms which can divide your task into number of parallel
sub-tasks.

> Nicolas

--
Best regards,
Igor Stasenko AKA sig.

Igor Stasenko

Re: Concurrent Futures

In reply to this post by Jason Johnson-5

On 30/10/2007, Jason Johnson <[hidden email]> wrote:
> On 10/30/07, Igor Stasenko <[hidden email]> wrote:
> >
> > We are really don't need to have more than a fixed number of threads
> > in VM (one for each core, and maybe 1 more for GC).
>
> I'm totally on board the idea of native threads being internal to the
> VM, and client code not being aware. But my plan is to provide a way
> change how many threads that is, as 1 per CPU is not always optimal.
>

I agree on that. But such details can be discovered later.

> When Erlang first did this, their approach was making a schedular
> 1-to-1 with a native thread. They seem to hit max performance at
> about 4 native threads per CPU core (if you want the reference I will
> try to dig it up, but I'm sure if your google-fu is strong, you should
> find it fairly quick).
>
>
Most of reasons why CPU not utilized at 100% is using a blocking I/O
calls. Then a simplest solution to not use them and instead of blowing
up the number of threads use asynchronous I/O . Most major platforms
support asynchronous I/O and there are many libraries which support
async data handling almost in each area we need for. We just need to
build on top of them.

--
Best regards,
Igor Stasenko AKA sig.

Igor Stasenko

Re: Concurrent Futures

In reply to this post by Jason Johnson-5

On 30/10/2007, Jason Johnson <[hidden email]> wrote:
> On 10/30/07, Igor Stasenko <[hidden email]> wrote:
> > which is _NOT_ concurrent
> > computing anymore, simple because its not using shared memory, and in
> > fact there is no sharing at all, only a glimpse of it.
>
> Huh? What does sharing have to do with concurrency? The one and only
> thing shared state has to do with concurrency is the desire to speed
> it up, i.e. a premature optimization. That's it.
>
Look. A current multi-core architecture uses shared memory. So the
logical way how we can utilize such architecture at maximum power is
to build on top of it.
Any other (such as share nothing) introducing too much noise on such
architectures.

--
Best regards,
Igor Stasenko AKA sig.

Andreas.Raab

Re: Concurrent Futures

Igor Stasenko wrote:
> Look. A current multi-core architecture uses shared memory. So the
> logical way how we can utilize such architecture at maximum power is
> to build on top of it.

That's like saying: "Look. A current multi-core architecture uses x86
instructions. So the logical way to utilize it is to write assembler
programs". It's neither logical nor true.

> Any other (such as share nothing) introducing too much noise on such
> architectures.

And if that statement were true, then Erlang shouldn't be able to
benefit from multiple cores. In fact, not even operating systems should
be able because running processes in their own address spaces is a
"share nothing" approach.

More importantly, you are missing a major point in the comparison:
Programmer effectiveness. The biggest cost factor for any software is
programmer hours. If you can get 80% of the efficiency by investing half
the resources you have a powerful advantage already. If you put this
together with added robustness in a modern 24/7 software-as-a-service
environment, you got a winner. In this sense it may be true that these
solutions "introduce noise" (I'm not really sure what you mean when
saying this but whatever) but they make it more than up in the amount of
time you spend actually solving the problem at hand.

Cheers,
- Andreas

Jecel Assumpcao Jr

Re: Concurrent Futures

In reply to this post by Jason Johnson-5

I would like to mention some of my previous work in this area:

- tinySelf 1 (1996)
http://www.lsi.usp.br/~jecel/tiny.html#rel1

This was a Self interepreter written in Self which implemented the one
thread per object model. All messages were future messages but since
sending a message to an unresolved future would block, you would have
deadlock on any recursion (direct or indirect). This problem was solved
by detecting the cycles and preempting the blocked mesasge with the one
it depends on. This results in interleaved execution, but since the
semantics are exactly the same as in a sequential execution of the
recursive code any bugs that appear won't be due to concurrency.

I was able to test simple expressions and was very happy with how much
parallelism I was able to extract from seemingly sequential code, but I
made the mistake of introducing a significant optimization (tail send
elimination) that made debugging so much harder that I was unable to
finish in the two weeks that I was able to dedicate to this project.

- 64 node Smalltalk machine (1992)
http://www.lsi.usp.br/~jecel/ms8702.html

The most interesting result in this project was the notion that most
objects in the system are immutable at any given time and that a
security system might be used to detect this. For example, just because
you can edit some font today doesn't mean that you will do it. And if
you and everyone currently logged on the local system only have read
permission for that font then it is effectively immutable. Only when the
font's owner logs in is this assumption invalid.

The advantage of knowing that an object is immutable is that you can
replicate it and you can allow multiple threads to access it at the same
time.

The only paper in English from this project describes how adaptive
compilation could be used to trim away excessive concurrency by
transforming future message passing into sequential message passing (the
semantics allow this) and then inlining them away. So if a machine has
64 processors and the application initially starts out with 10 thousand
threads, the compiler will eventually change this into code with 200 or
so threads (some are blocked at any given instant, so going down to 64
threads would not be good)..
http://www.lsi.usp.br/~jecel/jabs1.html

- operating system in an Objective-C like language (1988)
http://www.lsi.usp.br/~jecel/atos.html (this page has download links but
the text still hasn't been written)

This operating system for 286 machine used the virtual memory of that
hardware to isolate groups of objects, with one thread per group. This
would be similar to the vat/island model. All messages were sent in
exactly the same way and if the receiver was a local object then it was
just a fancy subroutine call but for remote objects you got a "segment
not present" fault and the message was packed up and sent to the other
task (possibly over the network). All messages were synchronous since I
was not aware of futures at that time.

-- current model --

I moved back to the one thread per object group model since I feel that
makes it easier for programmers to control things without having to
worry to much about details most of the time. Since my target is
children this is particularly important. An alternative that I
experimented with was having a separation between active and passive
objects. A passive object could be known only to a single active one,
but it is just too hard to program without ever accidentally letting
references to passive objects "leak". With the group/vat/island model
there is just one kind of object and things are simpler for the
programmer (but more complicated for the implementor). I have a
limitation that you can only create new objects in your own group or in
an entirely new group - I think forcing some other random group to
create an object for you is rude, though of course you can always ask
for an object there to please do it.

Some of the loaded groups are read/write but many are read-only. The
latter don't actually have their own threads but instead their code
executes in the thread of the calling group. I have hardware support for
this.

Speaking of hardware, I would like to stress how fantastically slow
(relatively speaking) main memory is these days. If I have a good
network connecting processor cores in a single chip then I can probably
send a message from one to another, get a reply, send a second message
and get another reply in the time that it takes to read a byte from
external RAM. So we should start thinking of DDR SDRAM as a really fast
disk to swap objects to/from and not as a shared memory. We should start
to take message passing seriously.

-- Jecel

Joshua Gargus-2

Re: Concurrent Futures

In reply to this post by Igor Stasenko

On Oct 30, 2007, at 3:28 PM, Igor Stasenko wrote:

> On 30/10/2007, Jason Johnson <[hidden email]> wrote:
>> On 10/30/07, Igor Stasenko <[hidden email]> wrote:
>>> which is _NOT_ concurrent
>>> computing anymore, simple because its not using shared memory,
>>> and in
>>> fact there is no sharing at all, only a glimpse of it.
>>
>> Huh? What does sharing have to do with concurrency? The one and
>> only
>> thing shared state has to do with concurrency is the desire to speed
>> it up, i.e. a premature optimization. That's it.
>>
> Look. A current multi-core architecture uses shared memory. So the
> logical way how we can utilize such architecture at maximum power is
> to build on top of it.
> Any other (such as share nothing) introducing too much noise on such
> architectures.

It is unreasonable to assume that ad-hoc, fine-grained sharing of
objects between processors will give you the fastest performance on
the upcoming machines with 100s and 1000s of cores. What about
memory locality and cache coherency? It is not cheap to juggle an
object between processors now, and it will become more expensive as
the number of cores increase.

In a different email in the thread, you made it clear that you
consider distributed computing to be a fundamentally different beast
from concurrency. Intel's chip designers don't see it this way. In
fact, they explicitly formulate inter-core communication as a
networking problem. For example, see http://www.intel.com/technology/
itj/2007/v11i3/1-integration/4-on-die.htm (I've better links in the
past, but this is the best that I could quickly find now).

I think that your proposal is very "clever", elegant, and fun to
think about. But I don't see what real problem it solves. It
doesn't help the application programmer write correct programs (you
delegate this responsibility to the language/libraries). It doesn't
make code at maximum speed, since it doesn't handle memory locality.
In short, it seems like too much work to do for such uncertain
gains... I think that we can get farther by examining some of our
assumptions before we start, and revising or goals accordingly.

Cheers,
Josh

>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

Igor Stasenko

Re: Concurrent Futures

In reply to this post by Andreas.Raab

On 31/10/2007, Andreas Raab <[hidden email]> wrote:

> Igor Stasenko wrote:
> > Look. A current multi-core architecture uses shared memory. So the
> > logical way how we can utilize such architecture at maximum power is
> > to build on top of it.
>
> That's like saying: "Look. A current multi-core architecture uses x86
> instructions. So the logical way to utilize it is to write assembler
> programs". It's neither logical nor true.
>
> > Any other (such as share nothing) introducing too much noise on such
> > architectures.
>
> And if that statement were true, then Erlang shouldn't be able to
> benefit from multiple cores. In fact, not even operating systems should
> be able because running processes in their own address spaces is a
> "share nothing" approach.

I don't saying that it shouldn't be able to benefit. I'm just like to
say, that this haves a considerable cost (a noise).
And my point is to select such model, which will have minimum noise
for given architecture.
(what is noise read below).

>
> More importantly, you are missing a major point in the comparison:
> Programmer effectiveness. The biggest cost factor for any software is
> programmer hours. If you can get 80% of the efficiency by investing half
> the resources you have a powerful advantage already. If you put this
> together with added robustness in a modern 24/7 software-as-a-service
> environment, you got a winner. In this sense it may be true that these
> solutions "introduce noise" (I'm not really sure what you mean when
> saying this but whatever) but they make it more than up in the amount of
> time you spend actually solving the problem at hand.
>

By saying noise i saying that to perform a simple two integers sum we
need (for instance) write first integer in file A, write second in
file B, then run OS script which reads them, computes the sum and
writes result in file C. Then you read from file C, converting result
from text and finally got a result.
That's what i call the noise :)

> Cheers,
> - Andreas
>
>

--
Best regards,
Igor Stasenko AKA sig.

Rob Withers

Re: Concurrent Futures

In reply to this post by Jason Johnson-5

----- Original Message -----
From: "Jason Johnson" <[hidden email]>
To: "The general-purpose Squeak developers list"
<[hidden email]>
Sent: Tuesday, October 30, 2007 1:31 PM
Subject: Re: Concurrent Futures

> On 10/29/07, Rob Withers <[hidden email]> wrote:
>>
>> This is what I am trying to do with SqueakElib. Any old object
>> referenced
>> in the system is an eventual local ref, but the system should handle
>> promises or non-local refs anywhere.
>
> Are you going to make the "vats" Smalltalk processes?

They already are.

> Are you going
> to isolate them from other Smalltalk processes in the same image? If
> so then I will probably have some synergy with that.

I am trying to ensure that objects in one Vat don't get directly manipulated
by objects in other Processes.

Rob

Igor Stasenko

Re: Concurrent Futures

In reply to this post by Joshua Gargus-2

On 31/10/2007, Joshua Gargus <[hidden email]> wrote:

>
> On Oct 30, 2007, at 3:28 PM, Igor Stasenko wrote:
>
> > On 30/10/2007, Jason Johnson <[hidden email]> wrote:
> >> On 10/30/07, Igor Stasenko <[hidden email]> wrote:
> >>> which is _NOT_ concurrent
> >>> computing anymore, simple because its not using shared memory,
> >>> and in
> >>> fact there is no sharing at all, only a glimpse of it.
> >>
> >> Huh? What does sharing have to do with concurrency? The one and
> >> only
> >> thing shared state has to do with concurrency is the desire to speed
> >> it up, i.e. a premature optimization. That's it.
> >>
> > Look. A current multi-core architecture uses shared memory. So the
> > logical way how we can utilize such architecture at maximum power is
> > to build on top of it.
> > Any other (such as share nothing) introducing too much noise on such
> > architectures.
>
> It is unreasonable to assume that ad-hoc, fine-grained sharing of
> objects between processors will give you the fastest performance on
> the upcoming machines with 100s and 1000s of cores. What about
> memory locality and cache coherency? It is not cheap to juggle an
> object between processors now, and it will become more expensive as
> the number of cores increase.
>

> In a different email in the thread, you made it clear that you
> consider distributed computing to be a fundamentally different beast
> from concurrency. Intel's chip designers don't see it this way. In
> fact, they explicitly formulate inter-core communication as a
> networking problem. For example, see http://www.intel.com/technology/
> itj/2007/v11i3/1-integration/4-on-die.htm (I've better links in the
> past, but this is the best that I could quickly find now).
>
Then i wonder, why they don't drop the idea of having shared memory at all?
Each CPU then could have own memory, and they could interact by
sending messages in network-style fashion. And we then would write a
code which uses such architecture in best way. But while this is not
true, should we assume that such code will work faster than code which
'knows' that there is a single shared memory for all CPUs and uses
such knowledge in best way?

> I think that your proposal is very "clever", elegant, and fun to
> think about.
Thanks :)

> But I don't see what real problem it solves. It
> doesn't help the application programmer write correct programs (you
> delegate this responsibility to the language/libraries). It doesn't
> make code at maximum speed, since it doesn't handle memory locality.
> In short, it seems like too much work to do for such uncertain
> gains... I think that we can get farther by examining some of our
> assumptions before we start, and revising or goals accordingly.
>
I thought that goals was pretty clear. We have a single image. And we
want to run multiple native threads upon it to utilize all cores of
multi-core CPU's.
What we currently have is a VM, which can't do that. So, i think, any
other , even naively implemented, which can do, is better than
nothing.
If you have any ideas how such VM would look like i'm glad to hear.

--
Best regards,
Igor Stasenko AKA sig.

Joshua Gargus-2

Re: Concurrent Futures

On Oct 30, 2007, at 5:26 PM, Igor Stasenko wrote:

> On 31/10/2007, Joshua Gargus <[hidden email]> wrote:
>>
>> On Oct 30, 2007, at 3:28 PM, Igor Stasenko wrote:
>>
>>> On 30/10/2007, Jason Johnson <[hidden email]> wrote:
>>>> On 10/30/07, Igor Stasenko <[hidden email]> wrote:
>>>>> which is _NOT_ concurrent
>>>>> computing anymore, simple because its not using shared memory,
>>>>> and in
>>>>> fact there is no sharing at all, only a glimpse of it.
>>>>
>>>> Huh? What does sharing have to do with concurrency? The one and
>>>> only
>>>> thing shared state has to do with concurrency is the desire to
>>>> speed
>>>> it up, i.e. a premature optimization. That's it.
>>>>
>>> Look. A current multi-core architecture uses shared memory. So the
>>> logical way how we can utilize such architecture at maximum power is
>>> to build on top of it.
>>> Any other (such as share nothing) introducing too much noise on such
>>> architectures.
>>
>> It is unreasonable to assume that ad-hoc, fine-grained sharing of
>> objects between processors will give you the fastest performance on
>> the upcoming machines with 100s and 1000s of cores. What about
>> memory locality and cache coherency? It is not cheap to juggle an
>> object between processors now, and it will become more expensive as
>> the number of cores increase.
>>
>
>> In a different email in the thread, you made it clear that you
>> consider distributed computing to be a fundamentally different beast
>> from concurrency. Intel's chip designers don't see it this way. In
>> fact, they explicitly formulate inter-core communication as a
>> networking problem. For example, see http://www.intel.com/
>> technology/
>> itj/2007/v11i3/1-integration/4-on-die.htm (I've better links in the
>> past, but this is the best that I could quickly find now).
>>
> Then i wonder, why they don't drop the idea of having shared memory
> at all?

It's convenient for programmers. Aside from the huge complexity of
programming everything this way, we might also have to program AMD
chips differently from Intel ones (at least until a standard emerged).

> Each CPU then could have own memory, and they could interact by
> sending messages in network-style fashion. And we then would write a
> code which uses such architecture in best way. But while this is not
> true, should we assume that such code will work faster than code which
> 'knows' that there is a single shared memory for all CPUs and uses
> such knowledge in best way?

Could you restate this? I don't understand what you mean.

>
>> I think that your proposal is very "clever", elegant, and fun to
>> think about.
> Thanks :)

You're welcome :-)

>
>> But I don't see what real problem it solves. It
>> doesn't help the application programmer write correct programs (you
>> delegate this responsibility to the language/libraries). It doesn't
>> make code at maximum speed, since it doesn't handle memory locality.
>> In short, it seems like too much work to do for such uncertain
>> gains... I think that we can get farther by examining some of our
>> assumptions before we start, and revising or goals accordingly.
>>
> I thought that goals was pretty clear. We have a single image. And we
> want to run multiple native threads upon it to utilize all cores of
> multi-core CPU's.
> What we currently have is a VM, which can't do that. So, i think, any
> other , even naively implemented, which can do, is better than
> nothing.
> If you have any ideas how such VM would look like i'm glad to hear.

Very briefly, because this is Andreas's idea (my original one was
similar but worse), and I think I convinced him to write it up. My
take on it is to rework the VM so that it can support multiple images
in one process, each with its own thread. Give each image an event-
loop to process messages from other images. Make it easy for an
image to spawn another image. A small Spoon-like image could be
spawned in very few milliseconds.

I like this approach because it is dead simple, and it keeps all
objects that can talk directly to one another on the same processor.
Because of its simplicity, it is a low-risk way to quickly get
somewhere useful. Once we have some experience with it, we can
decide whether we need finer-grained object access between threads
(as I've said, I don't think that we will).

Cheers,
Josh

>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

Igor Stasenko

Re: Concurrent Futures

On 31/10/2007, Joshua Gargus <[hidden email]> wrote:

>
> > Then i wonder, why they don't drop the idea of having shared memory
> > at all?
>
> It's convenient for programmers. Aside from the huge complexity of
> programming everything this way, we might also have to program AMD
> chips differently from Intel ones (at least until a standard emerged).
>
> > Each CPU then could have own memory, and they could interact by
> > sending messages in network-style fashion. And we then would write a
> > code which uses such architecture in best way. But while this is not
> > true, should we assume that such code will work faster than code which
> > 'knows' that there is a single shared memory for all CPUs and uses
> > such knowledge in best way?
>
> Could you restate this? I don't understand what you mean.
>

I don't know what to add to above. I just said that we should use
approaches which is best fit for architecture where our project(s)
will run on.
Of course what is best fit is arguable. But i don't think we should
drop a shared memory model support when we building a system on top of
architecture which haves it.

> >
> >> I think that your proposal is very "clever", elegant, and fun to
> >> think about.
> > Thanks :)
>
> You're welcome :-)
>
>
> Very briefly, because this is Andreas's idea (my original one was
> similar but worse), and I think I convinced him to write it up. My
> take on it is to rework the VM so that it can support multiple images
> in one process, each with its own thread. Give each image an event-
> loop to process messages from other images. Make it easy for an
> image to spawn another image. A small Spoon-like image could be
> spawned in very few milliseconds.
>
> I like this approach because it is dead simple, and it keeps all
> objects that can talk directly to one another on the same processor.
> Because of its simplicity, it is a low-risk way to quickly get
> somewhere useful. Once we have some experience with it, we can
> decide whether we need finer-grained object access between threads
> (as I've said, I don't think that we will).
>

Ah, yes, this was mentioned before. And i like that idea in general
because of its simplicity but don't like a memory overhead and some
other issues, like:
- persistence
- generic usage

Lets imagine that i run a squeak , started couple of servers, started
tetris game and playing it. Then at some moment i feel the urgent need
of shutting show my PC to insert additional 64 cores to my CPU to be
able run two tetrises instead of one. ;)
What is interesting now, is how i can have a dead-simple 'save
image(s)' one-click action so, that after i restart system i will able
to continue running same set of images from the same point as we have
in current single-image squeak?
If we store each separate image to separate file(s), then soon there
will be a thousands of them polluting same directory. And i will lost,
what do i really need to move my project on different PC.
Or, maybe we should merge multiple images into same file? This is much
better. But how about .changes file?

About generic usage. A small images like an ants is good when you deal
with small sized tasks and job of concrete ant is simple and
monotonic.
But how about real-world applications such as CAD systems, or business
applications which could have a couple hundred megabytes of image
size? Spawning multiple images of such size is a waste of resources.
Ok, then lets suppose we can have a single 'queen' image and smaller
'ant' images. But now we need a sophisticated framework for
coordinating moves. And also it is clear that most work load still
will be in queen image(because most of objects located there) which
means that its not scales well.

--
Best regards,
Igor Stasenko AKA sig.

Andreas.Raab

Thoughts on a concurrent Squeak VM (was: Re: Concurrent Futures)

In reply to this post by Igor Stasenko

Igor Stasenko wrote:
> If you have any ideas how such VM would look like i'm glad to hear.

Okay, so Josh convinced me to write up the ideas. The main problem as I
see it with a *practical* solution to the problem is that all of the
solutions so far require huge leaps and can't be implemented
step-by-step (which almost certainly dooms them to failure).

So what do we know and what do we actually all pretty much agree on?
It's that we need to be able to utilize multiple cores and that we need
a practical way to get there (if you disagree with the latter this
message is not meant for you ;-) Running multiple processes is one
option but it is not always sufficient. For example, some OSes would
have trouble firing off a couple of thousand processes whereas the same
OS may have no problem at all with a couple of thousand threads in one
process. To give an example, starting a thread on Windows cost somewhere
in the range of a millisecond which is admittedly slow, but still orders
of magnitude faster than creating a new process. Then there are issues
with resource sharing (like file handles) which are practically
guaranteed not to work across process boundaries etc. So while there are
perfectly good reasons to run multiple processes, there are reasons just
as good to wanting to run multiple threads in one process.

The question then is, can we find an easy way to extend the Squeak VM to
run multiple threads and if so how? Given the simplistic nature of the
Squeak interpreter, there is actually very little global state that is
not encapsulated in objects on the Squeak heap - basically all the
variables in class interpreter. So if we would put them into state that
is local to each thread, we could trivially run multiple instances of
the byte code interpreter in the same VM. This gets us to the two major
questions:

* How do we encapsulate the interpreter state?
* How do we deal with primitives and plugins?

Let's start with the first one. Obviously, the answer is "make it an
object". The way how I would go about is by modifying the CCodeGenerator
such that it generates all functions with an argument of type "struct
VM" and that variable accesses prefix things properly and that all
functions calls pass the extra argument along. In short, what used to be
translated as:

sqInt primitiveAdd(void) {
integerResult = stackIntegerValue(1) + stackIntegerValue(0)
/* etc. */
}

will then become something like here:

sqInt primitiveAdd(struct VM *vm) {
integerResult = stackIntegerValue(vm,1) + stackIntegerValue(vm,0)
/* etc. */
}

This is a *purely* mechanical step that can be done independent of
anything else. It should be possible to generate code that is entirely
equivalent to todays code and with a bit of tweaking it should be
possible to make that code roughly as fast as we have today (not that I
think it matters but understanding the speed difference between this and
the default interpreter is important for judging relative speed
improvements later).

The above takes care about the interpreter but there are still
primitives and plugins that need to be dealt with. What I would do here
is define operations like ioLock(struct VM) and ioUnlock(struct VM) that
are the effective equivalent of Python's GIL (global interpreter lock)
and allow exclusive access to primitives that have not been converted to
multi-threading yet. How exactly this conversion should happen is
deliberately left open here; maybe changing the VMs major proxy version
is the right thing to do to indicate the changed semantics. In any case,
the GIL allows us to readily reuse all existing plugins without having
to worry about conversion early on.

So now we've taken care of the two major parts of Squeak: We have the
ability to run new interpreters and we have the ability to use
primitives. This is when the fun begins, because at this point we have
options:

For example, if you are into shared-state concurrency, you might
implement a primitive that forks a new instance of the interpreter
running in the same object memory that your previous interpreter is
running in.

Or, and that would be the path that I would take, implement a primitive
that loads an image into a new object memory (I can explain in more
detail how memory allocation needs to work for that; it is a fairly
straightforward scheme but a little too long for this message) and run
that interpreter.

And at this point, the *real* fun begins because we can now start to
define the communication patterns we'd like to use (initially sockets,
later shared memory or event queues or whatever else). We can have tiny
worker images that only do minimal stuff but we can also do a Spoon-like
thing where we have a "master image" that contains all the code possibly
needed and fire off micro-images that (via imprinting) swap in just the
code they need to run.

[Whoa! I just got interrupted by a little 5.6 quake some 50 miles away]

Sorry but I lost my train of thought here. Happens at 5.6 Richter ;-)
Anyway, the main thing I'm trying to say in the above is that for a
*practical* solution to the problem there are some steps that are pretty
much required whichever way you look at it. And I think that regardless
of your interest in shared state or message passing concurrency we may
be able to define a road that leads to interesting experiments without
sacrificing the practical artifact. A VM built like described in the
above would be strictly a superset of the current VM so it would be able
to run any current images and leave room for further experiments.

Cheers,
- Andreas

Joshua Gargus-2

Re: Concurrent Futures

In reply to this post by Igor Stasenko

On Oct 30, 2007, at 7:56 PM, Igor Stasenko wrote:

> On 31/10/2007, Joshua Gargus <[hidden email]> wrote:
>>
>>> Then i wonder, why they don't drop the idea of having shared memory
>>> at all?
>>
>> It's convenient for programmers. Aside from the huge complexity of
>> programming everything this way, we might also have to program AMD
>> chips differently from Intel ones (at least until a standard
>> emerged).
>>
>>> Each CPU then could have own memory, and they could interact by
>>> sending messages in network-style fashion. And we then would write a
>>> code which uses such architecture in best way. But while this is not
>>> true, should we assume that such code will work faster than code
>>> which
>>> 'knows' that there is a single shared memory for all CPUs and uses
>>> such knowledge in best way?
>>
>> Could you restate this? I don't understand what you mean.
>>
> I don't know what to add to above. I just said that we should use
> approaches which is best fit for architecture where our project(s)
> will run on.
> Of course what is best fit is arguable. But i don't think we should
> drop a shared memory model support when we building a system on top of
> architecture which haves it.

OK, this is basically what I thought you meant, but wasn't sure.

<snip>

>
> Lets imagine that i run a squeak , started couple of servers, started
> tetris game and playing it. Then at some moment i feel the urgent need
> of shutting show my PC to insert additional 64 cores to my CPU to be
> able run two tetrises instead of one. ;)

Only 2 tetrises? I hope we can do better than that :-)

> What is interesting now, is how i can have a dead-simple 'save
> image(s)' one-click action so, that after i restart system i will able
> to continue running same set of images from the same point as we have
> in current single-image squeak?
> If we store each separate image to separate file(s), then soon there
> will be a thousands of them polluting same directory. And i will lost,
> what do i really need to move my project on different PC.
> Or, maybe we should merge multiple images into same file? This is much
> better. But how about .changes file?

Or source code control in general. Good point.

There are a number of options. For simplicity, I'd say to connect a
single blessed development image to the single changes file. Another
would be to use a Monticello-like approach (although there are still
things that changesets are needed for, and we don't want to bring in
all sorts of new fundamental dependencies). However, these are the
first thoughts that popped into my head... you seem to have thought
about this more than I have.

Re: persistence in general... I used to do all of my thinking in a
Squeak image, using morphs and projects. I grew tired of the
breakage of my media when I updated my code to the latest version or
screwing up my image in some way. I much prefer the way that we deal
with data in Croquet, using a separate serialization format instead
of the Squeak image format. So for me, it's acceptable to have many
of the images to be transient... created on startup and discarded
("garbage-collected" :-) ) at shutdown.

(BTW, I hope you'll excuse me for continually saying "we" even
there's no chance I'll have time to work on it with you all... I'll
have to be content with cheering from the sidelines :-( )

> About generic usage. A small images like an ants is good when you deal
> with small sized tasks and job of concrete ant is simple and
> monotonic.
> But how about real-world applications such as CAD systems, or business
> applications which could have a couple hundred megabytes of image
> size? Spawning multiple images of such size is a waste of resources.

Each image can be as big as it needs to be. We don't need to spawn
off identical clones of a single image.

I don't have a lot of first-hand experience with business systems.
I'm picturing large tables of relational data... I'm not sure how I
would approach this with the model I've described.

Honestly, it's difficult me to think about this stuff outside of the
context of Croquet. There might be use cases that aren't addressed
well. But my instinct is that if the model is good enough for a
world-wide network of interconnected, collaborative 3D spaces (and I
believe that it is), then it is good enough for most anything you'd
want to do with it.

> Ok, then lets suppose we can have a single 'queen' image and smaller
> 'ant' images. But now we need a sophisticated framework for
> coordinating moves.

If the application is so sophisticated, it will need a sophisticated
framework for coordinating concurrency anyway. No way around it.
It doesn't matter if it is one queen and may ants, or a heterogeneous
mix of islands.

Cheers,
Josh

> And also it is clear that most work load still
> will be in queen image(because most of objects located there) which
> means that its not scales well.
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

123456 ... 8