Smalltalk › Squeak › Squeak VM

Re: [Pharo-dev] Pony for Pharo VM

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

12 messages Options

Shaping1

Re: [Pharo-dev] Pony for Pharo VM

Just to get in the right frame of mind, consider that because of the Blub Paradox (http://www.paulgraham.com/avg.html)
you are going to have a hard time convincing people to "change to this language because of Feature X"
just by saying so. You need to dig deeper.

The Pony compiler and runtime need to be studied.

I’m not trying to convince; I’m presenting facts, observations, and resources for study of the problem and its solution. Hardware constraints now are intensely multicore, and everyone knows this. The changing programming paradigm in apparent. Hardware structure is forcing that change. Convincing yourself will not be difficult when you have the facts. You likely do already, at least on the problem-side.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame. I like Orca because it works on many cores (as many as 64, currently) without a synchronization step for GC, and has wonderful concurrency abilities. Pony and Orca were co-designed. The deferred reference counts managed by Orca run on the messages between the actors (send/receive tracing). GCs happen in Pony/Orca when each actor finishes its response to the last received message, and goes idle. The actor then GCs all objects no longer referenced by other actors. The runtime scheduler takes this time needed for each actor’s GCing into account. No actor waits to GC objects. An actor’s allocated objects’ ref counts are checked at idle-time, and unreferenced objects are GCed in an ongoing, fluid way, in small, high-frequency bursts, with very small, predictable tail latencies, as a result. That’s very interesting if you need smoothly running apps (graphics), design/program real-time control systems, or process data at high rates, as in financial applications at banks and exchanges.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Now I I took a quick look to see what I could learn about Pony
and the main interesting thing is its "Reference Capabilities".
Indeed, it seems Pony's purpose is the POC for that research.
I watched two videos...

[1] Øredev 2017 - Joe McIlvain - Pony - A Language for Provably Safe Lockless Concurrency
https://www.youtube.com/watch?v=9NH4bVfbvYI

[2] Sophia Drossopoulou - Pony for Safe, Fast, concurrent programs - Codemesh 2017
https://www.youtube.com/watch?v=e_bES30tFqI

So my "quick" assessment is to consider Reference Capabilities not a Type System
but a Meta-type System.

Yes, ref-caps control sharing of both mutable and immutable objects.

[2] says "They are not a property of an object, but define how I can look at an object." To summarize the videos...

[1]@29:15...

[2]@13:40...

[2]@24:40...

[2]@28:30...

So you say...
> I want to use the Pony concurrency model at the Smalltalk level. That’s the gist of it.

> Otherwise, I want a Smalltalk, not a statically compiled language.

I believe you have the wrong end of the the stick promoting that the VM

needs to be redeveloped using Pony to get the Reference Capabilities
at the Smalltalk level.

See my recent response to Ken in vm-dev.

After you Ahead-Of-Time compile the VM, the compiler

is left behind. The features of the compiler don't automatically flow through

to Smalltalk level.

Yes. JIT and AOT are needed. Adjustments need to be made for dynamic typing, but such adjustments need not violate invariants needed to provide the ref-cap concurrency guarantees.

A third option could be to extend to the existing "Immutability capability" of the VM.

Ref-caps control access to both immutable and mutable objects. Pony ascertains at compile time that only one actor at a time can hold a reference to a mutable object at runtime.

The six Reference Capabilities might be stored in the Spur Object Header

using a re-purposed Immutability bit plus the two free "green" bits.

https://clementbera.wordpress.com/2014/01/16/spurs-new-object-format/

We can’t use a single big system heap, and also use the actor model and many cores productively.

The big heap must go.

That’s the crux of what we need to learn at this juncture. I’m not sure everyone is getting this. I don’t need to convince you; you can do that for yourself. We don’t currently have a better choice of programming model or memory management, if we want maximum speed, and are paying attention to hard constraints. Multicore hardware is here to stay. If you know a way to use your multicore CPUs more efficiently than by use of actor-based programming and Orca memory management, please share your facts.

Everyone needs to understand the big-heap problem in order to commit to fine-grain, actor-based (and therefore state-machine-based) programming. We won’t realize productive multicore programming without an actor model. We still need new tools to help, as mentioned, lest our state-machine programming be tedious and painful, causing us to default to old coding habits, and not make the needed state-machines. I’m working on one of those tools. Orca manages memory as fast as possible in an actor-based program/runtime, without any synchronization. The big-heap problem has been solved for the statically typed case in Pony. The lack of discipline/tools for building state-machines remains a broad problem. Such new tools are therefore essential during the change of programming model.

I suggest studying the current Pony/Orca solution (see the Orca paper linked below), to see how it can be tweaked to accommodate dynamic types. I’ve not studied the Pony compiler and runtime enough to be certain, but they may be closer to what we need, even if the first rough implementation of a Smalltalk-with-Orca must be AOT, instead of JIT. We want both ultimately. We should do first whichever is easiest to implement and test in small steps.

"Reference Capabilities for Dynamic Languages" could be a strong PhD project.

It’s a good idea. Is this needed to make a Smalltalk that works with Orca? I don’t think so.

Orca matters more than Pony. But both can be useful to us. Studying the Orca paper and its implementation in the Pony compiler and runtime is the easiest way to understand the constraints imposed by the current static type system in Pony—and how to change them to accommodate dynamic types.

Everyone, please read and study https://www.ponylang.io/media/papers/orca_gc_and_type_system_co-design_for_actor_languages.pdf if you are interested in building a VM/runtime that guarantees no data-races, and fully uses all your machine’s cores, with the least programming effort.

Then study Pony: https://www.ponylang.io/

And ask questions on Pony Zulip: https://ponylang.zulipchat.com/

(BTW, Zulip seems easier to use than Discord and e-mail. Has anyone considered using it for the Pharo/Squeak lists?)

Shaping

On Fri, 10 Apr 2020 at 17:56, Shaping <[hidden email]> wrote:

>
> Hi Ken.
>
> Not to discourage people, but I have not seen cases where a "strong
> type system would be able to scale for _real_ Smalltalk applications.
>
> You’re right. It doesn't. I'm not suggesting that.
>
> The type safety is not for app-level Smalltalk development. It's for building the VM only.
>
> The six ref-cap ideas for sharing data reliably between actors are not hard to grasp, but they take some getting used to, and involve some mental load. I don't want that much (concurrency-related or any other) implementation detail in the domain layer, much in the same vein as: I don’t use Forth because I don’t want to see stack acrobatics (ROTs and DUPs, etc.) amidst domain-level state-changes (FirePhotonTorpedo). It’s distracting. It dilutes focus on the domain work/layer, and tends to cause mistakes there.
>
>
>
> The programmer’s domain logic and the concurrency-integrity provided by the ref-caps are different layers of thought and structure. The ref-caps are, however, mixed freely with domain logic in current Pony code. I think that’s a mistake. But that’s how it is now. I think of this layer mixing as an intermediate design stage of Pony. I want to abstract-out some or all ref-caps as the VM is built.
>
> Pony language is not the remarkable thing here. I see it as a better C or better Rust. It’s very good (as Algol-60-ish-looking crap-syntaxes go), but that’s not what this is about. It’s about the actor programming, the concurrency model, and the guarantees given by use of the ref-caps. We would still obviously need to respect the Pony compiler’s determination of where ref-cap use is correct or not. Your Pony program won’t compile if you don’t use the ref-caps correctly, and you don’t get the guarantees or a running program without the compile. Much easier, therefore, might be to go the other way by changing Pony to support dynamic types without violating the invariants that allow the ref-caps (under the hood, abstracted out) to make the concurrency guarantees. Work Smalltalk dynamism into Pony, instead of building a Smalltalk VM with Pony.

Robert Withers-2

Re: [Pharo-dev] Pony for Pharo VM

You may find this package has an event loop processing model with promise based capabilities to resolve asynchrony, safely. Please take a look at the RefsTest after loading into Squeak. I just split the local capabilites from the Raven package. Raven has the remote capabilities and is broken, currently.

Installer ss
project: 'Cryptography';
install: 'CapabilitiesLocal'

K, r

On 4/15/20 4:59 AM, Shaping wrote:

Just to get in the right frame of mind, consider that because of the Blub Paradox (http://www.paulgraham.com/avg.html)
you are going to have a hard time convincing people to "change to this language because of Feature X"
just by saying so. You need to dig deeper.

The Pony compiler and runtime need to be studied.

I’m not trying to convince; I’m presenting facts, observations, and resources for study of the problem and its solution. Hardware constraints now are intensely multicore, and everyone knows this. The changing programming paradigm in apparent. Hardware structure is forcing that change. Convincing yourself will not be difficult when you have the facts.   You likely do already, at least on the problem-side.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame. I like Orca because it works on many cores (as many as 64, currently) without a synchronization step for GC, and has wonderful concurrency abilities. Pony and Orca were co-designed. The deferred reference counts managed by Orca run on the messages between the actors (send/receive tracing). GCs happen in Pony/Orca when each actor finishes its response to the last received message, and goes idle. The actor then GCs all objects no longer referenced by other actors. The runtime scheduler takes this time needed for each actor’s GCing into account. No actor waits to GC objects. An actor’s allocated objects’ ref counts are checked at idle-time, and unreferenced objects are GCed in an ongoing, fluid way, in small, high-frequency bursts, with very small, predictable tail latencies, as a result. That’s very interesting if you need smoothly running apps (graphics), design/program real-time control systems, or process data at high rates, as in financial applications at banks and exchanges.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Now I I took a quick look to see what I could learn about Pony
and the main interesting thing is its "Reference Capabilities".
Indeed, it seems Pony's purpose is the POC for that research.
I watched two videos...

[1] Øredev 2017 - Joe McIlvain - Pony - A Language for Provably Safe Lockless Concurrency
https://www.youtube.com/watch?v=9NH4bVfbvYI

[2] Sophia Drossopoulou - Pony for Safe, Fast, concurrent programs - Codemesh 2017
https://www.youtube.com/watch?v=e_bES30tFqI

So my "quick" assessment is to consider Reference Capabilities not a Type System
but a Meta-type System.

Yes, ref-caps control sharing of both mutable and immutable objects.

[2] says "They are not a property of an object, but define how I can look at an object." To summarize the videos...

[1]@29:15...

[2]@13:40...

[2]@24:40...

[2]@28:30...

So you say...
> I want to use the Pony concurrency model at the Smalltalk level. That’s the gist of it.

> Otherwise, I want a Smalltalk, not a statically compiled language.

I believe you have the wrong end of the the stick promoting that the VM

needs to be redeveloped using Pony to get the Reference Capabilities
at the Smalltalk level.

See my recent response to Ken in vm-dev.

After you Ahead-Of-Time compile the VM, the compiler

is left behind. The features of the compiler don't automatically flow through

to Smalltalk level.

Yes. JIT and AOT are needed. Adjustments need to be made for dynamic typing, but such adjustments need not violate invariants needed to provide the ref-cap concurrency guarantees.

A third option could be to extend to the existing "Immutability capability" of the VM.

Ref-caps control access to both immutable and mutable objects. Pony ascertains at compile time that only one actor at a time can hold a reference to a mutable object at runtime.

The six Reference Capabilities might be stored in the Spur Object Header

using a re-purposed Immutability bit plus the two free "green" bits.

https://clementbera.wordpress.com/2014/01/16/spurs-new-object-format/

We can’t use a single big system heap, and also use the actor model and many cores productively.

The big heap must go.

That’s the crux of what we need to learn at this juncture. I’m not sure everyone is getting this. I don’t need to convince you; you can do that for yourself. We don’t currently have a better choice of programming model or memory management, if we want maximum speed, and are paying attention to hard constraints. Multicore hardware is here to stay. If you know a way to use your multicore CPUs more efficiently than by use of actor-based programming and Orca memory management, please share your facts.

Everyone needs to understand the big-heap problem in order to commit to fine-grain, actor-based (and therefore state-machine-based) programming. We won’t realize productive multicore programming without an actor model. We still need new tools to help, as mentioned, lest our state-machine programming be tedious and painful, causing us to default to old coding habits, and not make the needed state-machines. I’m working on one of those tools. Orca manages memory as fast as possible in an actor-based program/runtime, without any synchronization. The big-heap problem has been solved for the statically typed case in Pony. The lack of discipline/tools for building state-machines remains a broad problem. Such new tools are therefore essential during the change of programming model.

I suggest studying the current Pony/Orca solution (see the Orca paper linked below), to see how it can be tweaked to accommodate dynamic types. I’ve not studied the Pony compiler and runtime enough to be certain, but they may be closer to what we need, even if the first rough implementation of a Smalltalk-with-Orca must be AOT, instead of JIT. We want both ultimately. We should do first whichever is easiest to implement and test in small steps.

"Reference Capabilities for Dynamic Languages" could be a strong PhD project.

It’s a good idea. Is this needed to make a Smalltalk that works with Orca? I don’t think so.

Orca matters more than Pony. But both can be useful to us. Studying the Orca paper and its implementation in the Pony compiler and runtime is the easiest way to understand the constraints imposed by the current static type system in Pony—and how to change them to accommodate dynamic types.

Everyone, please read and study https://www.ponylang.io/media/papers/orca_gc_and_type_system_co-design_for_actor_languages.pdf if you are interested in building a VM/runtime that guarantees no data-races, and fully uses all your machine’s cores, with the least programming effort.

Then study Pony: https://www.ponylang.io/

And ask questions on Pony Zulip: https://ponylang.zulipchat.com/

(BTW, Zulip seems easier to use than Discord and e-mail. Has anyone considered using it for the Pharo/Squeak lists?)

Shaping

On Fri, 10 Apr 2020 at 17:56, Shaping <[hidden email]> wrote:
>
> Hi Ken.
>
> Not to discourage people, but I have not seen cases where a "strong
> type system would be able to scale for _real_ Smalltalk applications.
>
> You’re right. It doesn't. I'm not suggesting that.
>
> The type safety is not for app-level Smalltalk development. It's for building the VM only.
>
> The six ref-cap ideas for sharing data reliably between actors are not hard to grasp, but they take some getting used to, and involve some mental load. I don't want that much (concurrency-related or any other) implementation detail in the domain layer, much in the same vein as: I don’t use Forth because I don’t want to see stack acrobatics (ROTs and DUPs, etc.) amidst domain-level state-changes (FirePhotonTorpedo). It’s distracting. It dilutes focus on the domain work/layer, and tends to cause mistakes there.
>
>
>
> The programmer’s domain logic and the concurrency-integrity provided by the ref-caps are different layers of thought and structure. The ref-caps are, however, mixed freely with domain logic in current Pony code. I think that’s a mistake. But that’s how it is now. I think of this layer mixing as an intermediate design stage of Pony. I want to abstract-out some or all ref-caps as the VM is built.
>
> Pony language is not the remarkable thing here. I see it as a better C or better Rust. It’s very good (as Algol-60-ish-looking crap-syntaxes go), but that’s not what this is about. It’s about the actor programming, the concurrency model, and the guarantees given by use of the ref-caps. We would still obviously need to respect the Pony compiler’s determination of where ref-cap use is correct or not. Your Pony program won’t compile if you don’t use the ref-caps correctly, and you don’t get the guarantees or a running program without the compile. Much easier, therefore, might be to go the other way by changing Pony to support dynamic types without violating the invariants that allow the ref-caps (under the hood, abstracted out) to make the concurrency guarantees. Work Smalltalk dynamism into Pony, instead of building a Smalltalk VM with Pony.

-- 
Kindly,
Robert

Robert Withers-2

Re: [Pharo-dev] Pony for Pharo VM

In reply to this post by Shaping1

Hi Shaping,

I saw your email and thought to respond to what you say. Hopefully, helpfully.

On 4/15/20 4:59 AM, Shaping wrote:

Just to get in the right frame of mind, consider that because of the Blub Paradox (http://www.paulgraham.com/avg.html)
you are going to have a hard time convincing people to "change to this language because of Feature X"
just by saying so. You need to dig deeper.

The Pony compiler and runtime need to be studied.

What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!

I’m not trying to convince; I’m presenting facts, observations, and resources for study of the problem and its solution. Hardware constraints now are intensely multicore, and everyone knows this. The changing programming paradigm in apparent. Hardware structure is forcing that change. Convincing yourself will not be difficult when you have the facts. You likely do already, at least on the problem-side.

The solution is easy. As different event loops on different cores will use the same externalizing remote interface to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities to share workload. The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame.

Squeak concurrency model.

Installer ss
project: 'Cryptography';
install: 'CapabilitiesLocal'

I like Orca because it works on many cores (as many as 64, currently) without a synchronization step for GC, and has wonderful concurrency abilities. Pony and Orca were co-designed. The deferred reference counts managed by Orca run on the messages between the actors (send/receive tracing). GCs happen in Pony/Orca when each actor finishes its response to the last received message, and goes idle. The actor then GCs all objects no longer referenced by other actors. The runtime scheduler takes this time needed for each actor’s GCing into account. No actor waits to GC objects. An actor’s allocated objects’ ref counts are checked at idle-time, and unreferenced objects are GCed in an ongoing, fluid way, in small, high-frequency bursts, with very small, predictable tail latencies, as a result. That’s very interesting if you need smoothly running apps (graphics), design/program real-time control systems, or process data at high rates, as in financial applications at banks and exchanges.

So your use of Pony is purely to access the Orca vm? I think you will find the CogVM quite interesting and performant.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Sounds like you might regret your choice and took the wrong path. Come back to Squeak! ^,^

Kindly, Robert

Now I I took a quick look to see what I could learn about Pony
and the main interesting thing is its "Reference Capabilities".
Indeed, it seems Pony's purpose is the POC for that research.
I watched two videos...

[1] Øredev 2017 - Joe McIlvain - Pony - A Language for Provably Safe Lockless Concurrency
https://www.youtube.com/watch?v=9NH4bVfbvYI

[2] Sophia Drossopoulou - Pony for Safe, Fast, concurrent programs - Codemesh 2017
https://www.youtube.com/watch?v=e_bES30tFqI

So my "quick" assessment is to consider Reference Capabilities not a Type System
but a Meta-type System.

Yes, ref-caps control sharing of both mutable and immutable objects.

[2] says "They are not a property of an object, but define how I can look at an object." To summarize the videos...

[1]@29:15...

[2]@13:40...

[2]@24:40...

[2]@28:30...

So you say...
> I want to use the Pony concurrency model at the Smalltalk level. That’s the gist of it.

> Otherwise, I want a Smalltalk, not a statically compiled language.

I believe you have the wrong end of the the stick promoting that the VM

needs to be redeveloped using Pony to get the Reference Capabilities
at the Smalltalk level.

See my recent response to Ken in vm-dev.

After you Ahead-Of-Time compile the VM, the compiler

is left behind. The features of the compiler don't automatically flow through

to Smalltalk level.

Yes. JIT and AOT are needed. Adjustments need to be made for dynamic typing, but such adjustments need not violate invariants needed to provide the ref-cap concurrency guarantees.

A third option could be to extend to the existing "Immutability capability" of the VM.

Ref-caps control access to both immutable and mutable objects. Pony ascertains at compile time that only one actor at a time can hold a reference to a mutable object at runtime.

The six Reference Capabilities might be stored in the Spur Object Header

using a re-purposed Immutability bit plus the two free "green" bits.

https://clementbera.wordpress.com/2014/01/16/spurs-new-object-format/

We can’t use a single big system heap, and also use the actor model and many cores productively.

The big heap must go.

That’s the crux of what we need to learn at this juncture. I’m not sure everyone is getting this. I don’t need to convince you; you can do that for yourself. We don’t currently have a better choice of programming model or memory management, if we want maximum speed, and are paying attention to hard constraints. Multicore hardware is here to stay. If you know a way to use your multicore CPUs more efficiently than by use of actor-based programming and Orca memory management, please share your facts.

Everyone needs to understand the big-heap problem in order to commit to fine-grain, actor-based (and therefore state-machine-based) programming. We won’t realize productive multicore programming without an actor model. We still need new tools to help, as mentioned, lest our state-machine programming be tedious and painful, causing us to default to old coding habits, and not make the needed state-machines. I’m working on one of those tools. Orca manages memory as fast as possible in an actor-based program/runtime, without any synchronization. The big-heap problem has been solved for the statically typed case in Pony. The lack of discipline/tools for building state-machines remains a broad problem. Such new tools are therefore essential during the change of programming model.

I suggest studying the current Pony/Orca solution (see the Orca paper linked below), to see how it can be tweaked to accommodate dynamic types. I’ve not studied the Pony compiler and runtime enough to be certain, but they may be closer to what we need, even if the first rough implementation of a Smalltalk-with-Orca must be AOT, instead of JIT. We want both ultimately. We should do first whichever is easiest to implement and test in small steps.

"Reference Capabilities for Dynamic Languages" could be a strong PhD project.

It’s a good idea. Is this needed to make a Smalltalk that works with Orca? I don’t think so.

Orca matters more than Pony. But both can be useful to us. Studying the Orca paper and its implementation in the Pony compiler and runtime is the easiest way to understand the constraints imposed by the current static type system in Pony—and how to change them to accommodate dynamic types.

Everyone, please read and study https://www.ponylang.io/media/papers/orca_gc_and_type_system_co-design_for_actor_languages.pdf if you are interested in building a VM/runtime that guarantees no data-races, and fully uses all your machine’s cores, with the least programming effort.

Then study Pony: https://www.ponylang.io/

And ask questions on Pony Zulip: https://ponylang.zulipchat.com/

(BTW, Zulip seems easier to use than Discord and e-mail. Has anyone considered using it for the Pharo/Squeak lists?)

Shaping

On Fri, 10 Apr 2020 at 17:56, Shaping <[hidden email]> wrote:
>
> Hi Ken.
>
> Not to discourage people, but I have not seen cases where a "strong
> type system would be able to scale for _real_ Smalltalk applications.
>
> You’re right. It doesn't. I'm not suggesting that.
>
> The type safety is not for app-level Smalltalk development. It's for building the VM only.
>
> The six ref-cap ideas for sharing data reliably between actors are not hard to grasp, but they take some getting used to, and involve some mental load. I don't want that much (concurrency-related or any other) implementation detail in the domain layer, much in the same vein as: I don’t use Forth because I don’t want to see stack acrobatics (ROTs and DUPs, etc.) amidst domain-level state-changes (FirePhotonTorpedo). It’s distracting. It dilutes focus on the domain work/layer, and tends to cause mistakes there.
>
>
>
> The programmer’s domain logic and the concurrency-integrity provided by the ref-caps are different layers of thought and structure. The ref-caps are, however, mixed freely with domain logic in current Pony code. I think that’s a mistake. But that’s how it is now. I think of this layer mixing as an intermediate design stage of Pony. I want to abstract-out some or all ref-caps as the VM is built.
>
> Pony language is not the remarkable thing here. I see it as a better C or better Rust. It’s very good (as Algol-60-ish-looking crap-syntaxes go), but that’s not what this is about. It’s about the actor programming, the concurrency model, and the guarantees given by use of the ref-caps. We would still obviously need to respect the Pony compiler’s determination of where ref-cap use is correct or not. Your Pony program won’t compile if you don’t use the ref-caps correctly, and you don’t get the guarantees or a running program without the compile. Much easier, therefore, might be to go the other way by changing Pony to support dynamic types without violating the invariants that allow the ref-caps (under the hood, abstracted out) to make the concurrency guarantees. Work Smalltalk dynamism into Pony, instead of building a Smalltalk VM with Pony.

-- 
Kindly,
Robert

Shaping1

Re: [Pharo-dev] Pony for Pharo VM

The Pony compiler and runtime need to be studied.

What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!

Yes, that is one way. Then we can simulate the new collector with Smalltalk in the usual way, whilst also integrating ref-caps and dynamic types (the main challenge). We already know that Orca works in Pony (in high-performance production—not an experiment or toy). Still there will be bugs and perhaps room for improvements. Smalltalk simulation would help greatly there. The simulated Pony-Orca (the term used in the Orca paper) or simulated Smalltalk-Orca, if we can tag classes with ref-caps and keep Orca working, will run even more slowly in simulation-mode with all that message-passing added to the mix.

I’m starting to study the Pharo VM. Can someone suggest what to read. I see what appears to be outdated VM-related material. I’m not sure what to study (besides the source code) and what to ignore. I’m especially interested to know what not to read.

I’m not trying to convince; I’m presenting facts, observations, and resources for study of the problem and its solution. Hardware constraints now are intensely multicore, and everyone knows this. The changing programming paradigm in apparent. Hardware structure is forcing that change. Convincing yourself will not be difficult when you have the facts. You likely do already, at least on the problem-side.

The solution is easy.

The problem is easy to understand. It reduces to StW GCing in a large heap and how to make instead may small, well-managed heaps, one per actor. Orca does that already and demonstrates very high performance. That’s what the Orca paper is about.

The solution for Smalltalk is more complicated, and will involve a concurrent collector. The best one I can find now is Orca. If you know a better one, please share your facts.

As different event loops on different cores will use the same

externalizing remote interface

This idea is not clear. Is there a description of it?

to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities

Pony doesn’t yet have machine-node remoteness. The networked version is being planned, but is a ways off still. By remote, do you mean: another machine or another OS/CogVM process on the same machine? I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

to share workload.

With Pony-Orca, sharing of the workload doesn’t need to be managed by the programmer. That’s one of basic reasons for the existence of Pony-Orca. The Pony-Orca dev writes his actors, and they run automatically in load-balance, via the actor-thread scheduler and work-stealing, when possible, on all the cores. Making Smalltalk work with Orca is, at this early stage, about understanding how Orca works (study the C++ and program in Pony) and how to implement it, if possible, in a Smalltalk simulator. Concerning Orca in particular, if you notice at end of the paper, they tested Orca against Erlang VM, C4, and G1, and it performed much better than all.

The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.

No, that’s much slower. Squeak/Pharo still has the basic threading handicap: a single large heap.

Here’s the gist of the problem again: the big heap will not work and must go away, if we are to have extreme speed and a generalized multithreading programming solution.

My current understanding is that Pony-Orca (or Smalltalk-Orca) starts one OS process, and then spawns threads, as new actors begin working. You don’t need to do anything special as a programmer to make that happen. You just write the actors, keep them small, use the ref-caps correctly so that the program compiles (the ref-caps must also be applied to Smalltalk classes), and organize your synchronous code into classes, as usual. Functions run synchronous code. Behaviours run asynchronous code.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame.

Squeak concurrency model.

Installer ss
project: 'Cryptography';
install: 'CapabilitiesLocal'

What abilities does the above install give Squeak?

I like Orca because it works on many cores (as many as 64, currently) without a synchronization step for GC, and has wonderful concurrency abilities. Pony and Orca were co-designed. The deferred reference counts managed by Orca run on the messages between the actors (send/receive tracing). GCs happen in Pony/Orca when each actor finishes its response to the last received message, and goes idle. The actor then GCs all objects no longer referenced by other actors. The runtime scheduler takes this time needed for each actor’s GCing into account. No actor waits to GC objects. An actor’s allocated objects’ ref counts are checked at idle-time, and unreferenced objects are GCed in an ongoing, fluid way, in small, high-frequency bursts, with very small, predictable tail latencies, as a result. That’s very interesting if you need smoothly running apps (graphics), design/program real-time control systems, or process data at high rates, as in financial applications at banks and exchanges.

So your use of Pony is purely to access the Orca vm?

Orca is not a VM; it’s a garbage collection protocol for actor-based systems.

I suggest using Pony-Orca to learn how Orca works, and then replace the Pony part of Pony-Orca with Smalltalk (dynamic typing), keeping the ref-caps (because they provide the guarantees). I realize that this is a big undertaking. Or: write a new implementation of Orca in Smalltalk for the VM. This is currently second choice, but that could change.

I think you will find the CogVM quite interesting and performant.

--Not with its current architecture.

If the CogVM is not able to:

1) dynamically schedule unlimited actor-threads on all cores

2) automatically load-balance

3) support actor-based programs innately

4) guarantee no data-races

then, no, it is definitely not as interesting as the best concurrent collectors, like Orca, with an integrated type system and language. Orca has been applied successfully to Pony. Orca was also applied to the language Encore. If CogVM can be changed to implement a concurrent collector, then CogVM is interesting. That’s a big change. The main value of CogVM now seems to be as a possible building/rebuilding tool for the VM itself.

Did you study the Wallaroo leaning experience concerning performance?

I’ve no interest in coding custom, one-off, multi-core apps (or settling for a much slower general solution, as in the Erlang-like concurrency model in Squeak). Custom-coded multithreading is too costly and too error-prone. It’s not fun, productive, or even needed, unless you really do need an extremely optimized concurrent solution for a specific domain. I don’t want inter-process communication before inter-thread communication (much faster) has been exhausted. The concurrent collector, Orca in this case, in conjunction with the ref-caps generalize the multicore solution, efficiently (that’s the point of it) for any actor-based program, and the zero-copy message passing gives much more speed than IPC. The tiny heaps cause tiny pauses on async collection. Runtime message tracing costs decrease as use of mutable types does. Message tracing happens only because there are mutable types to track and eventually collect; none of that applies to immutable types. See the test results in the paper for details.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Sounds like you might regret your choice and took the wrong path.

I don’t see how you form that conclusion. I’ve not chosen yet.

I seek the easiest integration/mutation path for a concurrent collector and ref-cap system.

I can start with Pony or a Smalltalk VM simulator. Either direction may be chosen. Squeak/Pharo’s current architecture (it has one big heap) is not suitable for general, automatic, fast multithreading. If all the VM C code can be simulated in Smalltalk before compiling it to an exe, then simulation may be the better path.

Come back to Squeak! ^,^

I see the Actors for Squeak page. That is not a suitable implementation.

I’ve not used Squeak since 2004, and don’t know its current state. I assume that it does not have the four concurrency-related abilities listed above. Does it?

If you know, please share the current facts about Squeak’s concurrency abilities. I prefer to skip the work needed to adapt Smalltalk to a concurrent collector like Orca, if those abilities already exist in Squeak /Pharo.

If most of what Squeak/Pharo offers is pleasant/productive VM simulation, much work still remains to achieve even a basic actor system and collector, but the writing of VM code in Smalltalk and compiling it to C may be much more productive than writing C++. The C++ for the Pony compiler and runtime, however, already compiles and works well. Thus, starting the work in C++ is somewhat tempting. Can someone explain the limits of how the VM simulator can be used? How much VM core C is not a part of what can be compiled from Smalltalk? Can all VM C code be compiled from Smalltalk?

Shaping

Robert Withers-2

Re: [Pharo-dev] Pony for Pharo VM

Hi Shaping,

On 4/18/20 5:15 AM, Shaping wrote:

Just to get in the right frame of mind, consider that because of the Blub Paradox (http://www.paulgraham.com/avg.html)
you are going to have a hard time convincing people to "change to this language because of Feature X"
just by saying so. You need to dig deeper.

The Pony compiler and runtime need to be studied.

What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!

Yes, that is one way. Then we can simulate the new collector with Smalltalk in the usual way, whilst also integrating ref-caps and dynamic types (the main challenge). We already know that Orca works in Pony (in high-performance production—not an experiment or toy). Still there will be bugs and perhaps room for improvements. Smalltalk simulation would help greatly there. The simulated Pony-Orca (the term used in the Orca paper) or simulated Smalltalk-Orca, if we can tag classes with ref-caps and keep Orca working, will run even more slowly in simulation-mode with all that message-passing added to the mix.

The cost of message passing reduces down when using the CogVM JIT. It is indeed somewhat slower when running in the simulator. I think the objective should be to run the Pony bytecodes on the jitting CogVM. This VM allows you to install your own BytecodeEncoderSet. Note that I was definitely promoting a solution of running Pony on the CogVM, not Orca.

I’m starting to study the Pharo VM. Can someone suggest what to read. I see what appears to be outdated VM-related material. I’m not sure what to study (besides the source code) and what to ignore. I’m especially interested to know what not to read.

I would suggest sticking to Squeak, instead of Pharo, as that is where the VM is designed & developed. Here's a couple of interesting blogs covering the CogVM [1][2] regarding VM documentation.

I’m not trying to convince; I’m presenting facts, observations, and resources for study of the problem and its solution. Hardware constraints now are intensely multicore, and everyone knows this. The changing programming paradigm in apparent. Hardware structure is forcing that change. Convincing yourself will not be difficult when you have the facts. You likely do already, at least on the problem-side.

The solution is easy.

The problem is easy to understand. It reduces to StW GCing in a large heap and how to make instead may small, well-managed heaps, one per actor. Orca does that already and demonstrates very high performance. That’s what the Orca paper is about.

The CogVM has a single heap, divided into "segments" I believe they are called, to dynamically grow to gain new heap space. The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':

Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.

Here are the profiling results.

- 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows

As shown, 1 full GC occurred in 86 ms and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

The solution for Smalltalk is more complicated, and will involve a concurrent collector. The best one I can find now is Orca. If you know a better one, please share your facts.

As different event loops on different cores will use the same

externalizing remote interface

This idea is not clear. Is there a description of it?

So I gather that the Orca/Pony solution does not treat inter-actor messages, within the same process to be remote calls? If each core has a separate thread and thus a separate event loop, it makes sense to have references to actors in other event loops as a remote actor. Thus the parallelism is well defined.

to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities

Pony doesn’t yet have machine-node remoteness. The networked version is being planned, but is a ways off still. By remote, do you mean: another machine or another OS/CogVM process on the same machine?

Yes, I mean both. I also mean between two event loops within the same process, different threads.

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

to share workload.

With Pony-Orca, sharing of the workload doesn’t need to be managed by the programmer.

When I said sharing of workload is a primary challenge, I do not mean explicitly managing concurrency, the event loop ensures that concurrency safety. I meant that the design of a parallelized application into concurrent actors is the challenge, that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

That’s one of basic reasons for the existence of Pony-Orca. The Pony-Orca dev writes his actors, and they run automatically in load-balance, via the actor-thread scheduler and work-stealing, when possible, on all the cores. Making Smalltalk work with Orca is, at this early stage, about understanding how Orca works (study the C++ and program in Pony) and how to implement it, if possible, in a Smalltalk simulator. Concerning Orca in particular, if you notice at end of the paper, they tested Orca against Erlang VM, C4, and G1, and it performed much better than all.

I suppose it should be measured against the CogVM, to know for sure is the single large heap is a performance bottleneck as compared to Pony/Orca performance with tiny per-actor heaps.

The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.

No, that’s much slower. Squeak/Pharo still has the basic threading handicap: a single large heap.

In my proposal, with 64 separate squeak processes running across 64 cores, there will be 64 heaps, 1 per process. There will be a finite number of Capability actors in each event loop. This finite set of actors within one event loop will be GC-able by the global collector, full & incremental. As all inter-event loop interaction occurs through remote message passing, the differences between inter-vat (a vat is the event loop) communication within one process (create two local Vats), inter-vat communication between event-loops in different processes on the same machine and inter-vat communication between event-loops in different processes on different machines are all modeled exactly the same: remote event loops.

Here’s the gist of the problem again: the big heap will not work and must go away, if we are to have extreme speed and a generalized multithreading programming solution.

I am not convinced of this.

My current understanding is that Pony-Orca (or Smalltalk-Orca) starts one OS process, and then spawns threads, as new actors begin working. You don’t need to do anything special as a programmer to make that happen. You just write the actors, keep them small, use the ref-caps correctly so that the program compiles (the ref-caps must also be applied to Smalltalk classes), and organize your synchronous code into classes, as usual. Functions run synchronous code. Behaviours run asynchronous code.

My point was "writing the actors" and "organizing your synchronous code into classes" are challenging in the sense of choosing what is asynchronous and what is synchronous. The parallel design space holds primacy.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame.

Squeak concurrency model.

Installer ss
project: 'Cryptography';
install: 'CapabilitiesLocal'

What abilities does the above install give Squeak?

This installs a local only (no remote capabilites) capabilities model that attempts to implement the following in Squeak, the E-Rights capabilities model. [3] This also ensures inter-actor concurrency safety.

I like Orca because it works on many cores (as many as 64, currently) without a synchronization step for GC, and has wonderful concurrency abilities. Pony and Orca were co-designed. The deferred reference counts managed by Orca run on the messages between the actors (send/receive tracing). GCs happen in Pony/Orca when each actor finishes its response to the last received message, and goes idle. The actor then GCs all objects no longer referenced by other actors. The runtime scheduler takes this time needed for each actor’s GCing into account. No actor waits to GC objects. An actor’s allocated objects’ ref counts are checked at idle-time, and unreferenced objects are GCed in an ongoing, fluid way, in small, high-frequency bursts, with very small, predictable tail latencies, as a result. That’s very interesting if you need smoothly running apps (graphics), design/program real-time control systems, or process data at high rates, as in financial applications at banks and exchanges.

So your use of Pony is purely to access the Orca vm?

Orca is not a VM; it’s a garbage collection protocol for actor-based systems.

I suggest using Pony-Orca to learn how Orca works, and then replace the Pony part of Pony-Orca with Smalltalk (dynamic typing), keeping the ref-caps (because they provide the guarantees). I realize that this is a big undertaking. Or: write a new implementation of Orca in Smalltalk for the VM. This is currently second choice, but that could change.

I think you will find the CogVM quite interesting and performant.

--Not with its current architecture.

If the CogVM is not able to:

1) dynamically schedule unlimited actor-threads on all cores

Why not separate actor event-loop processes on each core, communicating remotely? [4][5]

2) automatically load-balance

Use of mobility with actors would allow for automated rebalancing.

3) support actor-based programs innately

With this code, asynchronous computation of "number eventual * 100" occurs in an event loop and resolves the promise

[:number | number eventual * 100] value: 0.03 "returning an unresolved promise until the async computation completes and resolves the promise"

Am I wrong to state that this model allows innate support to actors? Or were you somehow stating that the VM would need innate support? Why does the VM have to know?

4) guarantee no data-races

The issue to observe is whether computations are long running and livelock the event loop from handling other activations. This is a shared issue, as Pony/Orca are also susceptible to this. E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

then, no, it is definitely not as interesting as the best concurrent collectors, like Orca, with an integrated type system and language. Orca has been applied successfully to Pony. Orca was also applied to the language Encore. If CogVM can be changed to implement a concurrent collector, then CogVM is interesting. That’s a big change. The main value of CogVM now seems to be as a possible building/rebuilding tool for the VM itself.

Did you study the Wallaroo leaning experience concerning performance?

I’ve no interest in coding custom, one-off, multi-core apps (or settling for a much slower general solution, as in the Erlang-like concurrency model in Squeak). Custom-coded multithreading is too costly and too error-prone. It’s not fun, productive, or even needed, unless you really do need an extremely optimized concurrent solution for a specific domain. I don’t want inter-process communication before inter-thread communication (much faster) has been exhausted.

Imagine a cloud based compute engine, processing Cassandra events that uses inter-machine actors to process the massively parallel Cassandra database. Inter-thread communication is not sufficient as there are hundreds of separate nodes. Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

The concurrent collector, Orca in this case, in conjunction with the ref-caps generalize the multicore solution, efficiently (that’s the point of it) for any actor-based program, and the zero-copy message passing gives much more speed than IPC. The tiny heaps cause tiny pauses on async collection. Runtime message tracing costs decrease as use of mutable types does. Message tracing happens only because there are mutable types to track and eventually collect; none of that applies to immutable types. See the test results in the paper for details.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Sounds like you might regret your choice and took the wrong path.

I don’t see how you form that conclusion. I’ve not chosen yet.

You stated you are not thrilled with using Pony.

I seek the easiest integration/mutation path for a concurrent collector and ref-cap system.

I can start with Pony or a Smalltalk VM simulator. Either direction may be chosen. Squeak/Pharo’s current architecture (it has one big heap) is not suitable for general, automatic, fast multithreading. If all the VM C code can be simulated in Smalltalk before compiling it to an exe, then simulation may be the better path.

Come back to Squeak! ^,^

I see the Actors for Squeak page. That is not a suitable implementation.

I’ve not used Squeak since 2004, and don’t know its current state. I assume that it does not have the four concurrency-related abilities listed above. Does it?

If you know, please share the current facts about Squeak’s concurrency abilities. I prefer to skip the work needed to adapt Smalltalk to a concurrent collector like Orca, if those abilities already exist in Squeak /Pharo.

If most of what Squeak/Pharo offers is pleasant/productive VM simulation, much work still remains to achieve even a basic actor system and collector, but the writing of VM code in Smalltalk and compiling it to C may be much more productive than writing C++. The C++ for the Pony compiler and runtime, however, already compiles and works well. Thus, starting the work in C++ is somewhat tempting. Can someone explain the limits of how the VM simulator can be used? How much VM core C is not a part of what can be compiled from Smalltalk? Can all VM C code be compiled from Smalltalk?

Shaping

-- 
Kindly,
Robert

[1] Cog Blog - http://www.mirandabanda.org/cogblog/
[2] Smalltalk, Tips 'n Tricks - https://clementbera.wordpress.com/
[3] Capability Computation - http://erights.org/elib/capability/index.html
[4] Concurrency (Event Loops) - http://erights.org/elib/concurrency/index.html
[5] Distributed Programming - http://erights.org/elib/distrib/index.html

Shaping1

Re: [Pharo-dev] Pony for Pharo VM

The Pony compiler and runtime need to be studied.

What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!

The cost of message passing reduces down when using the CogVM JIT. It is indeed somewhat slower when running in the simulator. I think the objective should be to run the Pony bytecodes

Pony is a language, compiler and runtime. The compiler converts Pony source to machine code.

on the jitting CogVM. This VM allows you to install your own BytecodeEncoderSet. Note that I was definitely promoting a solution of running Pony on the CogVM, not Orca.

Pony is not a VM, either--no bytes codes. We would be studying Orca structure in the Pony C/C++, how that fits with the ref-caps, and then determine how to write something similar in the VM or work Smalltalk dynamic types into the existing Pony C/C++ (not nearly as fun, probably).

I’m starting to study the Pharo VM. Can someone suggest what to read. I see what appears to be outdated VM-related material. I’m not sure what to study (besides the source code) and what to ignore. I’m especially interested to know what not to read.

I would suggest sticking to Squeak, instead of Pharo, as that is where the VM is designed & developed.

Okay.

How do Pharo’s and Squeak’s VMs differ? I thought OpenSmalltalkVM was the common VM. I also read something recently from Eliot that seemed to indicate a fork.

I thought Pharo had the new tools, like GT, but I’m not sure. I don’t follow Squeak anymore.

Here's a couple of interesting blogs covering the CogVM [1][2] regarding VM documentation.

The problem is easy to understand. It reduces to StW GCing in a large heap and how to make instead may small, well-managed heaps, one per actor. Orca does that already and demonstrates very high performance. That’s what the Orca paper is about.

The CogVM has a single heap, divided into "segments" I believe they are called, to dynamically grow to gain new heap space.

Yeah—no, it won’t work. Sympathies. Empathies.

https://ponylang.zulipchat.com/#narrow/search/lzip

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time. The video details are fuzzy, but you can get a good feel for message flow.

This should have happened first in Smalltalk.

The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':

Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.

Here are the profiling results.

- 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows

As shown, 1 full GC occurred in 86 ms

Not acceptable. Too long.

and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

Not acceptable. Too long. And, worse, it won’t scale. The problem is not the percentage; it’s the big delays amidst other domain-specific computation. These times must be much smaller and spread out across many pauses during domain-specific computations. No serious real-time apps can be made in this case.

I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale.

The solution for Smalltalk is more complicated, and will involve a concurrent collector. The best one I can find now is Orca. If you know a better one, please share your facts.

As different event loops on different cores will use the same

externalizing remote interface

This idea is not clear. Is there a description of it?

So I gather that the Orca/Pony solution does not treat inter-actor messages, within the same process to be remote calls?

Why would the idea of ‘remote’ enter here? The execution scope is an OS process. Pony actors run on their respective threads in one OS process. Message passing is zero-copy; all “passing” is done by reference. No data is actually copied. The scheduler interleaves all threads needing to share a core if there are more actors than cores. Switching time for actor threads, in that case, is 5 to 15 ns. This was mentioned before. Opportunistic work stealing happens. That means that all the cores stay as busy as possible if there is any work at all left to do. All of this happens by design without intervention or thought from the programmer. You can read about this in the links given earlier. I suggest we copy the design for Smalltalk.

If each core has a separate thread and thus a separate event loop, it makes sense to have references to actors in other event loops as a remote actor. Thus the parallelism is well defined.

to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities

Pony doesn’t yet have machine-node remoteness. The networked version is being planned, but is a ways off still. By remote, do you mean: another machine or another OS/CogVM process on the same machine?

Yes, I mean both. I also mean between two event loops within the same process, different threads.

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

to share workload.

With Pony-Orca, sharing of the workload doesn’t need to be managed by the programmer.

If you can write a state-machine with actors that each do one very simple, preferably reusable thing in response to received async messages, then it’s not a challenge. We do have to learn how to do it. It’s not what most of us are used to. Pony is a good tool for practicing, even if the syntax is not interesting. Still, as mentioned, we should make tools to help with that state-machine construction. That comes later, but it must happen.

Pony has Actors. It also has Classes. The actors have behaviours. Think of these as async methods. Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects. Doing this last bit well is the task that concerns me most.

that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

I suggest a study of the Pony scheduler. There are actors, mailboxes, message queues, and the scheduler, mainly. You don’t need to be concerned about safety. It’s been handled for you by the runtime and ref-caps.

That’s one of basic reasons for the existence of Pony-Orca. The Pony-Orca dev writes his actors, and they run automatically in load-balance, via the actor-thread scheduler and work-stealing, when possible, on all the cores. Making Smalltalk work with Orca is, at this early stage, about understanding how Orca works (study the C++ and program in Pony) and how to implement it, if possible, in a Smalltalk simulator. Concerning Orca in particular, if you notice at end of the paper, they tested Orca against Erlang VM, C4, and G1, and it performed much better than all.

I suppose it should be measured against the CogVM, to know for sure is the single large heap is a performance bottleneck as compared to Pony/Orca performance with tiny per-actor heaps.

I don’t have time for Pony programming these days--I can’t even read about these days. Go ahead if you wish.

Your time is better spent in other ways, though. The speed and scale advantages of Orca over the big-heap approach have been demonstrated. That was done some time ago. Read the paper by Clebsch and friends for details. Read Wallaroo Lab’s field-experience whilst preparing to use Pony. Or better, learn to write a Pony program. If your resources don’t allow that, chat with Rocco Bowling (link above). Everyone on Pony Zulip is very helpful and super-enthusiastic about Pony—and it doesn’t even have its own debugger the last time I checked. The tooling is poor, and people still love this thing. Odd.

The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.

No, that’s much slower. Squeak/Pharo still has the basic threading handicap: a single large heap.

In my proposal, with 64 separate squeak processes running across 64 cores, there will be 64 heaps,

That would be too few actors, in general. We are not thinking on the same scale for speed and actor-count.

Expect actor counts to scale into the thousands or tens of thousands. There are about 100 in the app above.

1 per process. There will be a finite number of Capability actors in each event loop. This finite set of actors within one event loop will be GC-able by the global collector, full & incremental. As all inter-event loop interaction occurs through remote message passing, the differences between inter-vat (a vat is the event loop) communication within one process (create two local Vats), inter-vat communication between event-loops in different processes on the same machine and inter-vat communication between event-loops in different processes on different machines are all modeled exactly the same: remote event loops.

Here’s the gist of the problem again: the big heap will not work and must go away, if we are to have extreme speed and a generalized multithreading programming solution.

I am not convinced of this.

You must read of others’ measurements, or write your own programs, and do the tests to get those measurements. Read about the measurements made in the academic paper I cited. That’s the easy way. You can also read the one from Sebastian Blessing from 2013: https://www.ponylang.io/media/papers/a_string_of_ponies.pdf

My current understanding is that Pony-Orca (or Smalltalk-Orca) starts one OS process, and then spawns threads, as new actors begin working. You don’t need to do anything special as a programmer to make that happen. You just write the actors, keep them small, use the ref-caps correctly so that the program compiles (the ref-caps must also be applied to Smalltalk classes), and organize your synchronous code into classes, as usual. Functions run synchronous code. Behaviours run asynchronous code.

My point was "writing the actors" and "organizing your synchronous code into classes" are challenging in the sense of choosing what is asynchronous and what is synchronous.

Yup, but only for a while. Then you get used to it, and can’t imagine anything different, like not having a big heap.

The parallel design space holds primacy.

No, strictly, the state-machine design does. The parallelization is done for you.

You’re not parallelizing anything. That’s not your job. (What a relief, yes?) You’re an application programmer. You’re writing a state-machine for your app, and distributing its work across specialized actors, which you code and whose async messages to each other change object data slots (wherever they happen to live—which need not concern you), and thus change the state of the state-machine you designed.

You can’t use the multicore hardware you already own or the goodness in the Orca and ref-cap design if you can’t write a state-machine, and use actors, or don’t have a tool to help you do that. Most of us will want to use such a tool even if we are fluent at state-machine design. This doesn’t even exist in Pony. It’s very raw over there, but you get used to the patterns, as with any new strategy. Still I want a tool. Don’t you?

Two tasks: 1) build tools to help us make state-machines in a reliable pleasant way, so that we feel compelled and happy to do it; and 2) implement Pony-style scheduling, ref-caps, and Orca memory management work in Smalltalk.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame.
Squeak concurrency model.
Installer ss
project: 'Cryptography';
install: 'CapabilitiesLocal'
What abilities does the above install give Squeak?

So your use of Pony is purely to access the Orca vm?

Orca is not a VM; it’s a garbage collection protocol for actor-based systems.

I suggest using Pony-Orca to learn how Orca works, and then replace the Pony part of Pony-Orca with Smalltalk (dynamic typing), keeping the ref-caps (because they provide the guarantees). I realize that this is a big undertaking. Or: write a new implementation of Orca in Smalltalk for the VM. This is currently second choice, but that could change.

I think you will find the CogVM quite interesting and performant.

--Not with its current architecture.

If the CogVM is not able to:
1) dynamically schedule unlimited actor-threads on all cores

Why not separate actor event-loop processes on each core, communicating remotely? [4][5]

--Because it will continue the current Smalltalk-concurrency lameness. It’s a patch. And still it will not allow the system to scale. The concurrency problem has been solved nearly optimally and at high resolution in the current Pony-Orca. There’s room for improvement, but it’s already in a completely different performance league compared to any big-heap Smalltalk. If I’m to work hard on an implementation of this design for Smalltalk, I need a much greater speed-up and scaling ability than what these patches give.

2) automatically load-balance

Use of mobility with actors would allow for automated rebalancing.

Speed hit.

Too slow/wasteful. Moving an actor isn’t needed if the each has its own heap.

3) support actor-based programs innately

With this code, asynchronous computation of "number eventual * 100" occurs in an event loop and resolves the promise

[:number | number eventual * 100] value: 0.03 "returning an unresolved promise until the async computation completes and resolves the promise"

Promises and notifications are fine. Both happen in Pony-Orca. But the promises don’t fix the big performance problems.

Am I wrong to state that this model allows innate support to actors? Or were you somehow stating that the VM would need innate support? Why does the VM have to know?

It’s not enough. We still have the big pauses from GCs in a large heap.

4) guarantee no data-races

The issue to observe is whether computations are long running and livelock the event loop from handling other activations. This is a shared issue, as Pony/Orca are also susceptible to this.

Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca.

E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

Speed hit.

No blocking and no write barriers exist in Pony-Orca. You can’t wait. If you need to “wait,” you set a timer and respond to the event when the timer fires.

Yes; I didn’t claim otherwise. The networked version is coming. See above. My point is that the ‘remote’ characterization is not needed. It’s not helping us describe and understand.

Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

No new design is needed for concurrency and interfacing. There is much to implement, however.

The design is already done, modulo the not-yet-present network extension. Interfacing between actors is always by async messaging. Messaging will work as transparently as possible in the networked version across machine nodes.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)
Sounds like you might regret your choice and took the wrong path.
I don’t see how you form that conclusion. I’ve not chosen yet.

You stated you are not thrilled with using Pony.

I don’t like the Pony language syntax. I don’t like anything that looks like Algo-60. Pony is a language, compiler and runtime implementing Orca. The other stuff is good. And I’ve not had much time to use it; I suspect I could like it more.

[…]
If most of what Squeak/Pharo offers is pleasant/productive VM simulation, much work still remains to achieve even a basic actor system and collector, but the writing of VM code in Smalltalk and compiling it to C may be much more productive than writing C++. The C++ for the Pony compiler and runtime, however, already compiles and works well. Thus, starting the work in C++ is somewhat tempting. Can someone explain the limits of how the VM simulator can be used? How much VM core C is not a part of what can be compiled from Smalltalk? Can all VM C code be compiled from Smalltalk?

Can someone answer the above question?

Shaping

Robert Withers-2

Re: [Pharo-dev] Pony for Pharo VM

Your link does not work. https://ponylang.zulipchat.com/#narrow/search/lzip

On 4/21/20 2:05 AM, Shaping wrote:

The Pony compiler and runtime need to be studied.

What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!

Yes, that is one way. Then we can simulate the new collector with Smalltalk in the usual way, whilst also integrating ref-caps and dynamic types (the main challenge). We already know that Orca works in Pony (in high-performance production—not an experiment or toy). Still there will be bugs and perhaps room for improvements. Smalltalk simulation would help greatly there. The simulated Pony-Orca (the term used in the Orca paper) or simulated Smalltalk-Orca, if we can tag classes with ref-caps and keep Orca working, will run even more slowly in simulation-mode with all that message-passing added to the mix.

The cost of message passing reduces down when using the CogVM JIT. It is indeed somewhat slower when running in the simulator. I think the objective should be to run the Pony bytecodes

Pony is a language, compiler and runtime. The compiler converts Pony source to machine code.

on the jitting CogVM. This VM allows you to install your own BytecodeEncoderSet. Note that I was definitely promoting a solution of running Pony on the CogVM, not Orca.

Pony is not a VM, either--no bytes codes. We would be studying Orca structure in the Pony C/C++, how that fits with the ref-caps, and then determine how to write something similar in the VM or work Smalltalk dynamic types into the existing Pony C/C++ (not nearly as fun, probably).

I’m starting to study the Pharo VM. Can someone suggest what to read. I see what appears to be outdated VM-related material. I’m not sure what to study (besides the source code) and what to ignore. I’m especially interested to know what not to read.

I would suggest sticking to Squeak, instead of Pharo, as that is where the VM is designed & developed.

Okay.

How do Pharo’s and Squeak’s VMs differ? I thought OpenSmalltalkVM was the common VM. I also read something recently from Eliot that seemed to indicate a fork.

I thought Pharo had the new tools, like GT, but I’m not sure. I don’t follow Squeak anymore.

Here's a couple of interesting blogs covering the CogVM [1][2] regarding VM documentation.

The problem is easy to understand. It reduces to StW GCing in a large heap and how to make instead may small, well-managed heaps, one per actor. Orca does that already and demonstrates very high performance. That’s what the Orca paper is about.

The CogVM has a single heap, divided into "segments" I believe they are called, to dynamically grow to gain new heap space.

Yeah—no, it won’t work. Sympathies. Empathies.

https://ponylang.zulipchat.com/#narrow/search/lzip

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time. The video details are fuzzy, but you can get a good feel for message flow.

This should have happened first in Smalltalk.

The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':

Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.

Here are the profiling results.

- 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows

As shown, 1 full GC occurred in 86 ms

Not acceptable. Too long.

and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

Not acceptable. Too long. And, worse, it won’t scale. The problem is not the percentage; it’s the big delays amidst other domain-specific computation. These times must be much smaller and spread out across many pauses during domain-specific computations. No serious real-time apps can be made in this case.

I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale.

The solution for Smalltalk is more complicated, and will involve a concurrent collector. The best one I can find now is Orca. If you know a better one, please share your facts.

As different event loops on different cores will use the same

externalizing remote interface

This idea is not clear. Is there a description of it?

So I gather that the Orca/Pony solution does not treat inter-actor messages, within the same process to be remote calls?

Why would the idea of ‘remote’ enter here? The execution scope is an OS process. Pony actors run on their respective threads in one OS process. Message passing is zero-copy; all “passing” is done by reference. No data is actually copied. The scheduler interleaves all threads needing to share a core if there are more actors than cores. Switching time for actor threads, in that case, is 5 to 15 ns. This was mentioned before. Opportunistic work stealing happens. That means that all the cores stay as busy as possible if there is any work at all left to do. All of this happens by design without intervention or thought from the programmer. You can read about this in the links given earlier. I suggest we copy the design for Smalltalk.

If each core has a separate thread and thus a separate event loop, it makes sense to have references to actors in other event loops as a remote actor. Thus the parallelism is well defined.

to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities

Pony doesn’t yet have machine-node remoteness. The networked version is being planned, but is a ways off still. By remote, do you mean: another machine or another OS/CogVM process on the same machine?

Yes, I mean both. I also mean between two event loops within the same process, different threads.

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

to share workload.

With Pony-Orca, sharing of the workload doesn’t need to be managed by the programmer.

When I said sharing of workload is a primary challenge, I do not mean explicitly managing concurrency, the event loop ensures that concurrency safety. I meant that the design of a parallelized application into concurrent actors is the challenge,

If you can write a state-machine with actors that each do one very simple, preferably reusable thing in response to received async messages, then it’s not a challenge. We do have to learn how to do it. It’s not what most of us are used to. Pony is a good tool for practicing, even if the syntax is not interesting. Still, as mentioned, we should make tools to help with that state-machine construction. That comes later, but it must happen.

Pony has Actors. It also has Classes. The actors have behaviours. Think of these as async methods. Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects. Doing this last bit well is the task that concerns me most.

that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

I suggest a study of the Pony scheduler. There are actors, mailboxes, message queues, and the scheduler, mainly.   You don’t need to be concerned about safety. It’s been handled for you by the runtime and ref-caps.

That’s one of basic reasons for the existence of Pony-Orca. The Pony-Orca dev writes his actors, and they run automatically in load-balance, via the actor-thread scheduler and work-stealing, when possible, on all the cores. Making Smalltalk work with Orca is, at this early stage, about understanding how Orca works (study the C++ and program in Pony) and how to implement it, if possible, in a Smalltalk simulator. Concerning Orca in particular, if you notice at end of the paper, they tested Orca against Erlang VM, C4, and G1, and it performed much better than all.

I suppose it should be measured against the CogVM, to know for sure is the single large heap is a performance bottleneck as compared to Pony/Orca performance with tiny per-actor heaps.

I don’t have time for Pony programming these days--I can’t even read about these days. Go ahead if you wish.

Your time is better spent in other ways, though. The speed and scale advantages of Orca over the big-heap approach have been demonstrated. That was done some time ago.   Read the paper by Clebsch and friends for details. Read Wallaroo Lab’s field-experience whilst preparing to use Pony. Or better, learn to write a Pony program. If your resources don’t allow that, chat with Rocco Bowling (link above). Everyone on Pony Zulip is very helpful and super-enthusiastic about Pony—and it doesn’t even have its own debugger the last time I checked. The tooling is poor, and people still love this thing. Odd.

The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.

No, that’s much slower. Squeak/Pharo still has the basic threading handicap: a single large heap.

In my proposal, with 64 separate squeak processes running across 64 cores, there will be 64 heaps,

That would be too few actors, in general. We are not thinking on the same scale for speed and actor-count.

Expect actor counts to scale into the thousands or tens of thousands. There are about 100 in the app above.

1 per process. There will be a finite number of Capability actors in each event loop. This finite set of actors within one event loop will be GC-able by the global collector, full & incremental. As all inter-event loop interaction occurs through remote message passing, the differences between inter-vat (a vat is the event loop) communication within one process (create two local Vats), inter-vat communication between event-loops in different processes on the same machine and inter-vat communication between event-loops in different processes on different machines are all modeled exactly the same: remote event loops.

Here’s the gist of the problem again: the big heap will not work and must go away, if we are to have extreme speed and a generalized multithreading programming solution.

I am not convinced of this.

You must read of others’ measurements, or write your own programs, and do the tests to get those measurements. Read about the measurements made in the academic paper I cited.   That’s the easy way. You can also read the one from Sebastian Blessing from 2013: https://www.ponylang.io/media/papers/a_string_of_ponies.pdf

My current understanding is that Pony-Orca (or Smalltalk-Orca) starts one OS process, and then spawns threads, as new actors begin working. You don’t need to do anything special as a programmer to make that happen. You just write the actors, keep them small, use the ref-caps correctly so that the program compiles (the ref-caps must also be applied to Smalltalk classes), and organize your synchronous code into classes, as usual. Functions run synchronous code. Behaviours run asynchronous code.

My point was "writing the actors" and "organizing your synchronous code into classes" are challenging in the sense of choosing what is asynchronous and what is synchronous.

Yup, but only for a while. Then you get used to it, and can’t imagine anything different, like not having a big heap.

The parallel design space holds primacy.

No, strictly, the state-machine design does. The parallelization is done for you.

You’re not parallelizing anything. That’s not your job. (What a relief, yes?) You’re an application programmer. You’re writing a state-machine for your app, and distributing its work across specialized actors, which you code and whose async messages to each other change object data slots (wherever they happen to live—which need not concern you), and thus change the state of the state-machine you designed.

You can’t use the multicore hardware you already own or the goodness in the Orca and ref-cap design if you can’t write a state-machine, and use actors, or don’t have a tool to help you do that. Most of us will want to use such a tool even if we are fluent at state-machine design. This doesn’t even exist in Pony. It’s very raw over there, but you get used to the patterns, as with any new strategy. Still I want a tool.   Don’t you?

Two tasks: 1) build tools to help us make state-machines in a reliable pleasant way, so that we feel compelled and happy to do it; and 2) implement Pony-style scheduling, ref-caps, and Orca memory management work in Smalltalk.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame.

Squeak concurrency model.

Installer ss
    project: 'Cryptography';
    install: 'CapabilitiesLocal'

What abilities does the above install give Squeak?

This installs a local only (no remote capabilites) capabilities model that attempts to implement the following in Squeak, the E-Rights capabilities model. [3] This also ensures inter-actor concurrency safety.

So your use of Pony is purely to access the Orca vm?

Orca is not a VM; it’s a garbage collection protocol for actor-based systems.

I suggest using Pony-Orca to learn how Orca works, and then replace the Pony part of Pony-Orca with Smalltalk (dynamic typing), keeping the ref-caps (because they provide the guarantees). I realize that this is a big undertaking. Or: write a new implementation of Orca in Smalltalk for the VM. This is currently second choice, but that could change.

I think you will find the CogVM quite interesting and performant.

--Not with its current architecture.

If the CogVM is not able to:

1) dynamically schedule unlimited actor-threads on all cores

Why not separate actor event-loop processes on each core, communicating remotely? [4][5]

--Because it will continue the current Smalltalk-concurrency lameness. It’s a patch. And still it will not allow the system to scale. The concurrency problem has been solved nearly optimally and at high resolution in the current Pony-Orca. There’s room for improvement, but it’s already in a completely different performance league compared to any big-heap Smalltalk. If I’m to work hard on an implementation of this design for Smalltalk, I need a much greater speed-up and scaling ability than what these patches give.

2) automatically load-balance

Use of mobility with actors would allow for automated rebalancing.

Speed hit.

Too slow/wasteful.   Moving an actor isn’t needed if the each has its own heap.

3) support actor-based programs innately

With this code, asynchronous computation of "number eventual * 100" occurs in an event loop and resolves the promise

[:number | number eventual * 100] value: 0.03 "returning an unresolved promise until the async computation completes and resolves the promise"

Promises and notifications are fine. Both happen in Pony-Orca. But the promises don’t fix the big performance problems.

Am I wrong to state that this model allows innate support to actors? Or were you somehow stating that the VM would need innate support? Why does the VM have to know?

It’s not enough. We still have the big pauses from GCs in a large heap.

4) guarantee no data-races

The issue to observe is whether computations are long running and livelock the event loop from handling other activations. This is a shared issue, as Pony/Orca are also susceptible to this.

Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca.

E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

Speed hit.

No blocking and no write barriers exist in Pony-Orca. You can’t wait. If you need to “wait,” you set a timer and respond to the event when the timer fires.

Imagine a cloud based compute engine, processing Cassandra events that uses inter-machine actors to process the massively parallel Cassandra database. Inter-thread communication is not sufficient as there are hundreds of separate nodes.

Yes; I didn’t claim otherwise. The networked version is coming. See above. My point is that the ‘remote’ characterization is not needed. It’s not helping us describe and understand.

Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

No new design is needed for concurrency and interfacing. There is much to implement, however.

The design is already done, modulo the not-yet-present network extension. Interfacing between actors is always by async messaging. Messaging will work as transparently as possible in the networked version across machine nodes.



The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Sounds like you might regret your choice and took the wrong path.

I don’t see how you form that conclusion. I’ve not chosen yet.

You stated you are not thrilled with using Pony.

I don’t like the Pony language syntax. I don’t like anything that looks like Algo-60. Pony is a language, compiler and runtime implementing Orca. The other stuff is good. And I’ve not had much time to use it; I suspect I could like it more.

[…]

If most of what Squeak/Pharo offers is pleasant/productive VM simulation, much work still remains to achieve even a basic actor system and collector, but the writing of VM code in Smalltalk and compiling it to C may be much more productive than writing C++. The C++ for the Pony compiler and runtime, however, already compiles and works well. Thus, starting the work in C++ is somewhat tempting. Can someone explain the limits of how the VM simulator can be used? How much VM core C is not a part of what can be compiled from Smalltalk? Can all VM C code be compiled from Smalltalk?
Can someone answer the above question?
[1] Cog Blog - http://www.mirandabanda.org/cogblog/
[2] Smalltalk, Tips 'n Tricks - https://clementbera.wordpress.com/
[3] Capability Computation - http://erights.org/elib/capability/index.html
[4] Concurrency (Event Loops) - http://erights.org/elib/concurrency/index.html
[5] Distributed Programming - http://erights.org/elib/distrib/index.html

Shaping

-- 
Kindly,
Robert

Shaping1

Re: [Pharo-dev] Pony for Pharo VM

Yeah, the search probably works for me because I’m logged in as a member. Try this:

https://ponylang.zulipchat.com/

and search for ‘lzip’ at the top. This should take you to bits of the thread by Rocco.

In those bits you should find the link to his YT vid showing the actor network with message flows: https://www.youtube.com/watch?v=BslZY0D_xAg

Shaping

From: Robert [mailto:[hidden email]]
Sent: Tuesday, 21 April, 2020 02:38
To: Shaping <[hidden email]>; 'Open Smalltalk Virtual Machine Development Discussion' <[hidden email]>; 'Pharo Development List' <[hidden email]>
Subject: Re: [Vm-dev] [Pharo-dev] Pony for Pharo VM

Your link does not work. https://ponylang.zulipchat.com/#narrow/search/lzip

On 4/21/20 2:05 AM, Shaping wrote:

The Pony compiler and runtime need to be studied.
What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!

Yes, that is one way. Then we can simulate the new collector with Smalltalk in the usual way, whilst also integrating ref-caps and dynamic types (the main challenge). We already know that Orca works in Pony (in high-performance production—not an experiment or toy). Still there will be bugs and perhaps room for improvements. Smalltalk simulation would help greatly there. The simulated Pony-Orca (the term used in the Orca paper) or simulated Smalltalk-Orca, if we can tag classes with ref-caps and keep Orca working, will run even more slowly in simulation-mode with all that message-passing added to the mix.
The cost of message passing reduces down when using the CogVM JIT. It is indeed somewhat slower when running in the simulator. I think the objective should be to run the Pony bytecodes

Pony is a language, compiler and runtime. The compiler converts Pony source to machine code.
on the jitting CogVM. This VM allows you to install your own BytecodeEncoderSet. Note that I was definitely promoting a solution of running Pony on the CogVM, not Orca.

Pony is not a VM, either--no bytes codes. We would be studying Orca structure in the Pony C/C++, how that fits with the ref-caps, and then determine how to write something similar in the VM or work Smalltalk dynamic types into the existing Pony C/C++ (not nearly as fun, probably).

I’m starting to study the Pharo VM. Can someone suggest what to read. I see what appears to be outdated VM-related material. I’m not sure what to study (besides the source code) and what to ignore. I’m especially interested to know what not to read.
I would suggest sticking to Squeak, instead of Pharo, as that is where the VM is designed & developed.
Okay.
How do Pharo’s and Squeak’s VMs differ? I thought OpenSmalltalkVM was the common VM. I also read something recently from Eliot that seemed to indicate a fork.
I thought Pharo had the new tools, like GT, but I’m not sure. I don’t follow Squeak anymore.
Here's a couple of interesting blogs covering the CogVM [1][2] regarding VM documentation.

The problem is easy to understand. It reduces to StW GCing in a large heap and how to make instead may small, well-managed heaps, one per actor. Orca does that already and demonstrates very high performance. That’s what the Orca paper is about.
The CogVM has a single heap, divided into "segments" I believe they are called, to dynamically grow to gain new heap space.

Yeah—no, it won’t work. Sympathies. Empathies.
https://ponylang.zulipchat.com/#narrow/search/lzip

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time. The video details are fuzzy, but you can get a good feel for message flow.

This should have happened first in Smalltalk.

The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':
Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.
Here are the profiling results.
- 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows
As shown, 1 full GC occurred in 86 ms

Not acceptable. Too long.

and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

Not acceptable. Too long. And, worse, it won’t scale. The problem is not the percentage; it’s the big delays amidst other domain-specific computation. These times must be much smaller and spread out across many pauses during domain-specific computations. No serious real-time apps can be made in this case.
I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale.

The solution for Smalltalk is more complicated, and will involve a concurrent collector. The best one I can find now is Orca. If you know a better one, please share your facts.

As different event loops on different cores will use the same

externalizing remote interface

This idea is not clear. Is there a description of it?
So I gather that the Orca/Pony solution does not treat inter-actor messages, within the same process to be remote calls?

Why would the idea of ‘remote’ enter here? The execution scope is an OS process. Pony actors run on their respective threads in one OS process. Message passing is zero-copy; all “passing” is done by reference. No data is actually copied. The scheduler interleaves all threads needing to share a core if there are more actors than cores. Switching time for actor threads, in that case, is 5 to 15 ns. This was mentioned before. Opportunistic work stealing happens. That means that all the cores stay as busy as possible if there is any work at all left to do. All of this happens by design without intervention or thought from the programmer. You can read about this in the links given earlier. I suggest we copy the design for Smalltalk.

If each core has a separate thread and thus a separate event loop, it makes sense to have references to actors in other event loops as a remote actor. Thus the parallelism is well defined.

to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities

Pony doesn’t yet have machine-node remoteness. The networked version is being planned, but is a ways off still. By remote, do you mean: another machine or another OS/CogVM process on the same machine?
Yes, I mean both. I also mean between two event loops within the same process, different threads.
I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

to share workload.

With Pony-Orca, sharing of the workload doesn’t need to be managed by the programmer.
When I said sharing of workload is a primary challenge, I do not mean explicitly managing concurrency, the event loop ensures that concurrency safety. I meant that the design of a parallelized application into concurrent actors is the challenge,

If you can write a state-machine with actors that each do one very simple, preferably reusable thing in response to received async messages, then it’s not a challenge. We do have to learn how to do it. It’s not what most of us are used to. Pony is a good tool for practicing, even if the syntax is not interesting. Still, as mentioned, we should make tools to help with that state-machine construction. That comes later, but it must happen.

Pony has Actors. It also has Classes. The actors have behaviours. Think of these as async methods. Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects. Doing this last bit well is the task that concerns me most.

that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

I suggest a study of the Pony scheduler. There are actors, mailboxes, message queues, and the scheduler, mainly.   You don’t need to be concerned about safety. It’s been handled for you by the runtime and ref-caps.

That’s one of basic reasons for the existence of Pony-Orca. The Pony-Orca dev writes his actors, and they run automatically in load-balance, via the actor-thread scheduler and work-stealing, when possible, on all the cores. Making Smalltalk work with Orca is, at this early stage, about understanding how Orca works (study the C++ and program in Pony) and how to implement it, if possible, in a Smalltalk simulator. Concerning Orca in particular, if you notice at end of the paper, they tested Orca against Erlang VM, C4, and G1, and it performed much better than all.
I suppose it should be measured against the CogVM, to know for sure is the single large heap is a performance bottleneck as compared to Pony/Orca performance with tiny per-actor heaps.

I don’t have time for Pony programming these days--I can’t even read about these days. Go ahead if you wish.

Your time is better spent in other ways, though. The speed and scale advantages of Orca over the big-heap approach have been demonstrated. That was done some time ago.   Read the paper by Clebsch and friends for details. Read Wallaroo Lab’s field-experience whilst preparing to use Pony. Or better, learn to write a Pony program. If your resources don’t allow that, chat with Rocco Bowling (link above). Everyone on Pony Zulip is very helpful and super-enthusiastic about Pony—and it doesn’t even have its own debugger the last time I checked. The tooling is poor, and people still love this thing. Odd.

The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.

No, that’s much slower. Squeak/Pharo still has the basic threading handicap: a single large heap.
In my proposal, with 64 separate squeak processes running across 64 cores, there will be 64 heaps,

That would be too few actors, in general. We are not thinking on the same scale for speed and actor-count.
Expect actor counts to scale into the thousands or tens of thousands. There are about 100 in the app above.

1 per process. There will be a finite number of Capability actors in each event loop. This finite set of actors within one event loop will be GC-able by the global collector, full & incremental. As all inter-event loop interaction occurs through remote message passing, the differences between inter-vat (a vat is the event loop) communication within one process (create two local Vats), inter-vat communication between event-loops in different processes on the same machine and inter-vat communication between event-loops in different processes on different machines are all modeled exactly the same: remote event loops.

Here’s the gist of the problem again: the big heap will not work and must go away, if we are to have extreme speed and a generalized multithreading programming solution.
I am not convinced of this.

You must read of others’ measurements, or write your own programs, and do the tests to get those measurements. Read about the measurements made in the academic paper I cited.   That’s the easy way. You can also read the one from Sebastian Blessing from 2013: https://www.ponylang.io/media/papers/a_string_of_ponies.pdf

My current understanding is that Pony-Orca (or Smalltalk-Orca) starts one OS process, and then spawns threads, as new actors begin working. You don’t need to do anything special as a programmer to make that happen. You just write the actors, keep them small, use the ref-caps correctly so that the program compiles (the ref-caps must also be applied to Smalltalk classes), and organize your synchronous code into classes, as usual. Functions run synchronous code. Behaviours run asynchronous code.
My point was "writing the actors" and "organizing your synchronous code into classes" are challenging in the sense of choosing what is asynchronous and what is synchronous.

Yup, but only for a while. Then you get used to it, and can’t imagine anything different, like not having a big heap.
The parallel design space holds primacy.

No, strictly, the state-machine design does. The parallelization is done for you.

You’re not parallelizing anything. That’s not your job. (What a relief, yes?) You’re an application programmer. You’re writing a state-machine for your app, and distributing its work across specialized actors, which you code and whose async messages to each other change object data slots (wherever they happen to live—which need not concern you), and thus change the state of the state-machine you designed.

You can’t use the multicore hardware you already own or the goodness in the Orca and ref-cap design if you can’t write a state-machine, and use actors, or don’t have a tool to help you do that. Most of us will want to use such a tool even if we are fluent at state-machine design. This doesn’t even exist in Pony. It’s very raw over there, but you get used to the patterns, as with any new strategy. Still I want a tool.   Don’t you?

Two tasks: 1) build tools to help us make state-machines in a reliable pleasant way, so that we feel compelled and happy to do it; and 2) implement Pony-style scheduling, ref-caps, and Orca memory management work in Smalltalk.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame.
Squeak concurrency model.
Installer ss
    project: 'Cryptography';
    install: 'CapabilitiesLocal'
What abilities does the above install give Squeak?
This installs a local only (no remote capabilites) capabilities model that attempts to implement the following in Squeak, the E-Rights capabilities model. [3] This also ensures inter-actor concurrency safety.
So your use of Pony is purely to access the Orca vm?

Orca is not a VM; it’s a garbage collection protocol for actor-based systems.

I suggest using Pony-Orca to learn how Orca works, and then replace the Pony part of Pony-Orca with Smalltalk (dynamic typing), keeping the ref-caps (because they provide the guarantees). I realize that this is a big undertaking. Or: write a new implementation of Orca in Smalltalk for the VM. This is currently second choice, but that could change.

I think you will find the CogVM quite interesting and performant.

--Not with its current architecture.

If the CogVM is not able to:
1) dynamically schedule unlimited actor-threads on all cores
Why not separate actor event-loop processes on each core, communicating remotely? [4][5]

--Because it will continue the current Smalltalk-concurrency lameness. It’s a patch. And still it will not allow the system to scale. The concurrency problem has been solved nearly optimally and at high resolution in the current Pony-Orca. There’s room for improvement, but it’s already in a completely different performance league compared to any big-heap Smalltalk. If I’m to work hard on an implementation of this design for Smalltalk, I need a much greater speed-up and scaling ability than what these patches give.

2) automatically load-balance
Use of mobility with actors would allow for automated rebalancing.

Speed hit.
Too slow/wasteful.   Moving an actor isn’t needed if the each has its own heap.

3) support actor-based programs innately
With this code, asynchronous computation of "number eventual * 100" occurs in an event loop and resolves the promise
[:number | number eventual * 100] value: 0.03 "returning an unresolved promise until the async computation completes and resolves the promise"

Promises and notifications are fine. Both happen in Pony-Orca. But the promises don’t fix the big performance problems.
Am I wrong to state that this model allows innate support to actors? Or were you somehow stating that the VM would need innate support? Why does the VM have to know?
It’s not enough. We still have the big pauses from GCs in a large heap.
4) guarantee no data-races
The issue to observe is whether computations are long running and livelock the event loop from handling other activations. This is a shared issue, as Pony/Orca are also susceptible to this.

Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca.

E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

Speed hit.
No blocking and no write barriers exist in Pony-Orca. You can’t wait. If you need to “wait,” you set a timer and respond to the event when the timer fires.

Imagine a cloud based compute engine, processing Cassandra events that uses inter-machine actors to process the massively parallel Cassandra database. Inter-thread communication is not sufficient as there are hundreds of separate nodes.

Yes; I didn’t claim otherwise. The networked version is coming. See above. My point is that the ‘remote’ characterization is not needed. It’s not helping us describe and understand.

Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

No new design is needed for concurrency and interfacing. There is much to implement, however.

The design is already done, modulo the not-yet-present network extension. Interfacing between actors is always by async messaging. Messaging will work as transparently as possible in the networked version across machine nodes.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)
Sounds like you might regret your choice and took the wrong path.
I don’t see how you form that conclusion. I’ve not chosen yet.
You stated you are not thrilled with using Pony.

I don’t like the Pony language syntax. I don’t like anything that looks like Algo-60. Pony is a language, compiler and runtime implementing Orca. The other stuff is good. And I’ve not had much time to use it; I suspect I could like it more.

[…]
If most of what Squeak/Pharo offers is pleasant/productive VM simulation, much work still remains to achieve even a basic actor system and collector, but the writing of VM code in Smalltalk and compiling it to C may be much more productive than writing C++. The C++ for the Pony compiler and runtime, however, already compiles and works well. Thus, starting the work in C++ is somewhat tempting. Can someone explain the limits of how the VM simulator can be used? How much VM core C is not a part of what can be compiled from Smalltalk? Can all VM C code be compiled from Smalltalk?
Can someone answer the above question?
[1] Cog Blog - http://www.mirandabanda.org/cogblog/
[2] Smalltalk, Tips 'n Tricks - https://clementbera.wordpress.com/
[3] Capability Computation - http://erights.org/elib/capability/index.html
[4] Concurrency (Event Loops) - http://erights.org/elib/concurrency/index.html
[5] Distributed Programming - http://erights.org/elib/distrib/index.html

Shaping

--

Kindly,

Robert

Robert Withers-2

Re: [Pharo-dev] Pony for Pharo VM

Hi Shaping,

I was unable to find the message thread on lzip at zulipchat. I watched the video and that is how Smalltalk is built, from the basic activity of message passing. Locally (Stack), that message passing implements calling another objects and returns a computed value. Remotely (Queue), message passing is sending and returns a promise. Sending occurs between vats (event loops), whether inter-image between Vats, inter-process or inter-machine. All use the remote capabilities API.

More comments below....

On 4/21/20 4:56 AM, Shaping wrote:

Yeah, the search probably works for me because I’m logged in as a member. Try this:

https://ponylang.zulipchat.com/

and search for ‘lzip’ at the top. This should take you to bits of the thread by Rocco.

In those bits you should find the link to his YT vid showing the actor network with message flows: https://www.youtube.com/watch?v=BslZY0D_xAg

Shaping

From: Robert [[hidden email]]
Sent: Tuesday, 21 April, 2020 02:38
To: Shaping [hidden email]; 'Open Smalltalk Virtual Machine Development Discussion' [hidden email]; 'Pharo Development List' [hidden email]
Subject: Re: [Vm-dev] [Pharo-dev] Pony for Pharo VM

Your link does not work. https://ponylang.zulipchat.com/#narrow/search/lzip

On 4/21/20 2:05 AM, Shaping wrote:

The Pony compiler and runtime need to be studied.

What better way than to bring the Pony compiler into Squeak? Build a Pony runtime inside Squeak, with the vm simulator. Build a VM. Then people will learn Pony and it would be great!

Yes, that is one way. Then we can simulate the new collector with Smalltalk in the usual way, whilst also integrating ref-caps and dynamic types (the main challenge). We already know that Orca works in Pony (in high-performance production—not an experiment or toy). Still there will be bugs and perhaps room for improvements. Smalltalk simulation would help greatly there. The simulated Pony-Orca (the term used in the Orca paper) or simulated Smalltalk-Orca, if we can tag classes with ref-caps and keep Orca working, will run even more slowly in simulation-mode with all that message-passing added to the mix.

The cost of message passing reduces down when using the CogVM JIT. It is indeed somewhat slower when running in the simulator. I think the objective should be to run the Pony bytecodes

Pony is a language, compiler and runtime. The compiler converts Pony source to machine code.

on the jitting CogVM. This VM allows you to install your own BytecodeEncoderSet. Note that I was definitely promoting a solution of running Pony on the CogVM, not Orca.

Pony is not a VM, either--no bytes codes.

Oh, well then it is not possible.

We would be studying Orca structure in the Pony C/C++, how that fits with the ref-caps, and then determine how to write something similar in the VM or work Smalltalk dynamic types into the existing Pony C/C++ (not nearly as fun, probably).

I’m starting to study the Pharo VM. Can someone suggest what to read. I see what appears to be outdated VM-related material. I’m not sure what to study (besides the source code) and what to ignore. I’m especially interested to know what not to read.

I would suggest sticking to Squeak, instead of Pharo, as that is where the VM is designed & developed.

Okay.

How do Pharo’s and Squeak’s VMs differ? I thought OpenSmalltalkVM was the common VM. I also read something recently from Eliot that seemed to indicate a fork.

I thought Pharo had the new tools, like GT, but I’m not sure. I don’t follow Squeak anymore.

Pharo may, it is fast moving and they drop historical support as new tools come online. I don't follow Pharo anymore. There is a common VM but the builds are separate.

Here's a couple of interesting blogs covering the CogVM [1][2] regarding VM documentation.

The problem is easy to understand. It reduces to StW GCing in a large heap and how to make instead may small, well-managed heaps, one per actor. Orca does that already and demonstrates very high performance. That’s what the Orca paper is about.

The CogVM has a single heap, divided into "segments" I believe they are called, to dynamically grow to gain new heap space.

Yeah—no, it won’t work. Sympathies. Empathies.

https://ponylang.zulipchat.com/#narrow/search/lzip

Here was the thread reference I was unable to follow. That you provided it around a discussion of why the CogVM was "not acceptable". Say what? So we are adding a near real-time requirement? I would suggest that the CogVM meets near real-time requirements. The longest GC pause may be 100 ms let us say. That is still near real-time.

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time. The video details are fuzzy, but you can get a good feel for message flow.

Exactly the way Smalltalk operates at runtime. Smalltalk was made with the benefit that the core message passing paradigm is the exact model of interaction we see remotely: message passing. Squeak is a native message passing machine.

This should have happened first in Smalltalk.

It did.

The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':

Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.

Here are the profiling results.

- 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows

As shown, 1 full GC occurred in 86 ms

Not acceptable. Too long.

What is your near real-time requirement?

and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

Not acceptable. Too long. And, worse, it won’t scale.

I am unaware of any scaling problems. In Networking, 1000s of concurrent connections are supported. In computations, 10,000s objects. What are your timing requirements? Each incremental took a fraction of a millisecond to compute: 264 microseconds.

The problem is not the percentage; it’s the big delays amidst other domain-specific computation. These times must be much smaller and spread out across many pauses during domain-specific computations.

See the 307 incremental GCs? These are 264 microsecond delays spread out across domain-specific computations.

No serious real-time apps can be made in this case.

Of course they can. Model the domain as resilient & accepting of 100 ms pauses, for full GCs. It may be more could be done to the CogVM for near real-time, I am not very knowledgeable about the VM.

I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale.

Yeah, the video did not suggest anything other than using message passing. I could not find the thread discussing GC. Would you please post the specific URL to get to that resource, please? I do not want to guess any longer.

The solution for Smalltalk is more complicated, and will involve a concurrent collector. The best one I can find now is Orca. If you know a better one, please share your facts.

As different event loops on different cores will use the same

externalizing remote interface

This idea is not clear. Is there a description of it?

So I gather that the Orca/Pony solution does not treat inter-actor messages, within the same process to be remote calls?

Why would the idea of ‘remote’ enter here? The execution scope is an OS process. Pony actors run on their respective threads in one OS process. Message passing is zero-copy; all “passing” is done by reference. No data is actually copied.

In the SqueakELib Capabilities model, between Vats (in-process, inter-process & inter-machine-node) most references to Actors are remote and then we have zero-copy. Sometimes we need to pass numbers/strings/collections and those are pass-by-copy.

The scheduler interleaves all threads needing to share a core if there are more actors than cores. Switching time for actor threads, in that case, is 5 to 15 ns. This was mentioned before. Opportunistic work stealing happens. That means that all the cores stay as busy as possible if there is any work at all left to do. All of this happens by design without intervention or thought from the programmer. You can read about this in the links given earlier. I suggest we copy the design for Smalltalk.

Which specific links? Could you send a summary email?

If each core has a separate thread and thus a separate event loop, it makes sense to have references to actors in other event loops as a remote actor. Thus the parallelism is well defined.

to reach other event loops, we do not need a runtime that can run on all of those cores. We just need to start the minimal image on the CogVM with remote capabilities

Pony doesn’t yet have machine-node remoteness. The networked version is being planned, but is a ways off still. By remote, do you mean: another machine or another OS/CogVM process on the same machine?

Yes, I mean both. I also mean between two event loops within the same process, different threads.

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

In the case of remote capability references, there is reference counting. This occurs inside the Scope object, there are 6 tables: 2 for third party introduction (gift tables), 2 for outgoing references (#answers & #export) and 2 for incoming references (#questions & #imports). These tables manage all the remote reference counting. Once again between any two Vats (in-process, inter-process & inter-machine-node). There are 2 GC messages sent back from a remote node (GCAnswer & GCExport) for each of the outgoing references. Alice has a reference to a remote object in Bob, when the internal references to Alice's reference end and the RemoteERef is to be garbage collected a GC message is sent to the hosting Vat, Bob.

to share workload.

With Pony-Orca, sharing of the workload doesn’t need to be managed by the programmer.

When I said sharing of workload is a primary challenge, I do not mean explicitly managing concurrency, the event loop ensures that concurrency safety. I meant that the design of a parallelized application into concurrent actors is the challenge,

If you can write a state-machine with actors that each do one very simple, preferably reusable thing in response to received async messages, then it’s not a challenge. We do have to learn how to do it. It’s not what most of us are used to. Pony is a good tool for practicing, even if the syntax is not interesting. Still, as mentioned, we should make tools to help with that state-machine construction. That comes later, but it must happen.

I have some experience with state-machine construction to implement security protocols. In Squeak, DoIt to this script to load Crypto and ParrotTalk & SSL (currently broken) and see some state-machines:

Installer ss
    project: 'Cryptography'; install: 'ProCrypto-1-1-1';
    project: 'Cryptography'; install: 'ProCryptoTests-1-1-1';
    project: 'Cryptography'; install: 'CapabilitiesLocal';
    project: 'Oceanside'; install: 'ston-config-map';
    project: 'Cryptography'; install: 'SSLLoader;
    project: 'Cryptography'; install: 'Raven'.

Pony has Actors. It also has Classes. The actors have behaviours. Think of these as async methods. Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects. Doing this last bit well is the task that concerns me most.

Which? "ref-caps that type the objects"? What does that mean? With the CapabilitiesLocal I pointed you to, we have an Actor model with async message passing to behaviors of an Actor. Squeak has Actors supporting remote references (3-way introductions, through the gift tables is broken. Remote references from Alice to Bob is working. See the tests in ThunkHelloWorldTest: #testConnectAES, #testConnectAESBufferOrdering & #testConnectAESBuffered.

that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

I suggest a study of the Pony scheduler. There are actors, mailboxes, message queues, and the scheduler, mainly. You don’t need to be concerned about safety. It’s been handled for you by the runtime and ref-caps.

Same with Raven, plus remote. We have all of that. See the PriorityVat class.

That’s one of basic reasons for the existence of Pony-Orca. The Pony-Orca dev writes his actors, and they run automatically in load-balance, via the actor-thread scheduler and work-stealing, when possible, on all the cores. Making Smalltalk work with Orca is, at this early stage, about understanding how Orca works (study the C++ and program in Pony) and how to implement it, if possible, in a Smalltalk simulator. Concerning Orca in particular, if you notice at end of the paper, they tested Orca against Erlang VM, C4, and G1, and it performed much better than all.

I suppose it should be measured against the CogVM, to know for sure is the single large heap is a performance bottleneck as compared to Pony/Orca performance with tiny per-actor heaps.

I don’t have time for Pony programming these days--I can’t even read about these days. Go ahead if you wish.

Your time is better spent in other ways, though.

I communicate with you about what could be, but I agree I must stay focused on my primary target, which is porting SSL to use my new ThunkStack framework for remote encrypted communications. End-to-end encryption is what I am about. Here is a visualization of what I aim for with TLS 1.3 and Signal as to be done projects, it is currently vaporware. I have ParrotTalk done and am working SSL, then I will move to SSH. The script I listed above will load all remote packages, except for SSH. I m attaching the flyer I created to broadcast Squeak's ProCrypto configuration.

The speed and scale advantages of Orca over the big-heap approach have been demonstrated. That was done some time ago. Read the paper by Clebsch and friends for details.

Regarding capabilities please read the ELib documentation on ERights website: http://erights.org/elib/index.html

Read Wallaroo Lab’s field-experience whilst preparing to use Pony. Or better, learn to write a Pony program. If your resources don’t allow that, chat with Rocco Bowling (link above). Everyone on Pony Zulip is very helpful and super-enthusiastic about Pony—and it doesn’t even have its own debugger the last time I checked. The tooling is poor, and people still love this thing. Odd.

You state that Pony is just to access Orca. What makes Orca so great?

The biggest challenge, I think you would agree is the system/application design that provides the opportunities to take advantage of parallelism. It kinda fits the microservices arch. So, we would run 64 instances of squeak to take the multicore to town.

No, that’s much slower. Squeak/Pharo still has the basic threading handicap: a single large heap.

In my proposal, with 64 separate squeak processes running across 64 cores, there will be 64 heaps,

That would be too few actors, in general. We are not thinking on the same scale for speed and actor-count.

There are definitely more than one Actor per Vat.

Expect actor counts to scale into the thousands or tens of thousands. There are about 100 in the app above.

Ditto in ELib, many Actors per Vat.

1 per process. There will be a finite number of Capability actors in each event loop.

But more than one and scale into thousands per Vat.

This finite set of actors within one event loop will be GC-able by the global collector, full & incremental. As all inter-event loop interaction occurs through remote message passing, the differences between inter-vat (a vat is the event loop) communication within one process (create two local Vats), inter-vat communication between event-loops in different processes on the same machine and inter-vat communication between event-loops in different processes on different machines are all modeled exactly the same: remote event loops.

Here’s the gist of the problem again: the big heap will not work and must go away, if we are to have extreme speed and a generalized multithreading programming solution.

I am not convinced of this.

You must read of others’ measurements, or write your own programs, and do the tests to get those measurements. Read about the measurements made in the academic paper I cited. That’s the easy way. You can also read the one from Sebastian Blessing from 2013: https://www.ponylang.io/media/papers/a_string_of_ponies.pdf

Alright, reading.

My current understanding is that Pony-Orca (or Smalltalk-Orca) starts one OS process, and then spawns threads, as new actors begin working. You don’t need to do anything special as a programmer to make that happen. You just write the actors, keep them small, use the ref-caps correctly so that the program compiles (the ref-caps must also be applied to Smalltalk classes), and organize your synchronous code into classes, as usual. Functions run synchronous code. Behaviours run asynchronous code.

My point was "writing the actors" and "organizing your synchronous code into classes" are challenging in the sense of choosing what is asynchronous and what is synchronous.

Yup, but only for a while. Then you get used to it, and can’t imagine anything different, like not having a big heap.

The parallel design space holds primacy.

No, strictly, the state-machine design does. The parallelization is done for you.

You’re not parallelizing anything. That’s not your job. (What a relief, yes?) You’re an application programmer. You’re writing a state-machine for your app, and distributing its work across specialized actors, which you code and whose async messages to each other change object data slots (wherever they happen to live—which need not concern you), and thus change the state of the state-machine you designed.

You can’t use the multicore hardware you already own or the goodness in the Orca and ref-cap design if you can’t write a state-machine, and use actors, or don’t have a tool to help you do that. Most of us will want to use such a tool even if we are fluent at state-machine design. This doesn’t even exist in Pony. It’s very raw over there, but you get used to the patterns, as with any new strategy. Still I want a tool. Don’t you?

Two tasks: 1) build tools to help us make state-machines in a reliable pleasant way, so that we feel compelled and happy to do it; and 2) implement Pony-style scheduling, ref-caps, and Orca memory management work in Smalltalk.

Here is the state machine specification for ParrotTalk version 3.7, which is compiled by the ProtocolStateCompiler. This stateMap models states, triggers, transitions, default and callbacks and is simple to use.

ParrotTalkSessionOperations_V3_7 class>>#stateMap

    "(((ParrotTalkSessionOperations_v3_7 stateMap compile)))"

    | desc |
    desc := ProtocolStateCompiler initialState: #initial.
    (desc newState: #initial -> (#processInvalidRequest: -> #dead))
        add: #answer -> (nil -> #receivingExpectHello);
        add: #call -> (nil -> #receivingExpectResponse).
    (desc newState: #connected -> (#processInvalidRequest: -> #dead))
        addInteger: 7 -> (#processBytes: -> #connected).
    (desc newState: #dead -> (#processInvalidRequest: -> #dead)).

    (desc newState: #receivingExpectHello -> (#processInvalidRequest: -> #dead))
        addInteger: 16 -> (#processHello: -> #receivingExpectSignature).
    (desc newState: #receivingExpectSignature -> (#processInvalidRequest: -> #dead))
        addInteger: 18 -> (#processSignature: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).

    (desc newState: #receivingExpectResponse -> (#processInvalidRequest: -> #dead))
        addInteger: 17 -> (#processResponse: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).
    ^desc.

The issue is not whether to use Pony. I don’t like Pony, the language; it’s okay, even very good, but it’s not Smalltalk. I like Smalltalk, who concurrency model is painfully lame.

Squeak concurrency model.

Installer ss
project: 'Cryptography';
install: 'CapabilitiesLocal'

What abilities does the above install give Squeak?

This installs a local only (no remote capabilites) capabilities model that attempts to implement the following in Squeak, the E-Rights capabilities model. [3] This also ensures inter-actor concurrency safety.

So your use of Pony is purely to access the Orca vm?

Orca is not a VM; it’s a garbage collection protocol for actor-based systems.

I suggest using Pony-Orca to learn how Orca works, and then replace the Pony part of Pony-Orca with Smalltalk (dynamic typing), keeping the ref-caps (because they provide the guarantees). I realize that this is a big undertaking. Or: write a new implementation of Orca in Smalltalk for the VM. This is currently second choice, but that could change.

I think you will find the CogVM quite interesting and performant.

--Not with its current architecture.

If the CogVM is not able to:

1) dynamically schedule unlimited actor-threads on all cores

Why not separate actor event-loop processes on each core, communicating remotely? [4][5]

--Because it will continue the current Smalltalk-concurrency lameness.

The only identified difference is not the Actor model, it is the near real-time requirements on the Garbage Collector, yes? So what lameness do you reference?

It’s a patch. And still it will not allow the system to scale. The concurrency problem has been solved nearly optimally and at high resolution in the current Pony-Orca. There’s room for improvement, but it’s already in a completely different performance league compared to any big-heap Smalltalk.

Have you tried implementing SmallInteger class>>#tinyBenchmarks in Pony?

If I’m to work hard on an implementation of this design for Smalltalk, I need a much greater speed-up and scaling ability than what these patches give.

2) automatically load-balance

Use of mobility with actors would allow for automated rebalancing.

Speed hit.

Too slow/wasteful. Moving an actor isn’t needed if the each has its own heap.

In ELib, why not allow an Actor to be mobile and move from Alice's Vat to Bob's Vat? Then automated management apps can really truly rebalance Actors. Only on a rare moment, not for every call.

3) support actor-based programs innately

With this code, asynchronous computation of "number eventual * 100" occurs in an event loop and resolves the promise

[:number | number eventual * 100] value: 0.03 "returning an unresolved promise until the async computation completes and resolves the promise"

Promises and notifications are fine. Both happen in Pony-Orca. But the promises don’t fix the big performance problems.

Once again the near real-time of the Orca Garbage Collector. I am not convinced, but reading some papers.

Am I wrong to state that this model allows innate support to actors? Or were you somehow stating that the VM would need innate support? Why does the VM have to know?

It’s not enough. We still have the big pauses from GCs in a large heap.

86 ms will not break the contract, I propose.

4) guarantee no data-races

The issue to observe is whether computations are long running and livelock the event loop from handling other activations. This is a shared issue, as Pony/Orca are also susceptible to this.

Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca.

I don't watch for it, but it is a strong design consideration. Keep Actor behaviors short and sweet.

E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

Speed hit.

???

No blocking and no write barriers exist in Pony-Orca. You can’t wait. If you need to “wait,” you set a timer and respond to the event when the timer fires.

Imagine a cloud based compute engine, processing Cassandra events that uses inter-machine actors to process the massively parallel Cassandra database. Inter-thread communication is not sufficient as there are hundreds of separate nodes.

Yes; I didn’t claim otherwise. The networked version is coming. See above. My point is that the ‘remote’ characterization is not needed. It’s not helping us describe and understand.

It does so for me, either we have inner-Vat message calling, immediate, adding to the stack. And we have inter-Vat message sending, asynchronous, adding to the queue.

Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

No new design is needed for concurrency and interfacing. There is much to implement, however.

The design is already done, modulo the not-yet-present network extension. Interfacing between actors is always by async messaging. Messaging will work as transparently as possible in the networked version across machine nodes.

Must remember, we are still vulnerable to network failure errors.

The issue is how most efficiently to use Orca, which happens to be working in Pony. Pony is in production in two internal, speed-demanding, banking apps and in Wallaroo Labs’ high-rate streaming product. Pony is a convenient way to study and use a working implementation of Orca. Ergo, use Pony, even if we only study it as a good example of how to use Orca. Some tweaks (probably a lot of them) could allow use of dynamic types. We could roll our own implementation of Orca for the current Pharo VM, but that seems like more work than tweaking a working Pony compiler and runtime. I’m not sure about that. You know the VM better than I. (I was beginning my study of the Pharo/OpenSmalltalkVM when I found Pony.)

Sounds like you might regret your choice and took the wrong path.

I don’t see how you form that conclusion. I’ve not chosen yet.

You stated you are not thrilled with using Pony.

I don’t like the Pony language syntax. I don’t like anything that looks like Algo-60. Pony is a language, compiler and runtime implementing Orca. The other stuff is good. And I’ve not had much time to use it; I suspect I could like it more.

No argument from me! Squeak is a language, a compiler and a profound image-based runtime. Is there another language with such an image-based runtime? I think not.

Kindly,
Robert

[…]

If most of what Squeak/Pharo offers is pleasant/productive VM simulation, much work still remains to achieve even a basic actor system and collector, but the writing of VM code in Smalltalk and compiling it to C may be much more productive than writing C++. The C++ for the Pony compiler and runtime, however, already compiles and works well. Thus, starting the work in C++ is somewhat tempting. Can someone explain the limits of how the VM simulator can be used? How much VM core C is not a part of what can be compiled from Smalltalk? Can all VM C code be compiled from Smalltalk?
Can someone answer the above question?
[1] Cog Blog - http://www.mirandabanda.org/cogblog/
[2] Smalltalk, Tips 'n Tricks - https://clementbera.wordpress.com/
[3] Capability Computation - http://erights.org/elib/capability/index.html
[4] Concurrency (Event Loops) - http://erights.org/elib/concurrency/index.html
[5] Distributed Programming - http://erights.org/elib/distrib/index.html

Shaping

ProCrypto-1-1-1.pdf (159K) Download Attachment

Robert Withers-2

Re: [Pharo-dev] Pony for Pharo VM

Oh! OO is async message passing? Which Alan Kay lectures are you referencing? I would like to watch them.

The only thing I have heard you state is some requirement for near real-time that evidently the CogVM fails to meet. You have a language you do not really like, running on top of this near real-time garbage collector you claim Squeak fails at. This language introduces a type system to do compile-time analysis of actor messaging for some false sense of safety as well as optimization for performance. I would accept the second reasoning, if the near real-time requirement require such. I do not believe it does. A worst case 90 ms response time to a real-time event is absolutely sufficient. It sounds to me as if you are claiming requirements that only Pony/Orca can meet, thus justifying use of your language. In practically every near real-time system, such performance guarantees are met with a 90 ms delay. Unless you are justifying the language used by setting such a high bar...I have seen it happen.

No other language feature that I can see, but I may very well "not following the main thrust of the ideas we are discussing." So please review I am a little lost.

I re-added the list for discussions on requirements.

k, r

On 4/21/20 12:56 PM, Shaping wrote:

You’re not understanding what object-orientation is. Note the underlined bit. OO has nothing to do with the devices chosen for coding and organizing coded programs: Inheritance, classes, mixins, methods. These and several others are conventions. These can change often and have. Some are good; some are okay; some are terrible, like mixins.

OO is instead about changing program state asynchronously by messages. See the many Kay lectures on this.

The extreme is the point of the comment. Ironic is that the ideal and first language for object-orientation utterly failed to get it right at run-time. You’re not following the main thrust of the ideas we are discussing.

Shaping

From: Robert [[hidden email]]
Sent: Tuesday, 21 April, 2020 10:47
To: Shaping [hidden email]
Subject: Re: [Vm-dev] [Pharo-dev] Pony for Pharo VM

Hi Shaping,

I tried again with the url in your image, but I am not getting to the post you took an image of. If you would just send me the link for that entire thread, evidently discussing "how real object-oriented programs work at run-time", I would be grateful. Which is hilarious by the way, as Smalltalk was the first object-oriented system ever developed. From our perspective, real object oriented systems utilize message passing and are late binding.

On 4/21/20 10:54 AM, Shaping wrote:

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time.
-- 
Kindly,
Robert

-- 
Kindly,
Robert

Shaping1

Re: [Pharo-dev] Pony for Pharo VM

In reply to this post by Robert Withers-2

How do Pharo’s and Squeak’s VMs differ? I thought OpenSmalltalkVM was the common VM. I also read something recently from Eliot that seemed to indicate a fork.

I thought Pharo had the new tools, like GT, but I’m not sure. I don’t follow Squeak anymore.

Pharo may, it is fast moving and they drop historical support as new tools come online. I don't follow Pharo anymore. There is a common VM but the builds are separate.

I mostly don’t follow Squeak, but do follow Pharo on and off, and may port as soon as the GUI formatting problems are fixed or are fixable by my own use of Spec2.

I tried Squeak 5.3 a few days ago for the first time in 16 years. It has a nicer induction/setup process, but menus were malfunctioning (rending spastically) before I finished getting reacquainted with the new surface. I don’t have time these days to finish playing with it. I may get back to it, but why do that if the Pharo GUI is more advanced? Besides avoiding Pharo framework bloat and confusion, what about Squeak compels you to use it instead of Pharo for VM dev?

https://ponylang.zulipchat.com/#narrow/search/lzip

5 to 10 ms is what I need. Even Pony’s GCing barely keeps up with this, but the effect is smoother because all actors are running all the time, except when each actor GCs its own little heap. The timing issue is more about smoothness and predictably small tail latencies at the right end of a very steep and narrow latencies distribution.

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time. The video details are fuzzy, but you can get a good feel for message flow.

This should have happened first in Smalltalk.

It did.

No, the idea of asynchronous messaging did, but not the implementation. We’re not discussing the same phenomenon.

Smalltalk does not in general have asynchronous messaging between all actors all the time. It doesn’t even have actors by default in the core language. You have to design them as an afterthought. That’s just wrong.

Smalltalk does not have true actors as a baseline implementation of the OO programming paradigm. You have to engineer it if you want it, and it doesn’t scale well with the green threads. Just non-blocking FFI doesn’t count; that is necessary and good, but not sufficient.

Async messaging: that was the original vision, and it still works best for state-machine construction, because no blocking and no read/write-barriers are needed. Here’s the gist, and you’ll find that Kay says the same in his talks, repeatedly (apparently no one listens and thinks about using it): advance the state of the state-machine only by exchange of asynchronous messages between actors. That’s the whole thing. Then you have the tooling on top of that to make the SM building systematic, reliable, and pleasant. That’s missing too, and must be done, as well, or the core idea is hard to use fully—which is largely why we program as we do today. Most programmers are old dogs coding in the same old wrong way, because the new way, which is very much better, is even harder to do without the right tools and guarantees, and we don’t have those yet. Functional programming (great for many domains) is a much better choice for general-purpose programming than the actor model with the current actor-based tools.

The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':
Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.
Here are the profiling results.
- 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows
As shown, 1 full GC occurred in 86 ms

Not acceptable. Too long.

What is your near real-time requirement?

5 to 10 ms pauses per actor, not globally whilst all actors wait. Think smooth.

and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

Not acceptable. Too long. And, worse, it won’t scale.

The problem is not the percentage; it’s the big delays amidst other domain-specific computation. These times must be much smaller and spread out across many pauses during domain-specific computations.

See the 307 incremental GCs? These are 264 microsecond delays spread out across domain-specific computations.

We have to watch definitions and constraints carefully.

Where memory management is concerned, this thread tries to compare the merits of the two extremes: per-actor memory management, as in the Orca (Pony practically), and global stop-the-world (StW) collection as in a classical Smalltalk.

You seem to be presenting something intermediate above, where there are segments that are GCed in turn. Are you stopping all domain threads/messaging during the incremental GCs, or just the ones for objects in a certain section of the heap? Or, are the heap partitions divided by more traditional criteria, like object size and lifespan? What is the spacing between the incremental GCs? Steady frequency of pauses, smallness of pauses, and narrowness of distribution of longest pauses, especially, are the most important criteria.

No serious real-time apps can be made in this case.

Of course they can. Model the domain as resilient & accepting of 100 ms pauses, for full GCs. It may be more could be done to the CogVM for near real-time, I am not very knowledgeable about the VM.

We are discussing different app domains.

I can’t use 100 ms pauses in my real-time app. I need sub 10 ms pauses. Again even Pony needs better GCing, but the report I have from Rocco shows more or less acceptable pauses. He was kind enough to run the program again with the GC stats turned on. I’ll try to find the document and attach it.

I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale.

Yeah, the video did not suggest anything other than using message passing.

You can’t miss this detail; it’s almost everything that matters: asynchronous message passing between all actors, all the time, on the metal, not as an afterthought, with a guarantee of no data-races.

I could not find the thread discussing GC. Would you please post the specific URL to get to that resource, please? I do not want to guess any longer.

I don’t get it. It works for me in a newly opened tab. I don’t know what the problem is. Zulip should work for you as it does for me. Login to your Zulip and search on ‘Rocco’. You’ll see Rocco’s stuff, which is largely about the parsing app he has. Better, just ask for the info you want.

Why would the idea of ‘remote’ enter here? The execution scope is an OS process. Pony actors run on their respective threads in one OS process. Message passing is zero-copy; all “passing” is done by reference. No data is actually copied.

Yeah, the copying won’t work well. Just never do it, at least not in one OS process. 0-copy messaging is the rule in Pony. Even this may not matter much eventually. The MM is being reworked with the hope of eliminating all consensus-driven messaging, which can be expensive, even if only transiently, on highly mutable object sets. See the Verona project, which seems to be slowly converging with Pony: https://github.com/microsoft/verona/blob/master/docs/faq.md. The Orca memory management core-weakness is still the fact that all messaging must be traced so that the collector knows which actors still refer to a given object. This is true of all concurrent collectors. That problem—and it’s a big problem-- goes away completely if the Pony runtime becomes Verona-ized.

The scheduler interleaves all threads needing to share a core if there are more actors than cores. Switching time for actor threads, in that case, is 5 to 15 ns. This was mentioned before. Opportunistic work stealing happens. That means that all the cores stay as busy as possible if there is any work at all left to do. All of this happens by design without intervention or thought from the programmer. You can read about this in the links given earlier. I suggest we copy the design for Smalltalk.

Which specific links? Could you send a summary email?

Not now, maybe later.

Some of this data comes from lectures by Clebsch, but you’ll find most of the meat in the Pony papers, for which links can be found on the community page: https://www.ponylang.io/community/ .

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

In Squeak/Pharo do all actors stop for the GCs, even for the smaller incremental ones?

Pony has Actors. It also has Classes. The actors have behaviours. Think of these as async methods. Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects. Doing this last bit well is the task that concerns me most.

Which? "ref-caps that type the objects"? What does that mean?

This is the best page on ref-caps: https://www.ponylang.io/learn/#reference-capabilities

The six ref caps define which objects can mutate which others at compile time. Ref-caps provide the guarantees. The ref caps connect the code of your language (Pony so far) to the MM runtime at compile time.

With the CapabilitiesLocal I pointed you to, we have an Actor model with async message passing to behaviors of an Actor. Squeak has Actors supporting remote references (3-way introductions, through the gift tables is broken. Remote references from Alice to Bob is working. See the tests in ThunkHelloWorldTest: #testConnectAES, #testConnectAESBufferOrdering & #testConnectAESBuffered.

that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

I suggest a study of the Pony scheduler. There are actors, mailboxes, message queues, and the scheduler, mainly. You don’t need to be concerned about safety. It’s been handled for you by the runtime and ref-caps.

Same with Raven, plus remote. We have all of that. See the PriorityVat class.

Last time I checked, Squeak/Pharo didn’t have actors that run without interruption by the GC. Or can Smalltalk actors do that now? How hard to implement on OSVM would a per-actor memory architecture be? You need a concurrent collector algo to make it work. I don’t think you have that in the current VM.

I don’t have time for Pony programming these days--I can’t even read about these days. Go ahead if you wish.

Your time is better spent in other ways, though.

The speed and scale advantages of Orca over the big-heap approach have been demonstrated. That was done some time ago. Read the paper by Clebsch and friends for details.

Regarding capabilities please read the ELib documentation on ERights website: http://erights.org/elib/index.html

The material is not well organized, and is very hard to even want to read. Maybe you are the one to change that.

I’m looking for: 1) definitions of all terms that are new or nonstandard; 2) problem-constraint invariants (one can’t reason about anything without invariants); 3) problem-solution objectives expressed in measurable terms, like data-rate or latency.

Can I get those? This might be a good time and place to describe your proposed concurrency solution for Smalltalk in terse, measurable terms.

Have you measured a running actor-based app on one node? On two or more?

You state that Pony is just to access Orca.

The two are codesigned.

What makes Orca so great?

The measurements. Read the end of the Orca paper if you can’t read the whole thing. Orca is the best concurrent MM protocol now.

Aside: I want to be rid of general purpose GC. I think the purely functional approach with persistent data structures can work better (think Haskell). You still need to maintain state in processing queues (think Clojure; not sure how Haskell handles this). You still need temporal-coherency control devices at the periphery for IO (atoms for example).

There are definitely more than one Actor per Vat.

Why have another construct? It appears to be scoped to the node. Is that the reason for the vat’s existence? Does it control machine-node-specific messaging and resource management for the actors in it contains?

1 per process. There will be a finite number of Capability actors in each event loop.

But more than one and scale into thousands per Vat.

Is there a formal definition of vat? I found this:

“A vat is the part of the Neocosm implementation that has a unique network identity. We expect that normal circumstances, there will only be one vat running on a particular machine at one time. Neocom currently (28 May 1998) supports only one avatar per vat.”

ParrotTalkSessionOperations_V3_7 class>>#stateMap
    "(((ParrotTalkSessionOperations_v3_7 stateMap compile)))"

    | desc |
    desc := ProtocolStateCompiler initialState: #initial.
    (desc newState: #initial -> (#processInvalidRequest: -> #dead))
        add: #answer -> (nil -> #receivingExpectHello);
        add: #call -> (nil -> #receivingExpectResponse).
    (desc newState: #connected -> (#processInvalidRequest: -> #dead))
        addInteger: 7 -> (#processBytes: -> #connected).
    (desc newState: #dead -> (#processInvalidRequest: -> #dead)).

    (desc newState: #receivingExpectHello -> (#processInvalidRequest: -> #dead))
        addInteger: 16 -> (#processHello: -> #receivingExpectSignature).
    (desc newState: #receivingExpectSignature -> (#processInvalidRequest: -> #dead))
        addInteger: 18 -> (#processSignature: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).

    (desc newState: #receivingExpectResponse -> (#processInvalidRequest: -> #dead))
        addInteger: 17 -> (#processResponse: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).
    ^desc.

--Because it will continue the current Smalltalk-concurrency lameness.

The only identified difference is not the Actor model, it is the near real-time requirements on the Garbage Collector, yes? So what lameness do you reference?

The actor model is a problem too if you stop all actors with a StW GC. Fix the problem by letting each actor run until it finishes processing the last message. Then let it collect its own garbage. Then let it take the next message from the queue. All actors are single-threaded by definition. This maximizes processing rate and smoothness of GC-disruption of domain work. It also increases tracing overhead transiently when large numbers of mutable objects are used (21% peak CPU consumption ascribable to tracing when you throw the tree/ring exercise at Pony with as many mutable types as possible). We will be turning to functional programming strategies for at least some (if not eventually all) core (not peripheral IO) parallelization efforts, but I digress somewhat.

Two big problems impede parallelization of programs in Smalltalk: 1) the GC stops all actors, all of them or large chunks of them at once, depending on how the GC works. Neither situation is acceptable. That this pause is small is not as important as the work lost from all the actors during that period; 2) core Smalltalk doesn’t have a guaranteed concurrency-integrity model that automatically starts threads on different cores; it can only interleave them on one core (green threading).

These ideas should not be add-ons or frameworks. They should be core features. If we can do the above two in Squeak/Pharo, I’ll use Squeak/Pharo to work on a VM.

Have you tried implementing SmallInteger class>>#tinyBenchmarks in Pony?

No, sadly. I’ve yet to build Pony. I’m still reading about the techniques and background. I plan to parse large CSV files into objects, as in Rocco’s exercise. I have some files I can use for that purpose, and can get the same data over HTTP to test a connection too. That would be a nice first experiment. We need to be clinical about this. All the talk and hand-waving is an okay start, but at some point, we must measure and keep on doing that in a loop, as architecture is tweaked. I would like to compare a Pony parsing program to my VW parsing program as a start. But the Pony work has to wait.

Too slow/wasteful. Moving an actor isn’t needed if the each has its own heap.

In ELib, why not allow an Actor to be mobile and move from Alice's Vat to Bob's Vat?

Are Vats for scoping actors to specific machine nodes? If so, then yes move the actors to another machine node if it is better suited to what the actor does and needs.

Then automated management apps can really truly rebalance Actors. Only on a rare moment, not for every call.

Yes, we must have run-time feedback to adjust resources, which are the actors and the machine nodes they are made to run on. A coordination actor would watch all other actors and clusters of them.

Once again the near real-time of the Orca Garbage Collector. I am not convinced, but reading some papers.

Runtime action with an Orca collector is definitely smoother. This we can know before the fact (before we measure) from the fine-grain MM. What is not clear without testing is overall speed. We need to test and measure.

86 ms will not break the contract, I propose.

It’s okay for a business app, certainly--not such much for machine-control or 3D graphics. It’s good enough for prototyping a 3D sim, but not for deployment.

Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca.

I don't watch for it, but it is a strong design consideration. Keep Actor behaviors short and sweet.

The cycles are about chaining of messages/dependencies between actors. When a programmer makes these actor clusters manually, messing up is easy. A higher-level tools is need to manage these constructions. The actor-cycles will exist and become problematic even if each actor is fast. The detector finds and exposes them to the programmer.

E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

Speed hit.

???

This constraint is not needed to guarantee correct concurrency in Pony, which happens at compile time and takes no CPU cycles. The approach above sounds dynamic; there are some CPU cycles involved.

Yes; I didn’t claim otherwise. The networked version is coming. See above. My point is that the ‘remote’ characterization is not needed. It’s not helping us describe and understand.

It does so for me, either we have inner-Vat message calling, immediate, adding to the stack. And we have inter-Vat message sending, asynchronous, adding to the queue.

I still don’t see a clear definition of vat.

None of the above language is needed when the concurrency scheme is simpler and doesn’t use those ideas and devices.

Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

The design is already done, modulo the not-yet-present network extension. Interfacing between actors is always by async messaging. Messaging will work as transparently as possible in the networked version across machine nodes.

Must remember, we are still vulnerable to network failure errors.

Yes, they are keenly aware of that. It’s a big task and won’t happen for a while. But it will happen.

I don’t like the Pony language syntax. I don’t like anything that looks like Algo-60. Pony is a language, compiler and runtime implementing Orca. The other stuff is good. And I’ve not had much time to use it; I suspect I could like it more.

No argument from me! Squeak is a language, a compiler and a profound image-based runtime. Is there another language with such an image-based runtime? I think not.

Yes, we all love Smalltalk. It’s still too slow.

We’re not talking about coding ergonomics and testing dynamic. We all agree that Smalltalk is a better that way to code and interact with the evolving program. But that’s not what the thread is about. This is all about speed and scale across cores, CPUs, and machine nodes. The solution must be implemented close to the metal (VM). It can’t be an add-on framework. We need an Actor class and syntax for making actors and their async behaviours. Then we need the VM to understand the new format and bytecodes associated with actors.

Shaping

Robert Withers-2

Re: [Pharo-dev] Pony for Pharo VM

Hi Shaping,

Thank you for the helpful response and your taking the time to compose it. I am thinking there are two statements you are making: 1) you require sub 10 ms GC times and 2) the use of static typing regarding ref-cap states, allows you to present a proxy type reflecting the correct protocol for the subject farRef and static type all interactions with it.

1) is interesting.
2) In Raven, any message may be sent, passed as a string in a DeliveryMessage & DeliveryOnlyMessage. The transition of ERefs is well defined (NearERef -> PromiseERef -> a new Near/FarERef); (RemotePromiseERef -> FarERef -> a new RemotePromiseERef -> a new Near/FarERef).

The issue is how would a user/developer know that the subject farRef they are holding understands any of a particular protocol. The need to limit some protocol by failing to provide their capabilities is solved by a securityProxy, specific for each exported class, that provides only he allowed protocol. We do not need nor find valuable any static typing done by the compiler. Let's stay dynamic and allow runTime detection of failed protocol use, so we can use testing to validate the state machine. Static Typing is a false security blanket, Unit Testing is King!

On 5/8/20 1:06 AM, Shaping wrote:

How do Pharo’s and Squeak’s VMs differ? I thought OpenSmalltalkVM was the common VM. I also read something recently from Eliot that seemed to indicate a fork.

I thought Pharo had the new tools, like GT, but I’m not sure. I don’t follow Squeak anymore.

Pharo may, it is fast moving and they drop historical support as new tools come online. I don't follow Pharo anymore. There is a common VM but the builds are separate.

I mostly don’t follow Squeak, but do follow Pharo on and off, and may port as soon as the GUI formatting problems are fixed or are fixable by my own use of Spec2.

I tried Squeak 5.3 a few days ago for the first time in 16 years. It has a nicer induction/setup process, but menus were malfunctioning (rending spastically) before I finished getting reacquainted with the new surface. I don’t have time these days to finish playing with it. I may get back to it, but why do that if the Pharo GUI is more advanced? Besides avoiding Pharo framework bloat and confusion, what about Squeak compels you to use it instead of Pharo for VM dev?

What can I say? It is home.

https://ponylang.zulipchat.com/#narrow/search/lzip

Here was the thread reference I was unable to follow. That you provided it around a discussion of why the CogVM was "not acceptable". Say what? So we are adding a near real-time requirement? I would suggest that the CogVM meets near real-time requirements. The longest GC pause may be 100 ms let us say. That is still near real-time.

5 to 10 ms is what I need. Even Pony’s GCing barely keeps up with this, but the effect is smoother because all actors are running all the time, except when each actor GCs its own little heap. The timing issue is more about smoothness and predictably small tail latencies at the right end of a very steep and narrow latencies distribution.

Yes, indeed. I can see this would be valuable and I support this for the CogVM of some flavor. This dispersed cost can yield much finer resolutions. Alas, above my pay grade, Captain.

Read the thread above and watch the video to sharper your imagination and mental model, somewhat, for how real object-oriented programs work at run-time. The video details are fuzzy, but you can get a good feel for message flow.

Exactly the way Smalltalk operates at runtime. Smalltalk was made with the benefit that the core message passing paradigm is the exact model of interaction we see remotely: message passing. Squeak is a native message passing machine.

This should have happened first in Smalltalk.

It did.

No, the idea of asynchronous messaging did, but not the implementation.   We’re not discussing the same phenomenon.

Smalltalk does not in general have asynchronous messaging between all actors all the time. It doesn’t even have actors by default in the core language. You have to design them as an afterthought. That’s just wrong.

Smalltalk does not have true actors as a baseline implementation of the OO programming paradigm. You have to engineer it if you want it, and it doesn’t scale well with the green threads. Just non-blocking FFI doesn’t count; that is necessary and good, but not sufficient.

Async messaging: that was the original vision, and it still works best for state-machine construction, because no blocking and no read/write-barriers are needed. Here’s the gist, and you’ll find that Kay says the same in his talks, repeatedly (apparently no one listens and thinks about using it): advance the state of the state-machine only by exchange of asynchronous messages between actors. That’s the whole thing. Then you have the tooling on top of that to make the SM building systematic, reliable, and pleasant. That’s missing too, and must be done, as well, or the core idea is hard to use fully—which is largely why we program as we do today. Most programmers are old dogs coding in the same old wrong way, because the new way, which is very much better, is even harder to do without the right tools and guarantees, and we don’t have those yet. Functional programming (great for many domains) is a much better choice for general-purpose programming than the actor model with the current actor-based tools.

The performance of the GC in the CogVM is demonstrated with this profiling result running all Cryptography tests. Load Cryptography with this script, open the Test Runner select Cryptography tests and click 'Run Profiled':

Installer ss
    project: 'Cryptography';
    install: 'ProCrypto-1-1-1';
    install: 'ProCryptoTests-1-1-1'.

Here are the profiling results.

- 12467 tallies, 12696 msec.

**Leaves**
13.8% {1752ms} RGSixtyFourBitRegister64>>loadFrom:
8.7% {1099ms} RGSixtyFourBitRegister64>>bitXor:
7.2% {911ms} RGSixtyFourBitRegister64>>+=
6.0% {763ms} SHA256Inlined64>>processBuffer
5.9% {751ms} RGThirtyTwoBitRegister64>>loadFrom:
4.2% {535ms} RGThirtyTwoBitRegister64>>+=
3.9% {496ms} Random>>nextBytes:into:startingAt:
3.5% {450ms} RGThirtyTwoBitRegister64>>bitXor:
3.4% {429ms} LargePositiveInteger(Integer)>>bitShift:
3.3% {413ms} [] SystemProgressMorph(Morph)>>updateDropShadowCache
3.0% {382ms} RGSixtyFourBitRegister64>>leftRotateBy:
2.2% {280ms} RGThirtyTwoBitRegister64>>leftRotateBy:
1.6% {201ms} Random>>generateStates
1.5% {188ms} SHA512p256(SHA512)>>processBuffer
1.5% {184ms} SHA256Test(TestCase)>>timeout:after:
1.4% {179ms} SHA1Inlined64>>processBuffer
1.4% {173ms} RGSixtyFourBitRegister64>>bitAnd:

**Memory**
    old            -16,777,216 bytes
    young        +18,039,800 bytes
    used        +1,262,584 bytes
    free        -18,039,800 bytes

**GCs**
    full            1 totalling 86 ms (0.68% uptime), avg 86 ms
    incr            307 totalling 81 ms (0.6% uptime), avg 0.3 ms
    tenures        7,249 (avg 0 GCs/tenure)
    root table    0 overflows

As shown, 1 full GC occurred in 86 ms

Not acceptable. Too long.

What is your near real-time requirement?

5 to 10 ms pauses per actor, not globally whilst all actors wait. Think smooth.

and 307 incremental GCs occurred for a total of 81 ms. All of this GC activity occurred within a profile run lasting 12.7 seconds. The total GC time is just 1.31% of the total time. Very fast.

Not acceptable. Too long. And, worse, it won’t scale.

I am unaware of any scaling problems. In Networking, 1000s of concurrent connections are supported. In computations, 10,000s objects. What are your timing requirements? Each incremental took a fraction of a millisecond to compute: 264 microseconds.

The problem is not the percentage; it’s the big delays amidst other domain-specific computation. These times must be much smaller and spread out across many pauses during domain-specific computations.

See the 307 incremental GCs? These are 264 microsecond delays spread out across domain-specific computations.

We have to watch definitions and constraints carefully.

Where memory management is concerned, this thread tries to compare the merits of the two extremes: per-actor memory management, as in the Orca (Pony practically), and global stop-the-world (StW) collection as in a classical Smalltalk.

You seem to be presenting something intermediate above, where there are segments that are GCed in turn. Are you stopping all domain threads/messaging during the incremental GCs, or just the ones for objects in a certain section of the heap? Or, are the heap partitions divided by more traditional criteria, like object size and lifespan? What is the spacing between the incremental GCs? Steady frequency of pauses, smallness of pauses, and narrowness of distribution of longest pauses, especially, are the most important criteria.

No serious real-time apps can be made in this case.

Of course they can. Model the domain as resilient & accepting of 100 ms pauses, for full GCs. It may be more could be done to the CogVM for near real-time, I am not very knowledgeable about the VM.

We are discussing different app domains.

I can’t use 100 ms pauses in my real-time app. I need sub 10 ms pauses. Again even Pony needs better GCing, but the report I have from Rocco shows more or less acceptable pauses. He was kind enough to run the program again with the GC stats turned on.   I’ll try to find the document and attach it.

I suggest studying the Pony and Orca material, if the video and accompanying explanation don’t clarify Pony-Orca speed and scale.

Yeah, the video did not suggest anything other than using message passing.

You can’t miss this detail; it’s almost everything that matters: asynchronous message passing between all actors, all the time, on the metal, not as an afterthought, with a guarantee of no data-races.

I could not find the thread discussing GC. Would you please post the specific URL to get to that resource, please? I do not want to guess any longer.

I don’t get it. It works for me in a newly opened tab. I don’t know what the problem is. Zulip should work for you as it does for me. Login to your Zulip and search on ‘Rocco’. You’ll see Rocco’s stuff, which is largely about the parsing app he has. Better, just ask for the info you want.

I have found several documents. One has over a 100 pages! On GC.

Why would the idea of ‘remote’ enter here? The execution scope is an OS process. Pony actors run on their respective threads in one OS process. Message passing is zero-copy; all “passing” is done by reference. No data is actually copied.

In the SqueakELib Capabilities model, between Vats (in-process, inter-process & inter-machine-node) most references to Actors are remote and then we have zero-copy. Sometimes we need to pass numbers/strings/collections and those are pass-by-copy.

Yeah, the copying won’t work well. Just never do it, at least not in one OS process. 0-copy messaging is the rule in Pony. Even this may not matter much eventually. The MM is being reworked with the hope of eliminating all consensus-driven messaging, which can be expensive, even if only transiently, on highly mutable object sets. See the Verona project, which seems to be slowly converging with Pony: https://github.com/microsoft/verona/blob/master/docs/faq.md. The Orca memory management core-weakness is still the fact that all messaging must be traced so that the collector knows which actors still refer to a given object. This is true of all concurrent collectors. That problem—and it’s a big problem-- goes away completely if the Pony runtime becomes Verona-ized.

My hope is that this is interesting to those more knowledgeable of Squeak's(Pharo's) VM.

The scheduler interleaves all threads needing to share a core if there are more actors than cores. Switching time for actor threads, in that case, is 5 to 15 ns. This was mentioned before. Opportunistic work stealing happens. That means that all the cores stay as busy as possible if there is any work at all left to do. All of this happens by design without intervention or thought from the programmer. You can read about this in the links given earlier. I suggest we copy the design for Smalltalk.

Which specific links? Could you send a summary email?

Not now, maybe later.

Very well, as you find your time. I appreciate you.

I believe the same is true of the Local/Remote Capabilities of #CapabilitiesLocal and #Raven. The entire mechanism of #sending messages, rather than #calling them enables 0-copy (an app design choice!) with enough Vats running to use the cores, They are not usually multiple Vats in one process, they are in separate processes. Therefore, the interaction of messages between Vats are remote, through a socket. I suppose a SharedMemoryThunk could be designed. All of this is automatic.

Some of this data comes from lectures by Clebsch, but you’ll find most of the meat in the Pony papers, for which links can be found on the community page: https://www.ponylang.io/community/ .

I found some material to read, thank you.

I think the Pony runtime is still creating by default just one OS process per app and as many threads as needed, with each actor having only one thread of execution by definition of what an actor is (single-threaded, very simple, very small). A scheduler keeps all cores busy, running and interleaving all the current actor threads. Message tracing maintains ref counts. A cycle-detector keep things tidy. Do Squeak and Pharo have those abilities?

In the case of remote capability references, there is reference counting. This occurs inside the Scope object, there are 6 tables: 2 for third party introduction (gift tables), 2 for outgoing references (#answers & #export) and 2 for incoming references (#questions & #imports). These tables manage all the remote reference counting. Once again between any two Vats (in-process, inter-process & inter-machine-node). There are 2 GC messages sent back from a remote node (GCAnswer & GCExport) for each of the outgoing references. Alice has a reference to a remote object in Bob, when the internal references to Alice's reference end and the RemoteERef is to be garbage collected a GC message is sent to the hosting Vat, Bob.

In Squeak/Pharo do all actors stop for the GCs, even for the smaller incremental ones?

I suppose my answer to you is yes, to the best of my knowledge.

I have some experience with state-machine construction to implement security protocols. In Squeak, DoIt to this script to load Crypto and ParrotTalk & SSL (currently broken) and see some state-machines:

Installer ss
    project: 'Cryptography'; install: 'ProCrypto-1-1-1';
    project: 'Cryptography'; install: 'ProCryptoTests-1-1-1';
    project: 'Cryptography'; install: 'CapabilitiesLocal';
    project: 'Oceanside'; install: 'ston-config-map';
    project: 'Cryptography'; install: 'SSLLoader;
    project: 'Cryptography'; install: 'Raven'.

Pony has Actors. It also has Classes. The actors have behaviours. Think of these as async methods. Smalltalk would need new syntax for Actors, behaviours, and the ref-caps that type the objects. Doing this last bit well is the task that concerns me most.

Which? "ref-caps that type the objects"? What does that mean?

This is the best page on ref-caps: https://www.ponylang.io/learn/#reference-capabilities

This was the material I found to read! Thank you.

The six ref caps define which objects can mutate which others at compile time. Ref-caps provide the guarantees. The ref caps connect the code of your language (Pony so far) to the MM runtime at compile time.

This is static typing, unrelated to GC timing requirements. Use of a dynamic implementation of Promise capabilities and the state machine the references themselves undergo, is perfectly fine.

With the CapabilitiesLocal I pointed you to, we have an Actor model with async message passing to behaviors of an Actor. Squeak has Actors supporting remote references (3-way introductions, through the gift tables is broken. Remote references from Alice to Bob is working. See the tests in ThunkHelloWorldTest: #testConnectAES, #testConnectAESBufferOrdering & #testConnectAESBuffered.

that exists for Smalltalk capabilities and Pony capabilities. In fact, instead of talking about actors, concurrency & parallel applications, I prefer to speak of a capabilities model, inherently on an event loop which is the foal point for safe concurrency.

I suggest a study of the Pony scheduler. There are actors, mailboxes, message queues, and the scheduler, mainly. You don’t need to be concerned about safety. It’s been handled for you by the runtime and ref-caps.

Same with Raven, plus remote. We have all of that. See the PriorityVat class.

Here is my model of what a Vat looks like. The Vat has a green thread (called Process) that runs all sends and calls. A Vat has a current stack, for calling methods, a message queue for sending messages async, in priority-sorted FIFO order (scheduling), and a Pool of event-loop semaphores waited upon for continuations (to be built)(scheduling). All interactions with Actors managed by that Vat must come through the Queue, processed by the vatProcess. No Actor an be reached by more than one Vat. No mailboxes, yet. Need published public keys under an identity so they can be encrypted.

Last time I checked, Squeak/Pharo didn’t have actors that run without interruption by the GC. Or can Smalltalk actors do that now? How hard to implement on OSVM would a per-actor memory architecture be? You need a concurrent collector algo to make it work. I don’t think you have that in the current VM.

I do not know, myself, but this is a separate issue than static typing ref-caps.

I don’t have time for Pony programming these days--I can’t even read about these days. Go ahead if you wish.

Your time is better spent in other ways, though.

I communicate with you about what could be, but I agree I must stay focused on my primary target, which is porting SSL to use my new ThunkStack framework for remote encrypted communications. End-to-end encryption is what I am about. Here is a visualization of what I aim for with TLS 1.3 and Signal as to be done projects, it is currently vaporware. I have ParrotTalk done and am working SSL, then I will move to SSH. The script I listed above will load all remote packages, except for SSH. I m attaching the flyer I created to broadcast Squeak's ProCrypto configuration.

The speed and scale advantages of Orca over the big-heap approach have been demonstrated. That was done some time ago. Read the paper by Clebsch and friends for details.

Regarding capabilities please read the ELib documentation on ERights website: http://erights.org/elib/index.html

The material is not well organized, and is very hard to even want to read. Maybe you are the one to change that.

I can't change that, but I found them to be very informative for my purposes, which was to replicate this design in Squeak. See Raven. I have all the same classes and a few new ones (see Scope). I have read over & read over & read over that documentation I referenced, for ELib, many, many times, over the years: 16 years now I have worked on SqueakELib).

I’m looking for: 1) definitions of all terms that are new or nonstandard; 2) problem-constraint invariants (one can’t reason about anything without invariants); 3) problem-solution objectives expressed in measurable terms, like data-rate or latency.

Can I get those? This might be a good time and place to describe your proposed concurrency solution for Smalltalk in terse, measurable terms.

Have you measured a running actor-based app on one node? On two or more?

You state that Pony is just to access Orca.

The two are codesigned.

What makes Orca so great?

The measurements. Read the end of the Orca paper if you can’t read the whole thing. Orca is the best concurrent MM protocol now.

Aside: I want to be rid of general purpose GC. I think the purely functional approach with persistent data structures can work better (think Haskell). You still need to maintain state in processing queues (think Clojure; not sure how Haskell handles this). You still need temporal-coherency control devices at the periphery for IO (atoms for example).

There are definitely more than one Actor per Vat.

Why have another construct? It appears to be scoped to the node. Is that the reason for the vat’s existence? Does it control machine-node-specific messaging and resource management for the actors in it contains?

That's right, it is the stack, queue & pool for establishing a single threaded event-loop.

1 per process. There will be a finite number of Capability actors in each event loop.

But more than one and scale into thousands per Vat.

Is there a formal definition of vat? I found this:

“A vat is the part of the Neocosm implementation that has a unique network identity. We expect that normal circumstances, there will only be one vat running on a particular machine at one time. Neocom currently (28 May 1998) supports only one avatar per vat.”

Here is the state machine specification for ParrotTalk version 3.7, which is compiled by the ProtocolStateCompiler. This stateMap models states, triggers, transitions, default and callbacks and is simple to use.

ParrotTalkSessionOperations_V3_7 class>>#stateMap

    "(((ParrotTalkSessionOperations_v3_7 stateMap compile)))"

    | desc |
    desc := ProtocolStateCompiler initialState: #initial.
    (desc newState: #initial -> (#processInvalidRequest: -> #dead))
        add: #answer -> (nil -> #receivingExpectHello);
        add: #call -> (nil -> #receivingExpectResponse).
    (desc newState: #connected -> (#processInvalidRequest: -> #dead))
        addInteger: 7 -> (#processBytes: -> #connected).
    (desc newState: #dead -> (#processInvalidRequest: -> #dead)).

    (desc newState: #receivingExpectHello -> (#processInvalidRequest: -> #dead))
        addInteger: 16 -> (#processHello: -> #receivingExpectSignature).
    (desc newState: #receivingExpectSignature -> (#processInvalidRequest: -> #dead))
        addInteger: 18 -> (#processSignature: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).

    (desc newState: #receivingExpectResponse -> (#processInvalidRequest: -> #dead))
        addInteger: 17 -> (#processResponse: -> #connected);
        addInteger: 14 -> (#processDuplicateConnection: -> #dead);
        addInteger: 15 -> (#processNotMe: -> #dead).
    ^desc.

--Because it will continue the current Smalltalk-concurrency lameness.

The only identified difference is not the Actor model, it is the near real-time requirements on the Garbage Collector, yes? So what lameness do you reference?

The actor model is a problem too if you stop all actors with a StW GC. Fix the problem by letting each actor run until it finishes processing the last message. Then let it collect its own garbage. Then let it take the next message from the queue. All actors are single-threaded by definition. This maximizes processing rate and smoothness of GC-disruption of domain work. It also increases tracing overhead transiently when large numbers of mutable objects are used (21% peak CPU consumption ascribable to tracing when you throw the tree/ring exercise at Pony with as many mutable types as possible). We will be turning to functional programming strategies for at least some (if not eventually all) core (not peripheral IO) parallelization efforts, but I digress somewhat.

Two big problems impede parallelization of programs in Smalltalk: 1) the GC stops all actors, all of them or large chunks of them at once, depending on how the GC works. Neither situation is acceptable. That this pause is small is not as important as the work lost from all the actors during that period; 2) core Smalltalk doesn’t have a guaranteed concurrency-integrity model that automatically starts threads on different cores; it can only interleave them on one core (green threading).

These ideas should not be add-ons or frameworks. They should be core features. If we can do the above two in Squeak/Pharo, I’ll use Squeak/Pharo to work on a VM.

Have you tried implementing SmallInteger class>>#tinyBenchmarks in Pony?

No, sadly. I’ve yet to build Pony. I’m still reading about the techniques and background. I plan to parse large CSV files into objects, as in Rocco’s exercise. I have some files I can use for that purpose, and can get the same data over HTTP to test a connection too. That would be a nice first experiment. We need to be clinical about this. All the talk and hand-waving is an okay start, but at some point, we must measure and keep on doing that in a loop, as architecture is tweaked. I would like to compare a Pony parsing program to my VW parsing program as a start. But the Pony work has to wait.

What are you saying, you don't code?

Too slow/wasteful. Moving an actor isn’t needed if the each has its own heap.

In ELib, why not allow an Actor to be mobile and move from Alice's Vat to Bob's Vat?

Are Vats for scoping actors to specific machine nodes? If so, then yes move the actors to another machine node if it is better suited to what the actor does and needs.

Right.

Then automated management apps can really truly rebalance Actors. Only on a rare moment, not for every call.

Yes, we must have run-time feedback to adjust resources, which are the actors and the machine nodes they are made to run on. A coordination actor would watch all other actors and clusters of them.

Once again the near real-time of the Orca Garbage Collector. I am not convinced, but reading some papers.

Runtime action with an Orca collector is definitely smoother. This we can know before the fact (before we measure) from the fine-grain MM. What is not clear without testing is overall speed. We need to test and measure.

86 ms will not break the contract, I propose.

It’s okay for a business app, certainly--not such much for machine-control or 3D graphics. It’s good enough for prototyping a 3D sim, but not for deployment.

Yes, and a dedicated cycle-detecting actor watches for this in Pony-Orca.

I don't watch for it, but it is a strong design consideration. Keep Actor behaviors short and sweet.

The cycles are about chaining of messages/dependencies between actors. When a programmer makes these actor clusters manually, messing up is easy. A higher-level tools is need to manage these constructions. The actor-cycles will exist and become problematic even if each actor is fast. The detector finds and exposes them to the programmer.

E-right's event loops ensure no data races, as long as actor objects are not accessible from more than one event-loop.

Speed hit.

???

This constraint is not needed to guarantee correct concurrency in Pony, which happens at compile time and takes no CPU cycles. The approach above sounds dynamic; there are some CPU cycles involved.

Yes; I didn’t claim otherwise. The networked version is coming. See above. My point is that the ‘remote’ characterization is not needed. It’s not helping us describe and understand.

It does so for me, either we have inner-Vat message calling, immediate, adding to the stack. And we have inter-Vat message sending, asynchronous, adding to the queue.

I still don’t see a clear definition of vat.

None of the above language is needed when the concurrency scheme is simpler and doesn’t use those ideas and devices.

Design wise, it makes much sense to treat inter-thread, inter-process and inter-machine concurrency as the same remote interface.

The design is already done, modulo the not-yet-present 3-way gift giving network extension. Interfacing between actors is always by async messaging, even local sends to nearRefs. Messaging will work as transparently as possible in the networked version across machine nodes.

Must remember, we are still vulnerable to network failure errors.

Yes, they are keenly aware of that. It’s a big task and won’t happen for a while. But it will happen.

I don’t like the Pony language syntax. I don’t like anything that looks like Algo-60. Pony is a language, compiler and runtime implementing Orca. The other stuff is good. And I’ve not had much time to use it; I suspect I could like it more.

No argument from me! Squeak is a language, a compiler and a profound image-based runtime. Is there another language with such an image-based runtime? I think not.

Yes, we all love Smalltalk. It’s still too slow.

We’re not talking about coding ergonomics and testing dynamic. We all agree that Smalltalk is a better that way to code and interact with the evolving program. But that’s not what the thread is about. This is all about speed and scale across cores, CPUs, and machine nodes. The solution must be implemented close to the metal (VM).

The Way

Get it to run;
Get it to run right;
Get it to run fast.

This has always been the Way. If that third step requires a per Actor GC, then so be it.

It can’t be an add-on framework. We need an Actor class and syntax for making actors and their async behaviours.

This we have exactly in CapabilitiesLocal and Raven.

Then we need the VM to understand the new format and bytecodes associated with actors.

We have had no need for special bytecodes. Squeak just does [:it | it with: #Class] I have thought of making the VM aware of ERefs. I am busy on layer 5, building secure Sessions for SSL, SSH, Signal and TLS 1.3, as I mentioned. Then I will turn to Raven, again.

Shaping

-- 
Kindly and with good fortune,
Robert