Smalltalk › Squeak › Squeak VM

Amazing ARM simulator experience

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Eliot Miranda-2

Amazing ARM simulator experience

Hi All,

just had to tell people about this morning's experience using the ARM simulator. I've been building Smalltalk VMs since 1983, so 33 years. My first on the Three Rivers PERQ was dog slow. My undergraduate project was done the the National Semiconductor 32016 based Whitechapel Computer Works workstation and its and 32032-based successor.

This morning I was revamping register management in Cog's ARMv5/v6 back end, making more registers available by using two of the C argument registers for two of registers the Cogit assigns to various fixed tasks, and using store and load multiple instructions to save and restore concisely the registers around calls into the runtime.

Remember the architecture here. The simulator generates ARM machine code into the ByteArray that re[resents the address space, holding all of generated machine code, a small C stack, and the Smalltalk heap. A plugin, derived from Gdb's ARM simulator written in C, interprets that machine code, running for a couple of milliseconds at a time in a loop, applying breakpoint calculations, and asserts on every call into the run-time, and that calls into the run-time and accesses of variables in the simulator is done by using illegal addresses in the generated machine code. Each illegal access causes the Gdb-derived machine code interpreter to return with an error, this error is turned into an exception, the handler for which maps the specific illegal address into a variable or message selector, accesses the variable or activates the message selector, providing the result, and allowing the execution to continue.

In changing the register management I had a test case that worked, an image that prompted for an expression ands evaluated it, which worked both in the simulator and with the generated VM. But the real VM, crashed when used on a proper image. So to debug I started launching the real interactive image in the simulator.

Well, the amazing experience is that that image, whose machine code is being _interpreted by a C program_ feels /faster/ than my 32016 based implementation back in 1984-ish. Quite amazing. I can open windows, type text, access source code (was playing with Message Names) etc. It's sluggish, but usable. Amazing how fast modern machines are. All on my 2012 vintage 2.2 Ghz Core i7 MacBook Pro. I'm blown away :-)

_,,,^..^,,,_

best, Eliot

kilon.alios

Re: Amazing ARM simulator experience

My first computer that my father bought me in 1988 was an Amstrad CPC 6128 , with a 4 Mhz CPU and 128kb Ram

My latest computer is a late 2013 quad core 3 Ghz iMac with 8 gb memory. Which makes it at least 3000 times faster than my Amstrad , technology has come a long way the past 30 years. I could not imagine this future even in my wildest dreams.

By Gdb you mean the GNU debugger ?

On Wed, Jun 1, 2016 at 8:43 PM Eliot Miranda <[hidden email]> wrote:

Hi All,

just had to tell people about this morning's experience using the ARM simulator. I've been building Smalltalk VMs since 1983, so 33 years. My first on the Three Rivers PERQ was dog slow. My undergraduate project was done the the National Semiconductor 32016 based Whitechapel Computer Works workstation and its and 32032-based successor.

This morning I was revamping register management in Cog's ARMv5/v6 back end, making more registers available by using two of the C argument registers for two of registers the Cogit assigns to various fixed tasks, and using store and load multiple instructions to save and restore concisely the registers around calls into the runtime.

Remember the architecture here. The simulator generates ARM machine code into the ByteArray that re[resents the address space, holding all of generated machine code, a small C stack, and the Smalltalk heap. A plugin, derived from Gdb's ARM simulator written in C, interprets that machine code, running for a couple of milliseconds at a time in a loop, applying breakpoint calculations, and asserts on every call into the run-time, and that calls into the run-time and accesses of variables in the simulator is done by using illegal addresses in the generated machine code. Each illegal access causes the Gdb-derived machine code interpreter to return with an error, this error is turned into an exception, the handler for which maps the specific illegal address into a variable or message selector, accesses the variable or activates the message selector, providing the result, and allowing the execution to continue.

In changing the register management I had a test case that worked, an image that prompted for an expression ands evaluated it, which worked both in the simulator and with the generated VM. But the real VM, crashed when used on a proper image. So to debug I started launching the real interactive image in the simulator.

Well, the amazing experience is that that image, whose machine code is being _interpreted by a C program_ feels /faster/ than my 32016 based implementation back in 1984-ish. Quite amazing. I can open windows, type text, access source code (was playing with Message Names) etc. It's sluggish, but usable. Amazing how fast modern machines are. All on my 2012 vintage 2.2 Ghz Core i7 MacBook Pro. I'm blown away :-)

_,,,^..^,,,_
best, Eliot

Eliot Miranda-2

Re: Amazing ARM simulator experience

Hi Dimitris,

On Wed, Jun 1, 2016 at 3:07 PM, Dimitris Chloupis <[hidden email]> wrote:

My first computer that my father bought me in 1988 was an Amstrad CPC 6128 , with a 4 Mhz CPU and 128kb Ram

My latest computer is a late 2013 quad core 3 Ghz iMac with 8 gb memory. Which makes it at least 3000 times faster than my Amstrad , technology has come a long way the past 30 years. I could not imagine this future even in my wildest dreams.

By Gdb you mean the GNU debugger ?

Yes. Gdb comes with a lot of different simulators. So far for executing machine code in Cog we've used Bochs for x86 and x86_64, Gdb for ARM v6 and hand-written Smalltalk for MIPSEL.

On Wed, Jun 1, 2016 at 8:43 PM Eliot Miranda <[hidden email]> wrote:

Hi All,

just had to tell people about this morning's experience using the ARM simulator. I've been building Smalltalk VMs since 1983, so 33 years. My first on the Three Rivers PERQ was dog slow. My undergraduate project was done the the National Semiconductor 32016 based Whitechapel Computer Works workstation and its and 32032-based successor.

This morning I was revamping register management in Cog's ARMv5/v6 back end, making more registers available by using two of the C argument registers for two of registers the Cogit assigns to various fixed tasks, and using store and load multiple instructions to save and restore concisely the registers around calls into the runtime.

Remember the architecture here. The simulator generates ARM machine code into the ByteArray that re[resents the address space, holding all of generated machine code, a small C stack, and the Smalltalk heap. A plugin, derived from Gdb's ARM simulator written in C, interprets that machine code, running for a couple of milliseconds at a time in a loop, applying breakpoint calculations, and asserts on every call into the run-time, and that calls into the run-time and accesses of variables in the simulator is done by using illegal addresses in the generated machine code. Each illegal access causes the Gdb-derived machine code interpreter to return with an error, this error is turned into an exception, the handler for which maps the specific illegal address into a variable or message selector, accesses the variable or activates the message selector, providing the result, and allowing the execution to continue.

In changing the register management I had a test case that worked, an image that prompted for an expression ands evaluated it, which worked both in the simulator and with the generated VM. But the real VM, crashed when used on a proper image. So to debug I started launching the real interactive image in the simulator.

Well, the amazing experience is that that image, whose machine code is being _interpreted by a C program_ feels /faster/ than my 32016 based implementation back in 1984-ish. Quite amazing. I can open windows, type text, access source code (was playing with Message Names) etc. It's sluggish, but usable. Amazing how fast modern machines are. All on my 2012 vintage 2.2 Ghz Core i7 MacBook Pro. I'm blown away :-)

_,,,^..^,,,_
best, Eliot

_,,,^..^,,,_

best, Eliot

Ryan Macnak

Re: Amazing ARM simulator experience

I'll second that simulators are an essential tool for building a JIT. In the Dart VM, we have our own simulators for ARM, ARM64 and MIPS that allow us to test changes against all the architectures we support, locally on our x64 workstations. When we first got the VM running on iOS, we were even running the ARM simulator on the iPhone to work around the no-JITing-unless-you're-Apple policy (we have since completed an AOT mode). Although it was sluggish compared to its JIT counterpart running on Android, it was certainly usable. And given our loading code is also implemented in Dart, having simulators allows us to cross-compile AOT code for Android and iOS from x64 desktops.

On Wed, Jun 1, 2016 at 6:21 PM, Eliot Miranda <[hidden email]> wrote:

Hi Dimitris,

On Wed, Jun 1, 2016 at 3:07 PM, Dimitris Chloupis <[hidden email]> wrote:

My first computer that my father bought me in 1988 was an Amstrad CPC 6128 , with a 4 Mhz CPU and 128kb Ram

My latest computer is a late 2013 quad core 3 Ghz iMac with 8 gb memory. Which makes it at least 3000 times faster than my Amstrad , technology has come a long way the past 30 years. I could not imagine this future even in my wildest dreams.

By Gdb you mean the GNU debugger ?

Yes. Gdb comes with a lot of different simulators. So far for executing machine code in Cog we've used Bochs for x86 and x86_64, Gdb for ARM v6 and hand-written Smalltalk for MIPSEL.

On Wed, Jun 1, 2016 at 8:43 PM Eliot Miranda <[hidden email]> wrote:

Hi All,

just had to tell people about this morning's experience using the ARM simulator. I've been building Smalltalk VMs since 1983, so 33 years. My first on the Three Rivers PERQ was dog slow. My undergraduate project was done the the National Semiconductor 32016 based Whitechapel Computer Works workstation and its and 32032-based successor.

This morning I was revamping register management in Cog's ARMv5/v6 back end, making more registers available by using two of the C argument registers for two of registers the Cogit assigns to various fixed tasks, and using store and load multiple instructions to save and restore concisely the registers around calls into the runtime.

Remember the architecture here. The simulator generates ARM machine code into the ByteArray that re[resents the address space, holding all of generated machine code, a small C stack, and the Smalltalk heap. A plugin, derived from Gdb's ARM simulator written in C, interprets that machine code, running for a couple of milliseconds at a time in a loop, applying breakpoint calculations, and asserts on every call into the run-time, and that calls into the run-time and accesses of variables in the simulator is done by using illegal addresses in the generated machine code. Each illegal access causes the Gdb-derived machine code interpreter to return with an error, this error is turned into an exception, the handler for which maps the specific illegal address into a variable or message selector, accesses the variable or activates the message selector, providing the result, and allowing the execution to continue.

In changing the register management I had a test case that worked, an image that prompted for an expression ands evaluated it, which worked both in the simulator and with the generated VM. But the real VM, crashed when used on a proper image. So to debug I started launching the real interactive image in the simulator.

Well, the amazing experience is that that image, whose machine code is being _interpreted by a C program_ feels /faster/ than my 32016 based implementation back in 1984-ish. Quite amazing. I can open windows, type text, access source code (was playing with Message Names) etc. It's sluggish, but usable. Amazing how fast modern machines are. All on my 2012 vintage 2.2 Ghz Core i7 MacBook Pro. I'm blown away :-)

_,,,^..^,,,_
best, Eliot

--
_,,,^..^,,,_
best, Eliot

Ben Coman

Re: Amazing ARM simulator experience

On Thu, Jun 2, 2016 at 10:19 AM, Ryan Macnak <[hidden email]> wrote:
>
> I'll second that simulators are an essential tool for building a JIT. In the Dart VM, we have our own simulators for ARM, ARM64 and MIPS that allow us to test changes against all the architectures we support, locally on our x64 workstations. When we first got the VM running on iOS, we were even running the ARM simulator on the iPhone to work around the no-JITing-unless-you're-Apple policy (we have since completed an AOT mode). Although it was sluggish compared to its JIT counterpart running on Android, it was certainly usable. And given our loading code is also implemented in Dart, having simulators allows us to cross-compile AOT code for Android and iOS from x64 desktops.

One thing I've been contemplating for a while, given that Sista will
IIUC cache hotspot info in the Image, enabling a hot-start, would that
be a reasonable workaround for Apple's no-JIT policy. You could use
unit tests to warm up Sista then code-sign the whole resultant image
??

btw I got curious what exactly the policy[1] was... "Further
protection is provided by iOS using ARM’s Execute Never (XN) feature,
which marks memory pages as non-executable. Memory pages marked as
both writable and executable can be used only by apps under tightly
controlled conditions: The kernel checks for the presence of the
Apple-only dynamic code-signing entitlement. Even then, only a single
mmap call can be made to request an executable and writable page,
which is given a randomized address. Safari uses this functionality
for its JavaScript JIT compiler."

[1] https://www.apple.com/business/docs/iOS_Security_Guide.pdf

cheers -ben

Clément Béra

Re: Amazing ARM simulator experience

On Thu, Jun 2, 2016 at 7:49 AM, Ben Coman <[hidden email]> wrote:

On Thu, Jun 2, 2016 at 10:19 AM, Ryan Macnak <[hidden email]> wrote:
>
> I'll second that simulators are an essential tool for building a JIT. In the Dart VM, we have our own simulators for ARM, ARM64 and MIPS that allow us to test changes against all the architectures we support, locally on our x64 workstations. When we first got the VM running on iOS, we were even running the ARM simulator on the iPhone to work around the no-JITing-unless-you're-Apple policy (we have since completed an AOT mode). Although it was sluggish compared to its JIT counterpart running on Android, it was certainly usable. And given our loading code is also implemented in Dart, having simulators allows us to cross-compile AOT code for Android and iOS from x64 desktops.

One thing I've been contemplating for a while, given that Sista will
IIUC cache hotspot info in the Image, enabling a hot-start, would that
be a reasonable workaround for Apple's no-JIT policy. You could use
unit tests to warm up Sista then code-sign the whole resultant image
??

Yes and no.

One problem is that the sista image has optimized code in the form of bytecoded method. The baseline JIT is still required to generate the machine code. So the application would need a prepackaged machine code zone, which is not possible without some work right now. Currently sista methods are optimized to use the baseline JIT as the back-end and are not optimized for the interpreter.

Another problem is things like inline caches that patch the machine code. We would need to change that logic. One way would be to keep in the cache values in a non executable memory zone, another one would be to have inline cache failure never patch the code.

Currently the Stack VM works on iOS and the Stack VM interpreter is very fast (between 10 and 20% overhead compared to the ASM template production version of Java's hotspot). There are multiple solutions to boost the performance on iOS using the existing infrastructure, but there is no obvious way on how to make that production ready in less than (optimistically) 6 months of work.

btw I got curious what exactly the policy[1] was... "Further
protection is provided by iOS using ARM’s Execute Never (XN) feature,
which marks memory pages as non-executable. Memory pages marked as
both writable and executable can be used only by apps under tightly
controlled conditions: The kernel checks for the presence of the
Apple-only dynamic code-signing entitlement. Even then, only a single
mmap call can be made to request an executable and writable page,
which is given a randomized address. Safari uses this functionality
for its JavaScript JIT compiler."

Ahah. "Apple-only". How fancy.

[1] https://www.apple.com/business/docs/iOS_Security_Guide.pdf

cheers -ben

Ben Coman

Re: Amazing ARM simulator experience

Hi Clement,

On Thu, Jun 2, 2016 at 6:10 PM, Clément Bera <[hidden email]> wrote:

On Thu, Jun 2, 2016 at 7:49 AM, Ben Coman <[hidden email]> wrote:

On Thu, Jun 2, 2016 at 10:19 AM, Ryan Macnak <[hidden email]> wrote:
>
> I'll second that simulators are an essential tool for building a JIT. In the Dart VM, we have our own simulators for ARM, ARM64 and MIPS that allow us to test changes against all the architectures we support, locally on our x64 workstations. When we first got the VM running on iOS, we were even running the ARM simulator on the iPhone to work around the no-JITing-unless-you're-Apple policy (we have since completed an AOT mode). Although it was sluggish compared to its JIT counterpart running on Android, it was certainly usable. And given our loading code is also implemented in Dart, having simulators allows us to cross-compile AOT code for Android and iOS from x64 desktops.

One thing I've been contemplating for a while, given that Sista will
IIUC cache hotspot info in the Image, enabling a hot-start, would that
be a reasonable workaround for Apple's no-JIT policy. You could use
unit tests to warm up Sista then code-sign the whole resultant image
??

Yes and no.

One problem is that the sista image has optimized code in the form of bytecoded method. The baseline JIT is still required to generate the machine code. So the application would need a prepackaged machine code zone, which is not possible without some work right now. Currently sista methods are optimized to use the baseline JIT as the back-end and are not optimized for the interpreter.

Another problem is things like inline caches that patch the machine code. We would need to change that logic. One way would be to keep in the cache values in a non executable memory zone, another one would be to have inline cache failure never patch the code.

Currently the Stack VM works on iOS and the Stack VM interpreter is very fast (between 10 and 20% overhead compared to the ASM template production version of Java's hotspot). There are multiple solutions to boost the performance on iOS using the existing infrastructure, but there is no obvious way on how to make that production ready in less than (optimistically) 6 months of work.

As I bump into this in the archives and re-read it with an improved understanding of Sista,

I wonder if it would be fair to expect that Sista's in-Image bytecode inlining would improve the performance on iOS even without JITing?

cheers -ben

btw I got curious what exactly the policy[1] was... "Further
protection is provided by iOS using ARM’s Execute Never (XN) feature,
which marks memory pages as non-executable. Memory pages marked as
both writable and executable can be used only by apps under tightly
controlled conditions: The kernel checks for the presence of the
Apple-only dynamic code-signing entitlement. Even then, only a single
mmap call can be made to request an executable and writable page,
which is given a randomized address. Safari uses this functionality
for its JavaScript JIT compiler."

Ahah. "Apple-only". How fancy.

[1] https://www.apple.com/business/docs/iOS_Security_Guide.pdf

cheers -ben

Clément Béra

Re: Amazing ARM simulator experience

On Wed, May 3, 2017 at 6:23 PM, Ben Coman <[hidden email]> wrote:

Hi Clement,

On Thu, Jun 2, 2016 at 6:10 PM, Clément Bera <[hidden email]> wrote:

On Thu, Jun 2, 2016 at 7:49 AM, Ben Coman <[hidden email]> wrote:

On Thu, Jun 2, 2016 at 10:19 AM, Ryan Macnak <[hidden email]> wrote:
>
> I'll second that simulators are an essential tool for building a JIT. In the Dart VM, we have our own simulators for ARM, ARM64 and MIPS that allow us to test changes against all the architectures we support, locally on our x64 workstations. When we first got the VM running on iOS, we were even running the ARM simulator on the iPhone to work around the no-JITing-unless-you're-Apple policy (we have since completed an AOT mode). Although it was sluggish compared to its JIT counterpart running on Android, it was certainly usable. And given our loading code is also implemented in Dart, having simulators allows us to cross-compile AOT code for Android and iOS from x64 desktops.

One thing I've been contemplating for a while, given that Sista will
IIUC cache hotspot info in the Image, enabling a hot-start, would that
be a reasonable workaround for Apple's no-JIT policy. You could use
unit tests to warm up Sista then code-sign the whole resultant image
??

Yes and no.

One problem is that the sista image has optimized code in the form of bytecoded method. The baseline JIT is still required to generate the machine code. So the application would need a prepackaged machine code zone, which is not possible without some work right now. Currently sista methods are optimized to use the baseline JIT as the back-end and are not optimized for the interpreter.

Another problem is things like inline caches that patch the machine code. We would need to change that logic. One way would be to keep in the cache values in a non executable memory zone, another one would be to have inline cache failure never patch the code.

Currently the Stack VM works on iOS and the Stack VM interpreter is very fast (between 10 and 20% overhead compared to the ASM template production version of Java's hotspot). There are multiple solutions to boost the performance on iOS using the existing infrastructure, but there is no obvious way on how to make that production ready in less than (optimistically) 6 months of work.

As I bump into this in the archives and re-read it with an improved understanding of Sista,
I wonder if it would be fair to expect that Sista's in-Image bytecode inlining would improve the performance on iOS even without JITing?

Just a detail, the performance of Sista is not only about inlining...

My experience with inlining is that closure inlining improve performance, inlining of specific methods may improve performance if it leads to specific patterns optimised by the VM (like smallinteger comparison followed by a branch), but overall inlining of method does not improve the performance significantly enough to be noticeable on Linux/Mac on modern macbook pros (and I have a hundred benchmarks to prove this for the ones which disagree).

The sista bytecode set is not designed for fast interpretation, but if one design such a set (basically you need to encode the same operations but performance critical instructions needs to be encoded in a single byte while uncommon instructions needs to be encoded in multiple bytes), and change the StackInterpreter to use optimisation flags it currently ignores (like no store check for a given store), I am pretty sure you could get some speed-up.

I guess I could provide a Sista Pharo image + VM for the ones which are interested. Building a Sista Squeak image is possible, it worked a couple years ago, but the Sista VM requires FullBlockClosures instead of BlockClosures to work and currently Squeak does not feature those AFAIK.

cheers -ben

btw I got curious what exactly the policy[1] was... "Further
protection is provided by iOS using ARM’s Execute Never (XN) feature,
which marks memory pages as non-executable. Memory pages marked as
both writable and executable can be used only by apps under tightly
controlled conditions: The kernel checks for the presence of the
Apple-only dynamic code-signing entitlement. Even then, only a single
mmap call can be made to request an executable and writable page,
which is given a randomized address. Safari uses this functionality
for its JavaScript JIT compiler."

Ahah. "Apple-only". How fancy.

[1] https://www.apple.com/business/docs/iOS_Security_Guide.pdf

cheers -ben