ARM Cog progress

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
40 messages Options
12
Reply | Threaded
Open this post in threaded view
|

ARM Cog progress

timrowledge

We can now run the ARM cog/spur system on a Pi 2; it’s debug compiled and lots of asserts fail, and it exits quite aggressively when you upset it, but - actual real compiled code, running on an actual ARM machine, using the actual morphic ui to do actual stuff.

3+4 does indeed = elephant.
100 factorial is a very long number.
1 tinyBenchmarks is utterly meaningless (gcc debug settings + lots of expensive runtime asserts) but still reports 120mbc/s and 7m sends/s or about 4x the stack vm.

Happy-happy.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful Latin Phrases:- Quantum materiae materietur marmota monax si marmota monax materiam possit materiari? = How much wood would a woodchuck chuck if a woodchuck could chuck wood?


Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Casey Ransberger-2

That's fantastic news. Gotta fix 3+4 though, that should answer #cowTools.

--C

P.S.

Suddenly wondering if anyone has ever confused one our symbols for a hashtag...

> On May 18, 2015, at 5:43 PM, tim Rowledge <[hidden email]> wrote:
>
>
> We can now run the ARM cog/spur system on a Pi 2; it’s debug compiled and lots of asserts fail, and it exits quite aggressively when you upset it, but - actual real compiled code, running on an actual ARM machine, using the actual morphic ui to do actual stuff.
>
> 3+4 does indeed = elephant.
> 100 factorial is a very long number.
> 1 tinyBenchmarks is utterly meaningless (gcc debug settings + lots of expensive runtime asserts) but still reports 120mbc/s and 7m sends/s or about 4x the stack vm.
>
> Happy-happy.
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Useful Latin Phrases:- Quantum materiae materietur marmota monax si marmota monax materiam possit materiari? = How much wood would a woodchuck chuck if a woodchuck could chuck wood?
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Bert Freudenberg
 

> On 19.05.2015, at 03:19, Casey Ransberger <[hidden email]> wrote:
>
>
> That's fantastic news. Gotta fix 3+4 though, that should answer #cowTools.
>
> --C
>
> P.S.
>
> Suddenly wondering if anyone has ever confused one our symbols for a hashtag…
#Smalltalk. Hashtagging for 35 years.

- Bert -




smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

KenDickey
In reply to this post by timrowledge
 
On Mon, 18 May 2015 17:43:17 -0700
tim Rowledge <[hidden email]> wrote:

>  
> We can now run the ARM cog/spur system on a Pi 2; it’s debug compiled and lots of asserts fail, and it exits quite aggressively when you upset it, but - actual real compiled code, running on an actual ARM machine, using the actual morphic ui to do actual stuff.

Congrads, Tim!

Looking forward to running it on my Samsung ARM Chromebook!

Thanks for all the great work!

-KenD
-KenD
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

timfelgentreff
Does the Pi1 work, too? Or are you using code specific to the newer cpu?
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Eliot Miranda-2

Hi Tim,

On May 21, 2015, at 12:47 AM, timfelgentreff <[hidden email]> wrote:

>
> Does the Pi1 work, too? Or are you using code specific to the newer cpu?

   TimR and I were talking about this yesterday.  The current code generator targets ARMv5, and so works on Pi1.

Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions.  ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing.  I'd like to know what the situation is for ARMv8 (the 64-bit ISA).

The temptation is to move to ARMv7 to get that more compact and faster literal generation.  But it would mean either dropping Pi1 or two VMs.  I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.

Knowing what ARMv8 does for 64-bit literal synthesis would help me make up my mind.  Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.

Eliot (phone)
>
> --
> View this message in context: http://forum.world.st/ARM-Cog-progress-tp4827195p4827779.html
> Sent from the Squeak VM mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

KenDickey
 
On Thu, 21 May 2015 05:58:41 -0700
Eliot Miranda <[hidden email]> wrote:

> Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions.  ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing.  I'd like to know what the situation is for ARMv8 (the 64-bit ISA).
>
> The temptation is to move to ARMv7 to get that more compact and faster literal generation.  But it would mean either dropping Pi1 or two VMs.  I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.

..but the test for ARM5/7/8/.. should happen once, and the codegen could be specialized at that time -- after which the ARM specialization code itself is no longer needed, so no bloat.

Dynamic specialization does work, right?  ;^)

--
-KenD
-KenD
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Ryan Macnak
In reply to this post by Eliot Miranda-2
 
On Thu, May 21, 2015 at 5:58 AM, Eliot Miranda <[hidden email]> wrote:

Hi Tim,

On May 21, 2015, at 12:47 AM, timfelgentreff <[hidden email]> wrote:

>
> Does the Pi1 work, too? Or are you using code specific to the newer cpu?

   TimR and I were talking about this yesterday.  The current code generator targets ARMv5, and so works on Pi1.

Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions.  ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing.  I'd like to know what the situation is for ARMv8 (the 64-bit ISA).

The temptation is to move to ARMv7 to get that more compact and faster literal generation.  But it would mean either dropping Pi1 or two VMs.  I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.

Dart puts all object references off into a pool to avoid this.

Knowing what ARMv8 does for 64-bit literal synthesis would help me make up my mind.  Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.

Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.

Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

timrowledge
In reply to this post by timfelgentreff
 

On 21-05-2015, at 12:47 AM, timfelgentreff <[hidden email]> wrote:

>
> Does the Pi1 work, too? Or are you using code specific to the newer cpu?
Yes, and no. Yet.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- A prime candidate for natural deselection.


Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Eliot Miranda-2
In reply to this post by Ryan Macnak
 
Hi Ryan, Hi Tim,

On May 21, 2015, at 8:55 AM, Ryan Macnak <[hidden email]> wrote:

On Thu, May 21, 2015 at 5:58 AM, Eliot Miranda <[hidden email]> wrote:

Hi Tim,

On May 21, 2015, at 12:47 AM, timfelgentreff <[hidden email]> wrote:

>
> Does the Pi1 work, too? Or are you using code specific to the newer cpu?

   TimR and I were talking about this yesterday.  The current code generator targets ARMv5, and so works on Pi1.

Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions.  ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing.  I'd like to know what the situation is for ARMv8 (the 64-bit ISA).

The temptation is to move to ARMv7 to get that more compact and faster literal generation.  But it would mean either dropping Pi1 or two VMs.  I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.

Dart puts all object references off into a pool to avoid this.

Knowing what ARMv8 does for 64-bit literal synthesis would help me make up my mind.  Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.

Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.

So out-of-line = 12 bytes vs in-line = 16 bytes.  For me, given that ARM has always supported out-of-line, and it should have good performance, I'd go for out-of-line.  But it's performance could be much worse.  Anyone have any numbers?

Eliot (phone)
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

timrowledge


On 21-05-2015, at 10:19 AM, Eliot Miranda <[hidden email]> wrote:
>>
>> Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.
>
> So out-of-line = 12 bytes vs in-line = 16 bytes.  For me, given that ARM has always supported out-of-line, and it should have good performance, I'd go for out-of-line.  But it's performance could be much worse.  Anyone have any numbers?

Also no absolute need to do 64bit oops with AArch64. It will be quite happy to do 32 bit oops. So 2x 16bit chunks would be fine for both that and v7. So far as I can work out all the operations can work in 32bit quantities, even rotations/shifts/compare.

And there is the hilarious concept of conditional comparisons - if the condition flags match a vector of condition flags, then do a compare of some sort and if that is true, set the flags as appropriate, otherwise set the flags to another vector. I’d love to see the logic that persuaded them to do that.

I swear I spotted a WTF instruction in there somewhere.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
State-of-the-art: What we could do with enough money.


Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Nicolas Cellier
 


2015-05-21 19:42 GMT+02:00 tim Rowledge <[hidden email]>:


On 21-05-2015, at 10:19 AM, Eliot Miranda <[hidden email]> wrote:
>>
>> Four 32-bit instructions loading 16-bit pieces, or one pc-relative load.
>
> So out-of-line = 12 bytes vs in-line = 16 bytes.  For me, given that ARM has always supported out-of-line, and it should have good performance, I'd go for out-of-line.  But it's performance could be much worse.  Anyone have any numbers?

Also no absolute need to do 64bit oops with AArch64. It will be quite happy to do 32 bit oops. So 2x 16bit chunks would be fine for both that and v7. So far as I can work out all the operations can work in 32bit quantities, even rotations/shifts/compare.

And there is the hilarious concept of conditional comparisons - if the condition flags match a vector of condition flags, then do a compare of some sort and if that is true, set the flags as appropriate, otherwise set the flags to another vector. I’d love to see the logic that persuaded them to do that.

I swear I spotted a WTF instruction in there somewhere.


Also a ROTFL
I think it just matches the conversion from SmallDouble to native double that Eliot naivly coded with many more instructions.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
State-of-the-art: What we could do with enough money.



Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Casey Ransberger-2
In reply to this post by timrowledge


> On May 21, 2015, at 10:42 AM, tim Rowledge <[hidden email]> wrote:
>
> I swear I spotted a WTF instruction in there somewhere.

Every ISA should have a WTF instruction. It should ideally be a noop, but more likely does something really ambitious completely wrong.

Hah.

--C
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Eliot Miranda-2
In reply to this post by Eliot Miranda-2
 
Hi Doug, Hi Tim,  Hi All,

    so yesterday I finally switched on the Raspberry Pi Doug gave me as an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM.  I was able to update a Spur image from mid February all the way to tip and run tests.  3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes
Fun!  So I want to revisit the literal load question.

Doug got me a Pi 1 B.  cat /proc/cpuinfo reveals
processor : 0
model name : ARMv6-compatible processor rev 7 (v6l)
Features : swp half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xb76
CPU revision : 7

Hardware : BCM2708
Revision : 000e
Serial : 00000000fe7b08eb

And the ARM Assembler User Guide says (emphasis added)
4.4 Load immediate values using MOV and MVN
The MOV and MVN instructions can write a range of immediate values to a register.

In ARM state:
MOV can load any 8-bit immediate value, giving a range of 0x0-0xFF (0-255).
It can also rotate these values by any even number.
These values are also available as immediate operands in many data processing operations, without being loaded in a separate instruction.
MVN can load the bitwise complements of these values. The numerical values are -(n+1), where n is the value available in MOV.
In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of 0x0-0xFFFF (0-65535).
The following table shows the range of 8-bit values that can be loaded in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.


So it looks to me that the right approach is to add an ARMv6 subclass to CogARMInstruction that uses the 16-bit literal load instructions and use that as our standard 32-bit ARM code generator.  But I'm ignorant as to the processor versions used in the Raspberry Pi.  Are all RPis ARMv6?  What exactly is ARMv6T2?  I'm guessing that T2 refers to Thunmb2, is that correct?  And on the specific question can anyone think of a good reason /not/ to use the 16-bit literal load approach?

On Thu, May 21, 2015 at 5:58 AM, Eliot Miranda <[hidden email]> wrote:
Hi Tim,

On May 21, 2015, at 12:47 AM, timfelgentreff <[hidden email]> wrote:

>
> Does the Pi1 work, too? Or are you using code specific to the newer cpu?

   TimR and I were talking about this yesterday.  The current code generator targets ARMv5, and so works on Pi1.

Pi2 uses ARMv7 which, so TimR tells me, has a 16-bit literal load instruction, which means a 32-bit literal can be generated using two 32-bit instructions.  ARMv5 either requires 4 32-bit instructions, or 1 32-bit instruction to access 1 32-bit literal out-of-line using PC-relative addressing.  I'd like to know what the situation is for ARMv8 (the 64-bit ISA).

The temptation is to move to ARMv7 to get that more compact and faster literal generation.  But it would mean either dropping Pi1 or two VMs.  I'm not afraid of two VMs but it is more stuff, with all the headaches for newbies that entails. Another alternative might be to have the JIT test whether the system is v7 or not and generate the appropriate code, but that is problematic; the JIT will bloat and scanning machine code for object references will slow down.

Knowing what ARMv8 does for 64-bit literal synthesis would help me make up my mind.  Whether the JIT should support out-of-line literal load is a somewhat significant issue; it's not something to write unless it's necessary.

Eliot (phone)
>
> --
> View this message in context: http://forum.world.st/ARM-Cog-progress-tp4827195p4827779.html
> Sent from the Squeak VM mailing list archive at Nabble.com.



--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

timrowledge


On 06-06-2015, at 8:15 AM, Eliot Miranda <[hidden email]> wrote:
>     so yesterday I finally switched on the Raspberry Pi Doug gave me as an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM.

It’s really nice to get to this. There are still some ‘exciting’ parts to get working though… floating point for example.

>  I was able to update a Spur image from mid February all the way to tip and run tests.  3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes

Did this include the FloatMathPluginTests? Because on my Pi2 that segfaults in all versions of the vm - interpreter, stack, cog. Then again my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s PiB is just fine with that. Good old GCC strikes again.

> Fun!  So I want to revisit the literal load question.
> In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of 0x0-0xFFFF (0-65535).
> The following table shows the range of 8-bit values that can be loaded in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.
>
Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back. One of the side-effects of the flexibility ARM provides to actual manufacturers is a fairly complex range of possible features within any particular architecture level.

That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7 features whilst using out of line data loads on the older machines. In the best case, where the data is already in the cache (we can use  PLD to help with that) a LDR takes 2 cycles as opposed to the 4 currently used by our mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always* two cycles with possibility of an out-of-cache delay, so I still think it is probably better.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: EIV: Erase IPL Volume


Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Eliot Miranda-2
 


On Sat, Jun 6, 2015 at 9:33 AM, tim Rowledge <[hidden email]> wrote:

On 06-06-2015, at 8:15 AM, Eliot Miranda <[hidden email]> wrote:
>     so yesterday I finally switched on the Raspberry Pi Doug gave me as an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM.

It’s really nice to get to this. There are still some ‘exciting’ parts to get working though… floating point for example.

>  I was able to update a Spur image from mid February all the way to tip and run tests.  3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes

Did this include the FloatMathPluginTests? Because on my Pi2 that segfaults in all versions of the vm - interpreter, stack, cog. Then again my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s PiB is just fine with that. Good old GCC strikes again.

> Fun!  So I want to revisit the literal load question.
> In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of 0x0-0xFFFF (0-65535).
> The following table shows the range of 8-bit values that can be loaded in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.
>
Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back. One of the side-effects of the flexibility ARM provides to actual manufacturers is a fairly complex range of possible features within any particular architecture level.

Damn, you're right.  gcc with the -march=armv6t2 option will generate 16-bit literal loads, e.g.

long it() { return 0x1A2B3C4D; }

=>

.arch armv6t2
...
.text
.align 2
.global it
.type it, %function
it:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movw r0, #15437
movt r0, 6699
bx lr

 

but compiling, linking and running does indeed signal Illegal instruction.  That's /my/ weekend ruined ;-)


That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7 features whilst using out of line data loads on the older machines. In the best case, where the data is already in the cache (we can use  PLD to help with that) a LDR takes 2 cycles as opposed to the 4 currently used by our mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always* two cycles with possibility of an out-of-cache delay, so I still think it is probably better.

Except that in 64-bits don't we end up with 6 cycles (2 x MOVT/H plus a shift and an add, or maybe 5 cycles if MOVT/H leave other bits undisturbed) vs 2 for the out-of-line literal load?  In which case, the out-of-line is a clear win for 64-bits and that's likely our most important target, given the ubiquity of smart phones.
 


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: EIV: Erase IPL Volume




--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

Eliot Miranda-2
In reply to this post by timrowledge
 


On Sat, Jun 6, 2015 at 9:33 AM, tim Rowledge <[hidden email]> wrote:

On 06-06-2015, at 8:15 AM, Eliot Miranda <[hidden email]> wrote:
>     so yesterday I finally switched on the Raspberry Pi Doug gave me as an xmas present, built the Spur ARM Cog VM and ... we definitely have a working VM.

It’s really nice to get to this. There are still some ‘exciting’ parts to get working though… floating point for example.

>  I was able to update a Spur image from mid February all the way to tip and run tests.  3751 run, 3628 passes, 24 expected failures, 89 failures, 10 errors, 0 unexpected passes

Did this include the FloatMathPluginTests? Because on my Pi2 that segfaults in all versions of the vm - interpreter, stack, cog. Then again my Pi2 is segfaulting on any vm compiled with -O2 right now whereas Eliot’s PiB is just fine with that. Good old GCC strikes again.

> Fun!  So I want to revisit the literal load question.
> In ARMv6T2 and later, MOV can load any 16-bit number, giving a range of 0x0-0xFFFF (0-65535).
> The following table shows the range of 8-bit values that can be loaded in a single ARM MOV or MVN instruction (for data processing operations). The value to load must be a multiple of the value shown in the Step column.
>
Sadly the Pi B/+ are NOT 6T2 cpus. I checked this with Eben a while back. One of the side-effects of the flexibility ARM provides to actual manufacturers is a fairly complex range of possible features within any particular architecture level.

That doesn’t mean we can’t do tricks to make the Pi_2_ use the nice v7 features whilst using out of line data loads on the older machines. In the best case, where the data is already in the cache (we can use  PLD to help with that) a LDR takes 2 cycles as opposed to the 4 currently used by our mov/orr^3 unit. Using the v7 MOVT/H is also two instructions but *always* two cycles with possibility of an out-of-cache delay, so I still think it is probably better.

Ha!  Turns out that at least for sends we're in the clear for out-of-line literal load.  i.e. from https://www.raspberrypi.org/forums/viewtopic.php?f=72&t=78090

Looking at the ARM1176jzf-s TRM, section "Cycle timings and interlock behaviour" we see that:

MOV Rn, x -> 1 cycle
MVN Rn, x -> 1 cycle
LDR Rn, [PC, #constant] -> 1 cycle, with a latency of 3 cycles on Rn 


And the send sequence would look like

    LDR Rclass. [PC, #constant]
    BLX method.entry

with the entry code being

00001828: ands r0, r0, #1
0000182c: b 0x00001844
entry:
00001830: ands r0, r7, #3
00001834: bne 0x00001828
00001838: ldr r0, [r7]
0000183c: mvn ip, #0
00001840: ands r0, r0, ip, lsr #10
00001844: cmp r0, Rclass
00001848: bne 0x00001820
noCheckEntry:

i.e. we don't actually access the register loaded in the LDR for at least 7 cycles.  So it should work a lot better; 11 cycles vs 14 cycles for the send sequence.  In fact the only code that should be impacted by the latency is a conditional branch of a method result (we subtract true or false from the result) or a constant assign.  Most of the time a literal will be passed as an argument and there will be quite a few cycles before it is used.

OK, so that implies doing the out-of-line literal load, with the advantage that there's a single VM, and the same approach is used for the 64-bit ARM system.



    

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: EIV: Erase IPL Volume





--
best,
Eliot
Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

timrowledge
In reply to this post by Eliot Miranda-2


On 06-06-2015, at 10:03 AM, Eliot Miranda <[hidden email]> wrote:
>
> Except that in 64-bits don't we end up with 6 cycles (2 x MOVT/H plus a shift and an add, or maybe 5 cycles if MOVT/H leave other bits undisturbed) vs 2 for the out-of-line literal load?  In which case, the out-of-line is a clear win for 64-bits and that's likely our most important target, given the ubiquity of smart phones.
>  

Let’s not forget that the v8 ARMs can (apparently) happily do 32bit data stuff; even the rotates and shifts can behave correctly for 32 bit. So we could use the same 32 bit image format and save some space, which may have some value for small machines like phones.

I can’t believe I’m referring to things with quadcore 64 bit cpus and 1/2/4Gb ram as small machines...

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: CMN: Convert to Mayan Numerals


Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

timrowledge
In reply to this post by Eliot Miranda-2


On 06-06-2015, at 10:18 AM, Eliot Miranda <[hidden email]> wrote:
>
> Ha!  Turns out that at least for sends we're in the clear for out-of-line literal load.  i.e. from https://www.raspberrypi.org/forums/viewtopic.php?f=72&t=78090

Excellent. So the only ‘fun’ is creating, managing and accessing the pools of out of line constants. Where should we place the pool though? My first thought was just in front of the ‘entry’ address but that would screw the assorted entry/nocheck offsets we have as constants. At the end, just before the metadata?

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
The Static Typing Philosophy: Make it fast. Make it right. Make it run.


Reply | Threaded
Open this post in threaded view
|

Re: ARM Cog progress

timrowledge

GCC is such fun.

Cog VM built on my Pi2, like all the ones built whilst developing this thing, with -O2.
Segfaults very early on Pi2; trying to run under gdb segfaults at pc=0, which really is clever and remarkably effectively obfuscates all the information you might hope to glean.
But copy that executable to an old Pi B+ and it runs perfectly happily.

Mind you, we currently compile the cogit file with no errors nor even warnings, so perhaps it’s simple revenge?

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
How come it's 'Java One' every year? Aren't they making any progress?


12