[squeak-dev] Re: floats

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: floats

Nicolas Cellier
Hans,
Tagging/untagging could be very fast! See my other post

1) UnTagging a double= No op
2) Tagging a double= a isnan test (so as to have a representable nan in Smalltalk)
3) This trick does not had any extra cost to tagging/untagging of other oops

What about the cost of allocating double?
Of course, you won't reach speed of optimized compiled code using FPU extensively.
BUT you remove the main cost of Smalltalk number crunching: pressure on Objectmemory garbage collector!

 
 

Envoyé par nicolas cellier via Google Reader :

 
 

via gmane.comp.lang.smalltalk.squeak.general de Hans-Martin Mosner le 15/03/09

Jecel Assumpcao Jr schrieb:
That does not work since xxx10 is used as a sentinel value in the
garbage collector.
I think a better approach is to handle floats specially in a JIT, and
keep them unboxed for typical sequences of arithmetic manipulation
within methods.
Of course, using immediate floats does avoid the object creation and
destruction overhead, but you still have some overhead for tagging and
untagging, which on modern architectures is still much higher than the
actual floating point operation costs.

Cheers,
Hans-Martin



 
 

Ce que vous pouvez faire à partir de cette page :

 
 


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Hans-Martin Mosner
nicolas cellier schrieb:
> Hans,
> Tagging/untagging could be very fast! See my other post
>
> 1) UnTagging a double= No op
> 2) Tagging a double= a isnan test (so as to have a representable nan
> in Smalltalk)
> 3) This trick does not had any extra cost to tagging/untagging of
> other oops
That's true for a 64-bit processor, and on such hardware I see the
advantages of this scheme.
For 32-bit hardware, it won't work.
Hopefully we'll all have suitable hardware in the near future...
But for example, I'm running 32-bit linux here on my 64-bit AMD
processor just because the WLAN card I'm using only has a 32-bit Windows
driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to
work correctly (which is somewhat stupid IMHO but I'm not going to hack
ndiswrapper).
In the real world, there are tons of silly constraints like this which
still prevent people from fully using 64-bit hardware.

Cheers,
Hans-Martin

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Claus Kick
Hans-Martin Mosner wrote:

> nicolas cellier schrieb:
>
*snip*

> In the real world, there are tons of silly constraints like this which
> still prevent people from fully using 64-bit hardware.

Silly questions from a lurker:

How much more or less current hardware is there which still is 32-bit?

More importantly, how much influence does the OS have on this? Can you
make use of 64-bit features running on a 32-bit OS running on 64-bit
hardware architectures?

Sorry for the stupid questions ...

Claus

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Nicolas Cellier
In reply to this post by Hans-Martin Mosner
2009/3/15 Hans-Martin Mosner <[hidden email]>
nicolas cellier schrieb:
> Hans,
> Tagging/untagging could be very fast! See my other post
>
> 1) UnTagging a double= No op
> 2) Tagging a double= a isnan test (so as to have a representable nan
> in Smalltalk)
> 3) This trick does not had any extra cost to tagging/untagging of
> other oops
That's true for a 64-bit processor, and on such hardware I see the
advantages of this scheme.
For 32-bit hardware, it won't work.
Hopefully we'll all have suitable hardware in the near future...
But for example, I'm running 32-bit linux here on my 64-bit AMD
processor just because the WLAN card I'm using only has a 32-bit Windows
driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to
work correctly (which is somewhat stupid IMHO but I'm not going to hack
ndiswrapper).
In the real world, there are tons of silly constraints like this which
still prevent people from fully using 64-bit hardware.

Cheers,
Hans-Martin

Of course, most of the nice properties come from the 64 bits adressing...
Hey, wait, I don't even have a 64 processor in my house!
For the fun I imagine we could emulate by spanning each oop over two int32
typedef struct {int32 high,low;} oop;
I would expect a slower VM by roughly a factor 2 - except for double arithmetic...

Nicolas



Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Jecel Assumpcao Jr
In reply to this post by Claus Kick
Thanks to everyone who is contributing to this thread! I should have
been more explicit about my interest in this area: a good floating point
unit is about the same size as a reasonably compact integer core. So for
the same cost I can have twice as many processors if I am willing to
have slow floating point. The worst case would be to have both half as
many processors (with a FPU each) *and* slow floating point anyway due
to Squeak's limitations.

Squeak does have a scheme for good floating point performance: the
FloatArray. In a previous discussion about this with Bryce, he felt that
between this and being about to compile away boxing/unboxing operations
within a single method (also mentioned by Hans-Martin in this thread) we
could have essentially the same performance as immediate floats (and
Hans-Martin pointed out that the bit pattern I suggested is already in
use anyway).

Nicolas evaluated the advantages of the "64 bit everything is a float"
scheme, which I unfortunately don't remember who was the inventor. One
trick that some old mainframes used was to represent integers as
denormalized floating point numbers, so you would need no checks nor
conversions. The IEEE 754 standard doesn't seem to support this,
however.

As Bert pointed out, lack of floating point hardware was the reason
given for not choosing the ARM for the first OLPC machine. Ivan
mentioned fixed point as an alternative, and this is actually what I
have used in my projects (specially the Forth based ones) for most of
the past ten years. But for Squeak I would rather just give people what
they are used to (not counting Fractions, LargeIntegers and such, of
course). Juan gave a list of application domains where floats are
considered fundamental.

Hans-Martin and Claus asked about the availability of 64 bit hardware
for the scheme I mentioned. That is indeed a problem (only my old Sparc
machine would be able to run a 64 bit Squeak of the 14 or so computers I
have around here, for example) but it could be solved by doing some
conversions when saving/loading images. We need to do transformations
when moving between 32 and 64 images anyway and unboxing floats would be
one of the simplest.

-- Jecel


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Eliot Miranda-2
In reply to this post by Nicolas Cellier


On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier <[hidden email]> wrote:
2009/3/15 Hans-Martin Mosner <[hidden email]>

nicolas cellier schrieb:
> Hans,
> Tagging/untagging could be very fast! See my other post
>
> 1) UnTagging a double= No op
> 2) Tagging a double= a isnan test (so as to have a representable nan
> in Smalltalk)
> 3) This trick does not had any extra cost to tagging/untagging of
> other oops
That's true for a 64-bit processor, and on such hardware I see the
advantages of this scheme.
For 32-bit hardware, it won't work.
Hopefully we'll all have suitable hardware in the near future...
But for example, I'm running 32-bit linux here on my 64-bit AMD
processor just because the WLAN card I'm using only has a 32-bit Windows
driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to
work correctly (which is somewhat stupid IMHO but I'm not going to hack
ndiswrapper).
In the real world, there are tons of silly constraints like this which
still prevent people from fully using 64-bit hardware.

Cheers,
Hans-Martin

Of course, most of the nice properties come from the 64 bits adressing...
Hey, wait, I don't even have a 64 processor in my house!
For the fun I imagine we could emulate by spanning each oop over two int32
typedef struct {int32 high,low;} oop;
I would expect a slower VM by roughly a factor 2 - except for double arithmetic...

In theory, but only for memory-limited symbolic applications.  If you have an application that fits entirely in cache then I would expect parity.  The argument for symbolic applications is that a 64-bit symbolic app has to move twice the data as a 32-bit symbolic app because each symbolic object is twice the size.

Many Smalltalk applications are large and hence more in the memory-limited range, but many Smalltalk objects are byte data and so a) they are not moving twice the data all the time and b) images do not double in size.  There are also opportunities for optimization in a 64-bit implementation.  In particular in 64-bit VW/HPS I was able to store the number of fixed fields in an object in its header instead of only in the class format word.  Hence 64-bit HPS has much faster at:[put:] than 32-bit.

So the experience with my 64-bit VW implementation was that
    - typical large symbolic benchmarks (e.g. all senders) were 15% to 20% slower in 64-bits than in 32-bits.
    - immediate double arithmetic is about 3 times faster at about half the speed of immediate integer operations
    - images are about 55% larger when converted from 32-bit to 64-bit


Eliot



Nicolas







Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Eliot Miranda-2
In reply to this post by Claus Kick


On Sun, Mar 15, 2009 at 2:35 PM, Jecel Assumpcao Jr <[hidden email]> wrote:
Thanks to everyone who is contributing to this thread! I should have
been more explicit about my interest in this area: a good floating point
unit is about the same size as a reasonably compact integer core. So for
the same cost I can have twice as many processors if I am willing to
have slow floating point. The worst case would be to have both half as
many processors (with a FPU each) *and* slow floating point anyway due
to Squeak's limitations.

Squeak does have a scheme for good floating point performance: the
FloatArray. In a previous discussion about this with Bryce, he felt that
between this and being about to compile away boxing/unboxing operations
within a single method (also mentioned by Hans-Martin in this thread) we
could have essentially the same performance as immediate floats (and
Hans-Martin pointed out that the bit pattern I suggested is already in
use anyway).

Nicolas evaluated the advantages of the "64 bit everything is a float"
scheme, which I unfortunately don't remember who was the inventor. One
trick that some old mainframes used was to represent integers as
denormalized floating point numbers, so you would need no checks nor
conversions. The IEEE 754 standard doesn't seem to support this,
however.

As Bert pointed out, lack of floating point hardware was the reason
given for not choosing the ARM for the first OLPC machine. Ivan
mentioned fixed point as an alternative, and this is actually what I
have used in my projects (specially the Forth based ones) for most of
the past ten years. But for Squeak I would rather just give people what
they are used to (not counting Fractions, LargeIntegers and such, of
course). Juan gave a list of application domains where floats are
considered fundamental.

Hans-Martin and Claus asked about the availability of 64 bit hardware
for the scheme I mentioned. That is indeed a problem (only my old Sparc
machine would be able to run a 64 bit Squeak of the 14 or so computers I
have around here, for example) but it could be solved by doing some
conversions when saving/loading images. We need to do transformations
when moving between 32 and 64 images anyway and unboxing floats would be
one of the simplest.

...and SPARC is one of the worst 64-bit implementations out there.  Question, how much bigger is a 64-bit literal load instruction vs a 32-bit literal load in x86/x86-64 and SPARC32/SPARC64?



-- Jecel





Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Nicolas Cellier
In reply to this post by Eliot Miranda-2
Hi Eliot,
AFAIK, VW does not use the nan trick, so it has to perform extra conversions on SmallDouble, doesn't it?

Nicolas

2009/3/16 Eliot Miranda <[hidden email]>


So the experience with my 64-bit VW implementation was that
    - typical large symbolic benchmarks (e.g. all senders) were 15% to 20% slower in 64-bits than in 32-bits.
    - immediate double arithmetic is about 3 times faster at about half the speed of immediate integer operations
    - images are about 55% larger when converted from 32-bit to 64-bit


Eliot





Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Philippe Marschall
In reply to this post by Eliot Miranda-2
2009/3/16 Eliot Miranda <[hidden email]>:

>
>
> On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier
> <[hidden email]> wrote:
>>
>> 2009/3/15 Hans-Martin Mosner <[hidden email]>
>>>
>>> nicolas cellier schrieb:
>>> > Hans,
>>> > Tagging/untagging could be very fast! See my other post
>>> >
>>> > 1) UnTagging a double= No op
>>> > 2) Tagging a double= a isnan test (so as to have a representable nan
>>> > in Smalltalk)
>>> > 3) This trick does not had any extra cost to tagging/untagging of
>>> > other oops
>>> That's true for a 64-bit processor, and on such hardware I see the
>>> advantages of this scheme.
>>> For 32-bit hardware, it won't work.
>>> Hopefully we'll all have suitable hardware in the near future...
>>> But for example, I'm running 32-bit linux here on my 64-bit AMD
>>> processor just because the WLAN card I'm using only has a 32-bit Windows
>>> driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to
>>> work correctly (which is somewhat stupid IMHO but I'm not going to hack
>>> ndiswrapper).
>>> In the real world, there are tons of silly constraints like this which
>>> still prevent people from fully using 64-bit hardware.
>>>
>>> Cheers,
>>> Hans-Martin
>>
>> Of course, most of the nice properties come from the 64 bits adressing...
>> Hey, wait, I don't even have a 64 processor in my house!
>> For the fun I imagine we could emulate by spanning each oop over two int32
>> typedef struct {int32 high,low;} oop;
>> I would expect a slower VM by roughly a factor 2 - except for double
>> arithmetic...
>
> In theory, but only for memory-limited symbolic applications.  If you have
> an application that fits entirely in cache then I would expect parity.  The
> argument for symbolic applications is that a 64-bit symbolic app has to move
> twice the data as a 32-bit symbolic app because each symbolic object is
> twice the size.

Couldn't you compress the oops? AFAIK HotSpot was the last remaining
JVM that got this.

Cheers
Philippe

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Eliot Miranda-2
In reply to this post by Nicolas Cellier


On Mon, Mar 16, 2009 at 11:41 AM, Nicolas Cellier <[hidden email]> wrote:
Hi Eliot,
AFAIK, VW does not use the nan trick, so it has to perform extra conversions on SmallDouble, doesn't it?

That's right.  The VW 64-bit immediate double representation is 
msb                                                            lsb
| 8 bit exponent | 52 bit mantissa | sign | 3 bit tag |

i.e. immediate doubles occupy the middle range of the doubles that corresponds to single-precision floats, ~ 10-38.

Putting the sign bit down low means that +/- 0 are the only immediate double values whose bit patterns are <= 15.

Converting an immediate double to an IEEE double then involves
   - logical shift right 3 bits (sign is now lsb)
   - compare against 1 to distinguish +/- 0 from others
   - if > 1 (not +/- 0)
        - add exponent offset (maps 8-bit exponent to 11-bit exponent)
   - rotate right 1 (move sign to sign bit)
   - move integer reg to float reg

Going in the other direction
    - move fp reg to integer reg
    - rotate left 1 bit (sign is now lsb)
    - compare against 1 to distinguish +/- 0 from others
    - if > 1 (not +/- 0)
          - subtract exponent offset (maps 11-bit exponent to 8-bit exponent)
          - fail if overflow (e.g. jump to code that boxes the float)
    - shift left 3
    - add tags

So more complicated than immediate integers but of a similar complexity to the fp unit's internal operations on floats (extracting exponent, shifting mantissa by exponent).



Nicolas

2009/3/16 Eliot Miranda <[hidden email]>



So the experience with my 64-bit VW implementation was that
    - typical large symbolic benchmarks (e.g. all senders) were 15% to 20% slower in 64-bits than in 32-bits.
    - immediate double arithmetic is about 3 times faster at about half the speed of immediate integer operations
    - images are about 55% larger when converted from 32-bit to 64-bit


Eliot









Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Eliot Miranda-2
In reply to this post by Philippe Marschall


On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall <[hidden email]> wrote:
2009/3/16 Eliot Miranda <[hidden email]>:
>
>
> On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier
> <[hidden email]> wrote:
>>
>> 2009/3/15 Hans-Martin Mosner <[hidden email]>
>>>
>>> nicolas cellier schrieb:
>>> > Hans,
>>> > Tagging/untagging could be very fast! See my other post
>>> >
>>> > 1) UnTagging a double= No op
>>> > 2) Tagging a double= a isnan test (so as to have a representable nan
>>> > in Smalltalk)
>>> > 3) This trick does not had any extra cost to tagging/untagging of
>>> > other oops
>>> That's true for a 64-bit processor, and on such hardware I see the
>>> advantages of this scheme.
>>> For 32-bit hardware, it won't work.
>>> Hopefully we'll all have suitable hardware in the near future...
>>> But for example, I'm running 32-bit linux here on my 64-bit AMD
>>> processor just because the WLAN card I'm using only has a 32-bit Windows
>>> driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to
>>> work correctly (which is somewhat stupid IMHO but I'm not going to hack
>>> ndiswrapper).
>>> In the real world, there are tons of silly constraints like this which
>>> still prevent people from fully using 64-bit hardware.
>>>
>>> Cheers,
>>> Hans-Martin
>>
>> Of course, most of the nice properties come from the 64 bits adressing...
>> Hey, wait, I don't even have a 64 processor in my house!
>> For the fun I imagine we could emulate by spanning each oop over two int32
>> typedef struct {int32 high,low;} oop;
>> I would expect a slower VM by roughly a factor 2 - except for double
>> arithmetic...
>
> In theory, but only for memory-limited symbolic applications.  If you have
> an application that fits entirely in cache then I would expect parity.  The
> argument for symbolic applications is that a 64-bit symbolic app has to move
> twice the data as a 32-bit symbolic app because each symbolic object is
> twice the size.

Couldn't you compress the oops? AFAIK HotSpot was the last remaining
JVM that got this.

I don't see the point.  Memory is cheap, getting cheaper.  64-bits means extremely cheap address space.  Why slow down the critical path to save space? 



Cheers
Philippe




Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Philippe Marschall
2009/3/16 Eliot Miranda <[hidden email]>:

>
>
> On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall
> <[hidden email]> wrote:
>>
>> 2009/3/16 Eliot Miranda <[hidden email]>:
>> >
>> >
>> > On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier
>> > <[hidden email]> wrote:
>> >>
>> >> 2009/3/15 Hans-Martin Mosner <[hidden email]>
>> >>>
>> >>> nicolas cellier schrieb:
>> >>> > Hans,
>> >>> > Tagging/untagging could be very fast! See my other post
>> >>> >
>> >>> > 1) UnTagging a double= No op
>> >>> > 2) Tagging a double= a isnan test (so as to have a representable nan
>> >>> > in Smalltalk)
>> >>> > 3) This trick does not had any extra cost to tagging/untagging of
>> >>> > other oops
>> >>> That's true for a 64-bit processor, and on such hardware I see the
>> >>> advantages of this scheme.
>> >>> For 32-bit hardware, it won't work.
>> >>> Hopefully we'll all have suitable hardware in the near future...
>> >>> But for example, I'm running 32-bit linux here on my 64-bit AMD
>> >>> processor just because the WLAN card I'm using only has a 32-bit
>> >>> Windows
>> >>> driver, and ndiswrapper on 64-bit linux would require a 64-bit driver
>> >>> to
>> >>> work correctly (which is somewhat stupid IMHO but I'm not going to
>> >>> hack
>> >>> ndiswrapper).
>> >>> In the real world, there are tons of silly constraints like this which
>> >>> still prevent people from fully using 64-bit hardware.
>> >>>
>> >>> Cheers,
>> >>> Hans-Martin
>> >>
>> >> Of course, most of the nice properties come from the 64 bits
>> >> adressing...
>> >> Hey, wait, I don't even have a 64 processor in my house!
>> >> For the fun I imagine we could emulate by spanning each oop over two
>> >> int32
>> >> typedef struct {int32 high,low;} oop;
>> >> I would expect a slower VM by roughly a factor 2 - except for double
>> >> arithmetic...
>> >
>> > In theory, but only for memory-limited symbolic applications.  If you
>> > have
>> > an application that fits entirely in cache then I would expect parity.
>> >  The
>> > argument for symbolic applications is that a 64-bit symbolic app has to
>> > move
>> > twice the data as a 32-bit symbolic app because each symbolic object is
>> > twice the size.
>>
>> Couldn't you compress the oops? AFAIK HotSpot was the last remaining
>> JVM that got this.
>
> I don't see the point.  Memory is cheap, getting cheaper.

But memory access isn't.

> 64-bits means
> extremely cheap address space.  Why slow down the critical path to save
> space?

Because it's faster (because you have to move around fewer data) an
gets you closer to 32bit speed.

http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/#comments
http://www.lowtek.ca/roo/2008/java-performance-in-64bit-land/
http://www.devwebsphere.com/devwebsphere/2008/10/websphere-nd-70.html
http://webspherecommunity.blogspot.com/2008/10/64-bit-performance-thoughputmemory.html

Cheers
Philippe

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Claus Kick
In reply to this post by Eliot Miranda-2
Eliot Miranda wrote:
*snip*
>
> ...and SPARC is one of the worst 64-bit implementations out there.
>  Question, how much bigger is a 64-bit literal load instruction vs a 32-bit
> literal load in x86/x86-64 and SPARC32/SPARC64?

Interesting though off topic tidbit, therefore an OT question: is that
aimed at SPARC as a (for lack of better word) architecture or do you
have a specific implementation in mind? (*curious*)

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Eliot Miranda-2


On Tue, Mar 17, 2009 at 2:45 PM, Claus Kick <[hidden email]> wrote:
Eliot Miranda wrote:
*snip*


...and SPARC is one of the worst 64-bit implementations out there.
 Question, how much bigger is a 64-bit literal load instruction vs a 32-bit
literal load in x86/x86-64 and SPARC32/SPARC64?

Interesting though off topic tidbit, therefore an OT question: is that aimed at SPARC as a (for lack of better word) architecture or do you have a specific implementation in mind? (*curious*)

I know nothing about SPARC internals and so cannot suggest an implementation.

Part of my complaint is the name, Scaleable Processor ARCitecture.  The current SPARC requires 6 (reads it and weep, _6_) 32-bit instructions to synthesize an arbitrary 64-bit literal.  It hasn't scaled to 64-bits; consequently there are a range of addressing models in 64-bit SPARC compilers, 20-something-bits 40-something bits (I forget the details) and 64-bits.  By contrast there are 10-byte instructions that do 64-bit literals loads in x86-64.  So a 200% overhead vs a 25% overhead.

One can try and use the branch and link instruction to jump over the literal, grab the pc and indirect through it, but IIRC that's a slow 5 word sequence that can't be used in leaf routines.  But this is off the top of my head so don't quote me.

I would have thought that somehow one could define a three word instruction saying "load the next two words into a register and skip them: or, if the anachronism of the delay slot must still be respected, a 4 word instruction saying "load the two words after the following instruction into a register and skip them, executing the instruction in the delay slot".




Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Andreas Wacknitz

Am 18.03.2009 um 04:47 schrieb Eliot Miranda:



On Tue, Mar 17, 2009 at 2:45 PM, Claus Kick <[hidden email]> wrote:
Eliot Miranda wrote:
*snip*


...and SPARC is one of the worst 64-bit implementations out there.
 Question, how much bigger is a 64-bit literal load instruction vs a 32-bit
literal load in x86/x86-64 and SPARC32/SPARC64?

Interesting though off topic tidbit, therefore an OT question: is that aimed at SPARC as a (for lack of better word) architecture or do you have a specific implementation in mind? (*curious*)

I know nothing about SPARC internals and so cannot suggest an implementation.

Part of my complaint is the name, Scaleable Processor ARCitecture.  The current SPARC requires 6 (reads it and weep, _6_) 32-bit instructions to synthesize an arbitrary 64-bit literal.  It hasn't scaled to 64-
As far as I know Scalable Processor ARChitecture was meant to make it possible to add processors. Sun has built servers with up to 112 processor with nearly linear performance gain.
So in this regard SPARC is scalable.

bits; consequently there are a range of addressing models in 64-bit SPARC compilers, 20-something-bits 40-something bits (I forget the details) and 64-bits.  By contrast there are 10-byte instructions that do 64-bit literals loads in x86-64.  So a 200% overhead vs a 25% overhead.

I assume that you met RISC vs. CISC paradigms here. It's the heart of the RISC idea to have just a few but fast address modes and instructions.
SPARC was faster than x86 until around UltraSPARC-III introduction / introduction of Pentium 4. Since then Intel and AMD surpassed SPARC with their x86 ISA.

One can try and use the branch and link instruction to jump over the literal, grab the pc and indirect through it, but IIRC that's a slow 5 word sequence that can't be used in leaf routines.  But this is off the top of my head so don't quote me.

I would have thought that somehow one could define a three word instruction saying "load the next two words into a register and skip them: or, if the anachronism of the delay slot must still be respected, a 4 word instruction saying "load the two words after the following instruction into a register and skip them, executing the instruction in the delay slot".



Regards
Andreas


Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: floats

Iain Bason
In reply to this post by Eliot Miranda-2

On Mar 17, 2009, at 11:47 PM, Eliot Miranda wrote:

> Part of my complaint is the name, Scaleable Processor ARCitecture.  
> The current SPARC requires 6 (reads it and weep, _6_) 32-bit  
> instructions to synthesize an arbitrary 64-bit literal.  It hasn't  
> scaled to 64-bits; consequently there are a range of addressing  
> models in 64-bit SPARC compilers, 20-something-bits 40-something  
> bits (I forget the details) and 64-bits.  By contrast there are 10-
> byte instructions that do 64-bit literals loads in x86-64.  So a  
> 200% overhead vs a 25% overhead.
>

It doesn't seem to matter, though, for C/C++/Fortran programs.  In  
those benchmarks where SPARC is slower in 64-bit mode than 32-bit  
mode, the slowdown is due to the benchmark's data structures being  
larger because of 64-bit pointers.  Loading a 64 bit literal is ugly,  
but so what?

Is there some reason why Smalltalk would need to do more loads of 64  
bit literals than C/C++/Fortran?

Iain


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Eliot Miranda-2


On Thu, Mar 19, 2009 at 6:28 AM, Iain Bason <[hidden email]> wrote:

On Mar 17, 2009, at 11:47 PM, Eliot Miranda wrote:

Part of my complaint is the name, Scaleable Processor ARCitecture.  The current SPARC requires 6 (reads it and weep, _6_) 32-bit instructions to synthesize an arbitrary 64-bit literal.  It hasn't scaled to 64-bits; consequently there are a range of addressing models in 64-bit SPARC compilers, 20-something-bits 40-something bits (I forget the details) and 64-bits.  By contrast there are 10-byte instructions that do 64-bit literals loads in x86-64.  So a 200% overhead vs a 25% overhead.


It doesn't seem to matter, though, for C/C++/Fortran programs.  In those benchmarks where SPARC is slower in 64-bit mode than 32-bit mode, the slowdown is due to the benchmark's data structures being larger because of 64-bit pointers.  Loading a 64 bit literal is ugly, but so what?

A JIT has to update instructions.  A JIT that embeds literals in instructions will have to update instructions on garbage collection or throw away code containing them or use an indirection.  The SPARC makes the update instructions approach painfully complex and slow.
 
Is there some reason why Smalltalk would need to do more loads of 64 bit literals than C/C++/Fortran?

Yes.  Object references in code.  None of C, C++ or Fortran have implementations that use moveable literals.  Many Smalltalk implementations do.

 


Iain





Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: floats

Eliot Miranda-2
In reply to this post by Philippe Marschall
Hi Philipe,

On Mon, Mar 16, 2009 at 10:52 PM, Philippe Marschall <[hidden email]> wrote:
2009/3/16 Eliot Miranda <[hidden email]>:
>
>
> On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall
> <[hidden email]> wrote:
>>
>> 2009/3/16 Eliot Miranda <[hidden email]>:
>> >
>> >
>> > On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier
>> > <[hidden email]> wrote:
>> >>
>> >> 2009/3/15 Hans-Martin Mosner <[hidden email]>
>> >>>
>> >>> nicolas cellier schrieb:
>> >>> > Hans,
>> >>> > Tagging/untagging could be very fast! See my other post
>> >>> >
>> >>> > 1) UnTagging a double= No op
>> >>> > 2) Tagging a double= a isnan test (so as to have a representable nan
>> >>> > in Smalltalk)
>> >>> > 3) This trick does not had any extra cost to tagging/untagging of
>> >>> > other oops
>> >>> That's true for a 64-bit processor, and on such hardware I see the
>> >>> advantages of this scheme.
>> >>> For 32-bit hardware, it won't work.
>> >>> Hopefully we'll all have suitable hardware in the near future...
>> >>> But for example, I'm running 32-bit linux here on my 64-bit AMD
>> >>> processor just because the WLAN card I'm using only has a 32-bit
>> >>> Windows
>> >>> driver, and ndiswrapper on 64-bit linux would require a 64-bit driver
>> >>> to
>> >>> work correctly (which is somewhat stupid IMHO but I'm not going to
>> >>> hack
>> >>> ndiswrapper).
>> >>> In the real world, there are tons of silly constraints like this which
>> >>> still prevent people from fully using 64-bit hardware.
>> >>>
>> >>> Cheers,
>> >>> Hans-Martin
>> >>
>> >> Of course, most of the nice properties come from the 64 bits
>> >> adressing...
>> >> Hey, wait, I don't even have a 64 processor in my house!
>> >> For the fun I imagine we could emulate by spanning each oop over two
>> >> int32
>> >> typedef struct {int32 high,low;} oop;
>> >> I would expect a slower VM by roughly a factor 2 - except for double
>> >> arithmetic...
>> >
>> > In theory, but only for memory-limited symbolic applications.  If you
>> > have
>> > an application that fits entirely in cache then I would expect parity.
>> >  The
>> > argument for symbolic applications is that a 64-bit symbolic app has to
>> > move
>> > twice the data as a 32-bit symbolic app because each symbolic object is
>> > twice the size.
>>
>> Couldn't you compress the oops? AFAIK HotSpot was the last remaining
>> JVM that got this.
>
> I don't see the point.  Memory is cheap, getting cheaper.

But memory access isn't.

> 64-bits means
> extremely cheap address space.  Why slow down the critical path to save
> space?

Because it's faster (because you have to move around fewer data) an
gets you closer to 32bit speed.

http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/#comments
http://www.lowtek.ca/roo/2008/java-performance-in-64bit-land/
http://www.devwebsphere.com/devwebsphere/2008/10/websphere-nd-70.html
http://webspherecommunity.blogspot.com/2008/10/64-bit-performance-thoughputmemory.html

OK, and this is a reasonable stop-gap until machines catch up with the potential of the 64-bit address space.  It reminds me of segmented approaches to 16-bit limits on PDP-11s, 8086s et al.  Basically these guys are scaling 32-bit oops by 8, allowing a maximum heap size of 32Gb and 4G small objects.  There are other approaches like using an indirection table for intra-segment object references and using 32-bit oops within a segment, which would fit well with a Train algorithm.

My gut feels like these stop gaps are a temporary thing.  After all if speed was so compelling we'd see lots of small 16-bit apps in places like Windows where there used to be good support for 16-bit code until quite recently.  But in fact 16-bit apps have died the death and we favour the regularity of 32-bit code.  Somewhat analogously Smalltalk trades perofrmance for regularity.  So I don't find these approaches particularly compelling.  In any case they require engineering teams that can afford to support multiple memory models in the VM, something I'm not going to assume in Cog :)

Thanks for the links.

Best
Eliot


Cheers
Philippe