Hans,
Tagging/untagging could be very fast! See my other post 1) UnTagging a double= No op 2) Tagging a double= a isnan test (so as to have a representable nan in Smalltalk) 3) This trick does not had any extra cost to tagging/untagging of other oops What about the cost of allocating double? Of course, you won't reach speed of optimized compiled code using FPU extensively. BUT you remove the main cost of Smalltalk number crunching: pressure on Objectmemory garbage collector! Envoyé par nicolas cellier via Google Reader :via gmane.comp.lang.smalltalk.squeak.general de Hans-Martin Mosner le 15/03/09 Jecel Assumpcao Jr schrieb: That does not work since xxx10 is used as a sentinel value in the garbage collector. I think a better approach is to handle floats specially in a JIT, and keep them unboxed for typical sequences of arithmetic manipulation within methods. Of course, using immediate floats does avoid the object creation and destruction overhead, but you still have some overhead for tagging and untagging, which on modern architectures is still much higher than the actual floating point operation costs. Cheers, Hans-Martin Ce que vous pouvez faire à partir de cette page :
|
nicolas cellier schrieb:
> Hans, > Tagging/untagging could be very fast! See my other post > > 1) UnTagging a double= No op > 2) Tagging a double= a isnan test (so as to have a representable nan > in Smalltalk) > 3) This trick does not had any extra cost to tagging/untagging of > other oops That's true for a 64-bit processor, and on such hardware I see the advantages of this scheme. For 32-bit hardware, it won't work. Hopefully we'll all have suitable hardware in the near future... But for example, I'm running 32-bit linux here on my 64-bit AMD processor just because the WLAN card I'm using only has a 32-bit Windows driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to work correctly (which is somewhat stupid IMHO but I'm not going to hack ndiswrapper). In the real world, there are tons of silly constraints like this which still prevent people from fully using 64-bit hardware. Cheers, Hans-Martin |
Hans-Martin Mosner wrote:
> nicolas cellier schrieb: > *snip* > In the real world, there are tons of silly constraints like this which > still prevent people from fully using 64-bit hardware. Silly questions from a lurker: How much more or less current hardware is there which still is 32-bit? More importantly, how much influence does the OS have on this? Can you make use of 64-bit features running on a 32-bit OS running on 64-bit hardware architectures? Sorry for the stupid questions ... Claus |
In reply to this post by Hans-Martin Mosner
2009/3/15 Hans-Martin Mosner <[hidden email]> nicolas cellier schrieb: Of course, most of the nice properties come from the 64 bits adressing... Hey, wait, I don't even have a 64 processor in my house! For the fun I imagine we could emulate by spanning each oop over two int32 typedef struct {int32 high,low;} oop; I would expect a slower VM by roughly a factor 2 - except for double arithmetic... Nicolas |
In reply to this post by Claus Kick
Thanks to everyone who is contributing to this thread! I should have
been more explicit about my interest in this area: a good floating point unit is about the same size as a reasonably compact integer core. So for the same cost I can have twice as many processors if I am willing to have slow floating point. The worst case would be to have both half as many processors (with a FPU each) *and* slow floating point anyway due to Squeak's limitations. Squeak does have a scheme for good floating point performance: the FloatArray. In a previous discussion about this with Bryce, he felt that between this and being about to compile away boxing/unboxing operations within a single method (also mentioned by Hans-Martin in this thread) we could have essentially the same performance as immediate floats (and Hans-Martin pointed out that the bit pattern I suggested is already in use anyway). Nicolas evaluated the advantages of the "64 bit everything is a float" scheme, which I unfortunately don't remember who was the inventor. One trick that some old mainframes used was to represent integers as denormalized floating point numbers, so you would need no checks nor conversions. The IEEE 754 standard doesn't seem to support this, however. As Bert pointed out, lack of floating point hardware was the reason given for not choosing the ARM for the first OLPC machine. Ivan mentioned fixed point as an alternative, and this is actually what I have used in my projects (specially the Forth based ones) for most of the past ten years. But for Squeak I would rather just give people what they are used to (not counting Fractions, LargeIntegers and such, of course). Juan gave a list of application domains where floats are considered fundamental. Hans-Martin and Claus asked about the availability of 64 bit hardware for the scheme I mentioned. That is indeed a problem (only my old Sparc machine would be able to run a 64 bit Squeak of the 14 or so computers I have around here, for example) but it could be solved by doing some conversions when saving/loading images. We need to do transformations when moving between 32 and 64 images anyway and unboxing floats would be one of the simplest. -- Jecel |
In reply to this post by Nicolas Cellier
On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier <[hidden email]> wrote:
In theory, but only for memory-limited symbolic applications. If you have an application that fits entirely in cache then I would expect parity. The argument for symbolic applications is that a 64-bit symbolic app has to move twice the data as a 32-bit symbolic app because each symbolic object is twice the size.
Many Smalltalk applications are large and hence more in the memory-limited range, but many Smalltalk objects are byte data and so a) they are not moving twice the data all the time and b) images do not double in size. There are also opportunities for optimization in a 64-bit implementation. In particular in 64-bit VW/HPS I was able to store the number of fixed fields in an object in its header instead of only in the class format word. Hence 64-bit HPS has much faster at:[put:] than 32-bit.
So the experience with my 64-bit VW implementation was that - typical large symbolic benchmarks (e.g. all senders) were 15% to 20% slower in 64-bits than in 32-bits. - immediate double arithmetic is about 3 times faster at about half the speed of immediate integer operations
- images are about 55% larger when converted from 32-bit to 64-bit Eliot
|
In reply to this post by Claus Kick
On Sun, Mar 15, 2009 at 2:35 PM, Jecel Assumpcao Jr <[hidden email]> wrote: Thanks to everyone who is contributing to this thread! I should have ...and SPARC is one of the worst 64-bit implementations out there. Question, how much bigger is a 64-bit literal load instruction vs a 32-bit literal load in x86/x86-64 and SPARC32/SPARC64?
|
In reply to this post by Eliot Miranda-2
Hi Eliot,
AFAIK, VW does not use the nan trick, so it has to perform extra conversions on SmallDouble, doesn't it? Nicolas 2009/3/16 Eliot Miranda <[hidden email]>
|
In reply to this post by Eliot Miranda-2
2009/3/16 Eliot Miranda <[hidden email]>:
> > > On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier > <[hidden email]> wrote: >> >> 2009/3/15 Hans-Martin Mosner <[hidden email]> >>> >>> nicolas cellier schrieb: >>> > Hans, >>> > Tagging/untagging could be very fast! See my other post >>> > >>> > 1) UnTagging a double= No op >>> > 2) Tagging a double= a isnan test (so as to have a representable nan >>> > in Smalltalk) >>> > 3) This trick does not had any extra cost to tagging/untagging of >>> > other oops >>> That's true for a 64-bit processor, and on such hardware I see the >>> advantages of this scheme. >>> For 32-bit hardware, it won't work. >>> Hopefully we'll all have suitable hardware in the near future... >>> But for example, I'm running 32-bit linux here on my 64-bit AMD >>> processor just because the WLAN card I'm using only has a 32-bit Windows >>> driver, and ndiswrapper on 64-bit linux would require a 64-bit driver to >>> work correctly (which is somewhat stupid IMHO but I'm not going to hack >>> ndiswrapper). >>> In the real world, there are tons of silly constraints like this which >>> still prevent people from fully using 64-bit hardware. >>> >>> Cheers, >>> Hans-Martin >> >> Of course, most of the nice properties come from the 64 bits adressing... >> Hey, wait, I don't even have a 64 processor in my house! >> For the fun I imagine we could emulate by spanning each oop over two int32 >> typedef struct {int32 high,low;} oop; >> I would expect a slower VM by roughly a factor 2 - except for double >> arithmetic... > > In theory, but only for memory-limited symbolic applications. If you have > an application that fits entirely in cache then I would expect parity. The > argument for symbolic applications is that a 64-bit symbolic app has to move > twice the data as a 32-bit symbolic app because each symbolic object is > twice the size. Couldn't you compress the oops? AFAIK HotSpot was the last remaining JVM that got this. Cheers Philippe |
In reply to this post by Nicolas Cellier
On Mon, Mar 16, 2009 at 11:41 AM, Nicolas Cellier <[hidden email]> wrote: Hi Eliot, That's right. The VW 64-bit immediate double representation is
msb lsb | 8 bit exponent | 52 bit mantissa | sign | 3 bit tag | i.e. immediate doubles occupy the middle range of the doubles that corresponds to single-precision floats, ~ 10-38.
Putting the sign bit down low means that +/- 0 are the only immediate double values whose bit patterns are <= 15. Converting an immediate double to an IEEE double then involves
- logical shift right 3 bits (sign is now lsb) - compare against 1 to distinguish +/- 0 from others - if > 1 (not +/- 0) - add exponent offset (maps 8-bit exponent to 11-bit exponent)
- rotate right 1 (move sign to sign bit) - move integer reg to float reg Going in the other direction - move fp reg to integer reg - rotate left 1 bit (sign is now lsb)
- compare against 1 to distinguish +/- 0 from others - if > 1 (not +/- 0) - subtract exponent offset (maps 11-bit exponent to 8-bit exponent) - fail if overflow (e.g. jump to code that boxes the float)
- shift left 3 - add tags So more complicated than immediate integers but of a similar complexity to the fp unit's internal operations on floats (extracting exponent, shifting mantissa by exponent).
|
In reply to this post by Philippe Marschall
On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall <[hidden email]> wrote: 2009/3/16 Eliot Miranda <[hidden email]>: I don't see the point. Memory is cheap, getting cheaper. 64-bits means extremely cheap address space. Why slow down the critical path to save space?
|
2009/3/16 Eliot Miranda <[hidden email]>:
> > > On Mon, Mar 16, 2009 at 2:15 PM, Philippe Marschall > <[hidden email]> wrote: >> >> 2009/3/16 Eliot Miranda <[hidden email]>: >> > >> > >> > On Sun, Mar 15, 2009 at 1:57 PM, Nicolas Cellier >> > <[hidden email]> wrote: >> >> >> >> 2009/3/15 Hans-Martin Mosner <[hidden email]> >> >>> >> >>> nicolas cellier schrieb: >> >>> > Hans, >> >>> > Tagging/untagging could be very fast! See my other post >> >>> > >> >>> > 1) UnTagging a double= No op >> >>> > 2) Tagging a double= a isnan test (so as to have a representable nan >> >>> > in Smalltalk) >> >>> > 3) This trick does not had any extra cost to tagging/untagging of >> >>> > other oops >> >>> That's true for a 64-bit processor, and on such hardware I see the >> >>> advantages of this scheme. >> >>> For 32-bit hardware, it won't work. >> >>> Hopefully we'll all have suitable hardware in the near future... >> >>> But for example, I'm running 32-bit linux here on my 64-bit AMD >> >>> processor just because the WLAN card I'm using only has a 32-bit >> >>> Windows >> >>> driver, and ndiswrapper on 64-bit linux would require a 64-bit driver >> >>> to >> >>> work correctly (which is somewhat stupid IMHO but I'm not going to >> >>> hack >> >>> ndiswrapper). >> >>> In the real world, there are tons of silly constraints like this which >> >>> still prevent people from fully using 64-bit hardware. >> >>> >> >>> Cheers, >> >>> Hans-Martin >> >> >> >> Of course, most of the nice properties come from the 64 bits >> >> adressing... >> >> Hey, wait, I don't even have a 64 processor in my house! >> >> For the fun I imagine we could emulate by spanning each oop over two >> >> int32 >> >> typedef struct {int32 high,low;} oop; >> >> I would expect a slower VM by roughly a factor 2 - except for double >> >> arithmetic... >> > >> > In theory, but only for memory-limited symbolic applications. If you >> > have >> > an application that fits entirely in cache then I would expect parity. >> > The >> > argument for symbolic applications is that a 64-bit symbolic app has to >> > move >> > twice the data as a 32-bit symbolic app because each symbolic object is >> > twice the size. >> >> Couldn't you compress the oops? AFAIK HotSpot was the last remaining >> JVM that got this. > > I don't see the point. Memory is cheap, getting cheaper. But memory access isn't. > 64-bits means > extremely cheap address space. Why slow down the critical path to save > space? Because it's faster (because you have to move around fewer data) an gets you closer to 32bit speed. http://wikis.sun.com/display/HotSpotInternals/CompressedOops http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/#comments http://www.lowtek.ca/roo/2008/java-performance-in-64bit-land/ http://www.devwebsphere.com/devwebsphere/2008/10/websphere-nd-70.html http://webspherecommunity.blogspot.com/2008/10/64-bit-performance-thoughputmemory.html Cheers Philippe |
In reply to this post by Eliot Miranda-2
Eliot Miranda wrote:
*snip* > > ...and SPARC is one of the worst 64-bit implementations out there. > Question, how much bigger is a 64-bit literal load instruction vs a 32-bit > literal load in x86/x86-64 and SPARC32/SPARC64? Interesting though off topic tidbit, therefore an OT question: is that aimed at SPARC as a (for lack of better word) architecture or do you have a specific implementation in mind? (*curious*) |
On Tue, Mar 17, 2009 at 2:45 PM, Claus Kick <[hidden email]> wrote: Eliot Miranda wrote:
I know nothing about SPARC internals and so cannot suggest an implementation. Part of my complaint is the name, Scaleable Processor ARCitecture. The current SPARC requires 6 (reads it and weep, _6_) 32-bit instructions to synthesize an arbitrary 64-bit literal. It hasn't scaled to 64-bits; consequently there are a range of addressing models in 64-bit SPARC compilers, 20-something-bits 40-something bits (I forget the details) and 64-bits. By contrast there are 10-byte instructions that do 64-bit literals loads in x86-64. So a 200% overhead vs a 25% overhead.
One can try and use the branch and link instruction to jump over the literal, grab the pc and indirect through it, but IIRC that's a slow 5 word sequence that can't be used in leaf routines. But this is off the top of my head so don't quote me.
I would have thought that somehow one could define a three word instruction saying "load the next two words into a register and skip them: or, if the anachronism of the delay slot must still be respected, a 4 word instruction saying "load the two words after the following instruction into a register and skip them, executing the instruction in the delay slot".
|
Am 18.03.2009 um 04:47 schrieb Eliot Miranda: As far as I know Scalable Processor ARChitecture was meant to make it possible to add processors. Sun has built servers with up to 112 processor with nearly linear performance gain. So in this regard SPARC is scalable. I assume that you met RISC vs. CISC paradigms here. It's the heart of the RISC idea to have just a few but fast address modes and instructions. SPARC was faster than x86 until around UltraSPARC-III introduction / introduction of Pentium 4. Since then Intel and AMD surpassed SPARC with their x86 ISA.
Andreas |
In reply to this post by Eliot Miranda-2
On Mar 17, 2009, at 11:47 PM, Eliot Miranda wrote: > Part of my complaint is the name, Scaleable Processor ARCitecture. > The current SPARC requires 6 (reads it and weep, _6_) 32-bit > instructions to synthesize an arbitrary 64-bit literal. It hasn't > scaled to 64-bits; consequently there are a range of addressing > models in 64-bit SPARC compilers, 20-something-bits 40-something > bits (I forget the details) and 64-bits. By contrast there are 10- > byte instructions that do 64-bit literals loads in x86-64. So a > 200% overhead vs a 25% overhead. > It doesn't seem to matter, though, for C/C++/Fortran programs. In those benchmarks where SPARC is slower in 64-bit mode than 32-bit mode, the slowdown is due to the benchmark's data structures being larger because of 64-bit pointers. Loading a 64 bit literal is ugly, but so what? Is there some reason why Smalltalk would need to do more loads of 64 bit literals than C/C++/Fortran? Iain |
On Thu, Mar 19, 2009 at 6:28 AM, Iain Bason <[hidden email]> wrote:
A JIT has to update instructions. A JIT that embeds literals in instructions will have to update instructions on garbage collection or throw away code containing them or use an indirection. The SPARC makes the update instructions approach painfully complex and slow.
Is there some reason why Smalltalk would need to do more loads of 64 bit literals than C/C++/Fortran? Yes. Object references in code. None of C, C++ or Fortran have implementations that use moveable literals. Many Smalltalk implementations do.
|
In reply to this post by Philippe Marschall
Hi Philipe,
On Mon, Mar 16, 2009 at 10:52 PM, Philippe Marschall <[hidden email]> wrote:
OK, and this is a reasonable stop-gap until machines catch up with the potential of the 64-bit address space. It reminds me of segmented approaches to 16-bit limits on PDP-11s, 8086s et al. Basically these guys are scaling 32-bit oops by 8, allowing a maximum heap size of 32Gb and 4G small objects. There are other approaches like using an indirection table for intra-segment object references and using 32-bit oops within a segment, which would fit well with a Train algorithm.
My gut feels like these stop gaps are a temporary thing. After all if speed was so compelling we'd see lots of small 16-bit apps in places like Windows where there used to be good support for 16-bit code until quite recently. But in fact 16-bit apps have died the death and we favour the regularity of 32-bit code. Somewhat analogously Smalltalk trades perofrmance for regularity. So I don't find these approaches particularly compelling. In any case they require engineering teams that can afford to support multiple memory models in the VM, something I'm not going to assume in Cog :)
Thanks for the links. Best Eliot
|
Free forum by Nabble | Edit this page |