On Jan 9, 2009, at 3:08 AM, Paolo Bonzini wrote: > >> Is it at all possible to get the x86 FPU to produce the correct result >> for this situation? > > You can also try compiling the Squeak VM with -mfpmath=sse. That should work, as long as you are running it on x86 processors that support SSE2 with double-precision scalar operations. The SSE floating-point unit(s) perform true single- and double-precision arithmetic. -- tim |
In reply to this post by Hans-Martin Mosner
On Jan 9, 2009, at 2:29 AM, Hans-Martin Mosner wrote: > First of all, does it matter? If I understand correctly, this behavior > is only present for denormalized numbers. If you set the internal rounding-precision mode in the x86 control register to double-precision, then yes, the multiple rounding issue goes away for most computations. It only remains when generating denormal results because the exponent field size still remains at extended precision during the computation, so the denormalized result is only generated during the conversion back to double-precision format, leading to multiple rounding operations. > Do these appear in real-world > cases? That's hard to say. It might be interesting to instrument the VM to check for denormal operands or results on the float operations to get a feel for how often (if ever) they are occurring. > I've tried to analyze the case in question and came to the following > results: > The exact mantissa after multiplication is 3B16EF930A76E.80002C69F96C2 > (the hex digits after the point are those that should be rounded off > when going to a 52-bit mantissa). The result with "A76F" as the last > hex > digits would therefore be the correct value for an IEEE-754 double > precision multiplication (rounding to the nearest representable > number), > so the PPC implementation does it right. > When doing an extended double precision multiplication, there are some > more bits in the mantissa, and the mantissa of the intermediate result > looks like ...A76E.800 which is exacly in the middle between two > representable numbers. Converting to double precision involves rounding > the mantissa again, and the rounding rule for this case (exact middle) > says to round to the nearest even number, which is ...A76E. That's sort of what is going on, but it is complicated here due to the way denorms are handled. What is actually happening is: 1st operand (binary): 1.0011001000001000101000100101111000000100111010000111 (note "hidden" 1 bit added to the left of the radix point because it is a normalized value) 2nd operand (binary): 0.0011000101101101110100011101000000101101000110101110 (note no "hidden" 1 bit because it is a denormalized value) Before multiplication, the 2nd operand is renormalized, adjusting the exponent accordingly: 1.1000101101101110100011101000000101101000110101110000 The exact product is: 1.1101100010110111011111001001100001010011101101110100,0000000000000001. .. The comma (,) is at the double-precision rounding position. PPC operation: The product exponent is smaller than can be represented in a normalized double-precision format, so the result is first denormalized: 0.001101100010110111011111001001100001010011101101110,100000000000000000 1... Then round-to-nearest rounding adds an ULP because the bits to the right of the rounding position are greater than halfway: 0.001101100010110111011111001001100001010011101101111 x86 (extended with double-rounding mode): Because the exponent field is still sized for 80-bit extended floats, the exact product is still representable as a normalized number: 1.1101100010110111011111001001100001010011101101110100,0000000000000001. .. Then round-to-nearest drops the bits to the right of the double-precision rounding point without adding an ULP, because the bits to the right are less than halfway: 1.1101100010110111011111001001100001010011101101110100 Then the result is converted to a double-precision representation when storing to memory, which causes it to become denormalized: 0.0011101100010110111011111001001100001010011101101110,100 Round-to-nearest rounding does not add an ULP because the bits to the right of the rounding position are exactly halfway, and in that case the nearest even result is selected: 0.0011101100010110111011111001001100001010011101101110 > Is it at all possible to get the x86 FPU to produce the correct result > for this situation? Not using the extended-precision FPU, without an extreme performance loss. You basically would have to cause an exception for all imprecise results (lots of them!) and check them in software. If you perform the operations in the SSE unit, then that will work, provided all the x86 platforms you run on support SSE2 with double-precision scalars. > Interestingly, this article > (http://www.vinc17.org/research/extended.en.html) claims that the FPU > is > set up for incorrect rounding under Linux only, but I could not > reproduce the test case given there with Squeak (which probably means > that I mistranslated the Java example). That example shows the effect for normalized intermediate results when the intermediate result is rounded to extended precision, then double-precision rounding occurs when converting to double-precision format. That can be fixed by setting the rounding-precision mode to double in the fpu control register, but, as shown in the example above, it does not work in the case of denormalized intermediate results. I suspect that your Squeak VM has the rounding-precision mode set to double, which fixes most of the cases. -- tim |
In reply to this post by Hans-Martin Mosner
On Jan 8, 2009, at 9:30 AM, Hans-Martin Mosner wrote: > AFAIK, the Croquet project needs to have exactly reproducible float > arithmetics on all platforms. Are you using the exact same code on all platforms for trancendentals and other floating-point functions? If not, that is going to be the greatest source of different results. They tend to vary widely on accuracy, since there is no 1/2 ULP accuracy requirement for those functions, unlike for the defined IEEE operations (+,-,*,/,sqrt) -- Tim Olson |
On 09.01.2009, at 16:16, Tim Olson wrote: > > On Jan 8, 2009, at 9:30 AM, Hans-Martin Mosner wrote: > >> AFAIK, the Croquet project needs to have exactly reproducible float >> arithmetics on all platforms. > > Are you using the exact same code on all platforms for > trancendentals and other floating-point functions? If not, that is > going to be the greatest source of different results. They tend to > vary widely on accuracy, since there is no 1/2 ULP accuracy > requirement for those functions, unlike for the defined IEEE > operations (+,-,*,/,sqrt) Yes, Andreas wrote a new plugin to ensure bit-identical floating point ops across platforms. - Bert - |
In reply to this post by Tim Olson-4
On 09.01.2009, at 05:31, Tim Olson wrote: > > On Jan 8, 2009, at 9:30 AM, Hans-Martin Mosner wrote: > >> AFAIK, the Croquet project needs to have exactly reproducible float >> arithmetics on all platforms. Would these rounding effects on the x86 >> affect that goal (perhaps with denormal intermediate results)? > > Yes. > >> Maybe it would be nice to write a unit test which would uncover >> different rounding behaviors between purely-double-precision FP >> hardware >> and extended-precision hardware. > > I'd have to play around a bit to get this into a Squeak unit test, > but here's a test vector which shows the effect: > > 3ff3208a25e04e87 * 000316dd1d02d1ae > > x86 result: 0003b16ef930a76e > powerPC result: 0003b16ef930a76f > > These are IEEE double-precision floating-point numbers, specified in > hex (big-endian) so that they are bit exact (no conversion error > from a decimal representation). The first number is between 1.0 and > 2.0, while the second number is a denormal value. The product shows > an lsb difference between an x86 platform (extended-precision FPU) > and a PPC platform (double-precision FPU), even when the x86 > floating-point control word is set to use double-precision rounding > for all results. On my Intel Mac with VM 3.8.17b6 I get ...a76f too, so this is not an x86 issue. a := Float newFrom: #(16r3FF3208A 16r25E04E87). b := Float newFrom: #(16r000316DD 16r1D02D1AE). {a hex. b hex. (a*b) hex} #('3FF3208A25E04E87' '000316DD1D02D1AE' '0003B16EF930A76F') I do get ...a76e under Linux (x86 VMWare emulation). - Bert - |
In reply to this post by Tim Olson-4
El 1/9/09 10:43 AM, "Tim Olson" <[hidden email]> escribió: > > On Jan 9, 2009, at 3:08 AM, Paolo Bonzini wrote: > >> >>> Is it at all possible to get the x86 FPU to produce the correct result >>> for this situation? >> >> You can also try compiling the Squeak VM with -mfpmath=sse. > > That should work, as long as you are running it on x86 processors that > support SSE2 with double-precision scalar operations. The SSE > floating-point unit(s) perform true single- and double-precision > arithmetic. > > -- tim > > |
In reply to this post by Bert Freudenberg
years back we coded up a FloatMathPlugin to do IEEE math a front end
to http://www.netlib.org/fdlibm/ in april of 06 Testing showed we could do IEEE math using the plugin on linux/windows/ mac (powerpc). I believe Qwaq took over building this for macintel. Also note the unix and carbon VM do fiddle with the FP unit control register via this code I've a copy of you would need to check the build tree for the unix version to see if it is different. #if defined(__GNUC__) && ( defined(i386) || defined(__i386) || defined(__i386__) \ || defined(i486) || defined(__i486) || defined (__i486__) \ || defined(intel) || defined(x86) || defined(i86pc) ) static void fldcw(unsigned int cw) { __asm__("fldcw %0" :: "m"(cw)); } #else # define fldcw(cw) #endif #if defined(__GNUC__) && ( defined(ppc) || defined(__ppc) || defined(__ppc__) \ || defined(POWERPC) || defined(__POWERPC) || defined (__POWERPC__) ) void mtfsfi(unsigned long long fpscr); void mtfsfi(unsigned long long fpscr) { __asm__("lfd f0, %0" :: "m"(fpscr)); __asm__("mtfsf 0xff, f0"); } #else # define mtfsfi(fpscr) #endif On 9-Jan-09, at 7:20 AM, Bert Freudenberg wrote: > > On 09.01.2009, at 16:16, Tim Olson wrote: > >> >> On Jan 8, 2009, at 9:30 AM, Hans-Martin Mosner wrote: >> >>> AFAIK, the Croquet project needs to have exactly reproducible float >>> arithmetics on all platforms. >> >> Are you using the exact same code on all platforms for >> trancendentals and other floating-point functions? If not, that is >> going to be the greatest source of different results. They tend to >> vary widely on accuracy, since there is no 1/2 ULP accuracy >> requirement for those functions, unlike for the defined IEEE >> operations (+,-,*,/,sqrt) > > > Yes, Andreas wrote a new plugin to ensure bit-identical floating > point ops across platforms. > > - Bert - > > > -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Tim Olson-4
Tim Olson wrote:
> > On Jan 8, 2009, at 9:30 AM, Hans-Martin Mosner wrote: > >> AFAIK, the Croquet project needs to have exactly reproducible float >> arithmetics on all platforms. Slightly off-topic, but since we have such a knowledgeable crowd... As noted by John, Croquet uses fdlibm for bit-identical floating point math. Does anyone have a feeling for how difficult (or impossible) it will be to achieve identical computation on OpenCL-compliant devices? For example, how difficult would it be to port, say, fdlibm, so that trancedentals use the exact same code? Any other show-stoppers that might not occur to the naive mind :-) ? Thanks, Josh > > Are you using the exact same code on all platforms for trancendentals > and other floating-point functions? If not, that is going to be the > greatest source of different results. They tend to vary widely on > accuracy, since there is no 1/2 ULP accuracy requirement for those > functions, unlike for the defined IEEE operations (+,-,*,/,sqrt) > > -- Tim Olson > > |
In reply to this post by Bert Freudenberg
>> 3ff3208a25e04e87 * 000316dd1d02d1ae >> x86 result: 0003b16ef930a76e >> powerPC result: 0003b16ef930a76f > > On my Intel Mac with VM 3.8.17b6 I get ...a76f too, so this is not an > x86 issue. Mac OS X uses SSE by default for math. It is an *x87* issue, not x86. Paolo |
In reply to this post by Joshua Gargus-2
On 10.01.2009, at 11:01, Josh Gargus wrote:
> As noted by John, Croquet uses fdlibm for bit-identical floating > point math. Does anyone have a feeling for how difficult (or > impossible) it will be to achieve identical computation on OpenCL- > compliant devices? The numerical behavior of compliant OpenCL implementations is covered in section 7 of the OpenCL spec. In particular, table 7.1 gives the error bounds for the various operations. If I interpret that correctly, very few functions are guaranteed to behave bit-identical. > For example, how difficult would it be to port, say, fdlibm, so > that trancedentals use the exact same code? Any other show-stoppers > that might not occur to the naive mind :-) ? Well OpenCL only requires single-precision, double-precision support is optional, whereas fdlibm is double-precision only. I don't know how compliant current implementations actually are. - Bert - |
Bert Freudenberg wrote:
> On 10.01.2009, at 11:01, Josh Gargus wrote: >> As noted by John, Croquet uses fdlibm for bit-identical floating >> point math. Does anyone have a feeling for how difficult (or >> impossible) it will be to achieve identical computation on >> OpenCL-compliant devices? > > The numerical behavior of compliant OpenCL implementations is covered > in section 7 of the OpenCL spec. In particular, table 7.1 gives the > error bounds for the various operations. If I interpret that > correctly, very few functions are guaranteed to behave bit-identical. Oops, I was skimming by the time I read that part of the spec. I saw that the transcendental functions need not return identical results (which was why I mentioned porting fdlibm), but I missed that even x/y isn't precisely specified. > >> For example, how difficult would it be to port, say, fdlibm, so >> that trancedentals use the exact same code? Any other show-stoppers >> that might not occur to the naive mind :-) ? > > Well OpenCL only requires single-precision, double-precision support > is optional, whereas fdlibm is double-precision only. I didn't know that. Maybe that's what the "d" stands for in "fdlibm". > I don't know how compliant current implementations actually are. I think that there's only one implementation right now, and only available to paid-up Apple developers. Anyway, I'm more interested in what the spec says than current conformance... implementations will gradually become more compliant. Thanks, Josh > - Bert - > > > |
Reading the CUDA 2.0 programming guide, I saw an interesting difference
between the error-bounds for single- and double-precision floating point: Double-precision Operation ULPs x+y 0 (IEEE-754 round-to-nearest-even) x*y 0 (IEEE-754 round-to-nearest-even) x/y 0 (IEEE-754 round-to-nearest-even) 1/x 0 (IEEE-754 round-to-nearest-even) sqrt(x) 0 (IEEE-754 round-to-nearest-even) Single-precision Operation ULPs x+y 0 (IEEE-754 round-to-nearest-even) x*y 0 (IEEE-754 round-to-nearest-even) x/y 2 1/x 1 sqrt(x) 3 So, there might be some hope, at least for double-precision ops. Currently, AFAIK, the latest NVIDIA GPUSs are the only ones with double-precision FP support. But, things are changing rapidly in this area: in about a year, Intel will release Larrabee, which will "fully support IEEE standards for single and double precision floating-point arithmetic". Hopefully this forces the other vendors to follow suit, and future OpenCL revisions reflect this. Cheers, Josh Josh Gargus wrote: > Bert Freudenberg wrote: >> On 10.01.2009, at 11:01, Josh Gargus wrote: >>> As noted by John, Croquet uses fdlibm for bit-identical floating >>> point math. Does anyone have a feeling for how difficult (or >>> impossible) it will be to achieve identical computation on >>> OpenCL-compliant devices? >> >> The numerical behavior of compliant OpenCL implementations is covered >> in section 7 of the OpenCL spec. In particular, table 7.1 gives the >> error bounds for the various operations. If I interpret that >> correctly, very few functions are guaranteed to behave bit-identical. > > Oops, I was skimming by the time I read that part of the spec. I saw > that the transcendental functions need not return identical results > (which was why I mentioned porting fdlibm), but I missed that even x/y > isn't precisely specified. > >> >>> For example, how difficult would it be to port, say, fdlibm, so >>> that trancedentals use the exact same code? Any other show-stoppers >>> that might not occur to the naive mind :-) ? >> >> Well OpenCL only requires single-precision, double-precision support >> is optional, whereas fdlibm is double-precision only. > > I didn't know that. Maybe that's what the "d" stands for in "fdlibm". > >> I don't know how compliant current implementations actually are. > > I think that there's only one implementation right now, and only > available to paid-up Apple developers. Anyway, I'm more interested in > what the spec says than current conformance... implementations will > gradually become more compliant. > > Thanks, > Josh > > >> - Bert - >> >> >> > > |
Free forum by Nabble | Edit this page |