Smalltalk › Squeak › Squeak - Dev

BUG FFI/unix vm ? Hackers help requested !

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

Nicolas Cellier-3

BUG FFI/unix vm ? Hackers help requested !

Hello gods of FFI and of the vm,

I am blocked for a while in the development of Smallapack (Smalltalk interface
to LAPACK) on squeak.

I have a very strange behaviour isolated on a small test case (see below):

I want to call DLANGE, a LAPACK FORTRAN routine to compute the norm of a
matrix.

And on the seventh call, i always get a stange result (result is 0.0 but will
answer false to = 0.0).

This fail on unix with image 3.9alpha7029,
vm is: squeak -version
3.7-7 #1 Sat Mar 19 13:12:20 PST 2005 gcc 3.3.5
Squeak3.7 of '4 September 2004' [latest update: #5989]
Linux squeak.hpl.hp.com 2.4.27-1-386 #1 Fri Sep 3 06:24:46 UTC 2004 i686
GNU/Linux
default plugin location: /usr/local/lib/squeak/3.7-7/*.so

Same code seem to work on windows (i have tried it today)...

Beware, i presume this can corrupt your image.
Maybe i am doing something wrong, could someone explain me please ?

Nicolas

----------------------------------------------------------------------------------------------------

My definition of dlange2.c (translated with f2c then modified to simply answer
0.0) is:

/* #include "f2c.h" */
typedef double doublereal;
typedef long integer;
typedef long ftnlen;
/*< DOUBLE PRECISION FUNCTION DLANGE( NORM, M, N, A, LDA, WORK ) >*/
doublereal dlange2_(char *norm, integer *m, integer *n, doublereal *a,
integer *lda, doublereal *work, ftnlen norm_len)
{
return 0.0;
} /* dlange2_ */

you just compile with:

gcc -c dlange2.c; ld -shared -o libdlange2.so dlange2.o

and call from Squeak with:

ExternalLibrary subclass: #DLANGE2Library
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'Smallapack-Test-DLANGE'

DLANGE2Library class>>moduleName
^'dlange2'

DLANGE2Library>>dlange2Withnorm: norm m: m n: n a: a lda: lda work: work
length: lengthOfnorm
<cdecl: double 'dlange2_'( char * long * long * double * long * double *
long )>
^self externalCallFailed

DLANGE2Library class>>testDLANGE2
"DLANGE2Library testDLANGE2"

| a m n lda norm cm cn clda work |
m := lda := 3.
n := 4.
a := ExternalData
fromHandle: (ByteArray new: m*n*8)
type: ExternalType double.

"AS FORTRAN IS PASSING POINTERS, DO ALLOCATE ExternalData"
cm := ExternalData
fromHandle: ((ByteArray new: 4) signedLongAt: 1 put: m; yourself)
type: ExternalType long.
cn := ExternalData
fromHandle: ((ByteArray new: 4) signedLongAt: 1 put: n; yourself)
type: ExternalType long.
clda := ExternalData
fromHandle: ((ByteArray new: 4) signedLongAt: 1 put: lda; yourself)
type: ExternalType long.
norm := 'M'.
work := nil.
^(1 to: 10) collect: [:i | (self new
dlange2Withnorm: norm m: cm n: cn a: a lda: clda work: work length: 1)
= 0.0]

"you always obtain false from the seventh entry on..."

Nicolas Cellier-3

Re: BUG FFI/unix vm ? Hackers help requested !

I am just joining the files...

DLANGE2Library.st (1K) Download Attachment

libdlange2.so (2K) Download Attachment

dlange2.c (336 bytes) Download Attachment

Nicolas Cellier-3

Re: BUG FFI/unix vm ? Hackers help requested !

Nobody alive on the thread?
The bug is now at http://bugs.impara.de/view.php?id=3929

Nicolas

bouchet vincent

Re: BUG FFI/unix vm ? Hackers help requested !

Hi nicolas,

I just test this :

^(1 to: 10) collect: [:i | r := (self new dlange2Withnorm: norm m: cm n: cn a: a lda: clda work: work length: 1) = 0.0.
Transcript show: r asString.
r ].

And its works...... but if I remove the "Transcrip show" I obtain : #(true true false false false false false false false false)

vincent.

2006/6/28, nicolas cellier <[hidden email]>:

Nobody alive on the thread?
The bug is now at http://bugs.impara.de/view.php?id=3929

Nicolas

Nicolas Cellier-3

Re: BUG FFI/unix vm ? Hackers help requested !

In reply to this post by Nicolas Cellier-3

Hello vincent,

yes, the FFI call is always returning 0.0, but then the vm fails somewhere after when testing = 0.0.
If you execute step by step in the debugger, it does not fail... something weird.
I guess we'd better quit without saving the image when such symptoms occurs...

Nicolas

bouchet vincent:

> Hi nicolas,
>
> I just test this :
>
> ^(1 to: 10) collect: [:i | r := (self new dlange2Withnorm: norm m: cm n: cn a: a lda: clda work: work length: 1) = 0.0.
> Transcript show: r asString.
> r ].
>
>
> And its works...... but if I remove the "Transcrip show" I obtain : #(true true false false false false false false false false)
>
> vincent.
>
> 2006/6/28, nicolas cellier <[hidden email]>:Nobody alive on the thread?
> The bug is now at http://bugs.impara.de/view.php?id=3929
>
> Nicolas
>
>
>
>

________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com

Dave Hylands

Re: BUG FFI/unix vm ? Hackers help requested !

In reply to this post by Nicolas Cellier-3

Hi Nicolas,

> And on the seventh call, i always get a stange result (result is 0.0 but will
> answer false to = 0.0).

I don't know a whole lot about FFI, but I know that generally accepted
practice (at least in C which is the language I'm most familiar with)
is that you NEVER test for equality when using floating point numbers.

You always use some small epsilon and test that the number you've got
is within epsilon of the comparison number.

This is because due to roundoff and other such effects, you can wind
up with really tiny numbers, like 1 x 10^-53 which is essentially
zero, but is not equal to zero.

This might not even be relevant to your discussion, but seeing the
equality test on 0.0 raised a red flag for me.

--
Dave Hylands
Vancouver, BC, Canada
http://www.DaveHylands.com/

bouchet vincent

Re: BUG FFI/unix vm ? Hackers help requested !

Hi all,

I think to a "synchronisation" problem : I remember similar problem with c++ (in an other life) : The value is read before she was write. (with a Transcript, or a debug screen, the execution speed is slower.)

2006/6/29, Dave Hylands <[hidden email]>:

Hi Nicolas,

> And on the seventh call, i always get a stange result (result is 0.0 but will
> answer false to = 0.0).

I don't know a whole lot about FFI, but I know that generally accepted
practice (at least in C which is the language I'm most familiar with)
is that you NEVER test for equality when using floating point numbers.

You always use some small epsilon and test that the number you've got
is within epsilon of the comparison number.

This is because due to roundoff and other such effects, you can wind
up with really tiny numbers, like 1 x 10^-53 which is essentially
zero, but is not equal to zero.

This might not even be relevant to your discussion, but seeing the
equality test on 0.0 raised a red flag for me.

--
Dave Hylands
Vancouver, BC, Canada
http://www.DaveHylands.com/

Nicolas Cellier-3

Re: BUG FFI/unix vm ? Hackers help requested !

In reply to this post by Nicolas Cellier-3

Dave,
you are perfectly right, and in my original TestCase i used such a carefully crafted epsilon based on matrix dimensions and Float precision.
And the test (norm < epsilon) did also fail...

What you see here is the result of my peregrinations in isolating and tracking the bug.
And since i force the value to 0.0 which has an exact representation in IEEE floating point, there is no problem using equal, this is one of the rare cases where this construct is licit. Beside, if i retry the expression while in the debugger, it does never fail.

If you load my testcase, you can replace the test = 0.0 with < 1.0e-5, ~= 1 or whatever, i guess it will still fail on the seventh call (sorry, no unix image under my hands to assert what i say).

Nicolas

Dave Hylands:

> Hi Nicolas,
>
> > And on the seventh call, i always get a stange result (result is 0.0 but will
> > answer false to = 0.0).
>
> I don't know a whole lot about FFI, but I know that generally accepted
> practice (at least in C which is the language I'm most familiar with)
> is that you NEVER test for equality when using floating point numbers.
>
> You always use some small epsilon and test that the number you've got
> is within epsilon of the comparison number.
>
> This is because due to roundoff and other such effects, you can wind
> up with really tiny numbers, like 1 x 10^-53 which is essentially
> zero, but is not equal to zero.
>
> This might not even be relevant to your discussion, but seeing the
> equality test on 0.0 raised a red flag for me.
>
> --
> Dave Hylands
> Vancouver, BC, Canada
> http://www.DaveHylands.com/
>

________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com

Nicolas Cellier-3

Re: BUG FFI/unix vm ? Hackers help requested !

In reply to this post by Nicolas Cellier-3

bouchet vincent:
> Hi all,
>
> I think to a "synchronisation" problem : I remember similar problemwith c++ (in an other life) : The value is read before she was write.(with a Transcript, or a debug screen, the execution speed is slower.)
>

Interesting,
now i have at least 2 solutions, Transcript in loops or steal a 486DX at the museum.
I thought there were a single thread of execution and have not the necessary background to understand what you say, but i will transmit the suggestion to Ian.
Thank you

Nicolas

________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com

David T. Lewis

Re: BUG FFI/unix vm ? Hackers help requested !

On Thu, Jun 29, 2006 at 05:52:25PM +0200, [hidden email] wrote:

>
> bouchet vincent:
> > Hi all,
> >
> > I think to a "synchronisation" problem : I remember similar problemwith c++ (in an other life) : The value is read before she was write.(with a Transcript, or a debug screen, the execution speed is slower.)
> >
>
> Interesting,
> now i have at least 2 solutions, Transcript in loops or steal a 486DX at the museum.
> I thought there were a single thread of execution and have not the necessary background to understand what you say, but i will transmit the suggestion to Ian.
>

Nicolas,

This is probably not related to your problem, but I will mention it just in
case. The Transcript is a very unreliable way to debug something that may be
related to timing, because it must be updated in the Squeak user interface
process. Instead of the Transcript, it may be better to write to console
standard output. If you have OSProcess loaded in your image, you can
use OSProcess class>>debugMessage: and OSProcess class>>trace: to write to
the console output immediately from the active Squeak process.

Again, I do *not* think that this is directly related to the problem you
are trying to solve, but maybe it will help make your debugging more
predictable.

Dave