Smalltalk › Pharo › Pharo Smalltalk Developers

Platform file encoding for FFI

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

11 messages Options

alistairgrant

Platform file encoding for FFI

Hi Esteban, Guille and Everyone,

I haven't looked at using FFI much, however it is easy to imagine that
different file encoding rules on different platforms will make writing
FFI calls more difficult, i.e. some of the different formats are:

- OSX uses Mac specific decomposed UTF8 encoding
- Windows uses Wide Strings (16 bit Unicode characters)
- Linux allows pretty much anything, but precomposed UTF8 is common

Believe it or not, I'm still working on getting the
FileAttributesPlugin working (file name encoding on Windows being the
latest issue - the tests in Pharo need to be extended).

Would it be useful for future FFI work to have primitives available
which convert file names to and from the various platform specific
formats? (Linux is basically a no-op, and Windows could be written
in-image, but OSX requires the platform routines to be called).

Cheers,
Alistair

Guillermo Polito

Re: Platform file encoding for FFI

On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <[hidden email]> wrote:

Hi Esteban, Guille and Everyone,

I haven't looked at using FFI much, however it is easy to imagine that
different file encoding rules on different platforms will make writing
FFI calls more difficult,

Well not really (from my point of view :))

From the point of view of the FFI call an encoded string is just a bunch of bytes. FFI does not do any interpretation of them.

i.e. some of the different formats are:

- OSX uses Mac specific decomposed UTF8 encoding
- Windows uses Wide Strings (16 bit Unicode characters)
- Linux allows pretty much anything, but precomposed UTF8 is common

At the image side, we could have an strategy that, depending on the OS, could encode in one encoding or another, or even not encode at all.

Believe it or not, I'm still working on getting the
FileAttributesPlugin working (file name encoding on Windows being the
latest issue - the tests in Pharo need to be extended).

I believe you, don't worry ^^.

Would it be useful for future FFI work to have primitives available
which convert file names to and from the various platform specific
formats? (Linux is basically a no-op, and Windows could be written
in-image, but OSX requires the platform routines to be called).

Maybe... Are the OSX routines exposed as C functions (that we can call through FFI) or they are objective-C methods/functions (that are more complicated to map)?

Thanks Alistair!

Henrik Sperre Johansen

Re: Platform file encoding for FFI

Guillermo Polito wrote
> On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <

> akgrant0710@

> >
> wrote:
>
>> Hi Esteban, Guille and Everyone,
>>
>> I haven't looked at using FFI much, however it is easy to imagine that
>> different file encoding rules on different platforms will make writing
>> FFI calls more difficult,
>
>
> Well not really (from my point of view :))
> From the point of view of the FFI call an encoded string is just a bunch
> of
> bytes. FFI does not do any interpretation of them.

It *would* be pretty handy for adding some auto-conversion into the
marshaller based on parameter encoding options though... (other than
filename, could be done in smalltalk using exisiting encoders)

self
ffiCall: #(bool saveContentsToFile(String fileName, String contents))
options: #(+stringEncodings( fileName return , platformAPI contents)

(And yes, I've probably badly mangled the options syntax)

Is much less verbose than having to manually convert Strings to the proper
platform Unicode encodings before calling.
Depends a bit on whether the primitive argument is
Byte/Widestrings(latin1/utf32), or if it accepts only utf8 bytes and one has
to convert first anyways.

It's not like this isn't a pain point, there are plenty of currently used
API's that are broken if you try to use non-ascii.

Cheers,
Henry

--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Henrik Sperre Johansen

Re: Platform file encoding for FFI

In reply to this post by Guillermo Polito

Guillermo Polito wrote
> On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <

> akgrant0710@

> >
> wrote:
>
>
>> Would it be useful for future FFI work to have primitives available
>> which convert file names to and from the various platform specific
>> formats? (Linux is basically a no-op, and Windows could be written
>> in-image, but OSX requires the platform routines to be called).
>>
>
> Maybe... Are the OSX routines exposed as C functions (that we can call
> through FFI) or they are objective-C methods/functions (that are more
> complicated to map)?
>
> Thanks Alistair!

+1. From the image point of view, the non-standard normal form used on OSX
is the biggest issue.
If it's available through FFI, the platform-specific String encoding options
I mentioned previously could be implemented entirely in the image.
If there are extra hoops to jump though, like having to provide utf8 to said
FFI function, it might still be worth it for the reduced performance
overhead.

Cheers,
Henry

--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Guillermo Polito

Re: Platform file encoding for FFI

In reply to this post by Henrik Sperre Johansen

On Tue, Sep 18, 2018 at 10:43 AM Henrik Sperre Johansen <[hidden email]> wrote:

Guillermo Polito wrote
> On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <

> akgrant0710@

> >
> wrote:
>
>> Hi Esteban, Guille and Everyone,
>>
>> I haven't looked at using FFI much, however it is easy to imagine that
>> different file encoding rules on different platforms will make writing
>> FFI calls more difficult,
>
>
> Well not really (from my point of view :))
> From the point of view of the FFI call an encoded string is just a bunch
> of
> bytes. FFI does not do any interpretation of them.

It *would* be pretty handy for adding some auto-conversion into the
marshaller based on parameter encoding options though... (other than
filename, could be done in smalltalk using exisiting encoders)

self
ffiCall: #(bool saveContentsToFile(String fileName, String contents))
options: #(+stringEncodings( fileName return , platformAPI contents)

Well, I like this idea.

(And yes, I've probably badly mangled the options syntax)

Is much less verbose than having to manually convert Strings to the proper
platform Unicode encodings before calling.
Depends a bit on whether the primitive argument is
Byte/Widestrings(latin1/utf32), or if it accepts only utf8 bytes and one has
to convert first anyways.

It's not like this isn't a pain point, there are plenty of currently used
API's that are broken if you try to use non-ascii.

Yes, but I think this may be because in general people tend to not know how encodings work... (even myself I don't feel I know enough :))

But this makes me think that we should make encoding explicit?

Maybe we should force people to specify an encoding if they specify a callout using a string.

And then, either they specify it at the level of the callout, or at the level of the library (like setting a default encoding for all strings).

Because this raises also the question of what is the default encoding?

And I'd say that in there is no satisfactory default encoding...

EstebanLM

Re: Platform file encoding for FFI

In reply to this post by Henrik Sperre Johansen

> On 18 Sep 2018, at 10:42, Henrik Sperre Johansen <[hidden email]> wrote:
>
> Guillermo Polito wrote
>> On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <
>
>> akgrant0710@
>
>> >
>> wrote:
>>
>>> Hi Esteban, Guille and Everyone,
>>>
>>> I haven't looked at using FFI much, however it is easy to imagine that
>>> different file encoding rules on different platforms will make writing
>>> FFI calls more difficult,
>>
>>
>> Well not really (from my point of view :))
>> From the point of view of the FFI call an encoded string is just a bunch
>> of
>> bytes. FFI does not do any interpretation of them.
>
> It *would* be pretty handy for adding some auto-conversion into the
> marshaller based on parameter encoding options though... (other than
> filename, could be done in smalltalk using exisiting encoders)
>
> self
> ffiCall: #(bool saveContentsToFile(String fileName, String contents))
> options: #(+stringEncodings( fileName return , platformAPI contents)

This is cool.
What I do not like is to rely on primitives to do that encoding.
This should be in image… using FFI if needed (this is all because we want to rely less and less on plugins :P)

Esteban

>
> (And yes, I've probably badly mangled the options syntax)
>
> Is much less verbose than having to manually convert Strings to the proper
> platform Unicode encodings before calling.
> Depends a bit on whether the primitive argument is
> Byte/Widestrings(latin1/utf32), or if it accepts only utf8 bytes and one has
> to convert first anyways.
>
> It's not like this isn't a pain point, there are plenty of currently used
> API's that are broken if you try to use non-ascii.
>
> Cheers,
> Henry
>
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>

EstebanLM

Re: Platform file encoding for FFI

In reply to this post by Guillermo Polito

On 18 Sep 2018, at 11:04, Guillermo Polito <[hidden email]> wrote:

On Tue, Sep 18, 2018 at 10:43 AM Henrik Sperre Johansen <[hidden email]> wrote:
Guillermo Polito wrote
> On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <

> akgrant0710@

> >
> wrote:
>
>> Hi Esteban, Guille and Everyone,
>>
>> I haven't looked at using FFI much, however it is easy to imagine that
>> different file encoding rules on different platforms will make writing
>> FFI calls more difficult,
>
>
> Well not really (from my point of view :))
> From the point of view of the FFI call an encoded string is just a bunch
> of
> bytes. FFI does not do any interpretation of them.

It *would* be pretty handy for adding some auto-conversion into the
marshaller based on parameter encoding options though... (other than
filename, could be done in smalltalk using exisiting encoders)

self
ffiCall: #(bool saveContentsToFile(String fileName, String contents))
options: #(+stringEncodings( fileName return , platformAPI contents)

Well, I like this idea.

(And yes, I've probably badly mangled the options syntax)

Is much less verbose than having to manually convert Strings to the proper
platform Unicode encodings before calling.
Depends a bit on whether the primitive argument is
Byte/Widestrings(latin1/utf32), or if it accepts only utf8 bytes and one has
to convert first anyways.

It's not like this isn't a pain point, there are plenty of currently used
API's that are broken if you try to use non-ascii.

Yes, but I think this may be because in general people tend to not know how encodings work... (even myself I don't feel I know enough :))
But this makes me think that we should make encoding explicit?

Yup, explicit please. Nothing hide behind the carpet :)

Maybe we should force people to specify an encoding if they specify a callout using a string.

And then, either they specify it at the level of the callout, or at the level of the library (like setting a default encoding for all strings).

You can have some global FFI settings (I was thinking on adding some global options settings for FFI in general, btw) and even “library based settings”, to simplify.

Esteban

Because this raises also the question of what is the default encoding?
And I'd say that in there is no satisfactory default encoding...

alistairgrant

Re: Platform file encoding for FFI

Hi Guille, Esteban and Henry,

Thanks for your replies.

On Tue, Sep 18, 2018 at 10:09:02AM +0200, Guillermo Polito wrote:

>
>
> On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <[hidden email]> wrote:
>
> Hi Esteban, Guille and Everyone,
>
> I haven't looked at using FFI much, however it is easy to imagine that
> different file encoding rules on different platforms will make writing
> FFI calls more difficult,
>
>
> Well not really (from my point of view :))
> From the point of view of the FFI call an encoded string is just a bunch of
> bytes. FFI does not do any interpretation of them.

Right, but getting the appropriately encoded bunch of bytes is the issue. :-)

> i.e. some of the different formats are:
>
> - OSX uses Mac specific decomposed UTF8 encoding
> - Windows uses Wide Strings (16 bit Unicode characters)
> - Linux allows pretty much anything, but precomposed UTF8 is common
>
>
>
> At the image side, we could have an strategy that, depending on the OS, could
> encode in one encoding or another, or even not encode at all.
>
>
> Believe it or not, I'm still working on getting the
> FileAttributesPlugin working (file name encoding on Windows being the
> latest issue - the tests in Pharo need to be extended).
>
>
> I believe you, don't worry ^^.
>
>
> Would it be useful for future FFI work to have primitives available
> which convert file names to and from the various platform specific
> formats? (Linux is basically a no-op, and Windows could be written
> in-image, but OSX requires the platform routines to be called).
>
>
> Maybe... Are the OSX routines exposed as C functions (that we can call through
> FFI) or they are objective-C methods/functions (that are more complicated to
> map)?

The OSX routines are exposed as C functions (and available as
Objective-C methods), see convertChars() in
platforms/unix/vm/sqUnixCharConv.c.

On Tue, Sep 18, 2018 at 11:21:41AM +0200, Esteban Lorenzano wrote:
> > self
> > ffiCall: #(bool saveContentsToFile(String fileName, String contents))
> > options: #(+stringEncodings( fileName return , platformAPI contents)
>
> This is cool.
> What I do not like is to rely on primitives to do that encoding.
> This should be in image??? using FFI if needed (this is all because we
> want to rely less and less on plugins :P)

I realise of course that this could all be done in FFI, and I agree with
all Estaban's arguments in favour of FFI, my main motivation was that
the code is already in the VM, and to avoid code duplication with the
obvious benefit that if a bug is fixed it will apply everywhere.

On Tue, Sep 18, 2018 at 11:23:56AM +0200, Esteban Lorenzano wrote:

>
>
> On 18 Sep 2018, at 11:04, Guillermo Polito <[hidden email]>
> wrote:
>
>
>
> On Tue, Sep 18, 2018 at 10:43 AM Henrik Sperre Johansen <
> [hidden email]> wrote:
>
> It *would* be pretty handy for adding some auto-conversion into the
> marshaller based on parameter encoding options though... (other than
> filename, could be done in smalltalk using exisiting encoders)
>
> self
> ffiCall: #(bool saveContentsToFile(String fileName, String
> contents))
> options: #(+stringEncodings( fileName return , platformAPI
> contents)
>
>
> Well, I like this idea.
>
>
> (And yes, I've probably badly mangled the options syntax)
>
> Is much less verbose than having to manually convert Strings to the
> proper
> platform Unicode encodings before calling.
> Depends a bit on whether the primitive argument is
> Byte/Widestrings(latin1/utf32), or if it accepts only utf8 bytes and
> one has
> to convert first anyways.
>
> It's not like this isn't a pain point, there are plenty of currently
> used
> API's that are broken if you try to use non-ascii.
>
>
> Yes, but I think this may be because in general people tend to not know how
> encodings work... (even myself I don't feel I know enough :))
> But this makes me think that we should make encoding explicit?
>
>
> Yup, explicit please. Nothing hide behind the carpet :)
>
>
>
> Maybe we should force people to specify an encoding if they specify a
> callout using a string.
>
> And then, either they specify it at the level of the callout, or at the
> level of the library (like setting a default encoding for all strings).
>
>
>
> You can have some global FFI settings (I was thinking on adding some global
> options settings for FFI in general, btw) and even ?library based settings?, to
> simplify.
>
> Esteban
>
>
>
> Because this raises also the question of what is the default encoding?
> And I'd say that in there is no satisfactory default encoding...

I'll defer to Sven every time when it comes to character encoding, but
my understanding is that the only platform that has consistent encoding
rules is OSX, which uses the platform specific decomposed UTF8.

Both Windows and Linux use precomposed UTF8, but other character
encodings are possible (particularly for older files).

So we certainly shouldn't make the encoding hard-coded. UTF8 as the default
encoding I think does make sense (this is what FilePlugin currently
uses).

Cheers,
Alistair

Guillermo Polito

Re: Platform file encoding for FFI

On Tue, Sep 18, 2018 at 4:40 PM Alistair Grant <[hidden email]> wrote:

> I haven't looked at using FFI much, however it is easy to imagine that
> different file encoding rules on different platforms will make writing
> FFI calls more difficult,
>
> Well not really (from my point of view :))
> From the point of view of the FFI call an encoded string is just a bunch of
> bytes. FFI does not do any interpretation of them.

Right, but getting the appropriately encoded bunch of bytes is the issue. :-)

Yes, the thing is that this would require some new extensions in uFFI to support encodings.

The good point of that is that that would have a positive impact in **ALL* FFI bindings using strings (by making explicit to people that they should care about encodings :)).

> Maybe... Are the OSX routines exposed as C functions (that we can call through
> FFI) or they are objective-C methods/functions (that are more complicated to
> map)?

The OSX routines are exposed as C functions (and available as
Objective-C methods), see convertChars() in
platforms/unix/vm/sqUnixCharConv.c.

Nice!

On Tue, Sep 18, 2018 at 11:21:41AM +0200, Esteban Lorenzano wrote:
> > self
> > ffiCall: #(bool saveContentsToFile(String fileName, String contents))
> > options: #(+stringEncodings( fileName return , platformAPI contents)
>
> This is cool.
> What I do not like is to rely on primitives to do that encoding.
> This should be in image??? using FFI if needed (this is all because we
> want to rely less and less on plugins :P)

I realise of course that this could all be done in FFI, and I agree with
all Estaban's arguments in favour of FFI, my main motivation was that
the code is already in the VM, and to avoid code duplication with the
obvious benefit that if a bug is fixed it will apply everywhere.

Yeh. At the end it's a matter of debugging cycles.

Imagine making the "compile-restart" steps that you're facing while changing the plugin almost negligible in the "change-compile-restart-test" loop :).

alistairgrant

Re: Platform file encoding for FFI

On Wed, 19 Sep 2018 at 10:26, Guillermo Polito
<[hidden email]> wrote:

>
> On Tue, Sep 18, 2018 at 4:40 PM Alistair Grant <[hidden email]> wrote:
>>
>> I realise of course that this could all be done in FFI, and I agree with
>> all Estaban's arguments in favour of FFI, my main motivation was that
>> the code is already in the VM, and to avoid code duplication with the
>> obvious benefit that if a bug is fixed it will apply everywhere.
>
>
> Yeh. At the end it's a matter of debugging cycles.
> Imagine making the "compile-restart" steps that you're facing while changing the plugin almost negligible in the "change-compile-restart-test" loop :).

This is true if the code only resides in the image, but in this case
the code won't be going away from the VM any time soon.

Anyway, for whoever does implement the code for FFI the option is always there.

Thanks to everyone for their replies!

Cheers,
Alistair

Eliot Miranda-2

Re: Platform file encoding for FFI

In reply to this post by Henrik Sperre Johansen

Hi Henry,

On Tue, Sep 18, 2018 at 1:43 AM Henrik Sperre Johansen <[hidden email]> wrote:

Guillermo Polito wrote
> On Mon, Sep 17, 2018 at 6:52 PM Alistair Grant <

> akgrant0710@

> >
> wrote:
>
>> Hi Esteban, Guille and Everyone,
>>
>> I haven't looked at using FFI much, however it is easy to imagine that
>> different file encoding rules on different platforms will make writing
>> FFI calls more difficult,
>
>
> Well not really (from my point of view :))
> From the point of view of the FFI call an encoded string is just a bunch
> of
> bytes. FFI does not do any interpretation of them.

It *would* be pretty handy for adding some auto-conversion into the
marshaller based on parameter encoding options though... (other than
filename, could be done in smalltalk using exisiting encoders)

self
ffiCall: #(bool saveContentsToFile(String fileName, String contents))
options: #(+stringEncodings( fileName return , platformAPI contents)

(And yes, I've probably badly mangled the options syntax)

Why not go for some generic escape sequence that can inject Smalltalk code into the marshaling? Right now e.g.

primExport: aName value: aValue

^ self ffiCall: #(void moz_preferences_set_bool (short* aName, bool aValue))

is compiled as

primExport: arg1 value: arg2

| tmp1 tmp2 |

'<an unprintable nonliteral value>'

invokeWithArguments:

{(tmp2 := arg1 packToArity: 1).

arg2}

where '<an unprintable nonliteral value>' is the ExternalFunction object (it could usefully print itself ass a literal and then decompilation would be meaningful; there is already code in the Squeak FFI repository).

Let's say one added {}'s as characters that can't ever appear in C parameter lists (of course and alas []'s can because of arrays)≥ Then you could perhaps write

primExport: aName value: aValue

^ self ffiCall: #(void moz_preferences_set_bool ( { short* aName } asUTF8String, bool aValue))

and have that generate a send of asUTF8String to arg1 or tmp2. One could surround the whole thing to apply a coercion to the return value, but there's no need because one can write e.g.

primExport: aName value: aValue

^(self ffiCall: #(void moz_preferences_set_bool ( { short* aName } asUTF8String, bool aValue))) fromUTF8String

So then there would be a generic mechanism for in jetting Smalltalk code into the marshaling and one could develop the string encoding support independently from the FFI. The options syntax however requires parsing support, more documentation, and constant extension to support new facilities, etc.

Is much less verbose than having to manually convert Strings to the proper
platform Unicode encodings before calling.
Depends a bit on whether the primitive argument is
Byte/Widestrings(latin1/utf32), or if it accepts only utf8 bytes and one has
to convert first anyways.

It's not like this isn't a pain point, there are plenty of currently used
API's that are broken if you try to use non-ascii.

Cheers,
Henry

--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

_,,,^..^,,,_

best, Eliot