Better management of encoding of environment variables

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Sven Van Caekenberghe-2
Dave,

> On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev <[hidden email]> wrote:
>
>
> From: "David T. Lewis" <[hidden email]>
> Subject: Re: [Pharo-dev] Better management of encoding of environment variables
> Date: 18 January 2019 at 01:54:34 GMT+1
> To: Pharo Development List <[hidden email]>
>
>
> On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:
>>
>>> On 16 Jan 2019, at 23:23, Eliot Miranda <[hidden email]> wrote:
>>>
>>> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe <[hidden email]> wrote:
>>>
>>> The image side is perfectly capable of dealing with platform differences
>>> in a clean/clear way, and at least we can then use the full power of our
>>> language and our tools.
>>>
>> Agreed.  At the same time I think it is very important that we don't reply
>> on the FFI for environment variable access.  This is a basic cross-platform
>> facility.  So I would like to see the environment accessed through primitives,
>> but have the image place interpretation on the result of the primitive(s),
>> and have the primitive(s) answer a raw result, just a sequence of uninterpreted
>> bytes.
>>
>> OK, I can understand that ENV VAR access is more fundamental than FFI
>> (although FFI is already essential for Pharo, also during startup).
>>
>>> VisualWorks takes this approach and provides a class UninterpretedBytes
>>> that the VM is aware of.  That's always seemed like an ugly name and
>>> overkill to me.  I would just use ByteArray and provide image level
>>> conversion from ByteArray to String, which is what I believe we have anyway.
>>
>> Right, bytes are always uninterpreted, else they would be something else.
>> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray
>> inspector decodes automatically if it can.
>>
>
> Hi Sven,
>
> I am the author of the getenv primitives, and I am also sadly uninformed
> about matters of character sets and strings in a multilingual environment.
>
> The primitives answer environment variable variable values as ByteString
> rather than ByteArray. This made sense to me at the time that I wrote it,
> because ByteString is easy to display in an inspector, and because it is
> easily converted to ByteArray.
>
> For an American English speaker this seems like a good choice, but I
> wonder now if it is a bad decision. After all, it is also trivially easy
> to convert a ByteArray to ByteString for display in the image.
>
> Would it be helpful to have getenv primitives that answer ByteArray
> instead, and to let all conversion (including in OSProcess) be done in
> the image?
>
> Thanks,
> Dave

Normally, the correct way to represent uninterpreted bytes is with a ByteArray. Decoding these bytes as characters is the specific task of a character encoder/decoder, with a deliberate choice as to which to use.

Since the getenv() system call uses simple C strings, it is understandable that this was carried over. It is probably not worth or too risky to change that - as long as the receiver understands that it is a raw OS string that needs more work.

Like with file path encoding/decoding, environment variable encoding/decoding is plain messy and complex. IMHO it is better to manage that at the image level where we are more agile and can better handle that complexity.

Sven

BTW: using funny Unicode chars, like 🎈 [https://www.fileformat.info/info/unicode/char/1f388/index.htm] is something even English speakers do.



Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Pharo Smalltalk Developers mailing list
In reply to this post by Sven Van Caekenberghe-2




On Wed, 16 Jan 2019 at 18:37, Sven Van Caekenberghe <[hidden email]> wrote:
Still, one of the conclusions of previous discussions about the encoding of environment variables was/is that there is no single correct solution. OS's are not consistent in how the encoding is done in all (historical) contexts (like sometimes,
 
1 env var defines the encoding to use for others,

ouch.  That one point nearly made my retract my comment next paragraph, but is there much more complexity?
or just a case of  utf8<==>appSpecificEncoding  rather than  ascii<==>appSpecificEncoding ?

Sorry if I'm rehashing past discussion (do you have a link?), but considering...
* 92% of web pages are UTF8 encoded[1] such that pragmatically UTF8 *is* the standard for text 
* Strings so pervasive in a system
...would there be an overall benefit to adopt UTF8 as the encoding for Strings 
consistently provided across the cross-platform vm interface?
(i.e. fixing platforms that don't comply to the standard due to their historical baggage)

And I found it interesting Microsoft are making some moves towards UTF8 [2]...
"With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling "narrow" functions, including fopen and SetWindowTextA, with UTF-8 strings. " 

The approach vm-side could be similar to Section 10 How to do text on Windows [3] 
with the philosophy of "performing the [conversions] as close to API calls as possible, 
and never holding the [converted] data."
 
different applications do different things, and other such nice stuff), and certainly not across platforms.

So this is really complex.

Do we want to hide this in some obscure VM C code that very few people can see, read, let alone help with ?

The image side is perfectly capable of dealing with platform differences in a clean/clear way, and at least we can then use the full power of our language and our tools.

Big question... Do we currently have primitives of the same name returning 
different encodings on different platforms?  I presume that would be awkward.
If the image is handle encoding differences, should separate primitives be 
used? e.g. utf8GetEnv & utf16getEnv

Could I get some feedback on [4] saying... **The Single Most Important Fact About Encodings** 
If you completely forget everything I just explained, please remember one extremely important fact. 
It does not make sense to have a string without knowing what encoding it uses. "

And so... does our String nowadays require an 'encoding' instance variable such that this is *always* associated? 
This might remove any need for separate utf8GetEnv & utf16getEnv (if that was even a reasonable idea).

cheers -ben




> On 16 Jan 2019, at 10:59, Guillermo Polito <[hidden email]> wrote:
>
> Hi Nicolas,
>
> On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <[hidden email]> wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
> I made progress recently in this area, but we should finish the job/test/consolidate.
>
> I'm following your changes for windows from the shadows and I think they are awesome :).

> If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.
>
>  So far we are using FFI for this, as you say we create first Win32WideStrings from utf8 strings and then we use ffi calls to the *W functions.
> I don't think we can make it for Pharo7.0.0. The cycle to build, do some acceptance tests, and then bless a new VM as stable is far too long for our inminent release :).
>
> But this could be for a 7.1.0, and if you like I can surely give a hand on this.
>
> Guille


Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Guillermo Polito


On Fri, Jan 18, 2019 at 1:48 PM Ben Coman via Pharo-dev <[hidden email]> wrote:




On Wed, 16 Jan 2019 at 18:37, Sven Van Caekenberghe <[hidden email]> wrote:
Still, one of the conclusions of previous discussions about the encoding of environment variables was/is that there is no single correct solution. OS's are not consistent in how the encoding is done in all (historical) contexts (like sometimes,
 
1 env var defines the encoding to use for others,

ouch.  That one point nearly made my retract my comment next paragraph, but is there much more complexity?
or just a case of  utf8<==>appSpecificEncoding  rather than  ascii<==>appSpecificEncoding ?

It's not muuuuch more complex. The problem is that usually the bugs that arise from wrongly managing such conversions can be super obscure.
 
Sorry if I'm rehashing past discussion (do you have a link?), but considering...
* 92% of web pages are UTF8 encoded[1] such that pragmatically UTF8 *is* the standard for text 
* Strings so pervasive in a system
...would there be an overall benefit to adopt UTF8 as the encoding for Strings 
consistently provided across the cross-platform vm interface?
(i.e. fixing platforms that don't comply to the standard due to their historical baggage)

And I found it interesting Microsoft are making some moves towards UTF8 [2]...
"With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling "narrow" functions, including fopen and SetWindowTextA, with UTF-8 strings. " 

The approach vm-side could be similar to Section 10 How to do text on Windows [3] 
with the philosophy of "performing the [conversions] as close to API calls as possible, 
and never holding the [converted] data."
 
different applications do different things, and other such nice stuff), and certainly not across platforms.

So this is really complex.

Do we want to hide this in some obscure VM C code that very few people can see, read, let alone help with ?

The image side is perfectly capable of dealing with platform differences in a clean/clear way, and at least we can then use the full power of our language and our tools.

Big question... Do we currently have primitives of the same name returning 
different encodings on different platforms?  I presume that would be awkward.
If the image is handle encoding differences, should separate primitives be 
used? e.g. utf8GetEnv & utf16getEnv

Could I get some feedback on [4] saying... **The Single Most Important Fact About Encodings** 
If you completely forget everything I just explained, please remember one extremely important fact. 
It does not make sense to have a string without knowing what encoding it uses. "

And so... does our String nowadays require an 'encoding' instance variable such that this is *always* associated? 
This might remove any need for separate utf8GetEnv & utf16getEnv (if that was even a reasonable idea).

I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings. Characters are represented with their corresponding unicode codepoint.
If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings.

I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler.


cheers -ben




> On 16 Jan 2019, at 10:59, Guillermo Polito <[hidden email]> wrote:
>
> Hi Nicolas,
>
> On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <[hidden email]> wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
> I made progress recently in this area, but we should finish the job/test/consolidate.
>
> I'm following your changes for windows from the shadows and I think they are awesome :).

> If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.
>
>  So far we are using FFI for this, as you say we create first Win32WideStrings from utf8 strings and then we use ffi calls to the *W functions.
> I don't think we can make it for Pharo7.0.0. The cycle to build, do some acceptance tests, and then bless a new VM as stable is far too long for our inminent release :).
>
> But this could be for a 7.1.0, and if you like I can surely give a hand on this.
>
> Guille




--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Better management of encoding of environment variables

ducasse
In reply to this post by Sven Van Caekenberghe-2

What's important is to create abstract layers that insulate the un-needed complexity in lowest layers possible.
The VM excels at insulating of course.
At image side we have to assume the responsibility of not leaking too much by ourself.

As Eliot said, right now the VM (and FFI) just take sequences of uninterpreted bytes (ByteArray) and pass them to API.
The conversion ByteString/WideString <-> specifically-encoded ByteArray is performed at image side.

With FFI, we could eventually make this conversion platform specific instead of always UTF8.
The purpose would be to reduce back and forth conversions in chained API calls for example.
For sanity, then better follow those rules:
- the image does not attempt direct interaction with these opaque data (other than thru OS API)
- nor preserve them across snapshots.
Beware, conversion is not platform specific, but can be library specific (some library on windows will take UTF8).
So we may reify the library and always double dispatch to the library, or we create upper level abstract messages that may chain several low level OS API calls.
We would thus let complexity creep one more level, but only if we have good reason to do so.
We don't want to trade uniformity for small gains.
BTW, note that the xxxW API is already a huge uniformisation progress compared to the code-page specific xxxA API!

Hi nicolas

I’m reading and trying to understand. but the xxx lost me. :)




Another strategy is to create more complex abstractions (i.e. parameterized) that can deal with a zoo of different underlying conventions.
For example, this would be the EncodedString of VW.
This strategy could be tempting, because it enables dealing with lower level platform-specific-encoded objects and still interact with them in the image transparently.
But I strongly advise to think twice (or more) before introducing such complexity:
- it breaks former invariants (thus potentially lot of code)
- complexity tends to spread in many places
I don't recommend it.

PS: oups, sorry for out of band message, I wanted to send, but it seems that I did not press the button properly...

> On 16 Jan 2019, at 10:59, Guillermo Polito <[hidden email]> wrote:
>
> Hi Nicolas,
>
> On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <[hidden email]> wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
> I made progress recently in this area, but we should finish the job/test/consolidate.
>
> I'm following your changes for windows from the shadows and I think they are awesome :).

> If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.
>
>  So far we are using FFI for this, as you say we create first Win32WideStrings from utf8 strings and then we use ffi calls to the *W functions.
> I don't think we can make it for Pharo7.0.0. The cycle to build, do some acceptance tests, and then bless a new VM as stable is far too long for our inminent release :).
>
> But this could be for a 7.1.0, and if you like I can surely give a hand on this.
>
> Guille




--
_,,,^..^,,,_
best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] Better management of encoding of environment variables

Nicolas Cellier


Le ven. 18 janv. 2019 à 14:35, ducasse <[hidden email]> a écrit :

What's important is to create abstract layers that insulate the un-needed complexity in lowest layers possible.
The VM excels at insulating of course.
At image side we have to assume the responsibility of not leaking too much by ourself.

As Eliot said, right now the VM (and FFI) just take sequences of uninterpreted bytes (ByteArray) and pass them to API.
The conversion ByteString/WideString <-> specifically-encoded ByteArray is performed at image side.

With FFI, we could eventually make this conversion platform specific instead of always UTF8.
The purpose would be to reduce back and forth conversions in chained API calls for example.
For sanity, then better follow those rules:
- the image does not attempt direct interaction with these opaque data (other than thru OS API)
- nor preserve them across snapshots.
Beware, conversion is not platform specific, but can be library specific (some library on windows will take UTF8).
So we may reify the library and always double dispatch to the library, or we create upper level abstract messages that may chain several low level OS API calls.
We would thus let complexity creep one more level, but only if we have good reason to do so.
We don't want to trade uniformity for small gains.
BTW, note that the xxxW API is already a huge uniformisation progress compared to the code-page specific xxxA API!

Hi nicolas

I’m reading and trying to understand. but the xxx lost me. :)


Sorry, I was talking of the windows API variants, W for Wide characters, A for ASCII (or rather current-code-page in effect)


Another strategy is to create more complex abstractions (i.e. parameterized) that can deal with a zoo of different underlying conventions.
For example, this would be the EncodedString of VW.
This strategy could be tempting, because it enables dealing with lower level platform-specific-encoded objects and still interact with them in the image transparently.
But I strongly advise to think twice (or more) before introducing such complexity:
- it breaks former invariants (thus potentially lot of code)
- complexity tends to spread in many places
I don't recommend it.

PS: oups, sorry for out of band message, I wanted to send, but it seems that I did not press the button properly...

> On 16 Jan 2019, at 10:59, Guillermo Polito <[hidden email]> wrote:
>
> Hi Nicolas,
>
> On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <[hidden email]> wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
> I made progress recently in this area, but we should finish the job/test/consolidate.
>
> I'm following your changes for windows from the shadows and I think they are awesome :).

> If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.
>
>  So far we are using FFI for this, as you say we create first Win32WideStrings from utf8 strings and then we use ffi calls to the *W functions.
> I don't think we can make it for Pharo7.0.0. The cycle to build, do some acceptance tests, and then bless a new VM as stable is far too long for our inminent release :).
>
> But this could be for a 7.1.0, and if you like I can surely give a hand on this.
>
> Guille




--
_,,,^..^,,,_
best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Sven Van Caekenberghe-2
In reply to this post by Guillermo Polito


> On 18 Jan 2019, at 14:23, Guillermo Polito <[hidden email]> wrote:
>
>
> I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings. Characters are represented with their corresponding unicode codepoint.
> If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings.
>
> I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler.

Absolutely !

(and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding step, so be it).
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Ben Coman


On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe <[hidden email]> wrote:


> On 18 Jan 2019, at 14:23, Guillermo Polito <[hidden email]> wrote:
>
>
> I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings.

Cool. I didn't realise that.  But to be pedantic, which unicode encoding? 
Should I presume from Sven's "UTF-8 encoding step" comment below 
and the WideString class comment  "This class represents the array of 32 bit wide characters"
that the WideString encoding is UTF-32?  So should its comment be updated to advise that?
 
cheers -ben

Characters are represented with their corresponding unicode codepoint.
> If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings.
>
> I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler.

Absolutely !

(and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding step, so be it).
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

David T. Lewis
In reply to this post by Sven Van Caekenberghe-2
On Fri, Jan 18, 2019 at 01:40:26PM +0100, Sven Van Caekenberghe wrote:

> Dave,
>
> > On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev <[hidden email]> wrote:
> >
> >
> > From: "David T. Lewis" <[hidden email]>
> > Subject: Re: [Pharo-dev] Better management of encoding of environment variables
> > Date: 18 January 2019 at 01:54:34 GMT+1
> > To: Pharo Development List <[hidden email]>
> >
> >
> > On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:
> >>
> >>> On 16 Jan 2019, at 23:23, Eliot Miranda <[hidden email]> wrote:
> >>>
> >>> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe <[hidden email]> wrote:
> >>>
> >>> The image side is perfectly capable of dealing with platform differences
> >>> in a clean/clear way, and at least we can then use the full power of our
> >>> language and our tools.
> >>>
> >> Agreed.  At the same time I think it is very important that we don't reply
> >> on the FFI for environment variable access.  This is a basic cross-platform
> >> facility.  So I would like to see the environment accessed through primitives,
> >> but have the image place interpretation on the result of the primitive(s),
> >> and have the primitive(s) answer a raw result, just a sequence of uninterpreted
> >> bytes.
> >>
> >> OK, I can understand that ENV VAR access is more fundamental than FFI
> >> (although FFI is already essential for Pharo, also during startup).
> >>
> >>> VisualWorks takes this approach and provides a class UninterpretedBytes
> >>> that the VM is aware of.  That's always seemed like an ugly name and
> >>> overkill to me.  I would just use ByteArray and provide image level
> >>> conversion from ByteArray to String, which is what I believe we have anyway.
> >>
> >> Right, bytes are always uninterpreted, else they would be something else.
> >> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray
> >> inspector decodes automatically if it can.
> >>
> >
> > Hi Sven,
> >
> > I am the author of the getenv primitives, and I am also sadly uninformed
> > about matters of character sets and strings in a multilingual environment.
> >
> > The primitives answer environment variable variable values as ByteString
> > rather than ByteArray. This made sense to me at the time that I wrote it,
> > because ByteString is easy to display in an inspector, and because it is
> > easily converted to ByteArray.
> >
> > For an American English speaker this seems like a good choice, but I
> > wonder now if it is a bad decision. After all, it is also trivially easy
> > to convert a ByteArray to ByteString for display in the image.
> >
> > Would it be helpful to have getenv primitives that answer ByteArray
> > instead, and to let all conversion (including in OSProcess) be done in
> > the image?
> >
> > Thanks,
> > Dave
>
> Normally, the correct way to represent uninterpreted bytes is with a ByteArray. Decoding these bytes as characters is the specific task of a character encoder/decoder, with a deliberate choice as to which to use.
>
> Since the getenv() system call uses simple C strings, it is understandable that this was carried over. It is probably not worth or too risky to change that - as long as the receiver understands that it is a raw OS string that needs more work.
>
> Like with file path encoding/decoding, environment variable encoding/decoding is plain messy and complex. IMHO it is better to manage that at the image level where we are more agile and can better handle that complexity.
>

Thanks Sven, that makes perfect sense to me.

>
> BTW: using funny Unicode chars, like ???? [https://www.fileformat.info/info/unicode/char/1f388/index.htm] is something even English speakers do.
>

You are right. I wrote those getenv primitives 20 years ago and
back then we were still doing our emoticons like this:

;-)

Thanks,
Dave
 

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Guillermo Polito
In reply to this post by Ben Coman


On Fri, Jan 18, 2019 at 2:46 PM Ben Coman <[hidden email]> wrote:


On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe <[hidden email]> wrote:


> On 18 Jan 2019, at 14:23, Guillermo Polito <[hidden email]> wrote:
>
>
> I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings.

Cool. I didn't realise that.  But to be pedantic, which unicode encoding? 
Should I presume from Sven's "UTF-8 encoding step" comment below 
and the WideString class comment  "This class represents the array of 32 bit wide characters"
that the WideString encoding is UTF-32?  So should its comment be updated to advise that?

None :D

That's the funny thing, they are not encoded.

Actually, you should see Strings as collections of Characters, and Characters defined in terms of their abstract code points.
ByteStrings are an optimized (just more compact) version that stores codepoints that fit in a byte.

 
cheers -ben

Characters are represented with their corresponding unicode codepoint.
> If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings.
>
> I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler.

Absolutely !

(and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding step, so be it).


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Sven Van Caekenberghe-2
In reply to this post by Ben Coman


> On 18 Jan 2019, at 14:45, Ben Coman <[hidden email]> wrote:
>
>
>
> On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe <[hidden email]> wrote:
>
>
> > On 18 Jan 2019, at 14:23, Guillermo Polito <[hidden email]> wrote:
> >
> >
> > I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings.
>
> Cool. I didn't realise that.  But to be pedantic, which unicode encoding?
> Should I presume from Sven's "UTF-8 encoding step" comment below
> and the WideString class comment  "This class represents the array of 32 bit wide characters"
> that the WideString encoding is UTF-32?  So should its comment be updated to advise that?

Not really, Pharo Strings are a collection of Characters, each of which is a Unicode code point (yes a 32 bit one).

An encoding projects this rather abstract notion onto a sequence of bytes,

UTF-32 (ZnUTF32Encoder, https://en.wikipedia.org/wiki/UTF-32) is for example endian dependent.

Read the first part of

https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html

> cheers -ben
>
> Characters are represented with their corresponding unicode codepoint.
> > If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings.
> >
> > I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler.
>
> Absolutely !
>
> (and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding step, so be it).


Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Pharo Smalltalk Developers mailing list

I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings.

Cool. I didn't realise that.  But to be pedantic, which unicode encoding? 
Should I presume from Sven's "UTF-8 encoding step" comment below 
and the WideString class comment  "This class represents the array of 32 bit wide characters"
that the WideString encoding is UTF-32?  So should its comment be updated to advise that?

Not really, Pharo Strings are a collection of Characters, each of which is a Unicode code point (yes a 32 bit one).

An encoding projects this rather abstract notion onto a sequence of bytes,

UTF-32 (ZnUTF32Encoder, https://en.wikipedia.org/wiki/UTF-32) is for example endian dependent.

Read the first part of

https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html

I love that book :)

This is too cool to have cool doc
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Eliot Miranda-2
In reply to this post by Guillermo Polito

> On Jan 18, 2019, at 2:04 AM, Guillermo Polito <[hidden email]> wrote:
[snip]
>
> Well, personally I would like that getenv/setenv and getcwd setcwd support are not in a plugin but as a basic service provided by the vm.

+1000

> Cheers,
> Guille

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Eliot Miranda-2
In reply to this post by Guillermo Polito
Hi Guille,

On Jan 18, 2019, at 6:04 AM, Guillermo Polito <[hidden email]> wrote:

On Fri, Jan 18, 2019 at 2:46 PM Ben Coman <[hidden email]> wrote:

On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe <[hidden email]> wrote:

> On 18 Jan 2019, at 14:23, Guillermo Polito <[hidden email]> wrote:
>
>
> I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings.

Cool. I didn't realise that.  But to be pedantic, which unicode encoding? 
Should I presume from Sven's "UTF-8 encoding step" comment below 
and the WideString class comment  "This class represents the array of 32 bit wide characters"
that the WideString encoding is UTF-32?  So should its comment be updated to advise that?

None :D

That's the funny thing, they are not encoded.

Actually, you should see Strings as collections of Characters, and Characters defined in terms of their abstract code points.
ByteStrings are an optimized (just more compact) version that stores codepoints that fit in a byte.

And Spur supports 16-bit strings too, which would be versions that store code points that fit in doublebytes.

cheers -ben

Characters are represented with their corresponding unicode codepoint.
> If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings.
>
> I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler.

Absolutely !

(and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding step, so be it).


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Win32OSProcessPlugin question (was: Better management of encoding of environment variables)

David T. Lewis
In reply to this post by Nicolas Cellier
Hi Nicolas,

Motivated by the discussion on Pharo list, I added new primitives to
OSProcessPlugin to answer environment variables and path information as
raw ByteArray, such that those byte arrays can be converted in the
image to strings with any encoding.

I did the changes in "trunk" OSPP, and am now implementing them in the
oscog branch. I have tested the Unix changes in both branches.

Unfortunately I do not have a Windows development at the moment, so I
need to ask for help.

In Win32OSProcessPlugin, I am adding two primitives:
  #primitiveGetCurrentWorkingDirectoryAsBytes
  #primitiveGetEnvironmentStringsAsBytes

These are based on the two string primitives, which remain available:
  #primitiveGetCurrentWorkingDirectory
  #primitiveGetEnvironmentStrings

The string primitives include your recent changes for UTF8 encoding,
and the "xxxAsBytes" variants are based on my earlier logic without
the UTF8 support.

This means that OSProcess on Windows will use your UTF8 string support,
and the additional primitives (based on the old logic, and answering
raw bytes) will be available for people who want to do the encoding
work in the image.

Does this sound like the right approach?

Thanks,
Dave


I have a question concerning the Win32 changes.


On Wed, Jan 16, 2019 at 10:24:51AM +0100, Nicolas Cellier wrote:

> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because
> the purpose of a VM is to provide an OS independant fa??ade.
> I made progress recently in this area, but we should finish the
> job/test/consolidate.
> If someone bypass the VM and use direct windows API thru FFI, then he takes
> the responsibility, but uniformity doesn't hurt.
>
> Le mer. 16 janv. 2019 ?? 10:14, Guillermo Polito <[hidden email]>
> a ??crit :
>
> > Hi Stephan,
> >
> > I'm sorry for the noise.
> >
> > At the time, both #at: and #getEnv: variants existed. The changes
> > backported from the PharoLauncher were only using the getter versions of
> > getEnv, but for Pharo I decided to implement also the setter versions. And
> > after checking the code and its users in image, I've finally decided to go
> > for an at:[[ifAbsent]put:] version. So I'd say that the leading
> > **guideline** was at the end the one here in the mailing list, but also if
> > you check the PR I've introduced a more complete and consistent API,
> > following the one of dictionaries.
> >
> > https://github.com/pharo-project/pharo/pull/1980/files
> >
> > at:
> > at:ifAbsent:
> > at:ifPresent:
> > at:ifPresent:ifAbsent:
> > at:put:
> > removeKey:
> >
> > Plus, in *nix, variants where an encoding can be specified.
> >
> > I'm sorry if I've introduced some confussion.
> >
> >
> > On Wed, Jan 16, 2019 at 9:47 AM Stephan Eggermont <[hidden email]>
> > wrote:
> > >
> > > Guillermo Polito <[hidden email]>
> > > wrote:
> > > > Hi all,
> > > >
> > > > following the meeting we had here @Inria headquarters, I'll be
> > backporting
> > > > some of the improvements we did in the launcher this last month
> > regarding
> > > > the encoding of environment variables.
> > > >
> > > > I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/
> > > >
> > > > We have already studied possible alternatives with Pablo and
> > Christophe and
> > > > we have some conclusions and we propose some changes:
> > > >
> > > > API Proposal for OSEnvironment
> > > > =========================
> > > >
> > > >
> > > >    -
> > > > *at: aVariableName *
> > > >
> > > > Gets the String value of an environment variable called `aVariableName`
> > > > It is the system reponsibility to manage the encoding.
> > > > Rationale: A common denominator for all platforms providing an already
> > > > decoded string, because windows does not (compared to *nix systems)
> > provide
> > > > a encoded byte representation of the value. Windows has instead its own
> > > > wide string representation.
> > > >
> > > >    - *[optionally] rawAt: anEncodedVariableName*
> > > >
> > > > Gets the Byte value of an environment variable called
> > > > `anEncodedVariableName`.
> > > > It is the user responsibility to encode and decode argument and return
> > > > values in the encoding of this preference.
> > > > Rationale: Some systems may want to have the liberty to use different
> > > > encodings, or even to put binary data in the variables.
> > > >
> > > >    - *[optionally] at: aVariableName encoding: anEncoding*
> > > >
> > > > Gets the value of an environment variable called `aVariableName` using
> > > > `anEncoding` to encode/decode arguments and return values.
> > > > Rationale: *xes could potentially use different encodings for their
> > > > environment variables or even use different encodings in different
> > parts of
> > > > their file system.
> > > >
> > > > Other Implementation details
> > > > =========================
> > > >
> > > >    - VM primitives returning paths Strings should be carefuly managed
> > to
> > > >    decode them, since they are actually C strings (so byte arrays)
> > disguised
> > > >    as ByteStrings.
> > > >    - Windows requires calling the right *Wide version of the functions
> > from
> > > >    C, plus the correct encoding routine. This could be implemented as
> > an FFI
> > > >    call or by modifying the VM to do it properly instead of calling
> > the Ascii
> > > >    version
> > > >
> > > >
> > >
> > > What is the conclusion from this and issue 22658? See PR 2238. #getEnv:
> > is
> > > public API
> > >
> > > Stephan
> > >
> > >
> > >
> >
> >
> > --
> >
> >
> >
> > Guille Polito
> >
> > Research Engineer
> >
> > Centre de Recherche en Informatique, Signal et Automatique de Lille
> >
> > CRIStAL - UMR 9189
> >
> > French National Center for Scientific Research - http://www.cnrs.fr
> >
> >
> > Web: http://guillep.github.io
> >
> > Phone: +33 06 52 70 66 13
> >

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

David T. Lewis
In reply to this post by David T. Lewis

On Fri, Jan 18, 2019 at 08:58:07AM -0500, David T. Lewis wrote:

> On Fri, Jan 18, 2019 at 01:40:26PM +0100, Sven Van Caekenberghe wrote:
> >
> > > On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev <[hidden email]> wrote:
> > >
> > > On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:
> > >>
> > >> Right, bytes are always uninterpreted, else they would be something else.
> > >> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray
> > >> inspector decodes automatically if it can.
> > >
> > > Hi Sven,
> > >
> > > I am the author of the getenv primitives, and I am also sadly uninformed
> > > about matters of character sets and strings in a multilingual environment.
> > >
> > > The primitives answer environment variable variable values as ByteString
> > > rather than ByteArray. This made sense to me at the time that I wrote it,
> > > because ByteString is easy to display in an inspector, and because it is
> > > easily converted to ByteArray.
> > >
> > > For an American English speaker this seems like a good choice, but I
> > > wonder now if it is a bad decision. After all, it is also trivially easy
> > > to convert a ByteArray to ByteString for display in the image.
> > >
> > > Would it be helpful to have getenv primitives that answer ByteArray
> > > instead, and to let all conversion (including in OSProcess) be done in
> > > the image?
> > >
> > > Thanks,
> > > Dave
> >
> > Normally, the correct way to represent uninterpreted bytes is with a
> > ByteArray. Decoding these bytes as characters is the specific task of
> > a character encoder/decoder, with a deliberate choice as to which to use.
> >
> > Since the getenv() system call uses simple C strings, it is understandable
> > that this was carried over. It is probably not worth or too risky to
> > change that - as long as the receiver understands that it is a raw OS
> > string that needs more work.
> >
> > Like with file path encoding/decoding, environment variable encoding/decoding
> > is plain messy and complex. IMHO it is better to manage that at the
> > image level where we are more agile and can better handle that complexity.
> >
>
> Thanks Sven, that makes perfect sense to me.
>

I added some new primitives to OSProcessPlugin that answer ByteArray instead of ByteString.

For Unix (Linux, OS X):
        <primitive: 'primitiveGetCurrentWorkingDirectoryAsBytes' module: 'UnixOSProcessPlugin'>
        <primitive: 'primitiveArgumentAtAsBytes' module: 'UnixOSProcessPlugin'>
        <primitive: 'primitiveEnvironmentAtAsBytes' module: 'UnixOSProcessPlugin'>
        <primitive: 'primitiveEnvironmentAtSymbolAsBytes' module: 'UnixOSProcessPlugin'>
        <primitive: 'primitiveRealpathAsBytes' module: 'UnixOSProcessPlugin'>

For Windows:
        <primitive: 'primitiveGetCurrentWorkingDirectoryAsBytes' module: 'Win32OSProcessPlugin'>
        <primitive: 'primitiveGetEnvironmentStringsAsBytes' module: 'Win32OSProcessPlugin'>

These should be in the latest VM builds now.

If you are using OSProcess, update it to the latest version to get accessor methods
for the new primitives. For example, OSProcess accessor primGetCurrentWorkingDirectory
calls the original primitive that answers a ByteString, and to get raw bytes
you can use OSProcess accessor primGetCurrentWorkingDirectoryAsBytes instead.

Dave




12