Environment variables encoding ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Environment variables encoding ?

Sven Van Caekenberghe-2
Hi,

The dictionary

 OSPlatform current environment

contains a copy of the OS's environment variables (more correctly of the VM process), as key/value pairs.

These are obtained via the following system calls:

on macOS & *nix

  LIBC environ

on Windows

  KERNEL32 GetEnvironmentStrings

It is however a bit unclear how these are encoded. On macOS & *nix that seems to be UTF8, on Windows there are some reports that it appears to be Latin1 - but both might be locale specific, I don't know either way.

Does anyone know for sure ?

I furthermore think that OSEnvironment and its subclasses, who do this call, should be responsible for decoding the C strings into proper Pharo strings, and not leave that responsibility to its users.

Fundamentally, in the following, the decoding is still not done correctly and that is wrong/confusing IMHO.

$ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current associations'
{'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'. 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'. 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'. 'TERM_PROGRAM_VERSION'->'404'. 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'. 'USER'->'sven'. 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'. 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'. 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'. 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'. 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'. 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'. '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'. '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}

Of course, if we change this, we will need to fix callers.

Opinions ?

Sven

PS: Furthermore, I note that there is a subtle difference in how $FOO and $PWD in the above are UTF-8 encoded. In the former, normalisation was done, in the latter not. Maybe that could lead to problems (when comparing/composing them). This is a difficult/complex subject (https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43).



Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Pharo Smalltalk Developers mailing list
I do remember clearly that while debugging that problem, the %LOCALAPPDATA% environment at some point kept that string encoded with Latin-1 (I'm on Windows 10, french version).  I have not been able to reproduce the exact sequence which led to that specific case unfortunately...


-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)


On Tuesday, April 17, 2018, 3:37:01 a.m. EDT, Sven Van Caekenberghe <[hidden email]> wrote:


Hi,

The dictionary

OSPlatform current environment

contains a copy of the OS's environment variables (more correctly of the VM process), as key/value pairs.

These are obtained via the following system calls:

on macOS & *nix

  LIBC environ

on Windows

  KERNEL32 GetEnvironmentStrings

It is however a bit unclear how these are encoded. On macOS & *nix that seems to be UTF8, on Windows there are some reports that it appears to be Latin1 - but both might be locale specific, I don't know either way.

Does anyone know for sure ?

I furthermore think that OSEnvironment and its subclasses, who do this call, should be responsible for decoding the C strings into proper Pharo strings, and not leave that responsibility to its users.

Fundamentally, in the following, the decoding is still not done correctly and that is wrong/confusing IMHO.

$ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current associations'
{'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'. 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'. 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'. 'TERM_PROGRAM_VERSION'->'404'. 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'. 'USER'->'sven'. 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'. 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'. 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'. 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'. 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'. 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'. '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'. '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}

Of course, if we change this, we will need to fix callers.

Opinions ?

Sven

PS: Furthermore, I note that there is a subtle difference in how $FOO and $PWD in the above are UTF-8 encoded. In the former, normalisation was done, in the latter not. Maybe that could lead to problems (when comparing/composing them). This is a difficult/complex subject (https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43).


Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Damien Pollet
In reply to this post by Sven Van Caekenberghe-2
It seems macOS normalizes UTF-8 differently from everyone else in file names (I think base character + composing instead of precomposed codepoint). That might affect PWD.
For environment variables, even if most sensible platforms should have adopted UTF-8 by now, I wouldn't be surprised if there's no official encoding whatsoever (i.e. they're just bytes with a 0 at the end…)

On 17 April 2018 at 09:36, Sven Van Caekenberghe <[hidden email]> wrote:
Hi,

The dictionary

 OSPlatform current environment

contains a copy of the OS's environment variables (more correctly of the VM process), as key/value pairs.

These are obtained via the following system calls:

on macOS & *nix

  LIBC environ

on Windows

  KERNEL32 GetEnvironmentStrings

It is however a bit unclear how these are encoded. On macOS & *nix that seems to be UTF8, on Windows there are some reports that it appears to be Latin1 - but both might be locale specific, I don't know either way.

Does anyone know for sure ?

I furthermore think that OSEnvironment and its subclasses, who do this call, should be responsible for decoding the C strings into proper Pharo strings, and not leave that responsibility to its users.

Fundamentally, in the following, the decoding is still not done correctly and that is wrong/confusing IMHO.

$ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current associations'
{'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'. 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'. 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'. 'TERM_PROGRAM_VERSION'->'404'. 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'. 'USER'->'sven'. 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'. 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'. 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'. 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'. 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'. 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'. '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'. '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}

Of course, if we change this, we will need to fix callers.

Opinions ?

Sven

PS: Furthermore, I note that there is a subtle difference in how $FOO and $PWD in the above are UTF-8 encoded. In the former, normalisation was done, in the latter not. Maybe that could lead to problems (when comparing/composing them). This is a difficult/complex subject (https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43).






--
Damien Pollet
type less, do more [ | ] http://people.untyped.org/damien.pollet
Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Sven Van Caekenberghe-2


> On 17 Apr 2018, at 09:57, Damien Pollet <[hidden email]> wrote:
>
> It seems macOS normalizes UTF-8 differently from everyone else in file names (I think base character + composing instead of precomposed codepoint). That might affect PWD.
> For environment variables, even if most sensible platforms should have adopted UTF-8 by now, I wouldn't be surprised if there's no official encoding whatsoever (i.e. they're just bytes with a 0 at the end…)

;-)

We can decode everything, we have all the tools, but of course, we first have to know what encoding is being used. Hence my question.

> On 17 April 2018 at 09:36, Sven Van Caekenberghe <[hidden email]> wrote:
> Hi,
>
> The dictionary
>
>  OSPlatform current environment
>
> contains a copy of the OS's environment variables (more correctly of the VM process), as key/value pairs.
>
> These are obtained via the following system calls:
>
> on macOS & *nix
>
>   LIBC environ
>
> on Windows
>
>   KERNEL32 GetEnvironmentStrings
>
> It is however a bit unclear how these are encoded. On macOS & *nix that seems to be UTF8, on Windows there are some reports that it appears to be Latin1 - but both might be locale specific, I don't know either way.
>
> Does anyone know for sure ?
>
> I furthermore think that OSEnvironment and its subclasses, who do this call, should be responsible for decoding the C strings into proper Pharo strings, and not leave that responsibility to its users.
>
> Fundamentally, in the following, the decoding is still not done correctly and that is wrong/confusing IMHO.
>
> $ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current associations'
> {'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'. 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'. 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'. 'TERM_PROGRAM_VERSION'->'404'. 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'. 'USER'->'sven'. 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'. 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'. 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'. 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'. 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'. 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'. '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'. '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}
>
> Of course, if we change this, we will need to fix callers.
>
> Opinions ?
>
> Sven
>
> PS: Furthermore, I note that there is a subtle difference in how $FOO and $PWD in the above are UTF-8 encoded. In the former, normalisation was done, in the latter not. Maybe that could lead to problems (when comparing/composing them). This is a difficult/complex subject (https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43).
>
>
>
>
>
>
> --
> Damien Pollet
> type less, do more [ | ] http://people.untyped.org/damien.pollet


Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Nicolai Hess-3-2


2018-04-17 10:05 GMT+02:00 Sven Van Caekenberghe <[hidden email]>:


> On 17 Apr 2018, at 09:57, Damien Pollet <[hidden email]> wrote:
>
> It seems macOS normalizes UTF-8 differently from everyone else in file names (I think base character + composing instead of precomposed codepoint). That might affect PWD.
> For environment variables, even if most sensible platforms should have adopted UTF-8 by now, I wouldn't be surprised if there's no official encoding whatsoever (i.e. they're just bytes with a 0 at the end…)

;-)

We can decode everything, we have all the tools, but of course, we first have to know what encoding is being used. Hence my question.

> On 17 April 2018 at 09:36, Sven Van Caekenberghe <[hidden email]> wrote:
> Hi,
>
> The dictionary
>
>  OSPlatform current environment
>
> contains a copy of the OS's environment variables (more correctly of the VM process), as key/value pairs.
>
> These are obtained via the following system calls:
>
> on macOS & *nix
>
>   LIBC environ
>
> on Windows
>
>   KERNEL32 GetEnvironmentStrings


Interestingly, this is only for the dictionary operations (asDictionary, keysAndValuesDo...)
If you just access the variable with getEnv, it works:

OSPlatform current environment setEnv:'FOO' value:'benoît'.
OSPlatform current environment getEnv:'FOO'. "'benoît'"
OSPlatform current environment asDictionary at: 'FOO'. "'benoŒt'"

 
>
> It is however a bit unclear how these are encoded. On macOS & *nix that seems to be UTF8, on Windows there are some reports that it appears to be Latin1 - but both might be locale specific, I don't know either way.
>
> Does anyone know for sure ?
>
> I furthermore think that OSEnvironment and its subclasses, who do this call, should be responsible for decoding the C strings into proper Pharo strings, and not leave that responsibility to its users.
>
> Fundamentally, in the following, the decoding is still not done correctly and that is wrong/confusing IMHO.
>
> $ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current associations'
> {'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'. 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'. 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'. 'TERM_PROGRAM_VERSION'->'404'. 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'. 'USER'->'sven'. 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'. 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'. 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'. 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'. 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'. 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'. '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'. '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}
>
> Of course, if we change this, we will need to fix callers.
>
> Opinions ?
>
> Sven
>
> PS: Furthermore, I note that there is a subtle difference in how $FOO and $PWD in the above are UTF-8 encoded. In the former, normalisation was done, in the latter not. Maybe that could lead to problems (when comparing/composing them). This is a difficult/complex subject (https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43).
>
>
>
>
>
>
> --
> Damien Pollet
> type less, do more [ | ] http://people.untyped.org/damien.pollet



Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Henrik Sperre Johansen
In reply to this post by Sven Van Caekenberghe-2
primitiveGetenv returns values in the current locale's code page on Windows;
a value bound to €  returns a stings with single char 128 on MS1252 (western
european) at least.

On windows, there are three versions of each api call with string
parameters/returns;
xxx (depending on UNICODE being defined, either resolves to *A or *W)
xxxA (Ascii, or, more accurately, current code page)*
xxxW (UTF-16)

IIRC, the intention is that primitives receiving/sending char* to the image
will expect/return utf8, so a conversion macro before passing it on to the
syscall would be necessary on Windows; I believe one exist already and is
used in at least the file plugin primitives.

Cheers,
Henry

* The windows FFI fallback used if primitive fails calls the *A version
directly, and can be changed to call *W correctly, but there's a fair bit of
wrapping fluff involved;
https://youtu.be/Um41DPPs5ZA?list=PL843D1D545F9F52B6&t=1591



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Sven Van Caekenberghe-2
In reply to this post by Nicolai Hess-3-2


> On 17 Apr 2018, at 10:40, Nicolai Hess <[hidden email]> wrote:
>
>
>
> 2018-04-17 10:05 GMT+02:00 Sven Van Caekenberghe <[hidden email]>:
>
>
> > On 17 Apr 2018, at 09:57, Damien Pollet <[hidden email]> wrote:
> >
> > It seems macOS normalizes UTF-8 differently from everyone else in file names (I think base character + composing instead of precomposed codepoint). That might affect PWD.
> > For environment variables, even if most sensible platforms should have adopted UTF-8 by now, I wouldn't be surprised if there's no official encoding whatsoever (i.e. they're just bytes with a 0 at the end…)
>
> ;-)
>
> We can decode everything, we have all the tools, but of course, we first have to know what encoding is being used. Hence my question.
>
> > On 17 April 2018 at 09:36, Sven Van Caekenberghe <[hidden email]> wrote:
> > Hi,
> >
> > The dictionary
> >
> >  OSPlatform current environment
> >
> > contains a copy of the OS's environment variables (more correctly of the VM process), as key/value pairs.
> >
> > These are obtained via the following system calls:
> >
> > on macOS & *nix
> >
> >   LIBC environ
> >
> > on Windows
> >
> >   KERNEL32 GetEnvironmentStrings
>
>
> Interestingly, this is only for the dictionary operations (asDictionary, keysAndValuesDo...)
> If you just access the variable with getEnv, it works:
>
> OSPlatform current environment setEnv:'FOO' value:'benoît'.
> OSPlatform current environment getEnv:'FOO'. "'benoît'"
> OSPlatform current environment asDictionary at: 'FOO'. "'benoŒt'"

Hmm, not for me (on macOS):

$ FOO=benoît ./pharo Pharo.image eval "OSPlatform current environment getEnv:'FOO'"
'benoît'

If you put it in yourself, are you not cheating then ?

> >
> > It is however a bit unclear how these are encoded. On macOS & *nix that seems to be UTF8, on Windows there are some reports that it appears to be Latin1 - but both might be locale specific, I don't know either way.
> >
> > Does anyone know for sure ?
> >
> > I furthermore think that OSEnvironment and its subclasses, who do this call, should be responsible for decoding the C strings into proper Pharo strings, and not leave that responsibility to its users.
> >
> > Fundamentally, in the following, the decoding is still not done correctly and that is wrong/confusing IMHO.
> >
> > $ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current associations'
> > {'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'. 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'. 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'. 'TERM_PROGRAM_VERSION'->'404'. 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'. 'USER'->'sven'. 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'. 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'. 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'. 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'. 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'. 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'. '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'. '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}
> >
> > Of course, if we change this, we will need to fix callers.
> >
> > Opinions ?
> >
> > Sven
> >
> > PS: Furthermore, I note that there is a subtle difference in how $FOO and $PWD in the above are UTF-8 encoded. In the former, normalisation was done, in the latter not. Maybe that could lead to problems (when comparing/composing them). This is a difficult/complex subject (https://medium.com/concerning-pharo/an-implementation-of-unicode-normalization-7c6719068f43).
> >
> >
> >
> >
> >
> >
> > --
> > Damien Pollet
> > type less, do more [ | ] http://people.untyped.org/damien.pollet


Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Sven Van Caekenberghe-2
In reply to this post by Henrik Sperre Johansen
Yes, I also saw the *W variants, my initial reaction was that those would make more sense, but it is hard to see all consequences.

> On 17 Apr 2018, at 10:49, Henrik Sperre Johansen <[hidden email]> wrote:
>
> primitiveGetenv returns values in the current locale's code page on Windows;
> a value bound to €  returns a stings with single char 128 on MS1252 (western
> european) at least.
>
> On windows, there are three versions of each api call with string
> parameters/returns;
> xxx (depending on UNICODE being defined, either resolves to *A or *W)
> xxxA (Ascii, or, more accurately, current code page)*
> xxxW (UTF-16)
>
> IIRC, the intention is that primitives receiving/sending char* to the image
> will expect/return utf8, so a conversion macro before passing it on to the
> syscall would be necessary on Windows; I believe one exist already and is
> used in at least the file plugin primitives.
>
> Cheers,
> Henry
>
> * The windows FFI fallback used if primitive fails calls the *A version
> directly, and can be changed to call *W correctly, but there's a fair bit of
> wrapping fluff involved;
> https://youtu.be/Um41DPPs5ZA?list=PL843D1D545F9F52B6&t=1591
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>


Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Henrik Sperre Johansen
In reply to this post by Damien Pollet
Damien Pollet wrote
> It seems macOS normalizes UTF-8 differently from everyone else in file
> names (I think base character + composing instead of precomposed
> codepoint). That might affect PWD.
> For environment variables, even if most sensible platforms should have
> adopted UTF-8 by now, I wouldn't be surprised if there's no official
> encoding whatsoever (i.e. they're just bytes with a 0 at the end…)
>
> On 17 April 2018 at 09:36, Sven Van Caekenberghe &lt;

> sven@

> &gt; wrote:
>
>> Hi,
>>
>> The dictionary
>>
>>  OSPlatform current environment
>>
>> contains a copy of the OS's environment variables (more correctly of the
>> VM process), as key/value pairs.
>>
>> These are obtained via the following system calls:
>>
>> on macOS & *nix
>>
>>   LIBC environ
>>
>> on Windows
>>
>>   KERNEL32 GetEnvironmentStrings
>>
>> It is however a bit unclear how these are encoded. On macOS & *nix that
>> seems to be UTF8, on Windows there are some reports that it appears to be
>> Latin1 - but both might be locale specific, I don't know either way.
>>
>> Does anyone know for sure ?
>>
>> I furthermore think that OSEnvironment and its subclasses, who do this
>> call, should be responsible for decoding the C strings into proper Pharo
>> strings, and not leave that responsibility to its users.
>>
>> Fundamentally, in the following, the decoding is still not done correctly
>> and that is wrong/confusing IMHO.
>>
>> $ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current
>> associations'
>> {'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'.
>> 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/
>> sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'.
>> 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'.
>> 'TERM_PROGRAM_VERSION'->'404'.
>> 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'.
>> 'USER'->'sven'.
>> 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'.
>> 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'.
>> 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'.
>> 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'.
>> 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.
>> apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'.
>> 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'.
>> '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'.
>> '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}
>>
>> Of course, if we change this, we will need to fix callers.
>>
>> Opinions ?
>>
>> Sven
>>
>> PS: Furthermore, I note that there is a subtle difference in how $FOO and
>> $PWD in the above are UTF-8 encoded. In the former, normalisation was
>> done,
>> in the latter not. Maybe that could lead to problems (when
>> comparing/composing them). This is a difficult/complex subject (
>> https://medium.com/concerning-pharo/an-implementation-of-unicode-
>> normalization-7c6719068f43).
>>
>>
>>
>>
>
>
> --
> Damien Pollet
> type less, do more [ | ] http://people.untyped.org/damien.pollet

If by different, you mean that it actually normalizes the file names, then
yes.
All Mac filenames are in a well defined form; NFD.
On linux, they're just arrays of bytes, and anything goes.
That the bytes mostly happen to be valid utf8 strings in NFC, is just a
by-product of the fact that's the format most programs use when calling the
file primitives.

Cheers,
Henry



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Guillermo Polito
Hi,

I think this problem is not environment variable exclusive. It also affects file paths and others. So far Pharo does not detect the locale to perform the encoding and it should be nice to do it.

On Tue, Apr 17, 2018 at 10:56 AM, Henrik Sperre Johansen <[hidden email]> wrote:
Damien Pollet wrote
> It seems macOS normalizes UTF-8 differently from everyone else in file
> names (I think base character + composing instead of precomposed
> codepoint). That might affect PWD.
> For environment variables, even if most sensible platforms should have
> adopted UTF-8 by now, I wouldn't be surprised if there's no official
> encoding whatsoever (i.e. they're just bytes with a 0 at the end…)
>
> On 17 April 2018 at 09:36, Sven Van Caekenberghe &lt;

> sven@

> &gt; wrote:
>
>> Hi,
>>
>> The dictionary
>>
>>  OSPlatform current environment
>>
>> contains a copy of the OS's environment variables (more correctly of the
>> VM process), as key/value pairs.
>>
>> These are obtained via the following system calls:
>>
>> on macOS & *nix
>>
>>   LIBC environ
>>
>> on Windows
>>
>>   KERNEL32 GetEnvironmentStrings
>>
>> It is however a bit unclear how these are encoded. On macOS & *nix that
>> seems to be UTF8, on Windows there are some reports that it appears to be
>> Latin1 - but both might be locale specific, I don't know either way.
>>
>> Does anyone know for sure ?
>>
>> I furthermore think that OSEnvironment and its subclasses, who do this
>> call, should be responsible for decoding the C strings into proper Pharo
>> strings, and not leave that responsibility to its users.
>>
>> Fundamentally, in the following, the decoding is still not done correctly
>> and that is wrong/confusing IMHO.
>>
>> $ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current
>> associations'
>> {'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'.
>> 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/
>> sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'.
>> 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'.
>> 'TERM_PROGRAM_VERSION'->'404'.
>> 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'.
>> 'USER'->'sven'.
>> 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'.
>> 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'.
>> 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'.
>> 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'.
>> 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.
>> apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'.
>> 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'.
>> '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'.
>> '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}
>>
>> Of course, if we change this, we will need to fix callers.
>>
>> Opinions ?
>>
>> Sven
>>
>> PS: Furthermore, I note that there is a subtle difference in how $FOO and
>> $PWD in the above are UTF-8 encoded. In the former, normalisation was
>> done,
>> in the latter not. Maybe that could lead to problems (when
>> comparing/composing them). This is a difficult/complex subject (
>> https://medium.com/concerning-pharo/an-implementation-of-unicode-
>> normalization-7c6719068f43).
>>
>>
>>
>>
>
>
> --
> Damien Pollet
> type less, do more [ | ] http://people.untyped.org/damien.pollet

If by different, you mean that it actually normalizes the file names, then
yes.
All Mac filenames are in a well defined form; NFD.
On linux, they're just arrays of bytes, and anything goes.
That the bytes mostly happen to be valid utf8 strings in NFC, is just a
by-product of the fact that's the format most programs use when calling the
file primitives.

Cheers,
Henry



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Environment variables encoding ?

Sven Van Caekenberghe-2


> On 19 Apr 2018, at 10:21, Guillermo Polito <[hidden email]> wrote:
>
> Hi,
>
> I think this problem is not environment variable exclusive. It also affects file paths and others. So far Pharo does not detect the locale to perform the encoding and it should be nice to do it.

Sure, it would be nice/good/helpful to detect locale (BTW, don't we have that already more or less).

But I would be surprised if an OS API would deliver different encoded data to a process, depending on the locale - I mean in general. That would be setting up things for a huge distaster, IMHO. A modern OS should just deliver UTF-8 (full Unicode data points) and be done with it.

> On Tue, Apr 17, 2018 at 10:56 AM, Henrik Sperre Johansen <[hidden email]> wrote:
> Damien Pollet wrote
> > It seems macOS normalizes UTF-8 differently from everyone else in file
> > names (I think base character + composing instead of precomposed
> > codepoint). That might affect PWD.
> > For environment variables, even if most sensible platforms should have
> > adopted UTF-8 by now, I wouldn't be surprised if there's no official
> > encoding whatsoever (i.e. they're just bytes with a 0 at the end…)
> >
> > On 17 April 2018 at 09:36, Sven Van Caekenberghe &lt;
>
> > sven@
>
> > &gt; wrote:
> >
> >> Hi,
> >>
> >> The dictionary
> >>
> >>  OSPlatform current environment
> >>
> >> contains a copy of the OS's environment variables (more correctly of the
> >> VM process), as key/value pairs.
> >>
> >> These are obtained via the following system calls:
> >>
> >> on macOS & *nix
> >>
> >>   LIBC environ
> >>
> >> on Windows
> >>
> >>   KERNEL32 GetEnvironmentStrings
> >>
> >> It is however a bit unclear how these are encoded. On macOS & *nix that
> >> seems to be UTF8, on Windows there are some reports that it appears to be
> >> Latin1 - but both might be locale specific, I don't know either way.
> >>
> >> Does anyone know for sure ?
> >>
> >> I furthermore think that OSEnvironment and its subclasses, who do this
> >> call, should be responsible for decoding the C strings into proper Pharo
> >> strings, and not leave that responsibility to its users.
> >>
> >> Fundamentally, in the following, the decoding is still not done correctly
> >> and that is wrong/confusing IMHO.
> >>
> >> $ FOO=benoît ./pharo Pharo.image eval 'OSEnvironment current
> >> associations'
> >> {'TERM_PROGRAM'->'Apple_Terminal'. 'TERM'->'xterm-256color'.
> >> 'SHELL'->'/bin/bash'. 'TMPDIR'->'/var/folders/sy/
> >> sndrtj9j1tq06j0lfnshmrl80000gn/T/'. 'FOO'->'benoît'.
> >> 'Apple_PubSub_Socket_Render'->'/private/tmp/com.apple.launchd.uWk7pivcLT/Render'.
> >> 'TERM_PROGRAM_VERSION'->'404'.
> >> 'TERM_SESSION_ID'->'845BECCD-0AB0-4686-B7F9-3A0FF84BDCB7'.
> >> 'USER'->'sven'.
> >> 'SSH_AUTH_SOCK'->'/private/tmp/com.apple.launchd.y5oCwdUyaG/Listeners'.
> >> 'PATH'->'/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/opt/X11/bin'.
> >> 'PWD'->'/tmp/benoît'. 'XPC_FLAGS'->'0x0'. 'XPC_SERVICE_NAME'->'0'.
> >> 'HOME'->'/Users/sven'. 'SHLVL'->'2'. 'LOGNAME'->'sven'.
> >> 'LC_CTYPE'->'UTF-8'. 'DISPLAY'->'/private/tmp/com.
> >> apple.launchd.lsgASYFiWW/org.macosforge.xquartz:0'.
> >> 'SECURITYSESSIONID'->'186a9'. 'OLDPWD'->'/tmp/benoît'.
> >> '_'->'/tmp/benoît/pharo-vm/Pharo.app/Contents/MacOS/Pharo'.
> >> '__CF_USER_TEXT_ENCODING'->'0x1F5:0x0:0x0'}
> >>
> >> Of course, if we change this, we will need to fix callers.
> >>
> >> Opinions ?
> >>
> >> Sven
> >>
> >> PS: Furthermore, I note that there is a subtle difference in how $FOO and
> >> $PWD in the above are UTF-8 encoded. In the former, normalisation was
> >> done,
> >> in the latter not. Maybe that could lead to problems (when
> >> comparing/composing them). This is a difficult/complex subject (
> >> https://medium.com/concerning-pharo/an-implementation-of-unicode-
> >> normalization-7c6719068f43).
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Damien Pollet
> > type less, do more [ | ] http://people.untyped.org/damien.pollet
>
> If by different, you mean that it actually normalizes the file names, then
> yes.
> All Mac filenames are in a well defined form; NFD.
> On linux, they're just arrays of bytes, and anything goes.
> That the bytes mostly happen to be valid utf8 strings in NFC, is just a
> by-product of the fact that's the format most programs use when calling the
> file primitives.
>
> Cheers,
> Henry
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
>
>
>
>
> --
>    
> Guille Polito
> Research Engineer
>
> Centre de Recherche en Informatique, Signal et Automatique de Lille
> CRIStAL - UMR 9189
> French National Center for Scientific Research - http://www.cnrs.fr
>
> Web: http://guillep.github.io
> Phone: +33 06 52 70 66 13