Better management of encoding of environment variables

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Better management of encoding of environment variables

Guillermo Polito
Hi all,

following the meeting we had here @Inria headquarters, I'll be backporting some of the improvements we did in the launcher this last month regarding the encoding of environment variables.

I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/

We have already studied possible alternatives with Pablo and Christophe and we have some conclusions and we propose some changes:

API Proposal for OSEnvironment
=========================

  • at: aVariableName
Gets the String value of an environment variable called `aVariableName`
It is the system reponsibility to manage the encoding.
Rationale: A common denominator for all platforms providing an already decoded string, because windows does not (compared to *nix systems) provide a encoded byte representation of the value. Windows has instead its own wide string representation.
  • [optionally] rawAt: anEncodedVariableName
Gets the Byte value of an environment variable called `anEncodedVariableName`.
It is the user responsibility to encode and decode argument and return values in the encoding of this preference.
Rationale: Some systems may want to have the liberty to use different encodings, or even to put binary data in the variables.
  • [optionally] at: aVariableName encoding: anEncoding
Gets the value of an environment variable called `aVariableName` using `anEncoding` to encode/decode arguments and return values.
Rationale: *xes could potentially use different encodings for their environment variables or even use different encodings in different parts of their file system.

Other Implementation details
=========================  
  • VM primitives returning paths Strings should be carefuly managed to decode them, since they are actually C strings (so byte arrays) disguised as ByteStrings.
  • Windows requires calling the right *Wide version of the functions from C, plus the correct encoding routine. This could be implemented as an FFI call or by modifying the VM to do it properly instead of calling the Ascii version
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Ben Coman


On Mon, 12 Nov 2018 at 18:02, Guillermo Polito <[hidden email]> wrote:
Hi all,

following the meeting we had here @Inria headquarters, I'll be backporting some of the improvements we did in the launcher this last month regarding the encoding of environment variables.

I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/

We have already studied possible alternatives with Pablo and Christophe and we have some conclusions and we propose some changes:

API Proposal for OSEnvironment
=========================

  • at: aVariableName
Gets the String value of an environment variable called `aVariableName`
It is the system reponsibility to manage the encoding.
Rationale: A common denominator for all platforms providing an already decoded string, because windows does not (compared to *nix systems) provide a encoded byte representation of the value. Windows has instead its own wide string representation.
  • [optionally] rawAt: anEncodedVariableName
Gets the Byte value of an environment variable called `anEncodedVariableName`.
It is the user responsibility to encode and decode argument and return values in the encoding of this preference.
Rationale: Some systems may want to have the liberty to use different encodings, or even to put binary data in the variables.
  • [optionally] at: aVariableName encoding: anEncoding
Gets the value of an environment variable called `aVariableName` using `anEncoding` to encode/decode arguments and return values.
Rationale: *xes could potentially use different encodings for their environment variables or even use different encodings in different parts of their file system.

Other Implementation details
=========================  
  • VM primitives returning paths Strings should be carefuly managed to decode them, since they are actually C strings (so byte arrays) disguised as ByteStrings.
  • Windows requires calling the right *Wide version of the functions from C, plus the correct encoding routine. This could be implemented as an FFI call or by modifying the VM to do it properly instead of calling the Ascii version

I haven't been using environment variables a lot so I don't have a strong technical opinion (although at a glance it makes reasonable sense).
But I wanted to say I really like the way you've presented your offline discussion, conclusions and proposal.  Thanks.

cheers -ben

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Guillermo Polito
Hi all,

Thanks Ben for reading.

For those wanting a follow up, I've proposed this pull request: https://github.com/pharo-project/pharo/pull/1980.
I'm still working on avoiding dependencies against UFFI, fixing one other test.
This is however almost finished, and given that I had to adapt the original abstract proposal to fit the real system, here is an updated version:

API Proposal for OSEnvironment and friends
=========================

OSEnvironment is the common denominator for all platforms. They should implement at least the following messages with the following semantics:
  • at: aVariableName [ifAbsent:/ifAbsentPut:/ifPresent:ifAbsent:]
Gets the String value of an environment variable called `aVariableName`.
It is the system reponsibility to manage the encoding of both arguments and return values.
  • at: aVariableName put: aValue
Sets the environment variable called `aVariableName` to value `aValue`.
It is the system reponsibility to manage the encoding of both arguments and return values.
  • removeKey: aVariableName
Removes the environment variable called `aVariableName`.
It is the system reponsibility to manage the encoding of both arguments and return values.

API Extensions for *Nix Systems (OSX & Linux)
=========================

Since *Nixes environment variables are binary data that could be encoded in any encoding, the following methods provide more flexibility to access such data in the encoding of the choice of the user, or even in binary form.
  • at: aVariableName encoding: anEncoding [ifAbsent:/ifAbsentPut:/ifPresent:ifAbsent:/put:]  /  removeKey: aVariableName encoding: anEncoding
Variants of the common API from OSEnvironment.
The encoding used as argument will be used to encode/decode both arguments and return values.
  • rawAt: anEncodedVariableName encoding: anEncoding [ifAbsent:/ifAbsentPut:/ifPresent:ifAbsent:/put:]  /  removeRawKey: anEncodedVariableName
Variants of the common API from OSEnvironment.
These methods assume arguments and return values are encoded/decoded by the user, so no marshalling or decoded is done by it.

Rationale
=========================
  • Encoding/Decoding should be applied not only to values but to variables names too. In most cases Ascii overlaps with utf* and Latin* encodings, but this cannot be simply assumed.
  • Windows requires calling the right *Wide version of the functions from C, plus the correct encoding routine. This could be implemented as an FFI call or by modifying the VM to do it properly instead of calling the Ascii version.
  • Unix FileSystems and environment variables could mix strings in different encodings, thus the flexibility added by the low level *Nix extensions.
Other Implementation Details
=========================
  • VM primitives returning paths Strings should be carefuly managed to decode them, since they are actually C strings (so byte arrays) disguised as ByteStrings.
  • Similar changes had to be applied to correctly obtain the current working directory in case it is a wide string.

On Mon, Nov 12, 2018 at 1:31 PM Ben Coman <[hidden email]> wrote:


On Mon, 12 Nov 2018 at 18:02, Guillermo Polito <[hidden email]> wrote:
Hi all,

following the meeting we had here @Inria headquarters, I'll be backporting some of the improvements we did in the launcher this last month regarding the encoding of environment variables.

I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/

We have already studied possible alternatives with Pablo and Christophe and we have some conclusions and we propose some changes:

API Proposal for OSEnvironment
=========================

  • at: aVariableName
Gets the String value of an environment variable called `aVariableName`
It is the system reponsibility to manage the encoding.
Rationale: A common denominator for all platforms providing an already decoded string, because windows does not (compared to *nix systems) provide a encoded byte representation of the value. Windows has instead its own wide string representation.
  • [optionally] rawAt: anEncodedVariableName
Gets the Byte value of an environment variable called `anEncodedVariableName`.
It is the user responsibility to encode and decode argument and return values in the encoding of this preference.
Rationale: Some systems may want to have the liberty to use different encodings, or even to put binary data in the variables.
  • [optionally] at: aVariableName encoding: anEncoding
Gets the value of an environment variable called `aVariableName` using `anEncoding` to encode/decode arguments and return values.
Rationale: *xes could potentially use different encodings for their environment variables or even use different encodings in different parts of their file system.

Other Implementation details
=========================  
  • VM primitives returning paths Strings should be carefuly managed to decode them, since they are actually C strings (so byte arrays) disguised as ByteStrings.
  • Windows requires calling the right *Wide version of the functions from C, plus the correct encoding routine. This could be implemented as an FFI call or by modifying the VM to do it properly instead of calling the Ascii version

I haven't been using environment variables a lot so I don't have a strong technical opinion (although at a glance it makes reasonable sense).
But I wanted to say I really like the way you've presented your offline discussion, conclusions and proposal.  Thanks.

cheers -ben



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Guillermo Polito
I have to add also
 
Other Implementation Details
=========================
  • VM primitives returning paths Strings should be carefuly managed to decode them, since they are actually C strings (so byte arrays) disguised as ByteStrings.
  • Similar changes had to be applied to correctly obtain the current working directory in case it is a wide string.
  • The default encoding assumed so far in *nixes is utf8. A more robust implementation could obtain this from the locale, but this would be too much of a change as all the locale code should be revisited too.
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Stephan Eggermont-3
In reply to this post by Guillermo Polito
Guillermo Polito <[hidden email]>
wrote:

> Hi all,
>
> following the meeting we had here @Inria headquarters, I'll be backporting
> some of the improvements we did in the launcher this last month regarding
> the encoding of environment variables.
>
> I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/
>
> We have already studied possible alternatives with Pablo and Christophe and
> we have some conclusions and we propose some changes:
>
> API Proposal for OSEnvironment
> =========================
>
>
>    -
> *at: aVariableName *
>
> Gets the String value of an environment variable called `aVariableName`
> It is the system reponsibility to manage the encoding.
> Rationale: A common denominator for all platforms providing an already
> decoded string, because windows does not (compared to *nix systems) provide
> a encoded byte representation of the value. Windows has instead its own
> wide string representation.
>
>    - *[optionally] rawAt: anEncodedVariableName*
>
> Gets the Byte value of an environment variable called
> `anEncodedVariableName`.
> It is the user responsibility to encode and decode argument and return
> values in the encoding of this preference.
> Rationale: Some systems may want to have the liberty to use different
> encodings, or even to put binary data in the variables.
>
>    - *[optionally] at: aVariableName encoding: anEncoding*
>
> Gets the value of an environment variable called `aVariableName` using
> `anEncoding` to encode/decode arguments and return values.
> Rationale: *xes could potentially use different encodings for their
> environment variables or even use different encodings in different parts of
> their file system.
>
> Other Implementation details
> =========================
>
>    - VM primitives returning paths Strings should be carefuly managed to
>    decode them, since they are actually C strings (so byte arrays) disguised
>    as ByteStrings.
>    - Windows requires calling the right *Wide version of the functions from
>    C, plus the correct encoding routine. This could be implemented as an FFI
>    call or by modifying the VM to do it properly instead of calling the Ascii
>    version
>
>

What is the conclusion from this and issue 22658? See PR 2238. #getEnv: is
public API

Stephan



Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Guillermo Polito
Hi Stephan,

I'm sorry for the noise.

At the time, both #at: and #getEnv: variants existed. The changes backported from the PharoLauncher were only using the getter versions of getEnv, but for Pharo I decided to implement also the setter versions. And after checking the code and its users in image, I've finally decided to go for an at:[[ifAbsent]put:] version. So I'd say that the leading **guideline** was at the end the one here in the mailing list, but also if you check the PR I've introduced a more complete and consistent API, following the one of dictionaries.

https://github.com/pharo-project/pharo/pull/1980/files

at:
at:ifAbsent:
at:ifPresent:
at:ifPresent:ifAbsent:
at:put:
removeKey:

Plus, in *nix, variants where an encoding can be specified.

I'm sorry if I've introduced some confussion.


On Wed, Jan 16, 2019 at 9:47 AM Stephan Eggermont <[hidden email]> wrote:

>
> Guillermo Polito <[hidden email]>
> wrote:
> > Hi all,
> >
> > following the meeting we had here @Inria headquarters, I'll be backporting
> > some of the improvements we did in the launcher this last month regarding
> > the encoding of environment variables.
> >
> > I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/
> >
> > We have already studied possible alternatives with Pablo and Christophe and
> > we have some conclusions and we propose some changes:
> >
> > API Proposal for OSEnvironment
> > =========================
> >
> >
> >    -
> > *at: aVariableName *
> >
> > Gets the String value of an environment variable called `aVariableName`
> > It is the system reponsibility to manage the encoding.
> > Rationale: A common denominator for all platforms providing an already
> > decoded string, because windows does not (compared to *nix systems) provide
> > a encoded byte representation of the value. Windows has instead its own
> > wide string representation.
> >
> >    - *[optionally] rawAt: anEncodedVariableName*
> >
> > Gets the Byte value of an environment variable called
> > `anEncodedVariableName`.
> > It is the user responsibility to encode and decode argument and return
> > values in the encoding of this preference.
> > Rationale: Some systems may want to have the liberty to use different
> > encodings, or even to put binary data in the variables.
> >
> >    - *[optionally] at: aVariableName encoding: anEncoding*
> >
> > Gets the value of an environment variable called `aVariableName` using
> > `anEncoding` to encode/decode arguments and return values.
> > Rationale: *xes could potentially use different encodings for their
> > environment variables or even use different encodings in different parts of
> > their file system.
> >
> > Other Implementation details
> > =========================
> >
> >    - VM primitives returning paths Strings should be carefuly managed to
> >    decode them, since they are actually C strings (so byte arrays) disguised
> >    as ByteStrings.
> >    - Windows requires calling the right *Wide version of the functions from
> >    C, plus the correct encoding routine. This could be implemented as an FFI
> >    call or by modifying the VM to do it properly instead of calling the Ascii
> >    version
> >
> >
>
> What is the conclusion from this and issue 22658? See PR 2238. #getEnv: is
> public API
>
> Stephan
>
>
>


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Nicolas Cellier
IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
I made progress recently in this area, but we should finish the job/test/consolidate.
If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.

Le mer. 16 janv. 2019 à 10:14, Guillermo Polito <[hidden email]> a écrit :
Hi Stephan,

I'm sorry for the noise.

At the time, both #at: and #getEnv: variants existed. The changes backported from the PharoLauncher were only using the getter versions of getEnv, but for Pharo I decided to implement also the setter versions. And after checking the code and its users in image, I've finally decided to go for an at:[[ifAbsent]put:] version. So I'd say that the leading **guideline** was at the end the one here in the mailing list, but also if you check the PR I've introduced a more complete and consistent API, following the one of dictionaries.

https://github.com/pharo-project/pharo/pull/1980/files

at:
at:ifAbsent:
at:ifPresent:
at:ifPresent:ifAbsent:
at:put:
removeKey:

Plus, in *nix, variants where an encoding can be specified.

I'm sorry if I've introduced some confussion.


On Wed, Jan 16, 2019 at 9:47 AM Stephan Eggermont <[hidden email]> wrote:

>
> Guillermo Polito <[hidden email]>
> wrote:
> > Hi all,
> >
> > following the meeting we had here @Inria headquarters, I'll be backporting
> > some of the improvements we did in the launcher this last month regarding
> > the encoding of environment variables.
> >
> > I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/
> >
> > We have already studied possible alternatives with Pablo and Christophe and
> > we have some conclusions and we propose some changes:
> >
> > API Proposal for OSEnvironment
> > =========================
> >
> >
> >    -
> > *at: aVariableName *
> >
> > Gets the String value of an environment variable called `aVariableName`
> > It is the system reponsibility to manage the encoding.
> > Rationale: A common denominator for all platforms providing an already
> > decoded string, because windows does not (compared to *nix systems) provide
> > a encoded byte representation of the value. Windows has instead its own
> > wide string representation.
> >
> >    - *[optionally] rawAt: anEncodedVariableName*
> >
> > Gets the Byte value of an environment variable called
> > `anEncodedVariableName`.
> > It is the user responsibility to encode and decode argument and return
> > values in the encoding of this preference.
> > Rationale: Some systems may want to have the liberty to use different
> > encodings, or even to put binary data in the variables.
> >
> >    - *[optionally] at: aVariableName encoding: anEncoding*
> >
> > Gets the value of an environment variable called `aVariableName` using
> > `anEncoding` to encode/decode arguments and return values.
> > Rationale: *xes could potentially use different encodings for their
> > environment variables or even use different encodings in different parts of
> > their file system.
> >
> > Other Implementation details
> > =========================
> >
> >    - VM primitives returning paths Strings should be carefuly managed to
> >    decode them, since they are actually C strings (so byte arrays) disguised
> >    as ByteStrings.
> >    - Windows requires calling the right *Wide version of the functions from
> >    C, plus the correct encoding routine. This could be implemented as an FFI
> >    call or by modifying the VM to do it properly instead of calling the Ascii
> >    version
> >
> >
>
> What is the conclusion from this and issue 22658? See PR 2238. #getEnv: is
> public API
>
> Stephan
>
>
>


--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Stephan Eggermont-3
In reply to this post by Guillermo Polito
Guillermo Polito <[hidden email]>
wrote:
> Hi Stephan,
>
> I'm sorry for the noise.

No problem, just trying to understand where we want to go.

Stephan




Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Stephan Eggermont-3
In reply to this post by Nicolas Cellier
Nicolas Cellier
<[hidden email]> wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because
> the purpose of a VM is to provide an OS independant façade.
> I made progress recently in this area, but we should finish the
> job/test/consolidate.
> If someone bypass the VM and use direct windows API thru FFI, then he takes
> the responsibility, but uniformity doesn't hurt.

That sounds like a very good idea. Do you suggest to do that after the
Pharo 7 release? Or is it simple enough that it can be done in time? On the
unix side, do I need the explicit encoding or can I ask the OSEnvironment
for the one I need?

Before the Pharo 7 release we need at least a #getEnv: back and a class
comment corresponding to what is expected. If we want to change to the new
API it needs to be deprecated.

Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Guillermo Polito
In reply to this post by Nicolas Cellier
Hi Nicolas,

On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <[hidden email]> wrote:
IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
I made progress recently in this area, but we should finish the job/test/consolidate.

I'm following your changes for windows from the shadows and I think they are awesome :).
 
If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.

 So far we are using FFI for this, as you say we create first Win32WideStrings from utf8 strings and then we use ffi calls to the *W functions.
I don't think we can make it for Pharo7.0.0. The cycle to build, do some acceptance tests, and then bless a new VM as stable is far too long for our inminent release :).

But this could be for a 7.1.0, and if you like I can surely give a hand on this.

Guille
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Sven Van Caekenberghe-2
Still, one of the conclusions of previous discussions about the encoding of environment variables was/is that there is no single correct solution. OS's are not consistent in how the encoding is done in all (historical) contexts (like sometimes, 1 env var defines the encoding to use for others, different applications do different things, and other such nice stuff), and certainly not across platforms.

So this is really complex.

Do we want to hide this in some obscure VM C code that very few people can see, read, let alone help with ?

The image side is perfectly capable of dealing with platform differences in a clean/clear way, and at least we can then use the full power of our language and our tools.

> On 16 Jan 2019, at 10:59, Guillermo Polito <[hidden email]> wrote:
>
> Hi Nicolas,
>
> On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <[hidden email]> wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
> I made progress recently in this area, but we should finish the job/test/consolidate.
>
> I'm following your changes for windows from the shadows and I think they are awesome :).
>  
> If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.
>
>  So far we are using FFI for this, as you say we create first Win32WideStrings from utf8 strings and then we use ffi calls to the *W functions.
> I don't think we can make it for Pharo7.0.0. The cycle to build, do some acceptance tests, and then bless a new VM as stable is far too long for our inminent release :).
>
> But this could be for a 7.1.0, and if you like I can surely give a hand on this.
>
> Guille


Reply | Threaded
Open this post in threaded view
|

Purpose of VM [was: Re: Better management of encoding of environment variables]

Martin McClure-2
In reply to this post by Nicolas Cellier
On 1/16/19 1:24 AM, Nicolas Cellier wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion
> because the purpose of a VM is to provide an OS independant façade.

I have not looked at this particular problem in detail, so I have no
opinion on whether the VM is the right place for this particular
functionality.

However, I feel that in general trying to put everything that might be
OS-specific into the VM is not the best design. To me, the purpose of a
Smalltalk VM is to present an object-oriented abstraction of the
underlying machine.

Thinking that way leads me to believe that the following are examples of
things that are good for a VM to do:

* Memory is garbage-collected objects, not bytes.

* Instructions are bytecodes, not underlying machine instructions.

This works well to hide the differences between machine instruction
sets, memory access, and other low-level things. However, no Smalltalk
implementation that I know of has been able to use the VM to iron out
all differences between different OSes.

I do believe that it is a good idea to have cleanly-designed layers of
the system, and that there should be an OS-independent layer and an
OS-dependent layer with clean separation. But I think it might be better
to put most of the OS-dependent layer in the image rather than in the
VM. For one thing, the image is easier to change if there is a bug, or a
lacking feature, or you're trying to support a new OS.

And if it's in the image you get to do the programming in Smalltalk
rather than C or Slang, which is more fun for most of us. And, let's
face it, fun is an important metric in an open-source project -- things
that are fun are much more likely to get done.

Regards,

-Martin


Reply | Threaded
Open this post in threaded view
|

Re: Purpose of VM [was: Re: Better management of encoding of environment variables]

Nicolas Cellier
To be clear, I do not militate for putting everything in the VM. I prefer a lean VM.
I'm cleaning what already exist in the VM.
If something is in the VM, then it should behave as we expect from a VM: uniformely.
A small fix in existing code base is more efficient than a full rewrite.
But if a full rewrite is wanted for other reasons, no problem.

Le jeu. 17 janv. 2019 à 02:00, Martin McClure <[hidden email]> a écrit :
On 1/16/19 1:24 AM, Nicolas Cellier wrote:
> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion
> because the purpose of a VM is to provide an OS independant façade.

I have not looked at this particular problem in detail, so I have no
opinion on whether the VM is the right place for this particular
functionality.

It is not in the VM currently, only in OSProcessPlugin (and Windows variant).
If some other plugins would depend on environment variables, then it might be interesting to provide this feature as a VM service.
I don't know if it is the case.
Also, some platform could have other strategies like querying the Registry in windows, a configuration file, etc...
So we might provide a service for the basic or multi-level policy.

However, I feel that in general trying to put everything that might be
OS-specific into the VM is not the best design. To me, the purpose of a
Smalltalk VM is to present an object-oriented abstraction of the
underlying machine.
 
Thinking that way leads me to believe that the following are examples of
things that are good for a VM to do:

* Memory is garbage-collected objects, not bytes.

* Instructions are bytecodes, not underlying machine instructions.

This works well to hide the differences between machine instruction
sets, memory access, and other low-level things. However, no Smalltalk
implementation that I know of has been able to use the VM to iron out
all differences between different OSes.

Files? Sockets?
Until the threaded FFI is consolidated, there is a category of async algorithms that we cannot easily program at image side.

I do believe that it is a good idea to have cleanly-designed layers of
the system, and that there should be an OS-independent layer and an
OS-dependent layer with clean separation. But I think it might be better
to put most of the OS-dependent layer in the image rather than in the
VM. For one thing, the image is easier to change if there is a bug, or a
lacking feature, or you're trying to support a new OS.

For supporting a new OS, I'm not sure.
Having an "edit on known platform-save-resume on new platform-crash" cycles is not a pleasure.
Well for certain persons, the pleasure can be proportional to the hurdle height ;)

You must have the bare minimum services (GUI) running before pleasure comes back.
A Smalltalk without an IDE is not superior to C with a good IDE.
Trial and error is not superior to a good debugger.

And if it's in the image you get to do the programming in Smalltalk
rather than C or Slang, which is more fun for most of us. And, let's
face it, fun is an important metric in an open-source project -- things
that are fun are much more likely to get done.

Regards,

-Martin

For example, the first time I wrote something like Smallapack, there was no choice: it was thru user defined primitives (st80 or Objectworks).
So I had to write a wrapper in C code for each function exposed, with all the marshalling of arguments in C!
When came DLLCC it became more fun.

But this does not apply to every library.
If the library depends on tons of preprocessors definitions, macros, then we currently lack tools for performing an efficient job at image side.
Some possible tools have been sketched in those mailing list, but so far, they are virtual.


Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Sven Van Caekenberghe-2
In reply to this post by Sven Van Caekenberghe-2


> On 16 Jan 2019, at 23:23, Eliot Miranda <[hidden email]> wrote:
>
> Hi Sven,
>
> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe <[hidden email]> wrote:
> Still, one of the conclusions of previous discussions about the encoding of environment variables was/is that there is no single correct solution. OS's are not consistent in how the encoding is done in all (historical) contexts (like sometimes, 1 env var defines the encoding to use for others, different applications do different things, and other such nice stuff), and certainly not across platforms.
>
> So this is really complex.
>
> Do we want to hide this in some obscure VM C code that very few people can see, read, let alone help with ?
>
> The image side is perfectly capable of dealing with platform differences in a clean/clear way, and at least we can then use the full power of our language and our tools.
>
> Agreed.  At the same time I think it is very important that we don't reply on the FFI for environment variable access.  This is a basic cross-platform facility.  So I would like to see the environment accessed through primitives, but have the image place interpretation on the result of the primitive(s), and have the primitive(s) answer a raw result, just a sequence of uninterpreted bytes.

OK, I can understand that ENV VAR access is more fundamental than FFI (although FFI is already essential for Pharo, also during startup).

> VisualWorks takes this approach and provides a class UninterpretedBytes that the VM is aware of.  That's always seemed like an ugly name and overkill to me.  I would just use ByteArray and provide image level conversion from ByteArray to String, which is what I believe we have anyway.

Right, bytes are always uninterpreted, else they would be something else.
We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray inspector decodes automatically if it can.

> > On 16 Jan 2019, at 10:59, Guillermo Polito <[hidden email]> wrote:
> >
> > Hi Nicolas,
> >
> > On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier <[hidden email]> wrote:
> > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
> > I made progress recently in this area, but we should finish the job/test/consolidate.
> >
> > I'm following your changes for windows from the shadows and I think they are awesome :).
> >  
> > If someone bypass the VM and use direct windows API thru FFI, then he takes the responsibility, but uniformity doesn't hurt.
> >
> >  So far we are using FFI for this, as you say we create first Win32WideStrings from utf8 strings and then we use ffi calls to the *W functions.
> > I don't think we can make it for Pharo7.0.0. The cycle to build, do some acceptance tests, and then bless a new VM as stable is far too long for our inminent release :).
> >
> > But this could be for a 7.1.0, and if you like I can surely give a hand on this.
> >
> > Guille
>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: Purpose of VM [was: Re: Better management of encoding of environment variables]

Sven Van Caekenberghe-2
In reply to this post by Martin McClure-2


> On 17 Jan 2019, at 02:00, Martin McClure <[hidden email]> wrote:
>
> On 1/16/19 1:24 AM, Nicolas Cellier wrote:
>> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
>
> I have not looked at this particular problem in detail, so I have no opinion on whether the VM is the right place for this particular functionality.
>
> However, I feel that in general trying to put everything that might be OS-specific into the VM is not the best design. To me, the purpose of a Smalltalk VM is to present an object-oriented abstraction of the underlying machine.
>
> Thinking that way leads me to believe that the following are examples of things that are good for a VM to do:
>
> * Memory is garbage-collected objects, not bytes.
>
> * Instructions are bytecodes, not underlying machine instructions.
>
> This works well to hide the differences between machine instruction sets, memory access, and other low-level things. However, no Smalltalk implementation that I know of has been able to use the VM to iron out all differences between different OSes.
>
> I do believe that it is a good idea to have cleanly-designed layers of the system, and that there should be an OS-independent layer and an OS-dependent layer with clean separation. But I think it might be better to put most of the OS-dependent layer in the image rather than in the VM. For one thing, the image is easier to change if there is a bug, or a lacking feature, or you're trying to support a new OS.
>
> And if it's in the image you get to do the programming in Smalltalk rather than C or Slang, which is more fun for most of us. And, let's face it, fun is an important metric in an open-source project -- things that are fun are much more likely to get done.

+100

> Regards,
>
> -Martin
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Purpose of VM [was: Re: Better management of encoding of environment variables]

Eliot Miranda-2


On Thu, Jan 17, 2019 at 8:02 AM Sven Van Caekenberghe <[hidden email]> wrote:

> On 17 Jan 2019, at 02:00, Martin McClure <[hidden email]> wrote:
>
> On 1/16/19 1:24 AM, Nicolas Cellier wrote:
>> IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because the purpose of a VM is to provide an OS independant façade.
>
> I have not looked at this particular problem in detail, so I have no opinion on whether the VM is the right place for this particular functionality.
>
> However, I feel that in general trying to put everything that might be OS-specific into the VM is not the best design. To me, the purpose of a Smalltalk VM is to present an object-oriented abstraction of the underlying machine.
>
> Thinking that way leads me to believe that the following are examples of things that are good for a VM to do:
>
> * Memory is garbage-collected objects, not bytes.
>
> * Instructions are bytecodes, not underlying machine instructions.
>
> This works well to hide the differences between machine instruction sets, memory access, and other low-level things. However, no Smalltalk implementation that I know of has been able to use the VM to iron out all differences between different OSes.
>
> I do believe that it is a good idea to have cleanly-designed layers of the system, and that there should be an OS-independent layer and an OS-dependent layer with clean separation. But I think it might be better to put most of the OS-dependent layer in the image rather than in the VM. For one thing, the image is easier to change if there is a bug, or a lacking feature, or you're trying to support a new OS.
>
> And if it's in the image you get to do the programming in Smalltalk rather than C or Slang, which is more fun for most of us. And, let's face it, fun is an important metric in an open-source project -- things that are fun are much more likely to get done.

+100

The VM *is* developed in Smalltalk
 
> Regards,
>
> -Martin
>
>




--
_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Pharo Smalltalk Developers mailing list
In reply to this post by Sven Van Caekenberghe-2
On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:

>
> > On 16 Jan 2019, at 23:23, Eliot Miranda <[hidden email]> wrote:
> >
> > On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe <[hidden email]> wrote:
> >
> > The image side is perfectly capable of dealing with platform differences
> > in a clean/clear way, and at least we can then use the full power of our
> > language and our tools.
> >
> Agreed.  At the same time I think it is very important that we don't reply
> on the FFI for environment variable access.  This is a basic cross-platform
> facility.  So I would like to see the environment accessed through primitives,
> but have the image place interpretation on the result of the primitive(s),
> and have the primitive(s) answer a raw result, just a sequence of uninterpreted
>  bytes.
>
> OK, I can understand that ENV VAR access is more fundamental than FFI
> (although FFI is already essential for Pharo, also during startup).
>
> > VisualWorks takes this approach and provides a class UninterpretedBytes
> > that the VM is aware of.  That's always seemed like an ugly name and
> > overkill to me.  I would just use ByteArray and provide image level
> > conversion from ByteArray to String, which is what I believe we have anyway.
>
> Right, bytes are always uninterpreted, else they would be something else.
> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray
>  inspector decodes automatically if it can.
>

Hi Sven,

I am the author of the getenv primitives, and I am also sadly uninformed
about matters of character sets and strings in a multilingual environment.

The primitives answer environment variable variable values as ByteString
rather than ByteArray. This made sense to me at the time that I wrote it,
because ByteString is easy to display in an inspector, and because it is
easily converted to ByteArray.

For an American English speaker this seems like a good choice, but I
wonder now if it is a bad decision. After all, it is also trivially easy
to convert a ByteArray to ByteString for display in the image.

Would it be helpful to have getenv primitives that answer ByteArray
instead, and to let all conversion (including in OSProcess) be done in
the image?

Thanks,
Dave
 

Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Guillermo Polito
On Fri, Jan 18, 2019 at 1:58 AM David T. Lewis via Pharo-dev <[hidden email]> wrote:
On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote:
>
> > On 16 Jan 2019, at 23:23, Eliot Miranda <[hidden email]> wrote:
> >
> > On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe <[hidden email]> wrote:
> >
> > The image side is perfectly capable of dealing with platform differences
> > in a clean/clear way, and at least we can then use the full power of our
> > language and our tools.
> >
> Agreed. 

+1

At the same time I think it is very important that we don't reply
> on the FFI for environment variable access.  This is a basic cross-platform
> facility.  So I would like to see the environment accessed through primitives,
> but have the image place interpretation on the result of the primitive(s),
> and have the primitive(s) answer a raw result, just a sequence of uninterpreted
>  bytes.

Having looked at it not so long ago, I'll add my 2cts.

Environment access is a very particular scenario.
We have in Pharo many startup actions that directly or indirectly (FileLocator home?) require environment variable access, and thus we have to be really careful and picky to make sure that they all work, dependencies are installed in the right order and so on...

In Pharo6 this was specially difficult because FFI was dynamically compiling methods,
  => which required access to argument names,
     => which required access to the sources files,
       => which required access to the env vars (because in Pharo the source/changes files are looked up in other directories than the image/vm ones)
          => which loops :)

In Pharo7 argument names in FFI calls are embedded in the method meta-data so all that is avoided.

Still I'd agree that moving this support to a primitive would make it less fragile.
I'd apply the same to getting/setting the working directory.

>
> OK, I can understand that ENV VAR access is more fundamental than FFI
> (although FFI is already essential for Pharo, also during startup).
>
> > VisualWorks takes this approach and provides a class UninterpretedBytes
> > that the VM is aware of.  That's always seemed like an ugly name and
> > overkill to me.  I would just use ByteArray and provide image level
> > conversion from ByteArray to String, which is what I believe we have anyway.
>
> Right, bytes are always uninterpreted, else they would be something else.
> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray
>  inspector decodes automatically if it can.
>

Hi Sven,

I am the author of the getenv primitives, and I am also sadly uninformed
about matters of character sets and strings in a multilingual environment.

The primitives answer environment variable variable values as ByteString
rather than ByteArray. This made sense to me at the time that I wrote it,
because ByteString is easy to display in an inspector, and because it is
easily converted to ByteArray.

For an American English speaker this seems like a good choice, but I
wonder now if it is a bad decision.

Well, as soon as you want to manage some internationalisation, indeed it is.
But also it is a source of bugs, because assuming ascii is not right for english either.
Most platforms will assume utf8 by default, and it's not quite the same for many symbols :).

For example,

Character allByteCharacters size. => 256
Character allByteCharacters utf8Encoded size. 384

Character allByteCharacters select: [ :c |
c asString utf8Encoded size > 1 ].

'€ ‚ƒ„…†‡ˆ‰Š‹Œ Ž ‘’“”•–—˜™š›œ žŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

Of course, many of those characters may not be used in the day-to-day of many people, but as soon as we find one of those (I'm thinking about the not super strange case of a database storing names :)).
Also think about the poor windows users (like myself since 2 weeks ago), that have to think about utf16!

BTW, I hope I'm not breaking anybody's mail client by pasting strange characters here :D (and if so, you may want suggest them to review how they manage encoding :))

After all, it is also trivially easy
to convert a ByteArray to ByteString for display in the image.

Yes, but it's sometimes difficult to find such places, as there are many primitives spread in a lot of places doing the wrong thing, which is a source of bugs...
I'd like to fix it from the root, the question is how to do it without breaking ^^.
In Pharo we are doing at many places,

self primitiveXXX asByteArray utf8Decoded

So making the primitives return ByteArray instances instead of ByteString should be safe enough :).
But this is in my opinion clearly a hack instead of fixing the real problem, and we have to be careful to guard such patterns with comments everywhere explaining why the bytearray conversion is really needed there...

 
Would it be helpful to have getenv primitives that answer ByteArray
instead, and to let all conversion (including in OSProcess) be done in
the image?

Well, personally I would like that getenv/setenv and getcwd setcwd support are not in a plugin but as a basic service provided by the vm.

Cheers,
Guille
 

Thanks,
Dave




--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: Purpose of VM [was: Re: Better management of encoding of environment variables]

Pharo Smalltalk Developers mailing list
In reply to this post by Eliot Miranda-2

> 
> And if it's in the image you get to do the programming in Smalltalk rather than C or Slang, which is more fun for most of us. And, let's face it, fun is an important metric in an open-source project -- things that are fun are much more likely to get done.

+100

The VM *is* developed in Smalltalk

It is not the point of the message of Martin. I imagine that Martin and Sven understand it perfectly that the VM is written in Slang and that there
is a simulator. Still many of us agree with their analysis. The VM logic should be on execution and try to delegate to the image most of the rest. 

Stef
Reply | Threaded
Open this post in threaded view
|

Re: Better management of encoding of environment variables

Pharo Smalltalk Developers mailing list
In reply to this post by Guillermo Polito

So making the primitives return ByteArray instances instead of ByteString should be safe enough :).
But this is in my opinion clearly a hack instead of fixing the real problem, and we have to be careful to guard such patterns with comments everywhere explaining why the bytearray conversion is really needed there…

Guillermo what is the correct way to do it?

Would it be helpful to have getenv primitives that answer ByteArray
instead, and to let all conversion (including in OSProcess) be done in
the image?

Well, personally I would like that getenv/setenv and getcwd setcwd support are not in a plugin but as a basic service provided by the vm.

Cheers,
Guille
 

Thanks,
Dave




--
   
Guille Polito
Research Engineer


Centre de Recherche en Informatique, Signal et Automatique de Lille
CRIStAL - UMR 9189
French National Center for Scientific Research - http://www.cnrs.fr

Phone: +33 06 52 70 66 13

12