Dave,
> On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev <[hidden email]> wrote: > > > From: "David T. Lewis" <[hidden email]> > Subject: Re: [Pharo-dev] Better management of encoding of environment variables > Date: 18 January 2019 at 01:54:34 GMT+1 > To: Pharo Development List <[hidden email]> > > > On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote: >> >>> On 16 Jan 2019, at 23:23, Eliot Miranda <[hidden email]> wrote: >>> >>> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe <[hidden email]> wrote: >>> >>> The image side is perfectly capable of dealing with platform differences >>> in a clean/clear way, and at least we can then use the full power of our >>> language and our tools. >>> >> Agreed. At the same time I think it is very important that we don't reply >> on the FFI for environment variable access. This is a basic cross-platform >> facility. So I would like to see the environment accessed through primitives, >> but have the image place interpretation on the result of the primitive(s), >> and have the primitive(s) answer a raw result, just a sequence of uninterpreted >> bytes. >> >> OK, I can understand that ENV VAR access is more fundamental than FFI >> (although FFI is already essential for Pharo, also during startup). >> >>> VisualWorks takes this approach and provides a class UninterpretedBytes >>> that the VM is aware of. That's always seemed like an ugly name and >>> overkill to me. I would just use ByteArray and provide image level >>> conversion from ByteArray to String, which is what I believe we have anyway. >> >> Right, bytes are always uninterpreted, else they would be something else. >> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray >> inspector decodes automatically if it can. >> > > Hi Sven, > > I am the author of the getenv primitives, and I am also sadly uninformed > about matters of character sets and strings in a multilingual environment. > > The primitives answer environment variable variable values as ByteString > rather than ByteArray. This made sense to me at the time that I wrote it, > because ByteString is easy to display in an inspector, and because it is > easily converted to ByteArray. > > For an American English speaker this seems like a good choice, but I > wonder now if it is a bad decision. After all, it is also trivially easy > to convert a ByteArray to ByteString for display in the image. > > Would it be helpful to have getenv primitives that answer ByteArray > instead, and to let all conversion (including in OSProcess) be done in > the image? > > Thanks, > Dave Normally, the correct way to represent uninterpreted bytes is with a ByteArray. Decoding these bytes as characters is the specific task of a character encoder/decoder, with a deliberate choice as to which to use. Since the getenv() system call uses simple C strings, it is understandable that this was carried over. It is probably not worth or too risky to change that - as long as the receiver understands that it is a raw OS string that needs more work. Like with file path encoding/decoding, environment variable encoding/decoding is plain messy and complex. IMHO it is better to manage that at the image level where we are more agile and can better handle that complexity. Sven BTW: using funny Unicode chars, like 🎈 [https://www.fileformat.info/info/unicode/char/1f388/index.htm] is something even English speakers do. |
In reply to this post by Sven Van Caekenberghe-2
On Wed, 16 Jan 2019 at 18:37, Sven Van Caekenberghe <[hidden email]> wrote: Still, one of the conclusions of previous discussions about the encoding of environment variables was/is that there is no single correct solution. OS's are not consistent in how the encoding is done in all (historical) contexts (like sometimes, 1 env var defines the encoding to use for others, ouch. That one point nearly made my retract my comment next paragraph, but is there much more complexity? or just a case of utf8<==>appSpecificEncoding rather than ascii<==>appSpecificEncoding ? Sorry if I'm rehashing past discussion (do you have a link?), but considering... * 92% of web pages are UTF8 encoded[1] such that pragmatically UTF8 *is* the standard for text * Strings so pervasive in a system...would there be an overall benefit to adopt UTF8 as the encoding for Strings consistently provided across the cross-platform vm interface? (i.e. fixing platforms that don't comply to the standard due to their historical baggage) And I found it interesting Microsoft are making some moves towards UTF8 [2]... "With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.[a] This allows for calling "narrow" functions, including fopen and SetWindowTextA, with UTF-8 strings. " The approach vm-side could be similar to Section 10 How to do text on Windows [3] with the philosophy of "performing the [conversions] as close to API calls as possible, and never holding the [converted] data." different applications do different things, and other such nice stuff), and certainly not across platforms. Big question... Do we currently have primitives of the same name returning different encodings on different platforms? I presume that would be awkward. If the image is handle encoding differences, should separate primitives be used? e.g. utf8GetEnv & utf16getEnv Could I get some feedback on [4] saying... **The Single Most Important Fact About Encodings** If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. " And so... does our String nowadays require an 'encoding' instance variable such that this is *always* associated? This might remove any need for separate utf8GetEnv & utf16getEnv (if that was even a reasonable idea). cheers -ben
|
On Fri, Jan 18, 2019 at 1:48 PM Ben Coman via Pharo-dev <[hidden email]> wrote:
It's not muuuuch more complex. The problem is that usually the bugs that arise from wrongly managing such conversions can be super obscure.
I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings. Characters are represented with their corresponding unicode codepoint. If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings. I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler.
|
In reply to this post by Sven Van Caekenberghe-2
Hi nicolas I’m reading and trying to understand. but the xxx lost me. :)
|
Le ven. 18 janv. 2019 à 14:35, ducasse <[hidden email]> a écrit :
Sorry, I was talking of the windows API variants, W for Wide characters, A for ASCII (or rather current-code-page in effect)
|
In reply to this post by Guillermo Polito
> On 18 Jan 2019, at 14:23, Guillermo Polito <[hidden email]> wrote: > > > I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings. Characters are represented with their corresponding unicode codepoint. > If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings. > > I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler. Absolutely ! (and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding step, so be it). |
On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe <[hidden email]> wrote:
Cool. I didn't realise that. But to be pedantic, which unicode encoding? Should I presume from Sven's "UTF-8 encoding step" comment below and the WideString class comment "This class represents the array of 32 bit wide characters" that the WideString encoding is UTF-32? So should its comment be updated to advise that? cheers -ben Characters are represented with their corresponding unicode codepoint. |
In reply to this post by Sven Van Caekenberghe-2
On Fri, Jan 18, 2019 at 01:40:26PM +0100, Sven Van Caekenberghe wrote:
> Dave, > > > On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev <[hidden email]> wrote: > > > > > > From: "David T. Lewis" <[hidden email]> > > Subject: Re: [Pharo-dev] Better management of encoding of environment variables > > Date: 18 January 2019 at 01:54:34 GMT+1 > > To: Pharo Development List <[hidden email]> > > > > > > On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote: > >> > >>> On 16 Jan 2019, at 23:23, Eliot Miranda <[hidden email]> wrote: > >>> > >>> On Wed, Jan 16, 2019 at 2:37 AM Sven Van Caekenberghe <[hidden email]> wrote: > >>> > >>> The image side is perfectly capable of dealing with platform differences > >>> in a clean/clear way, and at least we can then use the full power of our > >>> language and our tools. > >>> > >> Agreed. At the same time I think it is very important that we don't reply > >> on the FFI for environment variable access. This is a basic cross-platform > >> facility. So I would like to see the environment accessed through primitives, > >> but have the image place interpretation on the result of the primitive(s), > >> and have the primitive(s) answer a raw result, just a sequence of uninterpreted > >> bytes. > >> > >> OK, I can understand that ENV VAR access is more fundamental than FFI > >> (although FFI is already essential for Pharo, also during startup). > >> > >>> VisualWorks takes this approach and provides a class UninterpretedBytes > >>> that the VM is aware of. That's always seemed like an ugly name and > >>> overkill to me. I would just use ByteArray and provide image level > >>> conversion from ByteArray to String, which is what I believe we have anyway. > >> > >> Right, bytes are always uninterpreted, else they would be something else. > >> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray > >> inspector decodes automatically if it can. > >> > > > > Hi Sven, > > > > I am the author of the getenv primitives, and I am also sadly uninformed > > about matters of character sets and strings in a multilingual environment. > > > > The primitives answer environment variable variable values as ByteString > > rather than ByteArray. This made sense to me at the time that I wrote it, > > because ByteString is easy to display in an inspector, and because it is > > easily converted to ByteArray. > > > > For an American English speaker this seems like a good choice, but I > > wonder now if it is a bad decision. After all, it is also trivially easy > > to convert a ByteArray to ByteString for display in the image. > > > > Would it be helpful to have getenv primitives that answer ByteArray > > instead, and to let all conversion (including in OSProcess) be done in > > the image? > > > > Thanks, > > Dave > > Normally, the correct way to represent uninterpreted bytes is with a ByteArray. Decoding these bytes as characters is the specific task of a character encoder/decoder, with a deliberate choice as to which to use. > > Since the getenv() system call uses simple C strings, it is understandable that this was carried over. It is probably not worth or too risky to change that - as long as the receiver understands that it is a raw OS string that needs more work. > > Like with file path encoding/decoding, environment variable encoding/decoding is plain messy and complex. IMHO it is better to manage that at the image level where we are more agile and can better handle that complexity. > Thanks Sven, that makes perfect sense to me. > > BTW: using funny Unicode chars, like ???? [https://www.fileformat.info/info/unicode/char/1f388/index.htm] is something even English speakers do. > You are right. I wrote those getenv primitives 20 years ago and back then we were still doing our emoticons like this: ;-) Thanks, Dave |
In reply to this post by Ben Coman
On Fri, Jan 18, 2019 at 2:46 PM Ben Coman <[hidden email]> wrote:
None :D That's the funny thing, they are not encoded. Actually, you should see Strings as collections of Characters, and Characters defined in terms of their abstract code points. ByteStrings are an optimized (just more compact) version that stores codepoints that fit in a byte.
|
In reply to this post by Ben Coman
> On 18 Jan 2019, at 14:45, Ben Coman <[hidden email]> wrote: > > > > On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe <[hidden email]> wrote: > > > > On 18 Jan 2019, at 14:23, Guillermo Polito <[hidden email]> wrote: > > > > > > I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings. > > Cool. I didn't realise that. But to be pedantic, which unicode encoding? > Should I presume from Sven's "UTF-8 encoding step" comment below > and the WideString class comment "This class represents the array of 32 bit wide characters" > that the WideString encoding is UTF-32? So should its comment be updated to advise that? Not really, Pharo Strings are a collection of Characters, each of which is a Unicode code point (yes a 32 bit one). An encoding projects this rather abstract notion onto a sequence of bytes, UTF-32 (ZnUTF32Encoder, https://en.wikipedia.org/wiki/UTF-32) is for example endian dependent. Read the first part of https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html > cheers -ben > > Characters are represented with their corresponding unicode codepoint. > > If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings. > > > > I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler. > > Absolutely ! > > (and yes I know that for outgoing FFI calls that might mean a UTF-8 encoding step, so be it). |
This is too cool to have cool doc
|
In reply to this post by Guillermo Polito
> On Jan 18, 2019, at 2:04 AM, Guillermo Polito <[hidden email]> wrote: [snip] > > Well, personally I would like that getenv/setenv and getcwd setcwd support are not in a plugin but as a basic service provided by the vm. +1000 > Cheers, > Guille |
In reply to this post by Guillermo Polito
Hi Guille,
|
In reply to this post by Nicolas Cellier
Hi Nicolas,
Motivated by the discussion on Pharo list, I added new primitives to OSProcessPlugin to answer environment variables and path information as raw ByteArray, such that those byte arrays can be converted in the image to strings with any encoding. I did the changes in "trunk" OSPP, and am now implementing them in the oscog branch. I have tested the Unix changes in both branches. Unfortunately I do not have a Windows development at the moment, so I need to ask for help. In Win32OSProcessPlugin, I am adding two primitives: #primitiveGetCurrentWorkingDirectoryAsBytes #primitiveGetEnvironmentStringsAsBytes These are based on the two string primitives, which remain available: #primitiveGetCurrentWorkingDirectory #primitiveGetEnvironmentStrings The string primitives include your recent changes for UTF8 encoding, and the "xxxAsBytes" variants are based on my earlier logic without the UTF8 support. This means that OSProcess on Windows will use your UTF8 string support, and the additional primitives (based on the old logic, and answering raw bytes) will be available for people who want to do the encoding work in the image. Does this sound like the right approach? Thanks, Dave I have a question concerning the Win32 changes. On Wed, Jan 16, 2019 at 10:24:51AM +0100, Nicolas Cellier wrote: > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion because > the purpose of a VM is to provide an OS independant fa??ade. > I made progress recently in this area, but we should finish the > job/test/consolidate. > If someone bypass the VM and use direct windows API thru FFI, then he takes > the responsibility, but uniformity doesn't hurt. > > Le mer. 16 janv. 2019 ?? 10:14, Guillermo Polito <[hidden email]> > a ??crit : > > > Hi Stephan, > > > > I'm sorry for the noise. > > > > At the time, both #at: and #getEnv: variants existed. The changes > > backported from the PharoLauncher were only using the getter versions of > > getEnv, but for Pharo I decided to implement also the setter versions. And > > after checking the code and its users in image, I've finally decided to go > > for an at:[[ifAbsent]put:] version. So I'd say that the leading > > **guideline** was at the end the one here in the mailing list, but also if > > you check the PR I've introduced a more complete and consistent API, > > following the one of dictionaries. > > > > https://github.com/pharo-project/pharo/pull/1980/files > > > > at: > > at:ifAbsent: > > at:ifPresent: > > at:ifPresent:ifAbsent: > > at:put: > > removeKey: > > > > Plus, in *nix, variants where an encoding can be specified. > > > > I'm sorry if I've introduced some confussion. > > > > > > On Wed, Jan 16, 2019 at 9:47 AM Stephan Eggermont <[hidden email]> > > wrote: > > > > > > Guillermo Polito <[hidden email]> > > > wrote: > > > > Hi all, > > > > > > > > following the meeting we had here @Inria headquarters, I'll be > > backporting > > > > some of the improvements we did in the launcher this last month > > regarding > > > > the encoding of environment variables. > > > > > > > > I've opened for this issue https://pharo.fogbugz.com/f/cases/22658/ > > > > > > > > We have already studied possible alternatives with Pablo and > > Christophe and > > > > we have some conclusions and we propose some changes: > > > > > > > > API Proposal for OSEnvironment > > > > ========================= > > > > > > > > > > > > - > > > > *at: aVariableName * > > > > > > > > Gets the String value of an environment variable called `aVariableName` > > > > It is the system reponsibility to manage the encoding. > > > > Rationale: A common denominator for all platforms providing an already > > > > decoded string, because windows does not (compared to *nix systems) > > provide > > > > a encoded byte representation of the value. Windows has instead its own > > > > wide string representation. > > > > > > > > - *[optionally] rawAt: anEncodedVariableName* > > > > > > > > Gets the Byte value of an environment variable called > > > > `anEncodedVariableName`. > > > > It is the user responsibility to encode and decode argument and return > > > > values in the encoding of this preference. > > > > Rationale: Some systems may want to have the liberty to use different > > > > encodings, or even to put binary data in the variables. > > > > > > > > - *[optionally] at: aVariableName encoding: anEncoding* > > > > > > > > Gets the value of an environment variable called `aVariableName` using > > > > `anEncoding` to encode/decode arguments and return values. > > > > Rationale: *xes could potentially use different encodings for their > > > > environment variables or even use different encodings in different > > parts of > > > > their file system. > > > > > > > > Other Implementation details > > > > ========================= > > > > > > > > - VM primitives returning paths Strings should be carefuly managed > > to > > > > decode them, since they are actually C strings (so byte arrays) > > disguised > > > > as ByteStrings. > > > > - Windows requires calling the right *Wide version of the functions > > from > > > > C, plus the correct encoding routine. This could be implemented as > > an FFI > > > > call or by modifying the VM to do it properly instead of calling > > the Ascii > > > > version > > > > > > > > > > > > > > What is the conclusion from this and issue 22658? See PR 2238. #getEnv: > > is > > > public API > > > > > > Stephan > > > > > > > > > > > > > > > -- > > > > > > > > Guille Polito > > > > Research Engineer > > > > Centre de Recherche en Informatique, Signal et Automatique de Lille > > > > CRIStAL - UMR 9189 > > > > French National Center for Scientific Research - http://www.cnrs.fr > > > > > > Web: http://guillep.github.io > > > > Phone: +33 06 52 70 66 13 > > |
In reply to this post by David T. Lewis
On Fri, Jan 18, 2019 at 08:58:07AM -0500, David T. Lewis wrote: > On Fri, Jan 18, 2019 at 01:40:26PM +0100, Sven Van Caekenberghe wrote: > > > > > On 18 Jan 2019, at 01:54, David T. Lewis via Pharo-dev <[hidden email]> wrote: > > > > > > On Thu, Jan 17, 2019 at 04:57:18PM +0100, Sven Van Caekenberghe wrote: > > >> > > >> Right, bytes are always uninterpreted, else they would be something else. > > >> We got ByteArray>>#decodedWith: and ByteArray>>#utf8Decoded and our ByteArray > > >> inspector decodes automatically if it can. > > > > > > Hi Sven, > > > > > > I am the author of the getenv primitives, and I am also sadly uninformed > > > about matters of character sets and strings in a multilingual environment. > > > > > > The primitives answer environment variable variable values as ByteString > > > rather than ByteArray. This made sense to me at the time that I wrote it, > > > because ByteString is easy to display in an inspector, and because it is > > > easily converted to ByteArray. > > > > > > For an American English speaker this seems like a good choice, but I > > > wonder now if it is a bad decision. After all, it is also trivially easy > > > to convert a ByteArray to ByteString for display in the image. > > > > > > Would it be helpful to have getenv primitives that answer ByteArray > > > instead, and to let all conversion (including in OSProcess) be done in > > > the image? > > > > > > Thanks, > > > Dave > > > > Normally, the correct way to represent uninterpreted bytes is with a > > ByteArray. Decoding these bytes as characters is the specific task of > > a character encoder/decoder, with a deliberate choice as to which to use. > > > > Since the getenv() system call uses simple C strings, it is understandable > > that this was carried over. It is probably not worth or too risky to > > change that - as long as the receiver understands that it is a raw OS > > string that needs more work. > > > > Like with file path encoding/decoding, environment variable encoding/decoding > > is plain messy and complex. IMHO it is better to manage that at the > > image level where we are more agile and can better handle that complexity. > > > > Thanks Sven, that makes perfect sense to me. > I added some new primitives to OSProcessPlugin that answer ByteArray instead of ByteString. For Unix (Linux, OS X): <primitive: 'primitiveGetCurrentWorkingDirectoryAsBytes' module: 'UnixOSProcessPlugin'> <primitive: 'primitiveArgumentAtAsBytes' module: 'UnixOSProcessPlugin'> <primitive: 'primitiveEnvironmentAtAsBytes' module: 'UnixOSProcessPlugin'> <primitive: 'primitiveEnvironmentAtSymbolAsBytes' module: 'UnixOSProcessPlugin'> <primitive: 'primitiveRealpathAsBytes' module: 'UnixOSProcessPlugin'> For Windows: <primitive: 'primitiveGetCurrentWorkingDirectoryAsBytes' module: 'Win32OSProcessPlugin'> <primitive: 'primitiveGetEnvironmentStringsAsBytes' module: 'Win32OSProcessPlugin'> These should be in the latest VM builds now. If you are using OSProcess, update it to the latest version to get accessor methods for the new primitives. For example, OSProcess accessor primGetCurrentWorkingDirectory calls the original primitive that answers a ByteString, and to get raw bytes you can use OSProcess accessor primGetCurrentWorkingDirectoryAsBytes instead. Dave |
Free forum by Nabble | Edit this page |