Hi -
Due to a bug reported against Qwaq Forums I needed to look into how the Unix VM encodes file and path names and got terribly confused. My test case was to create a file with an Umlaut("Jürgen") and to see what both Squeak and the Unix shell reports with varying settings of -pathenc and -textenc. I started with the assumption that since the file system I was running this on is UTF-8 the default settings (-textenc MacRoman and -pathenc UTF-8) ought to be correct. However, the result was very surprising. The file name was reported incorrectly both in the file list as well as by the OS - the file list reported "J?" (truncated after the question mark) and the Unix shell reported "J?rgen" but with a "funky ?" (the glyph is hard to describe without a screenshot; it was neither an umlaut nor a regular question mark). Playing with the settings I could not find any combination that resulted in a consistent representation for all the different views - either the Unix shell was off or Squeak's view was off no matter how I set those encodings. Can someone explain to me how I need to set these values to get a consistent view on file names both from Squeak and Unix? Cheers, - Andreas |
For Sophie we spend hours/days? getting this correct. However we
really didn't check out the pure Unix variation. Actually let's some other utf-32 character versus ü Oh let's say LATIN CAPITAL LETTER SCHWA -> UTF-8 0xC68F UTF-32 0x0000018F Ə in the os-x 10.5.1 Finder we see Ə.png and in a terminal session we see -rw-r--r-- 1 johnmci staff 26451 Apr 10 2007 Ə.png in both cases just in case you can't see this in the email the character is visually correct. Using Squeak 3.10Alpha 7092 with a Mac Carbon VM 3.8.18b1 set to utf8 when we use the file list morphic What we see is ?.png the ? is 0x3F of course it says it can't open the file, because the smalltalk code (which code is an exercise for the reader) has mangled the 0xC68F into 0x3F. In asking about this a few years back I think I was told it converts the VM data to latin1. However the conversion from macroman to latin1 and back *usually* is workable, mind only if the characters are <= 0xFF However utf8 to latin1 usually ends up broken which is why the mac carbon VM is set to macroman by default. Recall that in os-x HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode. The macintosh carbon VM converts back and forth between the pre-composed to decomposed unicode when it is using UTF8 encoding. This also depends on the file system and what it thinks it wants to store unicode characters as... Now in Sophie when we import this into Sophie the URI that was generated is /Users/johnmci/Work%20In%20Progress/squeak%20Bugs/%C6%8F.png that becomes '/Users/johnmci/Work In Progress/squeak Bugs/Æ∑.png' But the VM ensures the proper thing is done. In sophie we store all media paths as encoded URI objects, and convert to what is required when we need to access the media. Oh and btw if you enter file:///Users/johnmci/Work%20In%20Progress/squeak%20Bugs/%C6%8F.png into FireFox, it's happy too. Oddly when you enter it into Safari it becomes file:///Users/johnmci/Work%20In%20Progress/squeak%20Bugs/Ə.png Oh and if I take the Ə.png from a terminal session, or the finder window and paste into a Sophie text field, yes it's Ə.png because the extended clipboard support converts it properly from utf8, Mind in TextEdit it comes across as RTF which is a different issue, but *still* is correctly converted into utf-32 in Sophie. People of course are welcome to uncover unicode character issue with Sophie and how it deals with file names or textual data in text fields. On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote: > Hi - > > Due to a bug reported against Qwaq Forums I needed to look into how > the Unix VM encodes file and path names and got terribly confused. > My test case was to create a file with an Umlaut("Jürgen") and to > see what both Squeak and the Unix shell reports with varying > settings of -pathenc and -textenc. > > I started with the assumption that since the file system I was > running this on is UTF-8 the default settings (-textenc MacRoman and > -pathenc UTF-8) ought to be correct. However, the result was very > surprising. The file name was reported incorrectly both in the file > list as well as by the OS - the file list reported "J?" (truncated > after the question mark) and the Unix shell reported "J?rgen" but > with a "funky ?" (the glyph is hard to describe without a > screenshot; it was neither an umlaut nor a regular question mark). > > Playing with the settings I could not find any combination that > resulted in a consistent representation for all the different views > - either the Unix shell was off or Squeak's view was off no matter > how I set those encodings. Can someone explain to me how I need to > set these values to get a consistent view on file names both from > Squeak and Unix? > > Cheers, > - Andreas > -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Andreas.Raab
On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote: > Hi - > > Due to a bug reported against Qwaq Forums I needed to look into how > the Unix VM encodes file and path names and got terribly confused. Also see From: [hidden email] Subject: Re: mac carbon VM goes to unix file names, testers needed Date: February 6, 2006 12:54:17 AM PST (CA) > We've been using UTF-8 VM encoding for a while, to be able to access > files with non-ASCII characters in their path. However, I haven't > found a way to permanently switch the image (3.8) to UTF-8 encoding. > It keeps resetting to Latin1 on startup. The only solution for me > was to put this line in my own startup code: > > LanguageEnvironment classPool at: #FileNameConverterClass put: > UTF8TextConverter > > Did anybody create a better solution than this horrible hack? I feel > someone else must be using UTF-8, too, now that we support it ... > > - Bert - -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
Oh, interesting. That reminds me of the fix that I needed for Windows,
let me see... yes, here it is: LanguageEnvironment class>>defaultFileNameConverter needs to be fixed since it is (wrongly) guessing the file name encoding based on the currently active locale (which makes no sense btw, since the locale doesn't mean Jack for file name encodings). Hm ... lemme try this ... ah, interesting. It appears that I can make the Umlauts work on Unix correctly if and only if: * I fix the above method to return UTF8TextConverter in every case [*1] * I use -pathenc MacRoman -textenc MacRoman Which makes no sense to me since neither the path nor the text encoding is MacRoman but it appears to work. Huh? [*1] And that of course reminds me that nobody has really made any comment on why the hell we still deal with all of these nonsensical legacy encodings and don't just go straight to UTF-8 in the VM interface which would simplify *lots* of cruft in the code. Cheers, - Andreas John M McIntosh wrote: > > On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote: > >> Hi - >> >> Due to a bug reported against Qwaq Forums I needed to look into how >> the Unix VM encodes file and path names and got terribly confused. > > Also see > > From: [hidden email] > Subject: Re: mac carbon VM goes to unix file names, testers needed > > Date: February 6, 2006 12:54:17 AM PST (CA) > >> We've been using UTF-8 VM encoding for a while, to be able to access >> files with non-ASCII characters in their path. However, I haven't >> found a way to permanently switch the image (3.8) to UTF-8 encoding. >> It keeps resetting to Latin1 on startup. The only solution for me was >> to put this line in my own startup code: >> >> LanguageEnvironment classPool at: #FileNameConverterClass put: >> UTF8TextConverter >> >> Did anybody create a better solution than this horrible hack? I feel >> someone else must be using UTF-8, too, now that we support it ... >> >> - Bert - > > -- > =========================================================================== > John M. McIntosh <[hidden email]> > Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com > =========================================================================== > > > > |
El dom, 30-12-2007 a las 10:28 +0100, Andreas Raab escribió:
> Oh, interesting. That reminds me of the fix that I needed for Windows, > let me see... yes, here it is: LanguageEnvironment > class>>defaultFileNameConverter needs to be fixed since it is (wrongly) > guessing the file name encoding based on the currently active locale > (which makes no sense btw, since the locale doesn't mean Jack for file > name encodings). Hm ... lemme try this ... ah, interesting. It appears > that I can make the Umlauts work on Unix correctly if and only if: > * I fix the above method to return UTF8TextConverter in every case [*1] > * I use -pathenc MacRoman -textenc MacRoman > Which makes no sense to me since neither the path nor the text encoding > is MacRoman but it appears to work. Huh? > any other with the mime type registered) over the image if the file name contains the Umlauts? That has never worked either in the unix vm... > [*1] And that of course reminds me that nobody has really made any > comment on why the hell we still deal with all of these nonsensical > legacy encodings and don't just go straight to UTF-8 in the VM interface > which would simplify *lots* of cruft in the code. Totally agree, maybe every vm hacker thinks everybody use english or it was just too messy to waste time on it... signature.asc (196 bytes) Download Attachment |
In reply to this post by Andreas.Raab
On Dec 30, 2007, at 1:28 AM, Andreas Raab wrote: > Oh, interesting. That reminds me of the fix that I needed for > Windows, let me see... yes, here it is: LanguageEnvironment > class>>defaultFileNameConverter needs to be fixed since it is > (wrongly) guessing the file name encoding based on the currently > active locale (which makes no sense btw, since the locale doesn't > mean Jack for file name encodings). Hm ... lemme try this ... ah, > interesting. It appears that I can make the Umlauts work on Unix > correctly if and only if: > * I fix the above method to return UTF8TextConverter in every case > [*1] > * I use -pathenc MacRoman -textenc MacRoman > Which makes no sense to me since neither the path nor the text > encoding is MacRoman but it appears to work. Huh? Careful now is that ü 0x9F in macroman or 0x000000FC in utf-32 which is coming back from the VM, then what does the LanguageEnvironment class>>defaultFileNameConverter do and which FONT and character set does that font think it lives in? Since the hex value might not have proper representation if you use a non-unicode proper font after converting from something to unicode.... -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Andreas.Raab
On Dec 30, 2007, at 1:28 AM, Andreas Raab wrote: > [*1] And that of course reminds me that nobody has really made any > comment on why the hell we still deal with all of these nonsensical > legacy encodings and don't just go straight to UTF-8 in the VM > interface which would simplify *lots* of cruft in the code. I think all the vertical squeak packages (scratch, sophie, plopp) use UTF8 and get to ignore the file tools. And don't get Tim started on what he thinks of the classes that deal with files... Personally I wanted to move the carbon vm to utf8 awhile back, but well as you see all the morphic file tools would break in interesting ways. -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Andreas.Raab
> Hm ... lemme try this ... ah, interesting. It appears
> that I can make the Umlauts work on Unix correctly if and only if: > * I fix the above method to return UTF8TextConverter in every case [*1] > * I use -pathenc MacRoman -textenc MacRoman > Which makes no sense to me since neither the path nor the text encoding > is MacRoman but it appears to work. Huh? Yes, on Unix VM, another historical mishappen caused it; "MacRoman" still means "no conversion" so that if the image passes UTF-8 string, the UTF-8 string is passed to system calls. > [*1] And that of course reminds me that nobody has really made any > comment on why the hell we still deal with all of these nonsensical > legacy encodings and don't just go straight to UTF-8 in the VM interface > which would simplify *lots* of cruft in the code. Well, nobody tried to change stuff on the all platforms at once. Windows is doing ok with 3.10 VM and OLPC Etoys image (there is still code that deals with older VM... typical installation for people is to install stuff from squeakland.org and then use Etoys image). -- Yoshiki |
On Dec 30, 2007, at 2:11 AM, Yoshiki Ohshima wrote: > >> Hm ... lemme try this ... ah, interesting. It appears >> that I can make the Umlauts work on Unix correctly if and only if: >> * I fix the above method to return UTF8TextConverter in every case >> [*1] >> * I use -pathenc MacRoman -textenc MacRoman >> Which makes no sense to me since neither the path nor the text >> encoding >> is MacRoman but it appears to work. Huh? > > Yes, on Unix VM, another historical mishappen caused it; "MacRoman" > still means "no conversion" so that if the image passes UTF-8 string, > the UTF-8 string is passed to system calls. Er, well I'm not sure that's quite accurate? In looking at sqUnixCharConv.c it seems to say that if the text encoding is macroman and the path encoding is macroman the translation from unix path to squeak would be macroman to macroman so nothing would happen. Convert(sq,ux, Path, sqTextEncoding, uxPathEncoding, 1, 0); // normalised paths for HFS+ Convert(ux,sq, Path, uxPathEncoding, sqTextEncoding, 0, 0); in sqInt dir_Lookup(char *pathString, sqInt pathStringLength, sqInt index, /* outputs: */ char *name, sqInt *nameLength, sqInt *creationDate, sqInt *modificationDate, sqInt *isDirectory, squeakFileOffsetType *sizeIfFile) we find *nameLength= ux2sqPath(dirEntry->d_name, nameLen, name, MAXPATHLEN, 0); However I note the carbon vm does if (norm) // HFS+ imposes Unicode2.1 decomposed UTF-8 encoding on all path elements CFStringNormalize(str, kCFStringNormalizationFormD); // canonical decomposition else CFStringNormalize(str, kCFStringNormalizationFormC); // pre- combined but the unix VM does not do this, which I think is an error based on: See From: [hidden email] Subject: [Vm-dev] Patch for filename normalization of mac vm Date: March 11, 2007 8:21:06 PM PDT (CA) To: [hidden email] > Hi, > > I've found the latest mac vm (or recent version) fails to normalize > UTF file name. > It seems to be the function convertChars() of > sqMacUnixFileInterface.c, which normalizes only decompose when > converting squeak string to unix, but I think it needs pre-combined > when unix string to squeak, and I noticed normalization form should > be canonical (exactly should be kCFStringNormalizationFormC) for pre- > combined. > > Patch (diff format of xcode tool) for this problem is attached to > this mail. > > Regards, > -- > Tetsuya HAYASHI, [hidden email], [hidden email] PS I note if you feed CFStringCreateWithBytes bad data, why it returns NULL, then the lurking CFStringCreateMutableCopy core dumps the VM. That's why I check for it in the carbon vm. Normally you won't see this issue unless you get creative... -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Yoshiki Ohshima-2
Yoshiki Ohshima wrote:
>> Hm ... lemme try this ... ah, interesting. It appears >> that I can make the Umlauts work on Unix correctly if and only if: >> * I fix the above method to return UTF8TextConverter in every case [*1] >> * I use -pathenc MacRoman -textenc MacRoman >> Which makes no sense to me since neither the path nor the text encoding >> is MacRoman but it appears to work. Huh? > > Yes, on Unix VM, another historical mishappen caused it; "MacRoman" > still means "no conversion" so that if the image passes UTF-8 string, > the UTF-8 string is passed to system calls. Playing around a little it appears as if the Unix VM always converts path names with the assumption that Squeak uses MacRoman in the image and only -pathenc affects the translation between file system and the image (i.e., -textenc has *no* effect on path name translation whatsoever). Can someone confirm this? It would explain why -pathenc MacRoman works (since like you say it's really the "no conversion" flag) if combined with a proper file name converter in the image. >> [*1] And that of course reminds me that nobody has really made any >> comment on why the hell we still deal with all of these nonsensical >> legacy encodings and don't just go straight to UTF-8 in the VM interface >> which would simplify *lots* of cruft in the code. > > Well, nobody tried to change stuff on the all platforms at once. > Windows is doing ok with 3.10 VM and OLPC Etoys image (there is still > code that deals with older VM... typical installation for people is to > install stuff from squeakland.org and then use Etoys image). What encoding options are being used on OLPC? Do non-ascii file names, clipboard, drag and drop etc. work on OLPC? Cheers, - Andreas |
In reply to this post by johnmci
> I think all the vertical squeak packages (scratch, sophie, plopp) use > UTF8 and get to ignore the file tools. And > don't get Tim started on what he thinks of the classes that deal with > files... On irc yesterday I learned that Rio has at least one fan! Keith |
In reply to this post by Andreas.Raab
On Dec 30, 2007, at 4:09 AM, Andreas Raab wrote: > Yoshiki Ohshima wrote: >>> Hm ... lemme try this ... ah, interesting. It appears that I can >>> make the Umlauts work on Unix correctly if and only if: >>> * I fix the above method to return UTF8TextConverter in every case >>> [*1] >>> * I use -pathenc MacRoman -textenc MacRoman >>> Which makes no sense to me since neither the path nor the text >>> encoding is MacRoman but it appears to work. Huh? >> Yes, on Unix VM, another historical mishappen caused it; "MacRoman" >> still means "no conversion" so that if the image passes UTF-8 string, >> the UTF-8 string is passed to system calls. > > Playing around a little it appears as if the Unix VM always converts > path names with the assumption that Squeak uses MacRoman in the > image and only -pathenc affects the translation between file system > and the image (i.e., -textenc has *no* effect on path name > translation whatsoever). Can someone confirm this? It would explain > why -pathenc MacRoman works (since like you say it's really the "no > conversion" flag) if combined with a proper file name converter in > the image. Mmm for the -pathenc and the -textenc from what I can see the data coming from the file system is said to exist in the form -pathenc and translated to a CFString in UTF-32, then translated back to a byte string in -textenc. In sending the data to the file system, it said it exists in the form - textenc, then translated to a CFString in UTF-32 then CFStringNormalize(str, kCFStringNormalizationFormD); // canonical decomposition, then translated back to a byte string in -pathenc. I'll note the kCFStringNormalizationFormD operation (and all above/ below) only occurs if this is macintosh. If this is a Linux/BSD unix system then iconv is used. So is this on a mac or some Linux/BSD system? That and I think a kCFStringNormalizationFormC is needed in the first step to properly compose the characters. For background Ok, let's see in the mac carbon vm we get back from the file system for LATIN CAPITAL LETTER SCHWA + LATIN CAPITAL LETTER A WITH ACUTE ƏÁ.png 0xC6, 0x8F, A, 0xCC, 0x81, .png Note how the A0xCC81 is the decomposed UTF8, this is what is stored in the HFS+ file system. We convert that from UTF8 to the target of MacRoman for path names by default in the base carbon VM. This means converting to a CFString from kCFStringEncodingUTF8 then applying CFStringNormalize(str, kCFStringNormalizationFormC); // pre-combined then pulling back the bytes as MacRoman, that becomes ?, 0xE7, .png since the translation of the Ə from utf8 to macroman fails, but the E7 is correct macroman for the Á Now if I set the vm up to use UTF8 as the path name default. after we apply the kCFStringNormalizationFormC step and pull back the data as UTF8 it is 0xC6, 0x8F, 0xC3, 0x81, .png where the 0xC381 is the (LATIN CAPITAL LETTER A WITH ACUTE) in UTF8 or 0x00c1 in utf-16 or 0x000000c1 in utf-32 Now if I remove the CFStringNormalize(str, kCFStringNormalizationFormC); // pre-combined which does not exist in the base unix vm, then I get back 0xC6, 0x8F, A, 0xCC, 0x81, '.png' which is the decomposed UTF8. I'll note in the file browser it shows as Æ∑AÌ∞.png but it does work.... NOW the question is what does the translation do... mmm Well if I try LanguageEnvironment classPool at: #FileNameConverterClass put: UTF8TextConverter then it shows: ?A?.png which is mmm, less wrong? But it does work. Now if in the base unix VM you take path encoding and text encoding and set to macroman Convert(ux,sq, Path, uxPathEncoding, sqTextEncoding, 0, 0); then we are saying the operating system (unix) path encoding is macroman and the squeak path name encoding is macroman that gives back 0xC6, 0x8F, A, 0xCC, 0x81, '.png' since as thought the macroman to macroman translation does nothing. However the UTF8TextConverter does not work with decomposed UTF8 so what the user would see is not correct, since it assumes we are working with precomposed UTF8 when it converts it to UTF-32 for the font system's enjoyment. I suspect to fix properly a CFStringNormalize(str, kCFStringNormalizationFormC); is needed in the Unix sqUnixCharConv.c and applied when applicable. Notes from Apple's site For example, an Á (A acute) can be encoded either precomposed, as U +00C1 (LATIN CAPITAL LETTER A WITH ACUTE), or decomposed, as U+0041 U +0301 (UTF8 is 0xCC81) (LATIN CAPITAL LETTER A followed by a COMBINING ACUTE ACCENT). Precomposed characters are more common in the Windows world, whereas decomposed characters are more common on the Mac. -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by Andreas.Raab
Andreas,
> Playing around a little it appears as if the Unix VM always converts > path names with the assumption that Squeak uses MacRoman in the image > and only -pathenc affects the translation between file system and the > image (i.e., -textenc has *no* effect on path name translation > whatsoever). Can someone confirm this? It would explain why -pathenc > MacRoman works (since like you say it's really the "no conversion" flag) > if combined with a proper file name converter in the image. I think it is. If I remember correctly, this was introduced before 3.8 (around 3.7 era, I believe) by Ned, and that was only to make something he was trying to do work. > > Well, nobody tried to change stuff on the all platforms at once. > > Windows is doing ok with 3.10 VM and OLPC Etoys image (there is still > > code that deals with older VM... typical installation for people is to > > install stuff from squeakland.org and then use Etoys image). > > What encoding options are being used on OLPC? Do non-ascii file names, > clipboard, drag and drop etc. work on OLPC? Filenames are in UTF-8. Clipboard and Drag and drop uses somewhat strange mechanism, but they are normalized to UTF-8 as well. -- Yoshiki |
Yoshiki Ohshima wrote:
>> What encoding options are being used on OLPC? Do non-ascii file names, >> clipboard, drag and drop etc. work on OLPC? > > Filenames are in UTF-8. Clipboard and Drag and drop uses somewhat > strange mechanism, but they are normalized to UTF-8 as well. Odd. Checking the latest OLPC image I have access to (etoys2.3-1867) there is nothing I can see that would deal properly in the image with non-ascii file names on Unix. LanguageEnvironment still returns "self currentPlatform class fileNameConverterClass" which won't use an UTF-8 converter anywhere. So unless I'm missing something it probably shouldn't work on OLPC. Maybe try it with some "real" UTF-32 names? Cheers, - Andreas |
Andreas,
> Odd. Checking the latest OLPC image I have access to (etoys2.3-1867) > there is nothing I can see that would deal properly in the image with > non-ascii file names on Unix. LanguageEnvironment still returns "self > currentPlatform class fileNameConverterClass" which won't use an UTF-8 > converter anywhere. So unless I'm missing something it probably > shouldn't work on OLPC. Maybe try it with some "real" UTF-32 names? Ah, you are right. the code wasn't the way I remember it should be. Latin1Environment (which should be called WesternEuropeanEnvironment) flieNameConverterClass should be a bit more elaborated (or just returning UTF8TextConverter). -- Yoshiki |
Free forum by Nabble | Edit this page |