Hi
I'm trying to remember the situation with the internal representation of string in pharo/squeak to revise http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo I saw that in pharo we have this NonASCIIMap. I do not remember what have been done in pharo. Argh memory leaks.... Nicolas do you remember the situation? In this context what is the squeakToUTF8 related behavior? is squeak still using latin-1 or in the midst of changing? Stef _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
You should ask Sophie team, their knowledge certainly is far more
advanced than mine. String should be a SequenceableCollection of Character. Internally, for space/speed reasons they rather store a code representing the value of a Character. In a simple model, this value would be the Unicode encoding... In squeak, only lowest 22 bits of a Character value are used to encode the character (#charCode). Bits of rank 23 to 30 encode a so called #leadingChar. I guess we stopped at bit #30 just to be sure to handle SmallInteger values. Don't count on me to explain leadingChar, I can't... For leadingChar ~~ 0, i'm not event sure of correct charCode interpretation... For value < 256, the interpretation of the charCode is not exactly unicode... It's more CP1252 (with assigned values to codes from 128 to 159). Once upon a time, it used to be Mac Roman encoding instead... Let's forget the past (but you could so some remnants in old code). ------------------------------ When marshalling/unmarshalling strings to/from outside world we could/should use ByteArray... Unwisely, we don't. Instead, we reuse a String as storage for these codes. As a result, you see all these squeakToUtf8, utf8ToSqueak etc... That means that the contents of the String cannot be interpreted outside of its context... Very very bad IMHO. Under this point of view, the String has no more a self-contained meaning, but is just a blob of codes (on 8 or 32 bits). Fortunately, we mostly use these forms for temporary storage, but even, I don't like it. There are other alternatives like defining subclasses of String that encapsulate their encodings and know how to be well behaved Strings, not just context dependent blobs. For example, you could as well define an UT8String. VW went on this kind of path long time ago (not sure for utf8 though). Well, I'm not sure whether I succeeded in explaining something at all or just added confusion... Anyway, Unicode is not simple, because it attempts to represent several centuries of typesetting conventions of different cultures... So don't expect the code to be as simple as in the ASCII times. It forces you to ask what is a character at all? Several glyphs exist for the same character (upper and lower case for a latin example), some characters can be decomposed as a base character and a diacritical mark, etc... Character rendering is even worse, with kerning, ligatures, anti aliasing, hinting, etc... Designing a font of good quality is a lot of work, especially if you have to support unicode ! If it's getting too complex and we don't get the task force to handle it, we'd better hook OS primitives to measure/render. I guess it is far beyond you original question, but that will arise soon, because without good fonts and good rendering, Unicode support is kind of void. Nicolas 2010/3/28 Stéphane Ducasse <[hidden email]>: > Hi > > I'm trying to remember the situation with the internal representation of string in pharo/squeak > to revise http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo > > I saw that in pharo we have this NonASCIIMap. I do not remember what have been done in pharo. > Argh memory leaks.... Nicolas do you remember the situation? > > In this context what is the squeakToUTF8 related behavior? > is squeak still using latin-1 or in the midst of changing? > > Stef > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
> You should ask Sophie team, their knowledge certainly is far more > advanced than mine. The problem is that most of them disappeared after the java rewrite announce. > String should be a SequenceableCollection of Character. > Internally, for space/speed reasons they rather store a code > representing the value of a Character. > > In a simple model, this value would be the Unicode encoding... > In squeak, only lowest 22 bits of a Character value are used to encode > the character (#charCode). > Bits of rank 23 to 30 encode a so called #leadingChar. > I guess we stopped at bit #30 just to be sure to handle SmallInteger values. > Don't count on me to explain leadingChar, I can't... :) I read the comment of the class and some code and got lost > For leadingChar ~~ 0, i'm not event sure of correct charCode interpretation... > > For value < 256, the interpretation of the charCode is not exactly unicode... > It's more CP1252 (with assigned values to codes from 128 to 159). > > Once upon a time, it used to be Mac Roman encoding instead... > Let's forget the past (but you could so some remnants in old code). > > ------------------------------ > > When marshalling/unmarshalling strings to/from outside world we > could/should use ByteArray... > Unwisely, we don't. > Instead, we reuse a String as storage for these codes. > As a result, you see all these squeakToUtf8, utf8ToSqueak etc... > That means that the contents of the String cannot be interpreted > outside of its context... Very very bad IMHO. > Under this point of view, the String has no more a self-contained > meaning, but is just a blob of codes (on 8 or 32 bits). > Fortunately, we mostly use these forms for temporary storage, but > even, I don't like it. > > There are other alternatives like defining subclasses of String that > encapsulate their encodings and know how to be well behaved Strings, > not just context dependent blobs. > For example, you could as well define an UT8String. > VW went on this kind of path long time ago (not sure for utf8 though). > > Well, I'm not sure whether I succeeded in explaining something at all > or just added confusion... don;t worry. for the seaside book I started to read unicode standard and history now it would be good to know what to do and do it :) > Anyway, Unicode is not simple, because it attempts to represent > several centuries of typesetting conventions of different cultures... > So don't expect the code to be as simple as in the ASCII times. > It forces you to ask what is a character at all? Several glyphs exist > for the same character (upper and lower case for a latin example), > some characters can be decomposed as a base character and a > diacritical mark, etc... Yes I read that. > > Character rendering is even worse, with kerning, ligatures, anti > aliasing, hinting, etc... > Designing a font of good quality is a lot of work, especially if you > have to support unicode ! > If it's getting too complex and we don't get the task force to handle > it, we'd better hook OS primitives to measure/render. > I guess it is far beyond you original question, but that will arise > soon, because without good fonts and good rendering, Unicode support > is kind of void. Yes this is all the question of a community not moving during 10 years (not only squeak) and the world making progress and more important getting more and more complex. So may be relying on external libraries will be more and more important (which I do not like). > Nicolas > > 2010/3/28 Stéphane Ducasse <[hidden email]>: >> Hi >> >> I'm trying to remember the situation with the internal representation of string in pharo/squeak >> to revise http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo >> >> I saw that in pharo we have this NonASCIIMap. I do not remember what have been done in pharo. >> Argh memory leaks.... Nicolas do you remember the situation? >> >> In this context what is the squeakToUTF8 related behavior? >> is squeak still using latin-1 or in the midst of changing? >> >> Stef >> >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2010/3/28 Stéphane Ducasse <[hidden email]>:
> >> You should ask Sophie team, their knowledge certainly is far more >> advanced than mine. > > The problem is that most of them disappeared after the java rewrite announce. > >> String should be a SequenceableCollection of Character. >> Internally, for space/speed reasons they rather store a code >> representing the value of a Character. >> >> In a simple model, this value would be the Unicode encoding... >> In squeak, only lowest 22 bits of a Character value are used to encode >> the character (#charCode). >> Bits of rank 23 to 30 encode a so called #leadingChar. >> I guess we stopped at bit #30 just to be sure to handle SmallInteger values. >> Don't count on me to explain leadingChar, I can't... > > :) > I read the comment of the class and some code and got lost > >> For leadingChar ~~ 0, i'm not event sure of correct charCode interpretation... >> >> For value < 256, the interpretation of the charCode is not exactly unicode... >> It's more CP1252 (with assigned values to codes from 128 to 159). >> >> Once upon a time, it used to be Mac Roman encoding instead... >> Let's forget the past (but you could so some remnants in old code). >> >> ------------------------------ >> >> When marshalling/unmarshalling strings to/from outside world we >> could/should use ByteArray... >> Unwisely, we don't. >> Instead, we reuse a String as storage for these codes. >> As a result, you see all these squeakToUtf8, utf8ToSqueak etc... >> That means that the contents of the String cannot be interpreted >> outside of its context... Very very bad IMHO. >> Under this point of view, the String has no more a self-contained >> meaning, but is just a blob of codes (on 8 or 32 bits). >> Fortunately, we mostly use these forms for temporary storage, but >> even, I don't like it. >> >> There are other alternatives like defining subclasses of String that >> encapsulate their encodings and know how to be well behaved Strings, >> not just context dependent blobs. >> For example, you could as well define an UT8String. >> VW went on this kind of path long time ago (not sure for utf8 though). >> >> Well, I'm not sure whether I succeeded in explaining something at all >> or just added confusion... > > don;t worry. > for the seaside book I started to read unicode standard and history now it would be good to > know what to do and do it :) > Ask Seaside folks, they certainly have some ideas. >> Anyway, Unicode is not simple, because it attempts to represent >> several centuries of typesetting conventions of different cultures... >> So don't expect the code to be as simple as in the ASCII times. >> It forces you to ask what is a character at all? Several glyphs exist >> for the same character (upper and lower case for a latin example), >> some characters can be decomposed as a base character and a >> diacritical mark, etc... > > Yes I read that. >> >> Character rendering is even worse, with kerning, ligatures, anti >> aliasing, hinting, etc... >> Designing a font of good quality is a lot of work, especially if you >> have to support unicode ! >> If it's getting too complex and we don't get the task force to handle >> it, we'd better hook OS primitives to measure/render. >> I guess it is far beyond you original question, but that will arise >> soon, because without good fonts and good rendering, Unicode support >> is kind of void. > > Yes this is all the question of a community not moving during 10 years (not only squeak) > and the world making progress and more important getting more and more complex. > So may be relying on external libraries will be more and more important (which I do not like). We have the Cuis alternative for simplicity. Not sure we should waste time competing in areas where we can't win. Seaside just take advantages of web standards and browsers, and it's the main commercial Smalltalk niche these days, isn't it ? Nicolas >> Nicolas >> >> 2010/3/28 Stéphane Ducasse <[hidden email]>: >>> Hi >>> >>> I'm trying to remember the situation with the internal representation of string in pharo/squeak >>> to revise http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo >>> >>> I saw that in pharo we have this NonASCIIMap. I do not remember what have been done in pharo. >>> Argh memory leaks.... Nicolas do you remember the situation? >>> >>> In this context what is the squeakToUTF8 related behavior? >>> is squeak still using latin-1 or in the midst of changing? >>> >>> Stef >>> >>> >>> _______________________________________________ >>> Pharo-project mailing list >>> [hidden email] >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >>> >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
I do not think that we should only focus on the web.
Smalltalk is much better than that. Now we should go step by step. I think that pharo is just at the beginning. STef _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Stéphane Ducasse
On 2010-03-28, at 10:02 AM, Stéphane Ducasse wrote: > >> You should ask Sophie team, their knowledge certainly is far more >> advanced than mine. > > The problem is that most of them disappeared after the java rewrite announce. Ok, well it's not as if we disappeared from the planet... The UTF8 work came out of work done on the OLPC/eToys, check with Bert You need to carefully check that some of the UTF8 encoders/decoder methods, or classes actually DO make UTF8 One of them lied and did a UTF8ToMacRoman or was that UTF8ToLatin1 conversion when it claimed a UTF8ToUnicode32. In Sophie because we wrote it's own storage subsystem we could ensure that all textual data going to storage was UTF8, and all text coming in was UTF8->WideStrings. Also all references to external resources were made into URI that were UTF8 and http encoded safe. Mind we did have a tiny disagreements at times about unicode characters and encoding in URLs between Safari and FireFox, never was quite sure who had the bug. -- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project smime.p7s (3K) Download Attachment |
Thanks john.
>>> You should ask Sophie team, their knowledge certainly is far more >>> advanced than mine. >> >> The problem is that most of them disappeared after the java rewrite announce. > > Ok, well it's not as if we disappeared from the planet... > > The UTF8 work came out of work done on the OLPC/eToys, check with Bert > > You need to carefully check that some of the UTF8 encoders/decoder methods, or classes actually DO make UTF8 > One of them lied and did a UTF8ToMacRoman or was that UTF8ToLatin1 conversion when it claimed a UTF8ToUnicode32. Argh! So we should really find a way to rescue what was done in sophie. > In Sophie because we wrote it's own storage subsystem we could ensure that all textual data going to storage was UTF8, and > all text coming in was UTF8->WideStrings. Also all references to external resources were made into URI that were UTF8 and http encoded safe. > Mind we did have a tiny disagreements at times about unicode characters and encoding in URLs between Safari and FireFox, never was quite > sure who had the bug. > > -- > =========================================================================== > John M. McIntosh <[hidden email]> Twitter: squeaker68882 > Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com > =========================================================================== > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Stéphane Ducasse
On Mar 28, 2010, at 4:36 13PM, Stéphane Ducasse wrote: > Hi > > I'm trying to remember the situation with the internal representation of string in pharo/squeak > to revise http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo > > I saw that in pharo we have this NonASCIIMap. I do not remember what have been done in pharo. > Argh memory leaks.... Nicolas do you remember the situation? NonASCIIMap is used for quickly determining whether the string with no character codes > 127 (ie only ascii characters). It's very useful for doing primitive accellerated isAsciiString, which in the case of ascii-compatible encodings (utf8, latin1, macroman, etc.) would mean no conversion is required for it to be the "appropriate" internal bytestring format. It's used f.ex. in the nextChunk code, Strangely it is also used in FileStream writeSourceCodeFrom: baseName: isSt: , for some reason we there use a MacRoman if stream contents isAscii, which really makes no sense, but whatever. John pointed out some converters were lying, I'm not entirely sure that's true anymore, what IS certain though, is the external code format used is inconsistent, depending on from where/how you save/load it. It really should be cleaned up to always store in utf8, and possibly also latin1 if possible. All this should be cleared up to always try reading as UTF8, then raising an InvalidUTF8 error which can be handled by telling it to use a different converter and restart. Possibly chosen from a menu when dropping a file on image, or choosing an alternative automatically if we know the possible other encodings a file could have been saved as, not sure how to best do it for scripts given as parameters when launching the vm On the font rendering side, I agree with Nicolas it's too complicated doing font rendering in-image, FT is an ok compromise though. As for the bitmap strikefont rendering, what is really needed is a way to specify the charset it represents, and mappings from the internal string encodings to its glyphs. F.ex., Bitmap DejaVu is really latin15, so it will currently render some ByteString characters incorrectly, as well as render some Unicode chars it really has glyphs for as ?. (such as the euro sign) Which all really has nothing to do with your initial question :) The internal representation of strings really hasn't changed since it was written, with the exception that leadingChar for WideStrings are now zero. As far as I can tell, that means the interal storage format of widestrings is now equivalent to utf32, not sure what Byte Order it uses though, or if that is even consistent across platforms. :) The point about using WaKomEncoded, and passing all strings going into/out of the image through an encoder is still valid. Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
I presume that under the idiom "latin1" you refer to code page 1252
rather than iso8859-L1, right ? Nicolas 2010/3/29 Henrik Johansen <[hidden email]>: > > On Mar 28, 2010, at 4:36 13PM, Stéphane Ducasse wrote: > >> Hi >> >> I'm trying to remember the situation with the internal representation of string in pharo/squeak >> to revise http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo >> >> I saw that in pharo we have this NonASCIIMap. I do not remember what have been done in pharo. >> Argh memory leaks.... Nicolas do you remember the situation? > NonASCIIMap is used for quickly determining whether the string with no character codes > 127 (ie only ascii characters). > It's very useful for doing primitive accellerated isAsciiString, which in the case of ascii-compatible encodings (utf8, latin1, macroman, etc.) would mean no conversion is required for it to be the "appropriate" internal bytestring format. > It's used f.ex. in the nextChunk code, > Strangely it is also used in FileStream writeSourceCodeFrom: baseName: isSt: , for some reason we there use a MacRoman if stream contents isAscii, which really makes no sense, but whatever. > > John pointed out some converters were lying, I'm not entirely sure that's true anymore, what IS certain though, is the external code format used is inconsistent, depending on from where/how you save/load it. > It really should be cleaned up to always store in utf8, and possibly also latin1 if possible. > All this should be cleared up to always try reading as UTF8, then raising an InvalidUTF8 error which can be handled by telling it to use a different converter and restart. > Possibly chosen from a menu when dropping a file on image, or choosing an alternative automatically if we know the possible other encodings a file could have been saved as, not sure how to best do it for scripts given as parameters when launching the vm > > On the font rendering side, I agree with Nicolas it's too complicated doing font rendering in-image, FT is an ok compromise though. > As for the bitmap strikefont rendering, what is really needed is a way to specify the charset it represents, and mappings from the internal string encodings to its glyphs. > F.ex., Bitmap DejaVu is really latin15, so it will currently render some ByteString characters incorrectly, as well as render some Unicode chars it really has glyphs for as ?. (such as the euro sign) > > Which all really has nothing to do with your initial question :) > The internal representation of strings really hasn't changed since it was written, with the exception that leadingChar for WideStrings are now zero. > As far as I can tell, that means the interal storage format of widestrings is now equivalent to utf32, not sure what Byte Order it uses though, or if that is even consistent across platforms. :) > > The point about using WaKomEncoded, and passing all strings going into/out of the image through an encoder is still valid. > > Cheers, > Henry > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: > I presume that under the idiom "latin1" you refer to code page 1252 > rather than iso8859-L1, right ? > > Nicolas Good question :) What IS the presumed internal encoding of Bytestrings in Squeak? That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such. Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range. Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2010/3/29 Henrik Johansen <[hidden email]>:
> > On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: > >> I presume that under the idiom "latin1" you refer to code page 1252 >> rather than iso8859-L1, right ? >> >> Nicolas > Good question :) > What IS the presumed internal encoding of Bytestrings in Squeak? > That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such. > Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range. > > Cheers, > Henry > From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 159. ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 to 159 unused. You know, when Microsoft "uses" a standard, it's always a better standard ;) I have nothing against CP1252, it's an optimization which avoid wasting 32 cheap codes. But I'm not sure about various compatibility issues in/with the external world... Squeak clearly uses CP1252. For Pharo, there might be a mix of the two since Sophie-like refactorings. Surely what John was refering to. Nicolas > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote: > 2010/3/29 Henrik Johansen <[hidden email]>: >> >> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: >> >>> I presume that under the idiom "latin1" you refer to code page 1252 >>> rather than iso8859-L1, right ? >>> >>> Nicolas >> Good question :) >> What IS the presumed internal encoding of Bytestrings in Squeak? >> That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such. >> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range. >> >> Cheers, >> Henry >> > > From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 159. > ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 > to 159 unused. > You know, when Microsoft "uses" a standard, it's always a better standard ;) > > I have nothing against CP1252, it's an optimization which avoid > wasting 32 cheap codes. > But I'm not sure about various compatibility issues in/with the > external world... > > Squeak clearly uses CP1252. > For Pharo, there might be a mix of the two since Sophie-like > refactorings. Surely what John was refering to. > > Nicolas Ummm... All the utf8-converters in squeak use Unicode value:, which maps directly from charCode 128->255 to Unicode value 128->255. Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal format, all the converters in Squeak are wrong. Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2010/3/29 Henrik Johansen <[hidden email]>:
> > On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote: > >> 2010/3/29 Henrik Johansen <[hidden email]>: >>> >>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: >>> >>>> I presume that under the idiom "latin1" you refer to code page 1252 >>>> rather than iso8859-L1, right ? >>>> >>>> Nicolas >>> Good question :) >>> What IS the presumed internal encoding of Bytestrings in Squeak? >>> That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such. >>> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range. >>> >>> Cheers, >>> Henry >>> >> >> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 159. >> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 >> to 159 unused. >> You know, when Microsoft "uses" a standard, it's always a better standard ;) >> >> I have nothing against CP1252, it's an optimization which avoid >> wasting 32 cheap codes. >> But I'm not sure about various compatibility issues in/with the >> external world... >> >> Squeak clearly uses CP1252. >> For Pharo, there might be a mix of the two since Sophie-like >> refactorings. Surely what John was refering to. >> >> Nicolas > > Ummm... > All the utf8-converters in squeak use Unicode value:, which maps directly from charCode 128->255 to Unicode value 128->255. > Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal format, all the converters in Squeak are wrong. > > Cheers, > Henry > ISO8859-L1 and CP1252 only differ for code points 16r80 to 16r9F. Contrarily to what I said, these code points are assigned to G1 control characters (anyone ever used these ?). See http://en.wikipedia.org/wiki/ISO_8859-1 and http://en.wikipedia.org/wiki/Windows-1252 Now, I'm not so sure anymore why I thought squeak was CP1252. Is it ? My guess was probably based on macToSqueak and squeakToMac implementation. But endering of following snippet isn't CP1252 complying: String withAll: ((16r80 to: 16r9F) collect: [:e | Character value: e]) or (16r80 to: 16r9F) collect: [:e | Character value: e] as: String ' ' In Squeak 4.1 the different fonts don't agree on rendering these characters... DefaultFixedTextStyle is still using MacRoman and display accented characters. DefaultTextStyle hack first 4 entries with caret underscore left arrow and up arrow (probably a Cuis hack) Accu* just seem to have a hack for left arrow Maybe with a bit more clean-up (Character euro is answering the MacRoman code for example, and taking macRoman conversions from Sophie/Pharo), we could declare Squeak is using unicode... Great ! Nicolas > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On Mar 29, 2010, at 2:00 09PM, Nicolas Cellier wrote: > 2010/3/29 Henrik Johansen <[hidden email]>: >> >> On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote: >> >>> 2010/3/29 Henrik Johansen <[hidden email]>: >>>> >>>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: >>>> >>>>> I presume that under the idiom "latin1" you refer to code page 1252 >>>>> rather than iso8859-L1, right ? >>>>> >>>>> Nicolas >>>> Good question :) >>>> What IS the presumed internal encoding of Bytestrings in Squeak? >>>> That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such. >>>> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range. >>>> >>>> Cheers, >>>> Henry >>>> >>> >>> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 159. >>> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 >>> to 159 unused. >>> You know, when Microsoft "uses" a standard, it's always a better standard ;) >>> >>> I have nothing against CP1252, it's an optimization which avoid >>> wasting 32 cheap codes. >>> But I'm not sure about various compatibility issues in/with the >>> external world... >>> >>> Squeak clearly uses CP1252. >>> For Pharo, there might be a mix of the two since Sophie-like >>> refactorings. Surely what John was refering to. >>> >>> Nicolas >> >> Ummm... >> All the utf8-converters in squeak use Unicode value:, which maps directly from charCode 128->255 to Unicode value 128->255. >> Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal format, all the converters in Squeak are wrong. >> >> Cheers, >> Henry >> > > ISO8859-L1 and CP1252 only differ for code points 16r80 to 16r9F. > Contrarily to what I said, these code points are assigned to G1 > control characters (anyone ever used these ?). > See http://en.wikipedia.org/wiki/ISO_8859-1 and > http://en.wikipedia.org/wiki/Windows-1252 Not to my knowledge :) The strong argument for using latin1 as internal charset for ByteString vs 1252 is the 1-1 mapping to unicode values. > > Now, I'm not so sure anymore why I thought squeak was CP1252. Is it ? Seems ambiguous. > My guess was probably based on macToSqueak and squeakToMac implementation. Yes, that does indeed do MacRoman -> 1252 transformation. As does MacRomanTextConverter, in Pharo as well... Converters assuming different internal encodings, fonts which render a charset different from both of them... Fun eh? > But endering of following snippet isn't CP1252 complying: > > String withAll: ((16r80 to: 16r9F) collect: [:e | Character value: e]) > or > (16r80 to: 16r9F) collect: [:e | Character value: e] as: String > '•™≠∞≥∑∫Ω√≈…—‘Ÿ⁄∂∆Œ‚„‰ˆ˜˘˙˚˝˛ˇıƒ' > > In Squeak 4.1 the different fonts don't agree on rendering these characters... > DefaultFixedTextStyle is still using MacRoman and display accented characters. > DefaultTextStyle hack first 4 entries with caret underscore left arrow > and up arrow (probably a Cuis hack) > Accu* just seem to have a hack for left arrow Yeah, they seem to cover... a blend of latin1, latin15 (has euro symbol), and something else (square-root :D ). Wee. Render with a Unicode font, and you get nothing but []'s, which would be the correct latin1-rendering of said string. Which is why I said an encoding property for the StrikeFonts was needed, so you can do the proper conversion of internal string charcodes to the charcode values the font expects. (Or rather, bitmap offsets) This of course means you'd have to come up with a consistent definition of what the internal ByteString encoding in Squeak is first, though. > Maybe with a bit more clean-up (Character euro is answering the > MacRoman code for example, The keyboardinput handling in Squeak does strange things, at least on a Mac... Alt - § (which gives a euro symbol on my keyboard layout) is read as a WideChar with the correct unicode value on Pharo, but as Char 164 in Squeak. Alt- 5 (∞) does a similar thing, reads as correct widechar on Pharo, but on Squeak turns into char 129. > and taking macRoman conversions from > Sophie/Pharo), we could declare Squeak is using unicode... > Great ! > > Nicolas That would be my dream as well. Or really, I'd settle for any unambiguous definition of what the ByteString encoding is. "A little more clean-up" may or may not be an understatement though, it would involve going through all the converters, all keyboard-input processing code (seems to be more stable in Pharo on mac), and all places where strings enters/leaves the system. :) Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On Mar 29, 2010, at 3:54 25PM, Henrik Johansen wrote: > Yes, that does indeed do MacRoman -> 1252 transformation. As does MacRomanTextConverter, in Pharo as well... > Converters assuming different internal encodings, fonts which render a charset different from both of them... Fun eh? Ooops. my bad. MacRomanTextConverter in Pharo actually does work. (I think... encodings make my head spin) Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On 3/29/10 7:14 AM, Henrik Johansen wrote:
> > On Mar 29, 2010, at 3:54 25PM, Henrik Johansen wrote: > >> Yes, that does indeed do MacRoman -> 1252 transformation. As does >> MacRomanTextConverter, in Pharo as well... Converters assuming >> different internal encodings, fonts which render a charset >> different from both of them... Fun eh? > > Ooops. my bad. MacRomanTextConverter in Pharo actually does work. (I > think... encodings make my head spin) And I think so does the rest. In Pharo, not necessarily Squeak... Part of the confusing discussion on the Pharo(!) list might be that you were actually talking about the current state of Squeak. There was a major UTF8 cleanup for Pharo and messages like macToSqueak shouldn't even exist in Pharo anymore. Michael _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Henrik Sperre Johansen
2010/3/29 Henrik Johansen <[hidden email]>:
> > On Mar 29, 2010, at 2:00 09PM, Nicolas Cellier wrote: > >> 2010/3/29 Henrik Johansen <[hidden email]>: >>> >>> On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote: >>> >>>> 2010/3/29 Henrik Johansen <[hidden email]>: >>>>> >>>>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: >>>>> >>>>>> I presume that under the idiom "latin1" you refer to code page 1252 >>>>>> rather than iso8859-L1, right ? >>>>>> >>>>>> Nicolas >>>>> Good question :) >>>>> What IS the presumed internal encoding of Bytestrings in Squeak? >>>>> That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such. >>>>> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range. >>>>> >>>>> Cheers, >>>>> Henry >>>>> >>>> >>>> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 159. >>>> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 >>>> to 159 unused. >>>> You know, when Microsoft "uses" a standard, it's always a better standard ;) >>>> >>>> I have nothing against CP1252, it's an optimization which avoid >>>> wasting 32 cheap codes. >>>> But I'm not sure about various compatibility issues in/with the >>>> external world... >>>> >>>> Squeak clearly uses CP1252. >>>> For Pharo, there might be a mix of the two since Sophie-like >>>> refactorings. Surely what John was refering to. >>>> >>>> Nicolas >>> >>> Ummm... >>> All the utf8-converters in squeak use Unicode value:, which maps directly from charCode 128->255 to Unicode value 128->255. >>> Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal format, all the converters in Squeak are wrong. >>> >>> Cheers, >>> Henry >>> >> >> ISO8859-L1 and CP1252 only differ for code points 16r80 to 16r9F. >> Contrarily to what I said, these code points are assigned to G1 >> control characters (anyone ever used these ?). >> See http://en.wikipedia.org/wiki/ISO_8859-1 and >> http://en.wikipedia.org/wiki/Windows-1252 > > Not to my knowledge :) > The strong argument for using latin1 as internal charset for ByteString vs 1252 is the 1-1 mapping to unicode values. > >> >> Now, I'm not so sure anymore why I thought squeak was CP1252. Is it ? > Seems ambiguous. > >> My guess was probably based on macToSqueak and squeakToMac implementation. > > Yes, that does indeed do MacRoman -> 1252 transformation. As does MacRomanTextConverter, in Pharo as well... > Converters assuming different internal encodings, fonts which render a charset different from both of them... Fun eh? > >> But endering of following snippet isn't CP1252 complying: >> >> String withAll: ((16r80 to: 16r9F) collect: [:e | Character value: e]) >> or >> (16r80 to: 16r9F) collect: [:e | Character value: e] as: String >> '•™≠∞≥∑∫Ω√≈…—‘Ÿ⁄∂∆Œ‚„‰ˆ˜˘˙˚˝˛ˇıƒ' >> I intentionnally included the above string in the mail just for the fun of it... My gmail/firefox browser originally did display boxed control characters, Now, in the same browser, I read back some math symbols in your answer... ... centered dot, Trade mark, different, infinity, greater or equal, summation etc... At least, you can see that "conforming to external world rules" might be pretty difficult I would add silly too :) >> In Squeak 4.1 the different fonts don't agree on rendering these characters... >> DefaultFixedTextStyle is still using MacRoman and display accented characters. >> DefaultTextStyle hack first 4 entries with caret underscore left arrow > Yup, Bitmap DejaVu is latin15 (some characters different from latin1, amongst them the € ), with 4 extra entries as you mentioned. >> and up arrow (probably a Cuis hack) >> Accu* just seem to have a hack for left arrow > Yeah, they seem to cover... a blend of latin1, latin15 (has euro symbol), and something else (square-root :D ). Wee. > > Render with a Unicode font, and you get nothing but []'s, which would be the correct latin1-rendering of said string. > > Which is why I said an encoding property for the StrikeFonts was needed, so you can do the proper conversion of internal string charcodes to the charcode values the font expects. (Or rather, bitmap offsets) > This of course means you'd have to come up with a consistent definition of what the internal ByteString encoding in Squeak is first, though. > > >> Maybe with a bit more clean-up (Character euro is answering the >> MacRoman code for example, > The keyboardinput handling in Squeak does strange things, at least on a Mac... > Alt - § (which gives a euro symbol on my keyboard layout) is read as a WideChar with the correct unicode value on Pharo, but as Char 164 in Squeak. > Alt- 5 (∞) does a similar thing, reads as correct widechar on Pharo, but on Squeak turns into char 129. >> and taking macRoman conversions from >> Sophie/Pharo), we could declare Squeak is using unicode... >> Great ! >> >> Nicolas > > > That would be my dream as well. > Or really, I'd settle for any unambiguous definition of what the ByteString encoding is. > "A little more clean-up" may or may not be an understatement though, it would involve going through all the converters, all keyboard-input processing code (seems to be more stable in Pharo on mac), and all places where strings enters/leaves the system. :) > I won't answer following mail, Michael took care of that in Pharo:) Let's do it in Squeak too. Nicolas > Cheers, > Henry > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
2010/3/29 Nicolas Cellier <[hidden email]>:
> 2010/3/29 Henrik Johansen <[hidden email]>: >> >> On Mar 29, 2010, at 2:00 09PM, Nicolas Cellier wrote: >> >>> 2010/3/29 Henrik Johansen <[hidden email]>: >>>> >>>> On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote: >>>> >>>>> 2010/3/29 Henrik Johansen <[hidden email]>: >>>>>> >>>>>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: >>>>>> >>>>>>> I presume that under the idiom "latin1" you refer to code page 1252 >>>>>>> rather than iso8859-L1, right ? >>>>>>> >>>>>>> Nicolas >>>>>> Good question :) >>>>>> What IS the presumed internal encoding of Bytestrings in Squeak? >>>>>> That's the one I meant, I merely assumed it was latin1 seeing as how the text converter refers to it as such. >>>>>> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF range. >>>>>> >>>>>> Cheers, >>>>>> Henry >>>>>> >>>>> >>>>> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 159. >>>>> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 >>>>> to 159 unused. >>>>> You know, when Microsoft "uses" a standard, it's always a better standard ;) >>>>> >>>>> I have nothing against CP1252, it's an optimization which avoid >>>>> wasting 32 cheap codes. >>>>> But I'm not sure about various compatibility issues in/with the >>>>> external world... >>>>> >>>>> Squeak clearly uses CP1252. >>>>> For Pharo, there might be a mix of the two since Sophie-like >>>>> refactorings. Surely what John was refering to. >>>>> >>>>> Nicolas >>>> >>>> Ummm... >>>> All the utf8-converters in squeak use Unicode value:, which maps directly from charCode 128->255 to Unicode value 128->255. >>>> Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal format, all the converters in Squeak are wrong. >>>> >>>> Cheers, >>>> Henry >>>> >>> >>> ISO8859-L1 and CP1252 only differ for code points 16r80 to 16r9F. >>> Contrarily to what I said, these code points are assigned to G1 >>> control characters (anyone ever used these ?). >>> See http://en.wikipedia.org/wiki/ISO_8859-1 and >>> http://en.wikipedia.org/wiki/Windows-1252 >> >> Not to my knowledge :) >> The strong argument for using latin1 as internal charset for ByteString vs 1252 is the 1-1 mapping to unicode values. >> >>> >>> Now, I'm not so sure anymore why I thought squeak was CP1252. Is it ? >> Seems ambiguous. >> >>> My guess was probably based on macToSqueak and squeakToMac implementation. >> >> Yes, that does indeed do MacRoman -> 1252 transformation. As does MacRomanTextConverter, in Pharo as well... >> Converters assuming different internal encodings, fonts which render a charset different from both of them... Fun eh? >> >>> But endering of following snippet isn't CP1252 complying: >>> >>> String withAll: ((16r80 to: 16r9F) collect: [:e | Character value: e]) >>> or >>> (16r80 to: 16r9F) collect: [:e | Character value: e] as: String >>> '•™≠∞≥∑∫Ω√≈…—‘Ÿ⁄∂∆Œ‚„‰ˆ˜˘˙˚˝˛ˇıƒ' >>> > > I intentionnally included the above string in the mail just for the fun of it... > My gmail/firefox browser originally did display boxed control characters, > Now, in the same browser, I read back some math symbols in your answer... > ... centered dot, Trade mark, different, infinity, greater or equal, > summation etc... > At least, you can see that "conforming to external world rules" might > be pretty difficult > I would add silly too :) > And gmail now display my original mail with CP1252 interpretation :) M$ friendly ? > >>> In Squeak 4.1 the different fonts don't agree on rendering these characters... >>> DefaultFixedTextStyle is still using MacRoman and display accented characters. >>> DefaultTextStyle hack first 4 entries with caret underscore left arrow >> Yup, Bitmap DejaVu is latin15 (some characters different from latin1, amongst them the € ), with 4 extra entries as you mentioned. >>> and up arrow (probably a Cuis hack) >>> Accu* just seem to have a hack for left arrow >> Yeah, they seem to cover... a blend of latin1, latin15 (has euro symbol), and something else (square-root :D ). Wee. >> >> Render with a Unicode font, and you get nothing but []'s, which would be the correct latin1-rendering of said string. >> >> Which is why I said an encoding property for the StrikeFonts was needed, so you can do the proper conversion of internal string charcodes to the charcode values the font expects. (Or rather, bitmap offsets) >> This of course means you'd have to come up with a consistent definition of what the internal ByteString encoding in Squeak is first, though. >> >> >>> Maybe with a bit more clean-up (Character euro is answering the >>> MacRoman code for example, >> The keyboardinput handling in Squeak does strange things, at least on a Mac... >> Alt - § (which gives a euro symbol on my keyboard layout) is read as a WideChar with the correct unicode value on Pharo, but as Char 164 in Squeak. >> Alt- 5 (∞) does a similar thing, reads as correct widechar on Pharo, but on Squeak turns into char 129. >>> and taking macRoman conversions from >>> Sophie/Pharo), we could declare Squeak is using unicode... >>> Great ! >>> >>> Nicolas >> >> >> That would be my dream as well. >> Or really, I'd settle for any unambiguous definition of what the ByteString encoding is. >> "A little more clean-up" may or may not be an understatement though, it would involve going through all the converters, all keyboard-input processing code (seems to be more stable in Pharo on mac), and all places where strings enters/leaves the system. :) >> > > I won't answer following mail, Michael took care of that in Pharo:) > Let's do it in Squeak too. > > Nicolas > >> Cheers, >> Henry >> >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Henrik Sperre Johansen
On 2010-03-29, at 6:54 AM, Henrik Johansen wrote: > >> Maybe with a bit more clean-up (Character euro is answering the >> MacRoman code for example, > The keyboardinput handling in Squeak does strange things, at least on a Mac... > Alt - § (which gives a euro symbol on my keyboard layout) is read as a WideChar with the correct unicode value on Pharo, but as Char 164 in Squeak. > Alt- 5 (∞) does a similar thing, reads as correct widechar on Pharo, but on Squeak turns into char 129. No doubt this is because the InputSensor logic in that image "Squeak" you are using pulls the MacRoman value out of the keyboard event array and passes that unchanged to Morphic. To summarize then I think all current VM (last couple of years) provide unicode values on the keychar event. Now the murky underpinnings, and I"ll only talk about the macintosh VM. The macintosh carbon 4.x VM receives the unicode value, macroman value, and keycode value for each input character, this is after text services has mucked with the keystrokes to deal with characters that require multi-keystroke to create. For the case where a text input widget is used to enter data we convert the unicode values to macroman and have no keycode data. In both cases we use CFStringGetCString() with kCFStringEncodingMacRoman to get the MacRoman value. In the Smalltalk Image the code must then take the combination of key down, key char, key up events (there are three events), and decide what to do. Depending on the frame work that might take data from key up, or key char and work with the macroman value, the unicode value, or even do take just the keycode and lookup the macroman & unicode in the image! In general all VMs attempt to supply the unicode and macroman values, the key code are a platform specific values. In the Cocoa based V5 VM the unicode and keycode is supplied by the event data, however some keyboard usage, oh say the HOME key result in a message send, versus keystroke so we have to create a synthetic keyboard event and build the unicode & keycode data to match the Carbon based VM's behaviour. The MacRoman data comes from NSMacOSRomanStringEncoding via getBytes: from the unicode char. For the iPhone we don't offer keyboard input (*yet*) The more attentive of you folks might wonder about keyboard input and the iPhone/iPad and Scratch.app. Er yes obviously we feed keyboard input to the Squeak VM. For that we have the unicode, and then calculate the rest, the MacRoman data comes from NSMacOSRomanStringEncoding, and we build the KeyCode data using a unicode to keycode table we built from KCHR carbon data. -- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project smime.p7s (3K) Download Attachment |
In reply to this post by Nicolas Cellier
On 29.03.2010, at 11:52, Nicolas Cellier wrote: If you know how to easily assure that (String with: (Character value: (Integer readFrom: '20AC' base: 16))) = (String with: (Character value: (Integer readFrom: '80' base: 16))) than you might be safe. By using Windows-1252 code points aren't unique anymore. Every code point in the range 0x80 - 0x9F exists somewhere else, too. So my estimation would be that it will cause more trouble than it might solve. In pharo the 20AC string gives me a euro sign but the 80 hex one prints a rectangle which is _a_ interpretation of '?' ;) Norbert
_______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Free forum by Nabble | Edit this page |