2008/9/28 Damien Pollet <[hidden email]>:
> On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall > <[hidden email]> wrote: >> Plus leading char. > > You mean the BOM (byte order mark) or something else ? No, I mean the language of the image encoded into every single character with an index bigger than 255. Check the class comment of Character for more information. Cheers Philippe |
In reply to this post by Yoshiki Ohshima-2
Yoshiki Ohshima wrote:
> For that kind of web applications and servers that deals with stuff > outside of Squeak, it doesn't serve a good purpose, because editting, > displaying etc. are out of scope. Needless to say, the original idea > was to make Squeak to be the dynamic, interactive, multilingualized, > environment so there is mismatch. Web applications etc. historically > comes after the goal. Which wouldn't be a problem if the code was able to handle the data properly. Unfortunately, the effects of an "invalid" leading char are very, very strange (everything from crashing the scanner to raising weird errors in comparisons, character access etc). As it stands, an application that uses non-Latin characters off the web is best off by keeping everything in UTF-8. BTW, one way to deal with this properly is by providing a leading char upon input conversion (i.e., utf8ToSqueak would then insert the proper leading chars for each character). As a matter of fact, I think this is what Unicode class>>value: should do (instead of substituting the environmental leading char). > If you need to retain these extra information, sending the strings > without going through UTF-8 conversion makes more sense. Or provide it via additional attributes. I still think that language information would best be modeled by a text attribute - in which case we have a plain Unicode implementation for strings as well as the ability to provide the disambiguation in text where required. Cheers, - Andreas |
2008/9/28, Andreas Raab <[hidden email]>:
> Yoshiki Ohshima wrote: >> For that kind of web applications and servers that deals with stuff >> outside of Squeak, it doesn't serve a good purpose, because editting, >> displaying etc. are out of scope. Needless to say, the original idea >> was to make Squeak to be the dynamic, interactive, multilingualized, >> environment so there is mismatch. Web applications etc. historically >> comes after the goal. > > Which wouldn't be a problem if the code was able to handle the data > properly. Unfortunately, the effects of an "invalid" leading char are > very, very strange (everything from crashing the scanner to raising > weird errors in comparisons, character access etc). As it stands, an > application that uses non-Latin characters off the web is best off by > keeping everything in UTF-8. > > BTW, one way to deal with this properly is by providing a leading char > upon input conversion (i.e., utf8ToSqueak would then insert the proper > leading chars for each character). As a matter of fact, I think this is > what Unicode class>>value: should do (instead of substituting the > environmental leading char). > >> If you need to retain these extra information, sending the strings >> without going through UTF-8 conversion makes more sense. > > Or provide it via additional attributes. I still think that language > information would best be modeled by a text attribute - in which case we > have a plain Unicode implementation for strings as well as the ability > to provide the disambiguation in text where required. > > Cheers, > - Andreas > > |
In reply to this post by Andreas.Raab
2008/9/28, Andreas Raab <[hidden email]>:
> Yoshiki Ohshima wrote: >> For that kind of web applications and servers that deals with stuff >> outside of Squeak, it doesn't serve a good purpose, because editting, >> displaying etc. are out of scope. Needless to say, the original idea >> was to make Squeak to be the dynamic, interactive, multilingualized, >> environment so there is mismatch. Web applications etc. historically >> comes after the goal. > > Which wouldn't be a problem if the code was able to handle the data > properly. Unfortunately, the effects of an "invalid" leading char are > very, very strange (everything from crashing the scanner to raising > weird errors in comparisons, character access etc). As it stands, an > application that uses non-Latin characters off the web is best off by > keeping everything in UTF-8. > > BTW, one way to deal with this properly is by providing a leading char > upon input conversion (i.e., utf8ToSqueak would then insert the proper > leading chars for each character). As a matter of fact, I think this is > what Unicode class>>value: should do (instead of substituting the > environmental leading char). > >> If you need to retain these extra information, sending the strings >> without going through UTF-8 conversion makes more sense. > > Or provide it via additional attributes. I still think that language > information would best be modeled by a text attribute - in which case we > have a plain Unicode implementation for strings as well as the ability > to provide the disambiguation in text where required. +1 Cheers Philippe |
In reply to this post by Andreas.Raab
At Sun, 28 Sep 2008 10:45:00 -0700,
Andreas Raab wrote: > > > If you need to retain these extra information, sending the strings > > without going through UTF-8 conversion makes more sense. > > Or provide it via additional attributes. I still think that language > information would best be modeled by a text attribute - in which case we > have a plain Unicode implementation for strings as well as the ability > to provide the disambiguation in text where required. Well, sure, for the more work and more clearner approach. That is what I've been mentioning time to time. The consequence would be that a bare character object or string object won't show up in the proper way; but it is not a big problem. -- Yoshiki |
In reply to this post by NorbertHartl
>>>> Am I the only one using the generic en/decoding functionality in
>>> Squeak in the form of #convertTo/FromEncoding? >>> >>> Convert from "Squeak" to UTF-8 >>> aString convertToEncoding: 'utf-8' >> >> >> do I understand correctly that such a aString is a sequence of >> unicode >> codepoints? >>> > At first the utf-8 is a sequence of bytes. These bytes are a space > optimzed encoding of a code point (utf-8). If you decode those bytes > you get your code point (unicode). From a sequence of code points > you can derive a character. In most cases (for us westerners) it will > be a single code point AFAIK. I'm trying to really understand in Squeak. :) What we call character is what then? Is it a codepoint? or the looked up glyph in a font table? Stef |
On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote:
> >>>> Am I the only one using the generic en/decoding functionality in > >>> Squeak in the form of #convertTo/FromEncoding? > >>> > >>> Convert from "Squeak" to UTF-8 > >>> aString convertToEncoding: 'utf-8' > >> > >> > >> do I understand correctly that such a aString is a sequence of > >> unicode > >> codepoints? > >>> > > At first the utf-8 is a sequence of bytes. These bytes are a space > > optimzed encoding of a code point (utf-8). If you decode those bytes > > you get your code point (unicode). From a sequence of code points > > you can derive a character. In most cases (for us westerners) it will > > be a single code point AFAIK. > > I'm trying to really understand in Squeak. :) > What we call character is what then? > Is it a codepoint? or the looked up glyph in a font table? > Norbert |
Am 29.09.2008 um 11:11 schrieb Norbert Hartl: > On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote: >>>>>> Am I the only one using the generic en/decoding functionality in >>>>> Squeak in the form of #convertTo/FromEncoding? >>>>> >>>>> Convert from "Squeak" to UTF-8 >>>>> aString convertToEncoding: 'utf-8' >>>> >>>> >>>> do I understand correctly that such a aString is a sequence of >>>> unicode >>>> codepoints? >>>>> >>> At first the utf-8 is a sequence of bytes. These bytes are a space >>> optimzed encoding of a code point (utf-8). If you decode those bytes >>> you get your code point (unicode). From a sequence of code points >>> you can derive a character. In most cases (for us westerners) it >>> will >>> be a single code point AFAIK. >> >> I'm trying to really understand in Squeak. :) >> What we call character is what then? >> Is it a codepoint? or the looked up glyph in a font table? >> > I don't know. I've never dealt with how squeak does those things A character represents a single code point. A font maps code points to glyphs. A character also encodes a language-tag (a.k.a. leading char) but we all seem to agree that's a bad idea, it was done to allow easier migration of old code (for many eastern languages a code point and a font is not enough for rendering, you also need to know the language). - Bert - |
Bert Freudenberg wrote:
> A character also encodes a language-tag (a.k.a. leading char) but we all > seem to agree that's a bad idea, it was done to allow easier migration > of old code (for many eastern languages a code point and a font is not > enough for rendering, you also need to know the language). I wouldn't necessarily call it a bad idea. It is incomplete, for sure, but it is one of the ways one can deal with this problem. Even though I prefer having language information in text attributes the language tag per se wouldn't cause problems if the code would be able to deal with its absence. E.g., if one could use strings with "just unicode" I wouldn't mind having the ability to add the language tag for disambiguation where necessary (issues of equality etc. notwithstanding which is why I think using text attributes is the better way to go). The problem is that too much code relies on both the presence as well as particular values for certain code points and simply breaks if it isn't filled in "correctly". As such the language tag seems to be mostly redundant with certain code points. I guess one way to get over this is to add a preference that leaves out the language tag and just try running that way to see what and where it breaks. Cheers, - Andreas |
In reply to this post by Bert Freudenberg
At Mon, 29 Sep 2008 11:24:36 -0700,
Bert Freudenberg wrote: > > >> I'm trying to really understand in Squeak. :) > >> What we call character is what then? > >> Is it a codepoint? or the looked up glyph in a font table? > >> > > I don't know. I've never dealt with how squeak does those things > > A character represents a single code point. This I would like to be philosophically false, but Unicode decided that is the way it is. We use Unicode for part of the representation, but we can have different philosophy there. > A font maps code points to glyphs. And the trouble is that "a font" cannot really map to glyphs to what the users want and we need additional information. IOW, if we follow the philosophy of "a character is a code point and a font maps to glyph", we should not be able to print-it "a codepoint" in a workspace. I am not sure that the Squeak community would like to go all the way like that. -- Yoshiki |
Free forum by Nabble | Edit this page |