Smalltalk › Squeak › Squeak - Dev

[squeak-dev] how to create an UTF-8 character

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

30 messages Options

Philippe Marschall

Re: [squeak-dev] how to create an UTF-8 character

2008/9/28 Damien Pollet <[hidden email]>:
> On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall
> <[hidden email]> wrote:
>> Plus leading char.
>
> You mean the BOM (byte order mark) or something else ?

No, I mean the language of the image encoded into every single
character with an index bigger than 255. Check the class comment of
Character for more information.

Cheers
Philippe

Andreas.Raab

[squeak-dev] Re: how to create an UTF-8 character

In reply to this post by Yoshiki Ohshima-2

Yoshiki Ohshima wrote:
> For that kind of web applications and servers that deals with stuff
> outside of Squeak, it doesn't serve a good purpose, because editting,
> displaying etc. are out of scope. Needless to say, the original idea
> was to make Squeak to be the dynamic, interactive, multilingualized,
> environment so there is mismatch. Web applications etc. historically
> comes after the goal.

Which wouldn't be a problem if the code was able to handle the data
properly. Unfortunately, the effects of an "invalid" leading char are
very, very strange (everything from crashing the scanner to raising
weird errors in comparisons, character access etc). As it stands, an
application that uses non-Latin characters off the web is best off by
keeping everything in UTF-8.

BTW, one way to deal with this properly is by providing a leading char
upon input conversion (i.e., utf8ToSqueak would then insert the proper
leading chars for each character). As a matter of fact, I think this is
what Unicode class>>value: should do (instead of substituting the
environmental leading char).

> If you need to retain these extra information, sending the strings
> without going through UTF-8 conversion makes more sense.

Or provide it via additional attributes. I still think that language
information would best be modeled by a text attribute - in which case we
have a plain Unicode implementation for strings as well as the ability
to provide the disambiguation in text where required.

Cheers,
- Andreas

Philippe Marschall

Re: [squeak-dev] Re: how to create an UTF-8 character

2008/9/28, Andreas Raab <[hidden email]>:

> Yoshiki Ohshima wrote:
>> For that kind of web applications and servers that deals with stuff
>> outside of Squeak, it doesn't serve a good purpose, because editting,
>> displaying etc. are out of scope. Needless to say, the original idea
>> was to make Squeak to be the dynamic, interactive, multilingualized,
>> environment so there is mismatch. Web applications etc. historically
>> comes after the goal.
>
> Which wouldn't be a problem if the code was able to handle the data
> properly. Unfortunately, the effects of an "invalid" leading char are
> very, very strange (everything from crashing the scanner to raising
> weird errors in comparisons, character access etc). As it stands, an
> application that uses non-Latin characters off the web is best off by
> keeping everything in UTF-8.
>
> BTW, one way to deal with this properly is by providing a leading char
> upon input conversion (i.e., utf8ToSqueak would then insert the proper
> leading chars for each character). As a matter of fact, I think this is
> what Unicode class>>value: should do (instead of substituting the
> environmental leading char).
>
>> If you need to retain these extra information, sending the strings
>> without going through UTF-8 conversion makes more sense.
>
> Or provide it via additional attributes. I still think that language
> information would best be modeled by a text attribute - in which case we
> have a plain Unicode implementation for strings as well as the ability
> to provide the disambiguation in text where required.
>
> Cheers,
> - Andreas
>
>

Philippe Marschall

Re: [squeak-dev] Re: how to create an UTF-8 character

In reply to this post by Andreas.Raab

2008/9/28, Andreas Raab <[hidden email]>:

+1

Cheers
Philippe

Yoshiki Ohshima-2

Re: [squeak-dev] Re: how to create an UTF-8 character

In reply to this post by Andreas.Raab

At Sun, 28 Sep 2008 10:45:00 -0700,
Andreas Raab wrote:
>
> > If you need to retain these extra information, sending the strings
> > without going through UTF-8 conversion makes more sense.
>
> Or provide it via additional attributes. I still think that language
> information would best be modeled by a text attribute - in which case we
> have a plain Unicode implementation for strings as well as the ability
> to provide the disambiguation in text where required.

Well, sure, for the more work and more clearner approach. That is
what I've been mentioning time to time. The consequence would be that
a bare character object or string object won't show up in the proper
way; but it is not a big problem.

-- Yoshiki

stephane ducasse

Re: [squeak-dev] how to create an UTF-8 character

In reply to this post by NorbertHartl

>>>> Am I the only one using the generic en/decoding functionality in
>>> Squeak in the form of #convertTo/FromEncoding?
>>>
>>> Convert from "Squeak" to UTF-8
>>> aString convertToEncoding: 'utf-8'
>>
>>
>> do I understand correctly that such a aString is a sequence of
>> unicode
>> codepoints?
>>>
> At first the utf-8 is a sequence of bytes. These bytes are a space
> optimzed encoding of a code point (utf-8). If you decode those bytes
> you get your code point (unicode). From a sequence of code points
> you can derive a character. In most cases (for us westerners) it will
> be a single code point AFAIK.

I'm trying to really understand in Squeak. :)
What we call character is what then?
Is it a codepoint? or the looked up glyph in a font table?

Stef

NorbertHartl

Re: [squeak-dev] how to create an UTF-8 character

On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote:

> >>>> Am I the only one using the generic en/decoding functionality in
> >>> Squeak in the form of #convertTo/FromEncoding?
> >>>
> >>> Convert from "Squeak" to UTF-8
> >>> aString convertToEncoding: 'utf-8'
> >>
> >>
> >> do I understand correctly that such a aString is a sequence of
> >> unicode
> >> codepoints?
> >>>
> > At first the utf-8 is a sequence of bytes. These bytes are a space
> > optimzed encoding of a code point (utf-8). If you decode those bytes
> > you get your code point (unicode). From a sequence of code points
> > you can derive a character. In most cases (for us westerners) it will
> > be a single code point AFAIK.
>
> I'm trying to really understand in Squeak. :)
> What we call character is what then?
> Is it a codepoint? or the looked up glyph in a font table?
>

I don't know. I've never dealt with how squeak does those things

Norbert

Bert Freudenberg

Re: [squeak-dev] how to create an UTF-8 character

Am 29.09.2008 um 11:11 schrieb Norbert Hartl:

> On Mon, 2008-09-29 at 18:53 +0200, stephane ducasse wrote:
>>>>>> Am I the only one using the generic en/decoding functionality in
>>>>> Squeak in the form of #convertTo/FromEncoding?
>>>>>
>>>>> Convert from "Squeak" to UTF-8
>>>>> aString convertToEncoding: 'utf-8'
>>>>
>>>>
>>>> do I understand correctly that such a aString is a sequence of
>>>> unicode
>>>> codepoints?
>>>>>
>>> At first the utf-8 is a sequence of bytes. These bytes are a space
>>> optimzed encoding of a code point (utf-8). If you decode those bytes
>>> you get your code point (unicode). From a sequence of code points
>>> you can derive a character. In most cases (for us westerners) it
>>> will
>>> be a single code point AFAIK.
>>
>> I'm trying to really understand in Squeak. :)
>> What we call character is what then?
>> Is it a codepoint? or the looked up glyph in a font table?
>>
> I don't know. I've never dealt with how squeak does those things

A character represents a single code point. A font maps code points to
glyphs.

A character also encodes a language-tag (a.k.a. leading char) but we
all seem to agree that's a bad idea, it was done to allow easier
migration of old code (for many eastern languages a code point and a
font is not enough for rendering, you also need to know the language).

- Bert -

Andreas.Raab

[squeak-dev] Re: how to create an UTF-8 character

Bert Freudenberg wrote:
> A character also encodes a language-tag (a.k.a. leading char) but we all
> seem to agree that's a bad idea, it was done to allow easier migration
> of old code (for many eastern languages a code point and a font is not
> enough for rendering, you also need to know the language).

I wouldn't necessarily call it a bad idea. It is incomplete, for sure,
but it is one of the ways one can deal with this problem. Even though I
prefer having language information in text attributes the language tag
per se wouldn't cause problems if the code would be able to deal with
its absence. E.g., if one could use strings with "just unicode" I
wouldn't mind having the ability to add the language tag for
disambiguation where necessary (issues of equality etc. notwithstanding
which is why I think using text attributes is the better way to go).

The problem is that too much code relies on both the presence as well as
particular values for certain code points and simply breaks if it isn't
filled in "correctly". As such the language tag seems to be mostly
redundant with certain code points. I guess one way to get over this is
to add a preference that leaves out the language tag and just try
running that way to see what and where it breaks.

Cheers,
- Andreas

Yoshiki Ohshima-2

Re: [squeak-dev] how to create an UTF-8 character

In reply to this post by Bert Freudenberg

At Mon, 29 Sep 2008 11:24:36 -0700,
Bert Freudenberg wrote:
>
> >> I'm trying to really understand in Squeak. :)
> >> What we call character is what then?
> >> Is it a codepoint? or the looked up glyph in a font table?
> >>
> > I don't know. I've never dealt with how squeak does those things
>
> A character represents a single code point.

This I would like to be philosophically false, but Unicode decided
that is the way it is. We use Unicode for part of the representation,
but we can have different philosophy there.

> A font maps code points to glyphs.

And the trouble is that "a font" cannot really map to glyphs to what
the users want and we need additional information.

IOW, if we follow the philosophy of "a character is a code point and
a font maps to glyph", we should not be able to print-it "a codepoint"
in a workspace. I am not sure that the Squeak community would like to
go all the way like that.

-- Yoshiki