[squeak-dev] how to create an UTF-8 character

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[squeak-dev] how to create an UTF-8 character

stephane ducasse
Hi all

I would like to know how I can create an UTF-* character composed for  
example of two bytes

16rC3 and 16rBC

I tried

        WideString fromByteArray: { 16rC3 . 16rBC }

Stef

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

NorbertHartl
On Tue, 2008-09-23 at 10:46 +0200, stephane ducasse wrote:

> Hi all
>
> I would like to know how I can create an UTF-* character composed for  
> example of two bytes
>
> 16rC3 and 16rBC
>
> I tried
>
> WideString fromByteArray: { 16rC3 . 16rBC }
>
> Stef
>

Hmm, I'm not sure what you mean by UTF-* Character but this
way it works

(
  (
    String fromByteArray: (
      ByteArray with: 16rC3 with: 16rBC
    )
  ) convertFromEncoding: #utf8
) at: 1

And it is not a two-byte character because it is a character
that is contained in latin-1.

I thought there would be an easier/better way to do! Bert? :)

Norbert


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Bert Freudenberg
In reply to this post by stephane ducasse
Am 23.09.2008 um 01:46 schrieb stephane ducasse:

> Hi all
>
> I would like to know how I can create an UTF-* character composed  
> for example of two bytes
>
> 16rC3 and 16rBC
>
> I tried
>
> WideString fromByteArray: { 16rC3 . 16rBC }
>
> Stef

There is no such thing as a "UTF-*" character. There are Unicode  
Characters, and Unicode Strings, and there are UTF-encoded string (UTF  
means Unicode Transformation Format).

All characters in Squeak use Unicode now. For example, the cyrillic Б  
is

        char := Character value: 16r0411.

this can be made into a String:

        wideString := String with: char.

which of course has the same Unicode code points:

        wideString asArray collect: [:each | each hex]

gives

         #('16r411')

The string can be encoded as UTF-8:

        utf8String := wideString squeakToUtf8.

and to see the values there

        utf8String asArray collect: [:each | each hex]

yields

         #('16rD0' '16r91')

which is the UTF-8 representation of the character we began with (but  
if you try to pront utf8String directly you get nonsense, because  
Squeak does not know it is UTF-8 encoded).

The decoding of UTF-8 to a String is similar:

        #(16rC3 16rBC) asByteArray asString utf8ToSqueak

which returns the String 'ü' and probably is what you wanted in the  
first place - but please try to understand and use the Unicode terms  
correctly to minimize confusion.

Anyway, to convert between a String in UTF-8 and a regular Squeak  
String, it's simplest to use utf8ToSqueak and squeakToUtf8.

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

NorbertHartl
On Tue, 2008-09-23 at 06:48 -0700, Bert Freudenberg wrote:

> Am 23.09.2008 um 01:46 schrieb stephane ducasse:
>
> > Hi all
> >
> > I would like to know how I can create an UTF-* character composed  
> > for example of two bytes
> >
> > 16rC3 and 16rBC
> >
> > I tried
> >
> > WideString fromByteArray: { 16rC3 . 16rBC }
> >
> > Stef
>
> There is no such thing as a "UTF-*" character. There are Unicode  
> Characters, and Unicode Strings, and there are UTF-encoded string (UTF  
> means Unicode Transformation Format).
>
> All characters in Squeak use Unicode now. For example, the cyrillic Б  
> is
>
> char := Character value: 16r0411.
>
> this can be made into a String:
>
> wideString := String with: char.
>
> which of course has the same Unicode code points:
>
> wideString asArray collect: [:each | each hex]
>
> gives
>
> #('16r411')
>
> The string can be encoded as UTF-8:
>
> utf8String := wideString squeakToUtf8.
>
> and to see the values there
>
> utf8String asArray collect: [:each | each hex]
>
> yields
>
> #('16rD0' '16r91')
>
> which is the UTF-8 representation of the character we began with (but  
> if you try to pront utf8String directly you get nonsense, because  
> Squeak does not know it is UTF-8 encoded).
>
> The decoding of UTF-8 to a String is similar:
>
> #(16rC3 16rBC) asByteArray asString utf8ToSqueak
>
Hmmm, I knew it :) That is the same I did just readable and in one line
(and more of this "strange method stuff"[tm]).

> which returns the String 'ü' and probably is what you wanted in the  
> first place - but please try to understand and use the Unicode terms  
> correctly to minimize confusion.
>
> Anyway, to convert between a String in UTF-8 and a regular Squeak  
> String, it's simplest to use utf8ToSqueak and squeakToUtf8.
>
> - Bert -
>

Norbert

P.S.: My only hope is that with my knowledge getting bigger and pharo's
getting smaller that we meet somewhere in between!!!


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

K. K. Subramaniam
In reply to this post by stephane ducasse
On Tuesday 23 Sep 2008 2:16:43 pm stephane ducasse wrote:
> I would like to know how I can create an UTF-* character composed for  
> example of two bytes
>
> 16rC3 and 16rBC
>
> I tried
>
>         WideString fromByteArray: { 16rC3 . 16rBC }

 alphaBeta := WideString from: #(945 946).

gives me a Squeak wide string containing Greek alpha and beta. The numbers are
from Unicode BMP for Greek.

  alphabeta squeakToUtf8 asByteArray

yields the UTF-8 sequence #(206 177 206 178)

and
 #(206 177 206 178) asString utf8ToSqueak

gives me back the original string.

Of course, you should turn on "usePangoRenderer" preference to see characters
rendered correctly for characters other than Latin-1.

HTH .. Subbu

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Philippe Marschall
In reply to this post by Bert Freudenberg
2008/9/23 Bert Freudenberg <[hidden email]>:

> Am 23.09.2008 um 01:46 schrieb stephane ducasse:
>
>> Hi all
>>
>> I would like to know how I can create an UTF-* character composed for
>> example of two bytes
>>
>> 16rC3 and 16rBC
>>
>> I tried
>>
>>        WideString fromByteArray: { 16rC3 . 16rBC }
>>
>> Stef
>
> There is no such thing as a "UTF-*" character. There are Unicode Characters,
> and Unicode Strings, and there are UTF-encoded string (UTF means Unicode
> Transformation Format).
>
> All characters in Squeak use Unicode now. For example, the cyrillic Б is
>
>        char := Character value: 16r0411.
>
> this can be made into a String:
>
>        wideString := String with: char.
>
> which of course has the same Unicode code points:
>
>        wideString asArray collect: [:each | each hex]
>
> gives
>
>         #('16r411')
>
> The string can be encoded as UTF-8:
>
>        utf8String := wideString squeakToUtf8.
>
> and to see the values there
>
>        utf8String asArray collect: [:each | each hex]
>
> yields
>
>         #('16rD0' '16r91')
>
> which is the UTF-8 representation of the character we began with (but if you
> try to pront utf8String directly you get nonsense, because Squeak does not
> know it is UTF-8 encoded).
>
> The decoding of UTF-8 to a String is similar:
>
>        #(16rC3 16rBC) asByteArray asString utf8ToSqueak
>
> which returns the String 'ü' and probably is what you wanted in the first
> place - but please try to understand and use the Unicode terms correctly to
> minimize confusion.
>
> Anyway, to convert between a String in UTF-8 and a regular Squeak String,
> it's simplest to use utf8ToSqueak and squeakToUtf8.
Am I the only one using the generic en/decoding functionality in
Squeak in the form of #convertTo/FromEncoding?

Convert from "Squeak" to UTF-8
aString convertToEncoding: 'utf-8'

Convert from UTF-8 to "Squeak"
aString converFromEncoding: 'utf-8'

For checking out all the encodings your image supports:
TextConverter allEncodingNames

Cheers
Philippe


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Damien Pollet
Is there a reason (other than history) why Strings are not collections
of unicode characters (at least as viewed from outside) rather than
bytes in some unknown encoding (which should be encapsulated and only
appear when text goes in and out the image) ? Or is it already like
that ?

On Tue, Sep 23, 2008 at 7:49 PM, Philippe Marschall
<[hidden email]> wrote:

> 2008/9/23 Bert Freudenberg <[hidden email]>:
>> Am 23.09.2008 um 01:46 schrieb stephane ducasse:
>>
>>> Hi all
>>>
>>> I would like to know how I can create an UTF-* character composed for
>>> example of two bytes
>>>
>>> 16rC3 and 16rBC
>>>
>>> I tried
>>>
>>>        WideString fromByteArray: { 16rC3 . 16rBC }
>>>
>>> Stef
>>
>> There is no such thing as a "UTF-*" character. There are Unicode Characters,
>> and Unicode Strings, and there are UTF-encoded string (UTF means Unicode
>> Transformation Format).
>>
>> All characters in Squeak use Unicode now. For example, the cyrillic Б is
>>
>>        char := Character value: 16r0411.
>>
>> this can be made into a String:
>>
>>        wideString := String with: char.
>>
>> which of course has the same Unicode code points:
>>
>>        wideString asArray collect: [:each | each hex]
>>
>> gives
>>
>>         #('16r411')
>>
>> The string can be encoded as UTF-8:
>>
>>        utf8String := wideString squeakToUtf8.
>>
>> and to see the values there
>>
>>        utf8String asArray collect: [:each | each hex]
>>
>> yields
>>
>>         #('16rD0' '16r91')
>>
>> which is the UTF-8 representation of the character we began with (but if you
>> try to pront utf8String directly you get nonsense, because Squeak does not
>> know it is UTF-8 encoded).
>>
>> The decoding of UTF-8 to a String is similar:
>>
>>        #(16rC3 16rBC) asByteArray asString utf8ToSqueak
>>
>> which returns the String 'ü' and probably is what you wanted in the first
>> place - but please try to understand and use the Unicode terms correctly to
>> minimize confusion.
>>
>> Anyway, to convert between a String in UTF-8 and a regular Squeak String,
>> it's simplest to use utf8ToSqueak and squeakToUtf8.
>
> Am I the only one using the generic en/decoding functionality in
> Squeak in the form of #convertTo/FromEncoding?
>
> Convert from "Squeak" to UTF-8
> aString convertToEncoding: 'utf-8'
>
> Convert from UTF-8 to "Squeak"
> aString converFromEncoding: 'utf-8'
>
> For checking out all the encodings your image supports:
> TextConverter allEncodingNames
>
> Cheers
> Philippe
>
>
>
>


--
Damien Pollet
type less, do more [ | ] http://people.untyped.org/damien.pollet


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Yoshiki Ohshima-2
At Wed, 24 Sep 2008 10:49:18 +0200,
Damien Pollet wrote:
>
> Is there a reason (other than history) why Strings are not collections
> of unicode characters (at least as viewed from outside) rather than
> bytes in some unknown encoding (which should be encapsulated and only
> appear when text goes in and out the image) ? Or is it already like
> that ?

  I think the answer is that it is already *like that*, although I
can't tell what you mean by "from outside".

  In the image, a ByteString or WideString is a sequence of characters
that hold Unicode code points.  (Note that a Unicode code point is
21-bit.) if all the code in a string fits within 8-bit, we use
ByteString. if it doesn't it uses WideString, but the distinction is
more or less hidden from a casual user.  The conversion is only needed
when the String is interfacing with the outside of the image.

  A Unicode code point doesn't really corresponds to the concept of a
character, if you think an accented character a "character".  The
original concept of Unicode was that such "character" should be always
represented as the sequence of code points; one base character, and
one or more accent marks.  It was at least pure and fair.

  But, they got the "Latin-1 compatibility" idea around 1990 in a
retrofitted way; so the original idea of "Let us make a universal
character set for everybody in the world" was turned to: "Let us make
a universal character set for everybody in the world, but let's treat
Westerners nicer."  But of course this turn made the situation where a
simple accented character has two (precomposed and decomposed)
representations.  Squeak is still way behind and prefers the
precomposed "normalization", but the normalization is really lax
there.

  To me, the han unification is another evidence of "Westerners first"
idea.  If tracing back to the origin of characters is the concept, i
and j should be perhaps unified as well (just kidding).

  But, Unicode is the standard now, and it does solve a lot of
problems.  So using it as the base but putting necessary information
around it to support it is a good way in principle.

  If so, one could argue that we can just hold every string in
decomposed UTF-8 in the image, and have a couple of variants of at:
and at:put:.  The requirement of O(1) random access is not that
important.  I might go that direction if I redo it now.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Colin Putney

On 24-Sep-08, at 2:26 AM, Yoshiki Ohshima wrote:

> At Wed, 24 Sep 2008 10:49:18 +0200,
> Damien Pollet wrote:
>>
>> Is there a reason (other than history) why Strings are not  
>> collections
>> of unicode characters (at least as viewed from outside) rather than
>> bytes in some unknown encoding (which should be encapsulated and only
>> appear when text goes in and out the image) ? Or is it already like
>> that ?
>
>  I think the answer is that it is already *like that*, although I
> can't tell what you mean by "from outside".

I think Damien's confusion comes from the fact that the abstractions  
are a bit leaky. For example, if you do something like this:

'ábc' convertToEncoding: 'utf-8'

the result is 'ábc'. It's a string where the internal, "encapsulated"  
state is such that writing it to a socket or file will produce the  
desired bytes, but all in-image behavior is totally broken.

VisualWorks tends to do a better job of maintaining the abstractions,  
I think. The equivalent of the above example would product a ByteArray.

> If so, one could argue that we can just hold every string in
> decomposed UTF-8 in the image, and have a couple of variants of at:
> and at:put:.  The requirement of O(1) random access is not that
> important.  I might go that direction if I redo it now.

A UTF8String would be really handy for web applications, where strings  
come in from the net as UTF-8, live in the image for a while, then get  
sent out as UTF-8. O(1) random access isn't very useful, because  
strings are (mostly) uninterpreted, but converting to Squeak's  
internal representation is expensive.

The thing is, as long as the "sequence of characters" abstraction is  
maintained, it doesn't matter (for purposes of correct behavior) what  
the internal representation is. So it's perfectly reasonable to have  
multiple encodings with different performance profiles. UTF8String  
when you need it, WideString when that makes sense.

Colin
Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

K. K. Subramaniam
In reply to this post by Yoshiki Ohshima-2
On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote:
>   In the image, a ByteString or WideString is a sequence of characters
> that hold Unicode code points.  (Note that a Unicode code point is
> 21-bit.) if all the code in a string fits within 8-bit, we use
> ByteString. if it doesn't it uses WideString
You mean a sequence of code points? Instances of Character hold only one code
point (value), while some characters need more than one code point (e.g. ksha
in Devanagari needs three).

Subbu

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Yoshiki Ohshima-2
In reply to this post by Colin Putney
At Wed, 24 Sep 2008 07:45:38 -0700,
Colin Putney wrote:

>
> A UTF8String would be really handy for web applications, where strings  
> come in from the net as UTF-8, live in the image for a while, then get  
> sent out as UTF-8. O(1) random access isn't very useful, because  
> strings are (mostly) uninterpreted, but converting to Squeak's  
> internal representation is expensive.
>
> The thing is, as long as the "sequence of characters" abstraction is  
> maintained, it doesn't matter (for purposes of correct behavior) what  
> the internal representation is. So it's perfectly reasonable to have  
> multiple encodings with different performance profiles. UTF8String  
> when you need it, WideString when that makes sense.

  The thing is though, that even from the net UTF-8 is not as dominant
as like that.  There are bunch of other encoding used.

  And, have UTF8String and WideString causes the comparison etc. more
complicated than it should.  Have a single internal representation is
cleaner.

  Have the encoded data in ByteArray is sensible thing to do.  That
would have been much bigger redesign of Squeak, though.

-- Yoshiki


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Yoshiki Ohshima-2
In reply to this post by K. K. Subramaniam
At Wed, 24 Sep 2008 20:38:18 +0530,
K. K. Subramaniam wrote:
>
> On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote:
> >   In the image, a ByteString or WideString is a sequence of characters
> > that hold Unicode code points.  (Note that a Unicode code point is
> > 21-bit.) if all the code in a string fits within 8-bit, we use
> > ByteString. if it doesn't it uses WideString
> You mean a sequence of code points? Instances of Character hold only one code
> point (value), while some characters need more than one code point (e.g. ksha
> in Devanagari needs three).

  Yes, a sequence of code points, as rephrased below the email.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

stephane ducasse
In reply to this post by Bert Freudenberg
>> There is no such thing as a "UTF-*" character. There are Unicode  
>> Characters, and Unicode Strings, and there are UTF-encoded string  
>> (UTF means Unicode Transformation Format).

Yes I was sloppy.
Thanks for the answer

> All characters in Squeak use Unicode now.

Do you mean that the characters are all encoded using codepoints values?

can you tell me what the "now" refers to?
OLPC? 3.8?
I wanted to chekc the changes made in OLPC and harvest them in Pharo.
Now do you know if there are some tests somehwere?

> For example, the cyrillic Б is
>
> char := Character value: 16r0411.
>
> this can be made into a String:
>
> wideString := String with: char.

when I do char printString
I block Squeak 3.9. :(
>
>
> which of course has the same Unicode code points:
>
> wideString asArray collect: [:each | each hex]
>
> gives
>
> #('16r411')

Here you are talking about codepoint
How do I get the corresponding glyph? Using an encoding I imagine

> The string can be encoded as UTF-8:
>
> utf8String := wideString squeakToUtf8.
>
> and to see the values there
>
> utf8String asArray collect: [:each | each hex]
>
> yields
>
> #('16rD0' '16r91')
>
> which is the UTF-8 representation of the character we began with  
> (but if you try to pront utf8String directly you get nonsense,  
> because Squeak does not know it is UTF-8 encoded).

ok
>
>
> The decoding of UTF-8 to a String is similar:
>
> #(16rC3 16rBC) asByteArray asString utf8ToSqueak
>
> which returns the String 'ü' and probably is what you wanted in the  
> first place

Why do I get a visual representation? How the mapping is done from the  
unicode to the glyph.
Should we always passed via a transformation?
How the encodings schema (UTF-*) associates a code point to its glyph?

> - but please try to understand and use the Unicode terms correctly  
> to minimize confusion.

I learned that over last weeks, reading a lot of docs.

character sets ~= character encodings

>
> Anyway, to convert between a String in UTF-8 and a regular Squeak  
> String, it's simplest to use utf8ToSqueak and squeakToUtf8.

Now utf-8 was just an example. I would like to know what is a *ToSqueak?
I can understand that characters are code points in Unicode system now  
how to get see their visual representation.




Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

stephane ducasse
In reply to this post by Philippe Marschall
>>
> Am I the only one using the generic en/decoding functionality in
> Squeak in the form of #convertTo/FromEncoding?
>
> Convert from "Squeak" to UTF-8
> aString convertToEncoding: 'utf-8'


do I understand correctly that such a aString is a sequence of unicode  
codepoints?

>
>
> Convert from UTF-8 to "Squeak"
> aString converFromEncoding: 'utf-8'
>
> For checking out all the encodings your image supports:
> TextConverter allEncodingNames
>
> Cheers
> Philippe
>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

NorbertHartl
On Sat, 2008-09-27 at 08:18 +0200, stephane ducasse wrote:

> >>
> > Am I the only one using the generic en/decoding functionality in
> > Squeak in the form of #convertTo/FromEncoding?
> >
> > Convert from "Squeak" to UTF-8
> > aString convertToEncoding: 'utf-8'
>
>
> do I understand correctly that such a aString is a sequence of unicode  
> codepoints?
> >
At first the utf-8 is a sequence of bytes. These bytes are a space
optimzed encoding of a code point (utf-8). If you decode those bytes
you get your code point (unicode). From a sequence of code points
you can derive a character. In most cases (for us westerners) it will
be a single code point AFAIK.

Norbert

> >
> > Convert from UTF-8 to "Squeak"
> > aString converFromEncoding: 'utf-8'
> >
> > For checking out all the encodings your image supports:
> > TextConverter allEncodingNames
> >
> > Cheers
> > Philippe
> >
>
>


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

K. K. Subramaniam
In reply to this post by stephane ducasse
On Saturday 27 Sep 2008 11:45:38 am stephane ducasse wrote:
> Why do I get a visual representation? How the mapping is done from the  
> unicode to the glyph.
Unicode codepoints are processed by a shaping engine to generate a graphic.
The term 'glyph' (carving in Greek) is historical since typefaces were carved
from metal. The shaping engine is trivial in the case of Latin-1 character
set. The first 256 code points are same as Extended ASCII and the graphic can
be looked up in a font table. Rendering "hello" on the screen involves
extracting the box dimensions and graphic of h, e, l, o from a font table,
laying out five boxes and then rendering appropriately into the five boxes.
Other languages have thousands of such graphics (pictals?) and the rendering
algorithms are complex enough to require a shaping engine with pluggable
rendering algorithms. google for Dr. Yannis Haralambous works for details.

> Should we always passed via a transformation?
UTF-8 is recommended when passing Unicode strings across programs and machines
for the sake of backward compatibility. Within a program, the choice of
encoding depends on the string handling requirements. For instance, if a
program deals with palindromes, then an encoding for "rés" like:
   <r> <grave> <e> <s>
will break current algorithms that just reverse the string of codepoints.

> How the encodings schema (UTF-*) associates a code point to its glyph?
The Unicode sequence "hello world" transformed into UTF-8 is same as its
Extended ASCII encoding. The process is more involved for Asian languages, so
a separate shaping engine is required. Examples are Pango, Qt shaping engine,
Uniscribe etc.

Regards .. Subbu

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Philippe Marschall
In reply to this post by stephane ducasse
2008/9/27 stephane ducasse <[hidden email]>:

>>>
>> Am I the only one using the generic en/decoding functionality in
>> Squeak in the form of #convertTo/FromEncoding?
>>
>> Convert from "Squeak" to UTF-8
>> aString convertToEncoding: 'utf-8'
>
>
> do I understand correctly that such a aString is a sequence of unicode
> codepoints?

Plus leading char. If you look at UTF8TextConverter it will give every
incoming character with an index higher than 255 the language of the
image. I don't need to explain why this is problematic in the context
of a web application, do I?

Cheers
Philippe

Reply | Threaded
Open this post in threaded view
|

[squeak-dev] Re: how to create an UTF-8 character

Andreas.Raab
Philippe Marschall wrote:
> 2008/9/27 stephane ducasse <[hidden email]>:
>> do I understand correctly that such a aString is a sequence of unicode
>> codepoints?
>
> Plus leading char. If you look at UTF8TextConverter it will give every
> incoming character with an index higher than 255 the language of the
> image. I don't need to explain why this is problematic in the context
> of a web application, do I?

Actually, it *is* worthwhile to explain this. The problem is that since
UTF-8 doesn't have the notion of a leading char there is no way to tag
incoming data correctly. The leading char will be taken from the running
image, so an image running in the US (like our servers) will tag input
coming from Chinese browsers as Latin1. In these situations the leading
char isn't just useless, it is actively misleading.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] Re: how to create an UTF-8 character

Yoshiki Ohshima-2
At Sat, 27 Sep 2008 10:14:39 -0700,
Andreas Raab wrote:

>
> Philippe Marschall wrote:
> > 2008/9/27 stephane ducasse <[hidden email]>:
> >> do I understand correctly that such a aString is a sequence of unicode
> >> codepoints?
> >
> > Plus leading char. If you look at UTF8TextConverter it will give every
> > incoming character with an index higher than 255 the language of the
> > image. I don't need to explain why this is problematic in the context
> > of a web application, do I?
>
> Actually, it *is* worthwhile to explain this. The problem is that since
> UTF-8 doesn't have the notion of a leading char there is no way to tag
> incoming data correctly. The leading char will be taken from the running
> image, so an image running in the US (like our servers) will tag input
> coming from Chinese browsers as Latin1. In these situations the leading
> char isn't just useless, it is actively misleading.

  For that kind of web applications and servers that deals with stuff
outside of Squeak, it doesn't serve a good purpose, because editting,
displaying etc. are out of scope.  Needless to say, the original idea
was to make Squeak to be the dynamic, interactive, multilingualized,
environment so there is mismatch.  Web applications etc. historically
comes after the goal.

  If you need to retain these extra information, sending the strings
without going through UTF-8 conversion makes more sense.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] how to create an UTF-8 character

Damien Pollet
In reply to this post by Philippe Marschall
On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall
<[hidden email]> wrote:
> Plus leading char.

You mean the BOM (byte order mark) or something else ?


--
Damien Pollet
type less, do more [ | ] http://people.untyped.org/damien.pollet

12