Smalltalk › Squeak › Squeak - Dev

[squeak-dev] UTF-8

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

23 messages Options

peportier

[squeak-dev] UTF-8

Hi !
I am in search of up to date links (or tips) on how to work with UTF-8 (instead of the default latin1) inside Squeak.
I thank you in advance for your help.
Pierre-Edouard

Philippe Marschall

Re: [squeak-dev] UTF-8

2009/3/27 Pierre-Edouard PORTIER <[hidden email]>:
> Hi !
> I am in search of up to date links (or tips) on how to work with UTF-8
> (instead of the default latin1) inside Squeak.
> I thank you in advance for your help.
> Pierre-Edouard

aString convertToEncoding: 'utf-8'
aString convertFromEncoding: 'utf-8'

Cheers
Philippe

Damien Cassou-3

Re: [squeak-dev] UTF-8

In reply to this post by peportier

On Fri, Mar 27, 2009 at 8:56 PM, Pierre-Edouard PORTIER
<[hidden email]> wrote:
> I am in search of up to date links (or tips) on how to work with UTF-8
> (instead of the default latin1) inside Squeak.
> I thank you in advance for your help.

http://article.gmane.org/gmane.comp.lang.smalltalk.pharo.devel/5065/match=looking+unicode+testers

--
Damien Cassou
http://damiencassou.seasidehosting.st

Pierre-Edouard PORTIER

Re: [squeak-dev] UTF-8

In reply to this post by Philippe Marschall

Thank you Philippe,

I was aware of :
aString squeakToUtf8
aString utf8ToSqueak

But I would like to be able to *see* utf-8 characters inside the squeak environment.

Cheers
Pierre-Edouard

On Sat, Mar 28, 2009 at 5:20 AM, Philippe Marschall <[hidden email]> wrote:

2009/3/27 Pierre-Edouard PORTIER <[hidden email]>:

> Hi !
> I am in search of up to date links (or tips) on how to work with UTF-8
> (instead of the default latin1) inside Squeak.
> I thank you in advance for your help.
> Pierre-Edouard

aString convertToEncoding: 'utf-8'
aString convertFromEncoding: 'utf-8'

Cheers
Philippe

Pierre-Edouard PORTIER

Re: [squeak-dev] UTF-8

In reply to this post by Damien Cassou-3

Thank you Damien,
I will be a tester of this fork.
Pierre-Edouard

On Sat, Mar 28, 2009 at 11:03 AM, Damien Cassou <[hidden email]> wrote:

On Fri, Mar 27, 2009 at 8:56 PM, Pierre-Edouard PORTIER
<[hidden email]> wrote:
> I am in search of up to date links (or tips) on how to work with UTF-8
> (instead of the default latin1) inside Squeak.
> I thank you in advance for your help.

http://article.gmane.org/gmane.comp.lang.smalltalk.pharo.devel/5065/match=looking+unicode+testers

--
Damien Cassou
http://damiencassou.seasidehosting.st

Philippe Marschall

Re: [squeak-dev] UTF-8

In reply to this post by Pierre-Edouard PORTIER

2009/3/28 Pierre-Edouard PORTIER <[hidden email]>:
> Thank you Philippe,
>
> I was aware of :
> aString squeakToUtf8
> aString utf8ToSqueak
>
> But I would like to be able to *see* utf-8 characters inside the squeak
> environment.

What do you mean with that? What do you understand as an utf-8 character?

Cheers
Philippe

Michael Rueger-6

Re: [squeak-dev] UTF-8

In reply to this post by Pierre-Edouard PORTIER

On Sat, Mar 28, 2009 at 11:55 AM, Pierre-Edouard PORTIER
<[hidden email]> wrote:

> But I would like to be able to *see* utf-8 characters inside the squeak
> environment.

Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is
just one way of encoding unicode (characters).
You can import utf-8 encoded characters/strings, but once inside
Squeak they are kept as unicode characters.

Michael

Philippe Marschall

Re: [squeak-dev] UTF-8

2009/3/29 Michael Rueger <[hidden email]>:

> On Sat, Mar 28, 2009 at 11:55 AM, Pierre-Edouard PORTIER
> <[hidden email]> wrote:
>
>
>> But I would like to be able to *see* utf-8 characters inside the squeak
>> environment.
>
> Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is
> just one way of encoding unicode (characters).
> You can import utf-8 encoded characters/strings, but once inside
> Squeak they are kept as unicode characters.

Plus leadingChar, which causes a lot of problems for web applications.

Cheers
Philippe

Janko Mivšek

Re: [squeak-dev] UTF-8

Philippe Marschall pravi:
> Michael Rueger:
>> Pierre-Edouard PORTIER wrote:

>>> But I would like to be able to *see* utf-8 characters inside the squeak
>>> environment.

>> Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is
>> just one way of encoding unicode (characters).
>> You can import utf-8 encoded characters/strings, but once inside
>> Squeak they are kept as unicode characters.

> Plus leadingChar, which causes a lot of problems for web applications.

We don't have any problems with Squeak Unicode in Aida/Web apps,
probably because we strictly use Unicode internally, not the UTF-8
encoded strings. All such strings are then encoded/decoded to the UTF-8
"at the edge" of image by Aida web framework.

Best regards
Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Philippe Marschall

Re: [squeak-dev] UTF-8

2009/3/29 Janko Mivšek <[hidden email]>:

> Philippe Marschall pravi:
>> Michael Rueger:
>>> Pierre-Edouard PORTIER wrote:
>
>>>> But I would like to be able to *see* utf-8 characters inside the squeak
>>>> environment.
>
>>> Are you sure you are not confusing "utf-8" with "unicode"? utf-8 is
>>> just one way of encoding unicode (characters).
>>> You can import utf-8 encoded characters/strings, but once inside
>>> Squeak they are kept as unicode characters.
>
>> Plus leadingChar, which causes a lot of problems for web applications.
>
> We don't have any problems with Squeak Unicode in Aida/Web apps,
> probably because we strictly use Unicode internally, not the UTF-8
> encoded strings. All such strings are then encoded/decoded to the UTF-8
> "at the edge" of image by Aida web framework.

What leadingChar do you use? The one of the image?

Cheers
Philippe

Philippe Marschall

Re: [squeak-dev] UTF-8

In reply to this post by Janko Mivšek

2009/3/29 Janko Mivšek <[hidden email]>:

You can not do that. Squeak stores the language of a character in
every character. In a web application you don't know the language of
the input and utf-8 certainly doesn't contain it. You could take the
language of the image but that is random and has no relation to the
input. You could also set the language of a character to unicode (255)
but that only works for non-Latin-1 characters, these are interned and
all have leadingChar 0. Did I already mention that the leadingChar is
used for #=? So no, I don't believe you.

Cheers
Philippe

Nicolas Cellier

Re: [squeak-dev] UTF-8

> You can not do that. Squeak stores the language of a character in
> every character. In a web application you don't know the language of
> the input and utf-8 certainly doesn't contain it. You could take the
> language of the image but that is random and has no relation to the
> input. You could also set the language of a character to unicode (255)
> but that only works for non-Latin-1 characters, these are interned and
> all have leadingChar 0. Did I already mention that the leadingChar is
> used for #=? So no, I don't believe you.
>
> Cheers
> Philippe
>

It seems most reasonnable to me to switch unicode leadingChar to 0.
Why couldn't we just do that?

Of course, all this does not really answer Pierre Edouard questions...
Pierre, what do you want unicode for?
- displaying any arbitrary character inside squeak
- inputing any character with keyboard in squeak
- exchanging files made of arbitrary characters with external world
(utf-8, utf-16 or other formats)
- reading and writing filenames containing arbitrary characters
- anything else?

Nicolas

Pierre-Edouard PORTIER

Re: [squeak-dev] UTF-8

Hi Nicolas !

Thank you for this nice synthesis. I want to:
- display any arbitrary character inside Squeak (for example Greek characters)
- input any character with keyboard inside Squeak
- exchange utf-8 encoded data with external world

Pierre-Edouard

On Sun, Mar 29, 2009 at 2:42 PM, Nicolas Cellier <[hidden email]> wrote:

> You can not do that. Squeak stores the language of a character in
> every character. In a web application you don't know the language of
> the input and utf-8 certainly doesn't contain it. You could take the
> language of the image but that is random and has no relation to the
> input. You could also set the language of a character to unicode (255)
> but that only works for non-Latin-1 characters, these are interned and
> all have leadingChar 0. Did I already mention that the leadingChar is
> used for #=? So no, I don't believe you.
>
> Cheers
> Philippe
>

It seems most reasonnable to me to switch unicode leadingChar to 0.
Why couldn't we just do that?

Of course, all this does not really answer Pierre Edouard questions...
Pierre, what do you want unicode for?
- displaying any arbitrary character inside squeak
- inputing any character with keyboard in squeak
- exchanging files made of arbitrary characters with external world
(utf-8, utf-16 or other formats)
- reading and writing filenames containing arbitrary characters
- anything else?

Nicolas

Janko Mivšek

Re: [squeak-dev] UTF-8

In reply to this post by Philippe Marschall

Philippe Marschall pravi:
> Janko Mivšek:

>> We don't have any problems with Squeak Unicode in Aida/Web apps,
>> probably because we strictly use Unicode internally,

> You can not do that. Squeak stores the language of a character in
> every character. In a web application you don't know the language of
> the input and utf-8 certainly doesn't contain it. You could take the
> language of the image but that is random and has no relation to the
> input. You could also set the language of a character to unicode (255)
> but that only works for non-Latin-1 characters, these are interned and
> all have leadingChar 0. Did I already mention that the leadingChar is
> used for #=? So no, I don't believe you.

Well, you should believe me, I have a proof!

Look at this Aida/Scribo multilingual demo served from Squeak image:
http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and
Russian text. Even Japanese urls are working correctly:
http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html

About leading character, I even don't know what is that, except in
theory. That is, I never encounter this character as a problem when
porting Aida and its i8n support to Squeak.

Best regards
Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Philippe Marschall

Re: [squeak-dev] UTF-8

2009/3/29 Janko Mivšek <[hidden email]>:

> Philippe Marschall pravi:
>> Janko Mivšek:
>
>>> We don't have any problems with Squeak Unicode in Aida/Web apps,
>>> probably because we strictly use Unicode internally,
>
>> You can not do that. Squeak stores the language of a character in
>> every character. In a web application you don't know the language of
>> the input and utf-8 certainly doesn't contain it. You could take the
>> language of the image but that is random and has no relation to the
>> input. You could also set the language of a character to unicode (255)
>> but that only works for non-Latin-1 characters, these are interned and
>> all have leadingChar 0. Did I already mention that the leadingChar is
>> used for #=? So no, I don't believe you.
>
> Well, you should believe me, I have a proof!
>
> Look at this Aida/Scribo multilingual demo served from Squeak image:
> http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and
> Russian text. Even Japanese urls are working correctly:
> http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html

That's just external representation, that tells absolutely nothing
about internal representation and the implementation. I could easily
the the same result on a Squeak 3.7.

> About leading character, I even don't know what is that, except in
> theory. That is, I never encounter this character as a problem when
> porting Aida and its i8n support to Squeak.

How can you seriously say everything is working fine when in practice
you can't say what is happening and don't know how Strings and
Characters work in Squeak? I find that quite dubious hyping.

Cheers
Philippe

Janko Mivšek

Re: [squeak-dev] UTF-8

Philippe Marschall pravi:

>> Look at this Aida/Scribo multilingual demo served from Squeak image:
>> http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and
>> Russian text. Even Japanese urls are working correctly:
>> http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html
>
> That's just external representation, that tells absolutely nothing
> about internal representation and the implementation. I could easily
> the the same result on a Squeak 3.7.

For this you need WideStrings and proper UTF-8 converter. Does Squeak
3.7 has that?

>> About leading character, I even don't know what is that, except in
>> theory. That is, I never encounter this character as a problem when
>> porting Aida and its i8n support to Squeak.
>
> How can you seriously say everything is working fine when in practice
> you can't say what is happening and don't know how Strings and
> Characters work in Squeak? I find that quite dubious hyping.

Not hype at all but pure reality. And coming from country where we
already need Unicode characters above 256, you can be sure that I know
what I'm talking about. If there would be some problem, I would be the
first encountering it. But there are no problems with Unicode strings
prepared by Aida, so why should I bother? This is like a premature
optimization for me.

Note also that Masashi Umezawa, a Japanese guy, made a preview and few
modifications to Aida to work well with Japanese writing, in all aspects
from Urls to the content. Because of his work I'm therefore even more
sure that we did the Unicode support right!

Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Enrico Schwass-2

Re: [squeak-dev] UTF-8

Janko Mivšek <[hidden email]> writes:

Hello Janko

I guess, Phillip talks about in-image japanese/arabic/whatever. This
needs probably changes to the vm. Here on Mac OS X it doesnt work. I
just get empty block-glyphs. Its not possible to copy non latin
characters into the workspace. Linux-vms might handle this better.

ciao
Enno

> Note also that Masashi Umezawa, a Japanese guy, made a preview and few
> modifications to Aida to work well with Japanese writing, in all aspects
> from Urls to the content. Because of his work I'm therefore even more
> sure that we did the Unicode support right!

Philippe Marschall

Re: [squeak-dev] UTF-8

In reply to this post by Janko Mivšek

2009/3/29 Janko Mivšek <[hidden email]>:

> Philippe Marschall pravi:
>
>>> Look at this Aida/Scribo multilingual demo served from Squeak image:
>>> http://demo.bioskop.fr/wiki/wiki.html, see specially Japanese and
>>> Russian text. Even Japanese urls are working correctly:
>>> http://demo.bioskop.fr/wiki/%E3%83%86%E3%82%B9%E3%83%88.html
>>
>> That's just external representation, that tells absolutely nothing
>> about internal representation and the implementation. I could easily
>> the the same result on a Squeak 3.7.
>
> For this you need WideStrings and proper UTF-8 converter.

No you don't. You just need to emit the right bytes. The simplest way
to achive this is return 1:1 what was inserted. This works well as
long as you don't need any String semantics. This is for example what
DabbleDB does.

> Does Squeak 3.7 has that?
>
>>> About leading character, I even don't know what is that, except in
>>> theory. That is, I never encounter this character as a problem when
>>> porting Aida and its i8n support to Squeak.
>>
>> How can you seriously say everything is working fine when in practice
>> you can't say what is happening and don't know how Strings and
>> Characters work in Squeak? I find that quite dubious hyping.
>
> Not hype at all but pure reality. And coming from country where we
> already need Unicode characters above 256, you can be sure that I know
> what I'm talking about.

Then tell us what leadingChar you use. And tell us how you address the
issue that #= takes the leadingChar into account.

> If there would be some problem, I would be the
> first encountering it.

No, as I said as long as you're just outputting the input you won't.

> But there are no problems with Unicode strings
> prepared by Aida, so why should I bother? This is like a premature
> optimization for me.

What, getting semantics of #= right is premature optimization? Having
a working String protocol is premature optimization?

> Note also that Masashi Umezawa, a Japanese guy, made a preview and few
> modifications to Aida to work well with Japanese writing, in all aspects
> from Urls to the content. Because of his work I'm therefore even more
> sure that we did the Unicode support right!

Then tell us how it works and how it addresses the leadingChar issues
outlined in this thread.

Cheers
Philippe

Bert Freudenberg

Re: [squeak-dev] UTF-8

In reply to this post by Enrico Schwass-2

On 29.03.2009, at 16:32, Enrico Schwass wrote:

> Janko Mivšek <[hidden email]> writes:
>
> Hello Janko
>
> I guess, Phillip talks about in-image japanese/arabic/whatever. This
> needs probably changes to the vm. Here on Mac OS X it doesnt work.

The VMs can provide full unicode input now, but not all images have
been adapted to make use of it. And that is completely separate from
unicode font rendering support in the image.

> I just get empty block-glyphs.

Your image needs to use the UTF-32 unicode character that recent VMs
produce along with the old byte-sized character.

Check that "ActiveHand keyboardInterpreter" is in fact a
UTF32InputInterpreter.

> Its not possible to copy non latin characters into the workspace.

Your image needs to make use of the ClipboardExtendedPlugin which does
ship in current Mac VMs.

- Bert -

Nicolas Cellier

Re: [squeak-dev] UTF-8

2009/3/29 Bert Freudenberg <[hidden email]>:

>
> On 29.03.2009, at 16:32, Enrico Schwass wrote:
>
>> Janko Mivšek <[hidden email]> writes:
>>
>> Hello Janko
>>
>> I guess, Phillip talks about in-image japanese/arabic/whatever. This
>> needs probably changes to the vm. Here on Mac OS X it doesnt work.
>
> The VMs can provide full unicode input now, but not all images have been
> adapted to make use of it. And that is completely separate from unicode font
> rendering support in the image.
>

I presume that a good Font or FontSet with unicode support should be
in image for rendering correctly.
Any link to a good Howto?

>> I just get empty block-glyphs.
>
> Your image needs to use the UTF-32 unicode character that recent VMs produce
> along with the old byte-sized character.
>
> Check that "ActiveHand keyboardInterpreter" is in fact a
> UTF32InputInterpreter.
>

For images which does not have UTF32InputInterpreter, let me remind
Bert's and Yoshiki's job is pending at
http://bugs.squeak.org/view.php?id=7071 ...

>> Its not possible to copy non latin characters into the workspace.
>
>
> Your image needs to make use of the ClipboardExtendedPlugin which does ship
> in current Mac VMs.
>
> - Bert -
>
>
>