Unicode everywhere

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode everywhere

Miguel Cobá
This is a newbie question.

I really don't understand the issue with Unicode and the image/vm/os in
Pharo/SqueakVM/Linux,Windows.

I searched the lists for references to utf-8 and the long discussions
about
leadingChar
encodings
convert*Encoding methods
Strings
WideStrings
Text
DisplayText
WAKom and WAKomEncoded
keyboard input
text file input/output
vm's -encoding -textenc options

So, the people that know about this things, can you please give a newbie
explanation about this.

What is needed for we users to:

1. Input text by keyboard with accents (and maybe in the language of
Mordor too) and they look not like a diamond with a ? sign in the image
system browser.
2. This text be correctly send to a modern web browser and rendered in
utf-8 encoding correctly.
3. Upload a java messages bundle (or a file with characters outside
ASCII) with a web browser posted to a Seaside application, stored
temporaly in disk on the server and processed correctly (by opening it
with some FileStream class) inside the image.

Or is this really hard to do and I'm asking for the imposible?

I understand that there are performance issues with full unicode
image/vm but supposing that premature optimization rule applies here,
what we need to do to achieve this (utopian maybe) goal?


Why the question, because I put some strings inside the image (they look
fine as I type them). Then I used them to put labels in my Seaside app.
In the image I typed and saw in a code browser:

Búsqueda de información

But the web browser (firefox) I see:

B�squeda de informaci�n

If I change the web browser encoding to iso8859-1 I see it correctly.

Now if I evaluate

'Búsqueda de información' convertToEncoding: 'utf-8'

this gives:

'Búsqueda de información'

and if I use this weird, uneditable by hand, string as the string for my
Seaside app, I correctly see the string in the web browser:

Búsqueda de información.

So, I really don't understand. Should I always write my strings in the
image as I want them, use convertToEncoding: method and use the output
as if that were my string?

Well, thanks for your answers.

P.S. I am using pharo core 1.0, Seaside 2.8 and squeakvm version:

4.0.3-2202 #1 XShm Sat Apr 17 18:21:07 UTC 2010 gcc 4.4.3

in a 64 bit Debian Linux with full utf-8 locale:

miguel@laptop:~/proyectos/azteca$ locale
LANG=es_MX.UTF-8
LC_CTYPE="es_MX.UTF-8"
LC_NUMERIC="es_MX.UTF-8"
LC_TIME="es_MX.UTF-8"
LC_COLLATE="es_MX.UTF-8"
LC_MONETARY="es_MX.UTF-8"
LC_MESSAGES="es_MX.UTF-8"
LC_PAPER="es_MX.UTF-8"
LC_NAME="es_MX.UTF-8"
LC_ADDRESS="es_MX.UTF-8"
LC_TELEPHONE="es_MX.UTF-8"
LC_MEASUREMENT="es_MX.UTF-8"
LC_IDENTIFICATION="es_MX.UTF-8"
LC_ALL=

Cheers


--
Miguel Cobá
http://miguel.leugim.com.mx


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Unicode everywhere

Stéphane Ducasse

On Jul 19, 2010, at 7:12 AM, Miguel Enrique Cobá Martínez wrote:

> This is a newbie question.
>
> I really don't understand the issue with Unicode and the image/vm/os in
> Pharo/SqueakVM/Linux,Windows.
>
> I searched the lists for references to utf-8 and the long discussions
> about
> leadingChar
> encodings
> convert*Encoding methods
> Strings
> WideStrings
> Text
> DisplayText
> WAKom and WAKomEncoded
> keyboard input
> text file input/output
> vm's -encoding -textenc options
>
> So, the people that know about this things, can you please give a newbie
> explanation about this.

I would love to have the possibility to describe the situation so that we can refer to it later.

>
> What is needed for we users to:
>
> 1. Input text by keyboard with accents (and maybe in the language of
> Mordor too) and they look not like a diamond with a ? sign in the image
> system browser.
> 2. This text be correctly send to a modern web browser and rendered in
> utf-8 encoding correctly.
> 3. Upload a java messages bundle (or a file with characters outside
> ASCII) with a web browser posted to a Seaside application, stored
> temporaly in disk on the server and processed correctly (by opening it
> with some FileStream class) inside the image.
>
> Or is this really hard to do and I'm asking for the imposible?
>
> I understand that there are performance issues with full unicode
> image/vm but supposing that premature optimization rule applies here,
> what we need to do to achieve this (utopian maybe) goal?
>
>
> Why the question, because I put some strings inside the image (they look
> fine as I type them). Then I used them to put labels in my Seaside app.
> In the image I typed and saw in a code browser:
>
> Búsqueda de información
>
> But the web browser (firefox) I see:
>
> B�squeda de informaci�n
>
> If I change the web browser encoding to iso8859-1 I see it correctly.
>
> Now if I evaluate
>
> 'Búsqueda de información' convertToEncoding: 'utf-8'
>
> this gives:
>
> 'Búsqueda de información'
>
> and if I use this weird, uneditable by hand, string as the string for my
> Seaside app, I correctly see the string in the web browser:
>
> Búsqueda de información.
>
> So, I really don't understand. Should I always write my strings in the
> image as I want them, use convertToEncoding: method and use the output
> as if that were my string?
>
> Well, thanks for your answers.
>
> P.S. I am using pharo core 1.0, Seaside 2.8 and squeakvm version:
>
> 4.0.3-2202 #1 XShm Sat Apr 17 18:21:07 UTC 2010 gcc 4.4.3
>
> in a 64 bit Debian Linux with full utf-8 locale:
>
> miguel@laptop:~/proyectos/azteca$ locale
> LANG=es_MX.UTF-8
> LC_CTYPE="es_MX.UTF-8"
> LC_NUMERIC="es_MX.UTF-8"
> LC_TIME="es_MX.UTF-8"
> LC_COLLATE="es_MX.UTF-8"
> LC_MONETARY="es_MX.UTF-8"
> LC_MESSAGES="es_MX.UTF-8"
> LC_PAPER="es_MX.UTF-8"
> LC_NAME="es_MX.UTF-8"
> LC_ADDRESS="es_MX.UTF-8"
> LC_TELEPHONE="es_MX.UTF-8"
> LC_MEASUREMENT="es_MX.UTF-8"
> LC_IDENTIFICATION="es_MX.UTF-8"
> LC_ALL=
>
> Cheers
>
>
> --
> Miguel Cobá
> http://miguel.leugim.com.mx
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Unicode everywhere

Philippe Marschall-2-3
In reply to this post by Miguel Cobá
On 19.07.2010 07:12, Miguel Enrique Cobá Martínez wrote:
> ...
> 2. This text be correctly send to a modern web browser and rendered in
> utf-8 encoding correctly.

In Seaside 2.8
- use WAKomEncoded
- configure the application to use utf-8
- do _not_ send #convertToEncoding:

In Seaside 3.0 rc
- set the encoding on the server to utf-8
- do _not_ send #convertToEncoding:

If that doesn't work please post on the Seaside mailing list with:
- the exact Pharo image version you have
- the exact Seaside version you have
- the exact Kom version you have
- does the String display correctly in inspector?
- the output of (yourString convertToEncoding: 'utf-8') asByteArray

> 3. Upload a java messages bundle (or a file with characters outside
> ASCII) with a web browser posted to a Seaside application, stored
> temporaly in disk on the server and processed correctly (by opening it
> with some FileStream class) inside the image.

Java message bundles:
saving:
- open a file stream in binary mode
- save contents of the upload to the stream
reading:
- open a file stream in text mode and iso-8859-1 as the encoding. This
will only work for Latin-1 character's but Java message bundles support
Unicode. You need to manually process Unicode escapes [1].

other files with characters outside ASCII:
saving:
- open a file stream in binary mode
- save contents of the upload to the stream
reading:
- Open a file stream in text mode and the correct encoding on the file.
Knowing the right encoding is hard to impossible. Seaside doesn't know
it because the browser doesn't tell Seaside. The browser doesn't know it
because the operating system doesn't know either.

 [1] http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.3

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project