Smalltalk › Frameworks & Tools › Seaside › Seaside General

3.9 and encoding

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

20 messages Options

NorbertHartl

3.9 and encoding

Hi,

I ran into a encoding problem. I'm using seaside together
with Glorp. For the web server I use WAKomEncoded39.
WAKomEncoded39 converts the output to the browser to utf-8.
But on incoming requests the url escaped characters are
translated to something different. For me it appears to
be latin-1 but I've no glue why it should be that way.
I detected it because my postgresql session has client
encoding utf-8 turned on and I get an error trying to
store strings containing characters like ö.

I read on the net that this has something to do with 3.9.
Is this still true? Is there a way to make it run or is
the only way to go back to 3.8?

thanks in advance,

Norbert

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Philippe Marschall

Re: 3.9 and encoding

2007/2/28, Norbert Hartl <[hidden email]>:

> Hi,
>
> I ran into a encoding problem. I'm using seaside together
> with Glorp. For the web server I use WAKomEncoded39.
> WAKomEncoded39 converts the output to the browser to utf-8.
> But on incoming requests the url escaped characters are
> translated to something different. For me it appears to
> be latin-1 but I've no glue why it should be that way.
> I detected it because my postgresql session has client
> encoding utf-8 turned on and I get an error trying to
> store strings containing characters like ö.

If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
(new) Squeak encoding in your image which is basically non-unified
unicode. For latin-1 characters this will be indistinguishable from
latin-1. If your database is utf-8 you need to encode your strings to
utf-8 when writing them to your database and decode your strings from
utf-8 when reading from the database (only to convert it back to utf-8
when generating html). You can configure the PostgreS database driver
to do this automatically for you.

An other option is to have utf-8 strings in your image. On Squeak 3.9
this requires WAKom and a modified version of KomHttpServer not
publicly available. This has the advantage that you don't need to do
encoding conversion it has however the disadvantage that it won't work
with the debugger, #size doesn't work and directly indexing into the
string (creating substrings) won't work too. Additionally you need to
convert you string literals to utf-8 (unless they're ascii).

Cheers
Philippe

> I read on the net that this has something to do with 3.9.
> Is this still true? Is there a way to make it run or is
> the only way to go back to 3.8?
>
> thanks in advance,
>
> Norbert
>
>
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

NorbertHartl

Re: 3.9 and encoding

On Wed, 2007-02-28 at 00:26 +0100, Philippe Marschall wrote:

Oh, this seems quite easy. But I didn't found anything to configure
in the Postgres driver. Do you have any hint?

Norbert

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Avi Bryant-2

Re: 3.9 and encoding

In reply to this post by Philippe Marschall

On 2/27/07, Philippe Marschall <[hidden email]> wrote:

> An other option is to have utf-8 strings in your image. On Squeak 3.9
> this requires WAKom and a modified version of KomHttpServer not
> publicly available.

What changes were needed? Can you post them?

> This has the advantage that you don't need to do
> encoding conversion it has however the disadvantage that it won't work
> with the debugger, #size doesn't work and directly indexing into the
> string (creating substrings) won't work too. Additionally you need to
> convert you string literals to utf-8 (unless they're ascii).

... exactly as in Squeak 3.7, right?

Avi
_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

tblanchard

Re: 3.9 and encoding

In reply to this post by Philippe Marschall

I took a quick look at the request processing and I don't see where utf-8 stuff gets decoded. AFAICS, it just doesn't do it - thus producing a one byte to a character transformation, but maybe I'm missing something.

I have done a LOT of this stuff (formerly chief architect at a web I18N company). There are a few things that are not so intuitive when dealing with encodings and http requests.

Escape sequences escape bytes, not characters.

On pass 1, you assume you have latin-1, parse the header and get the content-type and associated charset. Remember this for later translation.

Build a byte array from the string by putting ascii characters in as bytes. Decode escape sequences into single bytes as you go.

Convert the byte array to a string by reading bytes and composing them into code points according to the encoding specified as the charset in the content-type. For utf-8 this means reading a byte, checking the high order bits to find out the length of the byte sequence, then reading the rest of the sequence, composing the code point, etc...

Now you have text - start over and parse as normal.

Some of these steps can be folded but conceptually, this is how it works.

So I don't think WAKomEncoding39 is doing the right thing wrt to request processing AFAICS.

-Todd Blanchard

On Feb 27, 2007, at 3:26 PM, Philippe Marschall wrote:

If you run WAKomEncoded39 on Squeak 3.9 you will have strings with

(new) Squeak encoding in your image which is basically non-unified

unicode. For latin-1 characters this will be indistinguishable from

latin-1.

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Philippe Marschall

Re: 3.9 and encoding

In reply to this post by Avi Bryant-2

2007/2/28, Avi Bryant <[hidden email]>:
> On 2/27/07, Philippe Marschall <[hidden email]> wrote:
>
> > An other option is to have utf-8 strings in your image. On Squeak 3.9
> > this requires WAKom and a modified version of KomHttpServer not
> > publicly available.
>
> What changes were needed? Can you post them?

The basic problem is that #unescapePercents changed semantics from
Squeak 3.8 to 3.9. To work around that you need to change the sends
from
#unescapePercents
to
#unescapePercentsWithTextEncoding: nil
in HttpRequest >> #initStatusString: and HttpRequest class >>
#decodeUrlEncodedForm:multipleValues:

> > This has the advantage that you don't need to do
> > encoding conversion it has however the disadvantage that it won't work
> > with the debugger, #size doesn't work and directly indexing into the
> > string (creating substrings) won't work too. Additionally you need to
> > convert you string literals to utf-8 (unless they're ascii).
>
> ... exactly as in Squeak 3.7, right?

Exactly

> Avi
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>
_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Philippe Marschall

Re: 3.9 and encoding

In reply to this post by tblanchard

2007/2/28, Todd Blanchard <[hidden email]>:
> I took a quick look at the request processing and I don't see where utf-8
> stuff gets decoded. AFAICS, it just doesn't do it - thus producing a one
> byte to a character transformation, but maybe I'm missing something.

#unescapePercents does utf-8 decoding.

> I have done a LOT of this stuff (formerly chief architect at a web I18N
> company). There are a few things that are not so intuitive when dealing
> with encodings and http requests.
>
> Escape sequences escape bytes, not characters.
>
> On pass 1, you assume you have latin-1, parse the header and get the
> content-type and associated charset. Remember this for later translation.

We don't do that. We assume either you are running utf-8 or you don't
want any translation taking place.

> Build a byte array from the string by putting ascii characters in as bytes.
> Decode escape sequences into single bytes as you go.
>
> Convert the byte array to a string by reading bytes and composing them into
> code points according to the encoding specified as the charset in the
> content-type. For utf-8 this means reading a byte, checking the high order
> bits to find out the length of the byte sequence, then reading the rest of
> the sequence, composing the code point, etc...
>
> Now you have text - start over and parse as normal.
>
> Some of these steps can be folded but conceptually, this is how it works.
>
> So I don't think WAKomEncoding39 is doing the right thing wrt to request
> processing AFAICS.
>
> -Todd Blanchard
>
>
> On Feb 27, 2007, at 3:26 PM, Philippe Marschall wrote:
>
>
> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
>
> (new) Squeak encoding in your image which is basically non-unified
>
> unicode. For latin-1 characters this will be indistinguishable from
>
> latin-1.
>
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>
>

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Philippe Marschall

Re: 3.9 and encoding

In reply to this post by NorbertHartl

2007/2/28, Norbert Hartl <[hidden email]>:

> On Wed, 2007-02-28 at 00:26 +0100, Philippe Marschall wrote:
> > 2007/2/28, Norbert Hartl <[hidden email]>:
> > > Hi,
> > >
> > > I ran into a encoding problem. I'm using seaside together
> > > with Glorp. For the web server I use WAKomEncoded39.
> > > WAKomEncoded39 converts the output to the browser to utf-8.
> > > But on incoming requests the url escaped characters are
> > > translated to something different. For me it appears to
> > > be latin-1 but I've no glue why it should be that way.
> > > I detected it because my postgresql session has client
> > > encoding utf-8 turned on and I get an error trying to
> > > store strings containing characters like ö.
> >
> > If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> > (new) Squeak encoding in your image which is basically non-unified
> > unicode. For latin-1 characters this will be indistinguishable from
> > latin-1. If your database is utf-8 you need to encode your strings to
> > utf-8 when writing them to your database and decode your strings from
> > utf-8 when reading from the database (only to convert it back to utf-8
> > when generating html). You can configure the PostgreS database driver
> > to do this automatically for you.
> >
> Oh, this seems quite easy. But I didn't found anything to configure
> in the Postgres driver. Do you have any hint?

PGConnection >> class #buildDefaultFieldConverters
TestPGConnection >> #testFieldConverter

You need to register a field converter for your string types that does
#convertFromEncoding: #utf8

Sorry that does only do the decoding and not the encoding. I guess in
your case Glorp does the encoding. I don't know how you can customize
the Sql generation there but it everything else fails you can change
PGConnection >> #execute (yes, this is a hack)

sql := sqlString.
to
sql := sqlString convertToEncoding: #utf8.

Philippe

P.S.:
PGConnection >> class #buildDefaultFieldConverters
has given us a lot of pain because Squeak doesn't have full block closures

> Norbert
>
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

NorbertHartl

Re: 3.9 and encoding

On Wed, 2007-02-28 at 10:03 +0100, Philippe Marschall wrote:

> 2007/2/28, Norbert Hartl <[hidden email]>:
> > On Wed, 2007-02-28 at 00:26 +0100, Philippe Marschall wrote:
> > > 2007/2/28, Norbert Hartl <[hidden email]>:
> > > > Hi,
> > > >
> > > > I ran into a encoding problem. I'm using seaside together
> > > > with Glorp. For the web server I use WAKomEncoded39.
> > > > WAKomEncoded39 converts the output to the browser to utf-8.
> > > > But on incoming requests the url escaped characters are
> > > > translated to something different. For me it appears to
> > > > be latin-1 but I've no glue why it should be that way.
> > > > I detected it because my postgresql session has client
> > > > encoding utf-8 turned on and I get an error trying to
> > > > store strings containing characters like ö.
> > >
> > > If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> > > (new) Squeak encoding in your image which is basically non-unified
> > > unicode. For latin-1 characters this will be indistinguishable from
> > > latin-1. If your database is utf-8 you need to encode your strings to
> > > utf-8 when writing them to your database and decode your strings from
> > > utf-8 when reading from the database (only to convert it back to utf-8
> > > when generating html). You can configure the PostgreS database driver
> > > to do this automatically for you.
> > >
> > Oh, this seems quite easy. But I didn't found anything to configure
> > in the Postgres driver. Do you have any hint?
>
> PGConnection >> class #buildDefaultFieldConverters
> TestPGConnection >> #testFieldConverter
>
> You need to register a field converter for your string types that does
> #convertFromEncoding: #utf8
>

This way it is working already. I think as long as no one is touching
the string it comes as utf-8 from the database und gets encoded a
second time by WAKomEncoded39 which has no effect.

> Sorry that does only do the decoding and not the encoding. I guess in
> your case Glorp does the encoding. I don't know how you can customize
> the Sql generation there but it everything else fails you can change
> PGConnection >> #execute (yes, this is a hack)
>
I don't think Glorp does encoding and I think it shouldn't.
Glorp should be happy with strings. If there is conversion happening
it should happen in the postgres driver (it is the only one who
could know which encoding is needed for the database).

My strings are carried by ByteString. It seems that ByteString (got
from WAKomEncoded39) contains a bunch of bytes with any encoding (
ok, it is the non-unified unicode, you said, and i don't know what
that means :) ).
I can convert it with convertToEncoding: to another encoding still
using ByteString. But there is no information about encoding in the
object. I think this is really dangerous. I have to look at WideString.
I'm curious how those deal with encodings they are created from.

I think there are only two possibilities. Handle it like Java, Lisp
and convert every encoding to the internal (UCS-2) on string creation.
The other option which would be easier (i think) is to add the
character encoding information into the string.

What do you think?

> sql := sqlString.
> to
> sql := sqlString convertToEncoding: #utf8.
>
The hack is actually adding the conversion to
SqueakDatabaseAccessor>>basicExecuteSQLString:

I understand a lot more now. Thanks very much.

Norbert
> P.S.:
> PGConnection >> class #buildDefaultFieldConverters
> has given us a lot of pain because Squeak doesn't have full block closures
>
Oh, wow, another day hearing a lot of basic things I don't have any idea
about :) What are "full" block closures?

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

radoslav hodnicak

Re: 3.9 and encoding

In reply to this post by NorbertHartl

On Wed, 28 Feb 2007, Norbert Hartl wrote:

>> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
>> (new) Squeak encoding in your image which is basically non-unified
>> unicode. For latin-1 characters this will be indistinguishable from
>> latin-1. If your database is utf-8 you need to encode your strings to
>> utf-8 when writing them to your database and decode your strings from
>> utf-8 when reading from the database (only to convert it back to utf-8
>> when generating html). You can configure the PostgreS database driver
>> to do this automatically for you.
>>
> Oh, this seems quite easy. But I didn't found anything to configure
> in the Postgres driver. Do you have any hint?

Postgres supports communication with in various encodings, you can tell it
what encoding are you using with the sql command

SET CLIENT_ENCODING TO "encoding here";

look in postgres docs for all supported encodings

rado
_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

NorbertHartl

Re: 3.9 and encoding

On Wed, 2007-02-28 at 12:25 +0100, radoslav hodnicak wrote:

>
> On Wed, 28 Feb 2007, Norbert Hartl wrote:
>
> >> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> >> (new) Squeak encoding in your image which is basically non-unified
> >> unicode. For latin-1 characters this will be indistinguishable from
> >> latin-1. If your database is utf-8 you need to encode your strings to
> >> utf-8 when writing them to your database and decode your strings from
> >> utf-8 when reading from the database (only to convert it back to utf-8
> >> when generating html). You can configure the PostgreS database driver
> >> to do this automatically for you.
> >>
> > Oh, this seems quite easy. But I didn't found anything to configure
> > in the Postgres driver. Do you have any hint?
>
> Postgres supports communication with in various encodings, you can tell it
> what encoding are you using with the sql command
>
> SET CLIENT_ENCODING TO "encoding here";
>
> look in postgres docs for all supported encodings
>

Yes, but I doubt it supports the "non-unified unicode" encoding
which the squeak strings are made off :) Btw. I really like to
have utf-8 encoding. It is a good way to have a "common" way to
do these things. So this way round I need a way to convert it that
way.

I think the postgres driver should be capable of:

- If the driver is requested to do a specific format the driver
should try to negotiate that with the database
- If the driver is not requested to use a specific format the
database driver should be capable of converting data in the
encoding the database is using.
- Or alternatively if no encoding is requested the driver may use
a "default" encoding which is known to be supported on the database
side as well as on the squeak side so that conversion can take
place.

But I don't know which effects client encoding has ater all. For me it
appears only as the communication encoding between client and database.
I assume I can have client encoding latin-1 and send a latin-1 string.
And if I send the same string as utf-8 in client encoding utf-8 it will
be the same string on the database side. Is this correct?

Norbert

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Martial Boniou

Re: 3.9 and encoding

In reply to this post by Philippe Marschall

Hi,

I tought I had 'corrupted' 3.9 images but it seems to be a general issue
for 3.9. I had made the hack as Philippe said. Actually I changed the
Dialect class>>basicIsSqueak test so that it pass well during
installation (to get the SmalltalkImage things and not Smalltalk); I
added a Dialect class>>basicIsSqueak39 (true for SystemVersion number >
7010); I subclass SqueakDatabaseAccessor to Squeak39DatabaseAccessor to
modify the instance method #basicExecuteSQLString to say:

result := connection execute: (aString asWideString convertToEncoding:
'utf-8').

It works well. I did it to test Ramon Leon's Seaside Blog but I didn't
post this because I wasn't sure it was a common problem and because of
the ugliness of the string conversion.

I attach my two mods.

--
Martial

Philippe Marschall a écrit :
| 2007/2/28, Norbert Hartl <[hidden email]>:
| >On Wed, 2007-02-28 at 00:26 +0100, Philippe Marschall wrote:
| >> 2007/2/28, Norbert Hartl <[hidden email]>:
| >> > Hi,
| >> >
| >> > I ran into a encoding problem. I'm using seaside together
| >> > with Glorp. For the web server I use WAKomEncoded39.
| >> > WAKomEncoded39 converts the output to the browser to utf-8.
| >> > But on incoming requests the url escaped characters are
| >> > translated to something different. For me it appears to
| >> > be latin-1 but I've no glue why it should be that way.
| >> > I detected it because my postgresql session has client
| >> > encoding utf-8 turned on and I get an error trying to
| >> > store strings containing characters like ö.
| >>
| >> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
| >> (new) Squeak encoding in your image which is basically non-unified
| >> unicode. For latin-1 characters this will be indistinguishable from
| >> latin-1. If your database is utf-8 you need to encode your strings to
| >> utf-8 when writing them to your database and decode your strings from
| >> utf-8 when reading from the database (only to convert it back to utf-8
| >> when generating html). You can configure the PostgreS database driver
| >> to do this automatically for you.
| >>
| >Oh, this seems quite easy. But I didn't found anything to configure
| >in the Postgres driver. Do you have any hint?
|
| PGConnection >> class #buildDefaultFieldConverters
| TestPGConnection >> #testFieldConverter
|
| You need to register a field converter for your string types that does
| #convertFromEncoding: #utf8
|
| Sorry that does only do the decoding and not the encoding. I guess in
| your case Glorp does the encoding. I don't know how you can customize
| the Sql generation there but it everything else fails you can change
| PGConnection >> #execute (yes, this is a hack)
|
| sql := sqlString.
| to
| sql := sqlString convertToEncoding: #utf8.
|
| Philippe
|
| P.S.:
| PGConnection >> class #buildDefaultFieldConverters
| has given us a lot of pain because Squeak doesn't have full block closures
|
| >Norbert
| >
| >_______________________________________________
| >Seaside mailing list
| >[hidden email]
| >http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
| >

| _______________________________________________
| Seaside mailing list
| [hidden email]
| http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Dialect.st (18K) Download Attachment

Squeak39DatabaseAccessor.st (799 bytes) Download Attachment

Philippe Marschall

Re: 3.9 and encoding

2007/2/28, Martial Boniou <[hidden email]>:

> Hi,
>
> I tought I had 'corrupted' 3.9 images but it seems to be a general issue
> for 3.9. I had made the hack as Philippe said. Actually I changed the
> Dialect class>>basicIsSqueak test so that it pass well during
> installation (to get the SmalltalkImage things and not Smalltalk); I
> added a Dialect class>>basicIsSqueak39 (true for SystemVersion number >
> 7010); I subclass SqueakDatabaseAccessor to Squeak39DatabaseAccessor to
> modify the instance method #basicExecuteSQLString to say:
>
> result := connection execute: (aString asWideString convertToEncoding:
> 'utf-8').

Do you really need to send #asWideString? It doesn't look like it
would do anything if you already have a String.

Philippe

> It works well. I did it to test Ramon Leon's Seaside Blog but I didn't
> post this because I wasn't sure it was a common problem and because of
> the ugliness of the string conversion.
>
> I attach my two mods.
>
> --
> Martial
>
> Philippe Marschall a écrit :
> | 2007/2/28, Norbert Hartl <[hidden email]>:
> | >On Wed, 2007-02-28 at 00:26 +0100, Philippe Marschall wrote:
> | >> 2007/2/28, Norbert Hartl <[hidden email]>:
> | >> > Hi,
> | >> >
> | >> > I ran into a encoding problem. I'm using seaside together
> | >> > with Glorp. For the web server I use WAKomEncoded39.
> | >> > WAKomEncoded39 converts the output to the browser to utf-8.
> | >> > But on incoming requests the url escaped characters are
> | >> > translated to something different. For me it appears to
> | >> > be latin-1 but I've no glue why it should be that way.
> | >> > I detected it because my postgresql session has client
> | >> > encoding utf-8 turned on and I get an error trying to
> | >> > store strings containing characters like ö.
> | >>
> | >> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> | >> (new) Squeak encoding in your image which is basically non-unified
> | >> unicode. For latin-1 characters this will be indistinguishable from
> | >> latin-1. If your database is utf-8 you need to encode your strings to
> | >> utf-8 when writing them to your database and decode your strings from
> | >> utf-8 when reading from the database (only to convert it back to utf-8
> | >> when generating html). You can configure the PostgreS database driver
> | >> to do this automatically for you.
> | >>
> | >Oh, this seems quite easy. But I didn't found anything to configure
> | >in the Postgres driver. Do you have any hint?
> |
> | PGConnection >> class #buildDefaultFieldConverters
> | TestPGConnection >> #testFieldConverter
> |
> | You need to register a field converter for your string types that does
> | #convertFromEncoding: #utf8
> |
> | Sorry that does only do the decoding and not the encoding. I guess in
> | your case Glorp does the encoding. I don't know how you can customize
> | the Sql generation there but it everything else fails you can change
> | PGConnection >> #execute (yes, this is a hack)
> |
> | sql := sqlString.
> | to
> | sql := sqlString convertToEncoding: #utf8.
> |
> | Philippe
> |
> | P.S.:
> | PGConnection >> class #buildDefaultFieldConverters
> | has given us a lot of pain because Squeak doesn't have full block closures
> |
> | >Norbert
> | >
> | >_______________________________________________
> | >Seaside mailing list
> | >[hidden email]
> | >http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
> | >
>
> | _______________________________________________
> | Seaside mailing list
> | [hidden email]
> | http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>
>
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>
>
>

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Philippe Marschall

Re: 3.9 and encoding

In reply to this post by NorbertHartl

2007/2/28, Norbert Hartl <[hidden email]>:

> On Wed, 2007-02-28 at 10:03 +0100, Philippe Marschall wrote:
> > 2007/2/28, Norbert Hartl <[hidden email]>:
> > > On Wed, 2007-02-28 at 00:26 +0100, Philippe Marschall wrote:
> > > > 2007/2/28, Norbert Hartl <[hidden email]>:
> > > > > Hi,
> > > > >
> > > > > I ran into a encoding problem. I'm using seaside together
> > > > > with Glorp. For the web server I use WAKomEncoded39.
> > > > > WAKomEncoded39 converts the output to the browser to utf-8.
> > > > > But on incoming requests the url escaped characters are
> > > > > translated to something different. For me it appears to
> > > > > be latin-1 but I've no glue why it should be that way.
> > > > > I detected it because my postgresql session has client
> > > > > encoding utf-8 turned on and I get an error trying to
> > > > > store strings containing characters like ö.
> > > >
> > > > If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> > > > (new) Squeak encoding in your image which is basically non-unified
> > > > unicode. For latin-1 characters this will be indistinguishable from
> > > > latin-1. If your database is utf-8 you need to encode your strings to
> > > > utf-8 when writing them to your database and decode your strings from
> > > > utf-8 when reading from the database (only to convert it back to utf-8
> > > > when generating html). You can configure the PostgreS database driver
> > > > to do this automatically for you.
> > > >
> > > Oh, this seems quite easy. But I didn't found anything to configure
> > > in the Postgres driver. Do you have any hint?
> >
> > PGConnection >> class #buildDefaultFieldConverters
> > TestPGConnection >> #testFieldConverter
> >
> > You need to register a field converter for your string types that does
> > #convertFromEncoding: #utf8
> >
> This way it is working already. I think as long as no one is touching
> the string it comes as utf-8 from the database und gets encoded a
> second time by WAKomEncoded39 which has no effect.
>
> > Sorry that does only do the decoding and not the encoding. I guess in
> > your case Glorp does the encoding. I don't know how you can customize
> > the Sql generation there but it everything else fails you can change
> > PGConnection >> #execute (yes, this is a hack)
> >
> I don't think Glorp does encoding and I think it shouldn't.
> Glorp should be happy with strings. If there is conversion happening
> it should happen in the postgres driver (it is the only one who
> could know which encoding is needed for the database).
>
> My strings are carried by ByteString. It seems that ByteString (got
> from WAKomEncoded39) contains a bunch of bytes with any encoding (
> ok, it is the non-unified unicode, you said, and i don't know what
> that means :) ).
> I can convert it with convertToEncoding: to another encoding still
> using ByteString. But there is no information about encoding in the
> object. I think this is really dangerous. I have to look at WideString.
> I'm curious how those deal with encodings they are created from.
>
> I think there are only two possibilities. Handle it like Java, Lisp
> and convert every encoding to the internal (UCS-2) on string creation.
> The other option which would be easier (i think) is to add the
> character encoding information into the string.
>
> What do you think?

Strings are a hard problem. It's interesting to see how many languages
fuck up in this area considering this is a 'basic' data type. Having
more information about a String (what encoding, what escaping, ..)
would definitely help.

UCS-2 is not a "the solution" since it handles only characters in the
BMP. Additionally we don't want to do Han unification.

> > sql := sqlString.
> > to
> > sql := sqlString convertToEncoding: #utf8.
> >
> The hack is actually adding the conversion to
> SqueakDatabaseAccessor>>basicExecuteSQLString:
>
>
> I understand a lot more now. Thanks very much.
>
> Norbert
> > P.S.:
> > PGConnection >> class #buildDefaultFieldConverters
> > has given us a lot of pain because Squeak doesn't have full block closures
> >
> Oh, wow, another day hearing a lot of basic things I don't have any idea
> about :) What are "full" block closures?

The problem is that all these :s block arguments are all sharing the
same temporary variable. If multiple of these are activated at the
same time, you have a problem.

See:
http://bugs.impara.de/view.php?id=4636

Philippe

> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Martial Boniou

Re: 3.9 and encoding

In reply to this post by Philippe Marschall

Philippe Marschall a écrit :
| 2007/2/28, Martial Boniou <[hidden email]>:
| >Hi,
| >
| >I tought I had 'corrupted' 3.9 images but it seems to be a general issue
| >for 3.9. I had made the hack as Philippe said. Actually I changed the
| >Dialect class>>basicIsSqueak test so that it pass well during
| >installation (to get the SmalltalkImage things and not Smalltalk); I
| >added a Dialect class>>basicIsSqueak39 (true for SystemVersion number >
| >7010); I subclass SqueakDatabaseAccessor to Squeak39DatabaseAccessor to
| >modify the instance method #basicExecuteSQLString to say:
| >
| >result := connection execute: (aString asWideString convertToEncoding:
| >'utf-8').
|
| Do you really need to send #asWideString? It doesn't look like it
| would do anything if you already have a String.

Of course! That means nothing. My brain had lack of oxygen at this
moment ;-)

| Philippe
|
| >It works well. I did it to test Ramon Leon's Seaside Blog but I didn't
| >post this because I wasn't sure it was a common problem and because of
| >the ugliness of the string conversion.
| >
| >I attach my two mods.
| >
| >--
| >Martial

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

NorbertHartl

Re: 3.9 and encoding

In reply to this post by Philippe Marschall

On Wed, 2007-02-28 at 10:03 +0100, Philippe Marschall wrote:

Yes, you are right. For everybody who wants to know. You can fix it by
adding

#(1043 )
do: [:each| converters at: each put: [:s | s convertFromEncoding:
#utf8]].

to PGConnection class>>buildDefaultFieldConverters

Hmmm, I guess I talk to Yanni about this.

regards,

Norbert

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

stephane ducasse

Re: 3.9 and encoding

In reply to this post by Philippe Marschall

Just a question.
We introduced this change in 3.9 because apparently it was important.
Certainly suggested by an eminent seasider.
Now I would like to know if this was correct (ie it fixed a problem)
and I would like to avoid to give the impression that "they broke
again something in 3.9" because I can tell you that we ***really***
payed attention. This is also because of that kind of atmosphere
and regular bashing that we lost marcus.

Stef

> 2007/2/28, Norbert Hartl <[hidden email]>:
>> Hi,
>>
>> I ran into a encoding problem. I'm using seaside together
>> with Glorp. For the web server I use WAKomEncoded39.
>> WAKomEncoded39 converts the output to the browser to utf-8.
>> But on incoming requests the url escaped characters are
>> translated to something different. For me it appears to
>> be latin-1 but I've no glue why it should be that way.
>> I detected it because my postgresql session has client
>> encoding utf-8 turned on and I get an error trying to
>> store strings containing characters like ö.
>
> If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> (new) Squeak encoding in your image which is basically non-unified
> unicode. For latin-1 characters this will be indistinguishable from
> latin-1. If your database is utf-8 you need to encode your strings to
> utf-8 when writing them to your database and decode your strings from
> utf-8 when reading from the database (only to convert it back to utf-8
> when generating html). You can configure the PostgreS database driver
> to do this automatically for you.
>
> An other option is to have utf-8 strings in your image. On Squeak 3.9
> this requires WAKom and a modified version of KomHttpServer not
> publicly available. This has the advantage that you don't need to do
> encoding conversion it has however the disadvantage that it won't work
> with the debugger, #size doesn't work and directly indexing into the
> string (creating substrings) won't work too. Additionally you need to
> convert you string literals to utf-8 (unless they're ascii).
>
> Cheers
> Philippe
>
>> I read on the net that this has something to do with 3.9.
>> Is this still true? Is there a way to make it run or is
>> the only way to go back to 3.8?
>>
>> thanks in advance,
>>
>> Norbert
>>
>>
>> _______________________________________________
>> Seaside mailing list
>> [hidden email]
>> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>>
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

Philippe Marschall

Re: 3.9 and encoding

Well it kinda depends. It is really two things, bug fixes and
semantics changes and depending on point of view the semantics changes
are a bugfix. The bug fixes surely helps but the different semantics
in 3.8 vs. 3.9 make it hard to support Squeak 3.8 and 3.9 at the same
time.

I wasn't bashing anyone, just pointing out that the different
semantics require different client code in 3.8 vs. 3.9.

Cheers
Philippe

2007/3/1, stephane ducasse <[hidden email]>:

> Just a question.
> We introduced this change in 3.9 because apparently it was important.
> Certainly suggested by an eminent seasider.
> Now I would like to know if this was correct (ie it fixed a problem)
> and I would like to avoid to give the impression that "they broke
> again something in 3.9" because I can tell you that we ***really***
> payed attention. This is also because of that kind of atmosphere
> and regular bashing that we lost marcus.
>
> Stef
>
>
> > 2007/2/28, Norbert Hartl <[hidden email]>:
> >> Hi,
> >>
> >> I ran into a encoding problem. I'm using seaside together
> >> with Glorp. For the web server I use WAKomEncoded39.
> >> WAKomEncoded39 converts the output to the browser to utf-8.
> >> But on incoming requests the url escaped characters are
> >> translated to something different. For me it appears to
> >> be latin-1 but I've no glue why it should be that way.
> >> I detected it because my postgresql session has client
> >> encoding utf-8 turned on and I get an error trying to
> >> store strings containing characters like ö.
> >
> > If you run WAKomEncoded39 on Squeak 3.9 you will have strings with
> > (new) Squeak encoding in your image which is basically non-unified
> > unicode. For latin-1 characters this will be indistinguishable from
> > latin-1. If your database is utf-8 you need to encode your strings to
> > utf-8 when writing them to your database and decode your strings from
> > utf-8 when reading from the database (only to convert it back to utf-8
> > when generating html). You can configure the PostgreS database driver
> > to do this automatically for you.
> >
> > An other option is to have utf-8 strings in your image. On Squeak 3.9
> > this requires WAKom and a modified version of KomHttpServer not
> > publicly available. This has the advantage that you don't need to do
> > encoding conversion it has however the disadvantage that it won't work
> > with the debugger, #size doesn't work and directly indexing into the
> > string (creating substrings) won't work too. Additionally you need to
> > convert you string literals to utf-8 (unless they're ascii).
> >
> > Cheers
> > Philippe
> >
> >> I read on the net that this has something to do with 3.9.
> >> Is this still true? Is there a way to make it run or is
> >> the only way to go back to 3.8?
> >>
> >> thanks in advance,
> >>
> >> Norbert
> >>
> >>
> >> _______________________________________________
> >> Seaside mailing list
> >> [hidden email]
> >> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
> >>
> > _______________________________________________
> > Seaside mailing list
> > [hidden email]
> > http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>
> _______________________________________________
> Seaside mailing list
> [hidden email]
> http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside
>

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

stephane ducasse

Re: 3.9 and encoding

> Well it kinda depends. It is really two things, bug fixes and
> semantics changes and depending on point of view the semantics changes
> are a bugfix. The bug fixes surely helps but the different semantics
> in 3.8 vs. 3.9 make it hard to support Squeak 3.8 and 3.9 at the same
> time.
>
> I wasn't bashing anyone, just pointing out that the different
> semantics require different client code in 3.8 vs. 3.9.

I know. ;)
But I wanted to know if we introduced a bug or not.
String encoding is a mess. May be once we should have strings that
know their business (encoding and all the rest).
_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside

NorbertHartl

Re: 3.9 and encoding

In reply to this post by NorbertHartl

On Wed, 2007-02-28 at 23:41 +0100, Norbert Hartl wrote:

Here's a little update. Actually the conversion matched only varchar.
To match varchar and text columns add

#(1043 25)
do: [:each| converters at: each put: [:s | s convertFromEncoding: #utf8]].

regards,

Norbert

_______________________________________________
Seaside mailing list
[hidden email]
http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside