pier2 and encoding

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

pier2 and encoding

NorbertHartl
This is follow-up of the discussion in "upgrades report" about Sixx and importing. Today I managed to rewrite my pier importer to handle encoding correctly. After this I exported a pier kernel from my 2.3.1 environment and imported it back in my 2.4x environment.

As I reported in the last mail the encoding seems to be broken in my kernel. But now it appears it is quite the opposite.Importing the kernel gives me pier paragraphs that contain Strings which are encoded correctly. They are displayed as $ö and $ü just as it should be.

After installation of Pier-Setup I did a "PRDistribution new register" which gives me with a preconfigured pier kernel. If I edit a page there and enter an umlaut (ö,ü,..) I can see the right characters after saving (on the web page). Investigating the objects in gemstone shows that they are still in utf-8 encoding inside the image. In the inspector I can see 'Köln' instead of 'Köln'.

I didn't figure it out completely what is going on. The utf-8 string which is 'Köln' in the image is outputted without conversion so the browser shows it correct. Btw. The server (WAGsSwazooAdaptor, didn't have any success with the others) reports a character set of 'utf-8' in the header. My right encoded texts in the pier kernel are converted to latin-1 which is the reason the browser displays the wrong characters.

I'm investigating this further but hopefully anyone has an idea what is going wrong here.

Norbert

Reply | Threaded
Open this post in threaded view
|

Re: pier2 and encoding

Dale Henrichs
Norbert Hartl wrote:

> This is follow-up of the discussion in "upgrades report" about Sixx
> and importing. Today I managed to rewrite my pier importer to handle
> encoding correctly. After this I exported a pier kernel from my 2.3.1
> environment and imported it back in my 2.4x environment.
>
> As I reported in the last mail the encoding seems to be broken in my
> kernel. But now it appears it is quite the opposite.Importing the
> kernel gives me pier paragraphs that contain Strings which are
> encoded correctly. They are displayed as $ö and $ü just as it should
> be.
>
> After installation of Pier-Setup I did a "PRDistribution new
> register" which gives me with a preconfigured pier kernel. If I edit
> a page there and enter an umlaut (ö,ü,..) I can see the right
> characters after saving (on the web page). Investigating the objects
> in gemstone shows that they are still in utf-8 encoding inside the
> image. In the inspector I can see 'Köln' instead of 'Köln'.
>
> I didn't figure it out completely what is going on. The utf-8 string
> which is 'Köln' in the image is outputted without conversion so the
> browser shows it correct. Btw. The server (WAGsSwazooAdaptor, didn't
> have any success with the others) reports a character set of 'utf-8'
> in the header. My right encoded texts in the pier kernel are
> converted to latin-1 which is the reason the browser displays the
> wrong characters.
>
> I'm investigating this further but hopefully anyone has an idea what
> is going wrong here.
>
> Norbert
>
Norbert,

this stuff is not straightforward at all, but this is my understanding
of the way that things _should_ be working.

1. Internally GemStone supports String, DoubleByteString, and
QuadByteString ... there are other String implementation in the image,
but I am unsure of their status.

2. UTF8 is an encoding that allows one to encode muti-byte characters
using single byte collections. If you look at a UTF8 encoded String
without decoding, the String will look like random (single byte)
characters. When you decode a UTF8 encoded String in GemStone, the
result is a String or QuadByteString depending upon what is being
decoded (deconding USASCII results in a String). If one or more of the
encode characters, GemStone returns a QuadByteString instance. There is
an option to decode into DoubleByteString if you want to conserve space.

3. Using GemTools, the Quad and DoubleByteStrings are converted to
WideStrings when they come off the wire to your client Smalltalk image

4. The FastCGI and Swazoo adaptors for Seaside3.0 should convert all
strings to UTF8 (from String, DoubleByteString, or QuadByteString)
before shipping them across the wire to the HTTP client. All strings
that are received from a client are converted from UTF8 to String,
DoubleByteString, or QuadByteString. I've done a fair amount of work
with Seaside3.0 making sure that the correct encoding/decoding was going
  on in the right places (that doesn't mean there are no bugs, tho:)

5. Seaside2.8 did not necessarily correctly encode/decode UTF8, so with
Seaside2.8 it would not be surprising that UTF8 encoded strings became
lodged in your image. Since no automatic encoding/decoding was going on,
the UTF8 representation appeared to work fine:)

6. SIXX (without any UTF8 decodinging/encoding) "preserved" the UTF8
encoding of the strings and transported them to the 2.4.4.1 image as is.

7. Strings that exist in your image that are encoded in UTF8 should be
decoded into instances of String or QuadByteString as necessary. Once
that is done, the browsers should be displaying the right characters ...
when using Seaside 3.0

Dale
Reply | Threaded
Open this post in threaded view
|

Re: pier2 and encoding

NorbertHartl

On 02.08.2010, at 20:43, Dale Henrichs wrote:

> Norbert Hartl wrote:
>> This is follow-up of the discussion in "upgrades report" about Sixx
>> and importing. Today I managed to rewrite my pier importer to handle
>> encoding correctly. After this I exported a pier kernel from my 2.3.1
>> environment and imported it back in my 2.4x environment.
>> As I reported in the last mail the encoding seems to be broken in my
>> kernel. But now it appears it is quite the opposite.Importing the
>> kernel gives me pier paragraphs that contain Strings which are
>> encoded correctly. They are displayed as $ö and $ü just as it should
>> be.
>> After installation of Pier-Setup I did a "PRDistribution new
>> register" which gives me with a preconfigured pier kernel. If I edit
>> a page there and enter an umlaut (ö,ü,..) I can see the right
>> characters after saving (on the web page). Investigating the objects
>> in gemstone shows that they are still in utf-8 encoding inside the
>> image. In the inspector I can see 'Köln' instead of 'Köln'.
>> I didn't figure it out completely what is going on. The utf-8 string
>> which is 'Köln' in the image is outputted without conversion so the
>> browser shows it correct. Btw. The server (WAGsSwazooAdaptor, didn't
>> have any success with the others) reports a character set of 'utf-8'
>> in the header. My right encoded texts in the pier kernel are
>> converted to latin-1 which is the reason the browser displays the
>> wrong characters.
>> I'm investigating this further but hopefully anyone has an idea what
>> is going wrong here.
>> Norbert
> Norbert,
>
> this stuff is not straightforward at all, but this is my understanding
> of the way that things _should_ be working.
>
> 1. Internally GemStone supports String, DoubleByteString, and
> QuadByteString ... there are other String implementation in the image,
> but I am unsure of their status.
>
ok

> 2. UTF8 is an encoding that allows one to encode muti-byte characters
> using single byte collections. If you look at a UTF8 encoded String
> without decoding, the String will look like random (single byte)
> characters. When you decode a UTF8 encoded String in GemStone, the
> result is a String or QuadByteString depending upon what is being
> decoded (deconding USASCII results in a String). If one or more of the
> encode characters, GemStone returns a QuadByteString instance. There is
> an option to decode into DoubleByteString if you want to conserve space.
>
Well, I don't get the first sentence. Utf8 is variable length encoding scheme for a code point. Us-Ascii and latin-1 being the ones that can be encoded in single byte. Every other character can be encoded in the least needed amount of bytes. Wheras some characters in the upper range (128-255 ascii value) are present in two ways (composed and decomposed) inside the complete unicode range.

> 3. Using GemTools, the Quad and DoubleByteStrings are converted to
> WideStrings when they come off the wire to your client Smalltalk image
>
> 4. The FastCGI and Swazoo adaptors for Seaside3.0 should convert all
> strings to UTF8 (from String, DoubleByteString, or QuadByteString)
> before shipping them across the wire to the HTTP client. All strings
> that are received from a client are converted from UTF8 to String,
> DoubleByteString, or QuadByteString. I've done a fair amount of work with Seaside3.0 making sure that the correct encoding/decoding was going  on in the right places (that doesn't mean there are no bugs, tho:)
>
hmmm :)

> 5. Seaside2.8 did not necessarily correctly encode/decode UTF8, so with Seaside2.8 it would not be surprising that UTF8 encoded strings became lodged in your image. Since no automatic encoding/decoding was going on, the UTF8 representation appeared to work fine:)
>
I know the pains with WAKom. But if you used WAKomEncoded then most of it worked. I patched once the handling for multipart fields to decode utf-8 as well. And at the same time I fixed the postgresql driver to make Glorp "unicode aware"  ;)

> 6. SIXX (without any UTF8 decodinging/encoding) "preserved" the UTF8 encoding of the strings and transported them to the 2.4.4.1 image as is.
>
Well, yes, Sixx does not deal at all with encoding so it depends on the class and the stream being used what is put to disk :) But that is not the way it worked. I explicitly encoded the strings produced by Sixx to utf-8 and on import time I explicitly decoded utf-8 to gemstone strings. That's why it works everything else is hoping to be lucky.

> 7. Strings that exist in your image that are encoded in UTF8 should be decoded into instances of String or QuadByteString as necessary. Once that is done, the browsers should be displaying the right characters ... when using Seaside 3.0
>
Here we start to argue. At this moment (without knowing it better) it appears to me that strings coming from the network are not decoded properly and stored "raw" inside the image. And strings that are correctly encoded in the image are not put in utf-8 onto the network. I'll check this indeep tomorrow.

Norbert


Reply | Threaded
Open this post in threaded view
|

Re: pier2 and encoding

Dale Henrichs
Norbert Hartl wrote:

> On 02.08.2010, at 20:43, Dale Henrichs wrote:
>> 7. Strings that exist in your image that are encoded in UTF8 should
>> be decoded into instances of String or QuadByteString as necessary.
>> Once that is done, the browsers should be displaying the right
>> characters ... when using Seaside 3.0
>>
> Here we start to argue. At this moment (without knowing it better) it
> appears to me that strings coming from the network are not decoded
> properly and stored "raw" inside the image. And strings that are
> correctly encoded in the image are not put in utf-8 onto the network.
> I'll check this indeep tomorrow.

No arguments:). I did my UTF8 testing using GemSource and FastCGI. If
you are using Seaside-Adaptors-FastCGI-DaleHenrichs.17 or later, then
you should have my fixes. I don't think I've spent much time with the
Swazoo adaptor, although in my dim memory I seem to remember that I ran
through the test cases with Swazoo, but I wouldn't swear to that:)

Provide a couple of simple test cases using a default Pier kernel and
I'll fix the bugs...

Dale
Reply | Threaded
Open this post in threaded view
|

Re: pier2 and encoding

NorbertHartl

On 02.08.2010, at 22:30, Dale Henrichs wrote:

> Norbert Hartl wrote:
>> On 02.08.2010, at 20:43, Dale Henrichs wrote:
>>> 7. Strings that exist in your image that are encoded in UTF8 should
>>> be decoded into instances of String or QuadByteString as necessary.
>>> Once that is done, the browsers should be displaying the right
>>> characters ... when using Seaside 3.0
>> Here we start to argue. At this moment (without knowing it better) it
>> appears to me that strings coming from the network are not decoded
>> properly and stored "raw" inside the image. And strings that are
>> correctly encoded in the image are not put in utf-8 onto the network.
>> I'll check this indeep tomorrow.
>
> No arguments:). I did my UTF8 testing using GemSource and FastCGI. If you are using Seaside-Adaptors-FastCGI-DaleHenrichs.17 or later, then you should have my fixes. I don't think I've spent much time with the Swazoo adaptor, although in my dim memory I seem to remember that I ran through the test cases with Swazoo, but I wouldn't swear to that:)
>
> Provide a couple of simple test cases using a default Pier kernel and I'll fix the bugs...
>
Ok, I'll do. In fastcgi it looks good and with swazoo it's not

Norbert

Reply | Threaded
Open this post in threaded view
|

Re: pier2 and encoding

NorbertHartl
In reply to this post by Dale Henrichs

On 02.08.2010, at 22:30, Dale Henrichs wrote:

> Norbert Hartl wrote:
>> On 02.08.2010, at 20:43, Dale Henrichs wrote:
>>> 7. Strings that exist in your image that are encoded in UTF8 should
>>> be decoded into instances of String or QuadByteString as necessary.
>>> Once that is done, the browsers should be displaying the right
>>> characters ... when using Seaside 3.0
>> Here we start to argue. At this moment (without knowing it better) it
>> appears to me that strings coming from the network are not decoded
>> properly and stored "raw" inside the image. And strings that are
>> correctly encoded in the image are not put in utf-8 onto the network.
>> I'll check this indeep tomorrow.
>
> No arguments:). I did my UTF8 testing using GemSource and FastCGI. If you are using Seaside-Adaptors-FastCGI-DaleHenrichs.17 or later, then you should have my fixes. I don't think I've spent much time with the Swazoo adaptor, although in my dim memory I seem to remember that I ran through the test cases with Swazoo, but I wouldn't swear to that:)
>
> Provide a couple of simple test cases using a default Pier kernel and I'll fix the bugs...

The problem is that

WAServerAdaptor>>defaultCodec
        ^ GRNullCodec new

and

WAFastCGIAdaptor>>defaultCodec

        ^GRUtf8GemStoneCodec new

overwrites that while WASwazooAdaptor does not. GRNullCodec ist not a very good setting. Other than in object oriented programming where a null codec is feasible there is no null character encoding on the net. A default encoding per adaptor is only useful because there is "not so well negotiated" zone in the web that assumes that form data is send in the same encoding as the form encapsulating document was in. But I didn't see any code that looks for an character set setting in the header and adjusts encoding that way.

Usually I would tend to add subclassResponsibility to WAServerAdaptor>>defaultCodec but I can't think of a single reason why different adaptor need different encoding settings. So I'll fix it in my image by setting the utf-8 codec in WAServerAdaptor.

I needed some time to fin this. First I found some Omnibrowser tool that sets the encoding on an adaptor. But I couldn't open the browser in my image. Is this supposed to work in gemstone?

Norbert



Reply | Threaded
Open this post in threaded view
|

Re: pier2 and encoding

Dale Henrichs


Norbert Hartl wrote:

> On 02.08.2010, at 22:30, Dale Henrichs wrote:
>
>> Norbert Hartl wrote:
>>> On 02.08.2010, at 20:43, Dale Henrichs wrote:
>>>> 7. Strings that exist in your image that are encoded in UTF8 should
>>>> be decoded into instances of String or QuadByteString as necessary.
>>>> Once that is done, the browsers should be displaying the right
>>>> characters ... when using Seaside 3.0
>>> Here we start to argue. At this moment (without knowing it better) it
>>> appears to me that strings coming from the network are not decoded
>>> properly and stored "raw" inside the image. And strings that are
>>> correctly encoded in the image are not put in utf-8 onto the network.
>>> I'll check this indeep tomorrow.
>> No arguments:). I did my UTF8 testing using GemSource and FastCGI. If you are using Seaside-Adaptors-FastCGI-DaleHenrichs.17 or later, then you should have my fixes. I don't think I've spent much time with the Swazoo adaptor, although in my dim memory I seem to remember that I ran through the test cases with Swazoo, but I wouldn't swear to that:)
>>
>> Provide a couple of simple test cases using a default Pier kernel and I'll fix the bugs...
>
> The problem is that
>
> WAServerAdaptor>>defaultCodec
> ^ GRNullCodec new
>
> and
>
> WAFastCGIAdaptor>>defaultCodec
>
> ^GRUtf8GemStoneCodec new
>
> overwrites that while WASwazooAdaptor does not. GRNullCodec ist not a very good setting. Other than in object oriented programming where a null codec is feasible there is no null character encoding on the net. A default encoding per adaptor is only useful because there is "not so well negotiated" zone in the web that assumes that form data is send in the same encoding as the form encapsulating document was in. But I didn't see any code that looks for an character set setting in the header and adjusts encoding that way.
>
> Usually I would tend to add subclassResponsibility to WAServerAdaptor>>defaultCodec but I can't think of a single reason why different adaptor need different encoding settings. So I'll fix it in my image by setting the utf-8 codec in WAServerAdaptor.
>
> I needed some time to fin this. First I found some Omnibrowser tool that sets the encoding on an adaptor. But I couldn't open the browser in my image. Is this supposed to work in gemstone?
>
> Norbert
>
>
>

Aha,

I've submitted Issue 156
(http://code.google.com/p/glassdb/issues/detail?id=156) .. in GemStone
the WAGsSwazooAdaptor could default to using the utf8 codec.
WAGsSwazooAdaptor also introduces the abort/commit for request handling.

Yes the OB tool for managing adaptors hasn't been ported yet...

Dale