Seaside 3.0a6ish

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Seaside 3.0a6ish

Michael Lucas-Smith-3
As it stands right now, the only tests that are failing for VisualWorks
7.7.1 are the three codec tests (one of which is expected to fail on
VisualWorks):

GRCodecTest>>testCodecLatin1
GRUtf8CodecTest>>testCodecUtf8Bom
GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail)

The reason for these failures was raised in a previous email.

Cheers,
Michael
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Philippe Marschall
2010/5/19 Michael Lucas-Smith <[hidden email]>:
> As it stands right now, the only tests that are failing for VisualWorks
> 7.7.1 are the three codec tests (one of which is expected to fail on
> VisualWorks):
>
> GRCodecTest>>testCodecLatin1
> GRUtf8CodecTest>>testCodecUtf8Bom
> GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail)
>
> The reason for these failures was raised in a previous email.

What's the consensus there? String comparison method on platform?

Cheers
Philippe
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Michael Lucas-Smith-3
On 5/19/10 4:55 AM, Philippe Marschall wrote:

> 2010/5/19 Michael Lucas-Smith<[hidden email]>:
>    
>> As it stands right now, the only tests that are failing for VisualWorks
>> 7.7.1 are the three codec tests (one of which is expected to fail on
>> VisualWorks):
>>
>> GRCodecTest>>testCodecLatin1
>> GRUtf8CodecTest>>testCodecUtf8Bom
>> GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail)
>>
>> The reason for these failures was raised in a previous email.
>>      
> What's the consensus there? String comparison method on platform?
>
>    
It seems bizarre that Seaside would have a test that explicitly compares
two unicode strings but yet not all the smalltalk platforms can support
this using #=.. it would *really* suck if #= couldn't be used in the
general code base for safely comparing two strings. (If you have to do
it in the test, then you, in theory, probably need to do it everywhere).

Can someone speak to the platforms that have trouble with #= here?

Michael
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Paolo Bonzini-2
In reply to this post by Philippe Marschall
On 05/19/2010 01:55 PM, Philippe Marschall wrote:

> 2010/5/19 Michael Lucas-Smith<[hidden email]>:
>> As it stands right now, the only tests that are failing for VisualWorks
>> 7.7.1 are the three codec tests (one of which is expected to fail on
>> VisualWorks):
>>
>> GRCodecTest>>testCodecLatin1
>> GRUtf8CodecTest>>testCodecUtf8Bom
>> GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail)
>>
>> The reason for these failures was raised in a previous email.
>
> What's the consensus there? String comparison method on platform?

That, or a String conversion method like #greaseStringComparable (which
would return a String or ByteArray that you can send #= to).  I'd put it
in tests though, I don't think we want to expose that to users of Grease
until they complain.

Paolo
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Paolo Bonzini-2
In reply to this post by Michael Lucas-Smith-3
On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>
> Can someone speak to the platforms that have trouble with #= here?

GNU Smalltalk has problems comparing an encoded string with
#latin1String.  The problem is that the GRCodecTest>>#asString: method
does not store the encoding of the string in its result, so GNU
Smalltalk assumes it is in the default encoding (typically UTF-8).  Then
when "self latin1String" has to be compared with an ISO-8859-1 string
(the output of "codec encode: self decodedString"), GNU Smalltalk fails
because it finds an invalid UTF-8 sequence in "self latin1String".

Comparing bytearrays instead takes encodings out of the picture and works.

VisualWorks seems to have the opposite problem.  #encode: needs to know
what encoding was applied in order to convert to raw bytes.  This seems
to be a bug to me.  The #encode:-d representation should contain the raw
bytes, not the Unicode characters.

So, I could fix it by adding a platform-specific hack to #asString:, but
it seems wrong.  Can you check what breaks if you return a ByteArray
from your codec's #encode: method?

Paolo
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Michael Lucas-Smith-3
On 5/19/10 10:19 AM, Paolo Bonzini wrote:

> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>
>> Can someone speak to the platforms that have trouble with #= here?
>
> GNU Smalltalk has problems comparing an encoded string with
> #latin1String.  The problem is that the GRCodecTest>>#asString: method
> does not store the encoding of the string in its result, so GNU
> Smalltalk assumes it is in the default encoding (typically UTF-8).  
> Then when "self latin1String" has to be compared with an ISO-8859-1
> string (the output of "codec encode: self decodedString"), GNU
> Smalltalk fails because it finds an invalid UTF-8 sequence in "self
> latin1String".
>
> Comparing bytearrays instead takes encodings out of the picture and
> works.
>
> VisualWorks seems to have the opposite problem.  #encode: needs to
> know what encoding was applied in order to convert to raw bytes.  This
> seems to be a bug to me.  The #encode:-d representation should contain
> the raw bytes, not the Unicode characters.
I think there's a misunderstanding here somewhere. The generic String
object (subclasses ByteString, TwoByteString, FourByteString) represents
unicode characters. This is completely independent of any encoding at
all. We have some specific ByteEncodedString subclasses for ISO8859L1
and MSCP1252 but they don't really enter the picture here.

The terminology is important here, may be that's where we're struggling
- when you start off with bytes, the bytes cannot represent their
encoding (auto detecting is a fools game) so you must -decode- the bytes
in to characters. Once that is done, we have our String object. The
String object does not have a link back to the original bytes, nor does
it know what encoding you used to create the characters - not should it.
To turn the string back in to bytes, you have to -encode- the characters
in to bytes using an encoding.
>
> So, I could fix it by adding a platform-specific hack to #asString:,
> but it seems wrong.  Can you check what breaks if you return a
> ByteArray from your codec's #encode: method?
>
The expectation of the GRCodec is that, unfortunately, you will get back
a String object no matter whether you're doing an encode: or a decode:
.. I would *LOVELOVELOVE* to return a ByteArray when you call #encode: -
but this has never worked because Pharo/Squeak could never do it. May be
this has changed, but none of the code that calls #decode: gives us a
ByteArray - all of the tests, examples, seaside actual code passes in a
String containing characters representing bytes. I would really love it
if the API had a contract like this:

GRCodec>>encode: (String)
     ^(ByteArray)

GRCodec>>decode: (ByteArray)
     ^(String)

That would be absolutely ideal conceptually, but I suspect we'd be
bucking against:
     a) lots of existing code that current works and people would rather
not break
     b) the Squeak/Pharo adaptors existing expectations which would
require a fair bit of work to fix

Instead, the API works like this:

GRCodec>>encode: (String containing unicode characters)
     ^(String containing characters representing bytes)

GRCodec>>decode: (String containing characters representing bytes)
     ^(String containing unicode characters)

This is the reality we're in right now and instead of rocking the boat
too much, I have an obligation to make Seaside work the least disruptive
way possible. I'm all in favor of changing the contract, but only if
everyone else is too. So for now, I accept your accusation that #encode:
is returning a String seemingly incorrectly, but throw back that that's
the expectation of the API.

The two tests in question push the opposite ends of the problem.
#testCodecLatin1 tests the encoded bytes, while #testCodecUtf8Bom tests
the decoded characters. In the case of testCodecLatin1, you can send
#asByteArray to the ByteString containing characters representing bytes,
because none of the characters go over a value of 255 -- this is pure
happenstance that it works and we fully intend to one day deprecate
asByteArray from String fully from VisualWorks.

#testCodecUtf8Bom does the opposite, it wants to compare strings
containing unicode characters and as such in VisualWorks we end up with
a TwoByteString of which you cannot send #asByteArray. This is how I
first noticed the problem.

Oh as a small side note, the #name API is inconsistent. #testCodecLatin1
expects the name to be case insensitive, while #testCodecUtf8 expects
the name to be lowercase. I'm not sure how this 'came to be' but it's
impossible for me to make it pass consistently for both scenarios :)

Michael
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Michael Lucas-Smith-3

>> So, I could fix it by adding a platform-specific hack to #asString:,
>> but it seems wrong.  Can you check what breaks if you return a
>> ByteArray from your codec's #encode: method?
>>
>
Results from changing GRGenericCodec to expect and return ByteArrays
appropriately.

Errors:
     #testCodecLatin1 -- errors because #latin1String is sent to
#decode: when it should be #latin1Bytes
     #testCodecUtf8 -- errors because #utf8String is sent to #decode:
when it should be #utf8Bytes
         #testCodecUtf8Bom -- same
     #testCompileString -- errors because #contents called
#contentsDecodedUsing: where the @contents variable is already a string
         #testCompileStringAgain -- same
     #testDecodedWith -- errors because @user ends up containing a
String with unicode characters derived from the % encoding, but is then
attempted to be decoded using the codec

And those are just the tests - I suspect that there'll be even more
errors/failures attempting to actually use Seaside. I'd also need to
adjust the adaptor to do less fiddling, but that's a whole other story
(the Opentalk adaptor, by default, handles encoding -for- you, but
Seaside intends to handle it for itself.. so we have to do a lot of
fiddling at the adaptor level to pass through raw data and let Seaside
handle it its own way).

Cheers,
Michael
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Michael Lucas-Smith-3
In reply to this post by Michael Lucas-Smith-3
On 5/19/10 11:07 AM, Michael Lucas-Smith wrote:
> On 5/19/10 10:19 AM, Paolo Bonzini wrote:
>> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>>
>>> Can someone speak to the platforms that have trouble with #= here?
>>
>> GNU Smalltalk has problems comparing an encoded string with
>> #latin1String.
Perhaps we need to make the way we represent encoded bytes platform
specific. If a platform chooses to represent bytes as characters in a
string, so be it - but at least at the GRPlatform level we can draw a
distinction between "A collection of bytes" and "a collection of
characters" instead of always assuming the collection species is String?

Michael

_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Julian Fitzell-3
I'm too exhausted to dig into this any further than reading it right now. What Michael's talking about sounds fairly reasoned, but even assuming everyone agrees, I think this is a disruptive change to throw in at this point. What's the smallest change we can make to get it working "good enough" for now?

My suggestion is that... no, crap, that doesn't work at all. I was going to suggest we take the sprint at ESUG to focus on sorting out encoding/decoding (I'd like to be a focus for 3.1 anyway). But since Philippe won't be there and Lukas may or may not be, that's not really going to work out. :)

Ok, so I'm not sure when we tackle that, but the question is still what can we do now that won't make us even *less* likely to release?

Julian

On Wed, May 19, 2010 at 7:31 PM, Michael Lucas-Smith <[hidden email]> wrote:
On 5/19/10 11:07 AM, Michael Lucas-Smith wrote:
On 5/19/10 10:19 AM, Paolo Bonzini wrote:
On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:

Can someone speak to the platforms that have trouble with #= here?

GNU Smalltalk has problems comparing an encoded string with #latin1String.
Perhaps we need to make the way we represent encoded bytes platform specific. If a platform chooses to represent bytes as characters in a string, so be it - but at least at the GRPlatform level we can draw a distinction between "A collection of bytes" and "a collection of characters" instead of always assuming the collection species is String?


Michael

_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev


_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Michael Lucas-Smith-3
On 5/19/10 1:52 PM, Julian Fitzell wrote:

> I'm too exhausted to dig into this any further than reading it right
> now. What Michael's talking about sounds fairly reasoned, but even
> assuming everyone agrees, I think this is a disruptive change to throw
> in at this point. What's the smallest change we can make to get it
> working "good enough" for now?
>
> My suggestion is that... no, crap, that doesn't work at all. I was
> going to suggest we take the sprint at ESUG to focus on sorting out
> encoding/decoding (I'd like to be a focus for 3.1 anyway). But since
> Philippe won't be there and Lukas may or may not be, that's not really
> going to work out. :)
>
> Ok, so I'm not sure when we tackle that, but the question is still
> what can we do now that won't make us even *less* likely to release?
We have some tests failing - but a working system.

The tests right now are a curious blend of pragmatism and ideal in this
case. They exist to demonstrate what the system should do, but are
implemented in terms of what it can do. In that sense, I do not feel as
though there is a great urgency in resolving this issue and to fix it at
this point would mean I could not include 3.0a6 in to VisualWorks 7.7.1
- it'd have to wait for the next release.

My inclination is to accept these test faults as known and fix up the
encoding/decoding across smalltalks in the next iteration. In a sense,
my suggestion is to do nothing for now for 3.0, but to revisit this soon
after.

Michael
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Philippe Marschall
In reply to this post by Paolo Bonzini-2
2010/5/19 Paolo Bonzini <[hidden email]>:

> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>
>> Can someone speak to the platforms that have trouble with #= here?
>
> GNU Smalltalk has problems comparing an encoded string with #latin1String.
>  The problem is that the GRCodecTest>>#asString: method does not store the
> encoding of the string in its result, so GNU Smalltalk assumes it is in the
> default encoding (typically UTF-8).  Then when "self latin1String" has to be
> compared with an ISO-8859-1 string (the output of "codec encode: self
> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8
> sequence in "self latin1String".

Why does there have to be an encoding present? It concatenates
characters from known code points. There are no bytes involved so no
mapping or mapping information is required.

> Comparing bytearrays instead takes encodings out of the picture and works.
>
> VisualWorks seems to have the opposite problem.  #encode: needs to know what
> encoding was applied in order to convert to raw bytes.  This seems to be a
> bug to me.  The #encode:-d representation should contain the raw bytes, not
> the Unicode characters.
>
> So, I could fix it by adding a platform-specific hack to #asString:, but it
> seems wrong.  Can you check what breaks if you return a ByteArray from your
> codec's #encode: method?

I have a train ride today. I can give it a shot. It might actually
work because of a recent stream change.

Cheers
Philippe
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Philippe Marschall
In reply to this post by Michael Lucas-Smith-3
2010/5/19 Michael Lucas-Smith <[hidden email]>:

> On 5/19/10 10:19 AM, Paolo Bonzini wrote:
>>
>> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>>
>>> Can someone speak to the platforms that have trouble with #= here?
>>
>> GNU Smalltalk has problems comparing an encoded string with #latin1String.
>>  The problem is that the GRCodecTest>>#asString: method does not store the
>> encoding of the string in its result, so GNU Smalltalk assumes it is in the
>> default encoding (typically UTF-8).  Then when "self latin1String" has to be
>> compared with an ISO-8859-1 string (the output of "codec encode: self
>> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8
>> sequence in "self latin1String".
>>
>> Comparing bytearrays instead takes encodings out of the picture and works.
>>
>> VisualWorks seems to have the opposite problem.  #encode: needs to know
>> what encoding was applied in order to convert to raw bytes.  This seems to
>> be a bug to me.  The #encode:-d representation should contain the raw bytes,
>> not the Unicode characters.
>
> I think there's a misunderstanding here somewhere. The generic String object
> (subclasses ByteString, TwoByteString, FourByteString) represents unicode
> characters. This is completely independent of any encoding at all. We have
> some specific ByteEncodedString subclasses for ISO8859L1 and MSCP1252 but
> they don't really enter the picture here.
>
> The terminology is important here, may be that's where we're struggling -
> when you start off with bytes, the bytes cannot represent their encoding
> (auto detecting is a fools game) so you must -decode- the bytes in to
> characters. Once that is done, we have our String object. The String object
> does not have a link back to the original bytes, nor does it know what
> encoding you used to create the characters - not should it. To turn the
> string back in to bytes, you have to -encode- the characters in to bytes
> using an encoding.
>>
>> So, I could fix it by adding a platform-specific hack to #asString:, but
>> it seems wrong.  Can you check what breaks if you return a ByteArray from
>> your codec's #encode: method?
>>
> The expectation of the GRCodec is that, unfortunately, you will get back a
> String object no matter whether you're doing an encode: or a decode: .. I
> would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this
> has never worked because Pharo/Squeak could never do it. May be this has
> changed, but none of the code that calls #decode: gives us a ByteArray - all
> of the tests, examples, seaside actual code passes in a String containing
> characters representing bytes. I would really love it if the API had a
> contract like this:
>
> GRCodec>>encode: (String)
>    ^(ByteArray)
>
> GRCodec>>decode: (ByteArray)
>    ^(String)

I agree. I believe the first one is doable, I'll hack together a
prototype today. The second is more tricky because the servers
themselves (Comanche and Swazoo) already give us a String which is
actually just a ByteArray.

Cheers
Philippe
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Julian Fitzell-2
On Thu, May 20, 2010 at 6:32 AM, Philippe Marschall <[hidden email]> wrote:
2010/5/19 Michael Lucas-Smith <[hidden email]>:
> The expectation of the GRCodec is that, unfortunately, you will get back a
> String object no matter whether you're doing an encode: or a decode: .. I
> would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this
> has never worked because Pharo/Squeak could never do it. May be this has
> changed, but none of the code that calls #decode: gives us a ByteArray - all
> of the tests, examples, seaside actual code passes in a String containing
> characters representing bytes. I would really love it if the API had a
> contract like this:
>
> GRCodec>>encode: (String)
>    ^(ByteArray)
>
> GRCodec>>decode: (ByteArray)
>    ^(String)

I agree. I believe the first one is doable, I'll hack together a
prototype today. The second is more tricky because the servers
themselves (Comanche and Swazoo) already give us a String which is
actually just a ByteArray.

Maybe we could get the servers "fixed" (might imply a mode you can set or something). Obviously the server adaptors can simply deal with it as well, but there's going to be an inefficiency there.

Julian

_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Philippe Marschall
In reply to this post by Philippe Marschall
2010/5/20 Philippe Marschall <[hidden email]>:

> 2010/5/19 Michael Lucas-Smith <[hidden email]>:
>> On 5/19/10 10:19 AM, Paolo Bonzini wrote:
>>>
>>> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>>>
>>>> Can someone speak to the platforms that have trouble with #= here?
>>>
>>> GNU Smalltalk has problems comparing an encoded string with #latin1String.
>>>  The problem is that the GRCodecTest>>#asString: method does not store the
>>> encoding of the string in its result, so GNU Smalltalk assumes it is in the
>>> default encoding (typically UTF-8).  Then when "self latin1String" has to be
>>> compared with an ISO-8859-1 string (the output of "codec encode: self
>>> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8
>>> sequence in "self latin1String".
>>>
>>> Comparing bytearrays instead takes encodings out of the picture and works.
>>>
>>> VisualWorks seems to have the opposite problem.  #encode: needs to know
>>> what encoding was applied in order to convert to raw bytes.  This seems to
>>> be a bug to me.  The #encode:-d representation should contain the raw bytes,
>>> not the Unicode characters.
>>
>> I think there's a misunderstanding here somewhere. The generic String object
>> (subclasses ByteString, TwoByteString, FourByteString) represents unicode
>> characters. This is completely independent of any encoding at all. We have
>> some specific ByteEncodedString subclasses for ISO8859L1 and MSCP1252 but
>> they don't really enter the picture here.
>>
>> The terminology is important here, may be that's where we're struggling -
>> when you start off with bytes, the bytes cannot represent their encoding
>> (auto detecting is a fools game) so you must -decode- the bytes in to
>> characters. Once that is done, we have our String object. The String object
>> does not have a link back to the original bytes, nor does it know what
>> encoding you used to create the characters - not should it. To turn the
>> string back in to bytes, you have to -encode- the characters in to bytes
>> using an encoding.
>>>
>>> So, I could fix it by adding a platform-specific hack to #asString:, but
>>> it seems wrong.  Can you check what breaks if you return a ByteArray from
>>> your codec's #encode: method?
>>>
>> The expectation of the GRCodec is that, unfortunately, you will get back a
>> String object no matter whether you're doing an encode: or a decode: .. I
>> would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this
>> has never worked because Pharo/Squeak could never do it. May be this has
>> changed, but none of the code that calls #decode: gives us a ByteArray - all
>> of the tests, examples, seaside actual code passes in a String containing
>> characters representing bytes. I would really love it if the API had a
>> contract like this:
>>
>> GRCodec>>encode: (String)
>>    ^(ByteArray)
>>
>> GRCodec>>decode: (ByteArray)
>>    ^(String)
>
> I agree. I believe the first one is doable, I'll hack together a
> prototype today. The second is more tricky because the servers
> themselves (Comanche and Swazoo) already give us a String which is
> actually just a ByteArray.

There's one trouble point:
WAUrlEncoder >> #nextPutAll:

The trouble is we first need to convert a URL to bytes and then
interpret these bytes as Latin-1 and do percent encoding accordingly.

Cheers
Philippe
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Paolo Bonzini-2
On 05/20/2010 11:12 AM, Philippe Marschall wrote:
> There's one trouble point:
> WAUrlEncoder>>  #nextPutAll:
>
> The trouble is we first need to convert a URL to bytes and then
> interpret these bytes as Latin-1 and do percent encoding accordingly.

Really?  Shouldn't the percent-encoded strings use whatever encoding the
page uses?  Surely, browsers use whatever encoding the page sent when
creating their responses.

Alternatively, _do we need to use percent encoding at all_?  Including
HTML-encoded Unicode characters in <a> tags, like

   <a href="http://www.google.com/search?q=&#176;">test</a>

or

   <a href="http://www.google.com/search?q=&#32176;">test</a>'

should just work.

Paolo
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Paolo Bonzini-2
In reply to this post by Julian Fitzell-2
On 05/20/2010 10:23 AM, Julian Fitzell wrote:

>      > GRCodec>>encode: (String)
>      >    ^(ByteArray)
>      >
>      > GRCodec>>decode: (ByteArray)
>      >    ^(String)
>
>     I agree. I believe the first one is doable, I'll hack together a
>     prototype today. The second is more tricky because the servers
>     themselves (Comanche and Swazoo) already give us a String which is
>     actually just a ByteArray.
>
> Maybe we could get the servers "fixed" (might imply a mode you can set
> or something). Obviously the server adaptors can simply deal with it as
> well, but there's going to be an inefficiency there.

Or just have decode: accept a ByteArray or a String.  Just add good old
#isByteArray to Grease.  You can call it #isBytes if you want to be
politically correct. :-)

Paolo
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Paolo Bonzini-2
In reply to this post by Michael Lucas-Smith-3
On 05/19/2010 08:07 PM, Michael Lucas-Smith wrote:
> This is the reality we're in right now and instead of rocking the boat
> too much, I have an obligation to make Seaside work the least disruptive
> way possible. I'm all in favor of changing the contract, but only if
> everyone else is too. So for now, I accept your accusation that #encode:
> is returning a String seemingly incorrectly, but throw back that that's
> the expectation of the API.

And I can only agree with you, unfortunately.

> #testCodecUtf8Bom does the opposite, it wants to compare strings
> containing unicode characters and as such in VisualWorks we end up with
> a TwoByteString of which you cannot send #asByteArray. This is how I
> first noticed the problem.

Yep, it's contrary to GNU Smalltalk.  In GNU Smalltalk I get an error on
#asUnicodeString for the Latin-1 encoded string.  In VisualWorks you get
an error on #asByteArray for the UTF-8 decoded string. :-)

Paolo
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Paolo Bonzini-2
In reply to this post by Paolo Bonzini-2
[oops we went off list]

On 05/20/2010 04:33 PM, Philippe Marschall wrote:

>>>> Alternatively, _do we need to use percent encoding at all_?
>>>
>>> Yes.
>>
>> So how does it work if I want to link www.google.com/search?q=ė from
>> Seaside?  Currently it will print %17%01 which makes no sense at all.
>
> Might make no sense but is exactly what we have to do because a URL
> must be made out of ASCII characters only because HTTP headers allow
> only ASCII.

But unless it's for a redirect, you don't write HTTP headers, the
browser does.  So (again unless it's for a redirect) I don't see why
Seaside should bother about URL encoding, after all HTML doesn't say
a@href should be only ASCII.

>> I think that should be customizable just like in tomcat.
>
> It is statically, but not dynamically for each URL. The codec has a
> URL codec which might be different. This might bite you if you're
> creating links with non-ASCII characters to two different servers each
> expecting a different URL encoding.

Yeah, instead of having a URL codec in the code, we should have a codec
and a URL codec _in the application configuration_.  And the latter
should only be used by #redirectTo: instead of using #seasideString
(which uses a normal WAHtmlStreamDocument).

This is all stuff for 3.1 of course.

But actually, I don't understand what is the problem with WAUrlEncoder.
  You call

        super nextPutAll: (codec url encode: aString)

and WAEncoder calls #greaseInteger on each item of its argument.  So it
works fine with both "each character is really a byte" strings and
ByteArrays.

Paolo
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Philippe Marschall
In reply to this post by Philippe Marschall
2010/5/20 Philippe Marschall <[hidden email]>:

> 2010/5/19 Paolo Bonzini <[hidden email]>:
>> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
>>>
>>> Can someone speak to the platforms that have trouble with #= here?
>>
>> GNU Smalltalk has problems comparing an encoded string with #latin1String.
>>  The problem is that the GRCodecTest>>#asString: method does not store the
>> encoding of the string in its result, so GNU Smalltalk assumes it is in the
>> default encoding (typically UTF-8).  Then when "self latin1String" has to be
>> compared with an ISO-8859-1 string (the output of "codec encode: self
>> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8
>> sequence in "self latin1String".
>
> Why does there have to be an encoding present? It concatenates
> characters from known code points. There are no bytes involved so no
> mapping or mapping information is required.
>
>> Comparing bytearrays instead takes encodings out of the picture and works.
>>
>> VisualWorks seems to have the opposite problem.  #encode: needs to know what
>> encoding was applied in order to convert to raw bytes.  This seems to be a
>> bug to me.  The #encode:-d representation should contain the raw bytes, not
>> the Unicode characters.
>>
>> So, I could fix it by adding a platform-specific hack to #asString:, but it
>> seems wrong.  Can you check what breaks if you return a ByteArray from your
>> codec's #encode: method?
>
> I have a train ride today. I can give it a shot. It might actually
> work because of a recent stream change.

It does work [1]. We loose the ability to handle macroman and utf-16
but that could be added if needed. Everything else seems to be working
just fine.

 [1] http://www.squeaksource.com/Seaside31

Cheers
Philippe
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
Reply | Threaded
Open this post in threaded view
|

Re: Seaside 3.0a6ish

Paolo Bonzini-2
> It does work [1]. We loose the ability to handle macroman and utf-16
> but that could be added if needed.

Is that a limitation of Squeak or what?

Paolo
_______________________________________________
seaside-dev mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev
12