As it stands right now, the only tests that are failing for VisualWorks
7.7.1 are the three codec tests (one of which is expected to fail on VisualWorks): GRCodecTest>>testCodecLatin1 GRUtf8CodecTest>>testCodecUtf8Bom GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail) The reason for these failures was raised in a previous email. Cheers, Michael _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
2010/5/19 Michael Lucas-Smith <[hidden email]>:
> As it stands right now, the only tests that are failing for VisualWorks > 7.7.1 are the three codec tests (one of which is expected to fail on > VisualWorks): > > GRCodecTest>>testCodecLatin1 > GRUtf8CodecTest>>testCodecUtf8Bom > GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail) > > The reason for these failures was raised in a previous email. What's the consensus there? String comparison method on platform? Cheers Philippe _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
On 5/19/10 4:55 AM, Philippe Marschall wrote:
> 2010/5/19 Michael Lucas-Smith<[hidden email]>: > >> As it stands right now, the only tests that are failing for VisualWorks >> 7.7.1 are the three codec tests (one of which is expected to fail on >> VisualWorks): >> >> GRCodecTest>>testCodecLatin1 >> GRUtf8CodecTest>>testCodecUtf8Bom >> GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail) >> >> The reason for these failures was raised in a previous email. >> > What's the consensus there? String comparison method on platform? > > two unicode strings but yet not all the smalltalk platforms can support this using #=.. it would *really* suck if #= couldn't be used in the general code base for safely comparing two strings. (If you have to do it in the test, then you, in theory, probably need to do it everywhere). Can someone speak to the platforms that have trouble with #= here? Michael _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Philippe Marschall
On 05/19/2010 01:55 PM, Philippe Marschall wrote:
> 2010/5/19 Michael Lucas-Smith<[hidden email]>: >> As it stands right now, the only tests that are failing for VisualWorks >> 7.7.1 are the three codec tests (one of which is expected to fail on >> VisualWorks): >> >> GRCodecTest>>testCodecLatin1 >> GRUtf8CodecTest>>testCodecUtf8Bom >> GRUtf8CodecTest>>testCodecUtf8ShortestForm (expected to fail) >> >> The reason for these failures was raised in a previous email. > > What's the consensus there? String comparison method on platform? That, or a String conversion method like #greaseStringComparable (which would return a String or ByteArray that you can send #= to). I'd put it in tests though, I don't think we want to expose that to users of Grease until they complain. Paolo _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Michael Lucas-Smith-3
On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote:
> > Can someone speak to the platforms that have trouble with #= here? GNU Smalltalk has problems comparing an encoded string with #latin1String. The problem is that the GRCodecTest>>#asString: method does not store the encoding of the string in its result, so GNU Smalltalk assumes it is in the default encoding (typically UTF-8). Then when "self latin1String" has to be compared with an ISO-8859-1 string (the output of "codec encode: self decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8 sequence in "self latin1String". Comparing bytearrays instead takes encodings out of the picture and works. VisualWorks seems to have the opposite problem. #encode: needs to know what encoding was applied in order to convert to raw bytes. This seems to be a bug to me. The #encode:-d representation should contain the raw bytes, not the Unicode characters. So, I could fix it by adding a platform-specific hack to #asString:, but it seems wrong. Can you check what breaks if you return a ByteArray from your codec's #encode: method? Paolo _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
On 5/19/10 10:19 AM, Paolo Bonzini wrote:
> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote: >> >> Can someone speak to the platforms that have trouble with #= here? > > GNU Smalltalk has problems comparing an encoded string with > #latin1String. The problem is that the GRCodecTest>>#asString: method > does not store the encoding of the string in its result, so GNU > Smalltalk assumes it is in the default encoding (typically UTF-8). > Then when "self latin1String" has to be compared with an ISO-8859-1 > string (the output of "codec encode: self decodedString"), GNU > Smalltalk fails because it finds an invalid UTF-8 sequence in "self > latin1String". > > Comparing bytearrays instead takes encodings out of the picture and > works. > > VisualWorks seems to have the opposite problem. #encode: needs to > know what encoding was applied in order to convert to raw bytes. This > seems to be a bug to me. The #encode:-d representation should contain > the raw bytes, not the Unicode characters. object (subclasses ByteString, TwoByteString, FourByteString) represents unicode characters. This is completely independent of any encoding at all. We have some specific ByteEncodedString subclasses for ISO8859L1 and MSCP1252 but they don't really enter the picture here. The terminology is important here, may be that's where we're struggling - when you start off with bytes, the bytes cannot represent their encoding (auto detecting is a fools game) so you must -decode- the bytes in to characters. Once that is done, we have our String object. The String object does not have a link back to the original bytes, nor does it know what encoding you used to create the characters - not should it. To turn the string back in to bytes, you have to -encode- the characters in to bytes using an encoding. > > So, I could fix it by adding a platform-specific hack to #asString:, > but it seems wrong. Can you check what breaks if you return a > ByteArray from your codec's #encode: method? > The expectation of the GRCodec is that, unfortunately, you will get back a String object no matter whether you're doing an encode: or a decode: .. I would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this has never worked because Pharo/Squeak could never do it. May be this has changed, but none of the code that calls #decode: gives us a ByteArray - all of the tests, examples, seaside actual code passes in a String containing characters representing bytes. I would really love it if the API had a contract like this: GRCodec>>encode: (String) ^(ByteArray) GRCodec>>decode: (ByteArray) ^(String) That would be absolutely ideal conceptually, but I suspect we'd be bucking against: a) lots of existing code that current works and people would rather not break b) the Squeak/Pharo adaptors existing expectations which would require a fair bit of work to fix Instead, the API works like this: GRCodec>>encode: (String containing unicode characters) ^(String containing characters representing bytes) GRCodec>>decode: (String containing characters representing bytes) ^(String containing unicode characters) This is the reality we're in right now and instead of rocking the boat too much, I have an obligation to make Seaside work the least disruptive way possible. I'm all in favor of changing the contract, but only if everyone else is too. So for now, I accept your accusation that #encode: is returning a String seemingly incorrectly, but throw back that that's the expectation of the API. The two tests in question push the opposite ends of the problem. #testCodecLatin1 tests the encoded bytes, while #testCodecUtf8Bom tests the decoded characters. In the case of testCodecLatin1, you can send #asByteArray to the ByteString containing characters representing bytes, because none of the characters go over a value of 255 -- this is pure happenstance that it works and we fully intend to one day deprecate asByteArray from String fully from VisualWorks. #testCodecUtf8Bom does the opposite, it wants to compare strings containing unicode characters and as such in VisualWorks we end up with a TwoByteString of which you cannot send #asByteArray. This is how I first noticed the problem. Oh as a small side note, the #name API is inconsistent. #testCodecLatin1 expects the name to be case insensitive, while #testCodecUtf8 expects the name to be lowercase. I'm not sure how this 'came to be' but it's impossible for me to make it pass consistently for both scenarios :) Michael _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
>> So, I could fix it by adding a platform-specific hack to #asString:, >> but it seems wrong. Can you check what breaks if you return a >> ByteArray from your codec's #encode: method? >> > Results from changing GRGenericCodec to expect and return ByteArrays appropriately. Errors: #testCodecLatin1 -- errors because #latin1String is sent to #decode: when it should be #latin1Bytes #testCodecUtf8 -- errors because #utf8String is sent to #decode: when it should be #utf8Bytes #testCodecUtf8Bom -- same #testCompileString -- errors because #contents called #contentsDecodedUsing: where the @contents variable is already a string #testCompileStringAgain -- same #testDecodedWith -- errors because @user ends up containing a String with unicode characters derived from the % encoding, but is then attempted to be decoded using the codec And those are just the tests - I suspect that there'll be even more errors/failures attempting to actually use Seaside. I'd also need to adjust the adaptor to do less fiddling, but that's a whole other story (the Opentalk adaptor, by default, handles encoding -for- you, but Seaside intends to handle it for itself.. so we have to do a lot of fiddling at the adaptor level to pass through raw data and let Seaside handle it its own way). Cheers, Michael _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Michael Lucas-Smith-3
On 5/19/10 11:07 AM, Michael Lucas-Smith wrote:
> On 5/19/10 10:19 AM, Paolo Bonzini wrote: >> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote: >>> >>> Can someone speak to the platforms that have trouble with #= here? >> >> GNU Smalltalk has problems comparing an encoded string with >> #latin1String. Perhaps we need to make the way we represent encoded bytes platform specific. If a platform chooses to represent bytes as characters in a string, so be it - but at least at the GRPlatform level we can draw a distinction between "A collection of bytes" and "a collection of characters" instead of always assuming the collection species is String? Michael _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
I'm too exhausted to dig into this any further than reading it right now. What Michael's talking about sounds fairly reasoned, but even assuming everyone agrees, I think this is a disruptive change to throw in at this point. What's the smallest change we can make to get it working "good enough" for now?
My suggestion is that... no, crap, that doesn't work at all. I was going to suggest we take the sprint at ESUG to focus on sorting out encoding/decoding (I'd like to be a focus for 3.1 anyway). But since Philippe won't be there and Lukas may or may not be, that's not really going to work out. :) Ok, so I'm not sure when we tackle that, but the question is still what can we do now that won't make us even *less* likely to release? Julian On Wed, May 19, 2010 at 7:31 PM, Michael Lucas-Smith <[hidden email]> wrote:
_______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
On 5/19/10 1:52 PM, Julian Fitzell wrote:
> I'm too exhausted to dig into this any further than reading it right > now. What Michael's talking about sounds fairly reasoned, but even > assuming everyone agrees, I think this is a disruptive change to throw > in at this point. What's the smallest change we can make to get it > working "good enough" for now? > > My suggestion is that... no, crap, that doesn't work at all. I was > going to suggest we take the sprint at ESUG to focus on sorting out > encoding/decoding (I'd like to be a focus for 3.1 anyway). But since > Philippe won't be there and Lukas may or may not be, that's not really > going to work out. :) > > Ok, so I'm not sure when we tackle that, but the question is still > what can we do now that won't make us even *less* likely to release? The tests right now are a curious blend of pragmatism and ideal in this case. They exist to demonstrate what the system should do, but are implemented in terms of what it can do. In that sense, I do not feel as though there is a great urgency in resolving this issue and to fix it at this point would mean I could not include 3.0a6 in to VisualWorks 7.7.1 - it'd have to wait for the next release. My inclination is to accept these test faults as known and fix up the encoding/decoding across smalltalks in the next iteration. In a sense, my suggestion is to do nothing for now for 3.0, but to revisit this soon after. Michael _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Paolo Bonzini-2
2010/5/19 Paolo Bonzini <[hidden email]>:
> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote: >> >> Can someone speak to the platforms that have trouble with #= here? > > GNU Smalltalk has problems comparing an encoded string with #latin1String. > The problem is that the GRCodecTest>>#asString: method does not store the > encoding of the string in its result, so GNU Smalltalk assumes it is in the > default encoding (typically UTF-8). Then when "self latin1String" has to be > compared with an ISO-8859-1 string (the output of "codec encode: self > decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8 > sequence in "self latin1String". Why does there have to be an encoding present? It concatenates characters from known code points. There are no bytes involved so no mapping or mapping information is required. > Comparing bytearrays instead takes encodings out of the picture and works. > > VisualWorks seems to have the opposite problem. #encode: needs to know what > encoding was applied in order to convert to raw bytes. This seems to be a > bug to me. The #encode:-d representation should contain the raw bytes, not > the Unicode characters. > > So, I could fix it by adding a platform-specific hack to #asString:, but it > seems wrong. Can you check what breaks if you return a ByteArray from your > codec's #encode: method? I have a train ride today. I can give it a shot. It might actually work because of a recent stream change. Cheers Philippe _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Michael Lucas-Smith-3
2010/5/19 Michael Lucas-Smith <[hidden email]>:
> On 5/19/10 10:19 AM, Paolo Bonzini wrote: >> >> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote: >>> >>> Can someone speak to the platforms that have trouble with #= here? >> >> GNU Smalltalk has problems comparing an encoded string with #latin1String. >> The problem is that the GRCodecTest>>#asString: method does not store the >> encoding of the string in its result, so GNU Smalltalk assumes it is in the >> default encoding (typically UTF-8). Then when "self latin1String" has to be >> compared with an ISO-8859-1 string (the output of "codec encode: self >> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8 >> sequence in "self latin1String". >> >> Comparing bytearrays instead takes encodings out of the picture and works. >> >> VisualWorks seems to have the opposite problem. #encode: needs to know >> what encoding was applied in order to convert to raw bytes. This seems to >> be a bug to me. The #encode:-d representation should contain the raw bytes, >> not the Unicode characters. > > I think there's a misunderstanding here somewhere. The generic String object > (subclasses ByteString, TwoByteString, FourByteString) represents unicode > characters. This is completely independent of any encoding at all. We have > some specific ByteEncodedString subclasses for ISO8859L1 and MSCP1252 but > they don't really enter the picture here. > > The terminology is important here, may be that's where we're struggling - > when you start off with bytes, the bytes cannot represent their encoding > (auto detecting is a fools game) so you must -decode- the bytes in to > characters. Once that is done, we have our String object. The String object > does not have a link back to the original bytes, nor does it know what > encoding you used to create the characters - not should it. To turn the > string back in to bytes, you have to -encode- the characters in to bytes > using an encoding. >> >> So, I could fix it by adding a platform-specific hack to #asString:, but >> it seems wrong. Can you check what breaks if you return a ByteArray from >> your codec's #encode: method? >> > The expectation of the GRCodec is that, unfortunately, you will get back a > String object no matter whether you're doing an encode: or a decode: .. I > would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this > has never worked because Pharo/Squeak could never do it. May be this has > changed, but none of the code that calls #decode: gives us a ByteArray - all > of the tests, examples, seaside actual code passes in a String containing > characters representing bytes. I would really love it if the API had a > contract like this: > > GRCodec>>encode: (String) > ^(ByteArray) > > GRCodec>>decode: (ByteArray) > ^(String) I agree. I believe the first one is doable, I'll hack together a prototype today. The second is more tricky because the servers themselves (Comanche and Swazoo) already give us a String which is actually just a ByteArray. Cheers Philippe _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
On Thu, May 20, 2010 at 6:32 AM, Philippe Marschall <[hidden email]> wrote:
Maybe we could get the servers "fixed" (might imply a mode you can set or something). Obviously the server adaptors can simply deal with it as well, but there's going to be an inefficiency there. Julian _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Philippe Marschall
2010/5/20 Philippe Marschall <[hidden email]>:
> 2010/5/19 Michael Lucas-Smith <[hidden email]>: >> On 5/19/10 10:19 AM, Paolo Bonzini wrote: >>> >>> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote: >>>> >>>> Can someone speak to the platforms that have trouble with #= here? >>> >>> GNU Smalltalk has problems comparing an encoded string with #latin1String. >>> The problem is that the GRCodecTest>>#asString: method does not store the >>> encoding of the string in its result, so GNU Smalltalk assumes it is in the >>> default encoding (typically UTF-8). Then when "self latin1String" has to be >>> compared with an ISO-8859-1 string (the output of "codec encode: self >>> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8 >>> sequence in "self latin1String". >>> >>> Comparing bytearrays instead takes encodings out of the picture and works. >>> >>> VisualWorks seems to have the opposite problem. #encode: needs to know >>> what encoding was applied in order to convert to raw bytes. This seems to >>> be a bug to me. The #encode:-d representation should contain the raw bytes, >>> not the Unicode characters. >> >> I think there's a misunderstanding here somewhere. The generic String object >> (subclasses ByteString, TwoByteString, FourByteString) represents unicode >> characters. This is completely independent of any encoding at all. We have >> some specific ByteEncodedString subclasses for ISO8859L1 and MSCP1252 but >> they don't really enter the picture here. >> >> The terminology is important here, may be that's where we're struggling - >> when you start off with bytes, the bytes cannot represent their encoding >> (auto detecting is a fools game) so you must -decode- the bytes in to >> characters. Once that is done, we have our String object. The String object >> does not have a link back to the original bytes, nor does it know what >> encoding you used to create the characters - not should it. To turn the >> string back in to bytes, you have to -encode- the characters in to bytes >> using an encoding. >>> >>> So, I could fix it by adding a platform-specific hack to #asString:, but >>> it seems wrong. Can you check what breaks if you return a ByteArray from >>> your codec's #encode: method? >>> >> The expectation of the GRCodec is that, unfortunately, you will get back a >> String object no matter whether you're doing an encode: or a decode: .. I >> would *LOVELOVELOVE* to return a ByteArray when you call #encode: - but this >> has never worked because Pharo/Squeak could never do it. May be this has >> changed, but none of the code that calls #decode: gives us a ByteArray - all >> of the tests, examples, seaside actual code passes in a String containing >> characters representing bytes. I would really love it if the API had a >> contract like this: >> >> GRCodec>>encode: (String) >> ^(ByteArray) >> >> GRCodec>>decode: (ByteArray) >> ^(String) > > I agree. I believe the first one is doable, I'll hack together a > prototype today. The second is more tricky because the servers > themselves (Comanche and Swazoo) already give us a String which is > actually just a ByteArray. There's one trouble point: WAUrlEncoder >> #nextPutAll: The trouble is we first need to convert a URL to bytes and then interpret these bytes as Latin-1 and do percent encoding accordingly. Cheers Philippe _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
On 05/20/2010 11:12 AM, Philippe Marschall wrote:
> There's one trouble point: > WAUrlEncoder>> #nextPutAll: > > The trouble is we first need to convert a URL to bytes and then > interpret these bytes as Latin-1 and do percent encoding accordingly. Really? Shouldn't the percent-encoded strings use whatever encoding the page uses? Surely, browsers use whatever encoding the page sent when creating their responses. Alternatively, _do we need to use percent encoding at all_? Including HTML-encoded Unicode characters in <a> tags, like <a href="http://www.google.com/search?q=°">test</a> or <a href="http://www.google.com/search?q=綰">test</a>' should just work. Paolo _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Julian Fitzell-2
On 05/20/2010 10:23 AM, Julian Fitzell wrote:
> > GRCodec>>encode: (String) > > ^(ByteArray) > > > > GRCodec>>decode: (ByteArray) > > ^(String) > > I agree. I believe the first one is doable, I'll hack together a > prototype today. The second is more tricky because the servers > themselves (Comanche and Swazoo) already give us a String which is > actually just a ByteArray. > > Maybe we could get the servers "fixed" (might imply a mode you can set > or something). Obviously the server adaptors can simply deal with it as > well, but there's going to be an inefficiency there. Or just have decode: accept a ByteArray or a String. Just add good old #isByteArray to Grease. You can call it #isBytes if you want to be politically correct. :-) Paolo _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Michael Lucas-Smith-3
On 05/19/2010 08:07 PM, Michael Lucas-Smith wrote:
> This is the reality we're in right now and instead of rocking the boat > too much, I have an obligation to make Seaside work the least disruptive > way possible. I'm all in favor of changing the contract, but only if > everyone else is too. So for now, I accept your accusation that #encode: > is returning a String seemingly incorrectly, but throw back that that's > the expectation of the API. And I can only agree with you, unfortunately. > #testCodecUtf8Bom does the opposite, it wants to compare strings > containing unicode characters and as such in VisualWorks we end up with > a TwoByteString of which you cannot send #asByteArray. This is how I > first noticed the problem. Yep, it's contrary to GNU Smalltalk. In GNU Smalltalk I get an error on #asUnicodeString for the Latin-1 encoded string. In VisualWorks you get an error on #asByteArray for the UTF-8 decoded string. :-) Paolo _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Paolo Bonzini-2
[oops we went off list]
On 05/20/2010 04:33 PM, Philippe Marschall wrote: >>>> Alternatively, _do we need to use percent encoding at all_? >>> >>> Yes. >> >> So how does it work if I want to link www.google.com/search?q=ė from >> Seaside? Currently it will print %17%01 which makes no sense at all. > > Might make no sense but is exactly what we have to do because a URL > must be made out of ASCII characters only because HTTP headers allow > only ASCII. But unless it's for a redirect, you don't write HTTP headers, the browser does. So (again unless it's for a redirect) I don't see why Seaside should bother about URL encoding, after all HTML doesn't say a@href should be only ASCII. >> I think that should be customizable just like in tomcat. > > It is statically, but not dynamically for each URL. The codec has a > URL codec which might be different. This might bite you if you're > creating links with non-ASCII characters to two different servers each > expecting a different URL encoding. Yeah, instead of having a URL codec in the code, we should have a codec and a URL codec _in the application configuration_. And the latter should only be used by #redirectTo: instead of using #seasideString (which uses a normal WAHtmlStreamDocument). This is all stuff for 3.1 of course. But actually, I don't understand what is the problem with WAUrlEncoder. You call super nextPutAll: (codec url encode: aString) and WAEncoder calls #greaseInteger on each item of its argument. So it works fine with both "each character is really a byte" strings and ByteArrays. Paolo _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
In reply to this post by Philippe Marschall
2010/5/20 Philippe Marschall <[hidden email]>:
> 2010/5/19 Paolo Bonzini <[hidden email]>: >> On 05/19/2010 06:58 PM, Michael Lucas-Smith wrote: >>> >>> Can someone speak to the platforms that have trouble with #= here? >> >> GNU Smalltalk has problems comparing an encoded string with #latin1String. >> The problem is that the GRCodecTest>>#asString: method does not store the >> encoding of the string in its result, so GNU Smalltalk assumes it is in the >> default encoding (typically UTF-8). Then when "self latin1String" has to be >> compared with an ISO-8859-1 string (the output of "codec encode: self >> decodedString"), GNU Smalltalk fails because it finds an invalid UTF-8 >> sequence in "self latin1String". > > Why does there have to be an encoding present? It concatenates > characters from known code points. There are no bytes involved so no > mapping or mapping information is required. > >> Comparing bytearrays instead takes encodings out of the picture and works. >> >> VisualWorks seems to have the opposite problem. #encode: needs to know what >> encoding was applied in order to convert to raw bytes. This seems to be a >> bug to me. The #encode:-d representation should contain the raw bytes, not >> the Unicode characters. >> >> So, I could fix it by adding a platform-specific hack to #asString:, but it >> seems wrong. Can you check what breaks if you return a ByteArray from your >> codec's #encode: method? > > I have a train ride today. I can give it a shot. It might actually > work because of a recent stream change. It does work [1]. We loose the ability to handle macroman and utf-16 but that could be added if needed. Everything else seems to be working just fine. [1] http://www.squeaksource.com/Seaside31 Cheers Philippe _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
> It does work [1]. We loose the ability to handle macroman and utf-16
> but that could be added if needed. Is that a limitation of Squeak or what? Paolo _______________________________________________ seaside-dev mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/seaside-dev |
Free forum by Nabble | Edit this page |