All,
I'm currently working on a Seaside App which will go "live" next monday. Currently I'm using Damiens 3.10 based sq3.10.2-7179web09.02.1 image. Yesterday I had an horror experience when a Mac Betatester told me that the whole app got screwed up when he entered a euro sign (€) in the app. All following responses are totally garbage. Other characters like german umlauts (or even the euro sign from a windows client!) where fine and I tested them before - although they appear as (UTF8?) giberisch in the image. This triggered some memory and I changed from WAKom to WAKomEncoded which obviously "solved" the problem. However this morning (after reading the class comments) I'm a bit confused about the whole issue and I have two questions: 1) What's the differnece between WAKom, WAKomEncoded and WAKomEncoded39? I read the class comments but do not really understand the issue. In addition they are dealing with 3.8 vs. 3.9 and I'm on 3.10... If somebody advises me to deploy on 3.9 I'll do - I just need a working configuration. 2) I assume that the browser side always uses UTF-8. But in which format does the image "see" the input? CU, Udo _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
> 1) What's the differnece between WAKom, WAKomEncoded and WAKomEncoded39? I
> read the class comments but do not really understand the issue. In addition > they are dealing with 3.8 vs. 3.9 and I'm on 3.10... If somebody advises me > to deploy on 3.9 I'll do - I just need a working configuration. WAKom does no conversion at all. So if you assume that the browser side uses UTF-8 then you end up with UTF-8 strings inside your image. However since the internal encoding of Squeak is *not UTF-8* many strings will appear scrambled when looking at them using an inspector. It works well though as long as you do not perform heavy string scrambling, because the strings are sent back as is. If you have string literals with foreign characters in your application code you need to make sure that these are valid UTF-8 as well. This is very efficient, but you need to be aware of the implications. WAKomEncoded converts incoming data from UTF-8 to the internal encoding of Squeak, as well it converts outgoing data from the internal encoding to UTF-8. Like this, all strings are valid from within the image, and common string operations like #=, #size and #copyFrom:to: work like you would expect. If you use an external database that expects UTF-8 you need to convert again. Since there all incoming and outgoing data needs to be converted, this approach is slightly less efficient. WAKomEncoded39 is for compatibility with strange versions of Kom and Squeak. You should not need to use it. Lukas -- Lukas Renggli http://www.lukas-renggli.ch _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
Lukas,
thnanks for your detailed explanation. > WAKom does no conversion at all. So if you assume that the browser > side uses UTF-8 then you end up with UTF-8 strings inside your image. [...] > need to make sure that these are valid UTF-8 as well. This is very > efficient, but you need to be aware of the implications. This was the effect I was seeing before. I didn't care about it because I thought that as long as the stuff that gets in gets out is fine. However I learned my lesson with Safari/Mac/Euro sign which screwed up everything. > WAKomEncoded converts incoming data from UTF-8 to the internal > encoding of Squeak, as well it converts outgoing data from the > internal encoding to UTF-8. Like this, all strings are valid from > within the image, and common string operations like #=, #size and > #copyFrom:to: work like you would expect. If you use an external > database that expects UTF-8 you need to convert again. The current DB backend is Magma - so I assume I can ignore UTF-8 conversion for the DB for quite some time :-) > WAKomEncoded39 is for compatibility with strange versions of Kom and > Squeak. You should not need to use it. Good to hear. I'm currently running the tests against WAEncodedKom and everything looks great. Thank you very much for your help. CU, Udo _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
In reply to this post by Lukas Renggli
Lukas wrote: > However since the internal encoding of Squeak is *not UTF-8* many > strings will appear scrambled when looking at them using an inspector. > It works well though as long as you do not perform heavy string > scrambling, because the strings are sent back as is. If you have > string literals with foreign characters in your application code you > need to make sure that these are valid UTF-8 as well. This is very > efficient, but you need to be aware of the implications. What happens if squeak is made to use UTF-8 internally? Ie the unix man page and various postings on squeak-dev/newbies suggest that a recent squeak VM/image combo started with '-encoding utf8' should work well as a utf8 image (provided the correct font is supplied, etc). In such a case, should plain WAKom be used? With no issue wrt to string operations like #=, #size and #copyFrom:to: ? Or is there still a need to convert from the incoming utf-8 and squeak's WideString (and vice versa)? > WAKomEncoded converts incoming data from UTF-8 to the internal > encoding of Squeak, as well it converts outgoing data from the > internal encoding to UTF-8. The code and comments in #utf8ToSqueak: suggest that this is only true if squeak uses latin-1 internally (which is does by the default), right? > Since there all incoming and outgoing data needs to be converted, > this approach is slightly less efficient. Has anybody quantified the inefficiency? I'm starting a clean slate seaside server, so I'd like to pick the optimal configuration... Michal _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
2009/2/24 Michal <[hidden email]>:
> > Lukas wrote: >> However since the internal encoding of Squeak is *not UTF-8* many >> strings will appear scrambled when looking at them using an inspector. >> It works well though as long as you do not perform heavy string >> scrambling, because the strings are sent back as is. If you have >> string literals with foreign characters in your application code you >> need to make sure that these are valid UTF-8 as well. This is very >> efficient, but you need to be aware of the implications. > > What happens if squeak is made to use UTF-8 internally? String and Character loose all semantics. For example #size will answer the number of bytes, not the number of characters. #at: will answer the byte at the given index, not the Character at the given index. For example ä will be represented as (String with: (Character value: 195) with: (Character value: 164)) 'À'. > Ie the unix > man page and various postings on squeak-dev/newbies suggest that a > recent squeak VM/image combo started with '-encoding utf8' should work > well as a utf8 image (provided the correct font is supplied, etc). That's unrelated. > In such a case, should plain WAKom be used? If you're cool with the behavior described above, then use WAKom. > With no issue wrt to > string operations like #=, #size and #copyFrom:to: ? #= has limited usability due to missing Unicode normalization. It's actually a bit more useful because for WideStrings it would take the leadingChar into account with is more or less random. #size and #copyFrom:to: answer "random" data unless you know the ins and outs of utf-8 and Unicode. > Or is there still > a need to convert from the incoming utf-8 and squeak's WideString (and > vice versa)? Yes, utf-8 conversion won't happen automatically. If you want it, you need to do it yourself. >> WAKomEncoded converts incoming data from UTF-8 to the internal >> encoding of Squeak, as well it converts outgoing data from the >> internal encoding to UTF-8. > > The code and comments in #utf8ToSqueak: suggest that this is only true > if squeak uses latin-1 internally (which is does by the default), right? Nope, it's required for non-ASCII input. >> Since there all incoming and outgoing data needs to be converted, >> this approach is slightly less efficient. > > Has anybody quantified the inefficiency? Not that I'm aware of. > I'm starting a clean slate > seaside server, so I'd like to pick the optimal configuration... What do you want to optimize for? Cheers Philippe _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
>> What happens if squeak is made to use UTF-8 internally? > String and Character loose all semantics. That's disappointing! But thanks Philippe for the quick and helpful answer. >> I'm starting a clean slate seaside server, so I'd like to pick the >> optimal configuration... > What do you want to optimize for? I was hoping for a clean utf-8 image, and hence to be able to get rid of "historical cruft" (anything related to macroman and iso-8859-1) and at the same time gain some speed (no conversion needed on input / output while preserving #findString: , #copyFrom:to: and friends). Michal _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
2009/3/6 Michal <[hidden email]>:
> > >>> What happens if squeak is made to use UTF-8 internally? > >> String and Character loose all semantics. > > That's disappointing! But thanks Philippe for the quick and helpful > answer. > >>> I'm starting a clean slate seaside server, so I'd like to pick the >>> optimal configuration... > >> What do you want to optimize for? > > I was hoping for a clean utf-8 image, and hence to be able to get rid > of "historical cruft" (anything related to macroman and iso-8859-1) > and at the same time gain some speed (no conversion needed on input / > output while preserving #findString: , #copyFrom:to: and friends). In this case I would go for utf-8 as an external encoding and use WAKomEncoded. That will give you at least better semantics. You'll likely loose some speed but if you're lucky it won't be a bottleneck and you won't notice it. You'll gain other historical cruft (leadingChar). You might run into some WideString bugs. Some of them have been fixed in Squeak 3.10 and likely will be fixed in Pharo as well [1]. Should you chose to run Squeak 3.10 be aware that Seaside on Squeak 3.10 doesn't receive the same developer attention and testing as Seaside on Squeak 3.9 and Pharo so there might be hidden Seaside bugs there. Wow that was quite a reassuring post ;-) [1] http://code.google.com/p/pharo/issues/detail?id=524 Cheers Philippe _______________________________________________ seaside mailing list [hidden email] http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/seaside |
Free forum by Nabble | Edit this page |