First let me assume that WAKomEncoded is what I should be starting, versus WAKom ?
Us old Smalltalkers remember starting WAKom so in WikiServer startup that is what happens. I *guess* it really should be WAKomEncoded? So what's the fall out, I mean I can stuff UTF8 chars into PRPages... Happy Happy. Well not quite, I got a support email out of South Korea that the UTF8 character that was entered for the Page title was being mangled. In fact if they use the *wrong* character the app would hang as it's loading from binary storage to instantiate the PRPage. In looking at this it turns out that because WAKom is used, the UTF8 data from the request is being passed as a String into PRStructure (instance var name). Later lazy initialization is used to populate title title "Answer the title of the receiver, essentially the name but starting uppercase." ^ title ifNil: [ title := self name capitalized] Now here is the bad part, the capitalized runs Character>>asUppercase which actually is kinda unicode aware so it's attempting only to deal with wide characters but since the UTF8 character is multiple bytes in a String then it mangles the first byte to uppercase thus destroying the meaning of the UTF8 sequence. However now if I restart with WAKomEncoded the squeak to utf8 process then messes the UTF8 data that was stored in the binary data file. So thoughts on how to fix things when I load the PRPages from storage, and what fields would need fixing are welcome -- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== _______________________________________________ Magritte, Pier and Related Tools ... https://www.iam.unibe.ch/mailman/listinfo/smallwiki |
2010/2/4 John M McIntosh <[hidden email]>:
> First let me assume that WAKomEncoded is what I should be starting, versus WAKom ? We are talking about Seaside 2.8, right? WAKom: takes the bytes (!) as sent from the client and creates a ByteString from them without any decoding, which means a character that is encoded in two bytes in UTF-8 will take up two Charaters with their values being the values of the bytes WAKomEncoded: does UTF-8 de/encoding on the input, which will create WideStrings for non-Latin-1 strings > Us old Smalltalkers remember starting WAKom so in WikiServer startup that is what happens. > > I *guess* it really should be WAKomEncoded? Judging on your problems below: yes (assuming you're cool with WideStrings) > So what's the fall out, I mean I can stuff UTF8 chars into PRPages... Happy Happy. > > Well not quite, I got a support email out of South Korea that the UTF8 character that was entered for the > Page title was being mangled. In fact if they use the *wrong* character the app would hang as it's loading > from binary storage to instantiate the PRPage. > > In looking at this it turns out that because WAKom is used, the UTF8 data from the request is being passed > as a String into PRStructure (instance var name). Later lazy initialization is used to populate title > > title > "Answer the title of the receiver, essentially the name but starting uppercase." > > ^ title ifNil: [ title := self name capitalized] > > Now here is the bad part, the capitalized runs Character>>asUppercase which actually is kinda unicode aware > so it's attempting only to deal with wide characters but since the UTF8 character is multiple bytes in a String then it mangles > the first byte to uppercase thus destroying the meaning of the UTF8 sequence. Yeah, that's expected. ;-) > However now if I restart with WAKomEncoded the squeak to utf8 process then messes the UTF8 data that was > stored in the binary data file. > > So thoughts on how to fix things when I load the PRPages from storage, and what fields would need fixing are welcome Assuming you don't already have corrupted data in your image and want to do a migration: Option 1: Do a utf-8 decoding on the Strings in your model and use WAKomEncoded from that point on. Option 2: Hack #title method (and the other senders of #capitalized) to first do a utf-8 decoding, then #capitalized and then utf-8 encoding. Continue using WAKom. Now if you already have corrupted data in your image you'll have to clean that up. That can be tricky: - find the potential places (senders of #capitalized, can't think of anything else right now) - find the actual places, eg. try to do a utf-8 decoding on each candidate and see if you get an exception - undo the "capitalization", if you can't replace the String with "ERROR" or something - chose one of the options above Sorry for the inconvenience Philippe _______________________________________________ Magritte, Pier and Related Tools ... https://www.iam.unibe.ch/mailman/listinfo/smallwiki |
Ok, for WikiServer 1.5.1 which I push to the app store review team last night I'm doing
(2) I actually run the self name squeakToUTF8 capitalized utf8ToSqueak This converts the String to a utf8 widestring then capitalized, then back to String. There fortunately is only one usage of capitalized in Pier, so the fix was easy. As for the Option 1, I guess I'll have to look at all the Pier objects and see where the data comes from, then build a post loader that decides if the database is not UTF8 clean that it runs the UTF8 converter on all the instances fields that are being targeted. Once the Pier data model is saved I can mark it clean. On 2010-02-04, at 1:03 PM, Philippe Marschall wrote: > Assuming you don't already have corrupted data in your image and want > to do a migration: > > Option 1: > Do a utf-8 decoding on the Strings in your model and use WAKomEncoded > from that point on. > > Option 2: > Hack #title method (and the other senders of #capitalized) to first do > a utf-8 decoding, then #capitalized and then utf-8 encoding. Continue > using WAKom. > > Now if you already have corrupted data in your image you'll have to > clean that up. That can be tricky: > - find the potential places (senders of #capitalized, can't think of > anything else right now) > - find the actual places, eg. try to do a utf-8 decoding on each > candidate and see if you get an exception > - undo the "capitalization", if you can't replace the String with > "ERROR" or something > - chose one of the options above > > Sorry for the inconvenience > Philippe -- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== _______________________________________________ Magritte, Pier and Related Tools ... https://www.iam.unibe.ch/mailman/listinfo/smallwiki |
In reply to this post by Philippe Marschall
2010/2/4 Philippe Marschall <[hidden email]>:
> 2010/2/4 John M McIntosh <[hidden email]>: >> First let me assume that WAKomEncoded is what I should be starting, versus WAKom ? > > We are talking about Seaside 2.8, right? > WAKom: takes the bytes (!) as sent from the client and creates a > ByteString from them without any decoding, which means a character > that is encoded in two bytes in UTF-8 will take up two Charaters with > their values being the values of the bytes > WAKomEncoded: does UTF-8 de/encoding on the input, which will create > WideStrings for non-Latin-1 strings > >> Us old Smalltalkers remember starting WAKom so in WikiServer startup that is what happens. >> >> I *guess* it really should be WAKomEncoded? > > Judging on your problems below: yes (assuming you're cool with WideStrings) > >> So what's the fall out, I mean I can stuff UTF8 chars into PRPages... Happy Happy. >> >> Well not quite, I got a support email out of South Korea that the UTF8 character that was entered for the >> Page title was being mangled. In fact if they use the *wrong* character the app would hang as it's loading >> from binary storage to instantiate the PRPage. >> >> In looking at this it turns out that because WAKom is used, the UTF8 data from the request is being passed >> as a String into PRStructure (instance var name). Later lazy initialization is used to populate title >> >> title >> "Answer the title of the receiver, essentially the name but starting uppercase." >> >> ^ title ifNil: [ title := self name capitalized] >> >> Now here is the bad part, the capitalized runs Character>>asUppercase which actually is kinda unicode aware >> so it's attempting only to deal with wide characters but since the UTF8 character is multiple bytes in a String then it mangles >> the first byte to uppercase thus destroying the meaning of the UTF8 sequence. > > Yeah, that's expected. ;-) > >> However now if I restart with WAKomEncoded the squeak to utf8 process then messes the UTF8 data that was >> stored in the binary data file. >> >> So thoughts on how to fix things when I load the PRPages from storage, and what fields would need fixing are welcome > > Assuming you don't already have corrupted data in your image and want > to do a migration: > > Option 1: > Do a utf-8 decoding on the Strings in your model and use WAKomEncoded > from that point on. > > Option 2: > Hack #title method (and the other senders of #capitalized) to first do > a utf-8 decoding, then #capitalized and then utf-8 encoding. Continue > using WAKom. On second thought: Option 3: Don't the the #capitalized at all and use CSS: text-transform: capitalize Cheers Philippe _______________________________________________ Magritte, Pier and Related Tools ... https://www.iam.unibe.ch/mailman/listinfo/smallwiki |
Free forum by Nabble | Edit this page |