WAKom versus WAKomEncoded

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

WAKom versus WAKomEncoded

johnmci
First let me assume that WAKomEncoded is what I should be starting, versus WAKom  ?

Us old Smalltalkers remember starting WAKom so in WikiServer startup that is what happens.

I *guess* it really should be WAKomEncoded?

So what's the fall out, I mean I can stuff UTF8 chars into PRPages...  Happy Happy.

Well not quite, I got a support email out of South Korea that the UTF8 character that was entered for the
Page title was being mangled. In fact if they use the *wrong* character the app would hang as it's loading
from binary storage to instantiate the PRPage.

In looking at this it turns out that because WAKom is used, the UTF8 data from the request is being passed
as a String into PRStructure (instance var name). Later lazy initialization is used to populate title

title
        "Answer the title of the receiver, essentially the name but starting uppercase."

        ^ title ifNil: [ title := self name capitalized]
       
Now here is the bad part, the capitalized runs Character>>asUppercase which actually is kinda unicode aware
so it's attempting only to deal with wide characters but since the UTF8 character is multiple bytes in a String then it mangles
the first byte to uppercase thus destroying the meaning of the UTF8 sequence.

However now if I restart with WAKomEncoded the  squeak to utf8 process then messes the UTF8 data that was
stored in the binary data file.

So thoughts on how to fix things when I load the PRPages from storage, and what fields would need fixing are welcome
--
===========================================================================
John M. McIntosh <[hidden email]>   Twitter:  squeaker68882
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
===========================================================================





_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki
Reply | Threaded
Open this post in threaded view
|

Re: WAKom versus WAKomEncoded

Philippe Marschall
2010/2/4 John M McIntosh <[hidden email]>:
> First let me assume that WAKomEncoded is what I should be starting, versus WAKom  ?

We are talking about Seaside 2.8, right?
WAKom: takes the bytes (!) as sent from the client and creates a
ByteString from them without any decoding, which means a character
that is encoded in two bytes in UTF-8 will take up two Charaters with
their values being the values of the bytes
WAKomEncoded: does UTF-8 de/encoding on the input, which will create
WideStrings for non-Latin-1 strings

> Us old Smalltalkers remember starting WAKom so in WikiServer startup that is what happens.
>
> I *guess* it really should be WAKomEncoded?

Judging on your problems below: yes (assuming you're cool with WideStrings)

> So what's the fall out, I mean I can stuff UTF8 chars into PRPages...  Happy Happy.
>
> Well not quite, I got a support email out of South Korea that the UTF8 character that was entered for the
> Page title was being mangled. In fact if they use the *wrong* character the app would hang as it's loading
> from binary storage to instantiate the PRPage.
>
> In looking at this it turns out that because WAKom is used, the UTF8 data from the request is being passed
> as a String into PRStructure (instance var name). Later lazy initialization is used to populate title
>
> title
>        "Answer the title of the receiver, essentially the name but starting uppercase."
>
>        ^ title ifNil: [ title := self name capitalized]
>
> Now here is the bad part, the capitalized runs Character>>asUppercase which actually is kinda unicode aware
> so it's attempting only to deal with wide characters but since the UTF8 character is multiple bytes in a String then it mangles
> the first byte to uppercase thus destroying the meaning of the UTF8 sequence.

Yeah, that's expected. ;-)

> However now if I restart with WAKomEncoded the  squeak to utf8 process then messes the UTF8 data that was
> stored in the binary data file.
>
> So thoughts on how to fix things when I load the PRPages from storage, and what fields would need fixing are welcome

Assuming you don't already have corrupted data in your image and want
to do a migration:

Option 1:
Do a utf-8 decoding on the Strings in your model and use WAKomEncoded
from that point on.

Option 2:
Hack #title method (and the other senders of #capitalized) to first do
a utf-8 decoding, then #capitalized and then utf-8 encoding. Continue
using WAKom.

Now if you already have corrupted data in your image you'll have to
clean that up. That can be tricky:
- find the potential places (senders of #capitalized, can't think of
anything else right now)
- find the actual places, eg. try to do a utf-8 decoding on each
candidate and see if you get an exception
- undo the "capitalization", if you can't replace the String with
"ERROR" or something
- chose one of the options above

Sorry for the inconvenience
Philippe

_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki
Reply | Threaded
Open this post in threaded view
|

Re: WAKom versus WAKomEncoded

johnmci
Ok, for WikiServer 1.5.1 which I push to the app store review team last night I'm doing
(2)  I actually run the   self name squeakToUTF8 capitalized utf8ToSqueak

This converts the String to a utf8 widestring then capitalized, then back to String.

There fortunately is only one usage of capitalized in Pier, so the fix was easy.

As for the Option 1, I guess I'll have to look at all the Pier objects and see where the data comes from, then
build a post loader that decides if the database is not UTF8 clean that it runs the UTF8 converter on all the instances
fields that are being targeted.  Once the Pier data model is saved I can mark it clean.  


On 2010-02-04, at 1:03 PM, Philippe Marschall wrote:

> Assuming you don't already have corrupted data in your image and want
> to do a migration:
>
> Option 1:
> Do a utf-8 decoding on the Strings in your model and use WAKomEncoded
> from that point on.
>
> Option 2:
> Hack #title method (and the other senders of #capitalized) to first do
> a utf-8 decoding, then #capitalized and then utf-8 encoding. Continue
> using WAKom.
>
> Now if you already have corrupted data in your image you'll have to
> clean that up. That can be tricky:
> - find the potential places (senders of #capitalized, can't think of
> anything else right now)
> - find the actual places, eg. try to do a utf-8 decoding on each
> candidate and see if you get an exception
> - undo the "capitalization", if you can't replace the String with
> "ERROR" or something
> - chose one of the options above
>
> Sorry for the inconvenience
> Philippe

--
===========================================================================
John M. McIntosh <[hidden email]>   Twitter:  squeaker68882
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
===========================================================================





_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki
Reply | Threaded
Open this post in threaded view
|

Re: WAKom versus WAKomEncoded

Philippe Marschall
In reply to this post by Philippe Marschall
2010/2/4 Philippe Marschall <[hidden email]>:

> 2010/2/4 John M McIntosh <[hidden email]>:
>> First let me assume that WAKomEncoded is what I should be starting, versus WAKom  ?
>
> We are talking about Seaside 2.8, right?
> WAKom: takes the bytes (!) as sent from the client and creates a
> ByteString from them without any decoding, which means a character
> that is encoded in two bytes in UTF-8 will take up two Charaters with
> their values being the values of the bytes
> WAKomEncoded: does UTF-8 de/encoding on the input, which will create
> WideStrings for non-Latin-1 strings
>
>> Us old Smalltalkers remember starting WAKom so in WikiServer startup that is what happens.
>>
>> I *guess* it really should be WAKomEncoded?
>
> Judging on your problems below: yes (assuming you're cool with WideStrings)
>
>> So what's the fall out, I mean I can stuff UTF8 chars into PRPages...  Happy Happy.
>>
>> Well not quite, I got a support email out of South Korea that the UTF8 character that was entered for the
>> Page title was being mangled. In fact if they use the *wrong* character the app would hang as it's loading
>> from binary storage to instantiate the PRPage.
>>
>> In looking at this it turns out that because WAKom is used, the UTF8 data from the request is being passed
>> as a String into PRStructure (instance var name). Later lazy initialization is used to populate title
>>
>> title
>>        "Answer the title of the receiver, essentially the name but starting uppercase."
>>
>>        ^ title ifNil: [ title := self name capitalized]
>>
>> Now here is the bad part, the capitalized runs Character>>asUppercase which actually is kinda unicode aware
>> so it's attempting only to deal with wide characters but since the UTF8 character is multiple bytes in a String then it mangles
>> the first byte to uppercase thus destroying the meaning of the UTF8 sequence.
>
> Yeah, that's expected. ;-)
>
>> However now if I restart with WAKomEncoded the  squeak to utf8 process then messes the UTF8 data that was
>> stored in the binary data file.
>>
>> So thoughts on how to fix things when I load the PRPages from storage, and what fields would need fixing are welcome
>
> Assuming you don't already have corrupted data in your image and want
> to do a migration:
>
> Option 1:
> Do a utf-8 decoding on the Strings in your model and use WAKomEncoded
> from that point on.
>
> Option 2:
> Hack #title method (and the other senders of #capitalized) to first do
> a utf-8 decoding, then #capitalized and then utf-8 encoding. Continue
> using WAKom.

On second thought:
Option 3:
Don't the the #capitalized at all and use CSS:
text-transform: capitalize

Cheers
Philippe

_______________________________________________
Magritte, Pier and Related Tools ...
https://www.iam.unibe.ch/mailman/listinfo/smallwiki