Smalltalk › Pharo › Pharo Smalltalk Developers

XML Parser, Monticello and unicode?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

14 messages Options

NorbertHartl

XML Parser, Monticello and unicode?

I'm trying to port the newest XML Parser from squeaksource to gemstone. In XML-Parser-AlexandreBergel.73 there is a unicode test introduced with a longer unicode xml snippet. From this release on I cannot load or merge anything.

Besides that the xml snippet looks very strange it loads in pharo but not in gemstone. Unpacking the mcz on the console and examine the content showed that the encoding is indeed weird. I don't know what it is but it is neither ascii nor utf-8. Pharo loaded the snippet into a WideString instance.

How does monticello handle WideString instances when written to a file?

Norbert
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Henrik Sperre Johansen

Re: XML Parser, Monticello and unicode?

On Aug 5, 2010, at 5:37 34PM, Norbert Hartl wrote:

> I'm trying to port the newest XML Parser from squeaksource to gemstone. In XML-Parser-AlexandreBergel.73 there is a unicode test introduced with a longer unicode xml snippet. From this release on I cannot load or merge anything.
>
> Besides that the xml snippet looks very strange it loads in pharo but not in gemstone. Unpacking the mcz on the console and examine the content showed that the encoding is indeed weird. I don't know what it is but it is neither ascii nor utf-8. Pharo loaded the snippet into a WideString instance.
>
> How does monticello handle WideString instances when written to a file?

Rather randomly. ;)

If you merely need to export a package so you can import in gemstone, you could change:

MCMczWriter >> addString: internalString at: path
| member utfConverter utfStringStream|
utfConverter := TextConverter newForEncoding: 'utf8'. "(Or whatever other format Gemstone thinks .mcz definitions will be)"
utfStringStream := RWBinaryOrTextStream on: String new.
utfStringStream binary.
utfConverter class writeBOMOn: utfStringStream.
utfStringStream ascii.
utfConverter nextPutAll: internalString toStream: utfStringStream.
member := zip addString: utfStringStream contents asString as: path.
member desiredCompressionMethod: ZipArchive compressionDeflated

(Alternatively use String new writeStream if you don't need/want to write BOM).

Doing changes like this in the base image is unlikely without further investigation, as it would probably break reading new packages (saved in proper utf8) containing WideStrings into old images.
I haven't read the import code, but if the binary format is preferred by old images if available, it might be a reasonable compromise saving the source in utf8, provided you also include the binary file.

Cheers,
Henry

PS. Another fun fact I encountered when porting Assets:
Monticello uses MethodReference>>source, which kindly converts all LF / CRLFs in your source / strings in the source to CR.
So you can forget f.ex. trying to save arbitrary ByteArrays as strings in your code, and expect them to work the same when converting back to ByteArray after saving to monticello :)

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

jaayer

Re: XML Parser, Monticello and unicode?

In reply to this post by NorbertHartl

---- On Thu, 05 Aug 2010 08:37:34 -0700 Norbert Hartl wrote ----

>I'm trying to port the newest XML Parser from squeaksource to gemstone. In XML-Parser-AlexandreBergel.73 there is a unicode test introduced with a longer unicode xml snippet. From this release on I cannot load or merge anything.
>
>Besides that the xml snippet looks very strange it loads in pharo but not in gemstone. Unpacking the mcz on the console and examine the content showed that the encoding is indeed weird. I don't know what it is but it is neither ascii nor utf-8. Pharo loaded the snippet into a WideString instance.
>
>How does monticello handle WideString instances when written to a file?
>
>Norbert

It appears those were added to ensure the parser didn't choke on non-UTF-8 input. However, it probably should choke on such input (except in external unparsed entities), and regardless, if such a test is to be added, the non-UTF8 characters should not appear in the literal source code of a method; this was the source of subtle image and package character encoding issues like the one you encountered. I have removed it.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: XML Parser, Monticello and unicode?

In reply to this post by Henrik Sperre Johansen

Henrik

can you open bug entry for your nice gems? we should document and make sure that we will fix that.

Stef

On Aug 6, 2010, at 2:52 PM, Henrik Johansen wrote:

> On Aug 5, 2010, at 5:37 34PM, Norbert Hartl wrote:
>
>> I'm trying to port the newest XML Parser from squeaksource to gemstone. In XML-Parser-AlexandreBergel.73 there is a unicode test introduced with a longer unicode xml snippet. From this release on I cannot load or merge anything.
>>
>> Besides that the xml snippet looks very strange it loads in pharo but not in gemstone. Unpacking the mcz on the console and examine the content showed that the encoding is indeed weird. I don't know what it is but it is neither ascii nor utf-8. Pharo loaded the snippet into a WideString instance.
>>
>> How does monticello handle WideString instances when written to a file?
>
> Rather randomly. ;)
>
> If you merely need to export a package so you can import in gemstone, you could change:
>
> MCMczWriter >> addString: internalString at: path
> | member utfConverter utfStringStream|
> utfConverter := TextConverter newForEncoding: 'utf8'. "(Or whatever other format Gemstone thinks .mcz definitions will be)"
> utfStringStream := RWBinaryOrTextStream on: String new.
> utfStringStream binary.
> utfConverter class writeBOMOn: utfStringStream.
> utfStringStream ascii.
> utfConverter nextPutAll: internalString toStream: utfStringStream.
> member := zip addString: utfStringStream contents asString as: path.
> member desiredCompressionMethod: ZipArchive compressionDeflated
>
> (Alternatively use String new writeStream if you don't need/want to write BOM).
>
> Doing changes like this in the base image is unlikely without further investigation, as it would probably break reading new packages (saved in proper utf8) containing WideStrings into old images.
> I haven't read the import code, but if the binary format is preferred by old images if available, it might be a reasonable compromise saving the source in utf8, provided you also include the binary file.
>
> Cheers,
> Henry
>
> PS. Another fun fact I encountered when porting Assets:
> Monticello uses MethodReference>>source, which kindly converts all LF / CRLFs in your source / strings in the source to CR.
> So you can forget f.ex. trying to save arbitrary ByteArrays as strings in your code, and expect them to work the same when converting back to ByteArray after saving to monticello :)
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

NorbertHartl

Re: XML Parser, Monticello and unicode?

In reply to this post by Henrik Sperre Johansen

On 06.08.2010, at 14:52, Henrik Johansen wrote:

Has this been reported before? If not why? This is really important. I don't think we can wait until Monticello is replaced by something different that will fix this :)

There are a few things here that work together in 98% of all cases. I didn't get it fully what is going on but

ZipArchiveMember>>contentStream does
...
s := MultiByteBinaryOrTextStream on: (String new: self uncompressedSize).
s converter: Latin1TextConverter new.
...

and

MultiByteBinaryOrTextStream>>defaultConverter
^ Latin1TextConverter new.

These two are being used when a monticello package is being read. So we have an assumption about an encoding here. On the other hand something in the system does something similar. I don't know InputEvents and how to debug them but if I create a method

EncTest>>encTest
^ 'ö'

I can see that

((EncTest>>#encTest literalAt: 1) at: 1) asciiValue

is 246 which is something that matches latin1 to some extent.

This way there is a conversion (I think at the time I press on my keyboard) to latin1. While writing a monticello package I didn't find any conversion so this might be the reason that the files become latin1 on disk and can be read back using an explicit conversion from latin1.

But this does not explain how it does work with WideString. I would need to dig deeper but maybe someone of you have an idea.

To estimate the possibility to change this I think we should fix this. I scanned all of my cached monticello packages. Most of them are 7bit clean. No problem for them if we change encoding. Besides XML Parser I didn't find any that contain WideString so no problem here. Some of them are latin1 encoded (like Seaside 2.8 or Seaside-InternetExplorer from 3.0). That is the biggest problem because there is no fallback and monticello does not have a version number on file format, right?
I think it is still feasible to change this in monticello as the fix for users of older images will be probably only a few lines that you can apply to any version of monticello if I'm not wrong. But the change is not that easy.

Norbert

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: XML Parser, Monticello and unicode?

>>>
>>>
> Has this been reported before? If not why? This is really important. I don't think we can wait until Monticello is replaced by something different that will fix this :)

the anwser is:
- check bug entries
- add new one if necessary
- propose a fix if possible
- else wait.

>
> There are a few things here that work together in 98% of all cases. I didn't get it fully what is going on but
>
> ZipArchiveMember>>contentStream does
> ...
> s := MultiByteBinaryOrTextStream on: (String new: self uncompressedSize).
> s converter: Latin1TextConverter new.
> ...
>
> and
>
> MultiByteBinaryOrTextStream>>defaultConverter
> ^ Latin1TextConverter new.
>
> These two are being used when a monticello package is being read. So we have an assumption about an encoding here. On the other hand something in the system does something similar. I don't know InputEvents and how to debug them but if I create a method
>
> EncTest>>encTest
> ^ 'ö'
>
> I can see that
>
> ((EncTest>>#encTest literalAt: 1) at: 1) asciiValue
>
> is 246 which is something that matches latin1 to some extent.
>
> This way there is a conversion (I think at the time I press on my keyboard) to latin1. While writing a monticello package I didn't find any conversion so this might be the reason that the files become latin1 on disk and can be read back using an explicit conversion from latin1.
>
> But this does not explain how it does work with WideString. I would need to dig deeper but maybe someone of you have an idea.
>
> To estimate the possibility to change this I think we should fix this. I scanned all of my cached monticello packages. Most of them are 7bit clean.

how do you do that?

> No problem for them if we change encoding. Besides XML Parser I didn't find any that contain WideString so no problem here. Some of them are latin1 encoded (like Seaside 2.8 or Seaside-InternetExplorer from 3.0). That is the biggest problem because there is no fallback and monticello does not have a version number on file format, right?
> I think it is still feasible to change this in monticello as the fix for users of older images will be probably only a few lines that you can apply to any version of monticello if I'm not wrong. But the change is not that easy.

This is not clear to me what is the problem and potential solution. I was not concentrated enough on pharo :(
>
> Norbert
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

NorbertHartl

Re: XML Parser, Monticello and unicode?

On 07.08.2010, at 13:19, Stéphane Ducasse wrote:

>>>>
>>>>
>> Has this been reported before? If not why? This is really important. I don't think we can wait until Monticello is replaced by something different that will fix this :)
>
>
> the anwser is:
> - check bug entries
> - add new one if necessary
> - propose a fix if possible
> - else wait.
>

I replied to Henriks mail. It was meant as a question to him because he knows about the topic. I see I need to be more clear next time. And btw. this was _not_ an answer to my question.

>>
>> There are a few things here that work together in 98% of all cases. I didn't get it fully what is going on but
>>
>> ZipArchiveMember>>contentStream does
>> ...
>> s := MultiByteBinaryOrTextStream on: (String new: self uncompressedSize).
>> s converter: Latin1TextConverter new.
>> ...
>>
>> and
>>
>> MultiByteBinaryOrTextStream>>defaultConverter
>> ^ Latin1TextConverter new.
>>
>> These two are being used when a monticello package is being read. So we have an assumption about an encoding here. On the other hand something in the system does something similar. I don't know InputEvents and how to debug them but if I create a method
>>
>> EncTest>>encTest
>> ^ 'ö'
>>
>> I can see that
>>
>> ((EncTest>>#encTest literalAt: 1) at: 1) asciiValue
>>
>> is 246 which is something that matches latin1 to some extent.
>>
>> This way there is a conversion (I think at the time I press on my keyboard) to latin1. While writing a monticello package I didn't find any conversion so this might be the reason that the files become latin1 on disk and can be read back using an explicit conversion from latin1.
>>
>> But this does not explain how it does work with WideString. I would need to dig deeper but maybe someone of you have an idea.
>>
>> To estimate the possibility to change this I think we should fix this. I scanned all of my cached monticello packages. Most of them are 7bit clean.
>
> how do you do that?

for i in `find . -name "*.mcz"`; do
echo $i;
unzip -qc $i snapshot/source.st | enca -L none;
done

>
>> No problem for them if we change encoding. Besides XML Parser I didn't find any that contain WideString so no problem here. Some of them are latin1 encoded (like Seaside 2.8 or Seaside-InternetExplorer from 3.0). That is the biggest problem because there is no fallback and monticello does not have a version number on file format, right?
>> I think it is still feasible to change this in monticello as the fix for users of older images will be probably only a few lines that you can apply to any version of monticello if I'm not wrong. But the change is not that easy.
>
>
> This is not clear to me what is the problem and potential solution. I was not concentrated enough on pharo :(

It is not clear to me, either. That's why I am asking. I'm willing to track this down any further. At least i want to open a substantial ticket. But a little bit more information would be helpful.

Norbert

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2-3

Re: XML Parser, Monticello and unicode?

In reply to this post by NorbertHartl

On 07.08.2010 13:08, Norbert Hartl wrote:
> ....
> To estimate the possibility to change this I think we should fix this. I scanned all of my cached monticello packages. Most of them are 7bit clean. No problem for them if we change encoding. Besides XML Parser I didn't find any that contain WideString so no problem here. Some of them are latin1 encoded (like Seaside 2.8 or Seaside-InternetExplorer from 3.0). That is the biggest problem because there is no fallback and monticello does not have a version number on file format, right?
> I think it is still feasible to change this in monticello as the fix for users of older images will be probably only a few lines that you can apply to any version of monticello if I'm not wrong. But the change is not that easy.

We try to be 7bit clean in Seaside. Can you report the methods you have
trouble with to either the seaside mailing list or the issue tracker [1]?

[1] http://code.google.com/p/seaside/issues/list

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

NorbertHartl

Re: XML Parser, Monticello and unicode?

On 07.08.2010, at 20:58, Philippe Marschall wrote:

> On 07.08.2010 13:08, Norbert Hartl wrote:
>> ....
>> To estimate the possibility to change this I think we should fix this. I scanned all of my cached monticello packages. Most of them are 7bit clean. No problem for them if we change encoding. Besides XML Parser I didn't find any that contain WideString so no problem here. Some of them are latin1 encoded (like Seaside 2.8 or Seaside-InternetExplorer from 3.0). That is the biggest problem because there is no fallback and monticello does not have a version number on file format, right?
>> I think it is still feasible to change this in monticello as the fix for users of older images will be probably only a few lines that you can apply to any version of monticello if I'm not wrong. But the change is not that easy.
>
> We try to be 7bit clean in Seaside. Can you report the methods you have
> trouble with to either the seaside mailing list or the issue tracker [1]?
>
I don't have troubles with any of the seaside methods. It just seems impossible for me to state things clear enough :) There was a real problem. There was WideStrings creeping into the monticello package of XML Parser. There is no way monticello can handle multi byte strings as it uses latin1 for encoding. The latin1 encoded WideStrings cannot be read by gemstone. But the troublesome piece of code has already been removed from the XML Parser package.

Seaside is not completely 7bit clean. As an example take Seaside-InternetExplorer-lr.6.mcz. There is a sentence "...we have introduced a mechanism to help prevent the untrusted content from compromising your site<92>s security...". The <92> is a RIGHT SINGLE QUOTATION MARK that Microsoft put into the latin1 gap. So I guess this is CP1252 which means it has been copied from a windows system into the squeak image.

Norbert

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2

Re: XML Parser, Monticello and unicode?

On 08/08/2010 12:14 AM, Norbert Hartl wrote:

>
> On 07.08.2010, at 20:58, Philippe Marschall wrote:
>
>> On 07.08.2010 13:08, Norbert Hartl wrote:
>>> ....
>>> To estimate the possibility to change this I think we should fix this. I scanned all of my cached monticello packages. Most of them are 7bit clean. No problem for them if we change encoding. Besides XML Parser I didn't find any that contain WideString so no problem here. Some of them are latin1 encoded (like Seaside 2.8 or Seaside-InternetExplorer from 3.0). That is the biggest problem because there is no fallback and monticello does not have a version number on file format, right?
>>> I think it is still feasible to change this in monticello as the fix for users of older images will be probably only a few lines that you can apply to any version of monticello if I'm not wrong. But the change is not that easy.
>>
>> We try to be 7bit clean in Seaside. Can you report the methods you have
>> trouble with to either the seaside mailing list or the issue tracker [1]?
>>
> I don't have troubles with any of the seaside methods. It just seems impossible for me to state things clear enough :) There was a real problem. There was WideStrings creeping into the monticello package of XML Parser. There is no way monticello can handle multi byte strings as it uses latin1 for encoding. The latin1 encoded WideStrings cannot be read by gemstone. But the troublesome piece of code has already been removed from the XML Parser package.
>
> Seaside is not completely 7bit clean. As an example take Seaside-InternetExplorer-lr.6.mcz. There is a sentence "...we have introduced a mechanism to help prevent the untrusted content from compromising your site<92>s security...". The <92> is a RIGHT SINGLE QUOTATION MARK that Microsoft put into the latin1 gap. So I guess this is CP1252 which means it has been copied from a windows system into the squeak image.

Yeah, that was copied and pasted from a blog post. My mistake, sorry,
will fix it. CP1252 is evil.

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Alexandre Bergel

Re: XML Parser, Monticello and unicode?

In reply to this post by NorbertHartl

> There is no way monticello can handle multi byte strings as it uses latin1 for encoding.

Hi Norbert,

I often have very large Strings in my tests. Monticello behaves as it should. I haven't seen any problem. The problem you bumped into seems to stem from accented characters.

Cheers,
Alexandre

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Alexandre Bergel

Re: XML Parser, Monticello and unicode?

In reply to this post by NorbertHartl

The solution for now seems to remove this tests. It can be moved into a different package. How does that sound?

Cheers,
Alexandre

On 5 Aug 2010, at 11:37, Norbert Hartl wrote:

> I'm trying to port the newest XML Parser from squeaksource to gemstone. In XML-Parser-AlexandreBergel.73 there is a unicode test introduced with a longer unicode xml snippet. From this release on I cannot load or merge anything.
>
> Besides that the xml snippet looks very strange it loads in pharo but not in gemstone. Unpacking the mcz on the console and examine the content showed that the encoding is indeed weird. I don't know what it is but it is neither ascii nor utf-8. Pharo loaded the snippet into a WideString instance.
>
> How does monticello handle WideString instances when written to a file?
>
> Norbert
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

NorbertHartl

Re: XML Parser, Monticello and unicode?

In reply to this post by Alexandre Bergel

On 09.08.2010, at 15:00, Alexandre Bergel wrote:

>> There is no way monticello can handle multi byte strings as it uses latin1 for encoding.
>
>
> Hi Norbert,
>
> I often have very large Strings in my tests. Monticello behaves as it should. I haven't seen any problem. The problem you bumped into seems to stem from accented characters.
>
Alexandre,

the problem is not large strings but strings that contain characters with asciiValue > 255. The strings then become WideStrings and that is the problem. Jaayer removed the problematic method and after that (thanks again) I could read the monticello package again. And btw. I ported the newest source to gemstone already.

Norbert

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Alexandre Bergel

Re: XML Parser, Monticello and unicode?

> the problem is not large strings but strings that contain characters with asciiValue > 255. The strings then become WideStrings and that is the problem. Jaayer removed the problematic method and after that (thanks again) I could read the monticello package again. And btw. I ported the newest source to gemstone already.

Ok

Alexandre

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project