Smalltalk › Gnu

Unicode problem on parsing XML

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Bèrto ëd Sèra

Unicode problem on parsing XML

Hi all,

lately I started to see an out-of-index error when parsing Unicode
text from XML files. I am *not* 100% sure this isn't due to some
changes database-side that are now exposing a wider amount of text to
the parser, so I cannot safely claim it's new. Yet, now even common
accented Latin chars get warped into something unusable when read from
the parser, and this *surely* was not happening the last time I worked
on the interface, say 3 months ago, with Iliad 7.0.

Now I'm using gst 3.2 and iliad 0.8. What I get from the following code:
content := 'taxonomy.xml' asFile.
parser := XML.XMLParser new.
parser validate: false.
parser parse: content readStream.

is an error you can easily reply by putting
http://eng.i-iter.org/graph/taxonomy.xml file into your local dir.
Before you get crazy (as I did) digging around the text looking for
the guilty chars I can tell
you the breakers are, for example:
1)...the æ and œ ligatures, ...
2) Devanāgarī script for Hindi
3) Japanese Rōmaji script

I was wondering what changed... or, most probably, what kind of silly
mistake I'm making...

Bèrto

--
==============================
Constitution du 24 juin 1793 - Article 35. - Quand le gouvernement
viole les droits du peuple, l'insurrection est, pour le peuple et pour
chaque portion du peuple, le plus sacré des droits et le plus
indispensable des devoirs.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Bèrto ëd Sèra

Re: Unicode problem on parsing XML

I really should have more sleep... I wrote Unicode all around, but the
text I'm parsing it's actually UTF-8...

Bèrto

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini-2

Re: Unicode problem on parsing XML

In reply to this post by Bèrto ëd Sèra

On 05/05/2010 05:36 AM, Bèrto ëd Sèra wrote:
> Now I'm using gst 3.2 and iliad 0.8. What I get from the following code:
> content := 'taxonomy.xml' asFile.
> parser := XML.XMLParser new.
> parser validate: false.
> parser parse: content readStream.

Not related, but it's best to use "XML.SAXParser defaultParserClass
new". Unlike XMLParser, other parsers may not construct the DOM by
default, so you end up with:

PackageLoader fileInPackage: 'XML-XMLParser'.
content := 'taxonomy.xml' asFile.
parser := XML.SAXParser defaultParserClass new.
parser validate: false.
parser saxDriver: (driver := XML.DOM_SAXDriver new).
parser parse: content readStream
driver document

This won't fix the bug but will make you a good citizen (see NEWS file
in GST 3.2).

> the breakers are, for example:
> 1)...the æ and œ ligatures, ...
> 2) Devanāgarī script for Hindi
> 3) Japanese Rōmaji script

The breaker is _entities_, not characters.

> I was wondering what changed... or, most probably, what kind of silly
> mistake I'm making...

Nothing, it's a bug. The easiest way to fix it is to use the XML-Expat
package. You just have to replace the first line above with these two:

PackageLoader fileInPackage: 'XML-Expat'.
PackageLoader fileInPackage: 'XML-DOM'.

It's _thousands_ of times faster too.

But if you insist, this patch fixes it:

diff --git a/packages/xml/parser/XML.st b/packages/xml/parser/XML.st
index 309cf36..a9ebb7f 100644
--- a/packages/xml/parser/XML.st
+++ b/packages/xml/parser/XML.st
@@ -2950,7 +2950,7 @@ Instance Variables:
ifTrue:
[sax fatalError: (BadCharacterSignal new
messageText: 'A character with Unicode value %1 is not legal' %
{n})].
- data nextPut: (Character value: n).
+ data display: (Character codePoint: n).
self getNextChar
]

diff --git a/packages/xml/parser/package.xml
b/packages/xml/parser/package.xml
index 2e0bcce..fc72811 100644
--- a/packages/xml/parser/package.xml
+++ b/packages/xml/parser/package.xml
@@ -13,6 +13,7 @@

<prereq>XML-SAXParser</prereq>
<prereq>XML-DOM</prereq>
+ <prereq>Iconv</prereq>

<filein>XML.st</filein>
<file>XML.st</file>

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Bèrto ëd Sèra

Re: Unicode problem on parsing XML

In reply to this post by Bèrto ëd Sèra

Hi!

yes, sorry, it's definitely entities :) And expat works like a breeze,
so no chance I'd *ever* come back :) You made my day, and I'm glad I
helped fixing a bug :)

Bèrto

On 5 May 2010 20:17, Paolo Bonzini <[hidden email]> wrote:

> On 05/05/2010 11:14 AM, Bèrto ëd Sèra wrote:
>>
>> I really should have more sleep... I wrote Unicode all around, but the
>> text I'm parsing it's actually UTF-8...
>
> no, ci sono entità al posto dei caratteri >255.
>
> come test basta questo:
>
> <test><a>Ӓ</a></test>
>
> Paolo
>

--
==============================
Constitution du 24 juin 1793 - Article 35. - Quand le gouvernement
viole les droits du peuple, l'insurrection est, pour le peuple et pour
chaque portion du peuple, le plus sacré des droits et le plus
indispensable des devoirs.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk