Unicode problem on parsing XML

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode problem on parsing XML

Bèrto ëd Sèra
Hi all,

lately I started to see an out-of-index error when parsing Unicode
text from XML files. I am *not* 100% sure this isn't due to some
changes database-side that are now exposing a wider amount of text to
the parser, so I cannot safely claim it's new. Yet, now even common
accented Latin chars get warped into something unusable when read from
the parser, and this *surely* was not happening the last time I worked
on the interface, say 3 months ago, with Iliad 7.0.

Now I'm using gst 3.2 and iliad 0.8. What I get from the following code:
content := 'taxonomy.xml' asFile.
parser := XML.XMLParser new.
parser validate: false.
parser parse: content readStream.

is an error you can easily reply by putting
http://eng.i-iter.org/graph/taxonomy.xml file into your local dir.
Before you get crazy (as I did) digging around the text looking for
the guilty chars I can tell
you the breakers are, for example:
1)...the æ and œ ligatures, ...
2) Devanāgarī script for Hindi
3) Japanese Rōmaji script

I was wondering what changed... or, most probably, what kind of silly
mistake I'm making...

Bèrto

--
==============================
Constitution du 24 juin 1793 - Article 35. - Quand le gouvernement
viole les droits du peuple, l'insurrection est, pour le peuple et pour
chaque portion du peuple, le plus sacré des droits et le plus
indispensable des devoirs.


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Unicode problem on parsing XML

Bèrto ëd Sèra
I really should have more sleep... I wrote Unicode all around, but the
text I'm parsing it's actually UTF-8...

Bèrto


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Unicode problem on parsing XML

Paolo Bonzini-2
In reply to this post by Bèrto ëd Sèra
On 05/05/2010 05:36 AM, Bèrto ëd Sèra wrote:
> Now I'm using gst 3.2 and iliad 0.8. What I get from the following code:
> content := 'taxonomy.xml' asFile.
> parser := XML.XMLParser new.
> parser validate: false.
> parser parse: content readStream.

Not related, but it's best to use "XML.SAXParser defaultParserClass
new". Unlike XMLParser, other parsers may not construct the DOM by
default, so you end up with:

     PackageLoader fileInPackage: 'XML-XMLParser'.
     content := 'taxonomy.xml' asFile.
     parser := XML.SAXParser defaultParserClass new.
     parser validate: false.
     parser saxDriver: (driver := XML.DOM_SAXDriver new).
     parser parse: content readStream
     driver document

This won't fix the bug but will make you a good citizen (see NEWS file
in GST 3.2).

> the breakers are, for example:
> 1)...the æ and œ ligatures, ...
> 2) Devanāgarī script for Hindi
> 3) Japanese Rōmaji script

The breaker is _entities_, not characters.

> I was wondering what changed... or, most probably, what kind of silly
> mistake I'm making...

Nothing, it's a bug.  The easiest way to fix it is to use the XML-Expat
package.  You just have to replace the first line above with these two:

     PackageLoader fileInPackage: 'XML-Expat'.
     PackageLoader fileInPackage: 'XML-DOM'.

It's _thousands_ of times faster too.

But if you insist, this patch fixes it:

diff --git a/packages/xml/parser/XML.st b/packages/xml/parser/XML.st
index 309cf36..a9ebb7f 100644
--- a/packages/xml/parser/XML.st
+++ b/packages/xml/parser/XML.st
@@ -2950,7 +2950,7 @@ Instance Variables:
     ifTrue:
  [sax fatalError: (BadCharacterSignal new
     messageText: 'A character with Unicode value %1 is not legal' %
{n})].
- data nextPut: (Character value: n).
+ data display: (Character codePoint: n).
  self getNextChar
      ]

diff --git a/packages/xml/parser/package.xml
b/packages/xml/parser/package.xml
index 2e0bcce..fc72811 100644
--- a/packages/xml/parser/package.xml
+++ b/packages/xml/parser/package.xml
@@ -13,6 +13,7 @@

    <prereq>XML-SAXParser</prereq>
    <prereq>XML-DOM</prereq>
+  <prereq>Iconv</prereq>

    <filein>XML.st</filein>
    <file>XML.st</file>

Paolo


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Unicode problem on parsing XML

Bèrto ëd Sèra
In reply to this post by Bèrto ëd Sèra
Hi!

yes, sorry, it's definitely entities :) And expat works like a breeze,
so no chance I'd *ever* come back :) You made my day, and I'm glad I
helped fixing a bug :)

Bèrto

On 5 May 2010 20:17, Paolo Bonzini <[hidden email]> wrote:

> On 05/05/2010 11:14 AM, Bèrto ëd Sèra wrote:
>>
>> I really should have more sleep... I wrote Unicode all around, but the
>> text I'm parsing it's actually UTF-8...
>
> no, ci sono entità al posto dei caratteri >255.
>
> come test basta questo:
>
> <test><a>&#1234;</a></test>
>
> Paolo
>



--
==============================
Constitution du 24 juin 1793 - Article 35. - Quand le gouvernement
viole les droits du peuple, l'insurrection est, pour le peuple et pour
chaque portion du peuple, le plus sacré des droits et le plus
indispensable des devoirs.


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk