Hi all,
lately I started to see an out-of-index error when parsing Unicode text from XML files. I am *not* 100% sure this isn't due to some changes database-side that are now exposing a wider amount of text to the parser, so I cannot safely claim it's new. Yet, now even common accented Latin chars get warped into something unusable when read from the parser, and this *surely* was not happening the last time I worked on the interface, say 3 months ago, with Iliad 7.0. Now I'm using gst 3.2 and iliad 0.8. What I get from the following code: content := 'taxonomy.xml' asFile. parser := XML.XMLParser new. parser validate: false. parser parse: content readStream. is an error you can easily reply by putting http://eng.i-iter.org/graph/taxonomy.xml file into your local dir. Before you get crazy (as I did) digging around the text looking for the guilty chars I can tell you the breakers are, for example: 1)...the æ and œ ligatures, ... 2) Devanāgarī script for Hindi 3) Japanese Rōmaji script I was wondering what changed... or, most probably, what kind of silly mistake I'm making... Bèrto -- ============================== Constitution du 24 juin 1793 - Article 35. - Quand le gouvernement viole les droits du peuple, l'insurrection est, pour le peuple et pour chaque portion du peuple, le plus sacré des droits et le plus indispensable des devoirs. _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
I really should have more sleep... I wrote Unicode all around, but the
text I'm parsing it's actually UTF-8... Bèrto _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
In reply to this post by Bèrto ëd Sèra
On 05/05/2010 05:36 AM, Bèrto ëd Sèra wrote:
> Now I'm using gst 3.2 and iliad 0.8. What I get from the following code: > content := 'taxonomy.xml' asFile. > parser := XML.XMLParser new. > parser validate: false. > parser parse: content readStream. Not related, but it's best to use "XML.SAXParser defaultParserClass new". Unlike XMLParser, other parsers may not construct the DOM by default, so you end up with: PackageLoader fileInPackage: 'XML-XMLParser'. content := 'taxonomy.xml' asFile. parser := XML.SAXParser defaultParserClass new. parser validate: false. parser saxDriver: (driver := XML.DOM_SAXDriver new). parser parse: content readStream driver document This won't fix the bug but will make you a good citizen (see NEWS file in GST 3.2). > the breakers are, for example: > 1)...the æ and œ ligatures, ... > 2) Devanāgarī script for Hindi > 3) Japanese Rōmaji script The breaker is _entities_, not characters. > I was wondering what changed... or, most probably, what kind of silly > mistake I'm making... Nothing, it's a bug. The easiest way to fix it is to use the XML-Expat package. You just have to replace the first line above with these two: PackageLoader fileInPackage: 'XML-Expat'. PackageLoader fileInPackage: 'XML-DOM'. It's _thousands_ of times faster too. But if you insist, this patch fixes it: diff --git a/packages/xml/parser/XML.st b/packages/xml/parser/XML.st index 309cf36..a9ebb7f 100644 --- a/packages/xml/parser/XML.st +++ b/packages/xml/parser/XML.st @@ -2950,7 +2950,7 @@ Instance Variables: ifTrue: [sax fatalError: (BadCharacterSignal new messageText: 'A character with Unicode value %1 is not legal' % {n})]. - data nextPut: (Character value: n). + data display: (Character codePoint: n). self getNextChar ] diff --git a/packages/xml/parser/package.xml b/packages/xml/parser/package.xml index 2e0bcce..fc72811 100644 --- a/packages/xml/parser/package.xml +++ b/packages/xml/parser/package.xml @@ -13,6 +13,7 @@ <prereq>XML-SAXParser</prereq> <prereq>XML-DOM</prereq> + <prereq>Iconv</prereq> <filein>XML.st</filein> <file>XML.st</file> Paolo _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
In reply to this post by Bèrto ëd Sèra
Hi!
yes, sorry, it's definitely entities :) And expat works like a breeze, so no chance I'd *ever* come back :) You made my day, and I'm glad I helped fixing a bug :) Bèrto On 5 May 2010 20:17, Paolo Bonzini <[hidden email]> wrote: > On 05/05/2010 11:14 AM, Bèrto ëd Sèra wrote: >> >> I really should have more sleep... I wrote Unicode all around, but the >> text I'm parsing it's actually UTF-8... > > no, ci sono entità al posto dei caratteri >255. > > come test basta questo: > > <test><a>Ӓ</a></test> > > Paolo > -- ============================== Constitution du 24 juin 1793 - Article 35. - Quand le gouvernement viole les droits du peuple, l'insurrection est, pour le peuple et pour chaque portion du peuple, le plus sacré des droits et le plus indispensable des devoirs. _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
Free forum by Nabble | Edit this page |