Hello all,
I have come a quite a way in building my own framework to parse a particular XML file (that contains information about audio tracks). I can read and parse all the data, however the overall performance slows down dramatically with growing XML files. It takes about 3 minutes to parse an XML file with ca. 10560 tracks. When using Ian's profiler, I found that 80% of that time was spent in the method IXMLDOMNode>>text. After some more experimentation I also found that the bigger the XML file was, the longer each method execution took, even if it was for the same node at the same position (close to the beginning of the file). To test the latter I used the coding: root :=( IXMLDOMDocument new loadURL: 'file://P:\My Music\mp3testAll.xml') documentElement. n := (root childNodes at: 2) firstChild firstChild. Time millisecondsToRun: [ 10000 timesRepeat: [ n text ] ] and I changed the URL for the various files that I was testing with. My results (taken from the 3rd or 4th execution of the coding above for each URL) are: No. of tracks File size [KB] Execution time [s] in file 1 4 0.17 500 614 0.38 1000 1275 0.6 5000 6598 9.6 10559 14262 20.4 If anyone cares to reproduce this, I have uploaded the files I tested this with to http://www.geocities.com/bernhard_kohlhaas/xmltest.zip I find this behavior very peculiar, because I would have expected the execution time to be more or less constant. Is this a Dolphin issue or a problem with Microsoft's implementation of DOM or something else completely? Finally I loaded Steve Waring's Spray package and tried to parse this with the SAXHandler and the XMLDOMParser, but both choked on the file 'mp3test500.xml' with a walkback: SAXHandler parseDocumentFromFileNamed: 'mp3test500.xml' XMLDOMParser parseDocumentFromFileNamed: 'mp3test500.xml' So it looks like I am stuck with the MS parser, thus I'd really like to get it to work faster. I'd be very grateful for any suggestions on the matter, Bernhard |
Bernhard Kohlhaas wrote:
> I find this behavior very peculiar, because I would have expected the > execution time to be more or less constant. Is this a Dolphin issue or a > problem with Microsoft's implementation of DOM or something else > completely? I'd say it's a Microsoft implementation problem. The call to #text almost immediately calls down into the COM interfaces, so I don't see how Dolphin could be affecting it. But something /very/ odd is going on. I tried to reproduce it using your data. I found that on my machine n text. took about 10 microseconds when it was looking at the shortest file, growing (linearly) to about 625 microseconds for the longest file. Just as you found. I got interested so I tried using XOM package by Elliotte Rusty Harold. That's a Java package but I wanted to check anyway. Driving that from Dolphin it took about 20 microseconds regardless of the file size, just as you'd expect. So, I wondered if your long lines were confusing the Microsoft version in some way, and so ran your XML through the XOM pretty-printer to create new files. Then went back to XMLDomParser and tested it to see how it went with pretty-printed XML. Oddly enough it seemed to take around 10 microseconds regardless of file size. Odd. But much odder, I then tried it again on the original files. And it /still/ took around 10 microseconds regardless of file size. So it had suddenly started working properly. The only explanation I can come up with is that IXMLDOMParser had seen XOM doing its stuff, and got envious and decided to work properly in future... > Finally I loaded Steve Waring's Spray package and tried to parse this > with the SAXHandler and the XMLDOMParser, but both choked on the file > 'mp3test500.xml' with a walkback: Your files are in UTF8 and the Spray stuff isn't really Unicode aware. You can probably get it to work by skipping the Byte Order Mark in the first three bytes of your files (XML files in UTF-8 shouldn't really be including a Byte Order Mark anyway). Something like: stream := FileStream read: 'mp3testAll.xml'. stream skip: 3. root := XMLDOMParser parseDocumentFrom: stream. stream close. However, it /seems/ that there may be something wrong with the way that the Spray parser uses FileStreams. I suspect a fault somewhere (quite possibly in FileStream) to do with #position: and similar methods. A few times I saw the Spay parser fail, when loading the same data into memory and then parsing that worked OK. Unfortunately I don't have a reproducible case. If you find the same thing then you could do stream := FileStream read: 'mp3testAll.xml'. contents := stream skip: 3; upToEnd. stream close. root := XMLDOMParser parseDocumentFrom: contents readStream. HTH -- chris |
Chris,
Thanks a lot for taking the time to look into this. > I got interested so I tried using XOM package by Elliotte Rusty Harold. That's > a Java package but I wanted to check anyway. Driving that from Dolphin it took > about 20 microseconds regardless of the file size, just as you'd expect. > > So, I wondered if your long lines were confusing the Microsoft version in some > way, and so ran your XML through the XOM pretty-printer to create new files. > Then went back to XMLDomParser and tested it to see how it went with > pretty-printed XML. Oddly enough it seemed to take around 10 microseconds > regardless of file size. Odd. But much odder, I then tried it again on the > original files. And it /still/ took around 10 microseconds regardless of file > size. So it had suddenly started working properly. Inspired by your findings I searched for an XML pretty printer on the web and found Tidy at http://tidy.sourceforge.net/ . I added line breaks to mp3testAll.xml and saved the result in file mp3testAllpp.xml (which I also added to the archive at http://www.geocities.com/bernhard_kohlhaas/xmltest.zip ). Strangely enough I did not notice any performance difference between loading the two files. So it seems that XOM does some truely miraculous things beyond adding line breaks. I'd be curious how mp3testAllpp.xml behaves on your system. > The only explanation I can come up with is that IXMLDOMParser had seen XOM > doing its stuff, and got envious and decided to work properly in future... XOM's supernatural strength seems to reach far ;) Thank you for all those pointers and suggestions regarding SAX. That'll help me to get me a lot further. Bernhard |
Bernhard,
> Strangely enough I did not notice any performance difference between > loading the two files. So it seems that XOM does some truely miraculous > things beyond adding line breaks. I'd be curious how mp3testAllpp.xml > behaves on your system. I'm afraid that it turns out to be my mistake. In the second set of tests (the ones that appeared to run in constant time) I was setting n := (root childNodes at: 2) firstChild firstChild. where root was the IXMLDOMDocument rather than its #documentElement. I'm sorry for the mislead... BTW, it seem that MS still haven't fixed the problem. Forcing the use of MSXML 4 with code like: base := 'file://C:\...whatever...'. doc := IXMLDOMDocument createObject: 'Msxml2.DOMDocument.4.0'. doc loadURL: (base , 'mp3testAll.xml'). root := doc documentElement. node := (root childNodes at: 2) firstChild firstChild. Is about 30% faster, but is still linear in the size of the file... > Thank you for all those pointers and suggestions regarding SAX. That'll > help me to get me a lot further. I did manage to find the problem with the YAXO parser when reading directly from file. It's caused by the Squeak compatibility method Stream>>nextOrNil which reads: nextOrNil <primitive: 65> ^ nil. which doesn't work if the primitive fails because the stream has reached the end of its internal buffer. If you change it to: nextOrNil <primitive: 65> ^ self atEnd ifFalse: [self next]. then it seems to work OK. -- chris |
Chris,
> I'm afraid that it turns out to be my mistake. In the second set of tests (the > ones that appeared to run in constant time) I was setting > n := (root childNodes at: 2) firstChild firstChild. > where root was the IXMLDOMDocument rather than its #documentElement. I'm sorry > for the mislead... Not a problem. :) [...] > I did manage to find the problem with the YAXO parser when reading directly > from file. It's caused by the Squeak compatibility method Stream>>nextOrNil > which reads: > > nextOrNil > <primitive: 65> > ^ nil. > > which doesn't work if the primitive fails because the stream has reached the > end of its internal buffer. If you change it to: > > nextOrNil > <primitive: 65> > ^ self atEnd ifFalse: [self next]. > > then it seems to work OK. Thank you very much for finding this. I applied the patch and was able to parse my document with the Yaxo parser. My program now runs twice as fast as with the MS XML parser, which is a good step forward. :) It should become even more favorable for the Yaxo parser, as I am adding the retrieval of more XML nodes from the document, because the MS parser would run slower (due to more invocations of IXMLDOMNnode>>text), whereas the Yaxo parser seems to need all its time for the initial parsing of the document with basically instant access to the nodes' content later on. Thanks again for all the help & Best Regards, Bernhard |
Free forum by Nabble | Edit this page |