Smalltalk › Usenets › Dolphin Smalltalk

Strange performance behavior of IXMLDOMNode>>text in DPRO 5.1.4

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

5 messages Options

Bernhard Kohlhaas-6

Strange performance behavior of IXMLDOMNode>>text in DPRO 5.1.4

Hello all,

I have come a quite a way in building my own framework to parse a
particular XML file (that contains information about audio tracks). I
can read and parse all the data, however the overall performance slows
down dramatically with growing XML files. It takes about 3 minutes to
parse an XML file with ca. 10560 tracks.

When using Ian's profiler, I found that 80% of that time was spent in
the method IXMLDOMNode>>text. After some more experimentation I also
found that the bigger the XML file was, the longer each method execution
took, even if it was for the same node at the same position (close to
the beginning of the file).

To test the latter I used the coding:

root :=( IXMLDOMDocument new
loadURL: 'file://P:\My Music\mp3testAll.xml')
documentElement.
n := (root childNodes at: 2) firstChild firstChild.
Time millisecondsToRun: [ 10000 timesRepeat: [ n text ] ]

and I changed the URL for the various files that I was testing with. My
results (taken from the 3rd or 4th execution of the coding above for
each URL) are:

No. of tracks File size [KB] Execution time [s]
in file
1 4 0.17
500 614 0.38
1000 1275 0.6
5000 6598 9.6
10559 14262 20.4

If anyone cares to reproduce this, I have uploaded the files I tested
this with to http://www.geocities.com/bernhard_kohlhaas/xmltest.zip

I find this behavior very peculiar, because I would have expected the
execution time to be more or less constant. Is this a Dolphin issue or a
problem with Microsoft's implementation of DOM or something else completely?

Finally I loaded Steve Waring's Spray package and tried to parse this
with the SAXHandler and the XMLDOMParser, but both choked on the file
'mp3test500.xml' with a walkback:

SAXHandler parseDocumentFromFileNamed: 'mp3test500.xml'
XMLDOMParser parseDocumentFromFileNamed: 'mp3test500.xml'

So it looks like I am stuck with the MS parser, thus I'd really like to
get it to work faster.

I'd be very grateful for any suggestions on the matter,

Bernhard

Chris Uppal-3

Re: Strange performance behavior of IXMLDOMNode>>text in DPRO 5.1.4

Bernhard Kohlhaas wrote:

> I find this behavior very peculiar, because I would have expected the
> execution time to be more or less constant. Is this a Dolphin issue or a
> problem with Microsoft's implementation of DOM or something else
> completely?

I'd say it's a Microsoft implementation problem. The call to #text almost
immediately calls down into the COM interfaces, so I don't see how Dolphin
could be affecting it.

But something /very/ odd is going on. I tried to reproduce it using your data.
I found that on my machine
n text.
took about 10 microseconds when it was looking at the shortest file, growing
(linearly) to about 625 microseconds for the longest file. Just as you found.

I got interested so I tried using XOM package by Elliotte Rusty Harold. That's
a Java package but I wanted to check anyway. Driving that from Dolphin it took
about 20 microseconds regardless of the file size, just as you'd expect.

So, I wondered if your long lines were confusing the Microsoft version in some
way, and so ran your XML through the XOM pretty-printer to create new files.
Then went back to XMLDomParser and tested it to see how it went with
pretty-printed XML. Oddly enough it seemed to take around 10 microseconds
regardless of file size. Odd. But much odder, I then tried it again on the
original files. And it /still/ took around 10 microseconds regardless of file
size. So it had suddenly started working properly.

The only explanation I can come up with is that IXMLDOMParser had seen XOM
doing its stuff, and got envious and decided to work properly in future...

> Finally I loaded Steve Waring's Spray package and tried to parse this
> with the SAXHandler and the XMLDOMParser, but both choked on the file
> 'mp3test500.xml' with a walkback:

Your files are in UTF8 and the Spray stuff isn't really Unicode aware. You can
probably get it to work by skipping the Byte Order Mark in the first three
bytes of your files (XML files in UTF-8 shouldn't really be including a Byte
Order Mark anyway). Something like:

stream := FileStream read: 'mp3testAll.xml'.
stream skip: 3.
root := XMLDOMParser parseDocumentFrom: stream.
stream close.

However, it /seems/ that there may be something wrong with the way that the
Spray parser uses FileStreams. I suspect a fault somewhere (quite possibly in
FileStream) to do with #position: and similar methods. A few times I saw the
Spay parser fail, when loading the same data into memory and then parsing that
worked OK. Unfortunately I don't have a reproducible case. If you find the
same thing then you could do

stream := FileStream read: 'mp3testAll.xml'.
contents := stream skip: 3; upToEnd.
stream close.
root := XMLDOMParser parseDocumentFrom: contents readStream.

HTH

-- chris

Bernhard Kohlhaas-6

Re: Strange performance behavior of IXMLDOMNode>>text in DPRO 5.1.4

Chris,

Thanks a lot for taking the time to look into this.

> I got interested so I tried using XOM package by Elliotte Rusty Harold. That's
> a Java package but I wanted to check anyway. Driving that from Dolphin it took
> about 20 microseconds regardless of the file size, just as you'd expect.
>
> So, I wondered if your long lines were confusing the Microsoft version in some
> way, and so ran your XML through the XOM pretty-printer to create new files.
> Then went back to XMLDomParser and tested it to see how it went with
> pretty-printed XML. Oddly enough it seemed to take around 10 microseconds
> regardless of file size. Odd. But much odder, I then tried it again on the
> original files. And it /still/ took around 10 microseconds regardless of file
> size. So it had suddenly started working properly.

Inspired by your findings I searched for an XML pretty printer on the
web and found Tidy at http://tidy.sourceforge.net/ . I added line breaks
to mp3testAll.xml and saved the result in file mp3testAllpp.xml (which I
also added to the archive at
http://www.geocities.com/bernhard_kohlhaas/xmltest.zip ).

Strangely enough I did not notice any performance difference between
loading the two files. So it seems that XOM does some truely miraculous
things beyond adding line breaks. I'd be curious how mp3testAllpp.xml
behaves on your system.

> The only explanation I can come up with is that IXMLDOMParser had seen XOM
> doing its stuff, and got envious and decided to work properly in future...

XOM's supernatural strength seems to reach far ;)

Thank you for all those pointers and suggestions regarding SAX. That'll
help me to get me a lot further.

Bernhard

Chris Uppal-3

Re: Strange performance behavior of IXMLDOMNode>>text in DPRO 5.1.4

Bernhard,

> Strangely enough I did not notice any performance difference between
> loading the two files. So it seems that XOM does some truely miraculous
> things beyond adding line breaks. I'd be curious how mp3testAllpp.xml
> behaves on your system.

I'm afraid that it turns out to be my mistake. In the second set of tests (the
ones that appeared to run in constant time) I was setting
n := (root childNodes at: 2) firstChild firstChild.
where root was the IXMLDOMDocument rather than its #documentElement. I'm sorry
for the mislead...

BTW, it seem that MS still haven't fixed the problem. Forcing the use of MSXML
4 with code like:

base := 'file://C:\...whatever...'.
doc := IXMLDOMDocument createObject: 'Msxml2.DOMDocument.4.0'.
doc loadURL: (base , 'mp3testAll.xml').
root := doc documentElement.
node := (root childNodes at: 2) firstChild firstChild.

Is about 30% faster, but is still linear in the size of the file...

> Thank you for all those pointers and suggestions regarding SAX. That'll
> help me to get me a lot further.

I did manage to find the problem with the YAXO parser when reading directly
from file. It's caused by the Squeak compatibility method Stream>>nextOrNil
which reads:

nextOrNil
<primitive: 65>
^ nil.

which doesn't work if the primitive fails because the stream has reached the
end of its internal buffer. If you change it to:

nextOrNil
<primitive: 65>
^ self atEnd ifFalse: [self next].

then it seems to work OK.

-- chris

Bernhard Kohlhaas-7

Re: Strange performance behavior of IXMLDOMNode>>text in DPRO 5.1.4

Chris,

> I'm afraid that it turns out to be my mistake. In the second set of tests (the
> ones that appeared to run in constant time) I was setting
> n := (root childNodes at: 2) firstChild firstChild.
> where root was the IXMLDOMDocument rather than its #documentElement. I'm sorry
> for the mislead...

Not a problem. :)

[...]

> I did manage to find the problem with the YAXO parser when reading directly
> from file. It's caused by the Squeak compatibility method Stream>>nextOrNil
> which reads:
>
> nextOrNil
> <primitive: 65>
> ^ nil.
>
> which doesn't work if the primitive fails because the stream has reached the
> end of its internal buffer. If you change it to:
>
> nextOrNil
> <primitive: 65>
> ^ self atEnd ifFalse: [self next].
>
> then it seems to work OK.

Thank you very much for finding this. I applied the patch and was able
to parse my document with the Yaxo parser. My program now runs twice as
fast as with the MS XML parser, which is a good step forward. :)

It should become even more favorable for the Yaxo parser, as I am adding
the retrieval of more XML nodes from the document, because the MS parser
would run slower (due to more invocations of IXMLDOMNnode>>text),
whereas the Yaxo parser seems to need all its time for the initial
parsing of the document with basically instant access to the nodes'
content later on.

Thanks again for all the help & Best Regards,

Bernhard