tricks for XML parsing.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

tricks for XML parsing.

stepharo
Hi guys

these free books may be interesting for you

     http://aosabook.org/

http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html


stef


Reply | Threaded
Open this post in threaded view
|

Re: tricks for XML parsing.

SergeStinckwich
There is also a chapter very similar to the one ObjVLisp here:
http://aosabook.org/en/500L/a-simple-object-model.html

On Wed, Jul 13, 2016 at 3:27 PM, stepharo <[hidden email]> wrote:

> Hi guys
>
> these free books may be interesting for you
>
>     http://aosabook.org/
>
> http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html
>
>
> stef
>
>



--
Serge Stinckwich
UCBN & UMI UMMISCO 209 (IRD/UPMC)
Every DSL ends up being Smalltalk
http://www.doesnotunderstand.org/

Reply | Threaded
Open this post in threaded view
|

Re: tricks for XML parsing.

monty-3
In reply to this post by stepharo
Thanks for the link.

In-place parsing is a non-starter because it means storing the entire input as a string in memory, so you could only parse files that fit in Pharo's address space. The multi-gigabyte OpenStreetMap docs the article mentions would be unparsable with SAX in a 32-bit VM.

Linked lists for storing child nodes is common (LibXML2 and Xerces do it) and provides constant time insertion and sibling access, but arrays/vectors (Arrays/OrderedCollections) are more cache friendly and faster for sequential access and in Pharo are almost always the correct choice.

There is always the option of an FFI-based parser, but it shouldn't be a hybrid like Python's minidom (FFI Expat with a Python DOM implementation), because something like that already exists in Smalltalk/X (FFI Expat with a Smalltalk DOM) and it was slower than a St/X port of XMLParser in my tests (I assume due to the FFI overhead), so it's probably not worth it. But a non-hybrid parser with everything (including the DOM) done in C should definitely be faster.

> Sent: Wednesday, July 13, 2016 at 10:27 AM
> From: stepharo <[hidden email]>
> To: "Pharo Development List" <[hidden email]>
> Subject: [Pharo-dev] tricks for XML parsing.
>
> Hi guys
>
> these free books may be interesting for you
>
>      http://aosabook.org/
>
> http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html
>
>
> stef
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: tricks for XML parsing.

Jan Vrany
On Thu, 2016-07-14 at 01:58 +0200, monty wrote:
> Thanks for the link.
>
> In-place parsing is a non-starter because it means storing the entire
> input as a string in memory, so you could only parse files that fit
> in Pharo's address space. The multi-gigabyte OpenStreetMap docs the
> article mentions would be unparsable with SAX in a 32-bit VM.

I do not understand. I only know expat which does - AFAIK - in-place
parsing and surelt does not need the whole input in memory. 

> There is always the option of an FFI-based parser, but it shouldn't
> be a hybrid like Python's minidom (FFI Expat with a Python DOM
> implementation),
> because something like that already exists in Smalltalk/X (FFI Expat
> with a Smalltalk DOM)

I guess you refer to the implementation I did ages ago. 

> and it was slower than a St/X port of XMLParser in my tests (I assume
> due to the FFI overhead), so it's probably not worth it.

Very, very interesting. Where can I find the benchmarks? 

I just run a very simple benchmark on 112MB document (http://www.xml-be
nchmark.org/downloads.html) and results are quite the opposite: 

Benchmark resut:
Generated at :14-07-2016 07:32:25 AM

           Benchmark      Execution Time [ms]      # of M&S GCs
[1]      # of newspace GCs [1]   Parameters
BenchmarkXML
            SAX -
VW                    93418                     0                      
 2060   
      SAX -
XMLSuite                     9921                     0                
        410   

As you can see, the latter is roughly 10 times faster. 

I agree my implementation which uses Expat is clearly suboptimal 
and need to be improved (for example it does not use a ILC-based 
send to driver so you have a lot of cache misses and does a lot 
of unnecessary memcpy()s, but this can be easily improved)

Jan
  

> But a non-hybrid parser with everything (including the DOM) done in C
> should definitely be faster.
>
> > Sent: Wednesday, July 13, 2016 at 10:27 AM
> > From: stepharo <[hidden email]>
> > To: "Pharo Development List" <[hidden email]>
> > Subject: [Pharo-dev] tricks for XML parsing.
> >
> > Hi guys
> >
> > these free books may be interesting for you
> >
> >      http://aosabook.org/
> >
> > http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html
> >
> >
> > stef
> >
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: tricks for XML parsing.

stepharo
In reply to this post by monty-3
thanks for the analysis :)

I just skimmed over the tabl of contents and this caugth my eyes.



Le 14/7/16 à 01:58, monty a écrit :

> Thanks for the link.
>
> In-place parsing is a non-starter because it means storing the entire input as a string in memory, so you could only parse files that fit in Pharo's address space. The multi-gigabyte OpenStreetMap docs the article mentions would be unparsable with SAX in a 32-bit VM.
>
> Linked lists for storing child nodes is common (LibXML2 and Xerces do it) and provides constant time insertion and sibling access, but arrays/vectors (Arrays/OrderedCollections) are more cache friendly and faster for sequential access and in Pharo are almost always the correct choice.
>
> There is always the option of an FFI-based parser, but it shouldn't be a hybrid like Python's minidom (FFI Expat with a Python DOM implementation), because something like that already exists in Smalltalk/X (FFI Expat with a Smalltalk DOM) and it was slower than a St/X port of XMLParser in my tests (I assume due to the FFI overhead), so it's probably not worth it. But a non-hybrid parser with everything (including the DOM) done in C should definitely be faster.
>
>> Sent: Wednesday, July 13, 2016 at 10:27 AM
>> From: stepharo <[hidden email]>
>> To: "Pharo Development List" <[hidden email]>
>> Subject: [Pharo-dev] tricks for XML parsing.
>>
>> Hi guys
>>
>> these free books may be interesting for you
>>
>>       http://aosabook.org/
>>
>> http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html
>>
>>
>> stef
>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: tricks for XML parsing.

monty-3
In reply to this post by Jan Vrany


> Sent: Thursday, July 14, 2016 at 2:50 AM
> From: "Jan Vrany" <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-dev] tricks for XML parsing.
>
> On Thu, 2016-07-14 at 01:58 +0200, monty wrote:
> > Thanks for the link.
> >
> > In-place parsing is a non-starter because it means storing the entire
> > input as a string in memory, so you could only parse files that fit
> > in Pharo's address space. The multi-gigabyte OpenStreetMap docs the
> > article mentions would be unparsable with SAX in a 32-bit VM.
>
> I do not understand. I only know expat which does - AFAIK - in-place
> parsing and surelt does not need the whole input in memory. 

From the article, footnote 3: "This creates a lifetime dependency–the entire source buffer must outlive all document nodes for the technique to work"

> > There is always the option of an FFI-based parser, but it shouldn't
> > be a hybrid like Python's minidom (FFI Expat with a Python DOM
> > implementation),
> > because something like that already exists in Smalltalk/X (FFI Expat
> > with a Smalltalk DOM)
>
> I guess you refer to the implementation I did ages ago. 
>
> > and it was slower than a St/X port of XMLParser in my tests (I assume
> > due to the FFI overhead), so it's probably not worth it.
>
> Very, very interesting. Where can I find the benchmarks? 

This was well over a year ago, and it was DOM parsing. I was testing if St/X (your branch, I think) could be supported by XMLParser in addition to Pharo, Squeak, and GS, but I ran into too many incompatibilities, like Monticello not working (had to load in .st files), #new not sending #initialize, not being able to modify the value of a dictionary association directly, #lf/#cr weirdness, so I gave up. But not before hacking it enough to kind-of run and compared it with the other parsers.

> I just run a very simple benchmark on 112MB document (http://www.xml-be
> nchmark.org/downloads.html) and results are quite the opposite: 
>
> Benchmark resut:
> Generated at :14-07-2016 07:32:25 AM
>
>            Benchmark      Execution Time [ms]      # of M&S GCs
> [1]      # of newspace GCs [1]   Parameters
> BenchmarkXML
>             SAX -
> VW                    93418                     0                      
>  2060   
>       SAX -
> XMLSuite                     9921                     0                
>         410   
>
> As you can see, the latter is roughly 10 times faster. 

That's the VW parser, which is slower than XMLParser. And again, it was of DOM parsing.

> I agree my implementation which uses Expat is clearly suboptimal 
> and need to be improved (for example it does not use a ILC-based 
> send to driver so you have a lot of cache misses and does a lot 
> of unnecessary memcpy()s, but this can be easily improved)

Your implementation was fine, and particularly, its XPath/Query was very impressive. I wasn't attacking you. My point was just the hybrid approach built on Expat (which is a non-validating parser, BTW) should be avoided, in case anyone is considering it, based on my experience with minidom vs lxml.etree in Python and with St/X's v2 parser. A parser based on LibXLM2, Xerces, or something else for SAX, DOM, XPath, etc is probably a better way of creating an alternative to pure-Smalltalk parsers.

> Jan
>   
>
> > But a non-hybrid parser with everything (including the DOM) done in C
> > should definitely be faster.
> >
> > > Sent: Wednesday, July 13, 2016 at 10:27 AM
> > > From: stepharo <[hidden email]>
> > > To: "Pharo Development List" <[hidden email]>
> > > Subject: [Pharo-dev] tricks for XML parsing.
> > >
> > > Hi guys
> > >
> > > these free books may be interesting for you
> > >
> > >      http://aosabook.org/
> > >
> > > http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html
> > >
> > >
> > > stef
> > >
> > >
> > >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: tricks for XML parsing.

Jan Vrany
On Thu, 2016-07-14 at 11:11 +0200, monty wrote:

>
> > Sent: Thursday, July 14, 2016 at 2:50 AM
> > From: "Jan Vrany" <[hidden email]>
> > To: [hidden email]
> > Subject: Re: [Pharo-dev] tricks for XML parsing.
> >
> > On Thu, 2016-07-14 at 01:58 +0200, monty wrote:
> > > Thanks for the link.
> > >
> > > In-place parsing is a non-starter because it means storing the
> > > entire
> > > input as a string in memory, so you could only parse files that
> > > fit
> > > in Pharo's address space. The multi-gigabyte OpenStreetMap docs
> > > the
> > > article mentions would be unparsable with SAX in a 32-bit VM.
> >
> > I do not understand. I only know expat which does - AFAIK - in-
> > place
> > parsing and surelt does not need the whole input in memory. 
>
> From the article, footnote 3: "This creates a lifetime dependency–the
> entire source buffer must outlive all document nodes for the
> technique to work"

Ah, I see. I assumed that the consumer would copy data if it needs to
retain them...then the parser can do in-place tricks. 
Perhaps the difference is that I do not consider creation of a DOM
being part of "parsing" (I'm not saying it's not, just that this is a
way I think of the problem which may cause confusion :-) 

>
> > > There is always the option of an FFI-based parser, but it
> > > shouldn't
> > > be a hybrid like Python's minidom (FFI Expat with a Python DOM
> > > implementation), 
> > > because something like that already exists in Smalltalk/X (FFI
> > > Expat
> > > with a Smalltalk DOM) 
> >
> > I guess you refer to the implementation I did ages ago. 
> >
> > > and it was slower than a St/X port of XMLParser in my tests (I
> > > assume
> > > due to the FFI overhead), so it's probably not worth it. 
> >
> > Very, very interesting. Where can I find the benchmarks? 
>
> This was well over a year ago, and it was DOM parsing. I was testing
> if St/X (your branch, I think) could be supported by XMLParser in
> addition to Pharo, Squeak, and GS, but I ran into too many
> incompatibilities, like Monticello not working (had to load in .st
> files), #new not sending #initialize, not being able to modify the
> value of a dictionary association directly, #lf/#cr weirdness, so I
> gave up. But not before hacking it enough to kind-of run and compared
> it with the other parsers.

Fair enough. If you want to discuss these problems,  I'd be happy to
help.

>
> > I just run a very simple benchmark on 112MB document (http://www.xm
> > l-be
> > nchmark.org/downloads.html) and results are quite the opposite: 
> >
> > Benchmark resut:
> > Generated at :14-07-2016 07:32:25 AM
> >
> >            Benchmark      Execution Time [ms]      # of M&S GCs
> > [1]      # of newspace GCs [1]   Parameters
> > BenchmarkXML
> >             SAX -
> > VW                    93418                     0                  
> >     
> >  2060   
> >       SAX -
> > XMLSuite                     9921                     0            
> >     
> >         410   
> >
> > As you can see, the latter is roughly 10 times faster. 
>
> That's the VW parser, which is slower than XMLParser. And again, it
> was of DOM parsing.

I see. I got confused since VW's parser class is XMLParser...

>
> > I agree my implementation which uses Expat is clearly suboptimal 
> > and need to be improved (for example it does not use a ILC-based 
> > send to driver so you have a lot of cache misses and does a lot 
> > of unnecessary memcpy()s, but this can be easily improved)
>
> Your implementation was fine, and particularly, its XPath/Query was
> very impressive. I wasn't attacking you.

No worry, I know you were not! Just that I was surprised since my
observations - performance wise - were very different. Now I see we
were talking about different things. 

> My point was just the hybrid approach built on Expat (which is a non-
> validating parser, BTW) should be avoided, in case anyone is
> considering it, based on my experience with minidom vs lxml.etree in
> Python and with St/X's v2 parser. A parser based on LibXLM2, Xerces,
> or something else for SAX, DOM, XPath, etc is probably a better way
> of creating an alternative to pure-Smalltalk parsers.

I tend to disagree, but I think we have different points of view,
that;s why. I have to admit I have not much experience (and need) 
with XML. 
Anyways, I'd like to discuss this more to understand the details.
Perhaps privately as it's not strictly Pharo-development related.


Best, Jan