Hi guys
these free books may be interesting for you http://aosabook.org/ http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html stef |
There is also a chapter very similar to the one ObjVLisp here:
http://aosabook.org/en/500L/a-simple-object-model.html On Wed, Jul 13, 2016 at 3:27 PM, stepharo <[hidden email]> wrote: > Hi guys > > these free books may be interesting for you > > http://aosabook.org/ > > http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html > > > stef > > -- Serge Stinckwich UCBN & UMI UMMISCO 209 (IRD/UPMC) Every DSL ends up being Smalltalk http://www.doesnotunderstand.org/ |
In reply to this post by stepharo
Thanks for the link.
In-place parsing is a non-starter because it means storing the entire input as a string in memory, so you could only parse files that fit in Pharo's address space. The multi-gigabyte OpenStreetMap docs the article mentions would be unparsable with SAX in a 32-bit VM. Linked lists for storing child nodes is common (LibXML2 and Xerces do it) and provides constant time insertion and sibling access, but arrays/vectors (Arrays/OrderedCollections) are more cache friendly and faster for sequential access and in Pharo are almost always the correct choice. There is always the option of an FFI-based parser, but it shouldn't be a hybrid like Python's minidom (FFI Expat with a Python DOM implementation), because something like that already exists in Smalltalk/X (FFI Expat with a Smalltalk DOM) and it was slower than a St/X port of XMLParser in my tests (I assume due to the FFI overhead), so it's probably not worth it. But a non-hybrid parser with everything (including the DOM) done in C should definitely be faster. > Sent: Wednesday, July 13, 2016 at 10:27 AM > From: stepharo <[hidden email]> > To: "Pharo Development List" <[hidden email]> > Subject: [Pharo-dev] tricks for XML parsing. > > Hi guys > > these free books may be interesting for you > > http://aosabook.org/ > > http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html > > > stef > > > |
On Thu, 2016-07-14 at 01:58 +0200, monty wrote:
> Thanks for the link. > > In-place parsing is a non-starter because it means storing the entire > input as a string in memory, so you could only parse files that fit > in Pharo's address space. The multi-gigabyte OpenStreetMap docs the > article mentions would be unparsable with SAX in a 32-bit VM. I do not understand. I only know expat which does - AFAIK - in-place parsing and surelt does not need the whole input in memory. > There is always the option of an FFI-based parser, but it shouldn't > be a hybrid like Python's minidom (FFI Expat with a Python DOM > implementation), > because something like that already exists in Smalltalk/X (FFI Expat > with a Smalltalk DOM) I guess you refer to the implementation I did ages ago. > and it was slower than a St/X port of XMLParser in my tests (I assume > due to the FFI overhead), so it's probably not worth it. Very, very interesting. Where can I find the benchmarks? I just run a very simple benchmark on 112MB document (http://www.xml-be nchmark.org/downloads.html) and results are quite the opposite: Benchmark resut: Generated at :14-07-2016 07:32:25 AM Benchmark Execution Time [ms] # of M&S GCs [1] # of newspace GCs [1] Parameters BenchmarkXML SAX - VW 93418 0 2060 SAX - XMLSuite 9921 0 410 As you can see, the latter is roughly 10 times faster. I agree my implementation which uses Expat is clearly suboptimal and need to be improved (for example it does not use a ILC-based send to driver so you have a lot of cache misses and does a lot of unnecessary memcpy()s, but this can be easily improved) Jan > But a non-hybrid parser with everything (including the DOM) done in C > should definitely be faster. > > > Sent: Wednesday, July 13, 2016 at 10:27 AM > > From: stepharo <[hidden email]> > > To: "Pharo Development List" <[hidden email]> > > Subject: [Pharo-dev] tricks for XML parsing. > > > > Hi guys > > > > these free books may be interesting for you > > > > http://aosabook.org/ > > > > http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html > > > > > > stef > > > > > > > |
In reply to this post by monty-3
thanks for the analysis :)
I just skimmed over the tabl of contents and this caugth my eyes. Le 14/7/16 à 01:58, monty a écrit : > Thanks for the link. > > In-place parsing is a non-starter because it means storing the entire input as a string in memory, so you could only parse files that fit in Pharo's address space. The multi-gigabyte OpenStreetMap docs the article mentions would be unparsable with SAX in a 32-bit VM. > > Linked lists for storing child nodes is common (LibXML2 and Xerces do it) and provides constant time insertion and sibling access, but arrays/vectors (Arrays/OrderedCollections) are more cache friendly and faster for sequential access and in Pharo are almost always the correct choice. > > There is always the option of an FFI-based parser, but it shouldn't be a hybrid like Python's minidom (FFI Expat with a Python DOM implementation), because something like that already exists in Smalltalk/X (FFI Expat with a Smalltalk DOM) and it was slower than a St/X port of XMLParser in my tests (I assume due to the FFI overhead), so it's probably not worth it. But a non-hybrid parser with everything (including the DOM) done in C should definitely be faster. > >> Sent: Wednesday, July 13, 2016 at 10:27 AM >> From: stepharo <[hidden email]> >> To: "Pharo Development List" <[hidden email]> >> Subject: [Pharo-dev] tricks for XML parsing. >> >> Hi guys >> >> these free books may be interesting for you >> >> http://aosabook.org/ >> >> http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html >> >> >> stef >> >> >> > |
In reply to this post by Jan Vrany
> Sent: Thursday, July 14, 2016 at 2:50 AM > From: "Jan Vrany" <[hidden email]> > To: [hidden email] > Subject: Re: [Pharo-dev] tricks for XML parsing. > > On Thu, 2016-07-14 at 01:58 +0200, monty wrote: > > Thanks for the link. > > > > In-place parsing is a non-starter because it means storing the entire > > input as a string in memory, so you could only parse files that fit > > in Pharo's address space. The multi-gigabyte OpenStreetMap docs the > > article mentions would be unparsable with SAX in a 32-bit VM. > > I do not understand. I only know expat which does - AFAIK - in-place > parsing and surelt does not need the whole input in memory. From the article, footnote 3: "This creates a lifetime dependency–the entire source buffer must outlive all document nodes for the technique to work" > > There is always the option of an FFI-based parser, but it shouldn't > > be a hybrid like Python's minidom (FFI Expat with a Python DOM > > implementation), > > because something like that already exists in Smalltalk/X (FFI Expat > > with a Smalltalk DOM) > > I guess you refer to the implementation I did ages ago. > > > and it was slower than a St/X port of XMLParser in my tests (I assume > > due to the FFI overhead), so it's probably not worth it. > > Very, very interesting. Where can I find the benchmarks? This was well over a year ago, and it was DOM parsing. I was testing if St/X (your branch, I think) could be supported by XMLParser in addition to Pharo, Squeak, and GS, but I ran into too many incompatibilities, like Monticello not working (had to load in .st files), #new not sending #initialize, not being able to modify the value of a dictionary association directly, #lf/#cr weirdness, so I gave up. But not before hacking it enough to kind-of run and compared it with the other parsers. > I just run a very simple benchmark on 112MB document (http://www.xml-be > nchmark.org/downloads.html) and results are quite the opposite: > > Benchmark resut: > Generated at :14-07-2016 07:32:25 AM > > Benchmark Execution Time [ms] # of M&S GCs > [1] # of newspace GCs [1] Parameters > BenchmarkXML > SAX - > VW 93418 0 > 2060 > SAX - > XMLSuite 9921 0 > 410 > > As you can see, the latter is roughly 10 times faster. That's the VW parser, which is slower than XMLParser. And again, it was of DOM parsing. > I agree my implementation which uses Expat is clearly suboptimal > and need to be improved (for example it does not use a ILC-based > send to driver so you have a lot of cache misses and does a lot > of unnecessary memcpy()s, but this can be easily improved) Your implementation was fine, and particularly, its XPath/Query was very impressive. I wasn't attacking you. My point was just the hybrid approach built on Expat (which is a non-validating parser, BTW) should be avoided, in case anyone is considering it, based on my experience with minidom vs lxml.etree in Python and with St/X's v2 parser. A parser based on LibXLM2, Xerces, or something else for SAX, DOM, XPath, etc is probably a better way of creating an alternative to pure-Smalltalk parsers. > Jan > > > > But a non-hybrid parser with everything (including the DOM) done in C > > should definitely be faster. > > > > > Sent: Wednesday, July 13, 2016 at 10:27 AM > > > From: stepharo <[hidden email]> > > > To: "Pharo Development List" <[hidden email]> > > > Subject: [Pharo-dev] tricks for XML parsing. > > > > > > Hi guys > > > > > > these free books may be interesting for you > > > > > > http://aosabook.org/ > > > > > > http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html > > > > > > > > > stef > > > > > > > > > > > > > |
On Thu, 2016-07-14 at 11:11 +0200, monty wrote:
> > > Sent: Thursday, July 14, 2016 at 2:50 AM > > From: "Jan Vrany" <[hidden email]> > > To: [hidden email] > > Subject: Re: [Pharo-dev] tricks for XML parsing. > > > > On Thu, 2016-07-14 at 01:58 +0200, monty wrote: > > > Thanks for the link. > > > > > > In-place parsing is a non-starter because it means storing the > > > entire > > > input as a string in memory, so you could only parse files that > > > fit > > > in Pharo's address space. The multi-gigabyte OpenStreetMap docs > > > the > > > article mentions would be unparsable with SAX in a 32-bit VM. > > > > I do not understand. I only know expat which does - AFAIK - in- > > place > > parsing and surelt does not need the whole input in memory. > > From the article, footnote 3: "This creates a lifetime dependency–the > entire source buffer must outlive all document nodes for the > technique to work" Ah, I see. I assumed that the consumer would copy data if it needs to retain them...then the parser can do in-place tricks. Perhaps the difference is that I do not consider creation of a DOM being part of "parsing" (I'm not saying it's not, just that this is a way I think of the problem which may cause confusion :-) > > > > There is always the option of an FFI-based parser, but it > > > shouldn't > > > be a hybrid like Python's minidom (FFI Expat with a Python DOM > > > implementation), > > > because something like that already exists in Smalltalk/X (FFI > > > Expat > > > with a Smalltalk DOM) > > > > I guess you refer to the implementation I did ages ago. > > > > > and it was slower than a St/X port of XMLParser in my tests (I > > > assume > > > due to the FFI overhead), so it's probably not worth it. > > > > Very, very interesting. Where can I find the benchmarks? > > This was well over a year ago, and it was DOM parsing. I was testing > if St/X (your branch, I think) could be supported by XMLParser in > addition to Pharo, Squeak, and GS, but I ran into too many > incompatibilities, like Monticello not working (had to load in .st > files), #new not sending #initialize, not being able to modify the > value of a dictionary association directly, #lf/#cr weirdness, so I > gave up. But not before hacking it enough to kind-of run and compared > it with the other parsers. Fair enough. If you want to discuss these problems, I'd be happy to help. > > > I just run a very simple benchmark on 112MB document (http://www.xm > > l-be > > nchmark.org/downloads.html) and results are quite the opposite: > > > > Benchmark resut: > > Generated at :14-07-2016 07:32:25 AM > > > > Benchmark Execution Time [ms] # of M&S GCs > > [1] # of newspace GCs [1] Parameters > > BenchmarkXML > > SAX - > > VW 93418 0 > > > > 2060 > > SAX - > > XMLSuite 9921 0 > > > > 410 > > > > As you can see, the latter is roughly 10 times faster. > > That's the VW parser, which is slower than XMLParser. And again, it > was of DOM parsing. I see. I got confused since VW's parser class is XMLParser... > > > I agree my implementation which uses Expat is clearly suboptimal > > and need to be improved (for example it does not use a ILC-based > > send to driver so you have a lot of cache misses and does a lot > > of unnecessary memcpy()s, but this can be easily improved) > > Your implementation was fine, and particularly, its XPath/Query was > very impressive. I wasn't attacking you. No worry, I know you were not! Just that I was surprised since my observations - performance wise - were very different. Now I see we were talking about different things. > My point was just the hybrid approach built on Expat (which is a non- > validating parser, BTW) should be avoided, in case anyone is > considering it, based on my experience with minidom vs lxml.etree in > Python and with St/X's v2 parser. A parser based on LibXLM2, Xerces, > or something else for SAX, DOM, XPath, etc is probably a better way > of creating an alternative to pure-Smalltalk parsers. I tend to disagree, but I think we have different points of view, that;s why. I have to admit I have not much experience (and need) with XML. Anyways, I'd like to discuss this more to understand the details. Perhaps privately as it's not strictly Pharo-development related. Best, Jan > |
Free forum by Nabble | Edit this page |