Example of XML SAX Parser usage

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Example of XML SAX Parser usage

Larry Gadallah
Hi all:

I'm reviving a thread from long ago now that I've gotten a few minutes
to look at this question again: How is XML data best parsed using a
SAX parser in Pharo Smalltalk?

I tried to look at the GenomeTools project that Miguel references
below, but it seems that the class he mentions (GTNCBIBlastParser) is
no longer in it. Perhaps there's a newer, better example of how to
drive the SAX parser somewhere?

> Message: 4
> Date: Tue, 20 Jul 2010 12:25:29 -0500
> From: Miguel Enrique Cob? Mart?nez <[hidden email]>
> Subject: Re: [Pharo-project] Markup Builder in Smalltalk (XMLWriter)
> To: [hidden email]
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset="UTF-8"
>
> This good summary should go directly to the collaboractive book.
>
>
> El mar, 20-07-2010 a las 14:11 -0300, Hern?n Morales Durand escribi?:
>> A XML parser just creates a representation of a XML document according
>> to a parsing model. Ideally you should choose a XML parser
>> specifically for your needs. You have different parsing models:
>>
>> -Tree Parser: This is what you will read everywhere as the "DOM parser"
>> -Event Parser: This is denoted  by S*X and could be
>> --SAX Parser: Known as the "Push parser"
>> --StAX Parser: Known also too as the "Pull parser"
>> -VTD Parser : This is known as "Virtual Token Descriptor"
>>
>> Now there are several classifications depending of the parser
>> characteristics and what you want to do or how. You may be interested
>> in:
>>
>> Making modifications or just processing?
>> -For modifications: The parser creates long-lived representations from
>> the XML document (necessary for modifications): You should choose DOM
>> or VTD
>> --Do you *need* to query or modify the objects (parser creates nodes): DOM
>> --You do not need the objects (parser creates integers and locations
>> caches): VTD
>> -For processing: The parser doesn't creates long-lived objects: SAX or StAX.
>>
>> Type of Access
>> -Back-and-forth: Access the data after the parsing is complete: DOM or VTD
>> --Massive or very frequent access: Choose DOM
>> --Rare or simple access: Choose VTD
>> -Sequential: Access the data while you're processing the document: SAX or StAX
>> --Processing all tokens: SAX
>> --Processing interested tokens (allows skipping forward): StAX
>>
>> Briefly
>> -Streaming applications (very large documents): SAX or StAX
>> -Database applications: DOM or VTD
>> -Hardware acceleration?: VTD
>>
>> For the S*X parsers you need to know the XML token types because, for
>> example in the case of XMLParser in Pharo/Squeak, you probably would
>> subclass SAXHandler and override one or several methods in the content
>> category to do your own processing. See GTNCBIBlastParser in
>> http://www.squeaksource.com/GenomeTools.html for an example of a SAX
>> Parser.
>>
>> XML token types:
>> Start element: <Hit>....
>> End element: ...</Hit>
>> Text: <...>Text value</...>
>> etc.
>>
>> For DOM usage examples you may see
>> http://community.ofset.org/index.php/Les_bases_de_XML_dans_Squeak (it
>> is in french but is a good document)
>>
>> What we have in Pharo/Squeak
>>
>> Parsers:
>> 1) XMLParser : Supports SAX and DOM. http://www.squeaksource.com/XMLSupport.html
>> 2) VWXML Parser : Supports SAX and DOM (AFAIK)
>> http://www.squeaksource.com/VWXML.html
>> 3) XMLPullParser : Supports StAX. http://www.squeaksource.com/XMLPullParser.html
>>
>> XML Query tools
>> 1) Pastell : Supports X-Path like queries. Requires XMLParser.
>> http://www.squeaksource.com/Pastell.html
>> 2) XPath library : Supports XPath partially. Requires XMLParser.
>> http://www.squeaksource.com/XPath.html
>>
>> There are several additional tools in SqueakSource but I haven't reviewed yet.
>> A VTD parser would be ideal for Smalltalk because it uses integer
>> arrays reducing the object allocation overhead in memory. I haven't
>> found implementations of a XML VTD parser in Smalltalk as of today.
>> Cheers,
>>

Thanks,
--
Larry Gadallah, VE6VQ/W7                          lgadallah AT gmail DOT com
PGP Sig: 917E DDB7 C911 9EC1 0CD9  C06B 06C4 835F 0BB8 7336

Reply | Threaded
Open this post in threaded view
|

Re: Example of XML SAX Parser usage

hernanmd
Hi Larry,

I renamed the class to GTBLASTResultsParser. To try it you may evaluate

GTBLASTResultReader new
                results: ( FileStream readOnlyFileNamed: 'K9W70UJU014-Alignment.xml' );
                hitNodes

which takes about 50 seconds on my machine over an alignment file of a
nucleotide BLAST with 500 target sequences (a 18 Mb file). I suspect a
considerable time is spent adding strings to a result collection, and
improvements may be possible by choosing carefully the collections, it
happens I haven't started to think about optimizations (just use it as
a sample to see how it works).

If anyone wants the sample alignment just drop me an e-mail.
Cheers,

2011/1/17 Larry Gadallah <[hidden email]>:

> Hi all:
>
> I'm reviving a thread from long ago now that I've gotten a few minutes
> to look at this question again: How is XML data best parsed using a
> SAX parser in Pharo Smalltalk?
>
> I tried to look at the GenomeTools project that Miguel references
> below, but it seems that the class he mentions (GTNCBIBlastParser) is
> no longer in it. Perhaps there's a newer, better example of how to
> drive the SAX parser somewhere?
>
>> Message: 4
>> Date: Tue, 20 Jul 2010 12:25:29 -0500
>> From: Miguel Enrique Cob? Mart?nez <[hidden email]>
>> Subject: Re: [Pharo-project] Markup Builder in Smalltalk (XMLWriter)
>> To: [hidden email]
>> Message-ID: <[hidden email]>
>> Content-Type: text/plain; charset="UTF-8"
>>
>> This good summary should go directly to the collaboractive book.
>>
>>
>> El mar, 20-07-2010 a las 14:11 -0300, Hern?n Morales Durand escribi?:
>>> A XML parser just creates a representation of a XML document according
>>> to a parsing model. Ideally you should choose a XML parser
>>> specifically for your needs. You have different parsing models:
>>>
>>> -Tree Parser: This is what you will read everywhere as the "DOM parser"
>>> -Event Parser: This is denoted  by S*X and could be
>>> --SAX Parser: Known as the "Push parser"
>>> --StAX Parser: Known also too as the "Pull parser"
>>> -VTD Parser : This is known as "Virtual Token Descriptor"
>>>
>>> Now there are several classifications depending of the parser
>>> characteristics and what you want to do or how. You may be interested
>>> in:
>>>
>>> Making modifications or just processing?
>>> -For modifications: The parser creates long-lived representations from
>>> the XML document (necessary for modifications): You should choose DOM
>>> or VTD
>>> --Do you *need* to query or modify the objects (parser creates nodes): DOM
>>> --You do not need the objects (parser creates integers and locations
>>> caches): VTD
>>> -For processing: The parser doesn't creates long-lived objects: SAX or StAX.
>>>
>>> Type of Access
>>> -Back-and-forth: Access the data after the parsing is complete: DOM or VTD
>>> --Massive or very frequent access: Choose DOM
>>> --Rare or simple access: Choose VTD
>>> -Sequential: Access the data while you're processing the document: SAX or StAX
>>> --Processing all tokens: SAX
>>> --Processing interested tokens (allows skipping forward): StAX
>>>
>>> Briefly
>>> -Streaming applications (very large documents): SAX or StAX
>>> -Database applications: DOM or VTD
>>> -Hardware acceleration?: VTD
>>>
>>> For the S*X parsers you need to know the XML token types because, for
>>> example in the case of XMLParser in Pharo/Squeak, you probably would
>>> subclass SAXHandler and override one or several methods in the content
>>> category to do your own processing. See GTNCBIBlastParser in
>>> http://www.squeaksource.com/GenomeTools.html for an example of a SAX
>>> Parser.
>>>
>>> XML token types:
>>> Start element: <Hit>....
>>> End element: ...</Hit>
>>> Text: <...>Text value</...>
>>> etc.
>>>
>>> For DOM usage examples you may see
>>> http://community.ofset.org/index.php/Les_bases_de_XML_dans_Squeak (it
>>> is in french but is a good document)
>>>
>>> What we have in Pharo/Squeak
>>>
>>> Parsers:
>>> 1) XMLParser : Supports SAX and DOM. http://www.squeaksource.com/XMLSupport.html
>>> 2) VWXML Parser : Supports SAX and DOM (AFAIK)
>>> http://www.squeaksource.com/VWXML.html
>>> 3) XMLPullParser : Supports StAX. http://www.squeaksource.com/XMLPullParser.html
>>>
>>> XML Query tools
>>> 1) Pastell : Supports X-Path like queries. Requires XMLParser.
>>> http://www.squeaksource.com/Pastell.html
>>> 2) XPath library : Supports XPath partially. Requires XMLParser.
>>> http://www.squeaksource.com/XPath.html
>>>
>>> There are several additional tools in SqueakSource but I haven't reviewed yet.
>>> A VTD parser would be ideal for Smalltalk because it uses integer
>>> arrays reducing the object allocation overhead in memory. I haven't
>>> found implementations of a XML VTD parser in Smalltalk as of today.
>>> Cheers,
>>>
>
> Thanks,
> --
> Larry Gadallah, VE6VQ/W7                          lgadallah AT gmail DOT com
> PGP Sig: 917E DDB7 C911 9EC1 0CD9  C06B 06C4 835F 0BB8 7336
>
>



--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.