Hi all:
I'm reviving a thread from long ago now that I've gotten a few minutes to look at this question again: How is XML data best parsed using a SAX parser in Pharo Smalltalk? I tried to look at the GenomeTools project that Miguel references below, but it seems that the class he mentions (GTNCBIBlastParser) is no longer in it. Perhaps there's a newer, better example of how to drive the SAX parser somewhere? > Message: 4 > Date: Tue, 20 Jul 2010 12:25:29 -0500 > From: Miguel Enrique Cob? Mart?nez <[hidden email]> > Subject: Re: [Pharo-project] Markup Builder in Smalltalk (XMLWriter) > To: [hidden email] > Message-ID: <[hidden email]> > Content-Type: text/plain; charset="UTF-8" > > This good summary should go directly to the collaboractive book. > > > El mar, 20-07-2010 a las 14:11 -0300, Hern?n Morales Durand escribi?: >> A XML parser just creates a representation of a XML document according >> to a parsing model. Ideally you should choose a XML parser >> specifically for your needs. You have different parsing models: >> >> -Tree Parser: This is what you will read everywhere as the "DOM parser" >> -Event Parser: This is denoted by S*X and could be >> --SAX Parser: Known as the "Push parser" >> --StAX Parser: Known also too as the "Pull parser" >> -VTD Parser : This is known as "Virtual Token Descriptor" >> >> Now there are several classifications depending of the parser >> characteristics and what you want to do or how. You may be interested >> in: >> >> Making modifications or just processing? >> -For modifications: The parser creates long-lived representations from >> the XML document (necessary for modifications): You should choose DOM >> or VTD >> --Do you *need* to query or modify the objects (parser creates nodes): DOM >> --You do not need the objects (parser creates integers and locations >> caches): VTD >> -For processing: The parser doesn't creates long-lived objects: SAX or StAX. >> >> Type of Access >> -Back-and-forth: Access the data after the parsing is complete: DOM or VTD >> --Massive or very frequent access: Choose DOM >> --Rare or simple access: Choose VTD >> -Sequential: Access the data while you're processing the document: SAX or StAX >> --Processing all tokens: SAX >> --Processing interested tokens (allows skipping forward): StAX >> >> Briefly >> -Streaming applications (very large documents): SAX or StAX >> -Database applications: DOM or VTD >> -Hardware acceleration?: VTD >> >> For the S*X parsers you need to know the XML token types because, for >> example in the case of XMLParser in Pharo/Squeak, you probably would >> subclass SAXHandler and override one or several methods in the content >> category to do your own processing. See GTNCBIBlastParser in >> http://www.squeaksource.com/GenomeTools.html for an example of a SAX >> Parser. >> >> XML token types: >> Start element: <Hit>.... >> End element: ...</Hit> >> Text: <...>Text value</...> >> etc. >> >> For DOM usage examples you may see >> http://community.ofset.org/index.php/Les_bases_de_XML_dans_Squeak (it >> is in french but is a good document) >> >> What we have in Pharo/Squeak >> >> Parsers: >> 1) XMLParser : Supports SAX and DOM. http://www.squeaksource.com/XMLSupport.html >> 2) VWXML Parser : Supports SAX and DOM (AFAIK) >> http://www.squeaksource.com/VWXML.html >> 3) XMLPullParser : Supports StAX. http://www.squeaksource.com/XMLPullParser.html >> >> XML Query tools >> 1) Pastell : Supports X-Path like queries. Requires XMLParser. >> http://www.squeaksource.com/Pastell.html >> 2) XPath library : Supports XPath partially. Requires XMLParser. >> http://www.squeaksource.com/XPath.html >> >> There are several additional tools in SqueakSource but I haven't reviewed yet. >> A VTD parser would be ideal for Smalltalk because it uses integer >> arrays reducing the object allocation overhead in memory. I haven't >> found implementations of a XML VTD parser in Smalltalk as of today. >> Cheers, >> Thanks, -- Larry Gadallah, VE6VQ/W7 lgadallah AT gmail DOT com PGP Sig: 917E DDB7 C911 9EC1 0CD9 C06B 06C4 835F 0BB8 7336 |
Hi Larry,
I renamed the class to GTBLASTResultsParser. To try it you may evaluate GTBLASTResultReader new results: ( FileStream readOnlyFileNamed: 'K9W70UJU014-Alignment.xml' ); hitNodes which takes about 50 seconds on my machine over an alignment file of a nucleotide BLAST with 500 target sequences (a 18 Mb file). I suspect a considerable time is spent adding strings to a result collection, and improvements may be possible by choosing carefully the collections, it happens I haven't started to think about optimizations (just use it as a sample to see how it works). If anyone wants the sample alignment just drop me an e-mail. Cheers, 2011/1/17 Larry Gadallah <[hidden email]>: > Hi all: > > I'm reviving a thread from long ago now that I've gotten a few minutes > to look at this question again: How is XML data best parsed using a > SAX parser in Pharo Smalltalk? > > I tried to look at the GenomeTools project that Miguel references > below, but it seems that the class he mentions (GTNCBIBlastParser) is > no longer in it. Perhaps there's a newer, better example of how to > drive the SAX parser somewhere? > >> Message: 4 >> Date: Tue, 20 Jul 2010 12:25:29 -0500 >> From: Miguel Enrique Cob? Mart?nez <[hidden email]> >> Subject: Re: [Pharo-project] Markup Builder in Smalltalk (XMLWriter) >> To: [hidden email] >> Message-ID: <[hidden email]> >> Content-Type: text/plain; charset="UTF-8" >> >> This good summary should go directly to the collaboractive book. >> >> >> El mar, 20-07-2010 a las 14:11 -0300, Hern?n Morales Durand escribi?: >>> A XML parser just creates a representation of a XML document according >>> to a parsing model. Ideally you should choose a XML parser >>> specifically for your needs. You have different parsing models: >>> >>> -Tree Parser: This is what you will read everywhere as the "DOM parser" >>> -Event Parser: This is denoted by S*X and could be >>> --SAX Parser: Known as the "Push parser" >>> --StAX Parser: Known also too as the "Pull parser" >>> -VTD Parser : This is known as "Virtual Token Descriptor" >>> >>> Now there are several classifications depending of the parser >>> characteristics and what you want to do or how. You may be interested >>> in: >>> >>> Making modifications or just processing? >>> -For modifications: The parser creates long-lived representations from >>> the XML document (necessary for modifications): You should choose DOM >>> or VTD >>> --Do you *need* to query or modify the objects (parser creates nodes): DOM >>> --You do not need the objects (parser creates integers and locations >>> caches): VTD >>> -For processing: The parser doesn't creates long-lived objects: SAX or StAX. >>> >>> Type of Access >>> -Back-and-forth: Access the data after the parsing is complete: DOM or VTD >>> --Massive or very frequent access: Choose DOM >>> --Rare or simple access: Choose VTD >>> -Sequential: Access the data while you're processing the document: SAX or StAX >>> --Processing all tokens: SAX >>> --Processing interested tokens (allows skipping forward): StAX >>> >>> Briefly >>> -Streaming applications (very large documents): SAX or StAX >>> -Database applications: DOM or VTD >>> -Hardware acceleration?: VTD >>> >>> For the S*X parsers you need to know the XML token types because, for >>> example in the case of XMLParser in Pharo/Squeak, you probably would >>> subclass SAXHandler and override one or several methods in the content >>> category to do your own processing. See GTNCBIBlastParser in >>> http://www.squeaksource.com/GenomeTools.html for an example of a SAX >>> Parser. >>> >>> XML token types: >>> Start element: <Hit>.... >>> End element: ...</Hit> >>> Text: <...>Text value</...> >>> etc. >>> >>> For DOM usage examples you may see >>> http://community.ofset.org/index.php/Les_bases_de_XML_dans_Squeak (it >>> is in french but is a good document) >>> >>> What we have in Pharo/Squeak >>> >>> Parsers: >>> 1) XMLParser : Supports SAX and DOM. http://www.squeaksource.com/XMLSupport.html >>> 2) VWXML Parser : Supports SAX and DOM (AFAIK) >>> http://www.squeaksource.com/VWXML.html >>> 3) XMLPullParser : Supports StAX. http://www.squeaksource.com/XMLPullParser.html >>> >>> XML Query tools >>> 1) Pastell : Supports X-Path like queries. Requires XMLParser. >>> http://www.squeaksource.com/Pastell.html >>> 2) XPath library : Supports XPath partially. Requires XMLParser. >>> http://www.squeaksource.com/XPath.html >>> >>> There are several additional tools in SqueakSource but I haven't reviewed yet. >>> A VTD parser would be ideal for Smalltalk because it uses integer >>> arrays reducing the object allocation overhead in memory. I haven't >>> found implementations of a XML VTD parser in Smalltalk as of today. >>> Cheers, >>> > > Thanks, > -- > Larry Gadallah, VE6VQ/W7 lgadallah AT gmail DOT com > PGP Sig: 917E DDB7 C911 9EC1 0CD9 C06B 06C4 835F 0BB8 7336 > > -- Hernán Morales Information Technology Manager, Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. |
Free forum by Nabble | Edit this page |