Dear all,
I want to process large XML file (typically event logs with more than 80Mb data) and I'm not able to do that at the moment with the XML DOM parser (it says I reach the read limit after 3094 XML lines) and I guess I will have problem to manage such a large file in memory after that. Should I switch to an event-driven XML parser in order to avoid loading all the XML file in memory ? Do we have such a parser for Pharo ? -- Serge Stinckwich UCBN & UMI UMMISCO 209 (IRD/UPMC) Every DSL ends up being Smalltalk http://www.doesnotunderstand.org/ _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
I encountered a similar situation before, but there is a way to go beyond that limit. I do not have the code at my disposal right now, but look at the XMLDOMParser constructor, and at some point you will see a hardcoded limit value. You should be able to pass another one in. Cheers, Doru On Fri, Jun 6, 2014 at 11:22 AM, Serge Stinckwich <[hidden email]> wrote: Dear all, "Every thing has its own flow"
_______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Hi, Ok, I looked a bit, and here is a way to get it to work. It is not at all ideal given that it loads the entire string before parsing it, but it surely will work with an xml of your size:
contents := fileReference readStreamDo: [ :stream | stream contents ]. svnlog := (XMLDOMParser on: contents ) documentReadLimit: contents size; parseDocument.
Cheers, Doru On Fri, Jun 6, 2014 at 12:50 PM, Tudor Girba <[hidden email]> wrote:
"Every thing has its own flow"
_______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Serge, let us know how it goes. What you are trying to do is really important.
Alexandre On Jun 7, 2014, at 4:56 PM, Tudor Girba <[hidden email]> wrote: > Hi, > > Ok, I looked a bit, and here is a way to get it to work. It is not at all ideal given that it loads the entire string before parsing it, but it surely will work with an xml of your size: > > contents := fileReference readStreamDo: [ :stream | stream contents ]. > svnlog := (XMLDOMParser on: contents ) > documentReadLimit: contents size; > parseDocument. > > Cheers, > Doru > > > > On Fri, Jun 6, 2014 at 12:50 PM, Tudor Girba <[hidden email]> wrote: > I encountered a similar situation before, but there is a way to go beyond that limit. > > I do not have the code at my disposal right now, but look at the XMLDOMParser constructor, and at some point you will see a hardcoded limit value. You should be able to pass another one in. > > > Cheers, > Doru > > > On Fri, Jun 6, 2014 at 11:22 AM, Serge Stinckwich <[hidden email]> wrote: > Dear all, > > I want to process large XML file (typically event logs with more than > 80Mb data) and I'm not able to do that at the moment with the XML DOM > parser (it says I reach the read limit after 3094 XML lines) and I > guess I will have problem to manage such a large file in memory after > that. > > Should I switch to an event-driven XML parser in order to avoid > loading all the XML file in memory ? Do we have such a parser for > Pharo ? > > -- > Serge Stinckwich > UCBN & UMI UMMISCO 209 (IRD/UPMC) > Every DSL ends up being Smalltalk > http://www.doesnotunderstand.org/ > _______________________________________________ > Moose-dev mailing list > [hidden email] > https://www.iam.unibe.ch/mailman/listinfo/moose-dev > > > > -- > www.tudorgirba.com > > "Every thing has its own flow" > > > > -- > www.tudorgirba.com > > "Every thing has its own flow" > _______________________________________________ > Moose-dev mailing list > [hidden email] > https://www.iam.unibe.ch/mailman/listinfo/moose-dev -- _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: Alexandre Bergel http://www.bergel.eu ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
In reply to this post by Tudor Girba-2
Thank you Doru for the help.
First case: 74Mb XML file. I'm able to load the string in memory but the XML parser crash the image before the end. Second case: 10Mb XML file. I'm able to load the string in memory, parse the XML file. When I start to process the XML file in order to create an object structure, the panel "Space is too low" appear. I would like to give a try with the XMLPullParser. I guess this is working correctly, because I find it with the Configuration Browser. On Sat, Jun 7, 2014 at 10:56 PM, Tudor Girba <[hidden email]> wrote: > Hi, > > Ok, I looked a bit, and here is a way to get it to work. It is not at all > ideal given that it loads the entire string before parsing it, but it surely > will work with an xml of your size: > > contents := fileReference readStreamDo: [ :stream | stream contents ]. > svnlog := (XMLDOMParser on: contents ) > documentReadLimit: contents size; > parseDocument. > > Cheers, > Doru > > > > On Fri, Jun 6, 2014 at 12:50 PM, Tudor Girba <[hidden email]> wrote: >> >> I encountered a similar situation before, but there is a way to go beyond >> that limit. >> >> I do not have the code at my disposal right now, but look at the >> XMLDOMParser constructor, and at some point you will see a hardcoded limit >> value. You should be able to pass another one in. >> >> >> Cheers, >> Doru >> >> >> On Fri, Jun 6, 2014 at 11:22 AM, Serge Stinckwich >> <[hidden email]> wrote: >>> >>> Dear all, >>> >>> I want to process large XML file (typically event logs with more than >>> 80Mb data) and I'm not able to do that at the moment with the XML DOM >>> parser (it says I reach the read limit after 3094 XML lines) and I >>> guess I will have problem to manage such a large file in memory after >>> that. >>> >>> Should I switch to an event-driven XML parser in order to avoid >>> loading all the XML file in memory ? Do we have such a parser for >>> Pharo ? >>> >>> -- >>> Serge Stinckwich >>> UCBN & UMI UMMISCO 209 (IRD/UPMC) >>> Every DSL ends up being Smalltalk >>> http://www.doesnotunderstand.org/ >>> _______________________________________________ >>> Moose-dev mailing list >>> [hidden email] >>> https://www.iam.unibe.ch/mailman/listinfo/moose-dev >> >> >> >> >> -- >> www.tudorgirba.com >> >> "Every thing has its own flow" > > > > > -- > www.tudorgirba.com > > "Every thing has its own flow" > > _______________________________________________ > Moose-dev mailing list > [hidden email] > https://www.iam.unibe.ch/mailman/listinfo/moose-dev > -- Serge Stinckwich UCBN & UMI UMMISCO 209 (IRD/UPMC) Every DSL ends up being Smalltalk http://www.doesnotunderstand.org/ _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Serge, you are using a mac don’t you? Have you tried to augment the memory of the image? Open using a text editor the Info.plist file, contained in the VM folder. By checking on the internet, 1880000000 is apparently the biggest value possible. Cheers, Alexandre On Jun 8, 2014, at 4:35 PM, Serge Stinckwich <[hidden email]> wrote: Thank you Doru for the help. -- _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: Alexandre Bergel http://www.bergel.eu ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Thank you for the trick Alex. We should put these information somewhere, maybe in a tuning chapter of Pharo Enterprise book ? Unfortunately even with the trick, I was unable to parse a 10Mb XML file ... Pharo memory grows up until 1.9 Go and crash after that. I will try now with the pull parser. On Sun, Jun 8, 2014 at 11:05 PM, Alexandre Bergel <[hidden email]> wrote:
Serge Stinckwich UCBN & UMI UMMISCO 209 (IRD/UPMC) Every DSL ends up being Smalltalk http://www.doesnotunderstand.org/ _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
I think you are doing something wrong. Could you pass me the XML file? Doru On Mon, Jun 9, 2014 at 9:49 PM, Serge Stinckwich <[hidden email]> wrote:
"Every thing has its own flow"
_______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
You can try in a fresh Moose 5.0 image : ================================================================== |graph builder contents document| Gofer it
package: 'Moose-XES'; load. contents := ((GZipReadStream on: 'http://data.3tu.nl/repository/uuid:f5ea9bc6-536f-4744-9c6f-9eb45a907178/DATA1' asUrl retrieveContents) upToEnd) asString.
document := (XMLDOMParser on: contents) documentReadLimit: contents size; parseDocument. graph := (XESAlphaAlgorithm on:(XESParser new parseDocument: document)) run.
builder := RTGraphBuilder new. builder nodes if:[:m| m type = #transition]; shape:(RTEllipse new size: 20) + RTLabel. builder nodes if:[:m| m type = #place]; shape:(RTBox new size: 20).
builder edges seed: graph edges; connectFrom: #from; connectTo: #to;
useInLayout. builder layout forceCharge: -350. builder addAll: graph nodes. builder open. ==================================================================
The problem is not parsing the XML file but generating the object tree after that. Thank you. On Mon, Jun 9, 2014 at 10:15 PM, Tudor Girba <[hidden email]> wrote:
Serge Stinckwich UCBN & UMI UMMISCO 209 (IRD/UPMC) Every DSL ends up being Smalltalk http://www.doesnotunderstand.org/ _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
I am not sure I understand. The issue is not the XML loading at all. Instead, it seems to be in the code that produces the model out of the DOM tree. In this case, why would you look for the PullXMLParser?
Doru On Mon, Jun 9, 2014 at 10:41 PM, Serge Stinckwich <[hidden email]> wrote:
"Every thing has its own flow"
_______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
On Wed, Jun 11, 2014 at 4:22 PM, Tudor Girba <[hidden email]> wrote: --
Yes, you are right.
Saving some memory space for my model ;-) Serge Stinckwich UCBN & UMI UMMISCO 209 (IRD/UPMC) Every DSL ends up being Smalltalk http://www.doesnotunderstand.org/ _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
I see. But, I think your problem is of a different nature given that you said that it went to 1.9 Gb :) Doru
On Wed, Jun 11, 2014 at 4:38 PM, Serge Stinckwich <[hidden email]> wrote:
"Every thing has its own flow"
_______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
hum, you are right, I have to think how to reduce the footprint of my
model creation ... I don't have time to wait for Spur ;-) On Wed, Jun 11, 2014 at 4:53 PM, Tudor Girba <[hidden email]> wrote: > I see. But, I think your problem is of a different nature given that you > said that it went to 1.9 Gb :) > > Doru > > > > > On Wed, Jun 11, 2014 at 4:38 PM, Serge Stinckwich > <[hidden email]> wrote: >> >> >> >> >> On Wed, Jun 11, 2014 at 4:22 PM, Tudor Girba <[hidden email]> wrote: >>> >>> I am not sure I understand. >>> >>> The issue is not the XML loading at all. Instead, it seems to be in the >>> code that produces the model out of the DOM tree. >> >> >> Yes, you are right. >> >>> >>> In this case, why would you look for the PullXMLParser? >>> >> >> Saving some memory space for my model ;-) >> >> -- >> Serge Stinckwich >> UCBN & UMI UMMISCO 209 (IRD/UPMC) >> Every DSL ends up being Smalltalk >> http://www.doesnotunderstand.org/ >> >> _______________________________________________ >> Moose-dev mailing list >> [hidden email] >> https://www.iam.unibe.ch/mailman/listinfo/moose-dev >> > > > > -- > www.tudorgirba.com > > "Every thing has its own flow" > > _______________________________________________ > Moose-dev mailing list > [hidden email] > https://www.iam.unibe.ch/mailman/listinfo/moose-dev > -- Serge Stinckwich UCBN & UMI UMMISCO 209 (IRD/UPMC) Every DSL ends up being Smalltalk http://www.doesnotunderstand.org/ _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
I am *very* suprised of this thread given that I have reported several
times in the pharo mailing lists that XML DOM parsing is not the right solution for processing large XML files. XMLDOM parsing is not even adequate for files larger than 10 Mbytes. Why some people insist using it? The XMLPullParser is the right way to go. Cheers, Hernán 2014-06-11 12:01 GMT-03:00 Serge Stinckwich <[hidden email]>: > hum, you are right, I have to think how to reduce the footprint of my > model creation ... > I don't have time to wait for Spur ;-) > > > On Wed, Jun 11, 2014 at 4:53 PM, Tudor Girba <[hidden email]> wrote: >> I see. But, I think your problem is of a different nature given that you >> said that it went to 1.9 Gb :) >> >> Doru >> >> >> >> >> On Wed, Jun 11, 2014 at 4:38 PM, Serge Stinckwich >> <[hidden email]> wrote: >>> >>> >>> >>> >>> On Wed, Jun 11, 2014 at 4:22 PM, Tudor Girba <[hidden email]> wrote: >>>> >>>> I am not sure I understand. >>>> >>>> The issue is not the XML loading at all. Instead, it seems to be in the >>>> code that produces the model out of the DOM tree. >>> >>> >>> Yes, you are right. >>> >>>> >>>> In this case, why would you look for the PullXMLParser? >>>> >>> >>> Saving some memory space for my model ;-) >>> >>> -- >>> Serge Stinckwich >>> UCBN & UMI UMMISCO 209 (IRD/UPMC) >>> Every DSL ends up being Smalltalk >>> http://www.doesnotunderstand.org/ >>> >>> _______________________________________________ >>> Moose-dev mailing list >>> [hidden email] >>> https://www.iam.unibe.ch/mailman/listinfo/moose-dev >>> >> >> >> >> -- >> www.tudorgirba.com >> >> "Every thing has its own flow" >> >> _______________________________________________ >> Moose-dev mailing list >> [hidden email] >> https://www.iam.unibe.ch/mailman/listinfo/moose-dev >> > > > > -- > Serge Stinckwich > UCBN & UMI UMMISCO 209 (IRD/UPMC) > Every DSL ends up being Smalltalk > http://www.doesnotunderstand.org/ > _______________________________________________ > Moose-dev mailing list > [hidden email] > https://www.iam.unibe.ch/mailman/listinfo/moose-dev _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Hi, I would distinguish between use cases. XMLDOM is extremely convenient for prototyping in particular in conjunction with GTInspector. I do not have to leave the inspection context to write code somewhere else and come back at a later time. Instead I focus on extracting what I want. I worked like this with XML files of ~100MB without any problems other than the funny default size limit that you have to overpass explicitly.
However, it is true that I would not recommend it for long term usage. Cheers, Doru On Sat, Jun 14, 2014 at 11:00 PM, Hernán Morales Durand <[hidden email]> wrote: I am *very* suprised of this thread given that I have reported several "Every thing has its own flow"
_______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Free forum by Nabble | Edit this page |