Administrator
|
I have a CSV file that has several subsections, each with its own format. What I'd like to do is parse one, reset the NeoCSVReader, set it up for the next section, and continue. I didn't see an API for this. Is it possible? Thanks.
Cheers,
Sean |
There is no way to perform this with NeoJSON or any other CSV
framework I'm aware of. I had to deal with that kind of "format" (which is likely an export format), and the way to deal with it was to process each "segment" using a different instance of the CSV reader, the segments where scanned in the stream using the delimiting heuristics of your choice (headers, blank lines, etc.), and then each segment was extracted and passed as argument to the reader of that segment. The drawback was that if the file was big there was no way to have "a stream over a stream" (like a window function), so passing the segment to the reader implied copying its string contents within the segment delimiters. It's something I already put some thought into, but never had the will to code and share publicly. Regards, Esteban A. Maringolo 2017-07-26 12:02 GMT-03:00 Sean P. DeNigris <[hidden email]>: > I have a CSV file that has several subsections, each with its own format. > What I'd like to do is parse one, reset the NeoCSVReader, set it up for the > next section, and continue. I didn't see an API for this. Is it possible? > Thanks. > > > > ----- > Cheers, > Sean > -- > View this message in context: http://forum.world.st/NeoCSV-on-Irregular-Files-tp4956850.html > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > |
I agree.
If the file is non-homegeneous it is not longer CSV by definition. Holding on to the original stream and creating new readers for each section is one option, an other one could be to add a #reset method. The big question is how to known when one section begins/ends. NeoCSVReader holds a one char buffer, so you could peek for something, just maybe. Then you could discover the section switches while parsing (a bit like #atEnd is used from #upToEnd, add a #atSectionEnd). But it all depends on your specific format. > On 26 Jul 2017, at 17:45, Esteban A. Maringolo <[hidden email]> wrote: > > There is no way to perform this with NeoJSON or any other CSV > framework I'm aware of. > > I had to deal with that kind of "format" (which is likely an export > format), and the way to deal with it was to process each "segment" > using a different instance of the CSV reader, the segments where > scanned in the stream using the delimiting heuristics of your choice > (headers, blank lines, etc.), and then each segment was extracted and > passed as argument to the reader of that segment. > > The drawback was that if the file was big there was no way to have "a > stream over a stream" (like a window function), so passing the segment > to the reader implied copying its string contents within the segment > delimiters. > > It's something I already put some thought into, but never had the will > to code and share publicly. > > Regards, > > Esteban A. Maringolo > > > 2017-07-26 12:02 GMT-03:00 Sean P. DeNigris <[hidden email]>: >> I have a CSV file that has several subsections, each with its own format. >> What I'd like to do is parse one, reset the NeoCSVReader, set it up for the >> next section, and continue. I didn't see an API for this. Is it possible? >> Thanks. >> >> >> >> ----- >> Cheers, >> Sean >> -- >> View this message in context: http://forum.world.st/NeoCSV-on-Irregular-Files-tp4956850.html >> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. >> > |
2017-07-26 13:04 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> I agree. > > If the file is non-homegeneous it is not longer CSV by definition. > > Holding on to the original stream and creating new readers for each section is one option, an other one could be to add a #reset method. > > The big question is how to known when one section begins/ends. In my experience I looked for certain delimiters, like a header row with the field names. Oil & Gas telemetry instruments generate outputs like that, like a concatenation of several CSVs into one, maybe even with a non-csv like header of 10 rows of data. What I had to do to deal with that was either: a) Reading it line by line, buffering the hole "segment" until EOF or the next delimiter is found, or... b) Pre-scanning the whole file, and marking start and end positions of each segment, generating a new readStream with the contents and passed it the CSV parser (which doesn't care nor know about segments). > NeoCSVReader holds a one char buffer, so you could peek for something, just maybe. Then you could discover the section switches while parsing (a bit like #atEnd is used from #upToEnd, add a #atSectionEnd). But it all depends on your specific format. It's harder to do if it is char based, instead of "line" based. Or at least harder to code. Regards! Esteban A. Maringolo |
Administrator
|
In reply to this post by Sven Van Caekenberghe-2
Thank you both. This worked quite well (23 LOC) vs. parsing the raw file with NeoCSV and then trying to clean up and interpret the resulting arrays (126 LOC).
Cheers,
Sean |
Free forum by Nabble | Edit this page |