Smalltalk › Pharo › Pharo Smalltalk Users

NeoCSV on Irregular Files

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

5 messages Options

Sean P. DeNigris

Jul 26, 2017; 3:02pm

NeoCSV on Irregular Files

Administrator

5777 posts

I have a CSV file that has several subsections, each with its own format. What I'd like to do is parse one, reset the NeoCSVReader, set it up for the next section, and continue. I didn't see an API for this. Is it possible? Thanks.

Cheers,
Sean

Esteban A. Maringolo

Jul 26, 2017; 3:45pm

Re: NeoCSV on Irregular Files

2343 posts

There is no way to perform this with NeoJSON or any other CSV
framework I'm aware of.

I had to deal with that kind of "format" (which is likely an export
format), and the way to deal with it was to process each "segment"
using a different instance of the CSV reader, the segments where
scanned in the stream using the delimiting heuristics of your choice
(headers, blank lines, etc.), and then each segment was extracted and
passed as argument to the reader of that segment.

The drawback was that if the file was big there was no way to have "a
stream over a stream" (like a window function), so passing the segment
to the reader implied copying its string contents within the segment
delimiters.

It's something I already put some thought into, but never had the will
to code and share publicly.

Regards,

Esteban A. Maringolo

2017-07-26 12:02 GMT-03:00 Sean P. DeNigris <[hidden email]>:

> I have a CSV file that has several subsections, each with its own format.
> What I'd like to do is parse one, reset the NeoCSVReader, set it up for the
> next section, and continue. I didn't see an API for this. Is it possible?
> Thanks.
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/NeoCSV-on-Irregular-Files-tp4956850.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>

Sven Van Caekenberghe-2

Jul 26, 2017; 4:04pm

Re: NeoCSV on Irregular Files

5697 posts

I agree.

If the file is non-homegeneous it is not longer CSV by definition.

Holding on to the original stream and creating new readers for each section is one option, an other one could be to add a #reset method.

The big question is how to known when one section begins/ends.

NeoCSVReader holds a one char buffer, so you could peek for something, just maybe. Then you could discover the section switches while parsing (a bit like #atEnd is used from #upToEnd, add a #atSectionEnd). But it all depends on your specific format.

> On 26 Jul 2017, at 17:45, Esteban A. Maringolo <[hidden email]> wrote:
>
> There is no way to perform this with NeoJSON or any other CSV
> framework I'm aware of.
>
> I had to deal with that kind of "format" (which is likely an export
> format), and the way to deal with it was to process each "segment"
> using a different instance of the CSV reader, the segments where
> scanned in the stream using the delimiting heuristics of your choice
> (headers, blank lines, etc.), and then each segment was extracted and
> passed as argument to the reader of that segment.
>
> The drawback was that if the file was big there was no way to have "a
> stream over a stream" (like a window function), so passing the segment
> to the reader implied copying its string contents within the segment
> delimiters.
>
> It's something I already put some thought into, but never had the will
> to code and share publicly.
>
> Regards,
>
> Esteban A. Maringolo
>
>
> 2017-07-26 12:02 GMT-03:00 Sean P. DeNigris <[hidden email]>:
>> I have a CSV file that has several subsections, each with its own format.
>> What I'd like to do is parse one, reset the NeoCSVReader, set it up for the
>> next section, and continue. I didn't see an API for this. Is it possible?
>> Thanks.
>>
>>
>>
>> -----
>> Cheers,
>> Sean
>> --
>> View this message in context: http://forum.world.st/NeoCSV-on-Irregular-Files-tp4956850.html
>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>
>

... [show rest of quote]

Esteban A. Maringolo

Jul 26, 2017; 5:05pm

Re: NeoCSV on Irregular Files

2343 posts

2017-07-26 13:04 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> I agree.
>
> If the file is non-homegeneous it is not longer CSV by definition.
>
> Holding on to the original stream and creating new readers for each section is one option, an other one could be to add a #reset method.
>
> The big question is how to known when one section begins/ends.

In my experience I looked for certain delimiters, like a header row
with the field names.
Oil & Gas telemetry instruments generate outputs like that, like a
concatenation of several CSVs into one, maybe even with a non-csv like
header of 10 rows of data.

What I had to do to deal with that was either:
a) Reading it line by line, buffering the hole "segment" until EOF or
the next delimiter is found, or...
b) Pre-scanning the whole file, and marking start and end positions of
each segment, generating a new readStream with the contents and passed
it the CSV parser (which doesn't care nor know about segments).

> NeoCSVReader holds a one char buffer, so you could peek for something, just maybe. Then you could discover the section switches while parsing (a bit like #atEnd is used from #upToEnd, add a #atSectionEnd). But it all depends on your specific format.

It's harder to do if it is char based, instead of "line" based. Or at
least harder to code.

Regards!

Esteban A. Maringolo

Sean P. DeNigris

Jul 30, 2017; 2:01am

Re: NeoCSV on Irregular Files

Administrator

5777 posts

In reply to this post by Sven Van Caekenberghe-2

Sven Van Caekenberghe-2 wrote

creating new readers for each section is one option

Thank you both. This worked quite well (23 LOC) vs. parsing the raw file with NeoCSV and then trying to clean up and interpret the resulting arrays (126 LOC).

Cheers,
Sean