NeoCSV on Irregular Files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

NeoCSV on Irregular Files

Sean P. DeNigris
Administrator
I have a CSV file that has several subsections, each with its own format. What I'd like to do is parse one, reset the NeoCSVReader, set it up for the next section, and continue. I didn't see an API for this. Is it possible? Thanks.
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV on Irregular Files

Esteban A. Maringolo
There is no way to perform this with NeoJSON or any other CSV
framework I'm aware of.

I had to deal with that kind of "format" (which is likely an export
format), and the way to deal with it was to process each "segment"
using a different instance of the CSV reader, the segments where
scanned in the stream using the delimiting heuristics of your choice
(headers, blank lines, etc.), and then each segment was extracted and
passed as argument to the reader of that segment.

The drawback was that if the file was big there was no way to have "a
stream over a stream" (like a window function), so passing the segment
to the reader implied copying its string contents within the segment
delimiters.

It's something I already put some thought into, but never had the will
to code and share publicly.

Regards,

Esteban A. Maringolo


2017-07-26 12:02 GMT-03:00 Sean P. DeNigris <[hidden email]>:

> I have a CSV file that has several subsections, each with its own format.
> What I'd like to do is parse one, reset the NeoCSVReader, set it up for the
> next section, and continue. I didn't see an API for this. Is it possible?
> Thanks.
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/NeoCSV-on-Irregular-Files-tp4956850.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV on Irregular Files

Sven Van Caekenberghe-2
I agree.

If the file is non-homegeneous it is not longer CSV by definition.

Holding on to the original stream and creating new readers for each section is one option, an other one could be to add a #reset method.

The big question is how to known when one section begins/ends.

NeoCSVReader holds a one char buffer, so you could peek for something, just maybe. Then you could discover the section switches while parsing (a bit like #atEnd is used from #upToEnd, add a #atSectionEnd). But it all depends on your specific format.

> On 26 Jul 2017, at 17:45, Esteban A. Maringolo <[hidden email]> wrote:
>
> There is no way to perform this with NeoJSON or any other CSV
> framework I'm aware of.
>
> I had to deal with that kind of "format" (which is likely an export
> format), and the way to deal with it was to process each "segment"
> using a different instance of the CSV reader, the segments where
> scanned in the stream using the delimiting heuristics of your choice
> (headers, blank lines, etc.), and then each segment was extracted and
> passed as argument to the reader of that segment.
>
> The drawback was that if the file was big there was no way to have "a
> stream over a stream" (like a window function), so passing the segment
> to the reader implied copying its string contents within the segment
> delimiters.
>
> It's something I already put some thought into, but never had the will
> to code and share publicly.
>
> Regards,
>
> Esteban A. Maringolo
>
>
> 2017-07-26 12:02 GMT-03:00 Sean P. DeNigris <[hidden email]>:
>> I have a CSV file that has several subsections, each with its own format.
>> What I'd like to do is parse one, reset the NeoCSVReader, set it up for the
>> next section, and continue. I didn't see an API for this. Is it possible?
>> Thanks.
>>
>>
>>
>> -----
>> Cheers,
>> Sean
>> --
>> View this message in context: http://forum.world.st/NeoCSV-on-Irregular-Files-tp4956850.html
>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV on Irregular Files

Esteban A. Maringolo
2017-07-26 13:04 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> I agree.
>
> If the file is non-homegeneous it is not longer CSV by definition.
>
> Holding on to the original stream and creating new readers for each section is one option, an other one could be to add a #reset method.
>
> The big question is how to known when one section begins/ends.

In my experience I looked for certain delimiters, like a header row
with the field names.
Oil & Gas telemetry instruments generate outputs like that, like a
concatenation of several CSVs into one, maybe even with a non-csv like
header of 10 rows of data.

What I had to do to deal with that was either:
a) Reading it line by line, buffering the hole "segment" until EOF or
the next delimiter is found, or...
b) Pre-scanning the whole file, and marking start and end positions of
each segment, generating a new readStream with the contents and passed
it the CSV parser (which doesn't care nor know about segments).

> NeoCSVReader holds a one char buffer, so you could peek for something, just maybe. Then you could discover the section switches while parsing (a bit like #atEnd is used from #upToEnd, add a #atSectionEnd). But it all depends on your specific format.

It's harder to do if it is char based, instead of "line" based. Or at
least harder to code.

Regards!

Esteban A. Maringolo

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV on Irregular Files

Sean P. DeNigris
Administrator
In reply to this post by Sven Van Caekenberghe-2
Sven Van Caekenberghe-2 wrote
creating new readers for each section is one option
Thank you both. This worked quite well (23 LOC) vs. parsing the raw file with NeoCSV and then trying to clean up and interpret the resulting arrays (126 LOC).
Cheers,
Sean