Securely skipping the last x lines in a "somewhat" CSV file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Securely skipping the last x lines in a "somewhat" CSV file

jtuchel

Hi there,

I guess some of you know the situation: you have to solve a simple task and fail over and over again. I am sitting in from of such a beast at the moment.

The task is easy: parse a CSV file which has a few leading and trailing lines that do not conform to the CSV format of the middle lines (let's call that part the payload of the file).

So all I have is a definition of the columns of the CSV payload part and the number of lines to ignore at the head and the tail of the file.

For the payload; I use NeoCSVReader.There is very little trouble with this, as long as I only use that payload part of the ReadStream for it.

Now to the "hard" part: ignoring irrevelant lines at the beginnig and the end of the file.
The problem is the same for the first and last lines, but the trainling lines are a bit harder, because you need to know that this upcoming line is the x-last or (x-1)-last line of the file.

The thing is: if the lines to be ignored were just lines, I coul used #nextLine. But in some cases they are CSV-lines which may or may not have the same number of columns as the payload lines. The hardest part is, however, that sometimes these lines are CSV (same or different number of columns as the payload) and contain qoeted fields with line breaks.

So I know ho to solve this in principle: I need to first cut the file into "real" lines, respecting the possibility of line breaks within quoted and sometime even quoted fields with quotes in them, stash them in a collection, remove the first y and last x lines, and feed the rest into NeoCSVReader. The hard part is first identifying the "real" lines.

So #nextLine is not enough, and it seems like what I need to do is write something very similar to NeoCSVReader that knows how to distinguish between line breaks inside quoted fields and "real" line breaks. I wonder if there is anything around that can make my journey shorter...?

Joachim














--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/a7e8540e-9145-4054-8fc7-472b5e9a79a6n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Securely skipping the last x lines in a "somewhat" CSV file

Esteban A. Maringolo
Hi Joachim,

I had to deal with similar situations, like having a "merged csv"
file, with non CSV headers in between (old style text headers) and
other fancy stuff, what I did was to wrap the underlying stream with
my own class that filtered the "invalid rows" [*], etc.

AFAIR NeoCSV is "character based", as in it operates on a per
character feed and it knows whether it is inside a quoted text with a
new line in between, or at the end of the line. What I'd do is also to
wrap the underlying stream, with my own logic for toggling whether to
pass the read character or skip it until the real CSV content begins.
This way if you're inside the CSV content of the file, calling #next
will return the read character, but if after that you're in the
"non-CSV" part of the file, then calling #next will keep skipping
until it reaches the beginning of the proper CSV.

This might be tricky since you'd likely need to also modify #skip:
#atEnd and similar methods used by NeoCSV, but you either wrap the
stream to make NeoCSV believe it is dealing with a proper CSV or you
modify NeoCSVReader (which is also an option, and could have this
"wrapping logic" inside of it).

So my answer is not a solution to your problem, but at least you have
some ideas to explore. :-)

Best regards!

Esteban A. Maringolo

On Tue, Jan 19, 2021 at 6:03 AM Joachim Tuchel <[hidden email]> wrote:

>
>
> Hi there,
>
> I guess some of you know the situation: you have to solve a simple task and fail over and over again. I am sitting in from of such a beast at the moment.
>
> The task is easy: parse a CSV file which has a few leading and trailing lines that do not conform to the CSV format of the middle lines (let's call that part the payload of the file).
>
> So all I have is a definition of the columns of the CSV payload part and the number of lines to ignore at the head and the tail of the file.
>
> For the payload; I use NeoCSVReader.There is very little trouble with this, as long as I only use that payload part of the ReadStream for it.
>
> Now to the "hard" part: ignoring irrevelant lines at the beginnig and the end of the file.
> The problem is the same for the first and last lines, but the trainling lines are a bit harder, because you need to know that this upcoming line is the x-last or (x-1)-last line of the file.
>
> The thing is: if the lines to be ignored were just lines, I coul used #nextLine. But in some cases they are CSV-lines which may or may not have the same number of columns as the payload lines. The hardest part is, however, that sometimes these lines are CSV (same or different number of columns as the payload) and contain qoeted fields with line breaks.
>
> So I know ho to solve this in principle: I need to first cut the file into "real" lines, respecting the possibility of line breaks within quoted and sometime even quoted fields with quotes in them, stash them in a collection, remove the first y and last x lines, and feed the rest into NeoCSVReader. The hard part is first identifying the "real" lines.
>
> So #nextLine is not enough, and it seems like what I need to do is write something very similar to NeoCSVReader that knows how to distinguish between line breaks inside quoted fields and "real" line breaks. I wonder if there is anything around that can make my journey shorter...?
>
> Joachim
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/a7e8540e-9145-4054-8fc7-472b5e9a79a6n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/CAJMgPCJOhz7_sJ34voBzaqpCwfbFvsB7W8V3agS-gWmNyy85jQ%40mail.gmail.com.
Reply | Threaded
Open this post in threaded view
|

Re: Securely skipping the last x lines in a "somewhat" CSV file

Marcus Wagner
In reply to this post by jtuchel
Hi,
I hope I will not frustate you all by adding my experiences here.

I had a case where the CSV was created out of Excel. So it is common that  line breaks are part of a cell, e.g.

field1,field2,field3
1,2,"3
4,5,5"
6,7,8
gives a header and two lines with three cells each (from a Smalltalk inspector):
First line
('1' '2' '3
4,5,5')
Second line
('6' '7' '8')
The third cell of the first line contains , and line break.

So the whole story was a night mare, too.

Consider that the " in itself needs special treatment. Consider
"a",abcde"
"a",abcde"
should read as two lines (in a Smalltalk inspector)
('"a"' 'abcde"')
('"a"' 'abcde"')
And further
"a","ab,cde"
"a","abcde"","
should give you
('"a"' 'ab,cde')
('"a"' 'abcde",')
or
"abc","bde" garbage
should read as
('"abc"' '"bde"')

-> there may be garbage even within a CSV (better to say it is a malformed CSV)

-> nextLine definitily cannot handle this

-> I created a deterministic automaton to handle ", line breaks and the comma to find the meta - structure in a first pass. The idea is based on "lex" in Unix, may be the support for regular expressions in Smalltalk will help.

-> I isolated leading and trailing garbage on this meta structure.

-> I retransformed the meta structure back to s.th. which NeoCSVReader understands as second pass, as the chances to integrate this logic into NeoCSVReader are very low.

Note also that a very common deliverer of CSV, Excel, has even some more additional particularities:
° consider the ' at the beginning of cell as a meta symbol (in particular, when the first character in a cell should be =, but no a formula),
° cells are limited in size
° not to speak that Excel sometimes uses ; (or any other national defined delimiter) instead of ,.

Good luck!
Marcus
[hidden email] schrieb am Dienstag, 19. Januar 2021 um 10:03:04 UTC+1:

Hi there,

I guess some of you know the situation: you have to solve a simple task and fail over and over again. I am sitting in from of such a beast at the moment.

The task is easy: parse a CSV file which has a few leading and trailing lines that do not conform to the CSV format of the middle lines (let's call that part the payload of the file).

So all I have is a definition of the columns of the CSV payload part and the number of lines to ignore at the head and the tail of the file.

For the payload; I use NeoCSVReader.There is very little trouble with this, as long as I only use that payload part of the ReadStream for it.

Now to the "hard" part: ignoring irrevelant lines at the beginnig and the end of the file.
The problem is the same for the first and last lines, but the trainling lines are a bit harder, because you need to know that this upcoming line is the x-last or (x-1)-last line of the file.

The thing is: if the lines to be ignored were just lines, I coul used #nextLine. But in some cases they are CSV-lines which may or may not have the same number of columns as the payload lines. The hardest part is, however, that sometimes these lines are CSV (same or different number of columns as the payload) and contain qoeted fields with line breaks.

So I know ho to solve this in principle: I need to first cut the file into "real" lines, respecting the possibility of line breaks within quoted and sometime even quoted fields with quotes in them, stash them in a collection, remove the first y and last x lines, and feed the rest into NeoCSVReader. The hard part is first identifying the "real" lines.

So #nextLine is not enough, and it seems like what I need to do is write something very similar to NeoCSVReader that knows how to distinguish between line breaks inside quoted fields and "real" line breaks. I wonder if there is anything around that can make my journey shorter...?

Joachim














--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/d6f4c4e3-e7ca-4006-9878-2e5f1d8987d4n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Securely skipping the last x lines in a "somewhat" CSV file

jtuchel
Marcus, Esteban,

thanks a lot for your input. I see my problem is not really esoteric ;-), even if the problem at hand has more to do with the fact that the files I have to read are not strictly CSV files....

I think I found an easy way to use NeoCSVReader with minimal changes to enable correct counting of lines in the CSV files and then parse only the "real" CSV lines using NeoCSVReader.
I've described my ideas and changes to NeoCSVReader in a github issue here: https://github.com/svenvc/NeoCSV/issues/20
The approach is low tech and not suitable for lare files, because essentially I parse the file/Stream twice and store it in a Collection in its entirety...
I hope people will agree this is useful and it will make its way into NeoCSV. If not, I will have to maintain this change on my own through upcoming NeoCSV releases.... Luckily, it can be done as extension methods with a little bit of code duplication if needed...

@Marcus: thank god I don't have to deal with Excel output! That would make things a lot more interesting. We've had a few interesting effects of Excel usage on standard CSV-like-files (blog post in German:https://www.kontolino.de/warum-sie-datev-buchungsstapel-besser-nicht-mit-excel-co-bearbeiten-sollten/ ). There are enough stupid ideas around concerning CSV export formats even without the layers of intelligent "number" formatting excel adds on top of that to keep the game interesting ;-)

Joachim





[hidden email] schrieb am Dienstag, 19. Januar 2021 um 15:16:22 UTC+1:
Hi,
I hope I will not frustate you all by adding my experiences here.

I had a case where the CSV was created out of Excel. So it is common that  line breaks are part of a cell, e.g.

field1,field2,field3
1,2,"3
4,5,5"
6,7,8
gives a header and two lines with three cells each (from a Smalltalk inspector):
First line
('1' '2' '3
4,5,5')
Second line
('6' '7' '8')
The third cell of the first line contains , and line break.

So the whole story was a night mare, too.

Consider that the " in itself needs special treatment. Consider
"a",abcde"
"a",abcde"
should read as two lines (in a Smalltalk inspector)
('"a"' 'abcde"')
('"a"' 'abcde"')
And further
"a","ab,cde"
"a","abcde"","
should give you
('"a"' 'ab,cde')
('"a"' 'abcde",')
or
"abc","bde" garbage
should read as
('"abc"' '"bde"')

-> there may be garbage even within a CSV (better to say it is a malformed CSV)

-> nextLine definitily cannot handle this

-> I created a deterministic automaton to handle ", line breaks and the comma to find the meta - structure in a first pass. The idea is based on "lex" in Unix, may be the support for regular expressions in Smalltalk will help.

-> I isolated leading and trailing garbage on this meta structure.

-> I retransformed the meta structure back to s.th. which NeoCSVReader understands as second pass, as the chances to integrate this logic into NeoCSVReader are very low.

Note also that a very common deliverer of CSV, Excel, has even some more additional particularities:
° consider the ' at the beginning of cell as a meta symbol (in particular, when the first character in a cell should be =, but no a formula),
° cells are limited in size
° not to speak that Excel sometimes uses ; (or any other national defined delimiter) instead of ,.

Good luck!
Marcus
[hidden email] schrieb am Dienstag, 19. Januar 2021 um 10:03:04 UTC+1:

Hi there,

I guess some of you know the situation: you have to solve a simple task and fail over and over again. I am sitting in from of such a beast at the moment.

The task is easy: parse a CSV file which has a few leading and trailing lines that do not conform to the CSV format of the middle lines (let's call that part the payload of the file).

So all I have is a definition of the columns of the CSV payload part and the number of lines to ignore at the head and the tail of the file.

For the payload; I use NeoCSVReader.There is very little trouble with this, as long as I only use that payload part of the ReadStream for it.

Now to the "hard" part: ignoring irrevelant lines at the beginnig and the end of the file.
The problem is the same for the first and last lines, but the trainling lines are a bit harder, because you need to know that this upcoming line is the x-last or (x-1)-last line of the file.

The thing is: if the lines to be ignored were just lines, I coul used #nextLine. But in some cases they are CSV-lines which may or may not have the same number of columns as the payload lines. The hardest part is, however, that sometimes these lines are CSV (same or different number of columns as the payload) and contain qoeted fields with line breaks.

So I know ho to solve this in principle: I need to first cut the file into "real" lines, respecting the possibility of line breaks within quoted and sometime even quoted fields with quotes in them, stash them in a collection, remove the first y and last x lines, and feed the rest into NeoCSVReader. The hard part is first identifying the "real" lines.

So #nextLine is not enough, and it seems like what I need to do is write something very similar to NeoCSVReader that knows how to distinguish between line breaks inside quoted fields and "real" line breaks. I wonder if there is anything around that can make my journey shorter...?

Joachim














--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/59bc163c-c338-441a-8555-d5fd7304f113n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Securely skipping the last x lines in a "somewhat" CSV file

vinref
Hi

I did something vaguely similar long ago. I imported the csv file into a SQLite memory db and selected for the compliant lines. The SQLite memory db is very fast.

(Actually I did a SELECT INSERT into an error table all the non-compliant lines for reporting back to the user, and if the error table was empty then I would process the data.)

Vince



--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/78535dca-da41-4018-8c32-d626f5fd33c7n%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

Re: Securely skipping the last x lines in a "somewhat" CSV file

jtuchel
Vince,

your approach sounds like fun and pretty performant. A (of course solvable) problem has be solved first, however: read exatly a line, be it with or without line breaks in quoted fields.
So you'd either have to write a little pseudo-CSV-Parser (or CSV subset parser) or reuse the one that's available in NeoCSV already.
Both ways, and also in your case, you have to do multiple passes over the file/Stream contents.
You approach has one charming characteristic, however: you can exactly name teh line/column which doesn't parse nicely. That's currently missing from NeoCSV.

Joachim


[hidden email] schrieb am Freitag, 22. Januar 2021 um 03:54:42 UTC+1:
Hi

I did something vaguely similar long ago. I imported the csv file into a SQLite memory db and selected for the compliant lines. The SQLite memory db is very fast.

(Actually I did a SELECT INSERT into an error table all the non-compliant lines for reporting back to the user, and if the error table was empty then I would process the data.)

Vince



--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/va-smalltalk/9efc69be-53b3-4494-9f1d-085ab594d312n%40googlegroups.com.