Conditional CSV parsing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Conditional CSV parsing

hernanmd
I used to use a CSV parser from Squeak where I could attach conditional iterations:

csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ].
csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ].
csvParser rowsFrom: 2000 to: 5000 do: [ " some action for rows between 2000 to 5000 " ].

I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation?
Cheers,

Hernán

Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

Sven Van Caekenberghe-2
Hi Hernán,

> On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
>
> I used to use a CSV parser from Squeak where I could attach conditional iterations:
>
> csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].

With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.

> csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ].

NeoCSVParser behaves a bit like a stream and a bit like a collection, there are #next, #atEnd and #upToEnd as well as #do: and  #select:

> csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ].
> csvParser rowsFrom: 2000 to: 5000 do: [ " some action for rows between 2000 to 5000 " ].

Those are not there, you will have to count yourself. #next can be used to #skip.

> I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation?

Should work, let me know if you have any problems.

> Cheers,
>
> Hernán

Sven


Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

hernanmd
Hi Sven,

2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
Hi Hernán,

> On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
>
> I used to use a CSV parser from Squeak where I could attach conditional iterations:
>
> csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].

With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.


I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
 
> csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ].

NeoCSVParser behaves a bit like a stream and a bit like a collection, there are #next, #atEnd and #upToEnd as well as #do: and  #select:


I was using the version from the Configuration and missed the #select:[thenDo:] update.

 
> csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ].
> csvParser rowsFrom: 2000 to: 5000 do: [ " some action for rows between 2000 to 5000 " ].

Those are not there, you will have to count yourself. #next can be used to #skip.


Ok
 
> I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation?

Should work, let me know if you have any problems.


Thank you Sven!

Cheers,



Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

Sven Van Caekenberghe-2

> On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
>
> Hi Sven,
>
> 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> Hi Hernán,
>
> > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
> >
> > I used to use a CSV parser from Squeak where I could attach conditional iterations:
> >
> > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
>
> With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
>
>
> I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
> A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?

Yes, like this:

testReadWithIgnoredField
        | input |
        input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
        self
                assert: ((NeoCSVReader on: input readStream)
                                        addIntegerField;
                                        addIntegerField;
                                        addIgnoredField;
                                        addIntegerField;
                                        upToEnd)
                equals: {
                        #(1 2 3).
                        #(1 2 3).
                        #(1 2 3).}

>  > csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ].
>
> NeoCSVParser behaves a bit like a stream and a bit like a collection, there are #next, #atEnd and #upToEnd as well as #do: and  #select:
>
>
> I was using the version from the Configuration and missed the #select:[thenDo:] update.

Yes, please use #bleedingEdge, I have to update #stable.

> > csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ].
> > csvParser rowsFrom: 2000 to: 5000 do: [ " some action for rows between 2000 to 5000 " ].
>
> Those are not there, you will have to count yourself. #next can be used to #skip.
>
>
> Ok
>  
> > I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation?
>
> Should work, let me know if you have any problems.
>
>
> Thank you Sven!
>
> Cheers,


Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

hernanmd


2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:

> On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
>
> Hi Sven,
>
> 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> Hi Hernán,
>
> > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
> >
> > I used to use a CSV parser from Squeak where I could attach conditional iterations:
> >
> > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
>
> With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
>
>
> I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
> A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?

Yes, like this:

testReadWithIgnoredField
        | input |
        input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
        self
                assert: ((NeoCSVReader on: input readStream)
                                        addIntegerField;
                                        addIntegerField;
                                        addIgnoredField;
                                        addIntegerField;
                                        upToEnd)
                equals: {
                        #(1 2 3).
                        #(1 2 3).
                        #(1 2 3).}



May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.

Thank you.

Hernán

Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

Sven Van Caekenberghe-2

> On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote:
>
>
>
> 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>
> > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
> >
> > Hi Sven,
> >
> > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> > Hi Hernán,
> >
> > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
> > >
> > > I used to use a CSV parser from Squeak where I could attach conditional iterations:
> > >
> > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
> >
> > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
> >
> >
> > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
> > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
>
> Yes, like this:
>
> testReadWithIgnoredField
>         | input |
>         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
>         self
>                 assert: ((NeoCSVReader on: input readStream)
>                                         addIntegerField;
>                                         addIntegerField;
>                                         addIgnoredField;
>                                         addIntegerField;
>                                         upToEnd)
>                 equals: {
>                         #(1 2 3).
>                         #(1 2 3).
>                         #(1 2 3).}
>
>
>
> May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.

1 million columns ? How is that possible, useful ?

The reader is like a builder. You could try to do this yourself by writing a little loop or two.

But still, 1 million ?

> Thank you.
>
> Hernán
>


Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

hernanmd
It is possible :)
I work with DNA sequences, there could be millions of common SNPs in a genome.

Cheers,

Hernán


2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:

> On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote:
>
>
>
> 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>
> > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
> >
> > Hi Sven,
> >
> > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> > Hi Hernán,
> >
> > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
> > >
> > > I used to use a CSV parser from Squeak where I could attach conditional iterations:
> > >
> > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
> >
> > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
> >
> >
> > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
> > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
>
> Yes, like this:
>
> testReadWithIgnoredField
>         | input |
>         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
>         self
>                 assert: ((NeoCSVReader on: input readStream)
>                                         addIntegerField;
>                                         addIntegerField;
>                                         addIgnoredField;
>                                         addIntegerField;
>                                         upToEnd)
>                 equals: {
>                         #(1 2 3).
>                         #(1 2 3).
>                         #(1 2 3).}
>
>
>
> May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.

1 million columns ? How is that possible, useful ?

The reader is like a builder. You could try to do this yourself by writing a little loop or two.

But still, 1 million ?

> Thank you.
>
> Hernán
>



Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

Sven Van Caekenberghe-2
Hernán,

> On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote:
>
> It is possible :)
> I work with DNA sequences, there could be millions of common SNPs in a genome.

Still weird for CSV. How many record are there then ? I assume they all have the same number of fields ?

Anyway, could you point me to the specification of the format you want to read ?
And to the older the that you used to use ?

Thx,

Sven

> Cheers,
>
> Hernán
>
>
> 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>
> > On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote:
> >
> >
> >
> > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> >
> > > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
> > >
> > > Hi Sven,
> > >
> > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> > > Hi Hernán,
> > >
> > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
> > > >
> > > > I used to use a CSV parser from Squeak where I could attach conditional iterations:
> > > >
> > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
> > >
> > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
> > >
> > >
> > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
> > > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
> >
> > Yes, like this:
> >
> > testReadWithIgnoredField
> >         | input |
> >         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
> >         self
> >                 assert: ((NeoCSVReader on: input readStream)
> >                                         addIntegerField;
> >                                         addIntegerField;
> >                                         addIgnoredField;
> >                                         addIntegerField;
> >                                         upToEnd)
> >                 equals: {
> >                         #(1 2 3).
> >                         #(1 2 3).
> >                         #(1 2 3).}
> >
> >
> >
> > May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.
>
> 1 million columns ? How is that possible, useful ?
>
> The reader is like a builder. You could try to do this yourself by writing a little loop or two.
>
> But still, 1 million ?
>
> > Thank you.
> >
> > Hernán
> >
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

hernanmd


2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
Hernán,

> On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote:
>
> It is possible :)
> I work with DNA sequences, there could be millions of common SNPs in a genome.

Still weird for CSV. How many record are there then ?

We genotyped few individuals (24 records) but now we have a genotyping platform (GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 million of markers. The first run I completed generated CSVs of 1 million of records (see attach). Sadly the high-level analysis of this data (annotation, clustering, discrimination) now is performed with R with packages like SNPolisher.

And this is microarray analysis, NGS platforms produce larger volumes of data in a shorter period of time (several genomes in a day). See http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics for the 2014-2020 predictions.

Feel free to contact me if you want to experiment with metrics.

I assume they all have the same number of fields ?

Yes, never seen CSV file with variable number of fields (in this domain)

Anyway, could you point me to the specification of the format you want to read ?

Actually I have no rush for this, I want to avoid awk, sed and shell scripts in the next run. I would like to avoid Python but spreads like a virus.

I will be working mostly with CSV's from Axiom annotation files [1] and genotyping results. Other file formats I use are genotype file formats for programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, because you have to transpose the output generated by genotyping platforms (millions of records), and then filter & cut them by chromosome because those Java programs cannot deal with all chromosomes at the same time.
 
And to the older the that you used to use ?

Cheers,
Hernán
Thx,

Sven

> Cheers,
>
> Hernán
>
>
> 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>
> > On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote:
> >
> >
> >
> > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> >
> > > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
> > >
> > > Hi Sven,
> > >
> > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> > > Hi Hernán,
> > >
> > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
> > > >
> > > > I used to use a CSV parser from Squeak where I could attach conditional iterations:
> > > >
> > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
> > >
> > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
> > >
> > >
> > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
> > > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
> >
> > Yes, like this:
> >
> > testReadWithIgnoredField
> >         | input |
> >         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
> >         self
> >                 assert: ((NeoCSVReader on: input readStream)
> >                                         addIntegerField;
> >                                         addIntegerField;
> >                                         addIgnoredField;
> >                                         addIntegerField;
> >                                         upToEnd)
> >                 equals: {
> >                         #(1 2 3).
> >                         #(1 2 3).
> >                         #(1 2 3).}
> >
> >
> >
> > May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.
>
> 1 million columns ? How is that possible, useful ?
>
> The reader is like a builder. You could try to do this yourself by writing a little loop or two.
>
> But still, 1 million ?
>
> > Thank you.
> >
> > Hernán
> >
>
>
>




affy_run.jpg (303K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

stepharo
hernan

if you need some help you can also find a smart student and ask ESUG to sponsor him during a summertalk.

Stef

Le 26/1/15 21:03, Hernán Morales Durand a écrit :


2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
Hernán,

> On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote:
>
> It is possible :)
> I work with DNA sequences, there could be millions of common SNPs in a genome.

Still weird for CSV. How many record are there then ?

We genotyped few individuals (24 records) but now we have a genotyping platform (GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 million of markers. The first run I completed generated CSVs of 1 million of records (see attach). Sadly the high-level analysis of this data (annotation, clustering, discrimination) now is performed with R with packages like SNPolisher.

And this is microarray analysis, NGS platforms produce larger volumes of data in a shorter period of time (several genomes in a day). See http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics for the 2014-2020 predictions.

Feel free to contact me if you want to experiment with metrics.

I assume they all have the same number of fields ?

Yes, never seen CSV file with variable number of fields (in this domain)

Anyway, could you point me to the specification of the format you want to read ?

Actually I have no rush for this, I want to avoid awk, sed and shell scripts in the next run. I would like to avoid Python but spreads like a virus.

I will be working mostly with CSV's from Axiom annotation files [1] and genotyping results. Other file formats I use are genotype file formats for programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, because you have to transpose the output generated by genotyping platforms (millions of records), and then filter & cut them by chromosome because those Java programs cannot deal with all chromosomes at the same time.
 
And to the older the that you used to use ?

Cheers,
Hernán
Thx,

Sven

> Cheers,
>
> Hernán
>
>
> 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>
> > On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote:
> >
> >
> >
> > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> >
> > > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
> > >
> > > Hi Sven,
> > >
> > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
> > > Hi Hernán,
> > >
> > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
> > > >
> > > > I used to use a CSV parser from Squeak where I could attach conditional iterations:
> > > >
> > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
> > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
> > >
> > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
> > >
> > >
> > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
> > > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
> >
> > Yes, like this:
> >
> > testReadWithIgnoredField
> >         | input |
> >         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
> >         self
> >                 assert: ((NeoCSVReader on: input readStream)
> >                                         addIntegerField;
> >                                         addIntegerField;
> >                                         addIgnoredField;
> >                                         addIntegerField;
> >                                         upToEnd)
> >                 equals: {
> >                         #(1 2 3).
> >                         #(1 2 3).
> >                         #(1 2 3).}
> >
> >
> >
> > May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.
>
> 1 million columns ? How is that possible, useful ?
>
> The reader is like a builder. You could try to do this yourself by writing a little loop or two.
>
> But still, 1 million ?
>
> > Thank you.
> >
> > Hernán
> >
>
>
>




Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

stepharo
In reply to this post by Sven Van Caekenberghe-2
Sven I just added a section on your chapter :)

Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

Sven Van Caekenberghe-2
In reply to this post by stepharo
There are some cool ideas here:

https://github.com/BurntSushi/xsv

> On 31 Jan 2015, at 14:26, stepharo <[hidden email]> wrote:
>
> hernan
>
> if you need some help you can also find a smart student and ask ESUG to sponsor him during a summertalk.
>
> Stef
>
> Le 26/1/15 21:03, Hernán Morales Durand a écrit :
>>
>>
>> 2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>> Hernán,
>>
>> > On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote:
>> >
>> > It is possible :)
>> > I work with DNA sequences, there could be millions of common SNPs in a genome.
>>
>> Still weird for CSV. How many record are there then ?
>>
>> We genotyped few individuals (24 records) but now we have a genotyping platform (GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 million of markers. The first run I completed generated CSVs of 1 million of records (see attach). Sadly the high-level analysis of this data (annotation, clustering, discrimination) now is performed with R with packages like SNPolisher.
>>
>> And this is microarray analysis, NGS platforms produce larger volumes of data in a shorter period of time (several genomes in a day). See http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics for the 2014-2020 predictions.
>>
>> Feel free to contact me if you want to experiment with metrics.
>>
>> I assume they all have the same number of fields ?
>>
>> Yes, never seen CSV file with variable number of fields (in this domain)
>>
>> Anyway, could you point me to the specification of the format you want to read ?
>>
>> Actually I have no rush for this, I want to avoid awk, sed and shell scripts in the next run. I would like to avoid Python but spreads like a virus.
>>
>> I will be working mostly with CSV's from Axiom annotation files [1] and genotyping results. Other file formats I use are genotype file formats for programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, because you have to transpose the output generated by genotyping platforms (millions of records), and then filter & cut them by chromosome because those Java programs cannot deal with all chromosomes at the same time.
>>  
>> And to the older the that you used to use ?
>>
>>
>> http://www.smalltalkhub.com/#!/~hernan/CSV
>>
>> Cheers,
>> Hernán
>>
>>
>> [1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx
>> [2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
>>
>>
>>  
>> Thx,
>>
>> Sven
>>
>> > Cheers,
>> >
>> > Hernán
>> >
>> >
>> > 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>> >
>> > > On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote:
>> > >
>> > >
>> > >
>> > > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>> > >
>> > > > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
>> > > >
>> > > > Hi Sven,
>> > > >
>> > > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>> > > > Hi Hernán,
>> > > >
>> > > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
>> > > > >
>> > > > > I used to use a CSV parser from Squeak where I could attach conditional iterations:
>> > > > >
>> > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
>> > > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
>> > > >
>> > > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
>> > > >
>> > > >
>> > > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
>> > > > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
>> > >
>> > > Yes, like this:
>> > >
>> > > testReadWithIgnoredField
>> > >         | input |
>> > >         input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
>> > >         self
>> > >                 assert: ((NeoCSVReader on: input readStream)
>> > >                                         addIntegerField;
>> > >                                         addIntegerField;
>> > >                                         addIgnoredField;
>> > >                                         addIntegerField;
>> > >                                         upToEnd)
>> > >                 equals: {
>> > >                         #(1 2 3).
>> > >                         #(1 2 3).
>> > >                         #(1 2 3).}
>> > >
>> > >
>> > >
>> > > May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.
>> >
>> > 1 million columns ? How is that possible, useful ?
>> >
>> > The reader is like a builder. You could try to do this yourself by writing a little loop or two.
>> >
>> > But still, 1 million ?
>> >
>> > > Thank you.
>> > >
>> > > Hernán
>> > >
>> >
>> >
>> >
>>
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Conditional CSV parsing

stepharo
I will add that to the topic list!

Stef


Le 22/2/15 22:15, Sven Van Caekenberghe a écrit :

> There are some cool ideas here:
>
> https://github.com/BurntSushi/xsv
>
>> On 31 Jan 2015, at 14:26, stepharo <[hidden email]> wrote:
>>
>> hernan
>>
>> if you need some help you can also find a smart student and ask ESUG to sponsor him during a summertalk.
>>
>> Stef
>>
>> Le 26/1/15 21:03, Hernán Morales Durand a écrit :
>>>
>>> 2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>>> Hernán,
>>>
>>>> On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote:
>>>>
>>>> It is possible :)
>>>> I work with DNA sequences, there could be millions of common SNPs in a genome.
>>> Still weird for CSV. How many record are there then ?
>>>
>>> We genotyped few individuals (24 records) but now we have a genotyping platform (GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 million of markers. The first run I completed generated CSVs of 1 million of records (see attach). Sadly the high-level analysis of this data (annotation, clustering, discrimination) now is performed with R with packages like SNPolisher.
>>>
>>> And this is microarray analysis, NGS platforms produce larger volumes of data in a shorter period of time (several genomes in a day). See http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics for the 2014-2020 predictions.
>>>
>>> Feel free to contact me if you want to experiment with metrics.
>>>
>>> I assume they all have the same number of fields ?
>>>
>>> Yes, never seen CSV file with variable number of fields (in this domain)
>>>
>>> Anyway, could you point me to the specification of the format you want to read ?
>>>
>>> Actually I have no rush for this, I want to avoid awk, sed and shell scripts in the next run. I would like to avoid Python but spreads like a virus.
>>>
>>> I will be working mostly with CSV's from Axiom annotation files [1] and genotyping results. Other file formats I use are genotype file formats for programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, because you have to transpose the output generated by genotyping platforms (millions of records), and then filter & cut them by chromosome because those Java programs cannot deal with all chromosomes at the same time.
>>>  
>>> And to the older the that you used to use ?
>>>
>>>
>>> http://www.smalltalkhub.com/#!/~hernan/CSV
>>>
>>> Cheers,
>>> Hernán
>>>
>>>
>>> [1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx
>>> [2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
>>>
>>>
>>>  
>>> Thx,
>>>
>>> Sven
>>>
>>>> Cheers,
>>>>
>>>> Hernán
>>>>
>>>>
>>>> 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>>>>
>>>>> On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>>>>>
>>>>>> On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote:
>>>>>>
>>>>>> Hi Sven,
>>>>>>
>>>>>> 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
>>>>>> Hi Hernán,
>>>>>>
>>>>>>> On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote:
>>>>>>>
>>>>>>> I used to use a CSV parser from Squeak where I could attach conditional iterations:
>>>>>>>
>>>>>>> csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ].
>>>>>>> csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ].
>>>>>> With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests.
>>>>>>
>>>>>>
>>>>>> I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21
>>>>>> A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields?
>>>>> Yes, like this:
>>>>>
>>>>> testReadWithIgnoredField
>>>>>          | input |
>>>>>          input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')).
>>>>>          self
>>>>>                  assert: ((NeoCSVReader on: input readStream)
>>>>>                                          addIntegerField;
>>>>>                                          addIntegerField;
>>>>>                                          addIgnoredField;
>>>>>                                          addIntegerField;
>>>>>                                          upToEnd)
>>>>>                  equals: {
>>>>>                          #(1 2 3).
>>>>>                          #(1 2 3).
>>>>>                          #(1 2 3).}
>>>>>
>>>>>
>>>>>
>>>>> May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such.
>>>> 1 million columns ? How is that possible, useful ?
>>>>
>>>> The reader is like a builder. You could try to do this yourself by writing a little loop or two.
>>>>
>>>> But still, 1 million ?
>>>>
>>>>> Thank you.
>>>>>
>>>>> Hernán
>>>>>
>>>>
>>>>
>>>
>>>
>
>
>