I used to use a CSV parser from Squeak where I could attach conditional iterations: I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation?csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ]. csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ]. csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ]. csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ]. csvParser rowsFrom: 2000 to: 5000 do: [ " some action for rows between 2000 to 5000 " ]. Hernán |
Hi Hernán,
> On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote: > > I used to use a CSV parser from Squeak where I could attach conditional iterations: > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ]. > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ]. With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests. > csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ]. NeoCSVParser behaves a bit like a stream and a bit like a collection, there are #next, #atEnd and #upToEnd as well as #do: and #select: > csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ]. > csvParser rowsFrom: 2000 to: 5000 do: [ " some action for rows between 2000 to 5000 " ]. Those are not there, you will have to count yourself. #next can be used to #skip. > I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation? Should work, let me know if you have any problems. > Cheers, > > Hernán Sven |
Hi Sven,
2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: Hi Hernán, I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21 A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields? > csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ]. I was using the version from the Configuration and missed the #select:[thenDo:] update. > csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ]. Ok > I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation? Thank you Sven! Cheers, |
> On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote: > > Hi Sven, > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: > Hi Hernán, > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote: > > > > I used to use a CSV parser from Squeak where I could attach conditional iterations: > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ]. > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ]. > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests. > > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21 > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields? Yes, like this: testReadWithIgnoredField | input | input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')). self assert: ((NeoCSVReader on: input readStream) addIntegerField; addIntegerField; addIgnoredField; addIntegerField; upToEnd) equals: { #(1 2 3). #(1 2 3). #(1 2 3).} > > csvParser rowsWhere: [ " a condition block " ] do: [ : row | " ... " ]. > > NeoCSVParser behaves a bit like a stream and a bit like a collection, there are #next, #atEnd and #upToEnd as well as #do: and #select: > > > I was using the version from the Configuration and missed the #select:[thenDo:] update. Yes, please use #bleedingEdge, I have to update #stable. > > csvParser rowsUpTo: 500000 do: [ " some action for rows up to 500000 " ]. > > csvParser rowsFrom: 2000 to: 5000 do: [ " some action for rows between 2000 to 5000 " ]. > > Those are not there, you will have to count yourself. #next can be used to #skip. > > > Ok > > > I want to replace the parser with NeoCSVReader, is this easily possible with the current implementation? > > Should work, let me know if you have any problems. > > > Thank you Sven! > > Cheers, |
2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such. Thank you. Hernán |
> On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote: > > > > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: > > > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote: > > > > Hi Sven, > > > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: > > Hi Hernán, > > > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote: > > > > > > I used to use a CSV parser from Squeak where I could attach conditional iterations: > > > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ]. > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ]. > > > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests. > > > > > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21 > > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields? > > Yes, like this: > > testReadWithIgnoredField > | input | > input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')). > self > assert: ((NeoCSVReader on: input readStream) > addIntegerField; > addIntegerField; > addIgnoredField; > addIntegerField; > upToEnd) > equals: { > #(1 2 3). > #(1 2 3). > #(1 2 3).} > > > > May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such. 1 million columns ? How is that possible, useful ? The reader is like a builder. You could try to do this yourself by writing a little loop or two. But still, 1 million ? > Thank you. > > Hernán > |
It is possible :) Cheers,I work with DNA sequences, there could be millions of common SNPs in a genome. Hernán 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>:
|
Hernán,
> On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote: > > It is possible :) > I work with DNA sequences, there could be millions of common SNPs in a genome. Still weird for CSV. How many record are there then ? I assume they all have the same number of fields ? Anyway, could you point me to the specification of the format you want to read ? And to the older the that you used to use ? Thx, Sven > Cheers, > > Hernán > > > 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: > > > On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote: > > > > > > > > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: > > > > > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote: > > > > > > Hi Sven, > > > > > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: > > > Hi Hernán, > > > > > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote: > > > > > > > > I used to use a CSV parser from Squeak where I could attach conditional iterations: > > > > > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ]. > > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ]. > > > > > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests. > > > > > > > > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21 > > > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields? > > > > Yes, like this: > > > > testReadWithIgnoredField > > | input | > > input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')). > > self > > assert: ((NeoCSVReader on: input readStream) > > addIntegerField; > > addIntegerField; > > addIgnoredField; > > addIntegerField; > > upToEnd) > > equals: { > > #(1 2 3). > > #(1 2 3). > > #(1 2 3).} > > > > > > > > May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such. > > 1 million columns ? How is that possible, useful ? > > The reader is like a builder. You could try to do this yourself by writing a little loop or two. > > But still, 1 million ? > > > Thank you. > > > > Hernán > > > > > |
2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: Hernán, We genotyped few individuals (24 records) but now we have a genotyping platform (GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 million of markers. The first run I completed generated CSVs of 1 million of records (see attach). Sadly the high-level analysis of this data (annotation, clustering, discrimination) now is performed with R with packages like SNPolisher. And this is microarray analysis, NGS platforms produce larger volumes of data in a shorter period of time (several genomes in a day). See http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics for the 2014-2020 predictions. Feel free to contact me if you want to experiment with metrics. I assume they all have the same number of fields ? Yes, never seen CSV file with variable number of fields (in this domain) Anyway, could you point me to the specification of the format you want to read ? Actually I have no rush for this, I want to avoid awk, sed and shell scripts in the next run. I would like to avoid Python but spreads like a virus. I will be working mostly with CSV's from Axiom annotation files [1] and genotyping results. Other file formats I use are genotype file formats for programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, because you have to transpose the output generated by genotyping platforms (millions of records), and then filter & cut them by chromosome because those Java programs cannot deal with all chromosomes at the same time. And to the older the that you used to use ? Cheers, Hernán [1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx [2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped Thx, affy_run.jpg (303K) Download Attachment |
hernan
if you need some help you can also find a smart student and ask ESUG to sponsor him during a summertalk. Stef Le 26/1/15 21:03, Hernán Morales Durand
a écrit :
|
In reply to this post by Sven Van Caekenberghe-2
Sven I just added a section on your chapter :)
|
In reply to this post by stepharo
There are some cool ideas here:
https://github.com/BurntSushi/xsv > On 31 Jan 2015, at 14:26, stepharo <[hidden email]> wrote: > > hernan > > if you need some help you can also find a smart student and ask ESUG to sponsor him during a summertalk. > > Stef > > Le 26/1/15 21:03, Hernán Morales Durand a écrit : >> >> >> 2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >> Hernán, >> >> > On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote: >> > >> > It is possible :) >> > I work with DNA sequences, there could be millions of common SNPs in a genome. >> >> Still weird for CSV. How many record are there then ? >> >> We genotyped few individuals (24 records) but now we have a genotyping platform (GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 million of markers. The first run I completed generated CSVs of 1 million of records (see attach). Sadly the high-level analysis of this data (annotation, clustering, discrimination) now is performed with R with packages like SNPolisher. >> >> And this is microarray analysis, NGS platforms produce larger volumes of data in a shorter period of time (several genomes in a day). See http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics for the 2014-2020 predictions. >> >> Feel free to contact me if you want to experiment with metrics. >> >> I assume they all have the same number of fields ? >> >> Yes, never seen CSV file with variable number of fields (in this domain) >> >> Anyway, could you point me to the specification of the format you want to read ? >> >> Actually I have no rush for this, I want to avoid awk, sed and shell scripts in the next run. I would like to avoid Python but spreads like a virus. >> >> I will be working mostly with CSV's from Axiom annotation files [1] and genotyping results. Other file formats I use are genotype file formats for programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, because you have to transpose the output generated by genotyping platforms (millions of records), and then filter & cut them by chromosome because those Java programs cannot deal with all chromosomes at the same time. >> >> And to the older the that you used to use ? >> >> >> http://www.smalltalkhub.com/#!/~hernan/CSV >> >> Cheers, >> Hernán >> >> >> [1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx >> [2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped >> >> >> >> Thx, >> >> Sven >> >> > Cheers, >> > >> > Hernán >> > >> > >> > 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >> > >> > > On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote: >> > > >> > > >> > > >> > > 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >> > > >> > > > On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote: >> > > > >> > > > Hi Sven, >> > > > >> > > > 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >> > > > Hi Hernán, >> > > > >> > > > > On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote: >> > > > > >> > > > > I used to use a CSV parser from Squeak where I could attach conditional iterations: >> > > > > >> > > > > csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ]. >> > > > > csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ]. >> > > > >> > > > With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests. >> > > > >> > > > >> > > > I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21 >> > > > A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields? >> > > >> > > Yes, like this: >> > > >> > > testReadWithIgnoredField >> > > | input | >> > > input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')). >> > > self >> > > assert: ((NeoCSVReader on: input readStream) >> > > addIntegerField; >> > > addIntegerField; >> > > addIgnoredField; >> > > addIntegerField; >> > > upToEnd) >> > > equals: { >> > > #(1 2 3). >> > > #(1 2 3). >> > > #(1 2 3).} >> > > >> > > >> > > >> > > May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such. >> > >> > 1 million columns ? How is that possible, useful ? >> > >> > The reader is like a builder. You could try to do this yourself by writing a little loop or two. >> > >> > But still, 1 million ? >> > >> > > Thank you. >> > > >> > > Hernán >> > > >> > >> > >> > >> >> >> > |
I will add that to the topic list!
Stef Le 22/2/15 22:15, Sven Van Caekenberghe a écrit : > There are some cool ideas here: > > https://github.com/BurntSushi/xsv > >> On 31 Jan 2015, at 14:26, stepharo <[hidden email]> wrote: >> >> hernan >> >> if you need some help you can also find a smart student and ask ESUG to sponsor him during a summertalk. >> >> Stef >> >> Le 26/1/15 21:03, Hernán Morales Durand a écrit : >>> >>> 2015-01-26 9:01 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >>> Hernán, >>> >>>> On 26 Jan 2015, at 08:00, Hernán Morales Durand <[hidden email]> wrote: >>>> >>>> It is possible :) >>>> I work with DNA sequences, there could be millions of common SNPs in a genome. >>> Still weird for CSV. How many record are there then ? >>> >>> We genotyped few individuals (24 records) but now we have a genotyping platform (GeneTitan) with array plates allowing up to 96 samples, which is up to 2.6 million of markers. The first run I completed generated CSVs of 1 million of records (see attach). Sadly the high-level analysis of this data (annotation, clustering, discrimination) now is performed with R with packages like SNPolisher. >>> >>> And this is microarray analysis, NGS platforms produce larger volumes of data in a shorter period of time (several genomes in a day). See http://www.slideshare.net/allenday/renaissance-in-medicine-strata-nosql-and-genomics for the 2014-2020 predictions. >>> >>> Feel free to contact me if you want to experiment with metrics. >>> >>> I assume they all have the same number of fields ? >>> >>> Yes, never seen CSV file with variable number of fields (in this domain) >>> >>> Anyway, could you point me to the specification of the format you want to read ? >>> >>> Actually I have no rush for this, I want to avoid awk, sed and shell scripts in the next run. I would like to avoid Python but spreads like a virus. >>> >>> I will be working mostly with CSV's from Axiom annotation files [1] and genotyping results. Other file formats I use are genotype file formats for programs like PLINK [2] (PED files, column 7 onwards) and HaploView. Is worst than you might think, because you have to transpose the output generated by genotyping platforms (millions of records), and then filter & cut them by chromosome because those Java programs cannot deal with all chromosomes at the same time. >>> >>> And to the older the that you used to use ? >>> >>> >>> http://www.smalltalkhub.com/#!/~hernan/CSV >>> >>> Cheers, >>> Hernán >>> >>> >>> [1] http://www.affymetrix.com/support/technical/annotationfilesmain.affx >>> [2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped >>> >>> >>> >>> Thx, >>> >>> Sven >>> >>>> Cheers, >>>> >>>> Hernán >>>> >>>> >>>> 2015-01-26 3:33 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >>>> >>>>> On 26 Jan 2015, at 06:32, Hernán Morales Durand <[hidden email]> wrote: >>>>> >>>>> >>>>> >>>>> 2015-01-23 18:00 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >>>>> >>>>>> On 23 Jan 2015, at 20:53, Hernán Morales Durand <[hidden email]> wrote: >>>>>> >>>>>> Hi Sven, >>>>>> >>>>>> 2015-01-23 16:06 GMT-03:00 Sven Van Caekenberghe <[hidden email]>: >>>>>> Hi Hernán, >>>>>> >>>>>>> On 23 Jan 2015, at 19:50, Hernán Morales Durand <[hidden email]> wrote: >>>>>>> >>>>>>> I used to use a CSV parser from Squeak where I could attach conditional iterations: >>>>>>> >>>>>>> csvParser rowsSkipFirst: 2 do: [: row | " some action ignoring first 2 fields on each row " ]. >>>>>>> csvParser rowsSkipLast: 2 do: [: row | " some action ignoring last 2 fields on each row " ]. >>>>>> With NeoCSVParser you can describe how each field is read and converted, using the same mechanism you can ignore fields. Have a look at the senders of #addIgnoredField from the unit tests. >>>>>> >>>>>> >>>>>> I am trying to understand the implementation, I see you included #addIgnoredFields: for consecutive fields in Neo-CSV-Core-SvenVanCaekenberghe.21 >>>>>> A question about usage then, adding ignored field(s) requires adding field types on all other remaining fields? >>>>> Yes, like this: >>>>> >>>>> testReadWithIgnoredField >>>>> | input | >>>>> input := (String crlf join: #( '1,2,a,3' '1,2,b,3' '1,2,c,3' '')). >>>>> self >>>>> assert: ((NeoCSVReader on: input readStream) >>>>> addIntegerField; >>>>> addIntegerField; >>>>> addIgnoredField; >>>>> addIntegerField; >>>>> upToEnd) >>>>> equals: { >>>>> #(1 2 3). >>>>> #(1 2 3). >>>>> #(1 2 3).} >>>>> >>>>> >>>>> >>>>> May be you like to know if you make a pass to NeoCSV, for some data sets I have 1 million of columns, it would be nice an addFieldsInterval: or such. >>>> 1 million columns ? How is that possible, useful ? >>>> >>>> The reader is like a builder. You could try to do this yourself by writing a little loop or two. >>>> >>>> But still, 1 million ? >>>> >>>>> Thank you. >>>>> >>>>> Hernán >>>>> >>>> >>>> >>> >>> > > > |
Free forum by Nabble | Edit this page |