Login  Register

NeoCSV and big files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options Options
Embed post
Permalink
Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

NeoCSV and big files

SergeStinckwich
2177 posts
Dear all,
We are currently setup a small ROASSAL team to participate to
#Datathon Data for Development:
http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/

We are looking to ways to be able to load big CSV table in a Pharo image.
Apparently the size of some CSV files provided will be huge (around 5
Go for one month of data). The format of the data are describe here:
http://arxiv.org/abs/1407.4885

Is this possible with NeoCSV, to read only a fraction of the lines
regarding some conditions ?

If some people want to help online, we can organize a chat to organize us.
Regards,
--
Serge Stinckwich
UCBN & UMI UMMISCO 209 (IRD/UPMC)
Every DSL ends up being Smalltalk
http://www.doesnotunderstand.org/

Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: NeoCSV and big files

NorbertHartl
3479 posts

> Am 04.04.2015 um 19:23 schrieb Serge Stinckwich <[hidden email]>:
>
> Dear all,
> We are currently setup a small ROASSAL team to participate to
> #Datathon Data for Development:
> http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/
>
> We are looking to ways to be able to load big CSV table in a Pharo image.
> Apparently the size of some CSV files provided will be huge (around 5
> Go for one month of data). The format of the data are describe here:
> http://arxiv.org/abs/1407.4885
>
> Is this possible with NeoCSV, to read only a fraction of the lines
> regarding some conditions ?
>
The NeoCSVReader supports the necessary stream protocol. If you setup the csv reader you can call #next on it and filter by condition. There is also #atEnd so a simple loop should. But I never used to csv reader so Sven might have much better options.

Norbert

> If some people want to help online, we can organize a chat to organize us.
> Regards,
> --
> Serge Stinckwich
> UCBN & UMI UMMISCO 209 (IRD/UPMC)
> Every DSL ends up being Smalltalk
> http://www.doesnotunderstand.org/
>


Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: NeoCSV and big files

Sven Van Caekenberghe-2
5697 posts
There are also #select: and #select:thenDo: convenience methods.

NeoCSV is properly streaming, it should not introduce memory consumption problems itself. But note that you cannot load more than about 1Gb of permanent data in the current VM.

One known performance limitation is in handling extremely long lines/records.

If you have a question or problem, just ask.

Sven

> On 04 Apr 2015, at 19:54, Norbert Hartl <[hidden email]> wrote:
>
>>
>> Am 04.04.2015 um 19:23 schrieb Serge Stinckwich <[hidden email]>:
>>
>> Dear all,
>> We are currently setup a small ROASSAL team to participate to
>> #Datathon Data for Development:
>> http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/
>>
>> We are looking to ways to be able to load big CSV table in a Pharo image.
>> Apparently the size of some CSV files provided will be huge (around 5
>> Go for one month of data). The format of the data are describe here:
>> http://arxiv.org/abs/1407.4885
>>
>> Is this possible with NeoCSV, to read only a fraction of the lines
>> regarding some conditions ?
>>
> The NeoCSVReader supports the necessary stream protocol. If you setup the csv reader you can call #next on it and filter by condition. There is also #atEnd so a simple loop should. But I never used to csv reader so Sven might have much better options.
>
> Norbert
>
>> If some people want to help online, we can organize a chat to organize us.
>> Regards,
>> --
>> Serge Stinckwich
>> UCBN & UMI UMMISCO 209 (IRD/UPMC)
>> Every DSL ends up being Smalltalk
>> http://www.doesnotunderstand.org/


Reply | Threaded
Open this post in threaded view
| More
Print post
Permalink

Re: NeoCSV and big files

abergel
5677 posts
Thanks Sven for your support!

Alexandre


> On Apr 4, 2015, at 3:02 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>
> There are also #select: and #select:thenDo: convenience methods.
>
> NeoCSV is properly streaming, it should not introduce memory consumption problems itself. But note that you cannot load more than about 1Gb of permanent data in the current VM.
>
> One known performance limitation is in handling extremely long lines/records.
>
> If you have a question or problem, just ask.
>
> Sven
>
>> On 04 Apr 2015, at 19:54, Norbert Hartl <[hidden email]> wrote:
>>
>>>
>>> Am 04.04.2015 um 19:23 schrieb Serge Stinckwich <[hidden email]>:
>>>
>>> Dear all,
>>> We are currently setup a small ROASSAL team to participate to
>>> #Datathon Data for Development:
>>> http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/
>>>
>>> We are looking to ways to be able to load big CSV table in a Pharo image.
>>> Apparently the size of some CSV files provided will be huge (around 5
>>> Go for one month of data). The format of the data are describe here:
>>> http://arxiv.org/abs/1407.4885
>>>
>>> Is this possible with NeoCSV, to read only a fraction of the lines
>>> regarding some conditions ?
>>>
>> The NeoCSVReader supports the necessary stream protocol. If you setup the csv reader you can call #next on it and filter by condition. There is also #atEnd so a simple loop should. But I never used to csv reader so Sven might have much better options.
>>
>> Norbert
>>
>>> If some people want to help online, we can organize a chat to organize us.
>>> Regards,
>>> --
>>> Serge Stinckwich
>>> UCBN & UMI UMMISCO 209 (IRD/UPMC)
>>> Every DSL ends up being Smalltalk
>>> http://www.doesnotunderstand.org/
>
>

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.