NeoCSV and big files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

NeoCSV and big files

SergeStinckwich
Dear all,
We are currently setup a small ROASSAL team to participate to
#Datathon Data for Development:
http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/

We are looking to ways to be able to load big CSV table in a Pharo image.
Apparently the size of some CSV files provided will be huge (around 5
Go for one month of data). The format of the data are describe here:
http://arxiv.org/abs/1407.4885

Is this possible with NeoCSV, to read only a fraction of the lines
regarding some conditions ?

If some people want to help online, we can organize a chat to organize us.
Regards,
--
Serge Stinckwich
UCBN & UMI UMMISCO 209 (IRD/UPMC)
Every DSL ends up being Smalltalk
http://www.doesnotunderstand.org/

Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV and big files

NorbertHartl

> Am 04.04.2015 um 19:23 schrieb Serge Stinckwich <[hidden email]>:
>
> Dear all,
> We are currently setup a small ROASSAL team to participate to
> #Datathon Data for Development:
> http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/
>
> We are looking to ways to be able to load big CSV table in a Pharo image.
> Apparently the size of some CSV files provided will be huge (around 5
> Go for one month of data). The format of the data are describe here:
> http://arxiv.org/abs/1407.4885
>
> Is this possible with NeoCSV, to read only a fraction of the lines
> regarding some conditions ?
>
The NeoCSVReader supports the necessary stream protocol. If you setup the csv reader you can call #next on it and filter by condition. There is also #atEnd so a simple loop should. But I never used to csv reader so Sven might have much better options.

Norbert

> If some people want to help online, we can organize a chat to organize us.
> Regards,
> --
> Serge Stinckwich
> UCBN & UMI UMMISCO 209 (IRD/UPMC)
> Every DSL ends up being Smalltalk
> http://www.doesnotunderstand.org/
>


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV and big files

Sven Van Caekenberghe-2
There are also #select: and #select:thenDo: convenience methods.

NeoCSV is properly streaming, it should not introduce memory consumption problems itself. But note that you cannot load more than about 1Gb of permanent data in the current VM.

One known performance limitation is in handling extremely long lines/records.

If you have a question or problem, just ask.

Sven

> On 04 Apr 2015, at 19:54, Norbert Hartl <[hidden email]> wrote:
>
>>
>> Am 04.04.2015 um 19:23 schrieb Serge Stinckwich <[hidden email]>:
>>
>> Dear all,
>> We are currently setup a small ROASSAL team to participate to
>> #Datathon Data for Development:
>> http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/
>>
>> We are looking to ways to be able to load big CSV table in a Pharo image.
>> Apparently the size of some CSV files provided will be huge (around 5
>> Go for one month of data). The format of the data are describe here:
>> http://arxiv.org/abs/1407.4885
>>
>> Is this possible with NeoCSV, to read only a fraction of the lines
>> regarding some conditions ?
>>
> The NeoCSVReader supports the necessary stream protocol. If you setup the csv reader you can call #next on it and filter by condition. There is also #atEnd so a simple loop should. But I never used to csv reader so Sven might have much better options.
>
> Norbert
>
>> If some people want to help online, we can organize a chat to organize us.
>> Regards,
>> --
>> Serge Stinckwich
>> UCBN & UMI UMMISCO 209 (IRD/UPMC)
>> Every DSL ends up being Smalltalk
>> http://www.doesnotunderstand.org/


Reply | Threaded
Open this post in threaded view
|

Re: NeoCSV and big files

abergel
Thanks Sven for your support!

Alexandre


> On Apr 4, 2015, at 3:02 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>
> There are also #select: and #select:thenDo: convenience methods.
>
> NeoCSV is properly streaming, it should not introduce memory consumption problems itself. But note that you cannot load more than about 1Gb of permanent data in the current VM.
>
> One known performance limitation is in handling extremely long lines/records.
>
> If you have a question or problem, just ask.
>
> Sven
>
>> On 04 Apr 2015, at 19:54, Norbert Hartl <[hidden email]> wrote:
>>
>>>
>>> Am 04.04.2015 um 19:23 schrieb Serge Stinckwich <[hidden email]>:
>>>
>>> Dear all,
>>> We are currently setup a small ROASSAL team to participate to
>>> #Datathon Data for Development:
>>> http://simplon.co/datathon-data-for-development-rdv-les-7-et-8-avril-a-montreuil/
>>>
>>> We are looking to ways to be able to load big CSV table in a Pharo image.
>>> Apparently the size of some CSV files provided will be huge (around 5
>>> Go for one month of data). The format of the data are describe here:
>>> http://arxiv.org/abs/1407.4885
>>>
>>> Is this possible with NeoCSV, to read only a fraction of the lines
>>> regarding some conditions ?
>>>
>> The NeoCSVReader supports the necessary stream protocol. If you setup the csv reader you can call #next on it and filter by condition. There is also #atEnd so a simple loop should. But I never used to csv reader so Sven might have much better options.
>>
>> Norbert
>>
>>> If some people want to help online, we can organize a chat to organize us.
>>> Regards,
>>> --
>>> Serge Stinckwich
>>> UCBN & UMI UMMISCO 209 (IRD/UPMC)
>>> Every DSL ends up being Smalltalk
>>> http://www.doesnotunderstand.org/
>
>

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.