running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Paul DeBruicker
Hi -

I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).  I'm not sure if its because of the size of the files or the code I've written to keep track of the domain objects I'm interested in, but I'm getting out of memory errors & crashes in Pharo 3 on Mac with the latest VM.  I haven't checked other vms.  

I'm going to profile my own code and attempt to split the files manually for now to see what else it could be.


Right now I'm doing something similar to

        |file reader|
        file:= '/path/to/file/myfile.csv' asFileReference readStream.
        reader: NeoCSVReader on: file

        reader
                recordClass: MyClass;
                skipHeader;
                addField: #myField:;
                ....
       

        reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt: eachRecord].
        file close.



Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000 lines at a time) or an easy way to do that ?




Thanks

Paul
Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Sven Van Caekenberghe-2
Hi Paul,

I think you must be doing something wrong with your class, the #do: is implemented as streaming over the record one by one, never holding more than one in memory.

This is what I tried:

'paul.csv' asFileReference writeStreamDo: [ :file|
  ZnBufferedWriteStream on: file do: [ :out |
    (NeoCSVWriter on: out) in: [ :writer |
      writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
      1 to: 1e7 do: [ :each |
        writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom. #(true false) atRandom } ] ] ] ].

This results in a 300Mb file:

$ ls -lah paul.csv
-rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
$ wc paul.csv
 10000001 10000001 342781577 paul.csv

This is a selective read and collect (loads about 10K records):

Array streamContents: [ :out |
  'paul.csv' asFileReference readStreamDo: [ :in |
    (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
      reader skipHeader; addIntegerField; addSymbolField; addIntegerField; addFieldConverter: [ :x | x = #true ].
      reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ] ] ] ] ].

This worked fine on my MacBook Air, no memory problems. It takes a while to parse that much data, of course.

Sven

> On 14 Nov 2014, at 19:08, Paul DeBruicker <[hidden email]> wrote:
>
> Hi -
>
> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).  I'm not sure if its because of the size of the files or the code I've written to keep track of the domain objects I'm interested in, but I'm getting out of memory errors & crashes in Pharo 3 on Mac with the latest VM.  I haven't checked other vms.  
>
> I'm going to profile my own code and attempt to split the files manually for now to see what else it could be.
>
>
> Right now I'm doing something similar to
>
> |file reader|
> file:= '/path/to/file/myfile.csv' asFileReference readStream.
> reader: NeoCSVReader on: file
>
> reader
> recordClass: MyClass;
> skipHeader;
> addField: #myField:;
> ....
>
>
> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt: eachRecord].
> file close.
>
>
>
> Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000 lines at a time) or an easy way to do that ?
>
>
>
>
> Thanks
>
> Paul


Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Paul DeBruicker
Hi Sven,

Thanks for taking a look and testing the NeoCSVReader portion for me.  You're right of course that there's something I'm doing that's slow.  But.  There is something I can't figure out yet.  

To provide a little more detail:

When the 'csv reading' process completes successfully profiling shows that most of the time is spent in NeoCSVReader>>#peekChar and using NeoCSVReader>>##addField: to convert a string to a DateAndTime.  Dropping the DateAndTime conversion speeds things up but doesn't stop it from running out of memory.  

I start the image with

./pharo-ui --memory 1000m myimage.image  

Splitting the CSV file helps:
~1.5MB  5,000 lines = 1.2 seconds.
~15MB   50,000 lines = 8 seconds.
~30MB   100,000 lines = 16 seconds.
~60MB   200,000 lines  = 45 seconds.
 

It seems that when the CSV file crosses ~70MB in size things start going haywire with performance, and leads to the out of memory condition.  The processing never ends.  Sending "kill -SIGUSR1" prints a stack primarily composed of:

0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) OutOfMemory class
0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n) OutOfMemory class
0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory class
0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) OutOfMemory class
0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n) OutOfMemory class
0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory class
0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n) OutOfMemory class

So it seems like its trying to signal that its out of memory after its out of memory which triggers another OutOfMemory error.  So that's why progress stops.  


** Aside - OutOfMemory should probably be refactored to be able to signal itself without taking up more memory, triggering itself infinitely.  Maybe it & its signalling morph infrastructure would be good as a singleton **



I'm confused about why it runs out of memory.  According to htop the image only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory' condition.  This Macbook Air laptop has 4GB, and has plenty of room for the image to grow.  Also I've specified a 1,000MB image size when starting.  So it should have plenty of room.  Is there something I should check or a flag somewhere that prevents it from growing on a Mac?  This is the latest Pharo30 VM.  


Thanks for helping me get to the bottom of this

Paul














Sven Van Caekenberghe-2 wrote
Hi Paul,

I think you must be doing something wrong with your class, the #do: is implemented as streaming over the record one by one, never holding more than one in memory.

This is what I tried:

'paul.csv' asFileReference writeStreamDo: [ :file|
  ZnBufferedWriteStream on: file do: [ :out |
    (NeoCSVWriter on: out) in: [ :writer |
      writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
      1 to: 1e7 do: [ :each |
        writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom. #(true false) atRandom } ] ] ] ].

This results in a 300Mb file:

$ ls -lah paul.csv
-rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
$ wc paul.csv
 10000001 10000001 342781577 paul.csv

This is a selective read and collect (loads about 10K records):

Array streamContents: [ :out |
  'paul.csv' asFileReference readStreamDo: [ :in |
    (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
      reader skipHeader; addIntegerField; addSymbolField; addIntegerField; addFieldConverter: [ :x | x = #true ].
      reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ] ] ] ] ].

This worked fine on my MacBook Air, no memory problems. It takes a while to parse that much data, of course.

Sven

> On 14 Nov 2014, at 19:08, Paul DeBruicker <[hidden email]> wrote:
>
> Hi -
>
> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).  I'm not sure if its because of the size of the files or the code I've written to keep track of the domain objects I'm interested in, but I'm getting out of memory errors & crashes in Pharo 3 on Mac with the latest VM.  I haven't checked other vms.  
>
> I'm going to profile my own code and attempt to split the files manually for now to see what else it could be.
>
>
> Right now I'm doing something similar to
>
> |file reader|
> file:= '/path/to/file/myfile.csv' asFileReference readStream.
> reader: NeoCSVReader on: file
>
> reader
> recordClass: MyClass;
> skipHeader;
> addField: #myField:;
> ....
>
>
> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt: eachRecord].
> file close.
>
>
>
> Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000 lines at a time) or an easy way to do that ?
>
>
>
>
> Thanks
>
> Paul
Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Sven Van Caekenberghe-2
Can you successfully run my example code ?

> On 14 Nov 2014, at 22:03, Paul DeBruicker <[hidden email]> wrote:
>
> Hi Sven,
>
> Thanks for taking a look and testing the NeoCSVReader portion for me.
> You're right of course that there's something I'm doing that's slow.  But.
> There is something I can't figure out yet.  
>
> To provide a little more detail:
>
> When the 'csv reading' process completes successfully profiling shows that
> most of the time is spent in NeoCSVReader>>#peekChar and using
> NeoCSVReader>>##addField: to convert a string to a DateAndTime.  Dropping
> the DateAndTime conversion speeds things up but doesn't stop it from running
> out of memory.  
>
> I start the image with
>
> ./pharo-ui --memory 1000m myimage.image  
>
> Splitting the CSV file helps:
> ~1.5MB  5,000 lines = 1.2 seconds.
> ~15MB   50,000 lines = 8 seconds.
> ~30MB   100,000 lines = 16 seconds.
> ~60MB   200,000 lines  = 45 seconds.
>
>
> It seems that when the CSV file crosses ~70MB in size things start going
> haywire with performance, and leads to the out of memory condition.  The
> processing never ends.  Sending "kill -SIGUSR1" prints a stack primarily
> composed of:
>
> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
> class
> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
> class
> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
> OutOfMemory class
>
> So it seems like its trying to signal that its out of memory after its out
> of memory which triggers another OutOfMemory error.  So that's why progress
> stops.  
>
>
> ** Aside - OutOfMemory should probably be refactored to be able to signal
> itself without taking up more memory, triggering itself infinitely.  Maybe
> it & its signalling morph infrastructure would be good as a singleton **
>
>
>
> I'm confused about why it runs out of memory.  According to htop the image
> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
> condition.  This Macbook Air laptop has 4GB, and has plenty of room for the
> image to grow.  Also I've specified a 1,000MB image size when starting.  So
> it should have plenty of room.  Is there something I should check or a flag
> somewhere that prevents it from growing on a Mac?  This is the latest
> Pharo30 VM.  
>
>
> Thanks for helping me get to the bottom of this
>
> Paul
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Sven Van Caekenberghe-2 wrote
>> Hi Paul,
>>
>> I think you must be doing something wrong with your class, the #do: is
>> implemented as streaming over the record one by one, never holding more
>> than one in memory.
>>
>> This is what I tried:
>>
>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>  ZnBufferedWriteStream on: file do: [ :out |
>>    (NeoCSVWriter on: out) in: [ :writer |
>>      writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>      1 to: 1e7 do: [ :each |
>>        writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
>> #(true false) atRandom } ] ] ] ].
>>
>> This results in a 300Mb file:
>>
>> $ ls -lah paul.csv
>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>> $ wc paul.csv
>> 10000001 10000001 342781577 paul.csv
>>
>> This is a selective read and collect (loads about 10K records):
>>
>> Array streamContents: [ :out |
>>  'paul.csv' asFileReference readStreamDo: [ :in |
>>    (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>      reader skipHeader; addIntegerField; addSymbolField; addIntegerField;
>> addFieldConverter: [ :x | x = #true ].
>>      reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ]
>> ] ] ] ].
>>
>> This worked fine on my MacBook Air, no memory problems. It takes a while
>> to parse that much data, of course.
>>
>> Sven
>>
>>> On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;
>
>> pdebruic@
>
>> &gt; wrote:
>>>
>>> Hi -
>>>
>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).
>>> I'm not sure if its because of the size of the files or the code I've
>>> written to keep track of the domain objects I'm interested in, but I'm
>>> getting out of memory errors & crashes in Pharo 3 on Mac with the latest
>>> VM.  I haven't checked other vms.  
>>>
>>> I'm going to profile my own code and attempt to split the files manually
>>> for now to see what else it could be.
>>>
>>>
>>> Right now I'm doing something similar to
>>>
>>> |file reader|
>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>> reader: NeoCSVReader on: file
>>>
>>> reader
>>> recordClass: MyClass;
>>> skipHeader;
>>> addField: #myField:;
>>> ....
>>>
>>>
>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>> eachRecord].
>>> file close.
>>>
>>>
>>>
>>> Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000
>>> lines at a time) or an easy way to do that ?
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>> Paul
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Paul DeBruicker
Yes. With the image & vm I'm having trouble with I get an array with 9,942 elements in it.  So its works as you'd expect.

While processing the CSV file the image stays at about 60MB in RAM.  








Sven Van Caekenberghe-2 wrote
Can you successfully run my example code ?

> On 14 Nov 2014, at 22:03, Paul DeBruicker <[hidden email]> wrote:
>
> Hi Sven,
>
> Thanks for taking a look and testing the NeoCSVReader portion for me.
> You're right of course that there's something I'm doing that's slow.  But.
> There is something I can't figure out yet.  
>
> To provide a little more detail:
>
> When the 'csv reading' process completes successfully profiling shows that
> most of the time is spent in NeoCSVReader>>#peekChar and using
> NeoCSVReader>>##addField: to convert a string to a DateAndTime.  Dropping
> the DateAndTime conversion speeds things up but doesn't stop it from running
> out of memory.  
>
> I start the image with
>
> ./pharo-ui --memory 1000m myimage.image  
>
> Splitting the CSV file helps:
> ~1.5MB  5,000 lines = 1.2 seconds.
> ~15MB   50,000 lines = 8 seconds.
> ~30MB   100,000 lines = 16 seconds.
> ~60MB   200,000 lines  = 45 seconds.
>
>
> It seems that when the CSV file crosses ~70MB in size things start going
> haywire with performance, and leads to the out of memory condition.  The
> processing never ends.  Sending "kill -SIGUSR1" prints a stack primarily
> composed of:
>
> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
> class
> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
> OutOfMemory class
> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
> class
> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
> OutOfMemory class
>
> So it seems like its trying to signal that its out of memory after its out
> of memory which triggers another OutOfMemory error.  So that's why progress
> stops.  
>
>
> ** Aside - OutOfMemory should probably be refactored to be able to signal
> itself without taking up more memory, triggering itself infinitely.  Maybe
> it & its signalling morph infrastructure would be good as a singleton **
>
>
>
> I'm confused about why it runs out of memory.  According to htop the image
> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
> condition.  This Macbook Air laptop has 4GB, and has plenty of room for the
> image to grow.  Also I've specified a 1,000MB image size when starting.  So
> it should have plenty of room.  Is there something I should check or a flag
> somewhere that prevents it from growing on a Mac?  This is the latest
> Pharo30 VM.  
>
>
> Thanks for helping me get to the bottom of this
>
> Paul
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Sven Van Caekenberghe-2 wrote
>> Hi Paul,
>>
>> I think you must be doing something wrong with your class, the #do: is
>> implemented as streaming over the record one by one, never holding more
>> than one in memory.
>>
>> This is what I tried:
>>
>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>  ZnBufferedWriteStream on: file do: [ :out |
>>    (NeoCSVWriter on: out) in: [ :writer |
>>      writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>      1 to: 1e7 do: [ :each |
>>        writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
>> #(true false) atRandom } ] ] ] ].
>>
>> This results in a 300Mb file:
>>
>> $ ls -lah paul.csv
>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>> $ wc paul.csv
>> 10000001 10000001 342781577 paul.csv
>>
>> This is a selective read and collect (loads about 10K records):
>>
>> Array streamContents: [ :out |
>>  'paul.csv' asFileReference readStreamDo: [ :in |
>>    (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>      reader skipHeader; addIntegerField; addSymbolField; addIntegerField;
>> addFieldConverter: [ :x | x = #true ].
>>      reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ]
>> ] ] ] ].
>>
>> This worked fine on my MacBook Air, no memory problems. It takes a while
>> to parse that much data, of course.
>>
>> Sven
>>
>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>
>> pdebruic@
>
>> > wrote:
>>>
>>> Hi -
>>>
>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).
>>> I'm not sure if its because of the size of the files or the code I've
>>> written to keep track of the domain objects I'm interested in, but I'm
>>> getting out of memory errors & crashes in Pharo 3 on Mac with the latest
>>> VM.  I haven't checked other vms.  
>>>
>>> I'm going to profile my own code and attempt to split the files manually
>>> for now to see what else it could be.
>>>
>>>
>>> Right now I'm doing something similar to
>>>
>>> |file reader|
>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>> reader: NeoCSVReader on: file
>>>
>>> reader
>>> recordClass: MyClass;
>>> skipHeader;
>>> addField: #myField:;
>>> ....
>>>
>>>
>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>> eachRecord].
>>> file close.
>>>
>>>
>>>
>>> Is there a facility in NeoCSVReader to read a file in batches (e.g. 1000
>>> lines at a time) or an easy way to do that ?
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>> Paul
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Sven Van Caekenberghe-2
OK then, you *can* read/process 300MB .csv files ;-)

What does your CSV file look like, can you show a couple of lines ?
You are using a custom record class of your own, what does that look like or do ?
Maybe you can try using Array again ?

What percentage of records read do you keep ? In my example it was very small. Have you tried calculating your memory usage ?

> On 14 Nov 2014, at 22:34, Paul DeBruicker <[hidden email]> wrote:
>
> Yes. With the image & vm I'm having trouble with I get an array with 9,942
> elements in it.  So its works as you'd expect.
>
> While processing the CSV file the image stays at about 60MB in RAM.  
>
>
>
>
>
>
>
>
>
> Sven Van Caekenberghe-2 wrote
>> Can you successfully run my example code ?
>>
>>> On 14 Nov 2014, at 22:03, Paul DeBruicker &lt;
>
>> pdebruic@
>
>> &gt; wrote:
>>>
>>> Hi Sven,
>>>
>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>> You're right of course that there's something I'm doing that's slow.
>>> But.
>>> There is something I can't figure out yet.  
>>>
>>> To provide a little more detail:
>>>
>>> When the 'csv reading' process completes successfully profiling shows
>>> that
>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.  Dropping
>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>> running
>>> out of memory.  
>>>
>>> I start the image with
>>>
>>> ./pharo-ui --memory 1000m myimage.image  
>>>
>>> Splitting the CSV file helps:
>>> ~1.5MB  5,000 lines = 1.2 seconds.
>>> ~15MB   50,000 lines = 8 seconds.
>>> ~30MB   100,000 lines = 16 seconds.
>>> ~60MB   200,000 lines  = 45 seconds.
>>>
>>>
>>> It seems that when the CSV file crosses ~70MB in size things start going
>>> haywire with performance, and leads to the out of memory condition.  The
>>> processing never ends.  Sending "kill -SIGUSR1" prints a stack primarily
>>> composed of:
>>>
>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>>> class
>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>>> class
>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>>
>>> So it seems like its trying to signal that its out of memory after its
>>> out
>>> of memory which triggers another OutOfMemory error.  So that's why
>>> progress
>>> stops.  
>>>
>>>
>>> ** Aside - OutOfMemory should probably be refactored to be able to signal
>>> itself without taking up more memory, triggering itself infinitely.
>>> Maybe
>>> it & its signalling morph infrastructure would be good as a singleton **
>>>
>>>
>>>
>>> I'm confused about why it runs out of memory.  According to htop the
>>> image
>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>> condition.  This Macbook Air laptop has 4GB, and has plenty of room for
>>> the
>>> image to grow.  Also I've specified a 1,000MB image size when starting.
>>> So
>>> it should have plenty of room.  Is there something I should check or a
>>> flag
>>> somewhere that prevents it from growing on a Mac?  This is the latest
>>> Pharo30 VM.  
>>>
>>>
>>> Thanks for helping me get to the bottom of this
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Hi Paul,
>>>>
>>>> I think you must be doing something wrong with your class, the #do: is
>>>> implemented as streaming over the record one by one, never holding more
>>>> than one in memory.
>>>>
>>>> This is what I tried:
>>>>
>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>   (NeoCSVWriter on: out) in: [ :writer |
>>>>     writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>     1 to: 1e7 do: [ :each |
>>>>       writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
>>>> #(true false) atRandom } ] ] ] ].
>>>>
>>>> This results in a 300Mb file:
>>>>
>>>> $ ls -lah paul.csv
>>>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>>>> $ wc paul.csv
>>>> 10000001 10000001 342781577 paul.csv
>>>>
>>>> This is a selective read and collect (loads about 10K records):
>>>>
>>>> Array streamContents: [ :out |
>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>   (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>     reader skipHeader; addIntegerField; addSymbolField;
>>>> addIntegerField;
>>>> addFieldConverter: [ :x | x = #true ].
>>>>     reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>> ]
>>>> ] ] ] ].
>>>>
>>>> This worked fine on my MacBook Air, no memory problems. It takes a while
>>>> to parse that much data, of course.
>>>>
>>>> Sven
>>>>
>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;
>>>
>>>> pdebruic@
>>>
>>>> &gt; wrote:
>>>>>
>>>>> Hi -
>>>>>
>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).
>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>> written to keep track of the domain objects I'm interested in, but I'm
>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>> latest
>>>>> VM.  I haven't checked other vms.  
>>>>>
>>>>> I'm going to profile my own code and attempt to split the files
>>>>> manually
>>>>> for now to see what else it could be.
>>>>>
>>>>>
>>>>> Right now I'm doing something similar to
>>>>>
>>>>> |file reader|
>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>> reader: NeoCSVReader on: file
>>>>>
>>>>> reader
>>>>> recordClass: MyClass;
>>>>> skipHeader;
>>>>> addField: #myField:;
>>>>> ....
>>>>>
>>>>>
>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>> eachRecord].
>>>>> file close.
>>>>>
>>>>>
>>>>>
>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>> 1000
>>>>> lines at a time) or an easy way to do that ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Paul DeBruicker
Hi Sven

Yes, like I said earlier, after your first email, that I think its not a problem with NeoCSV as with what I'm doing and an out of memory condition.  

Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:

output file stack is full.
output file stack is full.
output file stack is full.
output file stack is full.
output file stack is full.
....


What does that mean?

Answers to your questions below.

Thanks again for helping me out


Sven Van Caekenberghe-2 wrote
OK then, you *can* read/process 300MB .csv files ;-)

What does your CSV file look like, can you show a couple of lines ?

here are 2 lines + a header:

"provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
"015009","BURNS NURSING HOME, INC.","701 MONROE STREET NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give residents proper treatment to prevent new bed (pressure) sores or heal existing bed sores.","D","Deficient, Provider has date of correction","2013-10-10",1,"Y","N","2014-01-01"
"015009","BURNS NURSING HOME, INC.","701 MONROE STREET NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure that each resident who enters the nursing home without a catheter is not given a catheter, unless medically necessary, and that incontinent patients receive proper services to prevent urinary tract infections and restore normal bladder functions.","D","Deficient, Provider has date of correction","2013-10-10",1,"Y","N","2014-01-01"


You are using a custom record class of your own, what does that look like or do ?

A custom record class.    This is all publicly available data but I'm keeping track of the performance of US based health care providers during their annual inspections. So the records are notes of a deficiency during the inspection and I'm keeping those notes in a collection in an instance of the health care provider's class.   The custom record class just converts the CSV record to objects (Integers, Strings, DateAndTime) and then gets stuffed in the health care provider's deficiency history OrderedCollection (which has about 100 items).    Again I don't think its what I'm doing as much as the image isn't growing when it needs to.  




Maybe you can try using Array again ?

I've attempted to do it where I parse and convert the entire CSV into domain objects then add them to the image and the parsing works fine, but the system runs out of resources during the update phase.  


What percentage of records read do you keep ? In my example it was very small. Have you tried calculating your memory usage ?


I'm keeping some data from every record, but it doesn't load more than 500MB of the data before falling over.  I am not attempting to load the 9GB of CSV files into one image.  For 95% of the records in the CSV file 20 of the 22 columns of the data is the same from file to file, just a 'published date' and a 'time to expiration' date changes.   Each file covers a month, with about 500k deficiencies.  Each month some deficiencies are added to the file and some are resolved. So the total number of deficiencies in the image is about 500k.  Of those records that don't expire in a given month I'm adding the published date to a collection of published dates for the record and also adding the "time to expiration" to a collection of those to record what was made public and letting the rest of the data get GC'd.  I don't only load those two records because the other fields of the record in the CSV could change.    

I have not calculated the memory usage for the collection because I thought it would have no problem fitting in the 2GB of RAM I have on this machine.  



> On 14 Nov 2014, at 22:34, Paul DeBruicker <[hidden email]> wrote:
>
> Yes. With the image & vm I'm having trouble with I get an array with 9,942
> elements in it.  So its works as you'd expect.
>
> While processing the CSV file the image stays at about 60MB in RAM.  
>
>
>
>
>
>
>
>
>
> Sven Van Caekenberghe-2 wrote
>> Can you successfully run my example code ?
>>
>>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
>
>> pdebruic@
>
>> > wrote:
>>>
>>> Hi Sven,
>>>
>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>> You're right of course that there's something I'm doing that's slow.
>>> But.
>>> There is something I can't figure out yet.  
>>>
>>> To provide a little more detail:
>>>
>>> When the 'csv reading' process completes successfully profiling shows
>>> that
>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.  Dropping
>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>> running
>>> out of memory.  
>>>
>>> I start the image with
>>>
>>> ./pharo-ui --memory 1000m myimage.image  
>>>
>>> Splitting the CSV file helps:
>>> ~1.5MB  5,000 lines = 1.2 seconds.
>>> ~15MB   50,000 lines = 8 seconds.
>>> ~30MB   100,000 lines = 16 seconds.
>>> ~60MB   200,000 lines  = 45 seconds.
>>>
>>>
>>> It seems that when the CSV file crosses ~70MB in size things start going
>>> haywire with performance, and leads to the out of memory condition.  The
>>> processing never ends.  Sending "kill -SIGUSR1" prints a stack primarily
>>> composed of:
>>>
>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>>> class
>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>> OutOfMemory class
>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n) OutOfMemory
>>> class
>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>> OutOfMemory class
>>>
>>> So it seems like its trying to signal that its out of memory after its
>>> out
>>> of memory which triggers another OutOfMemory error.  So that's why
>>> progress
>>> stops.  
>>>
>>>
>>> ** Aside - OutOfMemory should probably be refactored to be able to signal
>>> itself without taking up more memory, triggering itself infinitely.
>>> Maybe
>>> it & its signalling morph infrastructure would be good as a singleton **
>>>
>>>
>>>
>>> I'm confused about why it runs out of memory.  According to htop the
>>> image
>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>> condition.  This Macbook Air laptop has 4GB, and has plenty of room for
>>> the
>>> image to grow.  Also I've specified a 1,000MB image size when starting.
>>> So
>>> it should have plenty of room.  Is there something I should check or a
>>> flag
>>> somewhere that prevents it from growing on a Mac?  This is the latest
>>> Pharo30 VM.  
>>>
>>>
>>> Thanks for helping me get to the bottom of this
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Hi Paul,
>>>>
>>>> I think you must be doing something wrong with your class, the #do: is
>>>> implemented as streaming over the record one by one, never holding more
>>>> than one in memory.
>>>>
>>>> This is what I tried:
>>>>
>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>   (NeoCSVWriter on: out) in: [ :writer |
>>>>     writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>     1 to: 1e7 do: [ :each |
>>>>       writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom.
>>>> #(true false) atRandom } ] ] ] ].
>>>>
>>>> This results in a 300Mb file:
>>>>
>>>> $ ls -lah paul.csv
>>>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>>>> $ wc paul.csv
>>>> 10000001 10000001 342781577 paul.csv
>>>>
>>>> This is a selective read and collect (loads about 10K records):
>>>>
>>>> Array streamContents: [ :out |
>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>   (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>     reader skipHeader; addIntegerField; addSymbolField;
>>>> addIntegerField;
>>>> addFieldConverter: [ :x | x = #true ].
>>>>     reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>> ]
>>>> ] ] ] ].
>>>>
>>>> This worked fine on my MacBook Air, no memory problems. It takes a while
>>>> to parse that much data, of course.
>>>>
>>>> Sven
>>>>
>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>>
>>>> pdebruic@
>>>
>>>> > wrote:
>>>>>
>>>>> Hi -
>>>>>
>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or so).
>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>> written to keep track of the domain objects I'm interested in, but I'm
>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>> latest
>>>>> VM.  I haven't checked other vms.  
>>>>>
>>>>> I'm going to profile my own code and attempt to split the files
>>>>> manually
>>>>> for now to see what else it could be.
>>>>>
>>>>>
>>>>> Right now I'm doing something similar to
>>>>>
>>>>> |file reader|
>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>> reader: NeoCSVReader on: file
>>>>>
>>>>> reader
>>>>> recordClass: MyClass;
>>>>> skipHeader;
>>>>> addField: #myField:;
>>>>> ....
>>>>>
>>>>>
>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>> eachRecord].
>>>>> file close.
>>>>>
>>>>>
>>>>>
>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>> 1000
>>>>> lines at a time) or an easy way to do that ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Sven Van Caekenberghe-2

> On 14 Nov 2014, at 23:14, Paul DeBruicker <[hidden email]> wrote:
>
> Hi Sven
>
> Yes, like I said earlier, after your first email, that I think its not a
> problem with NeoCSV as with what I'm doing and an out of memory condition.  
>
> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> ....
>
>
> What does that mean?

I don't know, but I think that you are really out of memory.
BTW, I think that setting no flags is better, memory will expand maximally then.
I think the useful maximum is closer to 1GB than 2GB.

> Answers to your questions below.

It is difficult to follow what you are doing exactly, but I think that you underestimate how much memory a parsed, structured/nested object uses. Taking the second line of your example, the 20+ fields, with 3 DateAndTimes, easily cost between 512 and 1024 bytes per record. That would limit you to between 1M and 2M records.

I tried this:

Array streamContents: [ :data |
        5e2 timesRepeat: [
                data nextPut: (Array streamContents: [ :out |
                        20 timesRepeat: [ out nextPut: Character alphabet ].
                        3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ].

it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it takes very long.

Good luck, if you can solve this, please tell us how you did it.

> Thanks again for helping me out
>
>
>
> Sven Van Caekenberghe-2 wrote
>> OK then, you *can* read/process 300MB .csv files ;-)
>>
>> What does your CSV file look like, can you show a couple of lines ?
>>
>> here are 2 lines + a header:
>>
>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>> residents proper treatment to prevent new bed (pressure) sores or heal
>> existing bed sores.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>> that each resident who enters the nursing home without a catheter is not
>> given a catheter, unless medically necessary, and that incontinent
>> patients receive proper services to prevent urinary tract infections and
>> restore normal bladder functions.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>
>>
>> You are using a custom record class of your own, what does that look like
>> or do ?
>>
>> A custom record class.    This is all publicly available data but I'm
>> keeping track of the performance of US based health care providers during
>> their annual inspections. So the records are notes of a deficiency during
>> the inspection and I'm keeping those notes in a collection in an instance
>> of the health care provider's class.   The custom record class just
>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>> then gets stuffed in the health care provider's deficiency history
>> OrderedCollection (which has about 100 items).    Again I don't think its
>> what I'm doing as much as the image isn't growing when it needs to.  
>>
>>
>>
>>
>> Maybe you can try using Array again ?
>>
>> I've attempted to do it where I parse and convert the entire CSV into
>> domain objects then add them to the image and the parsing works fine, but
>> the system runs out of resources during the update phase.  
>>
>>
>> What percentage of records read do you keep ? In my example it was very
>> small. Have you tried calculating your memory usage ?
>>
>>
>> I'm keeping some data from every record, but it doesn't load more than
>> 500MB of the data before falling over.  I am not attempting to load the
>> 9GB of CSV files into one image.  For 95% of the records in the CSV file
>> 20 of the 22 columns of the data is the same from file to file, just a
>> 'published date' and a 'time to expiration' date changes.   Each file
>> covers a month, with about 500k deficiencies.  Each month some
>> deficiencies are added to the file and some are resolved. So the total
>> number of deficiencies in the image is about 500k.  Of those records that
>> don't expire in a given month I'm adding the published date to a
>> collection of published dates for the record and also adding the "time to
>> expiration" to a collection of those to record what was made public and
>> letting the rest of the data get GC'd.  I don't only load those two
>> records because the other fields of the record in the CSV could change.    
>>
>> I have not calculated the memory usage for the collection because I
>> thought it would have no problem fitting in the 2GB of RAM I have on this
>> machine.  
>>
>>
>>
>>> On 14 Nov 2014, at 22:34, Paul DeBruicker &lt;
>
>> pdebruic@
>
>> &gt; wrote:
>>>
>>> Yes. With the image & vm I'm having trouble with I get an array with
>>> 9,942
>>> elements in it.  So its works as you'd expect.
>>>
>>> While processing the CSV file the image stays at about 60MB in RAM.  
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Can you successfully run my example code ?
>>>>
>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker &lt;
>>>
>>>> pdebruic@
>>>
>>>> &gt; wrote:
>>>>>
>>>>> Hi Sven,
>>>>>
>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>> You're right of course that there's something I'm doing that's slow.
>>>>> But.
>>>>> There is something I can't figure out yet.  
>>>>>
>>>>> To provide a little more detail:
>>>>>
>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>> that
>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>> Dropping
>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>> running
>>>>> out of memory.  
>>>>>
>>>>> I start the image with
>>>>>
>>>>> ./pharo-ui --memory 1000m myimage.image  
>>>>>
>>>>> Splitting the CSV file helps:
>>>>> ~1.5MB  5,000 lines = 1.2 seconds.
>>>>> ~15MB   50,000 lines = 8 seconds.
>>>>> ~30MB   100,000 lines = 16 seconds.
>>>>> ~60MB   200,000 lines  = 45 seconds.
>>>>>
>>>>>
>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>> going
>>>>> haywire with performance, and leads to the out of memory condition.
>>>>> The
>>>>> processing never ends.  Sending "kill -SIGUSR1" prints a stack
>>>>> primarily
>>>>> composed of:
>>>>>
>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>>
>>>>> So it seems like its trying to signal that its out of memory after its
>>>>> out
>>>>> of memory which triggers another OutOfMemory error.  So that's why
>>>>> progress
>>>>> stops.  
>>>>>
>>>>>
>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>> signal
>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>> Maybe
>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> I'm confused about why it runs out of memory.  According to htop the
>>>>> image
>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>> condition.  This Macbook Air laptop has 4GB, and has plenty of room for
>>>>> the
>>>>> image to grow.  Also I've specified a 1,000MB image size when starting.
>>>>> So
>>>>> it should have plenty of room.  Is there something I should check or a
>>>>> flag
>>>>> somewhere that prevents it from growing on a Mac?  This is the latest
>>>>> Pharo30 VM.  
>>>>>
>>>>>
>>>>> Thanks for helping me get to the bottom of this
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sven Van Caekenberghe-2 wrote
>>>>>> Hi Paul,
>>>>>>
>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>> implemented as streaming over the record one by one, never holding
>>>>>> more
>>>>>> than one in memory.
>>>>>>
>>>>>> This is what I tried:
>>>>>>
>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>>  (NeoCSVWriter on: out) in: [ :writer |
>>>>>>    writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>>    1 to: 1e7 do: [ :each |
>>>>>>      writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>> atRandom.
>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>
>>>>>> This results in a 300Mb file:
>>>>>>
>>>>>> $ ls -lah paul.csv
>>>>>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>>>>>> $ wc paul.csv
>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>
>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>
>>>>>> Array streamContents: [ :out |
>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>>  (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>>    reader skipHeader; addIntegerField; addSymbolField;
>>>>>> addIntegerField;
>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>>    reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>> ]
>>>>>> ] ] ] ].
>>>>>>
>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>> while
>>>>>> to parse that much data, of course.
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;
>>>>>
>>>>>> pdebruic@
>>>>>
>>>>>> &gt; wrote:
>>>>>>>
>>>>>>> Hi -
>>>>>>>
>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>> so).
>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>> I'm
>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>> latest
>>>>>>> VM.  I haven't checked other vms.  
>>>>>>>
>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>> manually
>>>>>>> for now to see what else it could be.
>>>>>>>
>>>>>>>
>>>>>>> Right now I'm doing something similar to
>>>>>>>
>>>>>>> |file reader|
>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>> reader: NeoCSVReader on: file
>>>>>>>
>>>>>>> reader
>>>>>>> recordClass: MyClass;
>>>>>>> skipHeader;
>>>>>>> addField: #myField:;
>>>>>>> ....
>>>>>>>
>>>>>>>
>>>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>> eachRecord].
>>>>>>> file close.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>> 1000
>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

stepharo
In reply to this post by Sven Van Caekenberghe-2
Thanks for this cool discussion.
I will add this to the NepCSV chapter :)

On 14/11/14 21:00, Sven Van Caekenberghe wrote:

> This is what I tried:
>
> 'paul.csv' asFileReference writeStreamDo: [ :file|
>    ZnBufferedWriteStream on: file do: [ :out |
>      (NeoCSVWriter on: out) in: [ :writer |
>        writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>        1 to: 1e7 do: [ :each |
>          writer nextPut: { each. #(Red Green Blue) atRandom. 1e6 atRandom. #(true false) atRandom } ] ] ] ].
>
> This results in a 300Mb file:
>
> $ ls -lah paul.csv
> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
> $ wc paul.csv
>   10000001 10000001 342781577 paul.csv
>
> This is a selective read and collect (loads about 10K records):
>
> Array streamContents: [ :out |
>    'paul.csv' asFileReference readStreamDo: [ :in |
>      (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>        reader skipHeader; addIntegerField; addSymbolField; addIntegerField; addFieldConverter: [ :x | x = #true ].
>        reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each ] ] ] ] ].


Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Alain Rastoul-2
In reply to this post by Paul DeBruicker
Hi Paul,

These are some vm memory allocation issues that exists on windows vm too
and are more related to your approach  which is discutable than to  a
real vm problem.
Suppose you have tomorrow to deal with all North American based health
care providers including Canada or even add South America or the rate
not being 95% , data won't fit in memory, even with 2G or 4G may be even
with 16G.
I saw the same approach and kind of problems in dotNet.

Let say it's your current requirement, and you want to do it like that,
a trick that may help you : during personal experiments about loading
data in memory and statistics from databases, I found that most often 70
to 80 % of real data is the same.

If you really need to load that in memory, you can intern all your data
in a dictionary.
Very simple to do:
at the beginning or you processing.
                | internedData |
                internedData := Dictionary new.
And for each field, before storing it in your array:
                field := internedData at: field ifAbsentPut:[field].

This will of course add some extra processing time but I guess it should
be ok.

:)

Regards,

Alain

Le 14/11/2014 23:14, Paul DeBruicker a écrit :

> Hi Sven
>
> Yes, like I said earlier, after your first email, that I think its not a
> problem with NeoCSV as with what I'm doing and an out of memory condition.
>
> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> ....
>
>
> What does that mean?
>
> Answers to your questions below.
>
> Thanks again for helping me out
>
>
>
> Sven Van Caekenberghe-2 wrote
>> OK then, you *can* read/process 300MB .csv files ;-)
>>
>> What does your CSV file look like, can you show a couple of lines ?
>>
>> here are 2 lines + a header:
>>
>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>> residents proper treatment to prevent new bed (pressure) sores or heal
>> existing bed sores.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>> that each resident who enters the nursing home without a catheter is not
>> given a catheter, unless medically necessary, and that incontinent
>> patients receive proper services to prevent urinary tract infections and
>> restore normal bladder functions.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>
>>
>> You are using a custom record class of your own, what does that look like
>> or do ?
>>
>> A custom record class.    This is all publicly available data but I'm
>> keeping track of the performance of US based health care providers during
>> their annual inspections. So the records are notes of a deficiency during
>> the inspection and I'm keeping those notes in a collection in an instance
>> of the health care provider's class.   The custom record class just
>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>> then gets stuffed in the health care provider's deficiency history
>> OrderedCollection (which has about 100 items).    Again I don't think its
>> what I'm doing as much as the image isn't growing when it needs to.
>>
>>
>>
>>
>> Maybe you can try using Array again ?
>>
>> I've attempted to do it where I parse and convert the entire CSV into
>> domain objects then add them to the image and the parsing works fine, but
>> the system runs out of resources during the update phase.
>>
>>
>> What percentage of records read do you keep ? In my example it was very
>> small. Have you tried calculating your memory usage ?
>>
>>
>> I'm keeping some data from every record, but it doesn't load more than
>> 500MB of the data before falling over.  I am not attempting to load the
>> 9GB of CSV files into one image.  For 95% of the records in the CSV file
>> 20 of the 22 columns of the data is the same from file to file, just a
>> 'published date' and a 'time to expiration' date changes.   Each file
>> covers a month, with about 500k deficiencies.  Each month some
>> deficiencies are added to the file and some are resolved. So the total
>> number of deficiencies in the image is about 500k.  Of those records that
>> don't expire in a given month I'm adding the published date to a
>> collection of published dates for the record and also adding the "time to
>> expiration" to a collection of those to record what was made public and
>> letting the rest of the data get GC'd.  I don't only load those two
>> records because the other fields of the record in the CSV could change.
>>
>> I have not calculated the memory usage for the collection because I
>> thought it would have no problem fitting in the 2GB of RAM I have on this
>> machine.
>>
>>
>>
>>> On 14 Nov 2014, at 22:34, Paul DeBruicker &lt;
>
>> pdebruic@
>
>> &gt; wrote:
>>>
>>> Yes. With the image & vm I'm having trouble with I get an array with
>>> 9,942
>>> elements in it.  So its works as you'd expect.
>>>
>>> While processing the CSV file the image stays at about 60MB in RAM.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Can you successfully run my example code ?
>>>>
>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker &lt;
>>>
>>>> pdebruic@
>>>
>>>> &gt; wrote:
>>>>>
>>>>> Hi Sven,
>>>>>
>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>> You're right of course that there's something I'm doing that's slow.
>>>>> But.
>>>>> There is something I can't figure out yet.
>>>>>
>>>>> To provide a little more detail:
>>>>>
>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>> that
>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>> Dropping
>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>> running
>>>>> out of memory.
>>>>>
>>>>> I start the image with
>>>>>
>>>>> ./pharo-ui --memory 1000m myimage.image
>>>>>
>>>>> Splitting the CSV file helps:
>>>>> ~1.5MB  5,000 lines = 1.2 seconds.
>>>>> ~15MB   50,000 lines = 8 seconds.
>>>>> ~30MB   100,000 lines = 16 seconds.
>>>>> ~60MB   200,000 lines  = 45 seconds.
>>>>>
>>>>>
>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>> going
>>>>> haywire with performance, and leads to the out of memory condition.
>>>>> The
>>>>> processing never ends.  Sending "kill -SIGUSR1" prints a stack
>>>>> primarily
>>>>> composed of:
>>>>>
>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>>
>>>>> So it seems like its trying to signal that its out of memory after its
>>>>> out
>>>>> of memory which triggers another OutOfMemory error.  So that's why
>>>>> progress
>>>>> stops.
>>>>>
>>>>>
>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>> signal
>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>> Maybe
>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> I'm confused about why it runs out of memory.  According to htop the
>>>>> image
>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>> condition.  This Macbook Air laptop has 4GB, and has plenty of room for
>>>>> the
>>>>> image to grow.  Also I've specified a 1,000MB image size when starting.
>>>>> So
>>>>> it should have plenty of room.  Is there something I should check or a
>>>>> flag
>>>>> somewhere that prevents it from growing on a Mac?  This is the latest
>>>>> Pharo30 VM.
>>>>>
>>>>>
>>>>> Thanks for helping me get to the bottom of this
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sven Van Caekenberghe-2 wrote
>>>>>> Hi Paul,
>>>>>>
>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>> implemented as streaming over the record one by one, never holding
>>>>>> more
>>>>>> than one in memory.
>>>>>>
>>>>>> This is what I tried:
>>>>>>
>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>>    (NeoCSVWriter on: out) in: [ :writer |
>>>>>>      writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>>      1 to: 1e7 do: [ :each |
>>>>>>        writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>> atRandom.
>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>
>>>>>> This results in a 300Mb file:
>>>>>>
>>>>>> $ ls -lah paul.csv
>>>>>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>>>>>> $ wc paul.csv
>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>
>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>
>>>>>> Array streamContents: [ :out |
>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>>    (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>>      reader skipHeader; addIntegerField; addSymbolField;
>>>>>> addIntegerField;
>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>>      reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>> ]
>>>>>> ] ] ] ].
>>>>>>
>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>> while
>>>>>> to parse that much data, of course.
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;
>>>>>
>>>>>> pdebruic@
>>>>>
>>>>>> &gt; wrote:
>>>>>>>
>>>>>>> Hi -
>>>>>>>
>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>> so).
>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>> I'm
>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>> latest
>>>>>>> VM.  I haven't checked other vms.
>>>>>>>
>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>> manually
>>>>>>> for now to see what else it could be.
>>>>>>>
>>>>>>>
>>>>>>> Right now I'm doing something similar to
>>>>>>>
>>>>>>> |file reader|
>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>> reader: NeoCSVReader on: file
>>>>>>>
>>>>>>> reader
>>>>>>> recordClass: MyClass;
>>>>>>> skipHeader;
>>>>>>> addField: #myField:;
>>>>>>> ....
>>>>>>>
>>>>>>>
>>>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>> eachRecord].
>>>>>>> file close.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>> 1000
>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>



Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Paul DeBruicker
In reply to this post by Sven Van Caekenberghe-2
Hi Sven,

I think you are right that I am mis-estimating how big these objects are, especially lots of them.  But I still think there's another problem with the image not growing to the machine or VM limits.  

Eliot Miranda created a script to see how big a heap (using Spur) could grow on different platforms (the email is here: http://forum.world.st/New-Cog-VMs-available-td4764823.html#a4764840).  I've adapted it ever so slightly to run on Pharo 3:

| them |
them := OrderedCollection new.
[[them addLast: (ByteArray new: 16000000).
 Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0) printShowingDecimalPlaces: 1); flush] repeat]
on: OutOfMemory
do: [:ex| 2 to: them size by: 2 do: [:i| them at: i put: nil. Smalltalk garbageCollect]].
Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0) printShowingDecimalPlaces: 1); flush.
them := nil.
Smalltalk garbageCollect.
Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0) printShowingDecimalPlaces: 1); flush

When I run it in Pharo 3 (& Pharo 1.4) on my laptop this is the output:


49.4
64.6
79.9
95.2
110.4
125.7
140.9
156.2
171.5
186.7
202.0
217.2
232.5
247.8
263.0
278.3
293.5
308.8
324.1
339.3
354.6
369.8
385.1
400.3
415.6
430.9
446.1
461.4
476.6
491.9
507.2
278.3
34.1

It shows that the heap grows to about 500MB and then an OutOfMemory error is thrown.  


It is my intuition that this test would make the image grow to either the limit of the machine or the limit of the VM, whichever came first.  

Is there a setting I need to change to make the image grow to, say, 1GB for this test?

Starting the image with the '--memory 1000m' command line argument doesn't change the test result.  


Also - that weird stack with 'output file stack is full' was a result of running the MessageTally profiler and hitting the issue that John McIntosh described here: http://forum.world.st/Squeak-hang-at-full-cpu-help-tp3006008p3007628.html


And for my immediate needs of processing the CSV files I ported everything to GemStone and am all set in that regard.





Sven Van Caekenberghe-2 wrote
> On 14 Nov 2014, at 23:14, Paul DeBruicker <[hidden email]> wrote:
>
> Hi Sven
>
> Yes, like I said earlier, after your first email, that I think its not a
> problem with NeoCSV as with what I'm doing and an out of memory condition.  
>
> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> ....
>
>
> What does that mean?

I don't know, but I think that you are really out of memory.
BTW, I think that setting no flags is better, memory will expand maximally then.
I think the useful maximum is closer to 1GB than 2GB.

> Answers to your questions below.

It is difficult to follow what you are doing exactly, but I think that you underestimate how much memory a parsed, structured/nested object uses. Taking the second line of your example, the 20+ fields, with 3 DateAndTimes, easily cost between 512 and 1024 bytes per record. That would limit you to between 1M and 2M records.

I tried this:

Array streamContents: [ :data |
        5e2 timesRepeat: [
                data nextPut: (Array streamContents: [ :out |
                        20 timesRepeat: [ out nextPut: Character alphabet ].
                        3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ].

it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it takes very long.

Good luck, if you can solve this, please tell us how you did it.

> Thanks again for helping me out
>
>
>
> Sven Van Caekenberghe-2 wrote
>> OK then, you *can* read/process 300MB .csv files ;-)
>>
>> What does your CSV file look like, can you show a couple of lines ?
>>
>> here are 2 lines + a header:
>>
>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>> residents proper treatment to prevent new bed (pressure) sores or heal
>> existing bed sores.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>> that each resident who enters the nursing home without a catheter is not
>> given a catheter, unless medically necessary, and that incontinent
>> patients receive proper services to prevent urinary tract infections and
>> restore normal bladder functions.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>
>>
>> You are using a custom record class of your own, what does that look like
>> or do ?
>>
>> A custom record class.    This is all publicly available data but I'm
>> keeping track of the performance of US based health care providers during
>> their annual inspections. So the records are notes of a deficiency during
>> the inspection and I'm keeping those notes in a collection in an instance
>> of the health care provider's class.   The custom record class just
>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>> then gets stuffed in the health care provider's deficiency history
>> OrderedCollection (which has about 100 items).    Again I don't think its
>> what I'm doing as much as the image isn't growing when it needs to.  
>>
>>
>>
>>
>> Maybe you can try using Array again ?
>>
>> I've attempted to do it where I parse and convert the entire CSV into
>> domain objects then add them to the image and the parsing works fine, but
>> the system runs out of resources during the update phase.  
>>
>>
>> What percentage of records read do you keep ? In my example it was very
>> small. Have you tried calculating your memory usage ?
>>
>>
>> I'm keeping some data from every record, but it doesn't load more than
>> 500MB of the data before falling over.  I am not attempting to load the
>> 9GB of CSV files into one image.  For 95% of the records in the CSV file
>> 20 of the 22 columns of the data is the same from file to file, just a
>> 'published date' and a 'time to expiration' date changes.   Each file
>> covers a month, with about 500k deficiencies.  Each month some
>> deficiencies are added to the file and some are resolved. So the total
>> number of deficiencies in the image is about 500k.  Of those records that
>> don't expire in a given month I'm adding the published date to a
>> collection of published dates for the record and also adding the "time to
>> expiration" to a collection of those to record what was made public and
>> letting the rest of the data get GC'd.  I don't only load those two
>> records because the other fields of the record in the CSV could change.    
>>
>> I have not calculated the memory usage for the collection because I
>> thought it would have no problem fitting in the 2GB of RAM I have on this
>> machine.  
>>
>>
>>
>>> On 14 Nov 2014, at 22:34, Paul DeBruicker <
>
>> pdebruic@
>
>> > wrote:
>>>
>>> Yes. With the image & vm I'm having trouble with I get an array with
>>> 9,942
>>> elements in it.  So its works as you'd expect.
>>>
>>> While processing the CSV file the image stays at about 60MB in RAM.  
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Can you successfully run my example code ?
>>>>
>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
>>>
>>>> pdebruic@
>>>
>>>> > wrote:
>>>>>
>>>>> Hi Sven,
>>>>>
>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>> You're right of course that there's something I'm doing that's slow.
>>>>> But.
>>>>> There is something I can't figure out yet.  
>>>>>
>>>>> To provide a little more detail:
>>>>>
>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>> that
>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>> Dropping
>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>> running
>>>>> out of memory.  
>>>>>
>>>>> I start the image with
>>>>>
>>>>> ./pharo-ui --memory 1000m myimage.image  
>>>>>
>>>>> Splitting the CSV file helps:
>>>>> ~1.5MB  5,000 lines = 1.2 seconds.
>>>>> ~15MB   50,000 lines = 8 seconds.
>>>>> ~30MB   100,000 lines = 16 seconds.
>>>>> ~60MB   200,000 lines  = 45 seconds.
>>>>>
>>>>>
>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>> going
>>>>> haywire with performance, and leads to the out of memory condition.
>>>>> The
>>>>> processing never ends.  Sending "kill -SIGUSR1" prints a stack
>>>>> primarily
>>>>> composed of:
>>>>>
>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>>
>>>>> So it seems like its trying to signal that its out of memory after its
>>>>> out
>>>>> of memory which triggers another OutOfMemory error.  So that's why
>>>>> progress
>>>>> stops.  
>>>>>
>>>>>
>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>> signal
>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>> Maybe
>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> I'm confused about why it runs out of memory.  According to htop the
>>>>> image
>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>> condition.  This Macbook Air laptop has 4GB, and has plenty of room for
>>>>> the
>>>>> image to grow.  Also I've specified a 1,000MB image size when starting.
>>>>> So
>>>>> it should have plenty of room.  Is there something I should check or a
>>>>> flag
>>>>> somewhere that prevents it from growing on a Mac?  This is the latest
>>>>> Pharo30 VM.  
>>>>>
>>>>>
>>>>> Thanks for helping me get to the bottom of this
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sven Van Caekenberghe-2 wrote
>>>>>> Hi Paul,
>>>>>>
>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>> implemented as streaming over the record one by one, never holding
>>>>>> more
>>>>>> than one in memory.
>>>>>>
>>>>>> This is what I tried:
>>>>>>
>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>>  (NeoCSVWriter on: out) in: [ :writer |
>>>>>>    writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>>    1 to: 1e7 do: [ :each |
>>>>>>      writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>> atRandom.
>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>
>>>>>> This results in a 300Mb file:
>>>>>>
>>>>>> $ ls -lah paul.csv
>>>>>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>>>>>> $ wc paul.csv
>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>
>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>
>>>>>> Array streamContents: [ :out |
>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>>  (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>>    reader skipHeader; addIntegerField; addSymbolField;
>>>>>> addIntegerField;
>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>>    reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>> ]
>>>>>> ] ] ] ].
>>>>>>
>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>> while
>>>>>> to parse that much data, of course.
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>>>>
>>>>>> pdebruic@
>>>>>
>>>>>> > wrote:
>>>>>>>
>>>>>>> Hi -
>>>>>>>
>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>> so).
>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>> I'm
>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>> latest
>>>>>>> VM.  I haven't checked other vms.  
>>>>>>>
>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>> manually
>>>>>>> for now to see what else it could be.
>>>>>>>
>>>>>>>
>>>>>>> Right now I'm doing something similar to
>>>>>>>
>>>>>>> |file reader|
>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>> reader: NeoCSVReader on: file
>>>>>>>
>>>>>>> reader
>>>>>>> recordClass: MyClass;
>>>>>>> skipHeader;
>>>>>>> addField: #myField:;
>>>>>>> ....
>>>>>>>
>>>>>>>
>>>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>> eachRecord].
>>>>>>> file close.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>> 1000
>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Alain Rastoul-2
Ah, this reminded me an old thread about memory on windows about why the
windows setting was 512 by default
http://lists.pharo.org/pipermail/pharo-dev_lists.pharo.org/2011-April/047594.html

And about vm options, it reminded me too that on windows the option had
a trailing ':'  that didn't exist on mac IIRC
(and no double '-' on mac I think).
-memory: 1024 against -memory 1024

may be stupid, but perhaps you could try it ?




Le 16/11/2014 01:08, Paul DeBruicker a écrit :

> Hi Sven,
>
> I think you are right that I am mis-estimating how big these objects are,
> especially lots of them.  But I still think there's another problem with the
> image not growing to the machine or VM limits.
>
> Eliot Miranda created a script to see how big a heap (using Spur) could grow
> on different platforms (the email is here:
> http://forum.world.st/New-Cog-VMs-available-td4764823.html#a4764840).  I've
> adapted it ever so slightly to run on Pharo 3:
>
> | them |
> them := OrderedCollection new.
> [[them addLast: (ByteArray new: 16000000).
>   Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0)
> printShowingDecimalPlaces: 1); flush] repeat]
> on: OutOfMemory
> do: [:ex| 2 to: them size by: 2 do: [:i| them at: i put: nil. Smalltalk
> garbageCollect]].
> Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0)
> printShowingDecimalPlaces: 1); flush.
> them := nil.
> Smalltalk garbageCollect.
> Transcript cr; show: ((Smalltalk vm parameterAt: 3) / (1024*1024.0)
> printShowingDecimalPlaces: 1); flush
>
> When I run it in Pharo 3 (& Pharo 1.4) on my laptop this is the output:
>
>
> 49.4
> 64.6
> 79.9
> 95.2
> 110.4
> 125.7
> 140.9
> 156.2
> 171.5
> 186.7
> 202.0
> 217.2
> 232.5
> 247.8
> 263.0
> 278.3
> 293.5
> 308.8
> 324.1
> 339.3
> 354.6
> 369.8
> 385.1
> 400.3
> 415.6
> 430.9
> 446.1
> 461.4
> 476.6
> 491.9
> 507.2
> 278.3
> 34.1
>
> It shows that the heap grows to about 500MB and then an OutOfMemory error is
> thrown.
>
>
> It is my intuition that this test would make the image grow to either the
> limit of the machine or the limit of the VM, whichever came first.
>
> Is there a setting I need to change to make the image grow to, say, 1GB for
> this test?
>
> Starting the image with the '--memory 1000m' command line argument doesn't
> change the test result.
>
>
> Also - that weird stack with 'output file stack is full' was a result of
> running the MessageTally profiler and hitting the issue that John McIntosh
> described here:
> http://forum.world.st/Squeak-hang-at-full-cpu-help-tp3006008p3007628.html
>
>
> And for my immediate needs of processing the CSV files I ported everything
> to GemStone and am all set in that regard.
>
>
>
>
>
>
> Sven Van Caekenberghe-2 wrote
>>> On 14 Nov 2014, at 23:14, Paul DeBruicker &lt;
>
>> pdebruic@
>
>> &gt; wrote:
>>>
>>> Hi Sven
>>>
>>> Yes, like I said earlier, after your first email, that I think its not a
>>> problem with NeoCSV as with what I'm doing and an out of memory
>>> condition.
>>>
>>> Have you ever seen a stack after sending kill -SIGUSR1 that looks like
>>> this:
>>>
>>> output file stack is full.
>>> output file stack is full.
>>> output file stack is full.
>>> output file stack is full.
>>> output file stack is full.
>>> ....
>>>
>>>
>>> What does that mean?
>>
>> I don't know, but I think that you are really out of memory.
>> BTW, I think that setting no flags is better, memory will expand maximally
>> then.
>> I think the useful maximum is closer to 1GB than 2GB.
>>
>>> Answers to your questions below.
>>
>> It is difficult to follow what you are doing exactly, but I think that you
>> underestimate how much memory a parsed, structured/nested object uses.
>> Taking the second line of your example, the 20+ fields, with 3
>> DateAndTimes, easily cost between 512 and 1024 bytes per record. That
>> would limit you to between 1M and 2M records.
>>
>> I tried this:
>>
>> Array streamContents: [ :data |
>> 5e2 timesRepeat: [
>> data nextPut: (Array streamContents: [ :out |
>> 20 timesRepeat: [ out nextPut: Character alphabet ].
>> 3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ].
>>
>> it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it
>> takes very long.
>>
>> Good luck, if you can solve this, please tell us how you did it.
>>
>>> Thanks again for helping me out
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> OK then, you *can* read/process 300MB .csv files ;-)
>>>>
>>>> What does your CSV file look like, can you show a couple of lines ?
>>>>
>>>> here are 2 lines + a header:
>>>>
>>>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>>>> residents proper treatment to prevent new bed (pressure) sores or heal
>>>> existing bed sores.","D","Deficient, Provider has date of
>>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>>>> that each resident who enters the nursing home without a catheter is not
>>>> given a catheter, unless medically necessary, and that incontinent
>>>> patients receive proper services to prevent urinary tract infections and
>>>> restore normal bladder functions.","D","Deficient, Provider has date of
>>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>>>
>>>>
>>>> You are using a custom record class of your own, what does that look
>>>> like
>>>> or do ?
>>>>
>>>> A custom record class.    This is all publicly available data but I'm
>>>> keeping track of the performance of US based health care providers
>>>> during
>>>> their annual inspections. So the records are notes of a deficiency
>>>> during
>>>> the inspection and I'm keeping those notes in a collection in an
>>>> instance
>>>> of the health care provider's class.   The custom record class just
>>>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>>>> then gets stuffed in the health care provider's deficiency history
>>>> OrderedCollection (which has about 100 items).    Again I don't think
>>>> its
>>>> what I'm doing as much as the image isn't growing when it needs to.
>>>>
>>>>
>>>>
>>>>
>>>> Maybe you can try using Array again ?
>>>>
>>>> I've attempted to do it where I parse and convert the entire CSV into
>>>> domain objects then add them to the image and the parsing works fine,
>>>> but
>>>> the system runs out of resources during the update phase.
>>>>
>>>>
>>>> What percentage of records read do you keep ? In my example it was very
>>>> small. Have you tried calculating your memory usage ?
>>>>
>>>>
>>>> I'm keeping some data from every record, but it doesn't load more than
>>>> 500MB of the data before falling over.  I am not attempting to load the
>>>> 9GB of CSV files into one image.  For 95% of the records in the CSV file
>>>> 20 of the 22 columns of the data is the same from file to file, just a
>>>> 'published date' and a 'time to expiration' date changes.   Each file
>>>> covers a month, with about 500k deficiencies.  Each month some
>>>> deficiencies are added to the file and some are resolved. So the total
>>>> number of deficiencies in the image is about 500k.  Of those records
>>>> that
>>>> don't expire in a given month I'm adding the published date to a
>>>> collection of published dates for the record and also adding the "time
>>>> to
>>>> expiration" to a collection of those to record what was made public and
>>>> letting the rest of the data get GC'd.  I don't only load those two
>>>> records because the other fields of the record in the CSV could change.
>>>>
>>>> I have not calculated the memory usage for the collection because I
>>>> thought it would have no problem fitting in the 2GB of RAM I have on
>>>> this
>>>> machine.
>>>>
>>>>
>>>>
>>>>> On 14 Nov 2014, at 22:34, Paul DeBruicker &lt;
>>>
>>>> pdebruic@
>>>
>>>> &gt; wrote:
>>>>>
>>>>> Yes. With the image & vm I'm having trouble with I get an array with
>>>>> 9,942
>>>>> elements in it.  So its works as you'd expect.
>>>>>
>>>>> While processing the CSV file the image stays at about 60MB in RAM.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sven Van Caekenberghe-2 wrote
>>>>>> Can you successfully run my example code ?
>>>>>>
>>>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker &lt;
>>>>>
>>>>>> pdebruic@
>>>>>
>>>>>> &gt; wrote:
>>>>>>>
>>>>>>> Hi Sven,
>>>>>>>
>>>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>>>> You're right of course that there's something I'm doing that's slow.
>>>>>>> But.
>>>>>>> There is something I can't figure out yet.
>>>>>>>
>>>>>>> To provide a little more detail:
>>>>>>>
>>>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>>>> that
>>>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>>>> Dropping
>>>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>>>> running
>>>>>>> out of memory.
>>>>>>>
>>>>>>> I start the image with
>>>>>>>
>>>>>>> ./pharo-ui --memory 1000m myimage.image
>>>>>>>
>>>>>>> Splitting the CSV file helps:
>>>>>>> ~1.5MB  5,000 lines = 1.2 seconds.
>>>>>>> ~15MB   50,000 lines = 8 seconds.
>>>>>>> ~30MB   100,000 lines = 16 seconds.
>>>>>>> ~60MB   200,000 lines  = 45 seconds.
>>>>>>>
>>>>>>>
>>>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>>>> going
>>>>>>> haywire with performance, and leads to the out of memory condition.
>>>>>>> The
>>>>>>> processing never ends.  Sending "kill -SIGUSR1" prints a stack
>>>>>>> primarily
>>>>>>> composed of:
>>>>>>>
>>>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060:
>>>>>>> a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>>> OutOfMemory
>>>>>>> class
>>>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060:
>>>>>>> a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>>> OutOfMemory
>>>>>>> class
>>>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060:
>>>>>>> a(n)
>>>>>>> OutOfMemory class
>>>>>>>
>>>>>>> So it seems like its trying to signal that its out of memory after
>>>>>>> its
>>>>>>> out
>>>>>>> of memory which triggers another OutOfMemory error.  So that's why
>>>>>>> progress
>>>>>>> stops.
>>>>>>>
>>>>>>>
>>>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>>>> signal
>>>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>>>> Maybe
>>>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>>>> **
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm confused about why it runs out of memory.  According to htop the
>>>>>>> image
>>>>>>> only takes up about 520-540 MB of RAM when it reaches the
>>>>>>> 'OutOfMemory'
>>>>>>> condition.  This Macbook Air laptop has 4GB, and has plenty of room
>>>>>>> for
>>>>>>> the
>>>>>>> image to grow.  Also I've specified a 1,000MB image size when
>>>>>>> starting.
>>>>>>> So
>>>>>>> it should have plenty of room.  Is there something I should check or
>>>>>>> a
>>>>>>> flag
>>>>>>> somewhere that prevents it from growing on a Mac?  This is the latest
>>>>>>> Pharo30 VM.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for helping me get to the bottom of this
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sven Van Caekenberghe-2 wrote
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> I think you must be doing something wrong with your class, the #do:
>>>>>>>> is
>>>>>>>> implemented as streaming over the record one by one, never holding
>>>>>>>> more
>>>>>>>> than one in memory.
>>>>>>>>
>>>>>>>> This is what I tried:
>>>>>>>>
>>>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>>>>   (NeoCSVWriter on: out) in: [ :writer |
>>>>>>>>     writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>>>>     1 to: 1e7 do: [ :each |
>>>>>>>>       writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>>>> atRandom.
>>>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>>>
>>>>>>>> This results in a 300Mb file:
>>>>>>>>
>>>>>>>> $ ls -lah paul.csv
>>>>>>>> -rw-r--r--@ 1 sven  staff   327M Nov 14 20:45 paul.csv
>>>>>>>> $ wc paul.csv
>>>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>>>
>>>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>>>
>>>>>>>> Array streamContents: [ :out |
>>>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>>>>   (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>>>>     reader skipHeader; addIntegerField; addSymbolField;
>>>>>>>> addIntegerField;
>>>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>>>>     reader do: [ :each | each third < 1000 ifTrue: [ out nextPut:
>>>>>>>> each
>>>>>>>> ]
>>>>>>>> ] ] ] ].
>>>>>>>>
>>>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>>>> while
>>>>>>>> to parse that much data, of course.
>>>>>>>>
>>>>>>>> Sven
>>>>>>>>
>>>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker &lt;
>>>>>>>
>>>>>>>> pdebruic@
>>>>>>>
>>>>>>>> &gt; wrote:
>>>>>>>>>
>>>>>>>>> Hi -
>>>>>>>>>
>>>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>>>> so).
>>>>>>>>> I'm not sure if its because of the size of the files or the code
>>>>>>>>> I've
>>>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>>>> I'm
>>>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>>>> latest
>>>>>>>>> VM.  I haven't checked other vms.
>>>>>>>>>
>>>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>>>> manually
>>>>>>>>> for now to see what else it could be.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right now I'm doing something similar to
>>>>>>>>>
>>>>>>>>> |file reader|
>>>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>>>> reader: NeoCSVReader on: file
>>>>>>>>>
>>>>>>>>> reader
>>>>>>>>> recordClass: MyClass;
>>>>>>>>> skipHeader;
>>>>>>>>> addField: #myField:;
>>>>>>>>> ....
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> reader do:[:eachRecord | self
>>>>>>>>> seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>>>> eachRecord].
>>>>>>>>> file close.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>>>> 1000
>>>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>>>> Sent from the Pharo Smalltalk Users mailing list archive at
>>>>>>> Nabble.com.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790441.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>



Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Paul DeBruicker
Hi Alain,

Thanks for the link to the discussion.  I attempted your suggestion for changing the command line parameters and it had no effect.  Adding the colon prevented the image from starting, as did using a single hyphen.  


Paul


Alain Rastoul-2 wrote
Ah, this reminded me an old thread about memory on windows about why the
windows setting was 512 by default
http://lists.pharo.org/pipermail/pharo-dev_lists.pharo.org/2011-April/047594.html

And about vm options, it reminded me too that on windows the option had
a trailing ':'  that didn't exist on mac IIRC
(and no double '-' on mac I think).
-memory: 1024 against -memory 1024

may be stupid, but perhaps you could try it ?

Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Alain Rastoul-2
ok, bad guess, sorry.

my previous suggestion about interning data in this kind of processing
is more reliable, and could help you reduce memory footprint by at least
a factor 3 or more (depends if you have lots of integers or floats, and
works very well with string and dates) and should be very easy to try
(just few lines to change in the csv reading method). I used it several
times and was always puzzled by that "statistical truth" about data.


Le 16/11/2014 21:04, Paul DeBruicker a écrit :

> Hi Alain,
>
> Thanks for the link to the discussion.  I attempted your suggestion for
> changing the command line parameters and it had no effect.  Adding the colon
> prevented the image from starting, as did using a single hyphen.
>
>
> Paul
>
>
>
> Alain Rastoul-2 wrote
>> Ah, this reminded me an old thread about memory on windows about why the
>> windows setting was 512 by default
>> http://lists.pharo.org/pipermail/pharo-dev_lists.pharo.org/2011-April/047594.html
>>
>> And about vm options, it reminded me too that on windows the option had
>> a trailing ':'  that didn't exist on mac IIRC
>> (and no double '-' on mac I think).
>> -memory: 1024 against -memory 1024
>>
>> may be stupid, but perhaps you could try it ?
>>
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790550.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>



Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Igor Stasenko
Never ending memory consumption problem.
Hopefully with 64-bit version of VM we'll have a way more space to waste
and it could take more effort to put system on its knees.


--
Best regards,
Igor Stasenko.
Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Stephan Eggermont-3
In reply to this post by Paul DeBruicker
Open package contents on your vm,
open Contents,
take a look at the info.plist

        <key>SqueakMaxHeapSize</key>
        <integer>541065216</integer>

That value needs to be increased to be able to use more than ~512 MB.

Alain wrote:
>Let say it's your current requirement, and you want to do it like that,
>a trick that may help you : during personal experiments about loading
>data in memory and statistics from databases, I found that most often 70
>to 80 % of real data is the same.

It is easy to confirm if this is the case in your data: just zip the csv file.
Reasonably structured relational database output often reduces to 10% of size.
With explicitly denormalized data I've seen 99% reduction.

In addition, DateAndTime has a rather wasteful representation for your purpose.
Just reduce to one SmallInt, or with Pharo4, use slots to get a more compact
record representation.

Stephan
Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Alain Rastoul-2
you are saying that zip ratio is somewhat related to normalized data,
interesting view, and certainly true :)
And right, this somewhat normalize all fields, a technique used in
specialized columnstore databases (monetdb and others), often BI
databases with id representing values (that were my experiments).
About DateTimes, I think this is not different than with other values,
using a pointer to an interned value should be equivalent to using an
int, as it would be a 32 bits pointer, and with this approach, using
compact records should not make a big difference too if there is not a
lot of different values.
The key I mentioned here is that in real life, this "normalizing ratio"
is very high for almost every kind of data and that's what puzzles me
(not the technique).

Regards,
Alain

Le 17/11/2014 10:47, Stephan Eggermont a écrit :

> Open package contents on your vm,
> open Contents,
> take a look at the info.plist
>
> <key>SqueakMaxHeapSize</key>
> <integer>541065216</integer>
>
> That value needs to be increased to be able to use more than ~512 MB.
>
> Alain wrote:
>> Let say it's your current requirement, and you want to do it like that,
>> a trick that may help you : during personal experiments about loading
>> data in memory and statistics from databases, I found that most often 70
>> to 80 % of real data is the same.
>
> It is easy to confirm if this is the case in your data: just zip the csv file.
> Reasonably structured relational database output often reduces to 10% of size.
> With explicitly denormalized data I've seen 99% reduction.
>
> In addition, DateAndTime has a rather wasteful representation for your purpose.
> Just reduce to one SmallInt, or with Pharo4, use slots to get a more compact
> record representation.
>
> Stephan
>



Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Alain Rastoul-2
our referential world is very restricted, whatever area  we are talking
about

Le 17/11/2014 21:04, Alain Rastoul a écrit :

> you are saying that zip ratio is somewhat related to normalized data,
> interesting view, and certainly true :)
> And right, this somewhat normalize all fields, a technique used in
> specialized columnstore databases (monetdb and others), often BI
> databases with id representing values (that were my experiments).
> About DateTimes, I think this is not different than with other values,
> using a pointer to an interned value should be equivalent to using an
> int, as it would be a 32 bits pointer, and with this approach, using
> compact records should not make a big difference too if there is not a
> lot of different values.
> The key I mentioned here is that in real life, this "normalizing ratio"
> is very high for almost every kind of data and that's what puzzles me
> (not the technique).
>
> Regards,
> Alain
>
> Le 17/11/2014 10:47, Stephan Eggermont a écrit :
>> Open package contents on your vm,
>> open Contents,
>> take a look at the info.plist
>>
>>     <key>SqueakMaxHeapSize</key>
>>     <integer>541065216</integer>
>>
>> That value needs to be increased to be able to use more than ~512 MB.
>>
>> Alain wrote:
>>> Let say it's your current requirement, and you want to do it like that,
>>> a trick that may help you : during personal experiments about loading
>>> data in memory and statistics from databases, I found that most often 70
>>> to 80 % of real data is the same.
>>
>> It is easy to confirm if this is the case in your data: just zip the
>> csv file.
>> Reasonably structured relational database output often reduces to 10%
>> of size.
>> With explicitly denormalized data I've seen 99% reduction.
>>
>> In addition, DateAndTime has a rather wasteful representation for your
>> purpose.
>> Just reduce to one SmallInt, or with Pharo4, use slots to get a more
>> compact
>> record representation.
>>
>> Stephan
>>
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: running out of memory while processing a 220MB csv file with NeoCSVReader - tips?

Stephan Eggermont-3
In reply to this post by Paul DeBruicker
Alain wrote:
>you are saying that zip ratio is somewhat related to normalized data,
>interesting view, and certainly true :)

I find it a nice heuristic to help me get started.
Just sort the tables on size, start compressing them and
start with ones compressing best.

>About DateTimes, I think this is not different than with other values,
>using a pointer to an interned value should be equivalent to using an
>int, as it would be a 32 bits pointer, and with this approach, using
>compact records should not make a big difference too if there is not a
>lot of different values.

Combining multiple booleans in one word still helps a lot, and
introducing extra objects for highly correlated fields.

>The key I mentioned here is that in real life, this "normalizing ratio"
>is very high for almost every kind of data and that's what puzzles me
>(not the technique).

My impression is that the a lot of the design decisions for relational databases
are cargo cult based on the time where most database did not fit
into ram and the query optimizers were not good at dealing with
lots of joins.  

Stephan