Smalltalk › Pharo › Pharo Smalltalk Users

Has anyone tried compiling the Pharo VM into JS using Emscripten?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Andy Burnett

Has anyone tried compiling the Pharo VM into JS using Emscripten?

I just saw this implementation of SQLite as a JS system, via Emscripten, and I was curious whether something similar would be even vaguely possible for the VM.

Cheers

Andy

ᐧ

Alain Rastoul-2

Re: Has anyone tried compiling the Pharo VM into JS using Emscripten?

Hi Andy,

If I understand your question, you want to remake Dan Ingall's lively
kernel ?
:)
http://lively-web.org/welcome.html

Cheers,
Alain

Le 14/11/2014 22:31, Andy Burnett a écrit :
> I just saw this implementation of SQLite as a JS system, via Emscripten,
> and I was curious whether something similar would be even vaguely
> possible for the VM.
>
> Cheers
> Andy
> ᐧ

Andy Burnett

Re: Has anyone tried compiling the Pharo VM into JS using Emscripten?

In reply to this post by Andy Burnett

Alain wrote
<<<
Hi Andy,

If I understand your question, you want to remake Dan Ingall's lively
kernel ?
>>>

Hi Alain,
Although I am really impressed with Dan's work, Emscripten seems to be very different.

In theory - and my knowledge in this is very limited - it might allow the VM C files to be transpiled into a very tight subset of JS. On the face of it, this is a crazy idea, but they have achieved amazing performance with things like the QT library. This made me wonder whether a Js version of pharo is possible.

Just Idle conjecturing in. Friday night :-)

> On 14 Nov 2014, at 18:09, [hidden email] wrote:
>
> Send Pharo-users mailing list submissions to
> [hidden email]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.pharo.org/mailman/listinfo/pharo-users_lists.pharo.org
> or, via email, send a message with subject or body 'help' to
> [hidden email]
>
> You can reach the person managing the list at
> [hidden email]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Pharo-users digest..."
>
>
> Today's Topics:
>
> 1. Re: running out of memory while processing a 220MB csv file
> with NeoCSVReader - tips? (Paul DeBruicker)
> 2. Re: running out of memory while processing a 220MB csv file
> with NeoCSVReader - tips? (Sven Van Caekenberghe)
> 3. Re: Has anyone tried compiling the Pharo VM into JS using
> Emscripten? (Alain Rastoul)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 14 Nov 2014 14:14:51 -0800 (PST)
> From: Paul DeBruicker <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-users] running out of memory while processing a
> 220MB csv file with NeoCSVReader - tips?
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=us-ascii
>
> Hi Sven
>
> Yes, like I said earlier, after your first email, that I think its not a
> problem with NeoCSV as with what I'm doing and an out of memory condition.
>
> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> output file stack is full.
> ....
>
>
> What does that mean?
>
> Answers to your questions below.
>
> Thanks again for helping me out
>
>
>
> Sven Van Caekenberghe-2 wrote
>> OK then, you *can* read/process 300MB .csv files ;-)
>>
>> What does your CSV file look like, can you show a couple of lines ?
>>
>> here are 2 lines + a header:
>>
>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>> residents proper treatment to prevent new bed (pressure) sores or heal
>> existing bed sores.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>> that each resident who enters the nursing home without a catheter is not
>> given a catheter, unless medically necessary, and that incontinent
>> patients receive proper services to prevent urinary tract infections and
>> restore normal bladder functions.","D","Deficient, Provider has date of
>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>
>>
>> You are using a custom record class of your own, what does that look like
>> or do ?
>>
>> A custom record class. This is all publicly available data but I'm
>> keeping track of the performance of US based health care providers during
>> their annual inspections. So the records are notes of a deficiency during
>> the inspection and I'm keeping those notes in a collection in an instance
>> of the health care provider's class. The custom record class just
>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>> then gets stuffed in the health care provider's deficiency history
>> OrderedCollection (which has about 100 items). Again I don't think its
>> what I'm doing as much as the image isn't growing when it needs to.
>>
>>
>>
>>
>> Maybe you can try using Array again ?
>>
>> I've attempted to do it where I parse and convert the entire CSV into
>> domain objects then add them to the image and the parsing works fine, but
>> the system runs out of resources during the update phase.
>>
>>
>> What percentage of records read do you keep ? In my example it was very
>> small. Have you tried calculating your memory usage ?
>>
>>
>> I'm keeping some data from every record, but it doesn't load more than
>> 500MB of the data before falling over. I am not attempting to load the
>> 9GB of CSV files into one image. For 95% of the records in the CSV file
>> 20 of the 22 columns of the data is the same from file to file, just a
>> 'published date' and a 'time to expiration' date changes. Each file
>> covers a month, with about 500k deficiencies. Each month some
>> deficiencies are added to the file and some are resolved. So the total
>> number of deficiencies in the image is about 500k. Of those records that
>> don't expire in a given month I'm adding the published date to a
>> collection of published dates for the record and also adding the "time to
>> expiration" to a collection of those to record what was made public and
>> letting the rest of the data get GC'd. I don't only load those two
>> records because the other fields of the record in the CSV could change.
>>
>> I have not calculated the memory usage for the collection because I
>> thought it would have no problem fitting in the 2GB of RAM I have on this
>> machine.
>>
>>
>>
>>> On 14 Nov 2014, at 22:34, Paul DeBruicker <
>
>> pdebruic@
>
>> > wrote:
>>>
>>> Yes. With the image & vm I'm having trouble with I get an array with
>>> 9,942
>>> elements in it. So its works as you'd expect.
>>>
>>> While processing the CSV file the image stays at about 60MB in RAM.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Can you successfully run my example code ?
>>>>
>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
>>>
>>>> pdebruic@
>>>
>>>> > wrote:
>>>>>
>>>>> Hi Sven,
>>>>>
>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>> You're right of course that there's something I'm doing that's slow.
>>>>> But.
>>>>> There is something I can't figure out yet.
>>>>>
>>>>> To provide a little more detail:
>>>>>
>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>> that
>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>> Dropping
>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>> running
>>>>> out of memory.
>>>>>
>>>>> I start the image with
>>>>>
>>>>> ./pharo-ui --memory 1000m myimage.image
>>>>>
>>>>> Splitting the CSV file helps:
>>>>> ~1.5MB 5,000 lines = 1.2 seconds.
>>>>> ~15MB 50,000 lines = 8 seconds.
>>>>> ~30MB 100,000 lines = 16 seconds.
>>>>> ~60MB 200,000 lines = 45 seconds.
>>>>>
>>>>>
>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>> going
>>>>> haywire with performance, and leads to the out of memory condition.
>>>>> The
>>>>> processing never ends. Sending "kill -SIGUSR1" prints a stack
>>>>> primarily
>>>>> composed of:
>>>>>
>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>> OutOfMemory
>>>>> class
>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>> OutOfMemory class
>>>>>
>>>>> So it seems like its trying to signal that its out of memory after its
>>>>> out
>>>>> of memory which triggers another OutOfMemory error. So that's why
>>>>> progress
>>>>> stops.
>>>>>
>>>>>
>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>> signal
>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>> Maybe
>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>> **
>>>>>
>>>>>
>>>>>
>>>>> I'm confused about why it runs out of memory. According to htop the
>>>>> image
>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>> condition. This Macbook Air laptop has 4GB, and has plenty of room for
>>>>> the
>>>>> image to grow. Also I've specified a 1,000MB image size when starting.
>>>>> So
>>>>> it should have plenty of room. Is there something I should check or a
>>>>> flag
>>>>> somewhere that prevents it from growing on a Mac? This is the latest
>>>>> Pharo30 VM.
>>>>>
>>>>>
>>>>> Thanks for helping me get to the bottom of this
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sven Van Caekenberghe-2 wrote
>>>>>> Hi Paul,
>>>>>>
>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>> implemented as streaming over the record one by one, never holding
>>>>>> more
>>>>>> than one in memory.
>>>>>>
>>>>>> This is what I tried:
>>>>>>
>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>> (NeoCSVWriter on: out) in: [ :writer |
>>>>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>> 1 to: 1e7 do: [ :each |
>>>>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>> atRandom.
>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>
>>>>>> This results in a 300Mb file:
>>>>>>
>>>>>> $ ls -lah paul.csv
>>>>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
>>>>>> $ wc paul.csv
>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>
>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>
>>>>>> Array streamContents: [ :out |
>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>> reader skipHeader; addIntegerField; addSymbolField;
>>>>>> addIntegerField;
>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>> ]
>>>>>> ] ] ] ].
>>>>>>
>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>> while
>>>>>> to parse that much data, of course.
>>>>>>
>>>>>> Sven
>>>>>>
>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>>>>
>>>>>> pdebruic@
>>>>>
>>>>>> > wrote:
>>>>>>>
>>>>>>> Hi -
>>>>>>>
>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>> so).
>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>> I'm
>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>> latest
>>>>>>> VM. I haven't checked other vms.
>>>>>>>
>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>> manually
>>>>>>> for now to see what else it could be.
>>>>>>>
>>>>>>>
>>>>>>> Right now I'm doing something similar to
>>>>>>>
>>>>>>> |file reader|
>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>> reader: NeoCSVReader on: file
>>>>>>>
>>>>>>> reader
>>>>>>> recordClass: MyClass;
>>>>>>> skipHeader;
>>>>>>> addField: #myField:;
>>>>>>> ....
>>>>>>>
>>>>>>>
>>>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>> eachRecord].
>>>>>>> file close.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>> 1000
>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sat, 15 Nov 2014 00:07:35 +0100
> From: Sven Van Caekenberghe <[hidden email]>
> To: Any question about pharo is welcome <[hidden email]>
> Subject: Re: [Pharo-users] running out of memory while processing a
> 220MB csv file with NeoCSVReader - tips?
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=us-ascii
>
>
>> On 14 Nov 2014, at 23:14, Paul DeBruicker <[hidden email]> wrote:
>>
>> Hi Sven
>>
>> Yes, like I said earlier, after your first email, that I think its not a
>> problem with NeoCSV as with what I'm doing and an out of memory condition.
>>
>> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>>
>> output file stack is full.
>> output file stack is full.
>> output file stack is full.
>> output file stack is full.
>> output file stack is full.
>> ....
>>
>>
>> What does that mean?
>
> I don't know, but I think that you are really out of memory.
> BTW, I think that setting no flags is better, memory will expand maximally then.
> I think the useful maximum is closer to 1GB than 2GB.
>
>> Answers to your questions below.
>
> It is difficult to follow what you are doing exactly, but I think that you underestimate how much memory a parsed, structured/nested object uses. Taking the second line of your example, the 20+ fields, with 3 DateAndTimes, easily cost between 512 and 1024 bytes per record. That would limit you to between 1M and 2M records.
>
> I tried this:
>
> Array streamContents: [ :data |
> 5e2 timesRepeat: [
> data nextPut: (Array streamContents: [ :out |
> 20 timesRepeat: [ out nextPut: Character alphabet ].
> 3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ].
>
> it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it takes very long.
>
> Good luck, if you can solve this, please tell us how you did it.
>
>> Thanks again for helping me out
>>
>>
>>
>> Sven Van Caekenberghe-2 wrote
>>> OK then, you *can* read/process 300MB .csv files ;-)
>>>
>>> What does your CSV file look like, can you show a couple of lines ?
>>>
>>> here are 2 lines + a header:
>>>
>>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>>> residents proper treatment to prevent new bed (pressure) sores or heal
>>> existing bed sores.","D","Deficient, Provider has date of
>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>>> that each resident who enters the nursing home without a catheter is not
>>> given a catheter, unless medically necessary, and that incontinent
>>> patients receive proper services to prevent urinary tract infections and
>>> restore normal bladder functions.","D","Deficient, Provider has date of
>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>>
>>>
>>> You are using a custom record class of your own, what does that look like
>>> or do ?
>>>
>>> A custom record class. This is all publicly available data but I'm
>>> keeping track of the performance of US based health care providers during
>>> their annual inspections. So the records are notes of a deficiency during
>>> the inspection and I'm keeping those notes in a collection in an instance
>>> of the health care provider's class. The custom record class just
>>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>>> then gets stuffed in the health care provider's deficiency history
>>> OrderedCollection (which has about 100 items). Again I don't think its
>>> what I'm doing as much as the image isn't growing when it needs to.
>>>
>>>
>>>
>>>
>>> Maybe you can try using Array again ?
>>>
>>> I've attempted to do it where I parse and convert the entire CSV into
>>> domain objects then add them to the image and the parsing works fine, but
>>> the system runs out of resources during the update phase.
>>>
>>>
>>> What percentage of records read do you keep ? In my example it was very
>>> small. Have you tried calculating your memory usage ?
>>>
>>>
>>> I'm keeping some data from every record, but it doesn't load more than
>>> 500MB of the data before falling over. I am not attempting to load the
>>> 9GB of CSV files into one image. For 95% of the records in the CSV file
>>> 20 of the 22 columns of the data is the same from file to file, just a
>>> 'published date' and a 'time to expiration' date changes. Each file
>>> covers a month, with about 500k deficiencies. Each month some
>>> deficiencies are added to the file and some are resolved. So the total
>>> number of deficiencies in the image is about 500k. Of those records that
>>> don't expire in a given month I'm adding the published date to a
>>> collection of published dates for the record and also adding the "time to
>>> expiration" to a collection of those to record what was made public and
>>> letting the rest of the data get GC'd. I don't only load those two
>>> records because the other fields of the record in the CSV could change.
>>>
>>> I have not calculated the memory usage for the collection because I
>>> thought it would have no problem fitting in the 2GB of RAM I have on this
>>> machine.
>>>
>>>
>>>
>>>> On 14 Nov 2014, at 22:34, Paul DeBruicker <
>>
>>> pdebruic@
>>
>>> > wrote:
>>>>
>>>> Yes. With the image & vm I'm having trouble with I get an array with
>>>> 9,942
>>>> elements in it. So its works as you'd expect.
>>>>
>>>> While processing the CSV file the image stays at about 60MB in RAM.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sven Van Caekenberghe-2 wrote
>>>>> Can you successfully run my example code ?
>>>>>
>>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
>>>>
>>>>> pdebruic@
>>>>
>>>>> > wrote:
>>>>>>
>>>>>> Hi Sven,
>>>>>>
>>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>>> You're right of course that there's something I'm doing that's slow.
>>>>>> But.
>>>>>> There is something I can't figure out yet.
>>>>>>
>>>>>> To provide a little more detail:
>>>>>>
>>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>>> that
>>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>>> Dropping
>>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>>> running
>>>>>> out of memory.
>>>>>>
>>>>>> I start the image with
>>>>>>
>>>>>> ./pharo-ui --memory 1000m myimage.image
>>>>>>
>>>>>> Splitting the CSV file helps:
>>>>>> ~1.5MB 5,000 lines = 1.2 seconds.
>>>>>> ~15MB 50,000 lines = 8 seconds.
>>>>>> ~30MB 100,000 lines = 16 seconds.
>>>>>> ~60MB 200,000 lines = 45 seconds.
>>>>>>
>>>>>>
>>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>>> going
>>>>>> haywire with performance, and leads to the out of memory condition.
>>>>>> The
>>>>>> processing never ends. Sending "kill -SIGUSR1" prints a stack
>>>>>> primarily
>>>>>> composed of:
>>>>>>
>>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>> OutOfMemory
>>>>>> class
>>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>> OutOfMemory
>>>>>> class
>>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>>
>>>>>> So it seems like its trying to signal that its out of memory after its
>>>>>> out
>>>>>> of memory which triggers another OutOfMemory error. So that's why
>>>>>> progress
>>>>>> stops.
>>>>>>
>>>>>>
>>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>>> signal
>>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>>> Maybe
>>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>>> **
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm confused about why it runs out of memory. According to htop the
>>>>>> image
>>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>>> condition. This Macbook Air laptop has 4GB, and has plenty of room for
>>>>>> the
>>>>>> image to grow. Also I've specified a 1,000MB image size when starting.
>>>>>> So
>>>>>> it should have plenty of room. Is there something I should check or a
>>>>>> flag
>>>>>> somewhere that prevents it from growing on a Mac? This is the latest
>>>>>> Pharo30 VM.
>>>>>>
>>>>>>
>>>>>> Thanks for helping me get to the bottom of this
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sven Van Caekenberghe-2 wrote
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>>> implemented as streaming over the record one by one, never holding
>>>>>>> more
>>>>>>> than one in memory.
>>>>>>>
>>>>>>> This is what I tried:
>>>>>>>
>>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>>> (NeoCSVWriter on: out) in: [ :writer |
>>>>>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>>> 1 to: 1e7 do: [ :each |
>>>>>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>>> atRandom.
>>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>>
>>>>>>> This results in a 300Mb file:
>>>>>>>
>>>>>>> $ ls -lah paul.csv
>>>>>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
>>>>>>> $ wc paul.csv
>>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>>
>>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>>
>>>>>>> Array streamContents: [ :out |
>>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>>> reader skipHeader; addIntegerField; addSymbolField;
>>>>>>> addIntegerField;
>>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>>> ]
>>>>>>> ] ] ] ].
>>>>>>>
>>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>>> while
>>>>>>> to parse that much data, of course.
>>>>>>>
>>>>>>> Sven
>>>>>>>
>>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>>>>>
>>>>>>> pdebruic@
>>>>>>
>>>>>>> > wrote:
>>>>>>>>
>>>>>>>> Hi -
>>>>>>>>
>>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>>> so).
>>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>>> I'm
>>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>>> latest
>>>>>>>> VM. I haven't checked other vms.
>>>>>>>>
>>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>>> manually
>>>>>>>> for now to see what else it could be.
>>>>>>>>
>>>>>>>>
>>>>>>>> Right now I'm doing something similar to
>>>>>>>>
>>>>>>>> |file reader|
>>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>>> reader: NeoCSVReader on: file
>>>>>>>>
>>>>>>>> reader
>>>>>>>> recordClass: MyClass;
>>>>>>>> skipHeader;
>>>>>>>> addField: #myField:;
>>>>>>>> ....
>>>>>>>>
>>>>>>>>
>>>>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>>> eachRecord].
>>>>>>>> file close.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>>> 1000
>>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Sat, 15 Nov 2014 00:09:48 +0100
> From: Alain Rastoul <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-users] Has anyone tried compiling the Pharo VM
> into JS using Emscripten?
> Message-ID: <m4623q$btk$[hidden email]>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Hi Andy,
>
> If I understand your question, you want to remake Dan Ingall's lively
> kernel ?
> :)
> http://lively-web.org/welcome.html
>
> Cheers,
> Alain
>
>
> Le 14/11/2014 22:31, Andy Burnett a ?crit :
>> I just saw this implementation of SQLite as a JS system, via Emscripten,
>> and I was curious whether something similar would be even vaguely
>> possible for the VM.
>>
>> Cheers
>> Andy
>> ?
>
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Pharo-users mailing list
> [hidden email]
> http://lists.pharo.org/mailman/listinfo/pharo-users_lists.pharo.org
>
>
> ------------------------------
>
> End of Pharo-users Digest, Vol 19, Issue 53
> *******************************************

Alain Rastoul-2

Re: Has anyone tried compiling the Pharo VM into JS using Emscripten?

Yes, that's right.
Dan Ingall also did the Potato VM a rewrite of the Squeak vm in java
that looked very much like the VM interpreter C source file,
interpreting each byte code in a loop and switch statement, and it ran
surprisingly well (a little bit smooth but surprising).
Different too but interesting to mention.
So about your idea, with some additional work around the plugins, and in
the image, why not ? Sounds like an interesting idea.
Do you plan to make a POC?

Le 15/11/2014 01:21, Andy Burnett a écrit :

> Alain wrote
> <<<
> Hi Andy,
>
> If I understand your question, you want to remake Dan Ingall's lively
> kernel ?
>>>>
>
> Hi Alain,
> Although I am really impressed with Dan's work, Emscripten seems to be very different.
>
> In theory - and my knowledge in this is very limited - it might allow the VM C files to be transpiled into a very tight subset of JS. On the face of it, this is a crazy idea, but they have achieved amazing performance with things like the QT library. This made me wonder whether a Js version of pharo is possible.
>
> Just Idle conjecturing in. Friday night :-)
>

kilon.alios

Re: Has anyone tried compiling the Pharo VM into JS using Emscripten?

In reply to this post by Alain Rastoul-2

"Hi Andy,

If I understand your question, you want to remake Dan Ingall's lively kernel ?
:)
http://lively-web.org/welcome.html

Cheers,
Alain"

no these are two separate things. Compiling PharoVM with Emscripten would be that you would still have the PharoVM but now running on top of a browser since it will be pure javascript. While lively Kernel is not related to Smalltalk at all, its just a gui on top of javascript meant to be used as a javascript library.

What Andy wants is something similar to SqueakJS but for Pharo. Would be great to have something like that, but I fear this would be a quite challenging project by itself since you would need to worry about things like graphics and even handling inside the browser. Maybe someone could take SqueakJS as a template and work on that.

On Sat, Nov 15, 2014 at 1:09 AM, Alain Rastoul <[hidden email]> wrote:

Le 14/11/2014 22:31, Andy Burnett a écrit :

I just saw this implementation of SQLite as a JS system, via Emscripten,
and I was curious whether something similar would be even vaguely
possible for the VM.

Cheers
Andy
ᐧ

Alain Rastoul-2

Re: Has anyone tried compiling the Pharo VM into JS using Emscripten?

Yes about the implementation you are right, but I was more talking about
the idea and the fact that Dan Ingall is one of the smalltalk creators
(see also my other answer about his potato vm).
I agree it would be really cool :)

Le 15/11/2014 09:38, kilon alios a écrit :

>
>
> "Hi Andy,
>
> If I understand your question, you want to remake Dan Ingall's lively
> kernel ?
> :)
> http://lively-web.org/welcome.__html <http://lively-web.org/welcome.html>
>
> Cheers,
> Alain"
>
> no these are two separate things. Compiling PharoVM with Emscripten
> would be that you would still have the PharoVM but now running on top of
> a browser since it will be pure javascript. While lively Kernel is not
> related to Smalltalk at all, its just a gui on top of javascript meant
> to be used as a javascript library.
>
> What Andy wants is something similar to SqueakJS but for Pharo. Would be
> great to have something like that, but I fear this would be a quite
> challenging project by itself since you would need to worry about things
> like graphics and even handling inside the browser. Maybe someone could
> take SqueakJS as a template and work on that.
>
> On Sat, Nov 15, 2014 at 1:09 AM, Alain Rastoul
> <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>
> Le 14/11/2014 22:31, Andy Burnett a écrit :
>
> I just saw this implementation of SQLite as a JS system, via
> Emscripten,
> and I was curious whether something similar would be even vaguely
> possible for the VM.
>
> Cheers
> Andy
> ᐧ
>
>
>
>
>

stepharo

Re: Has anyone tried compiling the Pharo VM into JS using Emscripten?

In reply to this post by Alain Rastoul-2

no
This is if this is possible to run the vm over V8

Stef

On 15/11/14 00:09, Alain Rastoul wrote:

> Hi Andy,
>
> If I understand your question, you want to remake Dan Ingall's lively
> kernel ?
> :)
> http://lively-web.org/welcome.html
>
> Cheers,
> Alain
>
>
> Le 14/11/2014 22:31, Andy Burnett a écrit :
>> I just saw this implementation of SQLite as a JS system, via Emscripten,
>> and I was curious whether something similar would be even vaguely
>> possible for the VM.
>>
>> Cheers
>> Andy
>> ᐧ
>
>
>
>

stepharo

Re: Has anyone tried compiling the Pharo VM into JS using Emscripten?

In reply to this post by Andy Burnett

> Hi Alain,
> Although I am really impressed with Dan's work, Emscripten seems to be very different.
>
> In theory - and my knowledge in this is very limited - it might allow the VM C files to be transpiled into a very tight subset of JS. On the face of it, this is a crazy idea, but they have achieved amazing performance with things like the QT library. This made me wonder whether a Js version of pharo is possible.

We thought about it about one and half year ago but lack of
resources.... you know the end of the story

>
> Just Idle conjecturing in. Friday night :-)
>
>
>
>> On 14 Nov 2014, at 18:09, [hidden email] wrote:
>>
>> Send Pharo-users mailing list submissions to
>> [hidden email]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://lists.pharo.org/mailman/listinfo/pharo-users_lists.pharo.org
>> or, via email, send a message with subject or body 'help' to
>> [hidden email]
>>
>> You can reach the person managing the list at
>> [hidden email]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Pharo-users digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Re: running out of memory while processing a 220MB csv file
>> with NeoCSVReader - tips? (Paul DeBruicker)
>> 2. Re: running out of memory while processing a 220MB csv file
>> with NeoCSVReader - tips? (Sven Van Caekenberghe)
>> 3. Re: Has anyone tried compiling the Pharo VM into JS using
>> Emscripten? (Alain Rastoul)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Fri, 14 Nov 2014 14:14:51 -0800 (PST)
>> From: Paul DeBruicker <[hidden email]>
>> To: [hidden email]
>> Subject: Re: [Pharo-users] running out of memory while processing a
>> 220MB csv file with NeoCSVReader - tips?
>> Message-ID: <[hidden email]>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Hi Sven
>>
>> Yes, like I said earlier, after your first email, that I think its not a
>> problem with NeoCSV as with what I'm doing and an out of memory condition.
>>
>> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>>
>> output file stack is full.
>> output file stack is full.
>> output file stack is full.
>> output file stack is full.
>> output file stack is full.
>> ....
>>
>>
>> What does that mean?
>>
>> Answers to your questions below.
>>
>> Thanks again for helping me out
>>
>>
>>
>> Sven Van Caekenberghe-2 wrote
>>> OK then, you *can* read/process 300MB .csv files ;-)
>>>
>>> What does your CSV file look like, can you show a couple of lines ?
>>>
>>> here are 2 lines + a header:
>>>
>>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>>> residents proper treatment to prevent new bed (pressure) sores or heal
>>> existing bed sores.","D","Deficient, Provider has date of
>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>>> that each resident who enters the nursing home without a catheter is not
>>> given a catheter, unless medically necessary, and that incontinent
>>> patients receive proper services to prevent urinary tract infections and
>>> restore normal bladder functions.","D","Deficient, Provider has date of
>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>>
>>>
>>> You are using a custom record class of your own, what does that look like
>>> or do ?
>>>
>>> A custom record class. This is all publicly available data but I'm
>>> keeping track of the performance of US based health care providers during
>>> their annual inspections. So the records are notes of a deficiency during
>>> the inspection and I'm keeping those notes in a collection in an instance
>>> of the health care provider's class. The custom record class just
>>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>>> then gets stuffed in the health care provider's deficiency history
>>> OrderedCollection (which has about 100 items). Again I don't think its
>>> what I'm doing as much as the image isn't growing when it needs to.
>>>
>>>
>>>
>>>
>>> Maybe you can try using Array again ?
>>>
>>> I've attempted to do it where I parse and convert the entire CSV into
>>> domain objects then add them to the image and the parsing works fine, but
>>> the system runs out of resources during the update phase.
>>>
>>>
>>> What percentage of records read do you keep ? In my example it was very
>>> small. Have you tried calculating your memory usage ?
>>>
>>>
>>> I'm keeping some data from every record, but it doesn't load more than
>>> 500MB of the data before falling over. I am not attempting to load the
>>> 9GB of CSV files into one image. For 95% of the records in the CSV file
>>> 20 of the 22 columns of the data is the same from file to file, just a
>>> 'published date' and a 'time to expiration' date changes. Each file
>>> covers a month, with about 500k deficiencies. Each month some
>>> deficiencies are added to the file and some are resolved. So the total
>>> number of deficiencies in the image is about 500k. Of those records that
>>> don't expire in a given month I'm adding the published date to a
>>> collection of published dates for the record and also adding the "time to
>>> expiration" to a collection of those to record what was made public and
>>> letting the rest of the data get GC'd. I don't only load those two
>>> records because the other fields of the record in the CSV could change.
>>>
>>> I have not calculated the memory usage for the collection because I
>>> thought it would have no problem fitting in the 2GB of RAM I have on this
>>> machine.
>>>
>>>
>>>
>>>> On 14 Nov 2014, at 22:34, Paul DeBruicker <
>>> pdebruic@
>>> > wrote:
>>>> Yes. With the image & vm I'm having trouble with I get an array with
>>>> 9,942
>>>> elements in it. So its works as you'd expect.
>>>>
>>>> While processing the CSV file the image stays at about 60MB in RAM.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sven Van Caekenberghe-2 wrote
>>>>> Can you successfully run my example code ?
>>>>>
>>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
>>>>> pdebruic@
>>>>> > wrote:
>>>>>> Hi Sven,
>>>>>>
>>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>>> You're right of course that there's something I'm doing that's slow.
>>>>>> But.
>>>>>> There is something I can't figure out yet.
>>>>>>
>>>>>> To provide a little more detail:
>>>>>>
>>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>>> that
>>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>>> Dropping
>>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>>> running
>>>>>> out of memory.
>>>>>>
>>>>>> I start the image with
>>>>>>
>>>>>> ./pharo-ui --memory 1000m myimage.image
>>>>>>
>>>>>> Splitting the CSV file helps:
>>>>>> ~1.5MB 5,000 lines = 1.2 seconds.
>>>>>> ~15MB 50,000 lines = 8 seconds.
>>>>>> ~30MB 100,000 lines = 16 seconds.
>>>>>> ~60MB 200,000 lines = 45 seconds.
>>>>>>
>>>>>>
>>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>>> going
>>>>>> haywire with performance, and leads to the out of memory condition.
>>>>>> The
>>>>>> processing never ends. Sending "kill -SIGUSR1" prints a stack
>>>>>> primarily
>>>>>> composed of:
>>>>>>
>>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>> OutOfMemory
>>>>>> class
>>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>> OutOfMemory
>>>>>> class
>>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>> OutOfMemory class
>>>>>>
>>>>>> So it seems like its trying to signal that its out of memory after its
>>>>>> out
>>>>>> of memory which triggers another OutOfMemory error. So that's why
>>>>>> progress
>>>>>> stops.
>>>>>>
>>>>>>
>>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>>> signal
>>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>>> Maybe
>>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>>> **
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm confused about why it runs out of memory. According to htop the
>>>>>> image
>>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>>> condition. This Macbook Air laptop has 4GB, and has plenty of room for
>>>>>> the
>>>>>> image to grow. Also I've specified a 1,000MB image size when starting.
>>>>>> So
>>>>>> it should have plenty of room. Is there something I should check or a
>>>>>> flag
>>>>>> somewhere that prevents it from growing on a Mac? This is the latest
>>>>>> Pharo30 VM.
>>>>>>
>>>>>>
>>>>>> Thanks for helping me get to the bottom of this
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sven Van Caekenberghe-2 wrote
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>>> implemented as streaming over the record one by one, never holding
>>>>>>> more
>>>>>>> than one in memory.
>>>>>>>
>>>>>>> This is what I tried:
>>>>>>>
>>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>>> (NeoCSVWriter on: out) in: [ :writer |
>>>>>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>>> 1 to: 1e7 do: [ :each |
>>>>>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>>> atRandom.
>>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>>
>>>>>>> This results in a 300Mb file:
>>>>>>>
>>>>>>> $ ls -lah paul.csv
>>>>>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
>>>>>>> $ wc paul.csv
>>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>>
>>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>>
>>>>>>> Array streamContents: [ :out |
>>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>>> reader skipHeader; addIntegerField; addSymbolField;
>>>>>>> addIntegerField;
>>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>>> ]
>>>>>>> ] ] ] ].
>>>>>>>
>>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>>> while
>>>>>>> to parse that much data, of course.
>>>>>>>
>>>>>>> Sven
>>>>>>>
>>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>>>>>> pdebruic@
>>>>>>> > wrote:
>>>>>>>> Hi -
>>>>>>>>
>>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>>> so).
>>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>>> I'm
>>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>>> latest
>>>>>>>> VM. I haven't checked other vms.
>>>>>>>>
>>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>>> manually
>>>>>>>> for now to see what else it could be.
>>>>>>>>
>>>>>>>>
>>>>>>>> Right now I'm doing something similar to
>>>>>>>>
>>>>>>>> |file reader|
>>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>>> reader: NeoCSVReader on: file
>>>>>>>>
>>>>>>>> reader
>>>>>>>> recordClass: MyClass;
>>>>>>>> skipHeader;
>>>>>>>> addField: #myField:;
>>>>>>>> ....
>>>>>>>>
>>>>>>>>
>>>>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>>> eachRecord].
>>>>>>>> file close.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>>> 1000
>>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Paul
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>
>>
>>
>>
>> --
>> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Sat, 15 Nov 2014 00:07:35 +0100
>> From: Sven Van Caekenberghe <[hidden email]>
>> To: Any question about pharo is welcome <[hidden email]>
>> Subject: Re: [Pharo-users] running out of memory while processing a
>> 220MB csv file with NeoCSVReader - tips?
>> Message-ID: <[hidden email]>
>> Content-Type: text/plain; charset=us-ascii
>>
>>
>>> On 14 Nov 2014, at 23:14, Paul DeBruicker <[hidden email]> wrote:
>>>
>>> Hi Sven
>>>
>>> Yes, like I said earlier, after your first email, that I think its not a
>>> problem with NeoCSV as with what I'm doing and an out of memory condition.
>>>
>>> Have you ever seen a stack after sending kill -SIGUSR1 that looks like this:
>>>
>>> output file stack is full.
>>> output file stack is full.
>>> output file stack is full.
>>> output file stack is full.
>>> output file stack is full.
>>> ....
>>>
>>>
>>> What does that mean?
>> I don't know, but I think that you are really out of memory.
>> BTW, I think that setting no flags is better, memory will expand maximally then.
>> I think the useful maximum is closer to 1GB than 2GB.
>>
>>> Answers to your questions below.
>> It is difficult to follow what you are doing exactly, but I think that you underestimate how much memory a parsed, structured/nested object uses. Taking the second line of your example, the 20+ fields, with 3 DateAndTimes, easily cost between 512 and 1024 bytes per record. That would limit you to between 1M and 2M records.
>>
>> I tried this:
>>
>> Array streamContents: [ :data |
>> 5e2 timesRepeat: [
>> data nextPut: (Array streamContents: [ :out |
>> 20 timesRepeat: [ out nextPut: Character alphabet ].
>> 3 timesRepeat: [ out nextPut: DateAndTime now ] ]) ] ].
>>
>> it worked to 5e5, but not for 5e6 - I didn't try numbers in between as it takes very long.
>>
>> Good luck, if you can solve this, please tell us how you did it.
>>
>>> Thanks again for helping me out
>>>
>>>
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> OK then, you *can* read/process 300MB .csv files ;-)
>>>>
>>>> What does your CSV file look like, can you show a couple of lines ?
>>>>
>>>> here are 2 lines + a header:
>>>>
>>>> "provnum","Provname","address","city","state","zip","survey_date_output","SurveyType","defpref","tag","tag_desc","scope","defstat","statdate","cycle","standard","complaint","filedate"
>>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0314","Give
>>>> residents proper treatment to prevent new bed (pressure) sores or heal
>>>> existing bed sores.","D","Deficient, Provider has date of
>>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>>> "015009","BURNS NURSING HOME, INC.","701 MONROE STREET
>>>> NW","RUSSELLVILLE","AL","35653","2013-09-05","Health","F","0315","Ensure
>>>> that each resident who enters the nursing home without a catheter is not
>>>> given a catheter, unless medically necessary, and that incontinent
>>>> patients receive proper services to prevent urinary tract infections and
>>>> restore normal bladder functions.","D","Deficient, Provider has date of
>>>> correction","2013-10-10",1,"Y","N","2014-01-01"
>>>>
>>>>
>>>> You are using a custom record class of your own, what does that look like
>>>> or do ?
>>>>
>>>> A custom record class. This is all publicly available data but I'm
>>>> keeping track of the performance of US based health care providers during
>>>> their annual inspections. So the records are notes of a deficiency during
>>>> the inspection and I'm keeping those notes in a collection in an instance
>>>> of the health care provider's class. The custom record class just
>>>> converts the CSV record to objects (Integers, Strings, DateAndTime) and
>>>> then gets stuffed in the health care provider's deficiency history
>>>> OrderedCollection (which has about 100 items). Again I don't think its
>>>> what I'm doing as much as the image isn't growing when it needs to.
>>>>
>>>>
>>>>
>>>>
>>>> Maybe you can try using Array again ?
>>>>
>>>> I've attempted to do it where I parse and convert the entire CSV into
>>>> domain objects then add them to the image and the parsing works fine, but
>>>> the system runs out of resources during the update phase.
>>>>
>>>>
>>>> What percentage of records read do you keep ? In my example it was very
>>>> small. Have you tried calculating your memory usage ?
>>>>
>>>>
>>>> I'm keeping some data from every record, but it doesn't load more than
>>>> 500MB of the data before falling over. I am not attempting to load the
>>>> 9GB of CSV files into one image. For 95% of the records in the CSV file
>>>> 20 of the 22 columns of the data is the same from file to file, just a
>>>> 'published date' and a 'time to expiration' date changes. Each file
>>>> covers a month, with about 500k deficiencies. Each month some
>>>> deficiencies are added to the file and some are resolved. So the total
>>>> number of deficiencies in the image is about 500k. Of those records that
>>>> don't expire in a given month I'm adding the published date to a
>>>> collection of published dates for the record and also adding the "time to
>>>> expiration" to a collection of those to record what was made public and
>>>> letting the rest of the data get GC'd. I don't only load those two
>>>> records because the other fields of the record in the CSV could change.
>>>>
>>>> I have not calculated the memory usage for the collection because I
>>>> thought it would have no problem fitting in the 2GB of RAM I have on this
>>>> machine.
>>>>
>>>>
>>>>
>>>>> On 14 Nov 2014, at 22:34, Paul DeBruicker <
>>>> pdebruic@
>>>> > wrote:
>>>>> Yes. With the image & vm I'm having trouble with I get an array with
>>>>> 9,942
>>>>> elements in it. So its works as you'd expect.
>>>>>
>>>>> While processing the CSV file the image stays at about 60MB in RAM.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Sven Van Caekenberghe-2 wrote
>>>>>> Can you successfully run my example code ?
>>>>>>
>>>>>>> On 14 Nov 2014, at 22:03, Paul DeBruicker <
>>>>>> pdebruic@
>>>>>> > wrote:
>>>>>>> Hi Sven,
>>>>>>>
>>>>>>> Thanks for taking a look and testing the NeoCSVReader portion for me.
>>>>>>> You're right of course that there's something I'm doing that's slow.
>>>>>>> But.
>>>>>>> There is something I can't figure out yet.
>>>>>>>
>>>>>>> To provide a little more detail:
>>>>>>>
>>>>>>> When the 'csv reading' process completes successfully profiling shows
>>>>>>> that
>>>>>>> most of the time is spent in NeoCSVReader>>#peekChar and using
>>>>>>> NeoCSVReader>>##addField: to convert a string to a DateAndTime.
>>>>>>> Dropping
>>>>>>> the DateAndTime conversion speeds things up but doesn't stop it from
>>>>>>> running
>>>>>>> out of memory.
>>>>>>>
>>>>>>> I start the image with
>>>>>>>
>>>>>>> ./pharo-ui --memory 1000m myimage.image
>>>>>>>
>>>>>>> Splitting the CSV file helps:
>>>>>>> ~1.5MB 5,000 lines = 1.2 seconds.
>>>>>>> ~15MB 50,000 lines = 8 seconds.
>>>>>>> ~30MB 100,000 lines = 16 seconds.
>>>>>>> ~60MB 200,000 lines = 45 seconds.
>>>>>>>
>>>>>>>
>>>>>>> It seems that when the CSV file crosses ~70MB in size things start
>>>>>>> going
>>>>>>> haywire with performance, and leads to the out of memory condition.
>>>>>>> The
>>>>>>> processing never ends. Sending "kill -SIGUSR1" prints a stack
>>>>>>> primarily
>>>>>>> composed of:
>>>>>>>
>>>>>>> 0xbffc5d08 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d20 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d38 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>>> OutOfMemory
>>>>>>> class
>>>>>>> 0xbffc5d50 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d68 M OutOfMemory class(Behavior)>basicNew 0x1f7ac060: a(n)
>>>>>>> OutOfMemory class
>>>>>>> 0xbffc5d80 M OutOfMemory class(Behavior)>new 0x1f7ac060: a(n)
>>>>>>> OutOfMemory
>>>>>>> class
>>>>>>> 0xbffc5d98 M OutOfMemory class(Exception class)>signal 0x1f7ac060: a(n)
>>>>>>> OutOfMemory class
>>>>>>>
>>>>>>> So it seems like its trying to signal that its out of memory after its
>>>>>>> out
>>>>>>> of memory which triggers another OutOfMemory error. So that's why
>>>>>>> progress
>>>>>>> stops.
>>>>>>>
>>>>>>>
>>>>>>> ** Aside - OutOfMemory should probably be refactored to be able to
>>>>>>> signal
>>>>>>> itself without taking up more memory, triggering itself infinitely.
>>>>>>> Maybe
>>>>>>> it & its signalling morph infrastructure would be good as a singleton
>>>>>>> **
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm confused about why it runs out of memory. According to htop the
>>>>>>> image
>>>>>>> only takes up about 520-540 MB of RAM when it reaches the 'OutOfMemory'
>>>>>>> condition. This Macbook Air laptop has 4GB, and has plenty of room for
>>>>>>> the
>>>>>>> image to grow. Also I've specified a 1,000MB image size when starting.
>>>>>>> So
>>>>>>> it should have plenty of room. Is there something I should check or a
>>>>>>> flag
>>>>>>> somewhere that prevents it from growing on a Mac? This is the latest
>>>>>>> Pharo30 VM.
>>>>>>>
>>>>>>>
>>>>>>> Thanks for helping me get to the bottom of this
>>>>>>>
>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sven Van Caekenberghe-2 wrote
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> I think you must be doing something wrong with your class, the #do: is
>>>>>>>> implemented as streaming over the record one by one, never holding
>>>>>>>> more
>>>>>>>> than one in memory.
>>>>>>>>
>>>>>>>> This is what I tried:
>>>>>>>>
>>>>>>>> 'paul.csv' asFileReference writeStreamDo: [ :file|
>>>>>>>> ZnBufferedWriteStream on: file do: [ :out |
>>>>>>>> (NeoCSVWriter on: out) in: [ :writer |
>>>>>>>> writer writeHeader: { #Number. #Color. #Integer. #Boolean}.
>>>>>>>> 1 to: 1e7 do: [ :each |
>>>>>>>> writer nextPut: { each. #(Red Green Blue) atRandom. 1e6
>>>>>>>> atRandom.
>>>>>>>> #(true false) atRandom } ] ] ] ].
>>>>>>>>
>>>>>>>> This results in a 300Mb file:
>>>>>>>>
>>>>>>>> $ ls -lah paul.csv
>>>>>>>> -rw-r--r--@ 1 sven staff 327M Nov 14 20:45 paul.csv
>>>>>>>> $ wc paul.csv
>>>>>>>> 10000001 10000001 342781577 paul.csv
>>>>>>>>
>>>>>>>> This is a selective read and collect (loads about 10K records):
>>>>>>>>
>>>>>>>> Array streamContents: [ :out |
>>>>>>>> 'paul.csv' asFileReference readStreamDo: [ :in |
>>>>>>>> (NeoCSVReader on: (ZnBufferedReadStream on: in)) in: [ :reader |
>>>>>>>> reader skipHeader; addIntegerField; addSymbolField;
>>>>>>>> addIntegerField;
>>>>>>>> addFieldConverter: [ :x | x = #true ].
>>>>>>>> reader do: [ :each | each third < 1000 ifTrue: [ out nextPut: each
>>>>>>>> ]
>>>>>>>> ] ] ] ].
>>>>>>>>
>>>>>>>> This worked fine on my MacBook Air, no memory problems. It takes a
>>>>>>>> while
>>>>>>>> to parse that much data, of course.
>>>>>>>>
>>>>>>>> Sven
>>>>>>>>
>>>>>>>>> On 14 Nov 2014, at 19:08, Paul DeBruicker <
>>>>>>>> pdebruic@
>>>>>>>> > wrote:
>>>>>>>>> Hi -
>>>>>>>>>
>>>>>>>>> I'm processing a 9 GBs of CSV files (the biggest file is 220MB or
>>>>>>>>> so).
>>>>>>>>> I'm not sure if its because of the size of the files or the code I've
>>>>>>>>> written to keep track of the domain objects I'm interested in, but
>>>>>>>>> I'm
>>>>>>>>> getting out of memory errors & crashes in Pharo 3 on Mac with the
>>>>>>>>> latest
>>>>>>>>> VM. I haven't checked other vms.
>>>>>>>>>
>>>>>>>>> I'm going to profile my own code and attempt to split the files
>>>>>>>>> manually
>>>>>>>>> for now to see what else it could be.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right now I'm doing something similar to
>>>>>>>>>
>>>>>>>>> |file reader|
>>>>>>>>> file:= '/path/to/file/myfile.csv' asFileReference readStream.
>>>>>>>>> reader: NeoCSVReader on: file
>>>>>>>>>
>>>>>>>>> reader
>>>>>>>>> recordClass: MyClass;
>>>>>>>>> skipHeader;
>>>>>>>>> addField: #myField:;
>>>>>>>>> ....
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> reader do:[:eachRecord | self seeIfRecordIsInterestingAndIfSoKeepIt:
>>>>>>>>> eachRecord].
>>>>>>>>> file close.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is there a facility in NeoCSVReader to read a file in batches (e.g.
>>>>>>>>> 1000
>>>>>>>>> lines at a time) or an easy way to do that ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Paul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790319.html
>>>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790328.html
>>>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://forum.world.st/running-out-of-memory-while-processing-a-220MB-csv-file-with-NeoCSVReader-tips-tp4790264p4790341.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>
>>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Sat, 15 Nov 2014 00:09:48 +0100
>> From: Alain Rastoul <[hidden email]>
>> To: [hidden email]
>> Subject: Re: [Pharo-users] Has anyone tried compiling the Pharo VM
>> into JS using Emscripten?
>> Message-ID: <m4623q$btk$[hidden email]>
>> Content-Type: text/plain; charset=utf-8; format=flowed
>>
>> Hi Andy,
>>
>> If I understand your question, you want to remake Dan Ingall's lively
>> kernel ?
>> :)
>> http://lively-web.org/welcome.html
>>
>> Cheers,
>> Alain
>>
>>
>> Le 14/11/2014 22:31, Andy Burnett a ?crit :
>>> I just saw this implementation of SQLite as a JS system, via Emscripten,
>>> and I was curious whether something similar would be even vaguely
>>> possible for the VM.
>>>
>>> Cheers
>>> Andy
>>> ?
>>
>>
>>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> Pharo-users mailing list
>> [hidden email]
>> http://lists.pharo.org/mailman/listinfo/pharo-users_lists.pharo.org
>>
>>
>> ------------------------------
>>
>> End of Pharo-users Digest, Vol 19, Issue 53
>> *******************************************
>