[From StackOverflow] How to parse ndjson in Pharo with NeoJSON

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[From StackOverflow] How to parse ndjson in Pharo with NeoJSON

EstebanLM
Hi, 

there is a question I don’t know how to answer.


Transcript: 

I want to parse ndjson (newline delimited json) data with NeoJSON on Pharo Smalltalk.

ndjson data looks like this:

{"smalltalk": "cool"}
{"pharo": "cooler"}

At the moment I convert my file stream to a string, split it on newline and then parse the single parts using NeoJSON. This seems to use an unnecessary (and extremely huge) amount of memory and time, probably because of converting streams to strings and vice-versa all the time. What would be an efficient way to do this task?


Takers?
Esteban
Reply | Threaded
Open this post in threaded view
|

Re: [From StackOverflow] How to parse ndjson in Pharo with NeoJSON

Sven Van Caekenberghe-2
(I don't do StackOverflow)

Reading the 'format' is easy, just keep on doing #next for each JSON expression (whitespace is ignored).

| data reader |
data := '{"smalltalk": "cool"}
{"pharo": "cooler"}'.
reader := NeoJSONReader on: data readStream.
Array streamContents: [ :out |
  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].

Preventing intermediary data structures is easy too, use streaming.

| client reader data networkStream |
(client := ZnClient new)
  streaming: true;
  url: 'https://github.com/NYPL-publicdomain/data-and-utilities/blob/master/items/pd_items_1.ndjson?raw=true';
  get.
networkStream := ZnCharacterReadStream on: client contents.
reader := NeoJSONReader on: networkStream.
data := Array streamContents: [ :out |
  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
client close.
data.

It took a couple of seconds, it is 80MB+ over the network for 50K items after all.


HTH,

Sven 


On 21 Jan 2016, at 12:02, Esteban Lorenzano <[hidden email]> wrote:

Hi, 

there is a question I don’t know how to answer.

http://stackoverflow.com/questions/34904337/how-to-parse-ndjson-in-pharo-with-neojson

Transcript: 

I want to parse ndjson (newline delimited json) data with NeoJSON on Pharo Smalltalk.

ndjson data looks like this:

{"smalltalk": "cool"}
{"pharo": "cooler"}
At the moment I convert my file stream to a string, split it on newline and then parse the single parts using NeoJSON. This seems to use an unnecessary (and extremely huge) amount of memory and time, probably because of converting streams to strings and vice-versa all the time. What would be an efficient way to do this task?


Takers?
Esteban

Reply | Threaded
Open this post in threaded view
|

Re: [From StackOverflow] How to parse ndjson in Pharo with NeoJSON

MartinW
Thank you, Sven! (I asked the question on StackOverflow)

And also let me thank you for NeoJSON, NeoCSV and Zinc, which I use a lot and which are a joy to use! Also the documentation is very good and helps a lot.

Your code works well and I save a bit of memory by avoiding intermediary data structures, but still this operation uses a lot more memory than I had expected (the example file I use is 80 MB). I tried to parse with PetitParser but the results were similar. I guess, i have to learn to find out were all the memory goes.

Best regards,
Martin.


Sven Van Caekenberghe-2 wrote
(I don't do StackOverflow)

Reading the 'format' is easy, just keep on doing #next for each JSON expression (whitespace is ignored).

| data reader |
data := '{"smalltalk": "cool"}
{"pharo": "cooler"}'.
reader := NeoJSONReader on: data readStream.
Array streamContents: [ :out |
  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].

Preventing intermediary data structures is easy too, use streaming.

| client reader data networkStream |
(client := ZnClient new)
  streaming: true;
  url: 'https://github.com/NYPL-publicdomain/data-and-utilities/blob/master/items/pd_items_1.ndjson?raw=true';
  get.
networkStream := ZnCharacterReadStream on: client contents.
reader := NeoJSONReader on: networkStream.
data := Array streamContents: [ :out |
  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
client close.
data.

It took a couple of seconds, it is 80MB+ over the network for 50K items after all.



HTH,

Sven


> On 21 Jan 2016, at 12:02, Esteban Lorenzano <[hidden email]> wrote:
>
> Hi,
>
> there is a question I don’t know how to answer.
>
> http://stackoverflow.com/questions/34904337/how-to-parse-ndjson-in-pharo-with-neojson
>
> Transcript:
>
> I want to parse ndjson (newline delimited json) data with NeoJSON on Pharo Smalltalk.
>
> ndjson data looks like this:
>
> {"smalltalk": "cool"}
> {"pharo": "cooler"}
> At the moment I convert my file stream to a string, split it on newline and then parse the single parts using NeoJSON. This seems to use an unnecessary (and extremely huge) amount of memory and time, probably because of converting streams to strings and vice-versa all the time. What would be an efficient way to do this task?
>
>
> Takers?
> Esteban



Screen Shot 2016-01-21 at 13.33.57.png (480K) <http://forum.world.st/attachment/4873112/0/Screen%20Shot%202016-01-21%20at%2013.33.57.png>
Reply | Threaded
Open this post in threaded view
|

Re: [From StackOverflow] How to parse ndjson in Pharo with NeoJSON

Sven Van Caekenberghe-2

> On 22 Jan 2016, at 16:13, MartinW <[hidden email]> wrote:
>
> Thank you, Sven! (I asked the question on StackOverflow)
>
> And also let me thank you for NeoJSON, NeoCSV and Zinc, which I use a lot
> and which are a joy to use! Also the documentation is very good and helps a
> lot.

Thanks, Martin.

> Your code works well and I save a bit of memory by avoiding intermediary
> data structures, but still this operation uses a lot more memory than I had
> expected (the example file I use is 80 MB).

Well, it is quite a bit of data (I didn't look too deeply), 50.000 records of structured/nested data with quite a lot of strings. If each record is 1Kb, that makes 50Mb.

How do you measure your memory consumption ? What did you expect ?

Right now, your JSON is parsed and the result is a combination of lists (Array) and maps (Dictionary). If you know/understand well what is inside it, and it is regular enough, you could try to build your own specialised/optimised data/domain model for it. NeoJSON can also parse directly to your objects, instead of the general ones (a process called mapping). This is some work, of course, and it might not be worth it, YMMV.

Sven  

> I tried to parse with
> PetitParser but the results were similar. I guess, i have to learn to find
> out were all the memory goes.
>
> Best regards,
> Martin.
>
>
>
> Sven Van Caekenberghe-2 wrote
>> (I don't do StackOverflow)
>>
>> Reading the 'format' is easy, just keep on doing #next for each JSON
>> expression (whitespace is ignored).
>>
>> | data reader |
>> data := '{"smalltalk": "cool"}
>> {"pharo": "cooler"}'.
>> reader := NeoJSONReader on: data readStream.
>> Array streamContents: [ :out |
>>  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
>>
>> Preventing intermediary data structures is easy too, use streaming.
>>
>> | client reader data networkStream |
>> (client := ZnClient new)
>>  streaming: true;
>>  url:
>> 'https://github.com/NYPL-publicdomain/data-and-utilities/blob/master/items/pd_items_1.ndjson?raw=true';
>>  get.
>> networkStream := ZnCharacterReadStream on: client contents.
>> reader := NeoJSONReader on: networkStream.
>> data := Array streamContents: [ :out |
>>  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
>> client close.
>> data.
>>
>> It took a couple of seconds, it is 80MB+ over the network for 50K items
>> after all.
>>
>>
>>
>> HTH,
>>
>> Sven
>>
>>
>>> On 21 Jan 2016, at 12:02, Esteban Lorenzano &lt;
>
>> estebanlm@
>
>> &gt; wrote:
>>>
>>> Hi,
>>>
>>> there is a question I don’t know how to answer.
>>>
>>> http://stackoverflow.com/questions/34904337/how-to-parse-ndjson-in-pharo-with-neojson
>>>
>>> Transcript:
>>>
>>> I want to parse ndjson (newline delimited json) data with NeoJSON on
>>> Pharo Smalltalk.
>>>
>>> ndjson data looks like this:
>>>
>>> {"smalltalk": "cool"}
>>> {"pharo": "cooler"}
>>> At the moment I convert my file stream to a string, split it on newline
>>> and then parse the single parts using NeoJSON. This seems to use an
>>> unnecessary (and extremely huge) amount of memory and time, probably
>>> because of converting streams to strings and vice-versa all the time.
>>> What would be an efficient way to do this task?
>>>
>>>
>>> Takers?
>>> Esteban
>>
>>
>>
>> Screen Shot 2016-01-21 at 13.33.57.png (480K)
>> &lt;http://forum.world.st/attachment/4873112/0/Screen%20Shot%202016-01-21%20at%2013.33.57.png&gt;
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/From-StackOverflow-How-to-parse-ndjson-in-Pharo-with-NeoJSON-tp4873097p4873385.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: [From StackOverflow] How to parse ndjson in Pharo with NeoJSON

MartinW
Sven Van Caekenberghe-2 wrote
Well, it is quite a bit of data (I didn't look too deeply), 50.000 records of structured/nested data with quite a lot of strings. If each record is 1Kb, that makes 50Mb.

How do you measure your memory consumption ? What did you expect ?
I did only think about memory, when my first attempts to parse the file reached the VM's memory limit, which seemed to be at ~500MB on OS X out of the box. Then I did only watch the memory from outside, using OS X's Activity Monitor and after I gave the VM more memory, the image grew up to 1.2 GB while parsing and inspecting the 80MB file. But I did not yet investigate, were the memory went - perhaps it is all in the Inspector that I opened to view the result :)

Sven Van Caekenberghe-2 wrote
Right now, your JSON is parsed and the result is a combination of lists (Array) and maps (Dictionary). If you know/understand well what is inside it, and it is regular enough, you could try to build your own specialised/optimised data/domain model for it. NeoJSON can also parse directly to your objects, instead of the general ones (a process called mapping). This is some work, of course, and it might not be worth it, YMMV.
Yes, I have used mappings in the past. Here I was just toying with the New York Public Library's Open Source data for a second...

Sven Van Caekenberghe-2 wrote
Sven  

> I tried to parse with
> PetitParser but the results were similar. I guess, i have to learn to find
> out were all the memory goes.
>
> Best regards,
> Martin.
>
>
>
> Sven Van Caekenberghe-2 wrote
>> (I don't do StackOverflow)
>>
>> Reading the 'format' is easy, just keep on doing #next for each JSON
>> expression (whitespace is ignored).
>>
>> | data reader |
>> data := '{"smalltalk": "cool"}
>> {"pharo": "cooler"}'.
>> reader := NeoJSONReader on: data readStream.
>> Array streamContents: [ :out |
>>  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
>>
>> Preventing intermediary data structures is easy too, use streaming.
>>
>> | client reader data networkStream |
>> (client := ZnClient new)
>>  streaming: true;
>>  url:
>> 'https://github.com/NYPL-publicdomain/data-and-utilities/blob/master/items/pd_items_1.ndjson?raw=true';
>>  get.
>> networkStream := ZnCharacterReadStream on: client contents.
>> reader := NeoJSONReader on: networkStream.
>> data := Array streamContents: [ :out |
>>  [ reader atEnd ] whileFalse: [ out nextPut: reader next ] ].
>> client close.
>> data.
>>
>> It took a couple of seconds, it is 80MB+ over the network for 50K items
>> after all.
>>
>>
>>
>> HTH,
>>
>> Sven
>>
>>
>>> On 21 Jan 2016, at 12:02, Esteban Lorenzano <
>
>> estebanlm@
>
>> > wrote:
>>>
>>> Hi,
>>>
>>> there is a question I don’t know how to answer.
>>>
>>> http://stackoverflow.com/questions/34904337/how-to-parse-ndjson-in-pharo-with-neojson
>>>
>>> Transcript:
>>>
>>> I want to parse ndjson (newline delimited json) data with NeoJSON on
>>> Pharo Smalltalk.
>>>
>>> ndjson data looks like this:
>>>
>>> {"smalltalk": "cool"}
>>> {"pharo": "cooler"}
>>> At the moment I convert my file stream to a string, split it on newline
>>> and then parse the single parts using NeoJSON. This seems to use an
>>> unnecessary (and extremely huge) amount of memory and time, probably
>>> because of converting streams to strings and vice-versa all the time.
>>> What would be an efficient way to do this task?
>>>
>>>
>>> Takers?
>>> Esteban
>>
>>
>>
>> Screen Shot 2016-01-21 at 13.33.57.png (480K)
>> <http://forum.world.st/attachment/4873112/0/Screen%20Shot%202016-01-21%20at%2013.33.57.png>
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/From-StackOverflow-How-to-parse-ndjson-in-Pharo-with-NeoJSON-tp4873097p4873385.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.