Smalltalk › Cincom › VisualWorks

[VWNC] DLLCC and memory mapped files advice

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Mike Hales

[VWNC] DLLCC and memory mapped files advice

I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).

I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)

My plan is something like this:

Create my data

mmap() the data file and get a pointer back.

Use the pointer to alternately write my data then increment the pointer, repeat.

close the map.

Retrieve my data

mmap() the data file and get a pointer back.

Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.

close the map.

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

Any feedback would be great.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Terry Raymond

Re: [VWNC] DLLCC and memory mapped files advice

Mike

Along with reading a binary file make sure your code does as little

copying as possible. This means examine the stream methods you

are using and make sure they are not creating intermediate buffers.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517 [hidden email]
<http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mike Hales
Sent: Friday, July 30, 2010 1:59 PM
To: [hidden email]
Subject: [vwnc] [VWNC] DLLCC and memory mapped files advice

My plan is something like this:

Create my data

mmap() the data file and get a pointer back.

Use the pointer to alternately write my data then increment the pointer, repeat.

close the map.

Retrieve my data

mmap() the data file and get a pointer back.

Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.

close the map.

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

Any feedback would be great.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Mike Hales

Re: [VWNC] DLLCC and memory mapped files advice

Wow, so I tried using IOAccessor directly and just looping while reading into my own buffer, rather than using any streams at all. Time to get through the whole file is down to 2.5 seconds. That's fast enough that I'm not going to worry about it any more, but I am still interested in mmap() if anybody has ideas.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com

On Fri, Jul 30, 2010 at 12:31 PM, Terry Raymond <[hidden email]> wrote:

Mike

Along with reading a binary file make sure your code does as little

copying as possible. This means examine the stream methods you

are using and make sure they are not creating intermediate buffers.

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI 02878
(401) 624-4517      [hidden email]
<http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mike Hales
Sent: Friday, July 30, 2010 1:59 PM
To: [hidden email]
Subject: [vwnc] [VWNC] DLLCC and memory mapped files advice

I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).

I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)

My plan is something like this:

Create my data

  mmap() the data file and get a pointer back.

  Use the pointer to alternately write my data then increment the pointer, repeat.

  close the map.

Retrieve my data

  mmap() the data file and get a pointer back.

  Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.

  close the map.

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

Any feedback would be great.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Mike Hales

Re: [VWNC] DLLCC and memory mapped files advice

In reply to this post by Mike Hales

I forgot to mention it, but BOSS was one of the first things I did. I ran into several exceptions with subscript out of bounds and other collection oriented errors once I ramped up to the full size data set. I didn't bother to debug further.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com

On Fri, Jul 30, 2010 at 1:51 PM, Anthony Lander <[hidden email]> wrote:

Hi Mike,

I know this sounds pedestrian, but have you tried BOSS? I know you're trying to get even better performance by memory mapping, but this might get the data in memory relatively quickly without as much work. I believe it is quite optimized for primitive data (say arrays of an int and two doubles). Once its all in ram, it should be fairly quick to process.

This is assuming you will process the structure many times, and that the reading is the big problem.

  -Anthony

On 10-Jul-30, at 1:58 PM, Mike Hales wrote:

I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).

I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)

My plan is something like this:

Create my data
  mmap() the data file and get a pointer back.
  Use the pointer to alternately write my data then increment the pointer, repeat.

  close the map.

Retrieve my data
  mmap() the data file and get a pointer back.
  Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.

  close the map.

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

Any feedback would be great.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc