[VWNC] DLLCC and memory mapped files advice

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[VWNC] DLLCC and memory mapped files advice

Mike Hales
I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).

I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)

My plan is something like this:

Create my data
  mmap() the data file and get a pointer back.
  Use the pointer to alternately write my data then increment the pointer, repeat.
  close the map.

Retrieve my data
  mmap() the data file and get a pointer back.
  Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.
  close the map.

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

Any feedback would be great.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [VWNC] DLLCC and memory mapped files advice

Terry Raymond

Mike

 

Along with reading a binary file make sure your code does as little

copying as possible. This means examine the stream methods you

are using and make sure they are not creating intermediate buffers.

 

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI  02878
(401) 624-4517      [hidden email]
<http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mike Hales
Sent: Friday, July 30, 2010 1:59 PM
To: [hidden email]
Subject: [vwnc] [VWNC] DLLCC and memory mapped files advice

 

I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).

 

I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)

 

My plan is something like this:

 

Create my data

  mmap() the data file and get a pointer back.

  Use the pointer to alternately write my data then increment the pointer, repeat.

  close the map.

 

Retrieve my data

  mmap() the data file and get a pointer back.

  Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.

  close the map.

 

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

 

Any feedback would be great.

 

Mike


Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [VWNC] DLLCC and memory mapped files advice

Mike Hales
Wow, so I tried using IOAccessor directly and just looping while reading into my own buffer, rather than using any streams at all. Time to get through the whole file is down to 2.5 seconds. That's fast enough that I'm not going to worry about it any more, but I am still interested in mmap() if anybody has ideas.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com


On Fri, Jul 30, 2010 at 12:31 PM, Terry Raymond <[hidden email]> wrote:

Mike

 

Along with reading a binary file make sure your code does as little

copying as possible. This means examine the stream methods you

are using and make sure they are not creating intermediate buffers.

 

Terry

===========================================================
Terry Raymond
Crafted Smalltalk
80 Lazywood Ln.
Tiverton, RI  02878
(401) 624-4517      [hidden email]
<http://www.craftedsmalltalk.com>
===========================================================

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mike Hales
Sent: Friday, July 30, 2010 1:59 PM
To: [hidden email]
Subject: [vwnc] [VWNC] DLLCC and memory mapped files advice

 

I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).

 

I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)

 

My plan is something like this:

 

Create my data

  mmap() the data file and get a pointer back.

  Use the pointer to alternately write my data then increment the pointer, repeat.

  close the map.

 

Retrieve my data

  mmap() the data file and get a pointer back.

  Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.

  close the map.

 

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

 

Any feedback would be great.

 

Mike


Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com



_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [VWNC] DLLCC and memory mapped files advice

Mike Hales
In reply to this post by Mike Hales
I forgot to mention it, but BOSS was one of the first things I did. I ran into several exceptions with subscript out of bounds and other collection oriented errors once I ramped up to the full size data set. I didn't bother to debug further.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com


On Fri, Jul 30, 2010 at 1:51 PM, Anthony Lander <[hidden email]> wrote:
Hi Mike,

I know this sounds pedestrian, but have you tried BOSS? I know you're trying to get even better performance by memory mapping, but this might get the data in memory relatively quickly without as much work. I believe it is quite optimized for primitive data (say arrays of an int and two doubles). Once its all in ram, it should be fairly quick to process.

This is assuming you will process the structure many times, and that the reading is the big problem.

  -Anthony


On 10-Jul-30, at 1:58 PM, Mike Hales wrote:

I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).

I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)

My plan is something like this:

Create my data
  mmap() the data file and get a pointer back.
  Use the pointer to alternately write my data then increment the pointer, repeat.
  close the map.

Retrieve my data
  mmap() the data file and get a pointer back.
  Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.
  close the map.

It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types.

Any feedback would be great.

Mike

Mike Hales
Engineering Manager
KnowledgeScape
www.kscape.com
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc



_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc