I have a bunch of data, the structure of which is very simple, it's an integer representing a time stamp, then two doubles, repeated over and over. Originally this was stored as text in a .csv file. There are several million records, so it took quite a while to iterate over them all. I want to just stream over the records and process them as fast as I can. I am free to store them any way I want to, so I would love suggestions. First I stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about 24 seconds). Then I stuck them in a binary file, and used binary streams to read in chunks of bytes, and by using some primitives to instantiate the Smalltalk objects out of the bytes, I can get through the file a little faster (about 20 seconds).
I was thinking about trying to memory map the file and iterate over it using pointer math, and use the DLLCC machinery to make my SmallInteger and Double instances from the bits. Could this be faster? Does anyone have any advice as to how to do this. If I need to write a small DLL I can, but is mmap() available already somehow? (Maybe as a primitive or in one of the system support libraries)
My plan is something like this: Create my data mmap() the data file and get a pointer back. Use the pointer to alternately write my data then increment the pointer, repeat.
close the map. Retrieve my data mmap() the data file and get a pointer back. Use the pointer to alternately create instances of my Smalltalk data then increment the pointer, repeat.
close the map. It seems like this shouldn't be too hard, and could be really fast since I'm using scalar data types. Any feedback would be great. Mike
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Mike Along with reading a
binary file make sure your code does as little copying as possible. This
means examine the stream methods you are using and make sure
they are not creating intermediate buffers. Terry From: [hidden email] [mailto:[hidden email]] On
Behalf Of Mike Hales I have a bunch of data, the structure of which is very
simple, it's an integer representing a time stamp, then two doubles, repeated
over and over. Originally this was stored as text in a .csv file. There are
several million records, so it took quite a while to iterate over them all. I
want to just stream over the records and process them as fast as I can. I am
free to store them any way I want to, so I would love suggestions. First I
stuck them all in a PostgreSQL database, but this wasn't that fast (many minutes
to iterate over them all). I tried BerkeleyDB, and that was pretty fast (about
24 seconds). Then I stuck them in a binary file, and used binary streams to
read in chunks of bytes, and by using some primitives to instantiate the
Smalltalk objects out of the bytes, I can get through the file a little faster
(about 20 seconds). I was thinking about trying to memory map the file and
iterate over it using pointer math, and use the DLLCC machinery to make my
SmallInteger and Double instances from the bits. Could this be faster? Does
anyone have any advice as to how to do this. If I need to write a small DLL I
can, but is mmap() available already somehow? (Maybe as a primitive or in one
of the system support libraries) My plan is something like this: Create my data mmap() the data file and get a pointer back. Use the pointer to alternately write my data
then increment the pointer, repeat. close the map. Retrieve my data mmap() the data file and get a pointer back. Use the pointer to alternately create instances
of my Smalltalk data then increment the pointer, repeat. close the map. It seems like this shouldn't be too hard, and could be
really fast since I'm using scalar data types. Any feedback would be great. Mike
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Wow, so I tried using IOAccessor directly and just looping while reading into my own buffer, rather than using any streams at all. Time to get through the whole file is down to 2.5 seconds. That's fast enough that I'm not going to worry about it any more, but I am still interested in mmap() if anybody has ideas.
Mike Mike Hales Engineering Manager KnowledgeScape www.kscape.com On Fri, Jul 30, 2010 at 12:31 PM, Terry Raymond <[hidden email]> wrote:
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Mike Hales
I forgot to mention it, but BOSS was one of the first things I did. I ran into several exceptions with subscript out of bounds and other collection oriented errors once I ramped up to the full size data set. I didn't bother to debug further.
Mike Mike Hales Engineering Manager KnowledgeScape www.kscape.com On Fri, Jul 30, 2010 at 1:51 PM, Anthony Lander <[hidden email]> wrote:
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |