Fastest matrix implementation?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Fastest matrix implementation?

hernanmd
Hi list

In the context of a scientific project here we are building big
matrices for later processing, mostly exporting to custom file formats
for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
our scripts in both Pharo 1.1 (not CogVM) with the corresponding
Python 2.6 implementation (without PyPy), and the performance in
Python was superior, about 8x faster than ST.
So I wonder if anyone knows the fastest (or a faster) implementation
of Matrix than the included by default in Collections?

Cheers,

--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

Benoit St-Jean-2
Have you tried the matrix implementation in the numerical package from Didier H. Besset?

http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC




-----------------
Benoit St-Jean
A standpoint is an intellectual horizon of radius zero.
(Albert Einstein)




> Date: Sun, 5 Dec 2010 17:33:17 -0300
> From: [hidden email]
> To: [hidden email]
> Subject: [Pharo-users] Fastest matrix implementation?
>
> Hi list
>
> In the context of a scientific project here we are building big
> matrices for later processing, mostly exporting to custom file formats
> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
> Python 2.6 implementation (without PyPy), and the performance in
> Python was superior, about 8x faster than ST.
> So I wonder if anyone knows the fastest (or a faster) implementation
> of Matrix than the included by default in Collections?
>
> Cheers,
>
> --
> Hernán Morales
> Information Technology Manager,
> Institute of Veterinary Genetics.
> National Scientific and Technical Research Council (CONICET).
> La Plata (1900), Buenos Aires, Argentina.
> Telephone: +54 (0221) 421-1799.
> Internal: 422
> Fax: 425-7980 or 421-1799.
>
Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

hernanmd
Hi Benoit,

I've loaded the package but it seems the port is not complete, i.e. if
you evaluate:

DhbMatrix new: 10

you will get a MessageNotUnderstood: Interval>>asVector because
extension methods were not ported. I uploaded to the SqueakSource a
new version including extension methods and now most tests pass.

Concerning the performance issues, I've narrowed my code to only
measure the writing and reading of a matrix of 710500 elements,
resulting in 58239 milliseconds for the native Matrix implementation
and 56920 for DhbMatrix.
It seems my performance problem involves reading and parsing a "CSV" file

Elements Matrix DhbMatrix
53400 18274 17329
175960 61043 60722
710500 379276 385278

I will check if it's worth to implement a primitive for very fast
parsing of CSV files.
Cheers,

2010/12/5 Benoit St-Jean <[hidden email]>:

> Have you tried the matrix implementation in the numerical package from
> Didier H. Besset?
>
> http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>
>
>
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: [Pharo-users] Fastest matrix implementation?
>>
>> Hi list
>>
>> In the context of a scientific project here we are building big
>> matrices for later processing, mostly exporting to custom file formats
>> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> Python 2.6 implementation (without PyPy), and the performance in
>> Python was superior, about 8x faster than ST.
>> So I wonder if anyone knows the fastest (or a faster) implementation
>> of Matrix than the included by default in Collections?
>>
>> Cheers,
>>

--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

Benoit St-Jean-2
What are you using to read those CSV files?  Do you have a file so we can have a look at it and possibly speed up the reading of the CSV file?

-----------------
Benoit St-Jean
A standpoint is an intellectual horizon of radius zero.
(Albert Einstein)




> Date: Mon, 6 Dec 2010 16:54:37 -0300
> From: [hidden email]
> To: [hidden email]
> Subject: Re: [Pharo-users] Fastest matrix implementation?
>
> Hi Benoit,
>
> I've loaded the package but it seems the port is not complete, i.e. if
> you evaluate:
>
> DhbMatrix new: 10
>
> you will get a MessageNotUnderstood: Interval>>asVector because
> extension methods were not ported. I uploaded to the SqueakSource a
> new version including extension methods and now most tests pass.
>
> Concerning the performance issues, I've narrowed my code to only
> measure the writing and reading of a matrix of 710500 elements,
> resulting in 58239 milliseconds for the native Matrix implementation
> and 56920 for DhbMatrix.
> It seems my performance problem involves reading and parsing a "CSV" file
>
> Elements Matrix DhbMatrix
> 53400 18274 17329
> 175960 61043 60722
> 710500 379276 385278
>
> I will check if it's worth to implement a primitive for very fast
> parsing of CSV files.
> Cheers,
>
> 2010/12/5 Benoit St-Jean <[hidden email]>:
> > Have you tried the matrix implementation in the numerical package from
> > Didier H. Besset?
> >
> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
> >
> >
> >
> >
> > -----------------
> > Benoit St-Jean
> > A standpoint is an intellectual horizon of radius zero.
> > (Albert Einstein)
> >
> >
> >
> >
> >> Date: Sun, 5 Dec 2010 17:33:17 -0300
> >> From: [hidden email]
> >> To: [hidden email]
> >> Subject: [Pharo-users] Fastest matrix implementation?
> >>
> >> Hi list
> >>
> >> In the context of a scientific project here we are building big
> >> matrices for later processing, mostly exporting to custom file formats
> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
> >> Python 2.6 implementation (without PyPy), and the performance in
> >> Python was superior, about 8x faster than ST.
> >> So I wonder if anyone knows the fastest (or a faster) implementation
> >> of Matrix than the included by default in Collections?
> >>
> >> Cheers,
> >>
>
> --
> Hernán Morales
> Information Technology Manager,
> Institute of Veterinary Genetics.
> National Scientific and Technical Research Council (CONICET).
> La Plata (1900), Buenos Aires, Argentina.
> Telephone: +54 (0221) 421-1799.
> Internal: 422
> Fax: 425-7980 or 421-1799.
>
Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

hernanmd
I cannot send the CSV files because they are private data currently
being used for research, just to give a hint I'm parsing files from
800,000 to 19 millions of lines.

I'm using http://www.squeaksource.com/SimpleTextParser.html which is
based in http://www.squeaksource.com/CSV.html plus some useful
additions (for me).

2010/12/7 Benoit St-Jean <[hidden email]>:

> What are you using to read those CSV files?  Do you have a file so we can
> have a look at it and possibly speed up the reading of the CSV file?
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Mon, 6 Dec 2010 16:54:37 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [Pharo-users] Fastest matrix implementation?
>>
>> Hi Benoit,
>>
>> I've loaded the package but it seems the port is not complete, i.e. if
>> you evaluate:
>>
>> DhbMatrix new: 10
>>
>> you will get a MessageNotUnderstood: Interval>>asVector because
>> extension methods were not ported. I uploaded to the SqueakSource a
>> new version including extension methods and now most tests pass.
>>
>> Concerning the performance issues, I've narrowed my code to only
>> measure the writing and reading of a matrix of 710500 elements,
>> resulting in 58239 milliseconds for the native Matrix implementation
>> and 56920 for DhbMatrix.
>> It seems my performance problem involves reading and parsing a "CSV" file
>>
>> Elements Matrix DhbMatrix
>> 53400 18274 17329
>> 175960 61043 60722
>> 710500 379276 385278
>>
>> I will check if it's worth to implement a primitive for very fast
>> parsing of CSV files.
>> Cheers,
>>
>> 2010/12/5 Benoit St-Jean <[hidden email]>:
>> > Have you tried the matrix implementation in the numerical package from
>> > Didier H. Besset?
>> >
>> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>> >
>> >
>> >
>> >
>> > -----------------
>> > Benoit St-Jean
>> > A standpoint is an intellectual horizon of radius zero.
>> > (Albert Einstein)
>> >
>> >
>> >
>> >
>> >> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> >> From: [hidden email]
>> >> To: [hidden email]
>> >> Subject: [Pharo-users] Fastest matrix implementation?
>> >>
>> >> Hi list
>> >>
>> >> In the context of a scientific project here we are building big
>> >> matrices for later processing, mostly exporting to custom file formats
>> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> >> Python 2.6 implementation (without PyPy), and the performance in
>> >> Python was superior, about 8x faster than ST.
>> >> So I wonder if anyone knows the fastest (or a faster) implementation
>> >> of Matrix than the included by default in Collections?
>> >>
>> >> Cheers,
>> >>
>>
>> --
>> Hernán Morales
>> Information Technology Manager,
>> Institute of Veterinary Genetics.
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>



--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

Mariano Martinez Peck
In reply to this post by hernanmd


On Mon, Dec 6, 2010 at 8:54 PM, Hernán Morales Durand <[hidden email]> wrote:
Hi Benoit,

I've loaded the package but it seems the port is not complete, i.e. if
you evaluate:

DhbMatrix new: 10

you will get a MessageNotUnderstood: Interval>>asVector because
extension methods were not ported. I uploaded to the SqueakSource a
new version including extension methods and now most tests pass.

Concerning the performance issues, I've narrowed my code to only
measure the writing and reading of a matrix of 710500 elements,
resulting in 58239 milliseconds for the native Matrix implementation
and 56920 for DhbMatrix.
It seems my performance problem involves reading and parsing a "CSV" file



which file stream are you using ?  do you need encoding?

You can use the Alexandre Profiler to detect where the problem is.

Cheers

Mariano
 
Elements Matrix DhbMatrix
53400   18274   17329
175960  61043   60722
710500  379276  385278

I will check if it's worth to implement a primitive for very fast
parsing of CSV files.
Cheers,

2010/12/5 Benoit St-Jean <[hidden email]>:
> Have you tried the matrix implementation in the numerical package from
> Didier H. Besset?
>
> http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>
>
>
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: [Pharo-users] Fastest matrix implementation?
>>
>> Hi list
>>
>> In the context of a scientific project here we are building big
>> matrices for later processing, mostly exporting to custom file formats
>> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> Python 2.6 implementation (without PyPy), and the performance in
>> Python was superior, about 8x faster than ST.
>> So I wonder if anyone knows the fastest (or a faster) implementation
>> of Matrix than the included by default in Collections?
>>
>> Cheers,
>>

--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.


Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

Sven Van Caekenberghe
In reply to this post by hernanmd
Hi Hernán,

On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote:

It seems my performance problem involves reading and parsing a "CSV" file

Elements Matrix DhbMatrix
53400 18274 17329
175960 61043 60722
710500 379276 385278

I will check if it's worth to implement a primitive for very fast parsing of CSV files.

I think that instead of going native, it would be worthwhile to try to optimize in Smalltalk first (it certainly is more fun). 

I thought that this was an interesting problem so I tried writing some code myself, assuming the main problem is getting a CSV matrix in and out of Smalltalk as fast as possible. I simplified further by making the matrix square and containing only Numbers. I also preallocate the matrix and use the fact that I know the row/column sizes.

These are my results:

Size Elements Read Write
250  62500    1013 7858
500  250000   4185 31007
750  562500   9858 71434

I think this is faster, but it is hard to compare. I am still a bit puzzled as to why the writing is slower than the reading though.

The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in the package 'Smalltalk-Hacking-Sven', class NumberMatrix.

This is the write loop:

writeCsvTo: stream
1 to: size do: [ :row |
1 to: size do: [ :column |
column ~= 1 ifTrue: [ stream nextPut: $, ].
stream print: (self at: row at: column) ].
stream nextPut: Character lf ]

And this is the read loop:

readCsvFrom: stream
| numberParser |
numberParser := SqNumberParser on: stream.
1 to: size do: [ :row |
1 to: size do: [ :column |
self at: row at: column put: numberParser nextNumber.
column ~= size ifTrue: [ stream peekFor: $, ] ].
row ~= size ifTrue: [ stream peekFor: Character lf ] ]

I am of course cheating a little bit, but should your CSV file be different, I am sure you can adapt the code (for example to deal with quoting). I am also advancing the stream under the SqNumberParser to avoid allocation a new one every time. I think this code generates little garbage.

What do you think ?

Sven

Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

hernanmd
In reply to this post by hernanmd
Hi Benoit,

Thanks for your suggestions! I did some metrics again with a subset of
my files with fewer lines, and here are the results

Lines Milliseconds
116 59
2784 1415
63936 18675
175840 48216
534760 149812

After implementing your improvements:
Lines Milliseconds
116 44
2784 1067
63936 13991
175840 37521
534760 112906

Since I can assume my CSV files doesn't include quotes, I've removed
the qouted character checking subclassing a special parser for files
with qoutes:
Lines Milliseconds
116 37
2784 888
63936 11521
175840 31905
534760 96549

which is an acceptable improvement considering files of millions of
lines. If I can make some time for implementing a primitive, I will
update with more results.
Best regards,

2010/12/7 Benoit St-Jean <[hidden email]>:

> Hi Hernan,
>
> I had a look at the Text Parser package and here's a little suggestion...
>
> In class STextParser, there seems to be place for optimization (I know, I
> *hate* using this word with Smalltalk, I've always preferred simplicity over
> harder-to-read-code) in one particular method, namely #nextInLine.
>
> A quick test in a workspace shows that removing message sends for methods
> #cr and #lf (making the class variables instead) as well as using #==
> instead of #= make this method 3 to 5 times faster.  Since it's probably
> called millions and millions of times in your case, this might help...
>
> So it would look like:
>
> nextInLine
>     | next |
>
>     next := stream next.
>     (next == MyCr or: [next == MyLf])
>         ifTrue:    [stream skip: -1.
>                 next := nil].
>     ^ next
>
>
> The original method executes in 23.45 seconds
> Removing the message sends (Character cr and Character lf) and replacing
> them with class vars brings that down to 10.9 seconds.
> Then, replacing #= by #== brings the average time to 4.25 seconds.
>
> Tests were executed for 10000000 characters.
>
> Hope this helps!
>
> Keep me posted!
>
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Tue, 7 Dec 2010 03:10:20 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [Pharo-users] Fastest matrix implementation?
>>
>> I cannot send the CSV files because they are private data currently
>> being used for research, just to give a hint I'm parsing files from
>> 800,000 to 19 millions of lines.
>>
>> I'm using http://www.squeaksource.com/SimpleTextParser.html which is
>> based in http://www.squeaksource.com/CSV.html plus some useful
>> additions (for me).
>>
>> 2010/12/7 Benoit St-Jean <[hidden email]>:
>> > What are you using to read those CSV files?  Do you have a file so we
>> > can
>> > have a look at it and possibly speed up the reading of the CSV file?
>> >
>> > -----------------
>> > Benoit St-Jean
>> > A standpoint is an intellectual horizon of radius zero.
>> > (Albert Einstein)
>> >
>> >
>> >
>> >
>> >> Date: Mon, 6 Dec 2010 16:54:37 -0300
>> >> From: [hidden email]
>> >> To: [hidden email]
>> >> Subject: Re: [Pharo-users] Fastest matrix implementation?
>> >>
>> >> Hi Benoit,
>> >>
>> >> I've loaded the package but it seems the port is not complete, i.e. if
>> >> you evaluate:
>> >>
>> >> DhbMatrix new: 10
>> >>
>> >> you will get a MessageNotUnderstood: Interval>>asVector because
>> >> extension methods were not ported. I uploaded to the SqueakSource a
>> >> new version including extension methods and now most tests pass.
>> >>
>> >> Concerning the performance issues, I've narrowed my code to only
>> >> measure the writing and reading of a matrix of 710500 elements,
>> >> resulting in 58239 milliseconds for the native Matrix implementation
>> >> and 56920 for DhbMatrix.
>> >> It seems my performance problem involves reading and parsing a "CSV"
>> >> file
>> >>
>> >> Elements Matrix DhbMatrix
>> >> 53400 18274 17329
>> >> 175960 61043 60722
>> >> 710500 379276 385278
>> >>
>> >> I will check if it's worth to implement a primitive for very fast
>> >> parsing of CSV files.
>> >> Cheers,
>> >>
>> >> 2010/12/5 Benoit St-Jean <[hidden email]>:
>> >> > Have you tried the matrix implementation in the numerical package
>> >> > from
>> >> > Didier H. Besset?
>> >> >
>> >> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > -----------------
>> >> > Benoit St-Jean
>> >> > A standpoint is an intellectual horizon of radius zero.
>> >> > (Albert Einstein)
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> >> >> From: [hidden email]
>> >> >> To: [hidden email]
>> >> >> Subject: [Pharo-users] Fastest matrix implementation?
>> >> >>
>> >> >> Hi list
>> >> >>
>> >> >> In the context of a scientific project here we are building big
>> >> >> matrices for later processing, mostly exporting to custom file
>> >> >> formats
>> >> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> >> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> >> >> Python 2.6 implementation (without PyPy), and the performance in
>> >> >> Python was superior, about 8x faster than ST.
>> >> >> So I wonder if anyone knows the fastest (or a faster) implementation
>> >> >> of Matrix than the included by default in Collections?
>> >> >>
>> >> >> Cheers,
>> >> >>
>> >>
>> >> --
>> >> Hernán Morales
>> >> Information Technology Manager,
>> >> Institute of Veterinary Genetics.
>> >> National Scientific and Technical Research Council (CONICET).
>> >> La Plata (1900), Buenos Aires, Argentina.
>> >> Telephone: +54 (0221) 421-1799.
>> >> Internal: 422
>> >> Fax: 425-7980 or 421-1799.
>> >>
>> >
>>
>>
>>
>> --
>> Hernán Morales
>> Information Technology Manager,
>> Institute of Veterinary Genetics.
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

hernanmd
In reply to this post by Mariano Martinez Peck
Hi Mariano,

2010/12/7 Mariano Martinez Peck <[hidden email]>:

>
>
> On Mon, Dec 6, 2010 at 8:54 PM, Hernán Morales Durand
> <[hidden email]> wrote:
>>
>> Hi Benoit,
>>
>> I've loaded the package but it seems the port is not complete, i.e. if
>> you evaluate:
>>
>> DhbMatrix new: 10
>>
>> you will get a MessageNotUnderstood: Interval>>asVector because
>> extension methods were not ported. I uploaded to the SqueakSource a
>> new version including extension methods and now most tests pass.
>>
>> Concerning the performance issues, I've narrowed my code to only
>> measure the writing and reading of a matrix of 710500 elements,
>> resulting in 58239 milliseconds for the native Matrix implementation
>> and 56920 for DhbMatrix.
>> It seems my performance problem involves reading and parsing a "CSV" file
>>
>
>
> which file stream are you using ?  do you need encoding?
>

I'm using just FileStream, how do I know if I need encoding?

> You can use the Alexandre Profiler to detect where the problem is.
>

You mean the PetitProfiler or just Spy?
Cheers,


> Cheers
>
> Mariano
>
>>


--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Reply | Threaded
Open this post in threaded view
|

Re: Fastest matrix implementation?

hernanmd
In reply to this post by Sven Van Caekenberghe
Hi Sven,

Thanks for your comments, my matrix includes names, floats and
characters, and the matrix isn't squared because I need to build a
transposed matrix while reading the CSV, however I will take a look
the SqNumberParser.

In case anyone is wondering how this is done, this is the code for a
matrix with 20 rows:

First to make some predefined columns for which I have to repeat
values, so I used the Generator ported by Lukas Renggli.

matrix := Matrix rows: 29044 columns: 20.
firstCol := ( Generator on: [: g | rowCount timesRepeat: [ g yield: 1
] ] ) upToEnd.
matrix atColumn: 1 put: fCol.
"... fill remaining 5 columns ..."
index := 0.
parserResult rowsDo: [: rs | | rowInd colInd |
        matrix
                at: ( rowInd := index \\ 20 + 1 )
                at: ( colInd := index // 20 + 7 )
                put: 'test'.
                index := index + 1 ].
1 to: rowCount do: [: rIndex |
        1 to: colCount do: [: cIndex |
                outputFile nextPutAll: ( matrix at: rIndex at: cIndex ) asString; tab ].
                outputFile cr ].

Hope it helps someone,
Cheers

2010/12/7 Sven Van Caekenberghe <[hidden email]>:

> Hi Hernán,
> On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote:
>
> It seems my performance problem involves reading and parsing a "CSV" file
>
> Elements Matrix DhbMatrix
> 53400 18274 17329
> 175960 61043 60722
> 710500 379276 385278
>
> I will check if it's worth to implement a primitive for very fast parsing of
> CSV files.
>
> I think that instead of going native, it would be worthwhile to try to
> optimize in Smalltalk first (it certainly is more fun).
> I thought that this was an interesting problem so I tried writing some code
> myself, assuming the main problem is getting a CSV matrix in and out of
> Smalltalk as fast as possible. I simplified further by making the matrix
> square and containing only Numbers. I also preallocate the matrix and use
> the fact that I know the row/column sizes.
> These are my results:
> Size Elements Read Write
> 250  62500    1013 7858
> 500  250000   4185 31007
> 750  562500   9858 71434
> I think this is faster, but it is hard to compare. I am still a bit puzzled
> as to why the writing is slower than the reading though.
> The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in
> the package 'Smalltalk-Hacking-Sven', class NumberMatrix.
> This is the write loop:
> writeCsvTo: stream
> 1 to: size do: [ :row |
> 1 to: size do: [ :column |
> column ~= 1 ifTrue: [ stream nextPut: $, ].
> stream print: (self at: row at: column) ].
> stream nextPut: Character lf ]
> And this is the read loop:
> readCsvFrom: stream
> | numberParser |
> numberParser := SqNumberParser on: stream.
> 1 to: size do: [ :row |
> 1 to: size do: [ :column |
> self at: row at: column put: numberParser nextNumber.
> column ~= size ifTrue: [ stream peekFor: $, ] ].
> row ~= size ifTrue: [ stream peekFor: Character lf ] ]
> I am of course cheating a little bit, but should your CSV file be different,
> I am sure you can adapt the code (for example to deal with quoting). I am
> also advancing the stream under the SqNumberParser to avoid allocation a new
> one every time. I think this code generates little garbage.
> What do you think ?
> Sven
>



--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.