Smalltalk › Pharo › Pharo Smalltalk Users

Fastest matrix implementation?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

hernanmd

Dec 05, 2010; 8:33pm

Fastest matrix implementation?

Hi list

In the context of a scientific project here we are building big
matrices for later processing, mostly exporting to custom file formats
for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
our scripts in both Pharo 1.1 (not CogVM) with the corresponding
Python 2.6 implementation (without PyPy), and the performance in
Python was superior, about 8x faster than ST.
So I wonder if anyone knows the fastest (or a faster) implementation
of Matrix than the included by default in Collections?

Cheers,

--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Benoit St-Jean-2

Dec 06, 2010; 12:57am

Re: Fastest matrix implementation?

Have you tried the matrix implementation in the numerical package from Didier H. Besset?

http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC

-----------------
Benoit St-Jean
A standpoint is an intellectual horizon of radius zero.
(Albert Einstein)

> Date: Sun, 5 Dec 2010 17:33:17 -0300
> From: [hidden email]
> To: [hidden email]
> Subject: [Pharo-users] Fastest matrix implementation?
>
> Hi list
>
> In the context of a scientific project here we are building big
> matrices for later processing, mostly exporting to custom file formats
> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
> Python 2.6 implementation (without PyPy), and the performance in
> Python was superior, about 8x faster than ST.
> So I wonder if anyone knows the fastest (or a faster) implementation
> of Matrix than the included by default in Collections?
>
> Cheers,
>
> --
> Hernán Morales
> Information Technology Manager,
> Institute of Veterinary Genetics.
> National Scientific and Technical Research Council (CONICET).
> La Plata (1900), Buenos Aires, Argentina.
> Telephone: +54 (0221) 421-1799.
> Internal: 422
> Fax: 425-7980 or 421-1799.
>

... [show rest of quote]

hernanmd

Dec 06, 2010; 7:54pm

Re: Fastest matrix implementation?

Hi Benoit,

I've loaded the package but it seems the port is not complete, i.e. if
you evaluate:

DhbMatrix new: 10

you will get a MessageNotUnderstood: Interval>>asVector because
extension methods were not ported. I uploaded to the SqueakSource a
new version including extension methods and now most tests pass.

Concerning the performance issues, I've narrowed my code to only
measure the writing and reading of a matrix of 710500 elements,
resulting in 58239 milliseconds for the native Matrix implementation
and 56920 for DhbMatrix.
It seems my performance problem involves reading and parsing a "CSV" file

Elements Matrix DhbMatrix
53400 18274 17329
175960 61043 60722
710500 379276 385278

I will check if it's worth to implement a primitive for very fast
parsing of CSV files.
Cheers,

2010/12/5 Benoit St-Jean <[hidden email]>:

> Have you tried the matrix implementation in the numerical package from
> Didier H. Besset?
>
> http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>
>
>
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: [Pharo-users] Fastest matrix implementation?
>>
>> Hi list
>>
>> In the context of a scientific project here we are building big
>> matrices for later processing, mostly exporting to custom file formats
>> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> Python 2.6 implementation (without PyPy), and the performance in
>> Python was superior, about 8x faster than ST.
>> So I wonder if anyone knows the fastest (or a faster) implementation
>> of Matrix than the included by default in Collections?
>>
>> Cheers,
>>

... [show rest of quote]

--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

Benoit St-Jean-2

Dec 07, 2010; 3:55am

Re: Fastest matrix implementation?

What are you using to read those CSV files? Do you have a file so we can have a look at it and possibly speed up the reading of the CSV file?

-----------------
Benoit St-Jean
A standpoint is an intellectual horizon of radius zero.
(Albert Einstein)

> Date: Mon, 6 Dec 2010 16:54:37 -0300
> From: [hidden email]
> To: [hidden email]
> Subject: Re: [Pharo-users] Fastest matrix implementation?
>
> Hi Benoit,
>
> I've loaded the package but it seems the port is not complete, i.e. if
> you evaluate:
>
> DhbMatrix new: 10
>
> you will get a MessageNotUnderstood: Interval>>asVector because
> extension methods were not ported. I uploaded to the SqueakSource a
> new version including extension methods and now most tests pass.
>
> Concerning the performance issues, I've narrowed my code to only
> measure the writing and reading of a matrix of 710500 elements,
> resulting in 58239 milliseconds for the native Matrix implementation
> and 56920 for DhbMatrix.
> It seems my performance problem involves reading and parsing a "CSV" file
>
> Elements Matrix DhbMatrix
> 53400 18274 17329
> 175960 61043 60722
> 710500 379276 385278
>
> I will check if it's worth to implement a primitive for very fast
> parsing of CSV files.
> Cheers,
>
> 2010/12/5 Benoit St-Jean <[hidden email]>:
> > Have you tried the matrix implementation in the numerical package from
> > Didier H. Besset?
> >
> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
> >
> >
> >
> >
> > -----------------
> > Benoit St-Jean
> > A standpoint is an intellectual horizon of radius zero.
> > (Albert Einstein)
> >
> >
> >
> >
> >> Date: Sun, 5 Dec 2010 17:33:17 -0300
> >> From: [hidden email]
> >> To: [hidden email]
> >> Subject: [Pharo-users] Fastest matrix implementation?
> >>
> >> Hi list
> >>
> >> In the context of a scientific project here we are building big
> >> matrices for later processing, mostly exporting to custom file formats
> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
> >> Python 2.6 implementation (without PyPy), and the performance in
> >> Python was superior, about 8x faster than ST.
> >> So I wonder if anyone knows the fastest (or a faster) implementation
> >> of Matrix than the included by default in Collections?
> >>
> >> Cheers,
> >>
>
> --
> Hernán Morales
> Information Technology Manager,
> Institute of Veterinary Genetics.
> National Scientific and Technical Research Council (CONICET).
> La Plata (1900), Buenos Aires, Argentina.
> Telephone: +54 (0221) 421-1799.
> Internal: 422
> Fax: 425-7980 or 421-1799.
>

... [show rest of quote]

hernanmd

Dec 07, 2010; 6:10am

Re: Fastest matrix implementation?

I cannot send the CSV files because they are private data currently
being used for research, just to give a hint I'm parsing files from
800,000 to 19 millions of lines.

I'm using http://www.squeaksource.com/SimpleTextParser.html which is
based in http://www.squeaksource.com/CSV.html plus some useful
additions (for me).

2010/12/7 Benoit St-Jean <[hidden email]>:

> What are you using to read those CSV files? Do you have a file so we can
> have a look at it and possibly speed up the reading of the CSV file?
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Mon, 6 Dec 2010 16:54:37 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [Pharo-users] Fastest matrix implementation?
>>
>> Hi Benoit,
>>
>> I've loaded the package but it seems the port is not complete, i.e. if
>> you evaluate:
>>
>> DhbMatrix new: 10
>>
>> you will get a MessageNotUnderstood: Interval>>asVector because
>> extension methods were not ported. I uploaded to the SqueakSource a
>> new version including extension methods and now most tests pass.
>>
>> Concerning the performance issues, I've narrowed my code to only
>> measure the writing and reading of a matrix of 710500 elements,
>> resulting in 58239 milliseconds for the native Matrix implementation
>> and 56920 for DhbMatrix.
>> It seems my performance problem involves reading and parsing a "CSV" file
>>
>> Elements Matrix DhbMatrix
>> 53400 18274 17329
>> 175960 61043 60722
>> 710500 379276 385278
>>
>> I will check if it's worth to implement a primitive for very fast
>> parsing of CSV files.
>> Cheers,
>>
>> 2010/12/5 Benoit St-Jean <[hidden email]>:
>> > Have you tried the matrix implementation in the numerical package from
>> > Didier H. Besset?
>> >
>> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>> >
>> >
>> >
>> >
>> > -----------------
>> > Benoit St-Jean
>> > A standpoint is an intellectual horizon of radius zero.
>> > (Albert Einstein)
>> >
>> >
>> >
>> >
>> >> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> >> From: [hidden email]
>> >> To: [hidden email]
>> >> Subject: [Pharo-users] Fastest matrix implementation?
>> >>
>> >> Hi list
>> >>
>> >> In the context of a scientific project here we are building big
>> >> matrices for later processing, mostly exporting to custom file formats
>> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> >> Python 2.6 implementation (without PyPy), and the performance in
>> >> Python was superior, about 8x faster than ST.
>> >> So I wonder if anyone knows the fastest (or a faster) implementation
>> >> of Matrix than the included by default in Collections?
>> >>
>> >> Cheers,
>> >>
>>
>> --
>> Hernán Morales
>> Information Technology Manager,
>> Institute of Veterinary Genetics.
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>

... [show rest of quote]

Mariano Martinez Peck

Dec 07, 2010; 8:38am

Re: Fastest matrix implementation?

In reply to this post by hernanmd

On Mon, Dec 6, 2010 at 8:54 PM, Hernán Morales Durand <[hidden email]> wrote:

Hi Benoit,

I've loaded the package but it seems the port is not complete, i.e. if
you evaluate:

DhbMatrix new: 10

you will get a MessageNotUnderstood: Interval>>asVector because
extension methods were not ported. I uploaded to the SqueakSource a
new version including extension methods and now most tests pass.

Concerning the performance issues, I've narrowed my code to only
measure the writing and reading of a matrix of 710500 elements,
resulting in 58239 milliseconds for the native Matrix implementation
and 56920 for DhbMatrix.
It seems my performance problem involves reading and parsing a "CSV" file

... [show rest of quote]

which file stream are you using ? do you need encoding?

You can use the Alexandre Profiler to detect where the problem is.

Cheers

Mariano

Elements Matrix DhbMatrix
53400 18274 17329
175960 61043 60722
710500 379276 385278

I will check if it's worth to implement a primitive for very fast
parsing of CSV files.
Cheers,

2010/12/5 Benoit St-Jean <[hidden email]>:

> Have you tried the matrix implementation in the numerical package from
> Didier H. Besset?
>
> http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>
>
>
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: [Pharo-users] Fastest matrix implementation?
>>
>> Hi list
>>
>> In the context of a scientific project here we are building big
>> matrices for later processing, mostly exporting to custom file formats
>> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> Python 2.6 implementation (without PyPy), and the performance in
>> Python was superior, about 8x faster than ST.
>> So I wonder if anyone knows the fastest (or a faster) implementation
>> of Matrix than the included by default in Collections?
>>
>> Cheers,
>>

--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

... [show rest of quote]

Sven Van Caekenberghe

Dec 07, 2010; 10:13am

Re: Fastest matrix implementation?

In reply to this post by hernanmd

Hi Hernán,

On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote:

It seems my performance problem involves reading and parsing a "CSV" file

Elements Matrix DhbMatrix
53400 18274 17329
175960 61043 60722
710500 379276 385278

I will check if it's worth to implement a primitive for very fast parsing of CSV files.

I think that instead of going native, it would be worthwhile to try to optimize in Smalltalk first (it certainly is more fun).

I thought that this was an interesting problem so I tried writing some code myself, assuming the main problem is getting a CSV matrix in and out of Smalltalk as fast as possible. I simplified further by making the matrix square and containing only Numbers. I also preallocate the matrix and use the fact that I know the row/column sizes.

These are my results:

Size Elements Read Write

250 62500 1013 7858

500 250000 4185 31007

750 562500 9858 71434

I think this is faster, but it is hard to compare. I am still a bit puzzled as to why the writing is slower than the reading though.

The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in the package 'Smalltalk-Hacking-Sven', class NumberMatrix.

This is the write loop:

writeCsvTo: stream

1 to: size do: [ :row |

1 to: size do: [ :column |

column ~= 1 ifTrue: [ stream nextPut: $, ].

stream print: (self at: row at: column) ].

stream nextPut: Character lf ]

And this is the read loop:

readCsvFrom: stream

| numberParser |

numberParser := SqNumberParser on: stream.

1 to: size do: [ :row |

1 to: size do: [ :column |

self at: row at: column put: numberParser nextNumber.

column ~= size ifTrue: [ stream peekFor: $, ] ].

row ~= size ifTrue: [ stream peekFor: Character lf ] ]

I am of course cheating a little bit, but should your CSV file be different, I am sure you can adapt the code (for example to deal with quoting). I am also advancing the stream under the SqNumberParser to avoid allocation a new one every time. I think this code generates little garbage.

What do you think ?

Sven

hernanmd

Dec 07, 2010; 6:30pm

Re: Fastest matrix implementation?

In reply to this post by hernanmd

Hi Benoit,

Thanks for your suggestions! I did some metrics again with a subset of
my files with fewer lines, and here are the results

Lines Milliseconds
116 59
2784 1415
63936 18675
175840 48216
534760 149812

After implementing your improvements:
Lines Milliseconds
116 44
2784 1067
63936 13991
175840 37521
534760 112906

Since I can assume my CSV files doesn't include quotes, I've removed
the qouted character checking subclassing a special parser for files
with qoutes:
Lines Milliseconds
116 37
2784 888
63936 11521
175840 31905
534760 96549

which is an acceptable improvement considering files of millions of
lines. If I can make some time for implementing a primitive, I will
update with more results.
Best regards,

2010/12/7 Benoit St-Jean <[hidden email]>:

> Hi Hernan,
>
> I had a look at the Text Parser package and here's a little suggestion...
>
> In class STextParser, there seems to be place for optimization (I know, I
> *hate* using this word with Smalltalk, I've always preferred simplicity over
> harder-to-read-code) in one particular method, namely #nextInLine.
>
> A quick test in a workspace shows that removing message sends for methods
> #cr and #lf (making the class variables instead) as well as using #==
> instead of #= make this method 3 to 5 times faster. Since it's probably
> called millions and millions of times in your case, this might help...
>
> So it would look like:
>
> nextInLine
>     | next |
>
>     next := stream next.
>     (next == MyCr or: [next == MyLf])
>         ifTrue:    [stream skip: -1.
>                 next := nil].
>     ^ next
>
>
> The original method executes in 23.45 seconds
> Removing the message sends (Character cr and Character lf) and replacing
> them with class vars brings that down to 10.9 seconds.
> Then, replacing #= by #== brings the average time to 4.25 seconds.
>
> Tests were executed for 10000000 characters.
>
> Hope this helps!
>
> Keep me posted!
>
>
> -----------------
> Benoit St-Jean
> A standpoint is an intellectual horizon of radius zero.
> (Albert Einstein)
>
>
>
>
>> Date: Tue, 7 Dec 2010 03:10:20 -0300
>> From: [hidden email]
>> To: [hidden email]
>> Subject: Re: [Pharo-users] Fastest matrix implementation?
>>
>> I cannot send the CSV files because they are private data currently
>> being used for research, just to give a hint I'm parsing files from
>> 800,000 to 19 millions of lines.
>>
>> I'm using http://www.squeaksource.com/SimpleTextParser.html which is
>> based in http://www.squeaksource.com/CSV.html plus some useful
>> additions (for me).
>>
>> 2010/12/7 Benoit St-Jean <[hidden email]>:
>> > What are you using to read those CSV files? Do you have a file so we
>> > can
>> > have a look at it and possibly speed up the reading of the CSV file?
>> >
>> > -----------------
>> > Benoit St-Jean
>> > A standpoint is an intellectual horizon of radius zero.
>> > (Albert Einstein)
>> >
>> >
>> >
>> >
>> >> Date: Mon, 6 Dec 2010 16:54:37 -0300
>> >> From: [hidden email]
>> >> To: [hidden email]
>> >> Subject: Re: [Pharo-users] Fastest matrix implementation?
>> >>
>> >> Hi Benoit,
>> >>
>> >> I've loaded the package but it seems the port is not complete, i.e. if
>> >> you evaluate:
>> >>
>> >> DhbMatrix new: 10
>> >>
>> >> you will get a MessageNotUnderstood: Interval>>asVector because
>> >> extension methods were not ported. I uploaded to the SqueakSource a
>> >> new version including extension methods and now most tests pass.
>> >>
>> >> Concerning the performance issues, I've narrowed my code to only
>> >> measure the writing and reading of a matrix of 710500 elements,
>> >> resulting in 58239 milliseconds for the native Matrix implementation
>> >> and 56920 for DhbMatrix.
>> >> It seems my performance problem involves reading and parsing a "CSV"
>> >> file
>> >>
>> >> Elements Matrix DhbMatrix
>> >> 53400 18274 17329
>> >> 175960 61043 60722
>> >> 710500 379276 385278
>> >>
>> >> I will check if it's worth to implement a primitive for very fast
>> >> parsing of CSV files.
>> >> Cheers,
>> >>
>> >> 2010/12/5 Benoit St-Jean <[hidden email]>:
>> >> > Have you tried the matrix implementation in the numerical package
>> >> > from
>> >> > Didier H. Besset?
>> >> >
>> >> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > -----------------
>> >> > Benoit St-Jean
>> >> > A standpoint is an intellectual horizon of radius zero.
>> >> > (Albert Einstein)
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >> Date: Sun, 5 Dec 2010 17:33:17 -0300
>> >> >> From: [hidden email]
>> >> >> To: [hidden email]
>> >> >> Subject: [Pharo-users] Fastest matrix implementation?
>> >> >>
>> >> >> Hi list
>> >> >>
>> >> >> In the context of a scientific project here we are building big
>> >> >> matrices for later processing, mostly exporting to custom file
>> >> >> formats
>> >> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of
>> >> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding
>> >> >> Python 2.6 implementation (without PyPy), and the performance in
>> >> >> Python was superior, about 8x faster than ST.
>> >> >> So I wonder if anyone knows the fastest (or a faster) implementation
>> >> >> of Matrix than the included by default in Collections?
>> >> >>
>> >> >> Cheers,
>> >> >>
>> >>
>> >> --
>> >> Hernán Morales
>> >> Information Technology Manager,
>> >> Institute of Veterinary Genetics.
>> >> National Scientific and Technical Research Council (CONICET).
>> >> La Plata (1900), Buenos Aires, Argentina.
>> >> Telephone: +54 (0221) 421-1799.
>> >> Internal: 422
>> >> Fax: 425-7980 or 421-1799.
>> >>
>> >
>>
>>
>>
>> --
>> Hernán Morales
>> Information Technology Manager,
>> Institute of Veterinary Genetics.
>> National Scientific and Technical Research Council (CONICET).
>> La Plata (1900), Buenos Aires, Argentina.
>> Telephone: +54 (0221) 421-1799.
>> Internal: 422
>> Fax: 425-7980 or 421-1799.
>>
>

... [show rest of quote]

hernanmd

Dec 07, 2010; 6:46pm

Re: Fastest matrix implementation?

In reply to this post by Mariano Martinez Peck

Hi Mariano,

2010/12/7 Mariano Martinez Peck <[hidden email]>:

>
>
> On Mon, Dec 6, 2010 at 8:54 PM, Hernán Morales Durand
> <[hidden email]> wrote:
>>
>> Hi Benoit,
>>
>> I've loaded the package but it seems the port is not complete, i.e. if
>> you evaluate:
>>
>> DhbMatrix new: 10
>>
>> you will get a MessageNotUnderstood: Interval>>asVector because
>> extension methods were not ported. I uploaded to the SqueakSource a
>> new version including extension methods and now most tests pass.
>>
>> Concerning the performance issues, I've narrowed my code to only
>> measure the writing and reading of a matrix of 710500 elements,
>> resulting in 58239 milliseconds for the native Matrix implementation
>> and 56920 for DhbMatrix.
>> It seems my performance problem involves reading and parsing a "CSV" file
>>
>
>
> which file stream are you using ? do you need encoding?
>

... [show rest of quote]

I'm using just FileStream, how do I know if I need encoding?

> You can use the Alexandre Profiler to detect where the problem is.
>

You mean the PetitProfiler or just Spy?
Cheers,

> Cheers
>
> Mariano
>
>>

--
Hernán Morales
Information Technology Manager,
Institute of Veterinary Genetics.
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

hernanmd

Dec 07, 2010; 7:44pm

Re: Fastest matrix implementation?

In reply to this post by Sven Van Caekenberghe

Hi Sven,

Thanks for your comments, my matrix includes names, floats and
characters, and the matrix isn't squared because I need to build a
transposed matrix while reading the CSV, however I will take a look
the SqNumberParser.

In case anyone is wondering how this is done, this is the code for a
matrix with 20 rows:

First to make some predefined columns for which I have to repeat
values, so I used the Generator ported by Lukas Renggli.

matrix := Matrix rows: 29044 columns: 20.
firstCol := ( Generator on: [: g | rowCount timesRepeat: [ g yield: 1
] ] ) upToEnd.
matrix atColumn: 1 put: fCol.
"... fill remaining 5 columns ..."
index := 0.
parserResult rowsDo: [: rs | | rowInd colInd |
matrix
at: ( rowInd := index \\ 20 + 1 )
at: ( colInd := index // 20 + 7 )
put: 'test'.
index := index + 1 ].
1 to: rowCount do: [: rIndex |
1 to: colCount do: [: cIndex |
outputFile nextPutAll: ( matrix at: rIndex at: cIndex ) asString; tab ].
outputFile cr ].

Hope it helps someone,
Cheers

2010/12/7 Sven Van Caekenberghe <[hidden email]>:

> Hi Hernán,
> On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote:
>
> It seems my performance problem involves reading and parsing a "CSV" file
>
> Elements Matrix DhbMatrix
> 53400 18274 17329
> 175960 61043 60722
> 710500 379276 385278
>
> I will check if it's worth to implement a primitive for very fast parsing of
> CSV files.
>
> I think that instead of going native, it would be worthwhile to try to
> optimize in Smalltalk first (it certainly is more fun).
> I thought that this was an interesting problem so I tried writing some code
> myself, assuming the main problem is getting a CSV matrix in and out of
> Smalltalk as fast as possible. I simplified further by making the matrix
> square and containing only Numbers. I also preallocate the matrix and use
> the fact that I know the row/column sizes.
> These are my results:
> Size Elements Read Write
> 250 62500 1013 7858
> 500 250000 4185 31007
> 750 562500 9858 71434
> I think this is faster, but it is hard to compare. I am still a bit puzzled
> as to why the writing is slower than the reading though.
> The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in
> the package 'Smalltalk-Hacking-Sven', class NumberMatrix.
> This is the write loop:
> writeCsvTo: stream
> 1 to: size do: [ :row |
> 1 to: size do: [ :column |
> column ~= 1 ifTrue: [ stream nextPut: $, ].
> stream print: (self at: row at: column) ].
> stream nextPut: Character lf ]
> And this is the read loop:
> readCsvFrom: stream
> | numberParser |
> numberParser := SqNumberParser on: stream.
> 1 to: size do: [ :row |
> 1 to: size do: [ :column |
> self at: row at: column put: numberParser nextNumber.
> column ~= size ifTrue: [ stream peekFor: $, ] ].
> row ~= size ifTrue: [ stream peekFor: Character lf ] ]
> I am of course cheating a little bit, but should your CSV file be different,
> I am sure you can adapt the code (for example to deal with quoting). I am
> also advancing the stream under the SqNumberParser to avoid allocation a new
> one every time. I think this code generates little garbage.
> What do you think ?
> Sven
>

... [show rest of quote]