Hi list
In the context of a scientific project here we are building big matrices for later processing, mostly exporting to custom file formats for PLINK, HaploView, etc (bioinformatics tools). I've tested one of our scripts in both Pharo 1.1 (not CogVM) with the corresponding Python 2.6 implementation (without PyPy), and the performance in Python was superior, about 8x faster than ST. So I wonder if anyone knows the fastest (or a faster) implementation of Matrix than the included by default in Collections? Cheers, -- Hernán Morales Information Technology Manager, Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. |
Have you tried the matrix implementation in the numerical package from Didier H. Besset?
http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC ----------------- Benoit St-Jean A standpoint is an intellectual horizon of radius zero. (Albert Einstein) > Date: Sun, 5 Dec 2010 17:33:17 -0300 > From: [hidden email] > To: [hidden email] > Subject: [Pharo-users] Fastest matrix implementation? > > Hi list > > In the context of a scientific project here we are building big > matrices for later processing, mostly exporting to custom file formats > for PLINK, HaploView, etc (bioinformatics tools). I've tested one of > our scripts in both Pharo 1.1 (not CogVM) with the corresponding > Python 2.6 implementation (without PyPy), and the performance in > Python was superior, about 8x faster than ST. > So I wonder if anyone knows the fastest (or a faster) implementation > of Matrix than the included by default in Collections? > > Cheers, > > -- > Hernán Morales > Information Technology Manager, > Institute of Veterinary Genetics. > National Scientific and Technical Research Council (CONICET). > La Plata (1900), Buenos Aires, Argentina. > Telephone: +54 (0221) 421-1799. > Internal: 422 > Fax: 425-7980 or 421-1799. > |
Hi Benoit,
I've loaded the package but it seems the port is not complete, i.e. if you evaluate: DhbMatrix new: 10 you will get a MessageNotUnderstood: Interval>>asVector because extension methods were not ported. I uploaded to the SqueakSource a new version including extension methods and now most tests pass. Concerning the performance issues, I've narrowed my code to only measure the writing and reading of a matrix of 710500 elements, resulting in 58239 milliseconds for the native Matrix implementation and 56920 for DhbMatrix. It seems my performance problem involves reading and parsing a "CSV" file Elements Matrix DhbMatrix 53400 18274 17329 175960 61043 60722 710500 379276 385278 I will check if it's worth to implement a primitive for very fast parsing of CSV files. Cheers, 2010/12/5 Benoit St-Jean <[hidden email]>: > Have you tried the matrix implementation in the numerical package from > Didier H. Besset? > > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC > > > > > ----------------- > Benoit St-Jean > A standpoint is an intellectual horizon of radius zero. > (Albert Einstein) > > > > >> Date: Sun, 5 Dec 2010 17:33:17 -0300 >> From: [hidden email] >> To: [hidden email] >> Subject: [Pharo-users] Fastest matrix implementation? >> >> Hi list >> >> In the context of a scientific project here we are building big >> matrices for later processing, mostly exporting to custom file formats >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding >> Python 2.6 implementation (without PyPy), and the performance in >> Python was superior, about 8x faster than ST. >> So I wonder if anyone knows the fastest (or a faster) implementation >> of Matrix than the included by default in Collections? >> >> Cheers, >> -- Hernán Morales Information Technology Manager, Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. |
What are you using to read those CSV files? Do you have a file so we can have a look at it and possibly speed up the reading of the CSV file?
----------------- Benoit St-Jean A standpoint is an intellectual horizon of radius zero. (Albert Einstein) > Date: Mon, 6 Dec 2010 16:54:37 -0300 > From: [hidden email] > To: [hidden email] > Subject: Re: [Pharo-users] Fastest matrix implementation? > > Hi Benoit, > > I've loaded the package but it seems the port is not complete, i.e. if > you evaluate: > > DhbMatrix new: 10 > > you will get a MessageNotUnderstood: Interval>>asVector because > extension methods were not ported. I uploaded to the SqueakSource a > new version including extension methods and now most tests pass. > > Concerning the performance issues, I've narrowed my code to only > measure the writing and reading of a matrix of 710500 elements, > resulting in 58239 milliseconds for the native Matrix implementation > and 56920 for DhbMatrix. > It seems my performance problem involves reading and parsing a "CSV" file > > Elements Matrix DhbMatrix > 53400 18274 17329 > 175960 61043 60722 > 710500 379276 385278 > > I will check if it's worth to implement a primitive for very fast > parsing of CSV files. > Cheers, > > 2010/12/5 Benoit St-Jean <[hidden email]>: > > Have you tried the matrix implementation in the numerical package from > > Didier H. Besset? > > > > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC > > > > > > > > > > ----------------- > > Benoit St-Jean > > A standpoint is an intellectual horizon of radius zero. > > (Albert Einstein) > > > > > > > > > >> Date: Sun, 5 Dec 2010 17:33:17 -0300 > >> From: [hidden email] > >> To: [hidden email] > >> Subject: [Pharo-users] Fastest matrix implementation? > >> > >> Hi list > >> > >> In the context of a scientific project here we are building big > >> matrices for later processing, mostly exporting to custom file formats > >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of > >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding > >> Python 2.6 implementation (without PyPy), and the performance in > >> Python was superior, about 8x faster than ST. > >> So I wonder if anyone knows the fastest (or a faster) implementation > >> of Matrix than the included by default in Collections? > >> > >> Cheers, > >> > > -- > Hernán Morales > Information Technology Manager, > Institute of Veterinary Genetics. > National Scientific and Technical Research Council (CONICET). > La Plata (1900), Buenos Aires, Argentina. > Telephone: +54 (0221) 421-1799. > Internal: 422 > Fax: 425-7980 or 421-1799. > |
I cannot send the CSV files because they are private data currently
being used for research, just to give a hint I'm parsing files from 800,000 to 19 millions of lines. I'm using http://www.squeaksource.com/SimpleTextParser.html which is based in http://www.squeaksource.com/CSV.html plus some useful additions (for me). 2010/12/7 Benoit St-Jean <[hidden email]>: > What are you using to read those CSV files? Do you have a file so we can > have a look at it and possibly speed up the reading of the CSV file? > > ----------------- > Benoit St-Jean > A standpoint is an intellectual horizon of radius zero. > (Albert Einstein) > > > > >> Date: Mon, 6 Dec 2010 16:54:37 -0300 >> From: [hidden email] >> To: [hidden email] >> Subject: Re: [Pharo-users] Fastest matrix implementation? >> >> Hi Benoit, >> >> I've loaded the package but it seems the port is not complete, i.e. if >> you evaluate: >> >> DhbMatrix new: 10 >> >> you will get a MessageNotUnderstood: Interval>>asVector because >> extension methods were not ported. I uploaded to the SqueakSource a >> new version including extension methods and now most tests pass. >> >> Concerning the performance issues, I've narrowed my code to only >> measure the writing and reading of a matrix of 710500 elements, >> resulting in 58239 milliseconds for the native Matrix implementation >> and 56920 for DhbMatrix. >> It seems my performance problem involves reading and parsing a "CSV" file >> >> Elements Matrix DhbMatrix >> 53400 18274 17329 >> 175960 61043 60722 >> 710500 379276 385278 >> >> I will check if it's worth to implement a primitive for very fast >> parsing of CSV files. >> Cheers, >> >> 2010/12/5 Benoit St-Jean <[hidden email]>: >> > Have you tried the matrix implementation in the numerical package from >> > Didier H. Besset? >> > >> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC >> > >> > >> > >> > >> > ----------------- >> > Benoit St-Jean >> > A standpoint is an intellectual horizon of radius zero. >> > (Albert Einstein) >> > >> > >> > >> > >> >> Date: Sun, 5 Dec 2010 17:33:17 -0300 >> >> From: [hidden email] >> >> To: [hidden email] >> >> Subject: [Pharo-users] Fastest matrix implementation? >> >> >> >> Hi list >> >> >> >> In the context of a scientific project here we are building big >> >> matrices for later processing, mostly exporting to custom file formats >> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of >> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding >> >> Python 2.6 implementation (without PyPy), and the performance in >> >> Python was superior, about 8x faster than ST. >> >> So I wonder if anyone knows the fastest (or a faster) implementation >> >> of Matrix than the included by default in Collections? >> >> >> >> Cheers, >> >> >> >> -- >> Hernán Morales >> Information Technology Manager, >> Institute of Veterinary Genetics. >> National Scientific and Technical Research Council (CONICET). >> La Plata (1900), Buenos Aires, Argentina. >> Telephone: +54 (0221) 421-1799. >> Internal: 422 >> Fax: 425-7980 or 421-1799. >> > -- Hernán Morales Information Technology Manager, Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. |
In reply to this post by hernanmd
On Mon, Dec 6, 2010 at 8:54 PM, Hernán Morales Durand <[hidden email]> wrote: Hi Benoit, which file stream are you using ? do you need encoding? You can use the Alexandre Profiler to detect where the problem is. Cheers Mariano Elements Matrix DhbMatrix |
In reply to this post by hernanmd
Hi Hernán,
On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote: It seems my performance problem involves reading and parsing a "CSV" file I think that instead of going native, it would be worthwhile to try to optimize in Smalltalk first (it certainly is more fun). I thought that this was an interesting problem so I tried writing some code myself, assuming the main problem is getting a CSV matrix in and out of Smalltalk as fast as possible. I simplified further by making the matrix square and containing only Numbers. I also preallocate the matrix and use the fact that I know the row/column sizes. These are my results: Size Elements Read Write 250 62500 1013 7858 500 250000 4185 31007 750 562500 9858 71434 I think this is faster, but it is hard to compare. I am still a bit puzzled as to why the writing is slower than the reading though. The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in the package 'Smalltalk-Hacking-Sven', class NumberMatrix. This is the write loop: writeCsvTo: stream 1 to: size do: [ :row | 1 to: size do: [ :column | column ~= 1 ifTrue: [ stream nextPut: $, ]. stream print: (self at: row at: column) ]. stream nextPut: Character lf ] And this is the read loop: readCsvFrom: stream | numberParser | numberParser := SqNumberParser on: stream. 1 to: size do: [ :row | 1 to: size do: [ :column | self at: row at: column put: numberParser nextNumber. column ~= size ifTrue: [ stream peekFor: $, ] ]. row ~= size ifTrue: [ stream peekFor: Character lf ] ] I am of course cheating a little bit, but should your CSV file be different, I am sure you can adapt the code (for example to deal with quoting). I am also advancing the stream under the SqNumberParser to avoid allocation a new one every time. I think this code generates little garbage. What do you think ? Sven |
In reply to this post by hernanmd
Hi Benoit,
Thanks for your suggestions! I did some metrics again with a subset of my files with fewer lines, and here are the results Lines Milliseconds 116 59 2784 1415 63936 18675 175840 48216 534760 149812 After implementing your improvements: Lines Milliseconds 116 44 2784 1067 63936 13991 175840 37521 534760 112906 Since I can assume my CSV files doesn't include quotes, I've removed the qouted character checking subclassing a special parser for files with qoutes: Lines Milliseconds 116 37 2784 888 63936 11521 175840 31905 534760 96549 which is an acceptable improvement considering files of millions of lines. If I can make some time for implementing a primitive, I will update with more results. Best regards, 2010/12/7 Benoit St-Jean <[hidden email]>: > Hi Hernan, > > I had a look at the Text Parser package and here's a little suggestion... > > In class STextParser, there seems to be place for optimization (I know, I > *hate* using this word with Smalltalk, I've always preferred simplicity over > harder-to-read-code) in one particular method, namely #nextInLine. > > A quick test in a workspace shows that removing message sends for methods > #cr and #lf (making the class variables instead) as well as using #== > instead of #= make this method 3 to 5 times faster. Since it's probably > called millions and millions of times in your case, this might help... > > So it would look like: > > nextInLine > | next | > > next := stream next. > (next == MyCr or: [next == MyLf]) > ifTrue: [stream skip: -1. > next := nil]. > ^ next > > > The original method executes in 23.45 seconds > Removing the message sends (Character cr and Character lf) and replacing > them with class vars brings that down to 10.9 seconds. > Then, replacing #= by #== brings the average time to 4.25 seconds. > > Tests were executed for 10000000 characters. > > Hope this helps! > > Keep me posted! > > > ----------------- > Benoit St-Jean > A standpoint is an intellectual horizon of radius zero. > (Albert Einstein) > > > > >> Date: Tue, 7 Dec 2010 03:10:20 -0300 >> From: [hidden email] >> To: [hidden email] >> Subject: Re: [Pharo-users] Fastest matrix implementation? >> >> I cannot send the CSV files because they are private data currently >> being used for research, just to give a hint I'm parsing files from >> 800,000 to 19 millions of lines. >> >> I'm using http://www.squeaksource.com/SimpleTextParser.html which is >> based in http://www.squeaksource.com/CSV.html plus some useful >> additions (for me). >> >> 2010/12/7 Benoit St-Jean <[hidden email]>: >> > What are you using to read those CSV files? Do you have a file so we >> > can >> > have a look at it and possibly speed up the reading of the CSV file? >> > >> > ----------------- >> > Benoit St-Jean >> > A standpoint is an intellectual horizon of radius zero. >> > (Albert Einstein) >> > >> > >> > >> > >> >> Date: Mon, 6 Dec 2010 16:54:37 -0300 >> >> From: [hidden email] >> >> To: [hidden email] >> >> Subject: Re: [Pharo-users] Fastest matrix implementation? >> >> >> >> Hi Benoit, >> >> >> >> I've loaded the package but it seems the port is not complete, i.e. if >> >> you evaluate: >> >> >> >> DhbMatrix new: 10 >> >> >> >> you will get a MessageNotUnderstood: Interval>>asVector because >> >> extension methods were not ported. I uploaded to the SqueakSource a >> >> new version including extension methods and now most tests pass. >> >> >> >> Concerning the performance issues, I've narrowed my code to only >> >> measure the writing and reading of a matrix of 710500 elements, >> >> resulting in 58239 milliseconds for the native Matrix implementation >> >> and 56920 for DhbMatrix. >> >> It seems my performance problem involves reading and parsing a "CSV" >> >> file >> >> >> >> Elements Matrix DhbMatrix >> >> 53400 18274 17329 >> >> 175960 61043 60722 >> >> 710500 379276 385278 >> >> >> >> I will check if it's worth to implement a primitive for very fast >> >> parsing of CSV files. >> >> Cheers, >> >> >> >> 2010/12/5 Benoit St-Jean <[hidden email]>: >> >> > Have you tried the matrix implementation in the numerical package >> >> > from >> >> > Didier H. Besset? >> >> > >> >> > http://squeaksource.com/@Q45T_l348Ag07gGT/VMsGzidC >> >> > >> >> > >> >> > >> >> > >> >> > ----------------- >> >> > Benoit St-Jean >> >> > A standpoint is an intellectual horizon of radius zero. >> >> > (Albert Einstein) >> >> > >> >> > >> >> > >> >> > >> >> >> Date: Sun, 5 Dec 2010 17:33:17 -0300 >> >> >> From: [hidden email] >> >> >> To: [hidden email] >> >> >> Subject: [Pharo-users] Fastest matrix implementation? >> >> >> >> >> >> Hi list >> >> >> >> >> >> In the context of a scientific project here we are building big >> >> >> matrices for later processing, mostly exporting to custom file >> >> >> formats >> >> >> for PLINK, HaploView, etc (bioinformatics tools). I've tested one of >> >> >> our scripts in both Pharo 1.1 (not CogVM) with the corresponding >> >> >> Python 2.6 implementation (without PyPy), and the performance in >> >> >> Python was superior, about 8x faster than ST. >> >> >> So I wonder if anyone knows the fastest (or a faster) implementation >> >> >> of Matrix than the included by default in Collections? >> >> >> >> >> >> Cheers, >> >> >> >> >> >> >> -- >> >> Hernán Morales >> >> Information Technology Manager, >> >> Institute of Veterinary Genetics. >> >> National Scientific and Technical Research Council (CONICET). >> >> La Plata (1900), Buenos Aires, Argentina. >> >> Telephone: +54 (0221) 421-1799. >> >> Internal: 422 >> >> Fax: 425-7980 or 421-1799. >> >> >> > >> >> >> >> -- >> Hernán Morales >> Information Technology Manager, >> Institute of Veterinary Genetics. >> National Scientific and Technical Research Council (CONICET). >> La Plata (1900), Buenos Aires, Argentina. >> Telephone: +54 (0221) 421-1799. >> Internal: 422 >> Fax: 425-7980 or 421-1799. >> > |
In reply to this post by Mariano Martinez Peck
Hi Mariano,
2010/12/7 Mariano Martinez Peck <[hidden email]>: > > > On Mon, Dec 6, 2010 at 8:54 PM, Hernán Morales Durand > <[hidden email]> wrote: >> >> Hi Benoit, >> >> I've loaded the package but it seems the port is not complete, i.e. if >> you evaluate: >> >> DhbMatrix new: 10 >> >> you will get a MessageNotUnderstood: Interval>>asVector because >> extension methods were not ported. I uploaded to the SqueakSource a >> new version including extension methods and now most tests pass. >> >> Concerning the performance issues, I've narrowed my code to only >> measure the writing and reading of a matrix of 710500 elements, >> resulting in 58239 milliseconds for the native Matrix implementation >> and 56920 for DhbMatrix. >> It seems my performance problem involves reading and parsing a "CSV" file >> > > > which file stream are you using ? do you need encoding? > I'm using just FileStream, how do I know if I need encoding? > You can use the Alexandre Profiler to detect where the problem is. > You mean the PetitProfiler or just Spy? Cheers, > Cheers > > Mariano > >> -- Hernán Morales Information Technology Manager, Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. |
In reply to this post by Sven Van Caekenberghe
Hi Sven,
Thanks for your comments, my matrix includes names, floats and characters, and the matrix isn't squared because I need to build a transposed matrix while reading the CSV, however I will take a look the SqNumberParser. In case anyone is wondering how this is done, this is the code for a matrix with 20 rows: First to make some predefined columns for which I have to repeat values, so I used the Generator ported by Lukas Renggli. matrix := Matrix rows: 29044 columns: 20. firstCol := ( Generator on: [: g | rowCount timesRepeat: [ g yield: 1 ] ] ) upToEnd. matrix atColumn: 1 put: fCol. "... fill remaining 5 columns ..." index := 0. parserResult rowsDo: [: rs | | rowInd colInd | matrix at: ( rowInd := index \\ 20 + 1 ) at: ( colInd := index // 20 + 7 ) put: 'test'. index := index + 1 ]. 1 to: rowCount do: [: rIndex | 1 to: colCount do: [: cIndex | outputFile nextPutAll: ( matrix at: rIndex at: cIndex ) asString; tab ]. outputFile cr ]. Hope it helps someone, Cheers 2010/12/7 Sven Van Caekenberghe <[hidden email]>: > Hi Hernán, > On 06 Dec 2010, at 20:54, Hernán Morales Durand wrote: > > It seems my performance problem involves reading and parsing a "CSV" file > > Elements Matrix DhbMatrix > 53400 18274 17329 > 175960 61043 60722 > 710500 379276 385278 > > I will check if it's worth to implement a primitive for very fast parsing of > CSV files. > > I think that instead of going native, it would be worthwhile to try to > optimize in Smalltalk first (it certainly is more fun). > I thought that this was an interesting problem so I tried writing some code > myself, assuming the main problem is getting a CSV matrix in and out of > Smalltalk as fast as possible. I simplified further by making the matrix > square and containing only Numbers. I also preallocate the matrix and use > the fact that I know the row/column sizes. > These are my results: > Size Elements Read Write > 250 62500 1013 7858 > 500 250000 4185 31007 > 750 562500 9858 71434 > I think this is faster, but it is hard to compare. I am still a bit puzzled > as to why the writing is slower than the reading though. > The code is available at http://www.squeaksource.com/ADayAtTheBeach.html in > the package 'Smalltalk-Hacking-Sven', class NumberMatrix. > This is the write loop: > writeCsvTo: stream > 1 to: size do: [ :row | > 1 to: size do: [ :column | > column ~= 1 ifTrue: [ stream nextPut: $, ]. > stream print: (self at: row at: column) ]. > stream nextPut: Character lf ] > And this is the read loop: > readCsvFrom: stream > | numberParser | > numberParser := SqNumberParser on: stream. > 1 to: size do: [ :row | > 1 to: size do: [ :column | > self at: row at: column put: numberParser nextNumber. > column ~= size ifTrue: [ stream peekFor: $, ] ]. > row ~= size ifTrue: [ stream peekFor: Character lf ] ] > I am of course cheating a little bit, but should your CSV file be different, > I am sure you can adapt the code (for example to deal with quoting). I am > also advancing the stream under the SqNumberParser to avoid allocation a new > one every time. I think this code generates little garbage. > What do you think ? > Sven > -- Hernán Morales Information Technology Manager, Institute of Veterinary Genetics. National Scientific and Technical Research Council (CONICET). La Plata (1900), Buenos Aires, Argentina. Telephone: +54 (0221) 421-1799. Internal: 422 Fax: 425-7980 or 421-1799. |
Free forum by Nabble | Edit this page |