Hi,
I'm looking for a fast way to write a large Collection of ints (over 6 million elements) to a file. My attempt was: aFile := CrLfFileStream fileNamed: aFilename. aLargeCollection do: [ :int | aFile nextPutAll: int printString, String cr. ]. aFile close. Unfortunately, this takes more than fifteen minutes. I suspect this is due to my implementation. Is there any smart way I can do this faster? Thanks, André |
Oe possible optimization is to not use the #, message because it
create a copy of the String. So you can try: aFile := CrLfFileStream fileNamed: aFilename. aLargeCollection do: [ :int | aFile nextPutAll: int printString; nextPut: Character cr. ]. aFile close. On Jan 26, 2008, at 12:37 PM, André Wendt wrote: > Hi, > > I'm looking for a fast way to write a large Collection of ints (over 6 > million elements) to a file. My attempt was: > > aFile := CrLfFileStream fileNamed: aFilename. > aLargeCollection do: [ :int | > aFile nextPutAll: int printString, String cr. > ]. > aFile close. > > Unfortunately, this takes more than fifteen minutes. I suspect this is > due to my implementation. > > Is there any smart way I can do this faster? > > Thanks, > André > Mth |
> Oe possible optimization is to not use the #, message because it
> create a copy of the String. > So you can try: > > aFile := CrLfFileStream fileNamed: aFilename. > aLargeCollection do: [ :int | > aFile nextPutAll: int printString; nextPut: Character cr. > ]. > aFile close. Or even smarter: aStream print: int; nextPut: Character cr Lukas > > > On Jan 26, 2008, at 12:37 PM, André Wendt wrote: > > > Hi, > > > > I'm looking for a fast way to write a large Collection of ints (over 6 > > million elements) to a file. My attempt was: > > > > aFile := CrLfFileStream fileNamed: aFilename. > > aLargeCollection do: [ :int | > > aFile nextPutAll: int printString, String cr. > > ]. > > aFile close. > > > > Unfortunately, this takes more than fifteen minutes. I suspect this is > > due to my implementation. > > > > Is there any smart way I can do this faster? > > > > Thanks, > > André > > > > Mth > > > > > -- Lukas Renggli http://www.lukas-renggli.ch |
Lukas Renggli wrote:
>> Oe possible optimization is to not use the #, message because it >> create a copy of the String. >> So you can try: >> >> aFile := CrLfFileStream fileNamed: aFilename. >> aLargeCollection do: [ :int | >> aFile nextPutAll: int printString; nextPut: Character cr. >> ]. >> aFile close. > > Or even smarter: > > aStream print: int; nextPut: Character cr > > Lukas Thanks for your suggestions! I've tried, but it didn't give the boost I was hoping for... André >> On Jan 26, 2008, at 12:37 PM, André Wendt wrote: >> >>> Hi, >>> >>> I'm looking for a fast way to write a large Collection of ints (over 6 >>> million elements) to a file. My attempt was: >>> >>> aFile := CrLfFileStream fileNamed: aFilename. >>> aLargeCollection do: [ :int | >>> aFile nextPutAll: int printString, String cr. >>> ]. >>> aFile close. >>> >>> Unfortunately, this takes more than fifteen minutes. I suspect this is >>> due to my implementation. >>> >>> Is there any smart way I can do this faster? >>> >>> Thanks, >>> André >>> >> Mth |
On Jan 26, 2008, at 16:57 , André Wendt wrote: > Lukas Renggli wrote: >>> Oe possible optimization is to not use the #, message because it >>> create a copy of the String. >>> So you can try: >>> >>> aFile := CrLfFileStream fileNamed: aFilename. >>> aLargeCollection do: [ :int | >>> aFile nextPutAll: int printString; nextPut: Character cr. >>> ]. >>> aFile close. >> >> Or even smarter: >> >> aStream print: int; nextPut: Character cr >> >> Lukas > > Thanks for your suggestions! I've tried, but it didn't give the > boost I > was hoping for... Try printing to a memory buffer in chunks of 10000 integers, and putting that on the file. Unbuffered I/O is slow. If that's still not enough (or even before), profile. - Bert - |
In reply to this post by André Wendt-3
Hi!
Try this: writeIntegersFrom: aLargeCollection to: aFileName | file crLf buffer | crLf := String cr, String lf. buffer := WriteStream on: (String new: 65536). [ file := StandardFileStream fileNamed: aFileName. aLargeCollection do: [ :int | buffer nextPutAll: int printString; nextPutAll: crLf. buffer position > 64000 ifTrue: [ file nextPutAll: buffer contents. buffer position: 0 ] ]. file nextPutAll: buffer contents ] ensure: [ file ifNotNil: [ file close ] ] I'm sure there's a better solution, this is just fast and dirty one. (30 secs for 6M integers on my machine) Levente On Sat, 26 Jan 2008, [UTF-8] André Wendt wrote: > Lukas Renggli wrote: > >> Oe possible optimization is to not use the #, message because it > >> create a copy of the String. > >> So you can try: > >> > >> aFile := CrLfFileStream fileNamed: aFilename. > >> aLargeCollection do: [ :int | > >> aFile nextPutAll: int printString; nextPut: Character cr. > >> ]. > >> aFile close. > > > > Or even smarter: > > > > aStream print: int; nextPut: Character cr > > > > Lukas > > Thanks for your suggestions! I've tried, but it didn't give the boost I > was hoping for... > > André > > >> On Jan 26, 2008, at 12:37 PM, André Wendt wrote: > >> > >>> Hi, > >>> > >>> I'm looking for a fast way to write a large Collection of ints (over 6 > >>> million elements) to a file. My attempt was: > >>> > >>> aFile := CrLfFileStream fileNamed: aFilename. > >>> aLargeCollection do: [ :int | > >>> aFile nextPutAll: int printString, String cr. > >>> ]. > >>> aFile close. > >>> > >>> Unfortunately, this takes more than fifteen minutes. I suspect this is > >>> due to my implementation. > >>> > >>> Is there any smart way I can do this faster? > >>> > >>> Thanks, > >>> André > >>> > >> Mth > > > |
In reply to this post by André Wendt-3
If you have only small integers and free memory, the fastest method (just
one i/o) would be | memory fOut | memory := Bitmap new: 6 * 1000 * 1000. memory atAllPut: SmallInteger maxVal. [(fOut := StandardFileStream newFileNamed: 'f.out') nextPutAll: memory asByteArray] ensure: [fOut close] You can fill the memory instance with #integerAt:put: /Klaus On Sat, 26 Jan 2008 16:57:16 +0100, André Wendt wrote: > Lukas Renggli wrote: >>> Oe possible optimization is to not use the #, message because it >>> create a copy of the String. >>> So you can try: >>> >>> aFile := CrLfFileStream fileNamed: aFilename. >>> aLargeCollection do: [ :int | >>> aFile nextPutAll: int printString; nextPut: Character cr. >>> ]. >>> aFile close. >> >> Or even smarter: >> >> aStream print: int; nextPut: Character cr >> >> Lukas > > Thanks for your suggestions! I've tried, but it didn't give the boost I > was hoping for... > > André > >>> On Jan 26, 2008, at 12:37 PM, André Wendt wrote: >>> >>>> Hi, >>>> >>>> I'm looking for a fast way to write a large Collection of ints (over 6 >>>> million elements) to a file. My attempt was: >>>> >>>> aFile := CrLfFileStream fileNamed: aFilename. >>>> aLargeCollection do: [ :int | >>>> aFile nextPutAll: int printString, String cr. >>>> ]. >>>> aFile close. >>>> >>>> Unfortunately, this takes more than fifteen minutes. I suspect this is >>>> due to my implementation. >>>> >>>> Is there any smart way I can do this faster? >>>> >>>> Thanks, >>>> André >>>> >>> Mth > > > |
> If you have only small integers and free memory, the fastest
> method (just one i/o) would be > > | memory fOut | > memory := Bitmap new: 6 * 1000 * 1000. > memory atAllPut: SmallInteger maxVal. > [(fOut := StandardFileStream newFileNamed: 'f.out') > nextPutAll: memory asByteArray] > ensure: [fOut close] > > You can fill the memory instance with #integerAt:put: > > /Klaus On a side note, watching this thread, I'm a bit curious why everyone is manually closing their files, why not... | memory | memory := Bitmap new: 6 * 1000 * 1000. memory atAllPut: SmallInteger maxVal. FileStream newFileNamed: 'f.out' do:[:f | f nextPutAll: memory asByteArray] Why is everyone ignoring that every file open method has a corresponding #do: that encapsulates the close and removes the need for a temp? Ramon Leon http://onsmalltalk.com |
On Sat, 26 Jan 2008 18:44:13 +0100, Ramon Leon wrote:
... > Why is everyone ignoring that every file open method has a corresponding > #do: that encapsulates the close and removes the need for a temp? I don't know why Mr.+Ms. everyone does so but, for me it's a matter of expressing what I think. And I think that #newFileNamed:do: is not the same as #ensure: + #close. /Klaus > Ramon Leon > http://onsmalltalk.com > > > |
In reply to this post by Ramon Leon-5
On Jan 26, 2008, at 18:44 , Ramon Leon wrote: >> If you have only small integers and free memory, the fastest >> method (just one i/o) would be >> >> | memory fOut | >> memory := Bitmap new: 6 * 1000 * 1000. >> memory atAllPut: SmallInteger maxVal. >> [(fOut := StandardFileStream newFileNamed: 'f.out') >> nextPutAll: memory asByteArray] >> ensure: [fOut close] >> >> You can fill the memory instance with #integerAt:put: >> >> /Klaus The original post asked for printed numbers, not binary 32 bit big- endian numbers. > On a side note, watching this thread, I'm a bit curious why > everyone is > manually closing their files, why not... > > | memory | > memory := Bitmap new: 6 * 1000 * 1000. > memory atAllPut: SmallInteger maxVal. > FileStream > newFileNamed: 'f.out' > do:[:f | f nextPutAll: memory asByteArray] > > Why is everyone ignoring that every file open method has a > corresponding > #do: that encapsulates the close and removes the need for a temp? Because that's a rather new addition (since 3.9 maybe?) that won't work in many images. - Bert - |
In reply to this post by Bert Freudenberg
Bert Freudenberg a écrit :
> > Try printing to a memory buffer in chunks of 10000 integers, and putting > that on the file. Unbuffered I/O is slow. > Any reason why Squeak should use unbuffered I/O? It sounds like strange we have to emulate a base function every underlying OS would perform so well. Nicolas |
On Sat, Jan 26, 2008 at 11:38:21PM +0100, nicolas cellier wrote:
> Bert Freudenberg a ?crit : > > > >Try printing to a memory buffer in chunks of 10000 integers, and putting > >that on the file. Unbuffered I/O is slow. > > > > Any reason why Squeak should use unbuffered I/O? > > It sounds like strange we have to emulate a base function every > underlying OS would perform so well. The Windows VM uses direct I/O to a Windows HANDLE, and all other VMs are using buffered I/O. I have never seen any measurement data to show that one is better than the other, or under what circumstances one might be better than the other. I certainly would not assume anything without seeing the numbers. It would be straightforward to implement either approach on any of the supported platforms, so I assume that these were simply design choices of the individual VM implementers. Furthermore, it is not necessarily the case that file I/O is the performance bottleneck in this case. I did a quick check of this: TimeProfileBrowser onBlock: [aFilename := 'foo.txt'. aLargeCollection := 1 to: 100000. aFile := CrLfFileStream fileNamed: aFilename. aLargeCollection do: [ :int | aFile nextPutAll: int printString, String cr]. aFile close] Which shows that for the particular VM and image that I was using, the majority of the processing time was spent in multibyte character conversion and conversion of integers to strings, and less than seven percent was spent in I/O primitives. Dave |
Actually if you open a FileStream you get a MultiByteFileStream
instance. If the stream is binary it invokes methods on the super class StandardFileStream to put a character or a collection of characters. However if it is text then it proceeds to read or write one character at a time causes yes a discrete file I/O primitive call. So say you need a UTF8 Stream and you have 1 million characters and you say foo nextPutAll: millionCharacterString This causes 1 million file I/O operations, that takes a *long* time. In Sophie I coded a SophieMultiByteMemoryFileStream which fronts the real stream with a buffer the size of the stream, that way the Translators get/put bytes to the buffer, and at close time I flush the entire buffer to disk as a binary file. Obviously this is not a general purpose solution since it relies on the fact in Sophie we know the UTF8 files we are working with will only be a few MB in size. On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote: > Which shows that for the particular VM and image that I was using, > the majority of the processing time was spent in multibyte character > conversion and conversion of integers to strings, and less than > seven percent was spent in I/O primitives. > > Dave > > -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by David T. Lewis
Forgot to say , what is needed is a BUFFERED multibyte stream, but
that is a non-trivial coding problem. On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote: > On Sat, Jan 26, 2008 at 11:38:21PM +0100, nicolas cellier wrote: >> Bert Freudenberg a ?crit : >>> >>> Try printing to a memory buffer in chunks of 10000 integers, and >>> putting >>> that on the file. Unbuffered I/O is slow. -- = = = ======================================================================== John M. McIntosh <[hidden email]> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ======================================================================== |
In reply to this post by johnmci
On Sun, Jan 27, 2008 at 02:05:35PM -0800, John M McIntosh wrote:
> > On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote: > > >Which shows that for the particular VM and image that I was using, > >the majority of the processing time was spent in multibyte character > >conversion and conversion of integers to strings, and less than > >seven percent was spent in I/O primitives. > > > Actually if you open a FileStream you get a MultiByteFileStream > instance. > > If the stream is binary it invokes methods on the super class > StandardFileStream to put a character or a collection of characters. > > However if it is text then it proceeds to read or write one character > at a time causes yes a discrete file I/O primitive call. > > > So say you need a UTF8 Stream and you have 1 million characters and > you say > foo nextPutAll: millionCharacterString > > This causes 1 million file I/O operations, that takes a *long* time. Quite right. But however inefficient this might be, the file I/O primitive call is still not the bottleneck. Most of the time (over 40% in my example) is eaten up in MultiByteFileStream>>nextPut: of which a small portion is the actual I/O primitive. I tried this test on an AMD-64 laptop running a Linux VM, and then with a Windows VM on the same machine. The results were about the same for both VMs (about 4% of the run time taken up in the I/O primitives), so for this kind of file I/O there is no significant difference in performance between a VM that uses buffered (stdio) I/O versus a VM that uses lower level (write to HANDLE) I/O. Furthermore, the I/O performance in the VM is only a small part of the overall file processing time, dispite calling a primitive for every single character processed. For the record, the test case was: TimeProfileBrowser onBlock: [aFilename := 'foo.txt'. aLargeCollection := 1 to: 100000. aFile := CrLfFileStream fileNamed: aFilename. aLargeCollection do: [ :int | aFile nextPutAll: int printString, String cr]. aFile close] > In Sophie I coded a SophieMultiByteMemoryFileStream which fronts the > real stream with a buffer the size of the stream, that way the > Translators get/put bytes to the buffer, and at close time I flush > the entire buffer to disk as a binary file. Obviously this is not a > general > purpose solution since it relies on the fact in Sophie we know the > UTF8 files we are working with will only be a few MB in size. That sounds like a good idea. Dave |
In reply to this post by Lukas Renggli
Lukas Renggli wrote:
> > Or even smarter: > > aStream print: int; nextPut: Character cr > > Lukas > I tried this and didn't get the improvement I was expecting, < 2%. I had a deeper look and noticed Integer>>#printOn:base: generates an unnecessary string representation which it then reverses and prints to the stream. I got closer to 20% with these changes to Integer as well: printOn: aStream base: base self negative ifTrue: [aStream nextPut: $-]. self abs normalize printDigitsOn: aStream base: base. printDigitsOn: aStream base: base (self >= base) ifTrue: [(self // base) printDigitsOn: aStream base: base]. aStream nextPut: (Character digitValue: self \\ base) I'm not sure why one needs to send #normalize but it was being send before...? I'll create an entry in Mantis when I get a moment to check the unit tests and the other printXXX methods on Integer. Regards, Zulq. |
On Mon, Jan 28, 2008 at 05:48:01AM +0000, Zulq Alam wrote:
> Lukas Renggli wrote: > > > >Or even smarter: > > > > aStream print: int; nextPut: Character cr > > > >Lukas > > > > I tried this and didn't get the improvement I was expecting, < 2%. I had > a deeper look and noticed Integer>>#printOn:base: generates an > unnecessary string representation which it then reverses and prints to > the stream. The #reversed is not unnecessary. Other algorithms are of course possible, but this one is not wrong. > I got closer to 20% with these changes to Integer as well: > > printOn: aStream base: base > self negative ifTrue: [aStream nextPut: $-]. > self abs normalize printDigitsOn: aStream base: base. > > printDigitsOn: aStream base: base > (self >= base) > ifTrue: > [(self // base) printDigitsOn: aStream base: base]. > > aStream nextPut: (Character digitValue: self \\ base) > > I'm not sure why one needs to send #normalize but it was being send > before...? The #normalize is there for large integers. If you remove it, you will break printing for large integers, and you may not see the problem right away since it would only apply to non-normalized large integers. This is the worst sort of bug: Shows up later, and nobody can remember how or why it got introduced. > I'll create an entry in Mantis when I get a moment to check the unit > tests and the other printXXX methods on Integer. It would be best not to open a new Mantis entry unless there actually is a new issue to be resolved. Dave |
David T. Lewis wrote:
> > The #reversed is not unnecessary. Other algorithms are of course > possible, but this one is not wrong. > It is the creation of an intermediate string that is unnecessary. The #printOn:base: is passed a stream on which it can print directly. I mentioned the reverse because it was another (the third) iteration of the integers digits (create, reverse and print). In fact, a negative number requires another iteration. > > It would be best not to open a new Mantis entry unless there actually > is a new issue to be resolved. > There is a "tweak" severity that suits this I think. The change yields an ~ 18% improvement when hooked up to #printStringBase: in this benchmark: TimeProfileBrowser spyOn: [-1000000 to: 1000000 do: [:i | i asString]] There are also significantly less objects produced. This may be very little in practicality but the only way to know that is to measure it. Regards, Zulq. |
Free forum by Nabble | Edit this page |