Writing a large Collection of integers to a file fast

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Writing a large Collection of integers to a file fast

André Wendt-3
Hi,

I'm looking for a fast way to write a large Collection of ints (over 6
million elements) to a file. My attempt was:

aFile := CrLfFileStream fileNamed: aFilename.
aLargeCollection do: [ :int |
  aFile nextPutAll: int printString, String cr.
].
aFile close.

Unfortunately, this takes more than fifteen minutes. I suspect this is
due to my implementation.

Is there any smart way I can do this faster?

Thanks,
André

Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Mathieu SUEN
Oe possible optimization is to not use the #, message  because it  
create a copy of the String.
So you can try:

aFile := CrLfFileStream fileNamed: aFilename.
aLargeCollection do: [ :int |
  aFile nextPutAll: int printString; nextPut: Character cr.
].
aFile close.


On Jan 26, 2008, at 12:37 PM, André Wendt wrote:

> Hi,
>
> I'm looking for a fast way to write a large Collection of ints (over 6
> million elements) to a file. My attempt was:
>
> aFile := CrLfFileStream fileNamed: aFilename.
> aLargeCollection do: [ :int |
>  aFile nextPutAll: int printString, String cr.
> ].
> aFile close.
>
> Unfortunately, this takes more than fifteen minutes. I suspect this is
> due to my implementation.
>
> Is there any smart way I can do this faster?
>
> Thanks,
> André
>

        Mth




Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Lukas Renggli
> Oe possible optimization is to not use the #, message  because it
> create a copy of the String.
> So you can try:
>
> aFile := CrLfFileStream fileNamed: aFilename.
> aLargeCollection do: [ :int |
>   aFile nextPutAll: int printString; nextPut: Character cr.
> ].
> aFile close.

Or even smarter:

    aStream print: int; nextPut: Character cr

Lukas


>
>
> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
>
> > Hi,
> >
> > I'm looking for a fast way to write a large Collection of ints (over 6
> > million elements) to a file. My attempt was:
> >
> > aFile := CrLfFileStream fileNamed: aFilename.
> > aLargeCollection do: [ :int |
> >  aFile nextPutAll: int printString, String cr.
> > ].
> > aFile close.
> >
> > Unfortunately, this takes more than fifteen minutes. I suspect this is
> > due to my implementation.
> >
> > Is there any smart way I can do this faster?
> >
> > Thanks,
> > André
> >
>
>         Mth
>
>
>
>
>


--
Lukas Renggli
http://www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

André Wendt-3
Lukas Renggli wrote:

>> Oe possible optimization is to not use the #, message  because it
>> create a copy of the String.
>> So you can try:
>>
>> aFile := CrLfFileStream fileNamed: aFilename.
>> aLargeCollection do: [ :int |
>>   aFile nextPutAll: int printString; nextPut: Character cr.
>> ].
>> aFile close.
>
> Or even smarter:
>
>     aStream print: int; nextPut: Character cr
>
> Lukas

Thanks for your suggestions! I've tried, but it didn't give the boost I
was hoping for...

André

>> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
>>
>>> Hi,
>>>
>>> I'm looking for a fast way to write a large Collection of ints (over 6
>>> million elements) to a file. My attempt was:
>>>
>>> aFile := CrLfFileStream fileNamed: aFilename.
>>> aLargeCollection do: [ :int |
>>>  aFile nextPutAll: int printString, String cr.
>>> ].
>>> aFile close.
>>>
>>> Unfortunately, this takes more than fifteen minutes. I suspect this is
>>> due to my implementation.
>>>
>>> Is there any smart way I can do this faster?
>>>
>>> Thanks,
>>> André
>>>
>>         Mth


Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Bert Freudenberg

On Jan 26, 2008, at 16:57 , André Wendt wrote:

> Lukas Renggli wrote:
>>> Oe possible optimization is to not use the #, message  because it
>>> create a copy of the String.
>>> So you can try:
>>>
>>> aFile := CrLfFileStream fileNamed: aFilename.
>>> aLargeCollection do: [ :int |
>>>   aFile nextPutAll: int printString; nextPut: Character cr.
>>> ].
>>> aFile close.
>>
>> Or even smarter:
>>
>>     aStream print: int; nextPut: Character cr
>>
>> Lukas
>
> Thanks for your suggestions! I've tried, but it didn't give the  
> boost I
> was hoping for...

Try printing to a memory buffer in chunks of 10000 integers, and  
putting that on the file. Unbuffered I/O is slow.

If that's still not enough (or even before), profile.

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Levente Uzonyi-2
In reply to this post by André Wendt-3
Hi!

Try this:

writeIntegersFrom: aLargeCollection to: aFileName

        | file crLf buffer |
        crLf := String cr, String lf.
        buffer := WriteStream on: (String new: 65536).
        [ file := StandardFileStream fileNamed: aFileName.
        aLargeCollection do: [ :int |
                buffer
                        nextPutAll: int printString;
                        nextPutAll: crLf.
                buffer position > 64000 ifTrue: [
                        file nextPutAll: buffer contents.
                        buffer position: 0 ] ].
        file nextPutAll: buffer contents ]
                ensure: [
                        file ifNotNil: [ file close ] ]

I'm sure there's a better solution, this is just fast and dirty one.
(30 secs for 6M integers on my machine)

Levente

On Sat, 26 Jan 2008, [UTF-8] André Wendt wrote:

> Lukas Renggli wrote:
> >> Oe possible optimization is to not use the #, message  because it
> >> create a copy of the String.
> >> So you can try:
> >>
> >> aFile := CrLfFileStream fileNamed: aFilename.
> >> aLargeCollection do: [ :int |
> >>   aFile nextPutAll: int printString; nextPut: Character cr.
> >> ].
> >> aFile close.
> >
> > Or even smarter:
> >
> >     aStream print: int; nextPut: Character cr
> >
> > Lukas
>
> Thanks for your suggestions! I've tried, but it didn't give the boost I
> was hoping for...
>
> André
>
> >> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm looking for a fast way to write a large Collection of ints (over 6
> >>> million elements) to a file. My attempt was:
> >>>
> >>> aFile := CrLfFileStream fileNamed: aFilename.
> >>> aLargeCollection do: [ :int |
> >>>  aFile nextPutAll: int printString, String cr.
> >>> ].
> >>> aFile close.
> >>>
> >>> Unfortunately, this takes more than fifteen minutes. I suspect this is
> >>> due to my implementation.
> >>>
> >>> Is there any smart way I can do this faster?
> >>>
> >>> Thanks,
> >>> André
> >>>
> >>         Mth
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Klaus D. Witzel
In reply to this post by André Wendt-3
If you have only small integers and free memory, the fastest method (just  
one i/o) would be

  | memory fOut |
  memory := Bitmap new: 6 * 1000 * 1000.
  memory atAllPut: SmallInteger maxVal.
  [(fOut := StandardFileStream newFileNamed: 'f.out')
        nextPutAll: memory asByteArray]
   ensure: [fOut close]

You can fill the memory instance with #integerAt:put:

/Klaus

On Sat, 26 Jan 2008 16:57:16 +0100, André Wendt wrote:

> Lukas Renggli wrote:
>>> Oe possible optimization is to not use the #, message  because it
>>> create a copy of the String.
>>> So you can try:
>>>
>>> aFile := CrLfFileStream fileNamed: aFilename.
>>> aLargeCollection do: [ :int |
>>>   aFile nextPutAll: int printString; nextPut: Character cr.
>>> ].
>>> aFile close.
>>
>> Or even smarter:
>>
>>     aStream print: int; nextPut: Character cr
>>
>> Lukas
>
> Thanks for your suggestions! I've tried, but it didn't give the boost I
> was hoping for...
>
> André
>
>>> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm looking for a fast way to write a large Collection of ints (over 6
>>>> million elements) to a file. My attempt was:
>>>>
>>>> aFile := CrLfFileStream fileNamed: aFilename.
>>>> aLargeCollection do: [ :int |
>>>>  aFile nextPutAll: int printString, String cr.
>>>> ].
>>>> aFile close.
>>>>
>>>> Unfortunately, this takes more than fifteen minutes. I suspect this is
>>>> due to my implementation.
>>>>
>>>> Is there any smart way I can do this faster?
>>>>
>>>> Thanks,
>>>> André
>>>>
>>>         Mth
>
>
>



Reply | Threaded
Open this post in threaded view
|

RE: Writing a large Collection of integers to a file fast

Ramon Leon-5
> If you have only small integers and free memory, the fastest
> method (just one i/o) would be
>
>   | memory fOut |
>   memory := Bitmap new: 6 * 1000 * 1000.
>   memory atAllPut: SmallInteger maxVal.
>   [(fOut := StandardFileStream newFileNamed: 'f.out')
> nextPutAll: memory asByteArray]
>    ensure: [fOut close]
>
> You can fill the memory instance with #integerAt:put:
>
> /Klaus

On a side note, watching this thread, I'm a bit curious why everyone is
manually closing their files, why not...

    | memory |
    memory := Bitmap new: 6 * 1000 * 1000.
    memory atAllPut: SmallInteger maxVal.
    FileStream
        newFileNamed: 'f.out'
        do:[:f | f nextPutAll: memory asByteArray]

Why is everyone ignoring that every file open method has a corresponding
#do: that encapsulates the close and removes the need for a temp?

Ramon Leon
http://onsmalltalk.com


Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Klaus D. Witzel
On Sat, 26 Jan 2008 18:44:13 +0100, Ramon Leon wrote:

...
> Why is everyone ignoring that every file open method has a corresponding
> #do: that encapsulates the close and removes the need for a temp?

I don't know why Mr.+Ms. everyone does so but, for me it's a matter of  
expressing what I think. And I think that #newFileNamed:do: is not the  
same as #ensure: + #close.

/Klaus

> Ramon Leon
> http://onsmalltalk.com
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Bert Freudenberg
In reply to this post by Ramon Leon-5

On Jan 26, 2008, at 18:44 , Ramon Leon wrote:

>> If you have only small integers and free memory, the fastest
>> method (just one i/o) would be
>>
>>   | memory fOut |
>>   memory := Bitmap new: 6 * 1000 * 1000.
>>   memory atAllPut: SmallInteger maxVal.
>>   [(fOut := StandardFileStream newFileNamed: 'f.out')
>> nextPutAll: memory asByteArray]
>>    ensure: [fOut close]
>>
>> You can fill the memory instance with #integerAt:put:
>>
>> /Klaus


The original post asked for printed numbers, not binary 32 bit big-
endian numbers.

> On a side note, watching this thread, I'm a bit curious why  
> everyone is
> manually closing their files, why not...
>
>     | memory |
>     memory := Bitmap new: 6 * 1000 * 1000.
>     memory atAllPut: SmallInteger maxVal.
>     FileStream
>         newFileNamed: 'f.out'
>         do:[:f | f nextPutAll: memory asByteArray]
>
> Why is everyone ignoring that every file open method has a  
> corresponding
> #do: that encapsulates the close and removes the need for a temp?

Because that's a rather new addition (since 3.9 maybe?) that won't  
work in many images.

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Nicolas Cellier-3
In reply to this post by Bert Freudenberg
Bert Freudenberg a écrit :
>
> Try printing to a memory buffer in chunks of 10000 integers, and putting
> that on the file. Unbuffered I/O is slow.
>

Any reason why Squeak should use unbuffered I/O?

It sounds like strange we have to emulate a base function every
underlying OS would perform so well.

Nicolas


Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

David T. Lewis
On Sat, Jan 26, 2008 at 11:38:21PM +0100, nicolas cellier wrote:

> Bert Freudenberg a ?crit :
> >
> >Try printing to a memory buffer in chunks of 10000 integers, and putting
> >that on the file. Unbuffered I/O is slow.
> >
>
> Any reason why Squeak should use unbuffered I/O?
>
> It sounds like strange we have to emulate a base function every
> underlying OS would perform so well.

The Windows VM uses direct I/O to a Windows HANDLE, and all other
VMs are using buffered I/O. I have never seen any measurement data
to show that one is better than the other, or under what circumstances
one might be better than the other. I certainly would not assume
anything without seeing the numbers.

It would be straightforward to implement either approach on any
of the supported platforms, so I assume that these were simply
design choices of the individual VM implementers.

Furthermore, it is not necessarily the case that file I/O is the
performance bottleneck in this case. I did a quick check of this:

  TimeProfileBrowser onBlock:
    [aFilename := 'foo.txt'.
    aLargeCollection := 1 to: 100000.
    aFile := CrLfFileStream fileNamed: aFilename.
    aLargeCollection do: [ :int |
      aFile nextPutAll: int printString, String cr].
    aFile close]

Which shows that for the particular VM and image that I was using,
the majority of the processing time was spent in multibyte character
conversion and conversion of integers to strings, and less than
seven percent was spent in I/O primitives.

Dave


Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

johnmci
Actually if you open a FileStream you get a MultiByteFileStream  
instance.

If the stream is binary it invokes methods on the super class  
StandardFileStream to put a character or a collection of characters.

However if it is text then it proceeds to read or write one character  
at a time causes yes a discrete file I/O primitive call.


So say you need a UTF8 Stream and you have 1 million characters and  
you say
foo nextPutAll: millionCharacterString

This causes 1 million file I/O operations, that takes a *long* time.

In Sophie I coded a SophieMultiByteMemoryFileStream which fronts the  
real stream with a buffer the size of the stream, that way the
Translators get/put bytes to the buffer, and at close time I  flush  
the entire buffer to disk as a binary file.  Obviously this is not a  
general
purpose solution since it relies on the fact in Sophie we know the  
UTF8 files we are working with will only be a few MB in size.


On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote:

> Which shows that for the particular VM and image that I was using,
> the majority of the processing time was spent in multibyte character
> conversion and conversion of integers to strings, and less than
> seven percent was spent in I/O primitives.
>
> Dave
>
>

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
=
=
=
========================================================================



Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

johnmci
In reply to this post by David T. Lewis
Forgot to say , what is needed is a BUFFERED multibyte stream, but  
that is a non-trivial coding problem.


On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote:

> On Sat, Jan 26, 2008 at 11:38:21PM +0100, nicolas cellier wrote:
>> Bert Freudenberg a ?crit :
>>>
>>> Try printing to a memory buffer in chunks of 10000 integers, and  
>>> putting
>>> that on the file. Unbuffered I/O is slow.

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com
=
=
=
========================================================================



Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

David T. Lewis
In reply to this post by johnmci
On Sun, Jan 27, 2008 at 02:05:35PM -0800, John M McIntosh wrote:

>
> On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote:
>
> >Which shows that for the particular VM and image that I was using,
> >the majority of the processing time was spent in multibyte character
> >conversion and conversion of integers to strings, and less than
> >seven percent was spent in I/O primitives.
> >
> Actually if you open a FileStream you get a MultiByteFileStream  
> instance.
>
> If the stream is binary it invokes methods on the super class  
> StandardFileStream to put a character or a collection of characters.
>
> However if it is text then it proceeds to read or write one character  
> at a time causes yes a discrete file I/O primitive call.
>
>
> So say you need a UTF8 Stream and you have 1 million characters and  
> you say
> foo nextPutAll: millionCharacterString
>
> This causes 1 million file I/O operations, that takes a *long* time.

Quite right. But however inefficient this might be, the file I/O
primitive call is still not the bottleneck. Most of the time (over
40% in my example) is eaten up in MultiByteFileStream>>nextPut:
of which a small portion is the actual I/O primitive.

I tried this test on an AMD-64 laptop running a Linux VM, and then
with a Windows VM on the same machine. The results were about the same
for both VMs (about 4% of the run time taken up in the I/O primitives),
so for this kind of file I/O there is no significant difference in
performance between a VM that uses buffered (stdio) I/O versus a VM
that uses lower level (write to HANDLE) I/O.  Furthermore, the I/O
performance in the VM is only a small part of the overall file processing
time, dispite calling a primitive for every single character processed.

For the record, the test case was:

  TimeProfileBrowser onBlock:
    [aFilename := 'foo.txt'.
    aLargeCollection := 1 to: 100000.
    aFile := CrLfFileStream fileNamed: aFilename.
    aLargeCollection do: [ :int |
      aFile nextPutAll: int printString, String cr].
    aFile close]

> In Sophie I coded a SophieMultiByteMemoryFileStream which fronts the  
> real stream with a buffer the size of the stream, that way the
> Translators get/put bytes to the buffer, and at close time I  flush  
> the entire buffer to disk as a binary file.  Obviously this is not a  
> general
> purpose solution since it relies on the fact in Sophie we know the  
> UTF8 files we are working with will only be a few MB in size.

That sounds like a good idea.

Dave
 

Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Zulq Alam-2
In reply to this post by Lukas Renggli
Lukas Renggli wrote:
>
> Or even smarter:
>
>     aStream print: int; nextPut: Character cr
>
> Lukas
>

I tried this and didn't get the improvement I was expecting, < 2%. I had
a deeper look and noticed Integer>>#printOn:base: generates an
unnecessary string representation which it then reverses and prints to
the stream.

I got closer to 20% with these changes to Integer as well:

printOn: aStream base: base
   self negative ifTrue: [aStream nextPut: $-].
   self abs normalize printDigitsOn: aStream base: base.

printDigitsOn: aStream base: base
   (self >= base)
     ifTrue:
       [(self // base) printDigitsOn: aStream base: base].

   aStream nextPut: (Character digitValue: self \\ base)

I'm not sure why one needs to send #normalize but it was being send
before...?

I'll create an entry in Mantis when I get a moment to check the unit
tests and the other printXXX methods on Integer.

Regards,
Zulq.


Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

David T. Lewis
On Mon, Jan 28, 2008 at 05:48:01AM +0000, Zulq Alam wrote:

> Lukas Renggli wrote:
> >
> >Or even smarter:
> >
> >    aStream print: int; nextPut: Character cr
> >
> >Lukas
> >
>
> I tried this and didn't get the improvement I was expecting, < 2%. I had
> a deeper look and noticed Integer>>#printOn:base: generates an
> unnecessary string representation which it then reverses and prints to
> the stream.

The #reversed is not unnecessary. Other algorithms are of course
possible, but this one is not wrong.

> I got closer to 20% with these changes to Integer as well:
>
> printOn: aStream base: base
>   self negative ifTrue: [aStream nextPut: $-].
>   self abs normalize printDigitsOn: aStream base: base.
>
> printDigitsOn: aStream base: base
>   (self >= base)
>     ifTrue:
>       [(self // base) printDigitsOn: aStream base: base].
>
>   aStream nextPut: (Character digitValue: self \\ base)
>
> I'm not sure why one needs to send #normalize but it was being send
> before...?

The #normalize is there for large integers. If you remove it, you
will break printing for large integers, and you may not see the
problem right away since it would only apply to non-normalized
large integers. This is the worst sort of bug: Shows up later, and
nobody can remember how or why it got introduced.

> I'll create an entry in Mantis when I get a moment to check the unit
> tests and the other printXXX methods on Integer.

It would be best not to open a new Mantis entry unless there actually
is a new issue to be resolved.

Dave


Reply | Threaded
Open this post in threaded view
|

Re: Writing a large Collection of integers to a file fast

Zulq Alam-2
David T. Lewis wrote:
>
> The #reversed is not unnecessary. Other algorithms are of course
> possible, but this one is not wrong.
>

It is the creation of an intermediate string that is unnecessary. The
#printOn:base: is passed a stream on which it can print directly.

I mentioned the reverse because it was another (the third) iteration of
the integers digits (create, reverse and print). In fact, a negative
number requires another iteration.

>
> It would be best not to open a new Mantis entry unless there actually
> is a new issue to be resolved.
>

There is a "tweak" severity that suits this I think.

The change yields an ~ 18% improvement when hooked up to
#printStringBase: in this benchmark:

TimeProfileBrowser spyOn:
        [-1000000 to: 1000000 do: [:i | i asString]]

There are also significantly less objects produced.

This may be very little in practicality but the only way to know that is
to measure it.

Regards,
Zulq.