Smalltalk › Squeak › Squeak - Dev

Writing a large Collection of integers to a file fast

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

18 messages Options

André Wendt-3

Writing a large Collection of integers to a file fast

Hi,

I'm looking for a fast way to write a large Collection of ints (over 6
million elements) to a file. My attempt was:

aFile := CrLfFileStream fileNamed: aFilename.
aLargeCollection do: [ :int |
aFile nextPutAll: int printString, String cr.
].
aFile close.

Unfortunately, this takes more than fifteen minutes. I suspect this is
due to my implementation.

Is there any smart way I can do this faster?

Thanks,
André

Mathieu SUEN

Re: Writing a large Collection of integers to a file fast

Oe possible optimization is to not use the #, message because it
create a copy of the String.
So you can try:

aFile := CrLfFileStream fileNamed: aFilename.
aLargeCollection do: [ :int |
aFile nextPutAll: int printString; nextPut: Character cr.
].
aFile close.

On Jan 26, 2008, at 12:37 PM, André Wendt wrote:

> Hi,
>
> I'm looking for a fast way to write a large Collection of ints (over 6
> million elements) to a file. My attempt was:
>
> aFile := CrLfFileStream fileNamed: aFilename.
> aLargeCollection do: [ :int |
> aFile nextPutAll: int printString, String cr.
> ].
> aFile close.
>
> Unfortunately, this takes more than fifteen minutes. I suspect this is
> due to my implementation.
>
> Is there any smart way I can do this faster?
>
> Thanks,
> André
>

Mth

Lukas Renggli

Re: Writing a large Collection of integers to a file fast

> Oe possible optimization is to not use the #, message because it
> create a copy of the String.
> So you can try:
>
> aFile := CrLfFileStream fileNamed: aFilename.
> aLargeCollection do: [ :int |
> aFile nextPutAll: int printString; nextPut: Character cr.
> ].
> aFile close.

Or even smarter:

aStream print: int; nextPut: Character cr

Lukas

>
>
> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
>
> > Hi,
> >
> > I'm looking for a fast way to write a large Collection of ints (over 6
> > million elements) to a file. My attempt was:
> >
> > aFile := CrLfFileStream fileNamed: aFilename.
> > aLargeCollection do: [ :int |
> > aFile nextPutAll: int printString, String cr.
> > ].
> > aFile close.
> >
> > Unfortunately, this takes more than fifteen minutes. I suspect this is
> > due to my implementation.
> >
> > Is there any smart way I can do this faster?
> >
> > Thanks,
> > André
> >
>
> Mth
>
>
>
>
>

--
Lukas Renggli
http://www.lukas-renggli.ch

André Wendt-3

Re: Writing a large Collection of integers to a file fast

Lukas Renggli wrote:

>> Oe possible optimization is to not use the #, message because it
>> create a copy of the String.
>> So you can try:
>>
>> aFile := CrLfFileStream fileNamed: aFilename.
>> aLargeCollection do: [ :int |
>> aFile nextPutAll: int printString; nextPut: Character cr.
>> ].
>> aFile close.
>
> Or even smarter:
>
> aStream print: int; nextPut: Character cr
>
> Lukas

Thanks for your suggestions! I've tried, but it didn't give the boost I
was hoping for...

André

>> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
>>
>>> Hi,
>>>
>>> I'm looking for a fast way to write a large Collection of ints (over 6
>>> million elements) to a file. My attempt was:
>>>
>>> aFile := CrLfFileStream fileNamed: aFilename.
>>> aLargeCollection do: [ :int |
>>> aFile nextPutAll: int printString, String cr.
>>> ].
>>> aFile close.
>>>
>>> Unfortunately, this takes more than fifteen minutes. I suspect this is
>>> due to my implementation.
>>>
>>> Is there any smart way I can do this faster?
>>>
>>> Thanks,
>>> André
>>>
>> Mth

Bert Freudenberg

Re: Writing a large Collection of integers to a file fast

On Jan 26, 2008, at 16:57 , André Wendt wrote:

Try printing to a memory buffer in chunks of 10000 integers, and
putting that on the file. Unbuffered I/O is slow.

If that's still not enough (or even before), profile.

- Bert -

Levente Uzonyi-2

Re: Writing a large Collection of integers to a file fast

In reply to this post by André Wendt-3

Hi!

Try this:

writeIntegersFrom: aLargeCollection to: aFileName

| file crLf buffer |
crLf := String cr, String lf.
buffer := WriteStream on: (String new: 65536).
[ file := StandardFileStream fileNamed: aFileName.
aLargeCollection do: [ :int |
buffer
nextPutAll: int printString;
nextPutAll: crLf.
buffer position > 64000 ifTrue: [
file nextPutAll: buffer contents.
buffer position: 0 ] ].
file nextPutAll: buffer contents ]
ensure: [
file ifNotNil: [ file close ] ]

I'm sure there's a better solution, this is just fast and dirty one.
(30 secs for 6M integers on my machine)

Levente

On Sat, 26 Jan 2008, [UTF-8] André Wendt wrote:

> Lukas Renggli wrote:
> >> Oe possible optimization is to not use the #, message because it
> >> create a copy of the String.
> >> So you can try:
> >>
> >> aFile := CrLfFileStream fileNamed: aFilename.
> >> aLargeCollection do: [ :int |
> >> aFile nextPutAll: int printString; nextPut: Character cr.
> >> ].
> >> aFile close.
> >
> > Or even smarter:
> >
> > aStream print: int; nextPut: Character cr
> >
> > Lukas
>
> Thanks for your suggestions! I've tried, but it didn't give the boost I
> was hoping for...
>
> André
>
> >> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm looking for a fast way to write a large Collection of ints (over 6
> >>> million elements) to a file. My attempt was:
> >>>
> >>> aFile := CrLfFileStream fileNamed: aFilename.
> >>> aLargeCollection do: [ :int |
> >>> aFile nextPutAll: int printString, String cr.
> >>> ].
> >>> aFile close.
> >>>
> >>> Unfortunately, this takes more than fifteen minutes. I suspect this is
> >>> due to my implementation.
> >>>
> >>> Is there any smart way I can do this faster?
> >>>
> >>> Thanks,
> >>> André
> >>>
> >> Mth
>
>
>

Klaus D. Witzel

Re: Writing a large Collection of integers to a file fast

In reply to this post by André Wendt-3

If you have only small integers and free memory, the fastest method (just
one i/o) would be

| memory fOut |
memory := Bitmap new: 6 * 1000 * 1000.
memory atAllPut: SmallInteger maxVal.
[(fOut := StandardFileStream newFileNamed: 'f.out')
nextPutAll: memory asByteArray]
ensure: [fOut close]

You can fill the memory instance with #integerAt:put:

/Klaus

On Sat, 26 Jan 2008 16:57:16 +0100, André Wendt wrote:

> Lukas Renggli wrote:
>>> Oe possible optimization is to not use the #, message because it
>>> create a copy of the String.
>>> So you can try:
>>>
>>> aFile := CrLfFileStream fileNamed: aFilename.
>>> aLargeCollection do: [ :int |
>>> aFile nextPutAll: int printString; nextPut: Character cr.
>>> ].
>>> aFile close.
>>
>> Or even smarter:
>>
>> aStream print: int; nextPut: Character cr
>>
>> Lukas
>
> Thanks for your suggestions! I've tried, but it didn't give the boost I
> was hoping for...
>
> André
>
>>> On Jan 26, 2008, at 12:37 PM, André Wendt wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm looking for a fast way to write a large Collection of ints (over 6
>>>> million elements) to a file. My attempt was:
>>>>
>>>> aFile := CrLfFileStream fileNamed: aFilename.
>>>> aLargeCollection do: [ :int |
>>>> aFile nextPutAll: int printString, String cr.
>>>> ].
>>>> aFile close.
>>>>
>>>> Unfortunately, this takes more than fifteen minutes. I suspect this is
>>>> due to my implementation.
>>>>
>>>> Is there any smart way I can do this faster?
>>>>
>>>> Thanks,
>>>> André
>>>>
>>> Mth
>
>
>

Ramon Leon-5

RE: Writing a large Collection of integers to a file fast

> If you have only small integers and free memory, the fastest
> method (just one i/o) would be
>
> | memory fOut |
> memory := Bitmap new: 6 * 1000 * 1000.
> memory atAllPut: SmallInteger maxVal.
> [(fOut := StandardFileStream newFileNamed: 'f.out')
> nextPutAll: memory asByteArray]
> ensure: [fOut close]
>
> You can fill the memory instance with #integerAt:put:
>
> /Klaus

On a side note, watching this thread, I'm a bit curious why everyone is
manually closing their files, why not...

| memory |
memory := Bitmap new: 6 * 1000 * 1000.
memory atAllPut: SmallInteger maxVal.
FileStream
newFileNamed: 'f.out'
do:[:f | f nextPutAll: memory asByteArray]

Why is everyone ignoring that every file open method has a corresponding
#do: that encapsulates the close and removes the need for a temp?

Ramon Leon
http://onsmalltalk.com

Klaus D. Witzel

Re: Writing a large Collection of integers to a file fast

On Sat, 26 Jan 2008 18:44:13 +0100, Ramon Leon wrote:

...
> Why is everyone ignoring that every file open method has a corresponding
> #do: that encapsulates the close and removes the need for a temp?

I don't know why Mr.+Ms. everyone does so but, for me it's a matter of
expressing what I think. And I think that #newFileNamed:do: is not the
same as #ensure: + #close.

/Klaus

> Ramon Leon
> http://onsmalltalk.com
>
>
>

Bert Freudenberg

Re: Writing a large Collection of integers to a file fast

In reply to this post by Ramon Leon-5

On Jan 26, 2008, at 18:44 , Ramon Leon wrote:

>> If you have only small integers and free memory, the fastest
>> method (just one i/o) would be
>>
>> | memory fOut |
>> memory := Bitmap new: 6 * 1000 * 1000.
>> memory atAllPut: SmallInteger maxVal.
>> [(fOut := StandardFileStream newFileNamed: 'f.out')
>> nextPutAll: memory asByteArray]
>> ensure: [fOut close]
>>
>> You can fill the memory instance with #integerAt:put:
>>
>> /Klaus

The original post asked for printed numbers, not binary 32 bit big-
endian numbers.

> On a side note, watching this thread, I'm a bit curious why
> everyone is
> manually closing their files, why not...
>
> | memory |
> memory := Bitmap new: 6 * 1000 * 1000.
> memory atAllPut: SmallInteger maxVal.
> FileStream
> newFileNamed: 'f.out'
> do:[:f | f nextPutAll: memory asByteArray]
>
> Why is everyone ignoring that every file open method has a
> corresponding
> #do: that encapsulates the close and removes the need for a temp?

Because that's a rather new addition (since 3.9 maybe?) that won't
work in many images.

- Bert -

Nicolas Cellier-3

Re: Writing a large Collection of integers to a file fast

In reply to this post by Bert Freudenberg

Bert Freudenberg a écrit :
>
> Try printing to a memory buffer in chunks of 10000 integers, and putting
> that on the file. Unbuffered I/O is slow.
>

Any reason why Squeak should use unbuffered I/O?

It sounds like strange we have to emulate a base function every
underlying OS would perform so well.

Nicolas

David T. Lewis

Re: Writing a large Collection of integers to a file fast

On Sat, Jan 26, 2008 at 11:38:21PM +0100, nicolas cellier wrote:

> Bert Freudenberg a ?crit :
> >
> >Try printing to a memory buffer in chunks of 10000 integers, and putting
> >that on the file. Unbuffered I/O is slow.
> >
>
> Any reason why Squeak should use unbuffered I/O?
>
> It sounds like strange we have to emulate a base function every
> underlying OS would perform so well.

The Windows VM uses direct I/O to a Windows HANDLE, and all other
VMs are using buffered I/O. I have never seen any measurement data
to show that one is better than the other, or under what circumstances
one might be better than the other. I certainly would not assume
anything without seeing the numbers.

It would be straightforward to implement either approach on any
of the supported platforms, so I assume that these were simply
design choices of the individual VM implementers.

Furthermore, it is not necessarily the case that file I/O is the
performance bottleneck in this case. I did a quick check of this:

TimeProfileBrowser onBlock:
[aFilename := 'foo.txt'.
aLargeCollection := 1 to: 100000.
aFile := CrLfFileStream fileNamed: aFilename.
aLargeCollection do: [ :int |
aFile nextPutAll: int printString, String cr].
aFile close]

Which shows that for the particular VM and image that I was using,
the majority of the processing time was spent in multibyte character
conversion and conversion of integers to strings, and less than
seven percent was spent in I/O primitives.

Dave

johnmci

Re: Writing a large Collection of integers to a file fast

Actually if you open a FileStream you get a MultiByteFileStream
instance.

If the stream is binary it invokes methods on the super class
StandardFileStream to put a character or a collection of characters.

However if it is text then it proceeds to read or write one character
at a time causes yes a discrete file I/O primitive call.

So say you need a UTF8 Stream and you have 1 million characters and
you say
foo nextPutAll: millionCharacterString

This causes 1 million file I/O operations, that takes a *long* time.

In Sophie I coded a SophieMultiByteMemoryFileStream which fronts the
real stream with a buffer the size of the stream, that way the
Translators get/put bytes to the buffer, and at close time I flush
the entire buffer to disk as a binary file. Obviously this is not a
general
purpose solution since it relies on the fact in Sophie we know the
UTF8 files we are working with will only be a few MB in size.

On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote:

> Which shows that for the particular VM and image that I was using,
> the majority of the processing time was spent in multibyte character
> conversion and conversion of integers to strings, and less than
> seven percent was spent in I/O primitives.
>
> Dave
>
>

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

johnmci

Re: Writing a large Collection of integers to a file fast

In reply to this post by David T. Lewis

Forgot to say , what is needed is a BUFFERED multibyte stream, but
that is a non-trivial coding problem.

On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote:

> On Sat, Jan 26, 2008 at 11:38:21PM +0100, nicolas cellier wrote:
>> Bert Freudenberg a ?crit :
>>>
>>> Try printing to a memory buffer in chunks of 10000 integers, and
>>> putting
>>> that on the file. Unbuffered I/O is slow.

--
=
=
=
========================================================================
John M. McIntosh <[hidden email]>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================

David T. Lewis

Re: Writing a large Collection of integers to a file fast

In reply to this post by johnmci

On Sun, Jan 27, 2008 at 02:05:35PM -0800, John M McIntosh wrote:

>
> On Jan 27, 2008, at 9:38 AM, David T. Lewis wrote:
>
> >Which shows that for the particular VM and image that I was using,
> >the majority of the processing time was spent in multibyte character
> >conversion and conversion of integers to strings, and less than
> >seven percent was spent in I/O primitives.
> >
> Actually if you open a FileStream you get a MultiByteFileStream
> instance.
>
> If the stream is binary it invokes methods on the super class
> StandardFileStream to put a character or a collection of characters.
>
> However if it is text then it proceeds to read or write one character
> at a time causes yes a discrete file I/O primitive call.
>
>
> So say you need a UTF8 Stream and you have 1 million characters and
> you say
> foo nextPutAll: millionCharacterString
>
> This causes 1 million file I/O operations, that takes a *long* time.

Quite right. But however inefficient this might be, the file I/O
primitive call is still not the bottleneck. Most of the time (over
40% in my example) is eaten up in MultiByteFileStream>>nextPut:
of which a small portion is the actual I/O primitive.

I tried this test on an AMD-64 laptop running a Linux VM, and then
with a Windows VM on the same machine. The results were about the same
for both VMs (about 4% of the run time taken up in the I/O primitives),
so for this kind of file I/O there is no significant difference in
performance between a VM that uses buffered (stdio) I/O versus a VM
that uses lower level (write to HANDLE) I/O. Furthermore, the I/O
performance in the VM is only a small part of the overall file processing
time, dispite calling a primitive for every single character processed.

For the record, the test case was:

TimeProfileBrowser onBlock:
[aFilename := 'foo.txt'.
aLargeCollection := 1 to: 100000.
aFile := CrLfFileStream fileNamed: aFilename.
aLargeCollection do: [ :int |
aFile nextPutAll: int printString, String cr].
aFile close]

> In Sophie I coded a SophieMultiByteMemoryFileStream which fronts the
> real stream with a buffer the size of the stream, that way the
> Translators get/put bytes to the buffer, and at close time I flush
> the entire buffer to disk as a binary file. Obviously this is not a
> general
> purpose solution since it relies on the fact in Sophie we know the
> UTF8 files we are working with will only be a few MB in size.

That sounds like a good idea.

Dave

Zulq Alam-2

Re: Writing a large Collection of integers to a file fast

In reply to this post by Lukas Renggli

Lukas Renggli wrote:
>
> Or even smarter:
>
> aStream print: int; nextPut: Character cr
>
> Lukas
>

I tried this and didn't get the improvement I was expecting, < 2%. I had
a deeper look and noticed Integer>>#printOn:base: generates an
unnecessary string representation which it then reverses and prints to
the stream.

I got closer to 20% with these changes to Integer as well:

printOn: aStream base: base
self negative ifTrue: [aStream nextPut: $-].
self abs normalize printDigitsOn: aStream base: base.

printDigitsOn: aStream base: base
(self >= base)
ifTrue:
[(self // base) printDigitsOn: aStream base: base].

aStream nextPut: (Character digitValue: self \\ base)

I'm not sure why one needs to send #normalize but it was being send
before...?

I'll create an entry in Mantis when I get a moment to check the unit
tests and the other printXXX methods on Integer.

Regards,
Zulq.

David T. Lewis

Re: Writing a large Collection of integers to a file fast

On Mon, Jan 28, 2008 at 05:48:01AM +0000, Zulq Alam wrote:

> Lukas Renggli wrote:
> >
> >Or even smarter:
> >
> > aStream print: int; nextPut: Character cr
> >
> >Lukas
> >
>
> I tried this and didn't get the improvement I was expecting, < 2%. I had
> a deeper look and noticed Integer>>#printOn:base: generates an
> unnecessary string representation which it then reverses and prints to
> the stream.

The #reversed is not unnecessary. Other algorithms are of course
possible, but this one is not wrong.

> I got closer to 20% with these changes to Integer as well:
>
> printOn: aStream base: base
> self negative ifTrue: [aStream nextPut: $-].
> self abs normalize printDigitsOn: aStream base: base.
>
> printDigitsOn: aStream base: base
> (self >= base)
> ifTrue:
> [(self // base) printDigitsOn: aStream base: base].
>
> aStream nextPut: (Character digitValue: self \\ base)
>
> I'm not sure why one needs to send #normalize but it was being send
> before...?

The #normalize is there for large integers. If you remove it, you
will break printing for large integers, and you may not see the
problem right away since it would only apply to non-normalized
large integers. This is the worst sort of bug: Shows up later, and
nobody can remember how or why it got introduced.

> I'll create an entry in Mantis when I get a moment to check the unit
> tests and the other printXXX methods on Integer.

It would be best not to open a new Mantis entry unless there actually
is a new issue to be resolved.

Dave

Zulq Alam-2

Re: Writing a large Collection of integers to a file fast

David T. Lewis wrote:
>
> The #reversed is not unnecessary. Other algorithms are of course
> possible, but this one is not wrong.
>

It is the creation of an intermediate string that is unnecessary. The
#printOn:base: is passed a stream on which it can print directly.

I mentioned the reverse because it was another (the third) iteration of
the integers digits (create, reverse and print). In fact, a negative
number requires another iteration.

>
> It would be best not to open a new Mantis entry unless there actually
> is a new issue to be resolved.
>

There is a "tweak" severity that suits this I think.

The change yields an ~ 18% improvement when hooked up to
#printStringBase: in this benchmark:

TimeProfileBrowser spyOn:
[-1000000 to: 1000000 do: [:i | i asString]]

There are also significantly less objects produced.

This may be very little in practicality but the only way to know that is
to measure it.

Regards,
Zulq.