Smalltalk › Pharo › Pharo Smalltalk Developers

Why is FileStream writing almost an order of a magnitude slower than reading ?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

13 messages Options

Sven Van Caekenberghe

Why is FileStream writing almost an order of a magnitude slower than reading ?

Related to the Matrix CSV input/output optimalization quest, I was puzzled why writing seemed so much slower than reading.

Here is a simple example:

[ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
100000 timesRepeat: [ stream print: 100 atRandom; space ] ] ] timeToRun.
1558
[ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
100000 timesRepeat: [ Integer readFrom: stream. stream peekFor: $ ] ] ] timeToRun.
183
[ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
100000 timesRepeat: [ stream nextPut: ($a to: $z) atRandom; space ] ] ] timeToRun.
1705
[ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
100000 timesRepeat: [ stream next. stream peekFor: $ ] ] ] timeToRun.
47

Clearly, the writing is close to an order of magnitude slower than reading.

This was on Pharo 1.1 with Cog, but I double-checked with Pharo 1.2 and Squeak 4.1.

On my machine (Mac Book Pro), this is what another dynamic language does:

> (time (with-output-to-file (out "/tmp/numbers.txt")
(loop repeat 100000 do (format out "~d " (random 100)))))
Timing the evaluation of (WITH-OUTPUT-TO-FILE (OUT "/tmp/numbers.txt") (LOOP REPEAT 100000 DO (FORMAT OUT "~d " (RANDOM 100))))

User time = 0.413
System time = 0.002
Elapsed time = 0.401
Allocation = 2502320 bytes
0 Page faults
Calls to %EVAL 1700063
NIL

> (time (with-open-file (in "/tmp/numbers.txt")
(loop repeat 100000 do (read in))))
Timing the evaluation of (WITH-OPEN-FILE (IN "/tmp/numbers.txt") (LOOP REPEAT 100000 DO (READ IN)))

User time = 0.328
System time = 0.001
Elapsed time = 0.315
Allocation = 2500764 bytes
0 Page faults
Calls to %EVAL 1400056
NIL

So Pharo Smalltalk clearly matches the read/parse speed, which is great, but fails at simple writing.

Maybe I am doing something wrong here (I know these are MultiByteFileSteams), but I fail to see what. Something with buffering/flushing ?

Anybody any idea ?

Sven

Levente Uzonyi-2

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

On Tue, 7 Dec 2010, Sven Van Caekenberghe wrote:

> Related to the Matrix CSV input/output optimalization quest, I was puzzled why writing seemed so much slower than reading.
>
> Here is a simple example:
>
> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
> 100000 timesRepeat: [ stream print: 100 atRandom; space ] ] ] timeToRun.
> 1558
> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
> 100000 timesRepeat: [ Integer readFrom: stream. stream peekFor: $ ] ] ] timeToRun.
> 183
> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
> 100000 timesRepeat: [ stream nextPut: ($a to: $z) atRandom; space ] ] ] timeToRun.
> 1705
> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
> 100000 timesRepeat: [ stream next. stream peekFor: $ ] ] ] timeToRun.
> 47
>
> Clearly, the writing is close to an order of magnitude slower than reading.
>
> This was on Pharo 1.1 with Cog, but I double-checked with Pharo 1.2 and Squeak 4.1.
>
> On my machine (Mac Book Pro), this is what another dynamic language does:
>
> > (time (with-output-to-file (out "/tmp/numbers.txt")
> (loop repeat 100000 do (format out "~d " (random 100)))))
> Timing the evaluation of (WITH-OUTPUT-TO-FILE (OUT "/tmp/numbers.txt") (LOOP REPEAT 100000 DO (FORMAT OUT "~d " (RANDOM 100))))
>
> User time = 0.413
> System time = 0.002
> Elapsed time = 0.401
> Allocation = 2502320 bytes
> 0 Page faults
> Calls to %EVAL 1700063
> NIL
>
>> (time (with-open-file (in "/tmp/numbers.txt")
> (loop repeat 100000 do (read in))))
> Timing the evaluation of (WITH-OPEN-FILE (IN "/tmp/numbers.txt") (LOOP REPEAT 100000 DO (READ IN)))
>
> User time = 0.328
> System time = 0.001
> Elapsed time = 0.315
> Allocation = 2500764 bytes
> 0 Page faults
> Calls to %EVAL 1400056
> NIL
>
> So Pharo Smalltalk clearly matches the read/parse speed, which is great, but fails at simple writing.
>
> Maybe I am doing something wrong here (I know these are MultiByteFileSteams), but I fail to see what. Something with buffering/flushing ?
>
> Anybody any idea ?

That's because filestreams are read buffered, but not write buffered. I
implemented a subclass of FileStream (intended as a possible replacement
of StandardFileStream) which is read and write buffered. It gives the same
performance for reading as the current implementation and a significant
boost for writes, so it can be done. But write buffering has side effects,
while read buffering doesn't. Maybe it can be added as a separate subclass
of FileStream if there's need for it, but the multibyte stuff has to be
duplicated in this case (note that it's already duplicated in
MultiByteFileStream and MultiByteBinaryOrTextStream). I also had an idea
to create MultiByteStream which would be a stream that wraps another
stream and does the conversion stuff using a TextConverter. It'd be a lot
of work to do it and I don't expect more than 30% performance improvement
(for the read performance).

There are several stream libraries (for example XTreams) that can easily
support write buffering without the need to care about compatibility.

Levente

>
> Sven
>
>
>

Sven Van Caekenberghe

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

Levente,

On 07 Dec 2010, at 16:20, Levente Uzonyi wrote:

> That's because filestreams are read buffered, but not write buffered. I implemented a subclass of FileStream (intended as a possible replacement of StandardFileStream) which is read and write buffered. It gives the same performance for reading as the current implementation and a significant boost for writes, so it can be done. But write buffering has side effects, while read buffering doesn't. Maybe it can be added as a separate subclass of FileStream if there's need for it, but the multibyte stuff has to be duplicated in this case (note that it's already duplicated in MultiByteFileStream and MultiByteBinaryOrTextStream). I also had an idea to create MultiByteStream which would be a stream that wraps another stream and does the conversion stuff using a TextConverter. It'd be a lot of work to do it and I don't expect more than 30% performance improvement (for the read performance).

Thanks for the explanation, some quick and dirty buffering makes a huge difference:

[ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream |
1000 timesRepeat: [
fileStream nextPutAll:
(String streamContents: [ :stream |
100 timesRepeat: [ stream print: 100 atRandom; space ] ]) ] ] ] timeToRun.
159

Still, the asymmetry is a bit strange.
Can't the side effects be dealt with using #flush ?

> There are several stream libraries (for example XTreams) that can easily support write buffering without the need to care about compatibility.

Yeah, although the Smalltalk Collection and Stream classes were better than everything else 20, 30 years ago, lots of things have changed and there is lots of competition. The fact that these classes are so nice to use seem to have prevented necessary improvements.

I think I might file this as a Pharo issue.

Sven

Philippe Marschall-2

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Levente Uzonyi-2

On 12/07/2010 04:20 PM, Levente Uzonyi wrote:

> On Tue, 7 Dec 2010, Sven Van Caekenberghe wrote:
>
>> Related to the Matrix CSV input/output optimalization quest, I was
>> puzzled why writing seemed so much slower than reading.
>>
>> Here is a simple example:
>>
>> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
>> 100000 timesRepeat: [ stream print: 100 atRandom; space ] ] ]
>> timeToRun.
>> 1558
>> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
>> 100000 timesRepeat: [ Integer readFrom: stream. stream peekFor: $
>> ] ] ] timeToRun.
>> 183
>> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
>> 100000 timesRepeat: [ stream nextPut: ($a to: $z) atRandom; space
>> ] ] ] timeToRun.
>> 1705
>> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :stream |
>> 100000 timesRepeat: [ stream next. stream peekFor: $ ] ] ] timeToRun.
>> 47
>>
>> Clearly, the writing is close to an order of magnitude slower than
>> reading.
>>
>> This was on Pharo 1.1 with Cog, but I double-checked with Pharo 1.2
>> and Squeak 4.1.
>>
>> On my machine (Mac Book Pro), this is what another dynamic language does:
>>
>> > (time (with-output-to-file (out "/tmp/numbers.txt")
>> (loop repeat 100000 do (format out "~d " (random 100)))))
>> Timing the evaluation of (WITH-OUTPUT-TO-FILE (OUT "/tmp/numbers.txt")
>> (LOOP REPEAT 100000 DO (FORMAT OUT "~d " (RANDOM 100))))
>>
>> User time = 0.413
>> System time = 0.002
>> Elapsed time = 0.401
>> Allocation = 2502320 bytes
>> 0 Page faults
>> Calls to %EVAL 1700063
>> NIL
>>
>>> (time (with-open-file (in "/tmp/numbers.txt")
>> (loop repeat 100000 do (read in))))
>> Timing the evaluation of (WITH-OPEN-FILE (IN "/tmp/numbers.txt") (LOOP
>> REPEAT 100000 DO (READ IN)))
>>
>> User time = 0.328
>> System time = 0.001
>> Elapsed time = 0.315
>> Allocation = 2500764 bytes
>> 0 Page faults
>> Calls to %EVAL 1400056
>> NIL
>>
>> So Pharo Smalltalk clearly matches the read/parse speed, which is
>> great, but fails at simple writing.
>>
>> Maybe I am doing something wrong here (I know these are
>> MultiByteFileSteams), but I fail to see what. Something with
>> buffering/flushing ?
>>
>> Anybody any idea ?
>
> That's because filestreams are read buffered, but not write buffered. I
> implemented a subclass of FileStream (intended as a possible replacement
> of StandardFileStream) which is read and write buffered. It gives the
> same performance for reading as the current implementation and a
> significant boost for writes, so it can be done. But write buffering has
> side effects, while read buffering doesn't. Maybe it can be added as a
> separate subclass of FileStream if there's need for it, but the
> multibyte stuff has to be duplicated in this case (note that it's
> already duplicated in MultiByteFileStream and
> MultiByteBinaryOrTextStream). I also had an idea to create
> MultiByteStream which would be a stream that wraps another stream and
> does the conversion stuff using a TextConverter. It'd be a lot of work
> to do it and I don't expect more than 30% performance improvement (for
> the read performance).

No, buffering should not be in a subclass or even the file stream class
itself. Buffering should be an other class that wraps file stream.

Cheers
Philippe

Levente Uzonyi-2

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Sven Van Caekenberghe

On Tue, 7 Dec 2010, Sven Van Caekenberghe wrote:

> Levente,
>
> On 07 Dec 2010, at 16:20, Levente Uzonyi wrote:
>
>> That's because filestreams are read buffered, but not write buffered. I implemented a subclass of FileStream (intended as a possible replacement of StandardFileStream) which is read and write buffered. It gives the same performance for reading as the current implementation and a significant boost for writes, so it can be done. But write buffering has side effects, while read buffering doesn't. Maybe it can be added as a separate subclass of FileStream if there's need for it, but the multibyte stuff has to be duplicated in this case (note that it's already duplicated in MultiByteFileStream and MultiByteBinaryOrTextStream). I also had an idea to create MultiByteStream which would be a stream that wraps another stream and does the conversion stuff using a TextConverter. It'd be a lot of work to do it and I don't expect more than 30% performance improvement (for the read performance).
>
> Thanks for the explanation, some quick and dirty buffering makes a huge difference:
>
> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream |
> 1000 timesRepeat: [
> fileStream nextPutAll:
> (String streamContents: [ :stream |
> 100 timesRepeat: [ stream print: 100 atRandom; space ] ]) ] ] ] timeToRun.
> 159
>
> Still, the asymmetry is a bit strange.
> Can't the side effects be dealt with using #flush ?

Lets go back in time. A year ago there was no read buffering (Pharo 1.0
was not released, Squeak 3.10.2 was out) and reading from a file was as
slow as writing is currently. Read buffering could be added transparently,
so it could give a huge speed improvement to all existing code.
Write buffering could be done the same way, but it would break code,
because currently a write is immediately done, while with buffering it
wouldn't be. Some files would be written only when the finalization
process closes the file. The solution for this could be automatic flushing
on each write, which could be turned off by a method. But that would be
the same as not using write buffering at all.
But with the same effort you could use another stream implementation, that
does write buffering. And write buffering can't be used to speed up
existing code without reviewing it.

Levente

>
>> There are several stream libraries (for example XTreams) that can easily support write buffering without the need to care about compatibility.
>
> Yeah, although the Smalltalk Collection and Stream classes were better than everything else 20, 30 years ago, lots of things have changed and there is lots of competition. The fact that these classes are so nice to use seem to have prevented necessary improvements.
>
> I think I might file this as a Pharo issue.
>
> Sven
>
>
>

Levente Uzonyi-2

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Philippe Marschall-2

On Tue, 7 Dec 2010, Philippe Marschall wrote:

snip

> No, buffering should not be in a subclass or even the file stream class
> itself. Buffering should be an other class that wraps file stream.

Stream wrappers are cool, but they are totally different from the current
stream design. One could create new stream classes that could do
buffering, conversion, compression, etc. and rewrite existing code to use
them, but if you just rewrite the existing code using an existing
stream library (for example XTreams), then you'll get pretty much the same
thing with less effort.

Levente

>
> Cheers
> Philippe
>
>
>

Nicolas Cellier

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

2010/12/8 Levente Uzonyi <[hidden email]>:

> On Tue, 7 Dec 2010, Philippe Marschall wrote:
>
> snip
>
>> No, buffering should not be in a subclass or even the file stream class
>> itself. Buffering should be an other class that wraps file stream.
>
> Stream wrappers are cool, but they are totally different from the current
> stream design. One could create new stream classes that could do buffering,
> conversion, compression, etc. and rewrite existing code to use them, but if
> you just rewrite the existing code using an existing stream library (for
> example XTreams), then you'll get pretty much the same thing with less
> effort.
>

By now, you have at least two wrapper libraries:

The VW Xtreams port at http://www.squeaksource.com/Xtreams.html
These are wrappers focused on clean-ness / cross dialect portability.
Efficiency is not at top but might be improved later.

Also the SqueaXTream at http://www.squeaksource.com/XTream.html
more focused on efficiency.

I don't count Nile which provides a few wrappers but is more oriented
toward Traits composition.

Nicolas

>
> Levente
>
>>
>> Cheers
>> Philippe
>>
>>
>>
>
>

Sven Van Caekenberghe

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Levente Uzonyi-2

On 08 Dec 2010, at 00:25, Levente Uzonyi wrote:

>> Thanks for the explanation, some quick and dirty buffering makes a huge difference:
>>
>> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream |
>> 1000 timesRepeat: [
>> fileStream nextPutAll:
>> (String streamContents: [ :stream |
>> 100 timesRepeat: [ stream print: 100 atRandom; space ] ]) ] ] ] timeToRun.
>> 159
>>
>> Still, the asymmetry is a bit strange.
>> Can't the side effects be dealt with using #flush ?
>
> Lets go back in time. A year ago there was no read buffering (Pharo 1.0 was not released, Squeak 3.10.2 was out) and reading from a file was as slow as writing is currently. Read buffering could be added transparently, so it could give a huge speed improvement to all existing code.
> Write buffering could be done the same way, but it would break code, because currently a write is immediately done, while with buffering it
> wouldn't be. Some files would be written only when the finalization process closes the file. The solution for this could be automatic flushing on each write, which could be turned off by a method. But that would be the same as not using write buffering at all.
> But with the same effort you could use another stream implementation, that does write buffering. And write buffering can't be used to speed up existing code without reviewing it.

Thanks again for the explanation.

OK, I tried writing my own buffered write stream class:

[ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream | | bufferedStream |
bufferedStream := ZnBufferedWriteStream on: fileStream.
100000 timesRepeat: [ bufferedStream print: 100 atRandom; space ].
bufferedStream flush ] ] timeToRun.
165

That wasn' too hard. And indeed, it is necessary to manually send #flush or #close to force the buffer out.

But I do not completely agree with the fact that it would be that much work. Stream>>#flush is already a no-op. Adding it to #streamContents: and some others can not be that much work. In fact, SocketStream does already do both input and output buffering (and thus requires #flush or #close), so would potentially fail in certain situations according to your reasoning. No ?

Sven

Philippe Marschall-2

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Levente Uzonyi-2

On 12/08/2010 12:32 AM, Levente Uzonyi wrote:

> On Tue, 7 Dec 2010, Philippe Marschall wrote:
>
> snip
>
>> No, buffering should not be in a subclass or even the file stream class
>> itself. Buffering should be an other class that wraps file stream.
>
> Stream wrappers are cool, but they are totally different from the
> current stream design. One could create new stream classes that could do
> buffering, conversion, compression, etc. and rewrite existing code to
> use them, but if you just rewrite the existing code using an existing
> stream library (for example XTreams), then you'll get pretty much the
> same thing with less effort.

Nope sorry, writing a stream that buffers is not less effort that
porting and maintaining a whole stream library and rewriting your code.

Cheers
Philippe

Levente Uzonyi-2

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Sven Van Caekenberghe

On Wed, 8 Dec 2010, Sven Van Caekenberghe wrote:

>
> On 08 Dec 2010, at 00:25, Levente Uzonyi wrote:
>
>>> Thanks for the explanation, some quick and dirty buffering makes a huge difference:
>>>
>>> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream |
>>> 1000 timesRepeat: [
>>> fileStream nextPutAll:
>>> (String streamContents: [ :stream |
>>> 100 timesRepeat: [ stream print: 100 atRandom; space ] ]) ] ] ] timeToRun.
>>> 159
>>>
>>> Still, the asymmetry is a bit strange.
>>> Can't the side effects be dealt with using #flush ?
>>
>> Lets go back in time. A year ago there was no read buffering (Pharo 1.0 was not released, Squeak 3.10.2 was out) and reading from a file was as slow as writing is currently. Read buffering could be added transparently, so it could give a huge speed improvement to all existing code.
>> Write buffering could be done the same way, but it would break code, because currently a write is immediately done, while with buffering it
>> wouldn't be. Some files would be written only when the finalization process closes the file. The solution for this could be automatic flushing on each write, which could be turned off by a method. But that would be the same as not using write buffering at all.
>> But with the same effort you could use another stream implementation, that does write buffering. And write buffering can't be used to speed up existing code without reviewing it.
>
> Thanks again for the explanation.
>
> OK, I tried writing my own buffered write stream class:
>
> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream | | bufferedStream |
> bufferedStream := ZnBufferedWriteStream on: fileStream.
> 100000 timesRepeat: [ bufferedStream print: 100 atRandom; space ].
> bufferedStream flush ] ] timeToRun.
> 165
>
> That wasn' too hard. And indeed, it is necessary to manually send #flush or #close to force the buffer out.
>
> But I do not completely agree with the fact that it would be that much work. Stream>>#flush is already a no-op. Adding it to #streamContents: and some others can not be that much work. In fact, SocketStream does already do both input and output buffering (and thus requires #flush or #close), so would potentially fail in certain situations according to your reasoning. No ?

It would be much work to add the write buffering to StandardFileStream
(it should also work with MultiByteFileStream) and fix all places in the
image, that use StandardFileStream or MultiByteFileStream for writing to a
file.
I don't get how #streamContents: could be used to send #flush. That's a
method of SequenceableCollection IIRC.
SocketStream is unrelated here, because it doesn't write to files and
buffering was always implemented in it AFAIK.

Levente

>
> Sven
>
>
>

Levente Uzonyi-2

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Philippe Marschall-2

On Wed, 8 Dec 2010, Philippe Marschall wrote:

> On 12/08/2010 12:32 AM, Levente Uzonyi wrote:
>> On Tue, 7 Dec 2010, Philippe Marschall wrote:
>>
>> snip
>>
>>> No, buffering should not be in a subclass or even the file stream class
>>> itself. Buffering should be an other class that wraps file stream.
>>
>> Stream wrappers are cool, but they are totally different from the
>> current stream design. One could create new stream classes that could do
>> buffering, conversion, compression, etc. and rewrite existing code to
>> use them, but if you just rewrite the existing code using an existing
>> stream library (for example XTreams), then you'll get pretty much the
>> same thing with less effort.
>
> Nope sorry, writing a stream that buffers is not less effort that
> porting and maintaining a whole stream library and rewriting your code.

It would be: writing and maintaining new stream classes VS using an
existing port.

Levente

>
> Cheers
> Philippe
>
>
>

Stéphane Ducasse

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Sven Van Caekenberghe

>>
> Thanks for the explanation, some quick and dirty buffering makes a huge difference:
>
> [ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream |
> 1000 timesRepeat: [
> fileStream nextPutAll:
> (String streamContents: [ :stream |
> 100 timesRepeat: [ stream print: 100 atRandom; space ] ]) ] ] ] timeToRun.
> 159
>
> Still, the asymmetry is a bit strange.
> Can't the side effects be dealt with using #flush ?
>
>> There are several stream libraries (for example XTreams) that can easily support write buffering without the need to care about compatibility.
>
> Yeah, although the Smalltalk Collection and Stream classes were better than everything else 20, 30 years ago, lots of things have changed and there is lots of competition. The fact that these classes are so nice to use seem to have prevented necessary improvements.

Yes time to move.

>
> I think I might file this as a Pharo issue.
>
> Sven
>
>

Sven Van Caekenberghe

Re: Why is FileStream writing almost an order of a magnitude slower than reading ?

In reply to this post by Levente Uzonyi-2

Levente,

On 08 Dec 2010, at 13:03, Levente Uzonyi wrote:

> I don't get how #streamContents: could be used to send #flush. That's a method of SequenceableCollection IIRC.

Yeah, you're right: I stand corrected, this is not related.

I tried adding an #on:do: modeled after #fileNamed:do: to ensure that #flush is called, which is more elegant IMHO:

[ FileStream fileNamed: '/tmp/numbers.txt' do: [ :fileStream |
ZnBufferedWriteStream on: fileStream do: [ :bufferedStream |
100000 timesRepeat: [
bufferedStream print: 100 atRandom; space ] ] ] ] timeToRun.

> SocketStream is unrelated here, because it doesn't write to files and buffering was always implemented in it AFAIK.

I think it is relevant: it proves that people can perfectly live with output buffering.

Can you give an example of which code would break (so needs to call #flush) with output buffering ?
Would there be many cases ?

Sven