20X speedup for read file upToEnd

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

20X speedup for read file upToEnd

David T. Lewis
Attached is a small change that gives a big performance boost for reading
a file upToEnd. My use case (where I noticed this) is reading an image
file, where the file is opened, the header is read, and the remainder of
the file upToEnd is read in as the object memory:

  fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary.
  t := Time millisecondsToRun: [ImageSnapshot fromStream: fs].
  fs close.
  t
  ==> 12428 "original implementation"
  ==> 645 "new version"

Overall speedup is about 20X for MultiByteFileStream and over 100x for
StandardFileStream (difference due to Levente's earlier improvements
in MultiByteFileStream).

This small change touches two packages so I'm posting it as a change set
for comment. I'll put it in trunk if there are no issues.

Dave




Speedup-file-upToEnd-dtl.1.cs (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 20X speedup for read file upToEnd

timrowledge
Looks pretty sensible to me. As a small aside, I noticed when reading '.po' files (that Scratch uses for language translation dictionaries) that explicitly using StandardFileStream in place of MultiByteFileStream in appropriate places could speed up file reading immensely; I suspect your change will solve that worry.

> On 2021-01-22, at 3:24 PM, David T. Lewis <[hidden email]> wrote:
>
> <Speedup-file-upToEnd-dtl.1.cs>


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Who is General Failure and why is he reading my disk?



Reply | Threaded
Open this post in threaded view
|

Re: 20X speedup for read file upToEnd

David T. Lewis
On Fri, Jan 22, 2021 at 03:53:00PM -0800, tim Rowledge wrote:

>
> > On 2021-01-22, at 3:24 PM, David T. Lewis <[hidden email]> wrote:
> >
> > <Speedup-file-upToEnd-dtl.1.cs>
>
> Looks pretty sensible to me. As a small aside, I noticed when reading
> '.po' files (that Scratch uses for language translation dictionaries) that
> explicitly using StandardFileStream in place of MultiByteFileStream in
> appropriate places could speed up file reading immensely; I suspect your
> change will solve that worry.
>

I suspect that these are two separate issues, but I'm also pretty sure
there are plenty of other small places where I/O can be improved.

For the case of #upToEnd I found that MultiByteFileStream was already
five times faster than StandardFileStream due to improvements that Levente
did back in 2009. So I took Levente's logic, added the "don't read it one
byte at a time" change, and moved it up to StandardFileStream.

It would be worth profiling the '.po' file reading in Scratch and seeing
where the time is going. Maybe there is another I/O method that could
be easily improved.

Dave


Reply | Threaded
Open this post in threaded view
|

Re: 20X speedup for read file upToEnd

Levente Uzonyi
In reply to this post by David T. Lewis
Hi Dave,

Good catch. That method of StandardFileStream, unlike #upTo: and
#upToAnyOf:do:, has not been optimized.

IMO, StandardFileStream >> #upToEnd should simply be

  ^self next: self size - self position

And I suggest your implementation be added to MultiByteFileStream with the
following modifications:

upToEnd
  "Answer a subcollection from the current access position through the last element of the receiver."

  | remainingEstimate |
  self isBinary ifTrue: [ ^super upToEnd ].
  remainingEstimate := self size - self position.
  ^self collectionSpecies
  new: remainingEstimate
  streamContents: [ :stream |
  | elements chunkSize |
  chunkSize := remainingEstimate min: 2000. "It's not worth allocating larger chunks"
  [ (elements := self next: chunkSize) isEmpty ] whileFalse: [
  stream nextPutAll: elements ] ]


Levente

On Fri, 22 Jan 2021, David T. Lewis wrote:

> Attached is a small change that gives a big performance boost for reading
> a file upToEnd. My use case (where I noticed this) is reading an image
> file, where the file is opened, the header is read, and the remainder of
> the file upToEnd is read in as the object memory:
>
>  fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary.
>  t := Time millisecondsToRun: [ImageSnapshot fromStream: fs].
>  fs close.
>  t
>  ==> 12428 "original implementation"
>  ==> 645 "new version"
>
> Overall speedup is about 20X for MultiByteFileStream and over 100x for
> StandardFileStream (difference due to Levente's earlier improvements
> in MultiByteFileStream).
>
> This small change touches two packages so I'm posting it as a change set
> for comment. I'll put it in trunk if there are no issues.
>
> Dave
>
>

Reply | Threaded
Open this post in threaded view
|

Re: 20X speedup for read file upToEnd

David T. Lewis
Hi Levente,

Thanks, that looks better. Do you want to commit the update to keep
the author initials right? If not I'll commit it and credit you in
the commit comment.

As an aside, this also makes me notice that #upToEnd does not work
for file streams that do not know their size, such as FileStream
stdin, on any file stream on an OS Pipe.

Here is the variation that I did (back in 2006) for OSProcess, which
uses AttachableFileStream subclassed from StandardFileStream:

AttachableFileStream>>upToEnd
        "Answer a subcollection from the current access position through the last element
        of the receiver. This is slower than the method in StandardFileStream, but it
        works with pipes which answer false to #atEnd when no further input is
        currently available, but the pipe is not yet closed."

        | newStream buffer nextBytes |
        buffer := buffer1 species new: 1000.
        newStream := WriteStream on: (buffer1 species new: 100).
        [self atEnd or: [(nextBytes := self nextInto: buffer) isEmpty]]
                whileFalse: [newStream nextPutAll: nextBytes].
        ^ newStream contents

The AttachableFileStream hack in OSProcess really needs to go away,
so I think I'd to find a way to make these things work with good
performance on any kind of file stream. But that is a topic for another
thread.

Thanks,
Dave

On Sat, Jan 23, 2021 at 08:58:07PM +0100, Levente Uzonyi wrote:

> Hi Dave,
>
> Good catch. That method of StandardFileStream, unlike #upTo: and
> #upToAnyOf:do:, has not been optimized.
>
> IMO, StandardFileStream >> #upToEnd should simply be
>
> ^self next: self size - self position
>
> And I suggest your implementation be added to MultiByteFileStream with the
> following modifications:
>
> upToEnd
> "Answer a subcollection from the current access position through the
> last element of the receiver."
>
> | remainingEstimate |
> self isBinary ifTrue: [ ^super upToEnd ].
> remainingEstimate := self size - self position.
> ^self collectionSpecies
> new: remainingEstimate
> streamContents: [ :stream |
> | elements chunkSize |
> chunkSize := remainingEstimate min: 2000. "It's not
> worth allocating larger chunks"
> [ (elements := self next: chunkSize) isEmpty ]
> whileFalse: [
> stream nextPutAll: elements ] ]
>
>
> Levente
>
> On Fri, 22 Jan 2021, David T. Lewis wrote:
>
> >Attached is a small change that gives a big performance boost for reading
> >a file upToEnd. My use case (where I noticed this) is reading an image
> >file, where the file is opened, the header is read, and the remainder of
> >the file upToEnd is read in as the object memory:
> >
> > fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary.
> > t := Time millisecondsToRun: [ImageSnapshot fromStream: fs].
> > fs close.
> > t
> > ==> 12428 "original implementation"
> > ==> 645 "new version"
> >
> >Overall speedup is about 20X for MultiByteFileStream and over 100x for
> >StandardFileStream (difference due to Levente's earlier improvements
> >in MultiByteFileStream).
> >
> >This small change touches two packages so I'm posting it as a change set
> >for comment. I'll put it in trunk if there are no issues.
> >
> >Dave
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: 20X speedup for read file upToEnd

Levente Uzonyi
Hi Dave,

On Sat, 23 Jan 2021, David T. Lewis wrote:

> Hi Levente,
>
> Thanks, that looks better. Do you want to commit the update to keep
> the author initials right? If not I'll commit it and credit you in
> the commit comment.

I'll commit those two methods.

>
> As an aside, this also makes me notice that #upToEnd does not work
> for file streams that do not know their size, such as FileStream
> stdin, on any file stream on an OS Pipe.

It depends on what the expected behavior is in that case. If it is to read
all available data, then I think it works. If it is to read until the
stream is #atEnd, then no, it doesn't work.


Levente

>
> Here is the variation that I did (back in 2006) for OSProcess, which
> uses AttachableFileStream subclassed from StandardFileStream:
>
> AttachableFileStream>>upToEnd
> "Answer a subcollection from the current access position through the last element
> of the receiver. This is slower than the method in StandardFileStream, but it
> works with pipes which answer false to #atEnd when no further input is
> currently available, but the pipe is not yet closed."
>
> | newStream buffer nextBytes |
> buffer := buffer1 species new: 1000.
> newStream := WriteStream on: (buffer1 species new: 100).
> [self atEnd or: [(nextBytes := self nextInto: buffer) isEmpty]]
> whileFalse: [newStream nextPutAll: nextBytes].
> ^ newStream contents
>
> The AttachableFileStream hack in OSProcess really needs to go away,
> so I think I'd to find a way to make these things work with good
> performance on any kind of file stream. But that is a topic for another
> thread.
>
> Thanks,
> Dave
>
> On Sat, Jan 23, 2021 at 08:58:07PM +0100, Levente Uzonyi wrote:
>> Hi Dave,
>>
>> Good catch. That method of StandardFileStream, unlike #upTo: and
>> #upToAnyOf:do:, has not been optimized.
>>
>> IMO, StandardFileStream >> #upToEnd should simply be
>>
>> ^self next: self size - self position
>>
>> And I suggest your implementation be added to MultiByteFileStream with the
>> following modifications:
>>
>> upToEnd
>> "Answer a subcollection from the current access position through the
>> last element of the receiver."
>>
>> | remainingEstimate |
>> self isBinary ifTrue: [ ^super upToEnd ].
>> remainingEstimate := self size - self position.
>> ^self collectionSpecies
>> new: remainingEstimate
>> streamContents: [ :stream |
>> | elements chunkSize |
>> chunkSize := remainingEstimate min: 2000. "It's not
>> worth allocating larger chunks"
>> [ (elements := self next: chunkSize) isEmpty ]
>> whileFalse: [
>> stream nextPutAll: elements ] ]
>>
>>
>> Levente
>>
>> On Fri, 22 Jan 2021, David T. Lewis wrote:
>>
>> >Attached is a small change that gives a big performance boost for reading
>> >a file upToEnd. My use case (where I noticed this) is reading an image
>> >file, where the file is opened, the header is read, and the remainder of
>> >the file upToEnd is read in as the object memory:
>> >
>> > fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary.
>> > t := Time millisecondsToRun: [ImageSnapshot fromStream: fs].
>> > fs close.
>> > t
>> > ==> 12428 "original implementation"
>> > ==> 645 "new version"
>> >
>> >Overall speedup is about 20X for MultiByteFileStream and over 100x for
>> >StandardFileStream (difference due to Levente's earlier improvements
>> >in MultiByteFileStream).
>> >
>> >This small change touches two packages so I'm posting it as a change set
>> >for comment. I'll put it in trunk if there are no issues.
>> >
>> >Dave
>> >
>> >
>>

Reply | Threaded
Open this post in threaded view
|

Re: 20X speedup for read file upToEnd

David T. Lewis
On Sun, Jan 24, 2021 at 07:39:01PM +0100, Levente Uzonyi wrote:

> Hi Dave,
>
> On Sat, 23 Jan 2021, David T. Lewis wrote:
>
> >Hi Levente,
> >
> >Thanks, that looks better. Do you want to commit the update to keep
> >the author initials right? If not I'll commit it and credit you in
> >the commit comment.
>
> I'll commit those two methods.
>

Thanks. I should add that with your final version, I now see a 35x
speedup for MultiByteFileSteam and 200x speedup for StandardFileStream.

Not bad for a small tweak :-)

Dave