Attached is a small change that gives a big performance boost for reading
a file upToEnd. My use case (where I noticed this) is reading an image file, where the file is opened, the header is read, and the remainder of the file upToEnd is read in as the object memory: fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary. t := Time millisecondsToRun: [ImageSnapshot fromStream: fs]. fs close. t ==> 12428 "original implementation" ==> 645 "new version" Overall speedup is about 20X for MultiByteFileStream and over 100x for StandardFileStream (difference due to Levente's earlier improvements in MultiByteFileStream). This small change touches two packages so I'm posting it as a change set for comment. I'll put it in trunk if there are no issues. Dave Speedup-file-upToEnd-dtl.1.cs (1K) Download Attachment |
Looks pretty sensible to me. As a small aside, I noticed when reading '.po' files (that Scratch uses for language translation dictionaries) that explicitly using StandardFileStream in place of MultiByteFileStream in appropriate places could speed up file reading immensely; I suspect your change will solve that worry.
> On 2021-01-22, at 3:24 PM, David T. Lewis <[hidden email]> wrote: > > <Speedup-file-upToEnd-dtl.1.cs> tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Who is General Failure and why is he reading my disk? |
On Fri, Jan 22, 2021 at 03:53:00PM -0800, tim Rowledge wrote:
> > > On 2021-01-22, at 3:24 PM, David T. Lewis <[hidden email]> wrote: > > > > <Speedup-file-upToEnd-dtl.1.cs> > > Looks pretty sensible to me. As a small aside, I noticed when reading > '.po' files (that Scratch uses for language translation dictionaries) that > explicitly using StandardFileStream in place of MultiByteFileStream in > appropriate places could speed up file reading immensely; I suspect your > change will solve that worry. > I suspect that these are two separate issues, but I'm also pretty sure there are plenty of other small places where I/O can be improved. For the case of #upToEnd I found that MultiByteFileStream was already five times faster than StandardFileStream due to improvements that Levente did back in 2009. So I took Levente's logic, added the "don't read it one byte at a time" change, and moved it up to StandardFileStream. It would be worth profiling the '.po' file reading in Scratch and seeing where the time is going. Maybe there is another I/O method that could be easily improved. Dave |
In reply to this post by David T. Lewis
Hi Dave,
Good catch. That method of StandardFileStream, unlike #upTo: and #upToAnyOf:do:, has not been optimized. IMO, StandardFileStream >> #upToEnd should simply be ^self next: self size - self position And I suggest your implementation be added to MultiByteFileStream with the following modifications: upToEnd "Answer a subcollection from the current access position through the last element of the receiver." | remainingEstimate | self isBinary ifTrue: [ ^super upToEnd ]. remainingEstimate := self size - self position. ^self collectionSpecies new: remainingEstimate streamContents: [ :stream | | elements chunkSize | chunkSize := remainingEstimate min: 2000. "It's not worth allocating larger chunks" [ (elements := self next: chunkSize) isEmpty ] whileFalse: [ stream nextPutAll: elements ] ] Levente On Fri, 22 Jan 2021, David T. Lewis wrote: > Attached is a small change that gives a big performance boost for reading > a file upToEnd. My use case (where I noticed this) is reading an image > file, where the file is opened, the header is read, and the remainder of > the file upToEnd is read in as the object memory: > > fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary. > t := Time millisecondsToRun: [ImageSnapshot fromStream: fs]. > fs close. > t > ==> 12428 "original implementation" > ==> 645 "new version" > > Overall speedup is about 20X for MultiByteFileStream and over 100x for > StandardFileStream (difference due to Levente's earlier improvements > in MultiByteFileStream). > > This small change touches two packages so I'm posting it as a change set > for comment. I'll put it in trunk if there are no issues. > > Dave > > |
Hi Levente,
Thanks, that looks better. Do you want to commit the update to keep the author initials right? If not I'll commit it and credit you in the commit comment. As an aside, this also makes me notice that #upToEnd does not work for file streams that do not know their size, such as FileStream stdin, on any file stream on an OS Pipe. Here is the variation that I did (back in 2006) for OSProcess, which uses AttachableFileStream subclassed from StandardFileStream: AttachableFileStream>>upToEnd "Answer a subcollection from the current access position through the last element of the receiver. This is slower than the method in StandardFileStream, but it works with pipes which answer false to #atEnd when no further input is currently available, but the pipe is not yet closed." | newStream buffer nextBytes | buffer := buffer1 species new: 1000. newStream := WriteStream on: (buffer1 species new: 100). [self atEnd or: [(nextBytes := self nextInto: buffer) isEmpty]] whileFalse: [newStream nextPutAll: nextBytes]. ^ newStream contents The AttachableFileStream hack in OSProcess really needs to go away, so I think I'd to find a way to make these things work with good performance on any kind of file stream. But that is a topic for another thread. Thanks, Dave On Sat, Jan 23, 2021 at 08:58:07PM +0100, Levente Uzonyi wrote: > Hi Dave, > > Good catch. That method of StandardFileStream, unlike #upTo: and > #upToAnyOf:do:, has not been optimized. > > IMO, StandardFileStream >> #upToEnd should simply be > > ^self next: self size - self position > > And I suggest your implementation be added to MultiByteFileStream with the > following modifications: > > upToEnd > "Answer a subcollection from the current access position through the > last element of the receiver." > > | remainingEstimate | > self isBinary ifTrue: [ ^super upToEnd ]. > remainingEstimate := self size - self position. > ^self collectionSpecies > new: remainingEstimate > streamContents: [ :stream | > | elements chunkSize | > chunkSize := remainingEstimate min: 2000. "It's not > worth allocating larger chunks" > [ (elements := self next: chunkSize) isEmpty ] > whileFalse: [ > stream nextPutAll: elements ] ] > > > Levente > > On Fri, 22 Jan 2021, David T. Lewis wrote: > > >Attached is a small change that gives a big performance boost for reading > >a file upToEnd. My use case (where I noticed this) is reading an image > >file, where the file is opened, the header is read, and the remainder of > >the file upToEnd is read in as the object memory: > > > > fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary. > > t := Time millisecondsToRun: [ImageSnapshot fromStream: fs]. > > fs close. > > t > > ==> 12428 "original implementation" > > ==> 645 "new version" > > > >Overall speedup is about 20X for MultiByteFileStream and over 100x for > >StandardFileStream (difference due to Levente's earlier improvements > >in MultiByteFileStream). > > > >This small change touches two packages so I'm posting it as a change set > >for comment. I'll put it in trunk if there are no issues. > > > >Dave > > > > > |
Hi Dave,
On Sat, 23 Jan 2021, David T. Lewis wrote: > Hi Levente, > > Thanks, that looks better. Do you want to commit the update to keep > the author initials right? If not I'll commit it and credit you in > the commit comment. I'll commit those two methods. > > As an aside, this also makes me notice that #upToEnd does not work > for file streams that do not know their size, such as FileStream > stdin, on any file stream on an OS Pipe. It depends on what the expected behavior is in that case. If it is to read all available data, then I think it works. If it is to read until the stream is #atEnd, then no, it doesn't work. Levente > > Here is the variation that I did (back in 2006) for OSProcess, which > uses AttachableFileStream subclassed from StandardFileStream: > > AttachableFileStream>>upToEnd > "Answer a subcollection from the current access position through the last element > of the receiver. This is slower than the method in StandardFileStream, but it > works with pipes which answer false to #atEnd when no further input is > currently available, but the pipe is not yet closed." > > | newStream buffer nextBytes | > buffer := buffer1 species new: 1000. > newStream := WriteStream on: (buffer1 species new: 100). > [self atEnd or: [(nextBytes := self nextInto: buffer) isEmpty]] > whileFalse: [newStream nextPutAll: nextBytes]. > ^ newStream contents > > The AttachableFileStream hack in OSProcess really needs to go away, > so I think I'd to find a way to make these things work with good > performance on any kind of file stream. But that is a topic for another > thread. > > Thanks, > Dave > > On Sat, Jan 23, 2021 at 08:58:07PM +0100, Levente Uzonyi wrote: >> Hi Dave, >> >> Good catch. That method of StandardFileStream, unlike #upTo: and >> #upToAnyOf:do:, has not been optimized. >> >> IMO, StandardFileStream >> #upToEnd should simply be >> >> ^self next: self size - self position >> >> And I suggest your implementation be added to MultiByteFileStream with the >> following modifications: >> >> upToEnd >> "Answer a subcollection from the current access position through the >> last element of the receiver." >> >> | remainingEstimate | >> self isBinary ifTrue: [ ^super upToEnd ]. >> remainingEstimate := self size - self position. >> ^self collectionSpecies >> new: remainingEstimate >> streamContents: [ :stream | >> | elements chunkSize | >> chunkSize := remainingEstimate min: 2000. "It's not >> worth allocating larger chunks" >> [ (elements := self next: chunkSize) isEmpty ] >> whileFalse: [ >> stream nextPutAll: elements ] ] >> >> >> Levente >> >> On Fri, 22 Jan 2021, David T. Lewis wrote: >> >> >Attached is a small change that gives a big performance boost for reading >> >a file upToEnd. My use case (where I noticed this) is reading an image >> >file, where the file is opened, the header is read, and the remainder of >> >the file upToEnd is read in as the object memory: >> > >> > fs := FileStream readOnlyFileNamed: Smalltalk imageName. fs binary. >> > t := Time millisecondsToRun: [ImageSnapshot fromStream: fs]. >> > fs close. >> > t >> > ==> 12428 "original implementation" >> > ==> 645 "new version" >> > >> >Overall speedup is about 20X for MultiByteFileStream and over 100x for >> >StandardFileStream (difference due to Levente's earlier improvements >> >in MultiByteFileStream). >> > >> >This small change touches two packages so I'm posting it as a change set >> >for comment. I'll put it in trunk if there are no issues. >> > >> >Dave >> > >> > >> |
On Sun, Jan 24, 2021 at 07:39:01PM +0100, Levente Uzonyi wrote:
> Hi Dave, > > On Sat, 23 Jan 2021, David T. Lewis wrote: > > >Hi Levente, > > > >Thanks, that looks better. Do you want to commit the update to keep > >the author initials right? If not I'll commit it and credit you in > >the commit comment. > > I'll commit those two methods. > Thanks. I should add that with your final version, I now see a 35x speedup for MultiByteFileSteam and 200x speedup for StandardFileStream. Not bad for a small tweak :-) Dave |
Free forum by Nabble | Edit this page |