The idea of stacking things like character coding and buffering &c by using layers of wrapping is very much in the spirit of Java and is neither good for performance nor helpful for correctness. On 20 March 2018 at 05:19, Guillermo Polito <[hidden email]> wrote:
|
Well, I'd say it we did it in the name of modularity. And yes, I believe that having separate responsibilities help in designing, testing and ensuring more easily the correctness of each of the parts in isolation. I've done also some profiling and it does not look like we've lost in performance either (reading and decoding a 35MB file): [file := Smalltalk sourcesFile fullName. (File named: file) readStreamDo: [ :binaryFile | (ZnCharacterReadStream on: (ZnBufferedReadStream on: binaryFile) encoding: 'utf8') next: binaryFile size. ]] timeToRun. "0:00:00:01.976" [file := Smalltalk sourcesFile fullName. (MultiByteFileStream fileNamed: file) converter: (TextConverter newForEncoding: 'utf8'); upToEnd ] timeToRun. "0:00:00:02.147" On Mon, Mar 19, 2018 at 5:51 PM, Richard O'Keefe <[hidden email]> wrote:
|
Thanks Guille for this great post! I will turn it into a doc :) On Mon, Mar 19, 2018 at 6:10 PM, Guillermo Polito <[hidden email]> wrote:
|
In reply to this post by Guillermo Polito
I love the idea of measuring statement :)
On Mon, Mar 19, 2018 at 6:10 PM, Guillermo Polito <[hidden email]> wrote:
|
Administrator
|
In reply to this post by Guillermo Polito
Guillermo Polito wrote
> I've done also some profiling and it does not look like we've lost in > performance either (reading and decoding a 35MB file): Your 1st (new API) example: "0:00:00:01.535" Your 2nd (old API) example:"0:00:00:01.732" *** but *** Your 1st (new API) example using the same consuming selector as example #1 (#upToEnd): "0:00:00:07.816" or 4.5x slower ----- Cheers, Sean -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
Cheers,
Sean |
> On 20 Mar 2018, at 00:19, Sean P. DeNigris <[hidden email]> wrote: > > Guillermo Polito wrote >> I've done also some profiling and it does not look like we've lost in >> performance either (reading and decoding a 35MB file): > > Your 1st (new API) example: "0:00:00:01.535" > Your 2nd (old API) example:"0:00:00:01.732" > *** but *** > Your 1st (new API) example using the same consuming selector as example #1 > (#upToEnd): "0:00:00:07.816" or 4.5x slower Yes and no ;-) Look at MultiByteFileStream>>#upToEnd where they cheat a bit and do basically the same as example 1 using #next: The Zn #upToEnd implementation does not use size knowledge (as streams are infinite). As a side note: using #size and #position[:] on variable encoded streams like UTF8 is *very* dangerous and should be avoided. Counting/indexing bytes in the underlying stream is *not* the same as counting/indexing decoded characters. It can be the source of subtle bugs. > ----- > Cheers, > Sean > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html > |
On Tue, Mar 20, 2018 at 8:57 AM, Sven Van Caekenberghe <[hidden email]> wrote:
Sean, Think that the Buffered stream can be reused in the context of sockets too. And there size has no meaning. Basically the cost in there is on growing and copying the internal buffer used to read. This buffer for the 35M file ends up being ~35M, so it has to grow and copy several times... Using #next: will preallocate the good size for the internal buffer. Now, I'd say first that if people want performance they should use the good API :). Maybe the ZnBuffered stream could ask its underlying stream for its #preferredInternalBufferSizeForUpToAll. But that would contaminate all the stream hierarchy... Is it worth it? Dunno...
|
We can cover this case by extra variable maxBufferSize. By default it would be equal to given size in method #sizeBuffer:. And special users will specify concrete value. FileHandler will use file size for this. It will also improve buffer size for small files where we do not need big buffer. 2018-03-20 9:57 GMT+01:00 Guillermo Polito <[hidden email]>:
|
2018-03-20 10:15 GMT+01:00 Denis Kudriashov <[hidden email]>:
Operations like upToEnd will relocate maximum buffer. #next: related methods can be also improved by it too. We will need extra method:
But it will require another variable minBufferSize.
|
I would very much prefer to take things slowly.
First we have to make the whole migration work, we are not yet there, there are lots of important things to fix. If there are serious performance regressions, then yes we have to fix them. But improving things or adding features should be done when everything is stable again. > On 20 Mar 2018, at 10:31, Denis Kudriashov <[hidden email]> wrote: > > > 2018-03-20 10:15 GMT+01:00 Denis Kudriashov <[hidden email]>: > We can cover this case by extra variable maxBufferSize. > > Operations like upToEnd will relocate maximum buffer. > #next: related methods can be also improved by it too. We will need extra method: > > ZnBufferedReadStream>>nextBuffer: requiredSize > newBufferSize := requiredSize min: maxBufferSize max: minBufferSize. > buffer size = newBufferSize ifFalse: [ > buffer := self collectionSpecies new: size]. > limit := stream readInto: buffer startingAt: 1 count: buffer size. > position := 1 > > But it will require another variable minBufferSize. > > By default it would be equal to given size in method #sizeBuffer:. And special users will specify concrete value. FileHandler will use file size for this. > It will also improve buffer size for small files where we do not need big buffer. > > 2018-03-20 9:57 GMT+01:00 Guillermo Polito <[hidden email]>: > > > On Tue, Mar 20, 2018 at 8:57 AM, Sven Van Caekenberghe <[hidden email]> wrote: > > > > On 20 Mar 2018, at 00:19, Sean P. DeNigris <[hidden email]> wrote: > > > > Guillermo Polito wrote > >> I've done also some profiling and it does not look like we've lost in > >> performance either (reading and decoding a 35MB file): > > > > Your 1st (new API) example: "0:00:00:01.535" > > Your 2nd (old API) example:"0:00:00:01.732" > > *** but *** > > Your 1st (new API) example using the same consuming selector as example #1 > > (#upToEnd): "0:00:00:07.816" or 4.5x slower > > Yes and no ;-) > > Look at MultiByteFileStream>>#upToEnd where they cheat a bit and do basically the same as example 1 using #next: > > The Zn #upToEnd implementation does not use size knowledge (as streams are infinite). > > As a side note: using #size and #position[:] on variable encoded streams like UTF8 is *very* dangerous and should be avoided. Counting/indexing bytes in the underlying stream is *not* the same as counting/indexing decoded characters. It can be the source of subtle bugs. > > Sean, > > Think that the Buffered stream can be reused in the context of sockets too. And there size has no meaning. > > Basically the cost in there is on growing and copying the internal buffer used to read. This buffer for the 35M file ends up being ~35M, so it has to grow and copy several times... > Using #next: will preallocate the good size for the internal buffer. > > Now, I'd say first that if people want performance they should use the good API :). > > Maybe the ZnBuffered stream could ask its underlying stream for its #preferredInternalBufferSizeForUpToAll. But that would contaminate all the stream hierarchy... Is it worth it? Dunno... > > > > > ----- > > Cheers, > > Sean > > -- > > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html > > > > > > > > -- > > Guille Polito > Research Engineer > > Centre de Recherche en Informatique, Signal et Automatique de Lille > CRIStAL - UMR 9189 > French National Center for Scientific Research - http://www.cnrs.fr > > Web: http://guillep.github.io > Phone: +33 06 52 70 66 13 > > |
This is great!
And, +1 on the need for some more convenience constructors... Word of warning (which would be nice to mention in the guide); using writeStreamDo: and manually constructed buffered writers will not #flush the buffered stream automatically on close, so it's crucial this is done manually. charGen := Random seed: 16543986234. output := String new: 1024*1024*50. 1 to: output size do: [ :ix | output at: ix put: ((charGen nextInt: 256) - 1) asCharacter ]. [(File named: 'utf16test.txt') writeStreamDo: [:ws | |encStream| encStream := ZnCharacterWriteStream on: (ZnBufferedWriteStream on: ws) encoding: 'utf16'. encStream nextPutAll: output ]] timeToRun. FileSize: 102 336 KB. [(File named: 'utf16test2.txt') writeStreamDo: [:ws | |encStream| encStream := ZnCharacterWriteStream on: (ZnBufferedWriteStream on: ws) encoding: 'utf16'. encStream nextPutAll: output;flush ]] timeToRun. FileSize: 102 400 KB. It's about 4 times slower than the corresponding [FileStream forceNewFileNamed: 'utf16test2.txt' do: [:ws | ws converter: UTF16TextConverter new. ws nextPutAll: output ]] timeToRun but it's 100MB of UTF16, and 4 secs vs 1, so probably not relevant to real use cases . Another addition to the guide valuable to those of use still stuck in the stone age, would be code equivalents to FileStream methods: forceNewFileNamed:do: newFileNamed:do: oldFileNamed:do: Cheers, Henry -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html |
> On 20 Mar 2018, at 12:56, Henrik Sperre Johansen <[hidden email]> wrote: > > This is great! > And, +1 on the need for some more convenience constructors... > > Word of warning (which would be nice to mention in the guide); using > writeStreamDo: and manually constructed buffered writers will not #flush the > buffered stream automatically on close, so it's crucial this is done > manually. well… in fact it should flush it, isn’t? > > charGen := Random seed: 16543986234. > output := String new: 1024*1024*50. > 1 to: output size do: [ :ix | output at: ix put: ((charGen nextInt: 256) - > 1) asCharacter ]. > > [(File named: 'utf16test.txt') writeStreamDo: [:ws | |encStream| > encStream := ZnCharacterWriteStream > on: (ZnBufferedWriteStream on: ws) > encoding: 'utf16'. > encStream nextPutAll: output > ]] timeToRun. > > FileSize: 102 336 KB. > > [(File named: 'utf16test2.txt') writeStreamDo: [:ws | |encStream| > encStream := ZnCharacterWriteStream > on: (ZnBufferedWriteStream on: ws) > encoding: 'utf16'. > encStream nextPutAll: output;flush > ]] timeToRun. > > FileSize: 102 400 KB. > > It's about 4 times slower than the corresponding > > [FileStream forceNewFileNamed: 'utf16test2.txt' do: [:ws | > ws converter: UTF16TextConverter new. > ws nextPutAll: output > ]] timeToRun > > but it's 100MB of UTF16, and 4 secs vs 1, so probably not relevant to real > use cases . is always relevant. I wonder why would be 4x to the 2nd one. Esteban > > Another addition to the guide valuable to those of use still stuck in the > stone age, would be code equivalents to FileStream methods: > forceNewFileNamed:do: > newFileNamed:do: > oldFileNamed:do: > > Cheers, > Henry > > > > > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html > |
EstebanLM wrote
>> On 20 Mar 2018, at 12:56, Henrik Sperre Johansen < > henrik.s.johansen@ > > wrote: >> >> This is great! >> And, +1 on the need for some more convenience constructors... >> >> Word of warning (which would be nice to mention in the guide); using >> writeStreamDo: and manually constructed buffered writers will not #flush >> the >> buffered stream automatically on close, so it's crucial this is done >> manually. > > well… in fact it should flush it, isn’t? > >> The binary filestream flushes to disk on each write operation iirc (or, at least when it is being closed), but it has no way to know it is wrapped by a buffered stream which also needs to be flushed to actually write the remaining buffer contents to file. Cheers, Henry -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html |
2018-03-20 13:20 GMT+01:00 Henrik Sperre Johansen
<[hidden email]>: > EstebanLM wrote >>> On 20 Mar 2018, at 12:56, Henrik Sperre Johansen < > >> henrik.s.johansen@ > >> > wrote: >>> >>> This is great! >>> And, +1 on the need for some more convenience constructors... >>> >>> Word of warning (which would be nice to mention in the guide); using >>> writeStreamDo: and manually constructed buffered writers will not #flush >>> the >>> buffered stream automatically on close, so it's crucial this is done >>> manually. >> >> well… in fact it should flush it, isn’t? >> >>> > > The binary filestream flushes to disk on each write operation iirc (or, at > least when it is being closed), but it has no way to know it is wrapped by a > buffered stream which also needs to be flushed to actually write the > remaining buffer contents to file. Could a binary filestream that has flush requirements (and close requirements as well) be made aware of the presence of the buffered stream above it to ensure a close / flush is done correctly? Could structures where you have multiple buffered streams concurrently opened on a single binary file stream considered as not exactly correct? Thierry > > Cheers, > Henry > > > > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html > |
In reply to this post by Richard O'Keefe
I like the structure of streams wrapping streams from the point of view of modularity of implementation and flexibility of reuse (as in applying buffering and encoding to in-memory or network streams).
But I don’t like exposing al the machinery to the user. For example, instead of bufferedReadStream := ZnBufferedReadStream on: aReadStream. bufferedWriteStream := ZnBufferedWriteStream on: aWriteStream. why not bufferedReadStream := ZnBuffer on: aReadStream. bufferedWriteStream := ZnBuffer on: aWriteStream where the class that implements the buffering looks at the stream and decides whether to do read or write buffering (or both, for a read-write stream). The meaning of #position: on an encoded Unicode stream is perfectly well-defined. It’s just not the same as the meaning of position on the underlying byteStream — one counts in codePoints, the other in bytes. The old file streams got this wrong; I’m hoping that the new ones get it right. However, the comment that "As a side note: using #size and #position[:] on variable encoded streams like UTF8 is *very* dangerous and should be avoided.” makes me worry that they do not. Please, either implement #size and #position[:] correctly on the encoded stream, or don’t implement them at all! Do *not* implement them incorrectly, and then tell me to avoid them because they are dangerous! Andrew |
In reply to this post by Henrik Sperre Johansen
> On 20 Mar 2018, at 12:56, Henrik Sperre Johansen <[hidden email]> wrote: > > This is great! > And, +1 on the need for some more convenience constructors... > > Word of warning (which would be nice to mention in the guide); using > writeStreamDo: and manually constructed buffered writers will not #flush the > buffered stream automatically on close, so it's crucial this is done > manually. Yes, you are right, but it is clearly documented in the class comment of ZnBufferedWriteStream There is also ZnBufferedWriteStream class>>#on:do: to help Now, these are system classes more than user classes. Normal users should get a ready made stream not knowing much about the construction process. > charGen := Random seed: 16543986234. > output := String new: 1024*1024*50. > 1 to: output size do: [ :ix | output at: ix put: ((charGen nextInt: 256) - > 1) asCharacter ]. > > [(File named: 'utf16test.txt') writeStreamDo: [:ws | |encStream| > encStream := ZnCharacterWriteStream > on: (ZnBufferedWriteStream on: ws) > encoding: 'utf16'. > encStream nextPutAll: output > ]] timeToRun. > > FileSize: 102 336 KB. > > [(File named: 'utf16test2.txt') writeStreamDo: [:ws | |encStream| > encStream := ZnCharacterWriteStream > on: (ZnBufferedWriteStream on: ws) > encoding: 'utf16'. > encStream nextPutAll: output;flush > ]] timeToRun. > > FileSize: 102 400 KB. > > It's about 4 times slower than the corresponding > > [FileStream forceNewFileNamed: 'utf16test2.txt' do: [:ws | > ws converter: UTF16TextConverter new. > ws nextPutAll: output > ]] timeToRun > > but it's 100MB of UTF16, and 4 secs vs 1, so probably not relevant to real > use cases . Ah, but you are cheating (deliberately or not ;-). Your test string is a ByteString not a WideString, and UTF16TextConverter contains a fast path that basically does nothing. Try another string, like 1 to: output size do: [ :ix | output at: ix put: ((charGen nextInt: 1024) - 1) asCharacter ]. Zn is the same speed, the old one does not even stop for me. I am not sure we have to optimise UTF16, maybe ... adding the same fast path would not be too difficult I guess. > Another addition to the guide valuable to those of use still stuck in the > stone age, would be code equivalents to FileStream methods: > forceNewFileNamed:do: > newFileNamed:do: > oldFileNamed:do: > > Cheers, > Henry > > > > > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html > |
In reply to this post by AndrewBlack
> On 20 Mar 2018, at 14:51, Andrew P. Black <[hidden email]> wrote: > > I like the structure of streams wrapping streams from the point of view of modularity of implementation and flexibility of reuse (as in applying buffering and encoding to in-memory or network streams). > > But I don’t like exposing al the machinery to the user. For example, instead of > > bufferedReadStream := ZnBufferedReadStream on: aReadStream. > bufferedWriteStream := ZnBufferedWriteStream on: aWriteStream. > > why not > > bufferedReadStream := ZnBuffer on: aReadStream. > bufferedWriteStream := ZnBuffer on: aWriteStream > > where the class that implements the buffering looks at the stream and decides whether to do read or write buffering (or both, for a read-write stream). Hmm, then I would prefer real object messages, like #buffered not class facades. > The meaning of #position: on an encoded Unicode stream is perfectly well-defined. It’s just not the same as the meaning of position on the underlying byteStream — one counts in codePoints, the other in bytes. Yes, agreed. > The old file streams got this wrong; I’m hoping that the new ones get it right. However, the comment that > > "As a side note: using #size and #position[:] on variable encoded streams > like UTF8 is *very* dangerous and should be avoided.” > > makes me worry that they do not. Please, either implement #size and #position[:] correctly on the encoded stream, or don’t implement them at all! Do *not* implement them incorrectly, and then tell me to avoid them because they are dangerous! Well, I will repeat my opinion: #size #position #position: #reset have no place in streams. I have an experimental implementation of a ZnPositionableReadStream that adds 'excursion behaviour' as an add-on on top of more primitive streams that do not. It allows for code like stream savingPositionDo: [ "read ahead" ]. "continue reading as if noting happened" Here is the class comment. I am ZnPositionableReadStream. I am polymorphic with (the most used/important methods of) ReadStream and PositionableStream. I wrap another read stream and store the elements that I read in a sliding circular buffer so that I am able to go back to any position inside that buffer. Essentially, I implement #position and #position: to be used to back out of reading ahead. Note that the size of my buffer limits how far I can go backwards. A SubscriptOutOfBounds exception will be signalled when an attempt is made to go too far backwards. The index returned by #position should be considered abstract, without concrete meaning, but it is currently implemented as the count of elements read by #next on the wrapped stream. On a simple stream over an in memory collection, that will be equivalent to an integer index into that collection. But on network streams or streams that were already further along, this will no longer be the case. The most elementary example of my capabilities can be seen in my implementation of #peek. See also the unit tests #testPlainExcursion and #testSearch Of course, backing out of an excursion is only possible within the window of the buffer size. Implementation - stream <ReadStream> the read stream that I wrap and add positioning to - buffer <String|ByteArray> sliding, circular buffer - index <PositiveInteger> zero based index into buffer, where next will be stored - count <PositiveInteger> number of next operations done on wrapped stream - delta <PositiveInteger> number of positions that I was moved backwards The real core methods are #next, #atEnd, #position and #position: and are used to implement the rest. Part of Zinc HTTP Components |
Hi all, Don't forget also that this are the low level bricks used in Pharo. Utterly required by the bootstrap and the minimal image, thus keeping dependencies small and controlled is a big +. Magic and automatic stuff can be added on top. Nothing prevents to add a little layer on top as a separate library (I even encourage people to do so!) adding a builder, or a DSL to ease the construction of streams in the most performant way possible. Pharo has extension methods, and even this could be done with some extra objects managing the creation of the stream without being too complex. Guille On Tue, Mar 20, 2018 at 3:12 PM, Sven Van Caekenberghe <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |