Hi,
I uploaded a new version of the Multilingual package to the Inbox for reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of ~3.7 (if the file has UTF-8 encoding). The speedup doesn't come free, the code assumes a few things about the file it's reading: - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in the encoded stream that byte is an encoded ! character - the stream doesn't convert line endings (though this is can be worked around if necessary) Are these assumptions valid? Can we have stricter assumptions? For example, can we say that every source file is UTF-8 encoded, just like CompressedSourceStreams? Here is the benchmark which show the speedup: (1 to: 3) collect: [ :run | Smalltalk garbageCollect. [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ] Current: #(7039 7037 7051) New: #(1923 1903 1890) (Note that further minor speedups are still possible, but I didn't bother with them.) While digging through the code of FileStream and subclasses, I found that it may be worth implementing MultiByteFileStream and MultiByteBinaryOrTextStream in a different way. Instead of subclassing existing stream classes (and adding cruft to the whole Stream hierachy) we could use a separate class named MultiByteStream which would encapsulate a stream (a FileStream or an in-memory stream), the converter, line-end conversion, etc. This would let us - get rid of the basic* methods of the stream hierarchy (which are broken). - remove duplicate code - find, deprecate, remove obsolete code - achieve better performance We may also be able to use two level buffering. What do you think? Should we do this (even if it will not be 100% backwards compatible)? Levente |
On 30 January 2010 04:09, Levente Uzonyi <[hidden email]> wrote:
> Hi, > > I uploaded a new version of the Multilingual package to the Inbox for > reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of > ~3.7 (if the file has UTF-8 encoding). > The speedup doesn't come free, the code assumes a few things about the file > it's reading: > - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in > the encoded stream that byte is an encoded ! character > - the stream doesn't convert line endings (though this is can be worked > around if necessary) > > Are these assumptions valid? Can we have stricter assumptions? For example, > can we say that every source file is UTF-8 encoded, just like > CompressedSourceStreams? > > Here is the benchmark which show the speedup: > (1 to: 3) collect: [ :run | > Smalltalk garbageCollect. > [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ] > > Current: #(7039 7037 7051) > New: #(1923 1903 1890) > > (Note that further minor speedups are still possible, but I didn't bother > with them.) > > > While digging through the code of FileStream and subclasses, I found that it > may be worth implementing MultiByteFileStream and > MultiByteBinaryOrTextStream in a different way. Instead of subclassing > existing stream classes (and adding cruft to the whole Stream hierachy) we > could use a separate class named MultiByteStream which would encapsulate a > stream (a FileStream or an in-memory stream), the converter, line-end > conversion, etc. This would let us > - get rid of the basic* methods of the stream hierarchy (which are broken). > - remove duplicate code > - find, deprecate, remove obsolete code > - achieve better performance > We may also be able to use two level buffering. > > What do you think? Should we do this (even if it will not be 100% backwards > compatible)? I am with you. Wrapping or delegation, is what i think a MultiByteStream should use, i.e. use an existing stream to read the data from and do own conversion. It may slow down things a little due to stream chaining, but clean up a lot of cruft out of implementation. > > > Levente > -- Best regards, Igor Stasenko AKA sig. |
On Sat, 30 Jan 2010, Igor Stasenko wrote:
> On 30 January 2010 04:09, Levente Uzonyi <[hidden email]> wrote: >> Hi, >> >> I uploaded a new version of the Multilingual package to the Inbox for >> reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of >> ~3.7 (if the file has UTF-8 encoding). >> The speedup doesn't come free, the code assumes a few things about the file >> it's reading: >> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in >> the encoded stream that byte is an encoded ! character >> - the stream doesn't convert line endings (though this is can be worked >> around if necessary) >> >> Are these assumptions valid? Can we have stricter assumptions? For example, >> can we say that every source file is UTF-8 encoded, just like >> CompressedSourceStreams? >> >> Here is the benchmark which show the speedup: >> (1 to: 3) collect: [ :run | >> Smalltalk garbageCollect. >> [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ] >> >> Current: #(7039 7037 7051) >> New: #(1923 1903 1890) >> >> (Note that further minor speedups are still possible, but I didn't bother >> with them.) >> >> >> While digging through the code of FileStream and subclasses, I found that it >> may be worth implementing MultiByteFileStream and >> MultiByteBinaryOrTextStream in a different way. Instead of subclassing >> existing stream classes (and adding cruft to the whole Stream hierachy) we >> could use a separate class named MultiByteStream which would encapsulate a >> stream (a FileStream or an in-memory stream), the converter, line-end >> conversion, etc. This would let us >> - get rid of the basic* methods of the stream hierarchy (which are broken). >> - remove duplicate code >> - find, deprecate, remove obsolete code >> - achieve better performance >> We may also be able to use two level buffering. >> >> What do you think? Should we do this (even if it will not be 100% backwards >> compatible)? > > I am with you. Wrapping or delegation, is what i think a > MultiByteStream should use, i.e. > use an existing stream to read the data from and do own conversion. > It may slow down things a little due to stream chaining, but clean up > a lot of cruft out of implementation. MultiByteFileStream >> #next sends TextConverter >> #nextFromStream: sends MultiByteFileStream >> #basicNext sends StandardFileStream >> #next With encapsulation it could be: MultiByteFileStream(MultiByteStream) >> #next sends TextConverter >> #nextFromStream: sends StandardFileStream >> #next So I expect it to be a bit faster. Levente > >> >> >> Levente >> > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
In reply to this post by Levente Uzonyi-2
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote:
> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in > the encoded stream that byte is an encoded ! character The "whenever byte 33 occurs in the encoded stream that byte is an encoded ! character" part of this seems suspect to me. Are you checking the bytes for byte 33, or are you still checking characters, and one of the characters is byte 33, then you assume it is ! ? If you are just scanning bytes, I would assume that some UTF-8 characters could have a byte 33 encoded in them. Although I'm not a UTF-8 expert. -Chris |
In reply to this post by Igor Stasenko
2010/1/30 Igor Stasenko <[hidden email]>:
> On 30 January 2010 04:09, Levente Uzonyi <[hidden email]> wrote: >> Hi, >> >> I uploaded a new version of the Multilingual package to the Inbox for >> reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of >> ~3.7 (if the file has UTF-8 encoding). >> The speedup doesn't come free, the code assumes a few things about the file >> it's reading: >> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in >> the encoded stream that byte is an encoded ! character >> - the stream doesn't convert line endings (though this is can be worked >> around if necessary) >> >> Are these assumptions valid? Can we have stricter assumptions? For example, >> can we say that every source file is UTF-8 encoded, just like >> CompressedSourceStreams? >> >> Here is the benchmark which show the speedup: >> (1 to: 3) collect: [ :run | >> Smalltalk garbageCollect. >> [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ] >> >> Current: #(7039 7037 7051) >> New: #(1923 1903 1890) >> >> (Note that further minor speedups are still possible, but I didn't bother >> with them.) >> >> >> While digging through the code of FileStream and subclasses, I found that it >> may be worth implementing MultiByteFileStream and >> MultiByteBinaryOrTextStream in a different way. Instead of subclassing >> existing stream classes (and adding cruft to the whole Stream hierachy) we >> could use a separate class named MultiByteStream which would encapsulate a >> stream (a FileStream or an in-memory stream), the converter, line-end >> conversion, etc. This would let us >> - get rid of the basic* methods of the stream hierarchy (which are broken). >> - remove duplicate code >> - find, deprecate, remove obsolete code >> - achieve better performance >> We may also be able to use two level buffering. >> >> What do you think? Should we do this (even if it will not be 100% backwards >> compatible)? > > I am with you. Wrapping or delegation, is what i think a > MultiByteStream should use, i.e. > use an existing stream to read the data from and do own conversion. > It may slow down things a little due to stream chaining, but clean up > a lot of cruft out of implementation. > Yes, subclassing was the worst choice wrt hacking. basicNext bareNext etc... should not exist. IMO the wrapper implementation will not be only cleaner, it shall also be faster. http://www.squeaksource.com/XTream demonstrate a comfortable speed up is possible: { [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles at: 2) name) ascii; wantsLineEndConversion: false; converter: UTF8TextConverter new. 1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun. [| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter new installLineEndConvention: nil)) buffered. 1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun. } #(332 19) Nicolas >> >> >> Levente >> > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
In reply to this post by cbc
On 29.01.2010, at 20:07, Chris Cunningham wrote:
> > On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote: >> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in >> the encoded stream that byte is an encoded ! character > > The "whenever byte 33 occurs in the encoded stream that byte is an > encoded ! character" part of this seems suspect to me. Are you > checking the bytes for byte 33, or are you still checking characters, > and one of the characters is byte 33, then you assume it is ! ? If > you are just scanning bytes, I would assume that some UTF-8 characters > could have a byte 33 encoded in them. Wrong. > Although I'm not a UTF-8 expert. Obviously ;) See http://en.wikipedia.org/wiki/UTF-8#Description - Bert - |
On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote:
> On 29.01.2010, at 20:07, Chris Cunningham wrote: >> >> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote: >>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in >>> the encoded stream that byte is an encoded ! character >> >> The "whenever byte 33 occurs in the encoded stream that byte is an >> encoded ! character" part of this seems suspect to me. Are you >> checking the bytes for byte 33, or are you still checking characters, >> and one of the characters is byte 33, then you assume it is ! ? If >> you are just scanning bytes, I would assume that some UTF-8 characters >> could have a byte 33 encoded in them. > > Wrong. > >> Although I'm not a UTF-8 expert. > > Obviously ;) See > > http://en.wikipedia.org/wiki/UTF-8#Description > decoding utf8 data. > - Bert - > > > > -- Best regards, Igor Stasenko AKA sig. |
On Sat, 30 Jan 2010, Igor Stasenko wrote:
> On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote: >> On 29.01.2010, at 20:07, Chris Cunningham wrote: >>> >>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote: >>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in >>>> the encoded stream that byte is an encoded ! character >>> >>> The "whenever byte 33 occurs in the encoded stream that byte is an >>> encoded ! character" part of this seems suspect to me. Are you >>> checking the bytes for byte 33, or are you still checking characters, >>> and one of the characters is byte 33, then you assume it is ! ? If >>> you are just scanning bytes, I would assume that some UTF-8 characters >>> could have a byte 33 encoded in them. >> >> Wrong. >> >>> Although I'm not a UTF-8 expert. >> >> Obviously ;) See >> >> http://en.wikipedia.org/wiki/UTF-8#Description >> > Either way, the presence of ! character should be tested after > decoding utf8 data. Levente > >> - Bert - >> >> >> >> > > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
2010/1/31 Levente Uzonyi <[hidden email]>:
> On Sat, 30 Jan 2010, Igor Stasenko wrote: > >> On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote: >>> >>> On 29.01.2010, at 20:07, Chris Cunningham wrote: >>>> >>>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote: >>>>> >>>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs >>>>> in >>>>> the encoded stream that byte is an encoded ! character >>>> >>>> The "whenever byte 33 occurs in the encoded stream that byte is an >>>> encoded ! character" part of this seems suspect to me. Are you >>>> checking the bytes for byte 33, or are you still checking characters, >>>> and one of the characters is byte 33, then you assume it is ! ? If >>>> you are just scanning bytes, I would assume that some UTF-8 characters >>>> could have a byte 33 encoded in them. >>> >>> Wrong. >>> >>>> Although I'm not a UTF-8 expert. >>> >>> Obviously ;) See >>> >>> http://en.wikipedia.org/wiki/UTF-8#Description >>> >> Either way, the presence of ! character should be tested after >> decoding utf8 data. > > Why? UTF-8 is ASCII compatible. > Well, utf8 is an octet stream (bytes), not characters. While we are seeking for '!' character, not byte. Logically, the data flow should be following: <primitive> -> ByteArray -> utf8 reader -> character stream -> '!' sure, due to nature of utf8 encoding you could shortcut, but then because of such hacks, you won't be able to switch to different encoding without pain: <primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!' > > Levente > -- Best regards, Igor Stasenko AKA sig. |
On 2010-01-31, at 10:54 AM, Igor Stasenko wrote: >> Why? UTF-8 is ASCII compatible. >> > > Well, utf8 is an octet stream (bytes), not characters. While we are > seeking for '!' character, not byte. > Logically, the data flow should be following: > <primitive> -> ByteArray -> utf8 reader -> character stream -> '!' > > sure, due to nature of utf8 encoding you could shortcut, but then > because of such hacks, you won't be able to > switch to different encoding without pain: > > <primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!' +1 Bytes and characters are not the same thing. Colin |
In reply to this post by Igor Stasenko
On Sun, 31 Jan 2010, Igor Stasenko wrote:
> 2010/1/31 Levente Uzonyi <[hidden email]>: >> On Sat, 30 Jan 2010, Igor Stasenko wrote: >> >>> On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote: >>>> >>>> On 29.01.2010, at 20:07, Chris Cunningham wrote: >>>>> >>>>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote: >>>>>> >>>>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs >>>>>> in >>>>>> the encoded stream that byte is an encoded ! character >>>>> >>>>> The "whenever byte 33 occurs in the encoded stream that byte is an >>>>> encoded ! character" part of this seems suspect to me. Are you >>>>> checking the bytes for byte 33, or are you still checking characters, >>>>> and one of the characters is byte 33, then you assume it is ! ? If >>>>> you are just scanning bytes, I would assume that some UTF-8 characters >>>>> could have a byte 33 encoded in them. >>>> >>>> Wrong. >>>> >>>>> Although I'm not a UTF-8 expert. >>>> >>>> Obviously ;) See >>>> >>>> http://en.wikipedia.org/wiki/UTF-8#Description >>>> >>> Either way, the presence of ! character should be tested after >>> decoding utf8 data. >> >> Why? UTF-8 is ASCII compatible. >> > > Well, utf8 is an octet stream (bytes), not characters. While we are > seeking for '!' character, not byte. > Logically, the data flow should be following: > <primitive> -> ByteArray -> utf8 reader -> character stream -> '!' - #nextChunk doesn't work in binary mode: 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU" - text converters don't do any conversion if the stream is binary > > sure, due to nature of utf8 encoding you could shortcut, but then > because of such hacks, you won't be able to > switch to different encoding without pain: > > <primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!' > That's what my original questions were about (which are still unanswered): - is it safe to assume that the encoding of source files will be compatible with this "hack"? - is it safe to assume that the source files are always UTF-8 encoded? Levente >> >> Levente >> > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
Levente Uzonyi wrote:
> On Sun, 31 Jan 2010, Igor Stasenko wrote: >> Well, utf8 is an octet stream (bytes), not characters. While we are >> seeking for '!' character, not byte. >> Logically, the data flow should be following: >> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!' > > This is far from reality, because > - #nextChunk doesn't work in binary mode: > 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" > 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU" > - text converters don't do any conversion if the stream is binary Right, although I think Igor's point is slightly different. You could implement #upTo: for example by applying to encoding to the argument and then do #upToAllEncoded: which takes an encoded character sequence as the argument. This would preserve the generality of #upTo: with the potential for more general speedup. I.e., upTo: aCharacter => upToEncoded: bytes => primitive read <= return encodedBytes <= converter decode: encodedBytes <= returns characters (one assumption here is that the converter doesn't "embed" a particular character sequence as a part of another one which is true for UTF-8 but I'm not sure about other encodings). > That's what my original questions were about (which are still unanswered): > - is it safe to assume that the encoding of source files will be > compatible with this "hack"? > - is it safe to assume that the source files are always UTF-8 encoded? I think UTF-8 is going to be the only standard going forward. Precisely because it has such (often overlooked) extremely useful properties. So yes, I think it'd be safe to assume that this will work going forward. Cheers, - Andreas |
I don't like at all having a String being a blob of bits subject to
encoding interpretation. String is a collection of characters, and there should be a canonical encoding known from the VM. utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks. We should use ByteArray, or better, introduce an UTF8String if it becomes that important. Code will be much much much cleaner and foolproof. Nicolas 2010/2/2 Andreas Raab <[hidden email]>: > Levente Uzonyi wrote: >> >> On Sun, 31 Jan 2010, Igor Stasenko wrote: >>> >>> Well, utf8 is an octet stream (bytes), not characters. While we are >>> seeking for '!' character, not byte. >>> Logically, the data flow should be following: >>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!' >> >> This is far from reality, because >> - #nextChunk doesn't work in binary mode: >> 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" >> 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU" >> - text converters don't do any conversion if the stream is binary > > Right, although I think Igor's point is slightly different. You could > implement #upTo: for example by applying to encoding to the argument and > then do #upToAllEncoded: which takes an encoded character sequence as the > argument. This would preserve the generality of #upTo: with the potential > for more general speedup. I.e., > > upTo: aCharacter > => upToEncoded: bytes > => primitive read > <= return encodedBytes > <= converter decode: encodedBytes > <= returns characters > > (one assumption here is that the converter doesn't "embed" a particular > character sequence as a part of another one which is true for UTF-8 but I'm > not sure about other encodings). > >> That's what my original questions were about (which are still unanswered): >> - is it safe to assume that the encoding of source files will be >> compatible with this "hack"? >> - is it safe to assume that the source files are always UTF-8 encoded? > > I think UTF-8 is going to be the only standard going forward. Precisely > because it has such (often overlooked) extremely useful properties. So yes, > I think it'd be safe to assume that this will work going forward. > > Cheers, > - Andreas > > > |
On 03.02.2010, at 00:13, Nicolas Cellier wrote:
> > I don't like at all having a String being a blob of bits subject to > encoding interpretation. > String is a collection of characters, and there should be a canonical > encoding known from the VM. > utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks. > > We should use ByteArray, or better, introduce an UTF8String if it > becomes that important. > Code will be much much much cleaner and foolproof. That's what Scratch did ... - Bert - |
On Wednesday 03 February 2010 01:50:05 pm Bert Freudenberg wrote:
> On 03.02.2010, at 00:13, Nicolas Cellier wrote: > > We should use ByteArray, or better, introduce an UTF8String if it > > becomes that important. > > Code will be much much much cleaner and foolproof. > > That's what Scratch did ... A much saner choice, I should say, after reading both Squeak and Scratch sources. Bytes are strictly for memory objects and have no place in higher level code. Higher level codes should only deal with encoded bytes - integer, character (ASCII), string (ASCII), utf8string (UTF8) etc. Nicolas, why "if it becomes that important" qualifier for UTF-8? Wake up :-). Subbu |
In reply to this post by Andreas.Raab
On Mon, 1 Feb 2010, Andreas Raab wrote:
> Levente Uzonyi wrote: >> On Sun, 31 Jan 2010, Igor Stasenko wrote: >>> Well, utf8 is an octet stream (bytes), not characters. While we are >>> seeking for '!' character, not byte. >>> Logically, the data flow should be following: >>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!' >> >> This is far from reality, because >> - #nextChunk doesn't work in binary mode: >> 'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'" >> 'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU" >> - text converters don't do any conversion if the stream is binary > > Right, although I think Igor's point is slightly different. You could > implement #upTo: for example by applying to encoding to the argument and then > do #upToAllEncoded: which takes an encoded character sequence as the > argument. This would preserve the generality of #upTo: with the potential for > more general speedup. I.e., > > upTo: aCharacter > => upToEncoded: bytes > => primitive read > <= return encodedBytes > <= converter decode: encodedBytes > <= returns characters > > (one assumption here is that the converter doesn't "embed" a particular > character sequence as a part of another one which is true for UTF-8 but I'm > not sure about other encodings). Another way to do this is to let the converter read the next chunk. "TextConverter >> #nextChunkFrom: stream" could use the current implementation of MultiByteFileStream >> #nextChunk, while UTF8TextConverter could use #upTo: (this would also let us avoid the #basicUpTo: hack). So we could use any encoding, while speeding up the UTF-8 case. Maybe we could also move the encoding/decoding related methods/tables from String and subclasses to the (class side of the) TextConverters. Levente > >> That's what my original questions were about (which are still unanswered): >> - is it safe to assume that the encoding of source files will be >> compatible with this "hack"? >> - is it safe to assume that the source files are always UTF-8 encoded? > > I think UTF-8 is going to be the only standard going forward. Precisely > because it has such (often overlooked) extremely useful properties. So yes, I > think it'd be safe to assume that this will work going forward. > > Cheers, > - Andreas > > > |
Free forum by Nabble | Edit this page |