#nextChunk speedup, the future of multibyte streams

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

#nextChunk speedup, the future of multibyte streams

Levente Uzonyi-2
Hi,

I uploaded a new version of the Multilingual package to the Inbox for
reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
~3.7 (if the file has UTF-8 encoding).
The speedup doesn't come free, the code assumes a few things about the
file it's reading:
- it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
   the encoded stream that byte is an encoded ! character
- the stream doesn't convert line endings (though this is can be worked
   around if necessary)

Are these assumptions valid? Can we have stricter assumptions? For
example, can we say that every source file is UTF-8 encoded, just like
CompressedSourceStreams?

Here is the benchmark which show the speedup:
(1 to: 3) collect: [ :run |
  Smalltalk garbageCollect.
  [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]

Current: #(7039 7037 7051)
New: #(1923 1903 1890)

(Note that further minor speedups are still possible, but I didn't
bother with them.)


While digging through the code of FileStream and subclasses, I found that
it may be worth implementing MultiByteFileStream and
MultiByteBinaryOrTextStream in a different way. Instead of subclassing
existing stream classes (and adding cruft to the whole Stream hierachy) we
could use a separate class named MultiByteStream which would encapsulate a
stream (a FileStream or an in-memory stream), the converter, line-end
conversion, etc. This would let us
- get rid of the basic* methods of the stream hierarchy (which are broken).
- remove duplicate code
- find, deprecate, remove obsolete code
- achieve better performance
We may also be able to use two level buffering.

What do you think? Should we do this (even if it will not be 100%
backwards compatible)?


Levente


Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Igor Stasenko
On 30 January 2010 04:09, Levente Uzonyi <[hidden email]> wrote:

> Hi,
>
> I uploaded a new version of the Multilingual package to the Inbox for
> reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
> ~3.7 (if the file has UTF-8 encoding).
> The speedup doesn't come free, the code assumes a few things about the file
> it's reading:
> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>  the encoded stream that byte is an encoded ! character
> - the stream doesn't convert line endings (though this is can be worked
>  around if necessary)
>
> Are these assumptions valid? Can we have stricter assumptions? For example,
> can we say that every source file is UTF-8 encoded, just like
> CompressedSourceStreams?
>
> Here is the benchmark which show the speedup:
> (1 to: 3) collect: [ :run |
>        Smalltalk garbageCollect.
>        [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
>
> Current: #(7039 7037 7051)
> New: #(1923 1903 1890)
>
> (Note that further minor speedups are still possible, but I didn't bother
> with them.)
>
>
> While digging through the code of FileStream and subclasses, I found that it
> may be worth implementing MultiByteFileStream and
> MultiByteBinaryOrTextStream in a different way. Instead of subclassing
> existing stream classes (and adding cruft to the whole Stream hierachy) we
> could use a separate class named MultiByteStream which would encapsulate a
> stream (a FileStream or an in-memory stream), the converter, line-end
> conversion, etc. This would let us
> - get rid of the basic* methods of the stream hierarchy (which are broken).
> - remove duplicate code
> - find, deprecate, remove obsolete code
> - achieve better performance
> We may also be able to use two level buffering.
>
> What do you think? Should we do this (even if it will not be 100% backwards
> compatible)?

I am with you. Wrapping or delegation, is what i think a
MultiByteStream should use, i.e.
use an existing stream to read the data from and do own conversion.
It may slow down things a little due to stream chaining, but clean up
a lot of cruft out of implementation.

>
>
> Levente
>


--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Levente Uzonyi-2
On Sat, 30 Jan 2010, Igor Stasenko wrote:

> On 30 January 2010 04:09, Levente Uzonyi <[hidden email]> wrote:
>> Hi,
>>
>> I uploaded a new version of the Multilingual package to the Inbox for
>> reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
>> ~3.7 (if the file has UTF-8 encoding).
>> The speedup doesn't come free, the code assumes a few things about the file
>> it's reading:
>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>>  the encoded stream that byte is an encoded ! character
>> - the stream doesn't convert line endings (though this is can be worked
>>  around if necessary)
>>
>> Are these assumptions valid? Can we have stricter assumptions? For example,
>> can we say that every source file is UTF-8 encoded, just like
>> CompressedSourceStreams?
>>
>> Here is the benchmark which show the speedup:
>> (1 to: 3) collect: [ :run |
>>        Smalltalk garbageCollect.
>>        [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
>>
>> Current: #(7039 7037 7051)
>> New: #(1923 1903 1890)
>>
>> (Note that further minor speedups are still possible, but I didn't bother
>> with them.)
>>
>>
>> While digging through the code of FileStream and subclasses, I found that it
>> may be worth implementing MultiByteFileStream and
>> MultiByteBinaryOrTextStream in a different way. Instead of subclassing
>> existing stream classes (and adding cruft to the whole Stream hierachy) we
>> could use a separate class named MultiByteStream which would encapsulate a
>> stream (a FileStream or an in-memory stream), the converter, line-end
>> conversion, etc. This would let us
>> - get rid of the basic* methods of the stream hierarchy (which are broken).
>> - remove duplicate code
>> - find, deprecate, remove obsolete code
>> - achieve better performance
>> We may also be able to use two level buffering.
>>
>> What do you think? Should we do this (even if it will not be 100% backwards
>> compatible)?
>
> I am with you. Wrapping or delegation, is what i think a
> MultiByteStream should use, i.e.
> use an existing stream to read the data from and do own conversion.
> It may slow down things a little due to stream chaining, but clean up
> a lot of cruft out of implementation.
The chaining is already there (as sends):
MultiByteFileStream >> #next sends
TextConverter >> #nextFromStream: sends
MultiByteFileStream >> #basicNext sends
StandardFileStream >> #next

With encapsulation it could be:
MultiByteFileStream(MultiByteStream) >> #next sends
TextConverter >> #nextFromStream: sends
StandardFileStream >> #next

So I expect it to be a bit faster.


Levente

>
>>
>>
>> Levente
>>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

cbc
Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

cbc
In reply to this post by Levente Uzonyi-2
On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote:
> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>  the encoded stream that byte is an encoded ! character

The "whenever byte 33 occurs in the encoded stream that byte is an
encoded ! character" part of this seems suspect to me.  Are you
checking the bytes for byte 33, or are you still checking characters,
and one of the characters is byte 33, then you assume it is ! ?  If
you are just scanning bytes, I would assume that some UTF-8 characters
could have a byte 33 encoded in them.

Although I'm not a UTF-8 expert.

-Chris

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Nicolas Cellier
In reply to this post by Igor Stasenko
2010/1/30 Igor Stasenko <[hidden email]>:

> On 30 January 2010 04:09, Levente Uzonyi <[hidden email]> wrote:
>> Hi,
>>
>> I uploaded a new version of the Multilingual package to the Inbox for
>> reviewing. It speeds up MultiByteFileStream >> #nextChunk by a factor of
>> ~3.7 (if the file has UTF-8 encoding).
>> The speedup doesn't come free, the code assumes a few things about the file
>> it's reading:
>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>>  the encoded stream that byte is an encoded ! character
>> - the stream doesn't convert line endings (though this is can be worked
>>  around if necessary)
>>
>> Are these assumptions valid? Can we have stricter assumptions? For example,
>> can we say that every source file is UTF-8 encoded, just like
>> CompressedSourceStreams?
>>
>> Here is the benchmark which show the speedup:
>> (1 to: 3) collect: [ :run |
>>        Smalltalk garbageCollect.
>>        [ CompiledMethod allInstancesDo: #getSourceFromFile ] timeToRun ]
>>
>> Current: #(7039 7037 7051)
>> New: #(1923 1903 1890)
>>
>> (Note that further minor speedups are still possible, but I didn't bother
>> with them.)
>>
>>
>> While digging through the code of FileStream and subclasses, I found that it
>> may be worth implementing MultiByteFileStream and
>> MultiByteBinaryOrTextStream in a different way. Instead of subclassing
>> existing stream classes (and adding cruft to the whole Stream hierachy) we
>> could use a separate class named MultiByteStream which would encapsulate a
>> stream (a FileStream or an in-memory stream), the converter, line-end
>> conversion, etc. This would let us
>> - get rid of the basic* methods of the stream hierarchy (which are broken).
>> - remove duplicate code
>> - find, deprecate, remove obsolete code
>> - achieve better performance
>> We may also be able to use two level buffering.
>>
>> What do you think? Should we do this (even if it will not be 100% backwards
>> compatible)?
>
> I am with you. Wrapping or delegation, is what i think a
> MultiByteStream should use, i.e.
> use an existing stream to read the data from and do own conversion.
> It may slow down things a little due to stream chaining, but clean up
> a lot of cruft out of implementation.
>

Yes, subclassing was the worst choice wrt hacking.
basicNext bareNext etc... should not exist.
IMO the wrapper implementation will not be only cleaner, it shall also
be faster.
http://www.squeaksource.com/XTream demonstrate a comfortable speed up
is possible:

{
[| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii; wantsLineEndConversion: false; converter:
UTF8TextConverter new.
       1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
[| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter
new installLineEndConvention: nil)) buffered.
       1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
}
 #(332 19)

Nicolas

>>
>>
>> Levente
>>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Bert Freudenberg
In reply to this post by cbc
On 29.01.2010, at 20:07, Chris Cunningham wrote:

>
> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote:
>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>>  the encoded stream that byte is an encoded ! character
>
> The "whenever byte 33 occurs in the encoded stream that byte is an
> encoded ! character" part of this seems suspect to me.  Are you
> checking the bytes for byte 33, or are you still checking characters,
> and one of the characters is byte 33, then you assume it is ! ?  If
> you are just scanning bytes, I would assume that some UTF-8 characters
> could have a byte 33 encoded in them.

Wrong.

> Although I'm not a UTF-8 expert.

Obviously ;) See

http://en.wikipedia.org/wiki/UTF-8#Description

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Igor Stasenko
On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote:

> On 29.01.2010, at 20:07, Chris Cunningham wrote:
>>
>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote:
>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>>>  the encoded stream that byte is an encoded ! character
>>
>> The "whenever byte 33 occurs in the encoded stream that byte is an
>> encoded ! character" part of this seems suspect to me.  Are you
>> checking the bytes for byte 33, or are you still checking characters,
>> and one of the characters is byte 33, then you assume it is ! ?  If
>> you are just scanning bytes, I would assume that some UTF-8 characters
>> could have a byte 33 encoded in them.
>
> Wrong.
>
>> Although I'm not a UTF-8 expert.
>
> Obviously ;) See
>
> http://en.wikipedia.org/wiki/UTF-8#Description
>
Either way, the presence of ! character should be tested after
decoding utf8 data.

> - Bert -
>
>
>
>



--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Levente Uzonyi-2
On Sat, 30 Jan 2010, Igor Stasenko wrote:

> On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote:
>> On 29.01.2010, at 20:07, Chris Cunningham wrote:
>>>
>>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote:
>>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs in
>>>>  the encoded stream that byte is an encoded ! character
>>>
>>> The "whenever byte 33 occurs in the encoded stream that byte is an
>>> encoded ! character" part of this seems suspect to me.  Are you
>>> checking the bytes for byte 33, or are you still checking characters,
>>> and one of the characters is byte 33, then you assume it is ! ?  If
>>> you are just scanning bytes, I would assume that some UTF-8 characters
>>> could have a byte 33 encoded in them.
>>
>> Wrong.
>>
>>> Although I'm not a UTF-8 expert.
>>
>> Obviously ;) See
>>
>> http://en.wikipedia.org/wiki/UTF-8#Description
>>
> Either way, the presence of ! character should be tested after
> decoding utf8 data.
Why? UTF-8 is ASCII compatible.


Levente

>
>> - Bert -
>>
>>
>>
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Igor Stasenko
2010/1/31 Levente Uzonyi <[hidden email]>:

> On Sat, 30 Jan 2010, Igor Stasenko wrote:
>
>> On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote:
>>>
>>> On 29.01.2010, at 20:07, Chris Cunningham wrote:
>>>>
>>>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote:
>>>>>
>>>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs
>>>>> in
>>>>>  the encoded stream that byte is an encoded ! character
>>>>
>>>> The "whenever byte 33 occurs in the encoded stream that byte is an
>>>> encoded ! character" part of this seems suspect to me.  Are you
>>>> checking the bytes for byte 33, or are you still checking characters,
>>>> and one of the characters is byte 33, then you assume it is ! ?  If
>>>> you are just scanning bytes, I would assume that some UTF-8 characters
>>>> could have a byte 33 encoded in them.
>>>
>>> Wrong.
>>>
>>>> Although I'm not a UTF-8 expert.
>>>
>>> Obviously ;) See
>>>
>>> http://en.wikipedia.org/wiki/UTF-8#Description
>>>
>> Either way, the presence of ! character should be tested after
>> decoding utf8 data.
>
> Why? UTF-8 is ASCII compatible.
>

Well, utf8 is an octet stream (bytes), not characters. While we are
seeking for '!' character, not byte.
Logically, the data flow should be following:
<primitive> -> ByteArray -> utf8 reader -> character stream -> '!'

sure, due to nature of utf8 encoding you could shortcut, but then
because of such hacks, you won't be able to
switch to different encoding without pain:

<primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'

>
> Levente
>


--
Best regards,
Igor Stasenko AKA sig.

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Colin Putney

On 2010-01-31, at 10:54 AM, Igor Stasenko wrote:

>> Why? UTF-8 is ASCII compatible.
>>
>
> Well, utf8 is an octet stream (bytes), not characters. While we are
> seeking for '!' character, not byte.
> Logically, the data flow should be following:
> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>
> sure, due to nature of utf8 encoding you could shortcut, but then
> because of such hacks, you won't be able to
> switch to different encoding without pain:
>
> <primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'

+1

Bytes and characters are not the same thing.

Colin

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Levente Uzonyi-2
In reply to this post by Igor Stasenko
On Sun, 31 Jan 2010, Igor Stasenko wrote:

> 2010/1/31 Levente Uzonyi <[hidden email]>:
>> On Sat, 30 Jan 2010, Igor Stasenko wrote:
>>
>>> On 30 January 2010 09:15, Bert Freudenberg <[hidden email]> wrote:
>>>>
>>>> On 29.01.2010, at 20:07, Chris Cunningham wrote:
>>>>>
>>>>> On Fri, Jan 29, 2010 at 6:09 PM, Levente Uzonyi <[hidden email]> wrote:
>>>>>>
>>>>>> - it assumes that ! is encoded as byte 33 and whenever byte 33 occurs
>>>>>> in
>>>>>>  the encoded stream that byte is an encoded ! character
>>>>>
>>>>> The "whenever byte 33 occurs in the encoded stream that byte is an
>>>>> encoded ! character" part of this seems suspect to me.  Are you
>>>>> checking the bytes for byte 33, or are you still checking characters,
>>>>> and one of the characters is byte 33, then you assume it is ! ?  If
>>>>> you are just scanning bytes, I would assume that some UTF-8 characters
>>>>> could have a byte 33 encoded in them.
>>>>
>>>> Wrong.
>>>>
>>>>> Although I'm not a UTF-8 expert.
>>>>
>>>> Obviously ;) See
>>>>
>>>> http://en.wikipedia.org/wiki/UTF-8#Description
>>>>
>>> Either way, the presence of ! character should be tested after
>>> decoding utf8 data.
>>
>> Why? UTF-8 is ASCII compatible.
>>
>
> Well, utf8 is an octet stream (bytes), not characters. While we are
> seeking for '!' character, not byte.
> Logically, the data flow should be following:
> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
This is far from reality, because
- #nextChunk doesn't work in binary mode:
   'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
   'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
- text converters don't do any conversion if the stream is binary

>
> sure, due to nature of utf8 encoding you could shortcut, but then
> because of such hacks, you won't be able to
> switch to different encoding without pain:
>
> <primitive> -> ByteArray -> <XYZ> reader -> character stream -> '!'
>

That's what my original questions were about (which are still unanswered):
- is it safe to assume that the encoding of source files will be
   compatible with this "hack"?
- is it safe to assume that the source files are always UTF-8 encoded?


Levente

>>
>> Levente
>>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Andreas.Raab
Levente Uzonyi wrote:

> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>> Well, utf8 is an octet stream (bytes), not characters. While we are
>> seeking for '!' character, not byte.
>> Logically, the data flow should be following:
>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>
> This is far from reality, because
> - #nextChunk doesn't work in binary mode:
>   'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
>   'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
> - text converters don't do any conversion if the stream is binary

Right, although I think Igor's point is slightly different. You could
implement #upTo: for example by applying to encoding to the argument and
then do #upToAllEncoded: which takes an encoded character sequence as
the argument. This would preserve the generality of #upTo: with the
potential for more general speedup. I.e.,

upTo: aCharacter
   => upToEncoded: bytes
      => primitive read
      <= return encodedBytes
   <= converter decode: encodedBytes
<= returns characters

(one assumption here is that the converter doesn't "embed" a particular
character sequence as a part of another one which is true for UTF-8 but
I'm not sure about other encodings).

> That's what my original questions were about (which are still unanswered):
> - is it safe to assume that the encoding of source files will be
>   compatible with this "hack"?
> - is it safe to assume that the source files are always UTF-8 encoded?

I think UTF-8 is going to be the only standard going forward. Precisely
because it has such (often overlooked) extremely useful properties. So
yes, I think it'd be safe to assume that this will work going forward.

Cheers,
   - Andreas


Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Nicolas Cellier
I don't like at all having a String being a blob of bits subject to
encoding interpretation.
String is a collection of characters, and there should be a canonical
encoding known from the VM.
utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks.

We should use ByteArray, or better, introduce an UTF8String if it
becomes that important.
Code will be much much much cleaner and foolproof.

Nicolas

2010/2/2 Andreas Raab <[hidden email]>:

> Levente Uzonyi wrote:
>>
>> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>>>
>>> Well, utf8 is an octet stream (bytes), not characters. While we are
>>> seeking for '!' character, not byte.
>>> Logically, the data flow should be following:
>>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>>
>> This is far from reality, because
>> - #nextChunk doesn't work in binary mode:
>>  'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
>>  'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
>> - text converters don't do any conversion if the stream is binary
>
> Right, although I think Igor's point is slightly different. You could
> implement #upTo: for example by applying to encoding to the argument and
> then do #upToAllEncoded: which takes an encoded character sequence as the
> argument. This would preserve the generality of #upTo: with the potential
> for more general speedup. I.e.,
>
> upTo: aCharacter
>  => upToEncoded: bytes
>     => primitive read
>     <= return encodedBytes
>  <= converter decode: encodedBytes
> <= returns characters
>
> (one assumption here is that the converter doesn't "embed" a particular
> character sequence as a part of another one which is true for UTF-8 but I'm
> not sure about other encodings).
>
>> That's what my original questions were about (which are still unanswered):
>> - is it safe to assume that the encoding of source files will be
>>  compatible with this "hack"?
>> - is it safe to assume that the source files are always UTF-8 encoded?
>
> I think UTF-8 is going to be the only standard going forward. Precisely
> because it has such (often overlooked) extremely useful properties. So yes,
> I think it'd be safe to assume that this will work going forward.
>
> Cheers,
>  - Andreas
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Bert Freudenberg
On 03.02.2010, at 00:13, Nicolas Cellier wrote:

>
> I don't like at all having a String being a blob of bits subject to
> encoding interpretation.
> String is a collection of characters, and there should be a canonical
> encoding known from the VM.
> utf8ToSqueak, squeakToUtf8 etc... are quick and dirty hacks.
>
> We should use ByteArray, or better, introduce an UTF8String if it
> becomes that important.
> Code will be much much much cleaner and foolproof.

That's what Scratch did ...

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

K. K. Subramaniam
On Wednesday 03 February 2010 01:50:05 pm Bert Freudenberg wrote:
> On 03.02.2010, at 00:13, Nicolas Cellier wrote:
> > We should use ByteArray, or better, introduce an UTF8String if it
> > becomes that important.
> > Code will be much much much cleaner and foolproof.
>
> That's what Scratch did ...
A much saner choice, I should say, after reading both Squeak and Scratch
sources. Bytes are strictly for memory objects and have no place in higher
level code. Higher level codes should only deal with encoded bytes - integer,
character (ASCII), string (ASCII), utf8string (UTF8) etc.

Nicolas, why "if it becomes that important" qualifier for UTF-8? Wake up :-).

Subbu

Reply | Threaded
Open this post in threaded view
|

Re: #nextChunk speedup, the future of multibyte streams

Levente Uzonyi-2
In reply to this post by Andreas.Raab
On Mon, 1 Feb 2010, Andreas Raab wrote:

> Levente Uzonyi wrote:
>> On Sun, 31 Jan 2010, Igor Stasenko wrote:
>>> Well, utf8 is an octet stream (bytes), not characters. While we are
>>> seeking for '!' character, not byte.
>>> Logically, the data flow should be following:
>>> <primitive> -> ByteArray -> utf8 reader -> character stream -> '!'
>>
>> This is far from reality, because
>> - #nextChunk doesn't work in binary mode:
>>   'This is a chunk!!!' readStream nextChunk "===> 'This is a chunk!'"
>>   'This is a chunk!!!' asByteArray readStream nextChunk "===> MNU"
>> - text converters don't do any conversion if the stream is binary
>
> Right, although I think Igor's point is slightly different. You could
> implement #upTo: for example by applying to encoding to the argument and then
> do #upToAllEncoded: which takes an encoded character sequence as the
> argument. This would preserve the generality of #upTo: with the potential for
> more general speedup. I.e.,
>
> upTo: aCharacter
>  => upToEncoded: bytes
>     => primitive read
>     <= return encodedBytes
>  <= converter decode: encodedBytes
> <= returns characters
>
> (one assumption here is that the converter doesn't "embed" a particular
> character sequence as a part of another one which is true for UTF-8 but I'm
> not sure about other encodings).

Another way to do this is to let the converter read the next chunk.
"TextConverter >> #nextChunkFrom: stream" could use the current
implementation of MultiByteFileStream >> #nextChunk, while
UTF8TextConverter could use #upTo: (this would also let us avoid the
#basicUpTo: hack). So we could use any encoding, while speeding up the
UTF-8 case.

Maybe we could also move the encoding/decoding related methods/tables
from String and subclasses to the (class side of the) TextConverters.


Levente

>
>> That's what my original questions were about (which are still unanswered):
>> - is it safe to assume that the encoding of source files will be
>>   compatible with this "hack"?
>> - is it safe to assume that the source files are always UTF-8 encoded?
>
> I think UTF-8 is going to be the only standard going forward. Precisely
> because it has such (often overlooked) extremely useful properties. So yes, I
> think it'd be safe to assume that this will work going forward.
>
> Cheers,
>  - Andreas
>
>
>