Streams. Status and where to go?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Streams. Status and where to go?

Igor Stasenko
Hello,

i am cross-posting, since i think it is good for all of us to agree on
some common points.

1. Streams needs to be rewritten.
2. What do you think is good replacement for current Streams?

personally, i currently need a fast and concise UTF8 reader.
The UTF8TextConverter is closest thing what i would take, but i don't
understand, why
it implemented as a non-stream?

The #nextFromStream:
and #nextPut:toStream:
crying out of loud to be just
#next
and
#nextPut:

Another thing which makes me sad is this line:

nextFromStream: aStream

        | character1 value1 character2 value2 unicode character3 value3
character4 value4 |
        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<


All external streams is initially binary , but UTF8TextConverter wants
to play with characters, instead of octets..
But hey... UTF8 encoding is exactly about encoding unicode characters
into binary form..
I'm not even mentioning that operating with bytes (smallints) is times
more efficient than operating with characters (objects), because first
thing it does:

        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
byte from somewhere and then converts it to instance of Character.
'Bonus' overhead here. "
        character1 isNil ifTrue: [^ nil].
        value1 := character1 asciiValue.  " and... what a surprise, we
converting a character back to integer value.. What a waste! "
        value1 <= 127 ifTrue: [

I really hope, that eventually we could have a good implementation,
where horse runs ahead of cart, not cart ahead of horse :)
Meanwhile i think i have no choice but make yet-another implementation
of utf8 reader in my own package, instead of using existing one.

--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
2010/2/25 Igor Stasenko <[hidden email]>:

> Hello,
>
> i am cross-posting, since i think it is good for all of us to agree on
> some common points.
>
> 1. Streams needs to be rewritten.
> 2. What do you think is good replacement for current Streams?
>
> personally, i currently need a fast and concise UTF8 reader.
> The UTF8TextConverter is closest thing what i would take, but i don't
> understand, why
> it implemented as a non-stream?
>
> The #nextFromStream:
> and #nextPut:toStream:
> crying out of loud to be just
> #next
> and
> #nextPut:
>
> Another thing which makes me sad is this line:
>
> nextFromStream: aStream
>
>        | character1 value1 character2 value2 unicode character3 value3
> character4 value4 |
>        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>
>
> All external streams is initially binary , but UTF8TextConverter wants
> to play with characters, instead of octets..
> But hey... UTF8 encoding is exactly about encoding unicode characters
> into binary form..
> I'm not even mentioning that operating with bytes (smallints) is times
> more efficient than operating with characters (objects), because first
> thing it does:
>
>        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
> byte from somewhere and then converts it to instance of Character.
> 'Bonus' overhead here. "
>        character1 isNil ifTrue: [^ nil].
>        value1 := character1 asciiValue.  " and... what a surprise, we
> converting a character back to integer value.. What a waste! "
>        value1 <= 127 ifTrue: [
>
> I really hope, that eventually we could have a good implementation,
> where horse runs ahead of cart, not cart ahead of horse :)
> Meanwhile i think i have no choice but make yet-another implementation
> of utf8 reader in my own package, instead of using existing one.
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>

Obviously right. encoded in bytes, decoded in Characters.

There are also ideas experimented at http://www.squeaksource.com/XTream.html
Sorry I hijacked VW name...
You can download it, it coexist pacificly with Stream.

- use endOfStreamAction instead of Exception... That means abandonning
primitives next nextPut: (no real performance impact, and expect a
boost in future COG).
- separate CollectionReadStream=concrete class, ReadStream=abstract class
- use a wrapper rather than a subclass for MultiByteFileStream
- implement sequenceable collection API
- buffer I/O (mostly in Squeak thanks Levente)

Of course, alternate ideas to pick from Nile, VW XTream, gst generators etc...

I think mutating existing library is doable (just a bit tricky because
both Compiler and source code management use Stream extensively...).

Nicolas

> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Stéphane Ducasse
In reply to this post by Igor Stasenko
I would love to see all this part gets cleaned.
I can allocate time to help integrating the result but I cannot be the lead.

Stef

On Feb 25, 2010, at 7:11 PM, Igor Stasenko wrote:

> Hello,
>
> i am cross-posting, since i think it is good for all of us to agree on
> some common points.
>
> 1. Streams needs to be rewritten.
> 2. What do you think is good replacement for current Streams?
>
> personally, i currently need a fast and concise UTF8 reader.
> The UTF8TextConverter is closest thing what i would take, but i don't
> understand, why
> it implemented as a non-stream?
>
> The #nextFromStream:
> and #nextPut:toStream:
> crying out of loud to be just
> #next
> and
> #nextPut:
>
> Another thing which makes me sad is this line:
>
> nextFromStream: aStream
>
> | character1 value1 character2 value2 unicode character3 value3
> character4 value4 |
> aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>
>
> All external streams is initially binary , but UTF8TextConverter wants
> to play with characters, instead of octets..
> But hey... UTF8 encoding is exactly about encoding unicode characters
> into binary form..
> I'm not even mentioning that operating with bytes (smallints) is times
> more efficient than operating with characters (objects), because first
> thing it does:
>
> character1 := aStream basicNext.  " a #basicNext, obviously, reads a
> byte from somewhere and then converts it to instance of Character.
> 'Bonus' overhead here. "
> character1 isNil ifTrue: [^ nil].
> value1 := character1 asciiValue.  " and... what a surprise, we
> converting a character back to integer value.. What a waste! "
> value1 <= 127 ifTrue: [
>
> I really hope, that eventually we could have a good implementation,
> where horse runs ahead of cart, not cart ahead of horse :)
> Meanwhile i think i have no choice but make yet-another implementation
> of utf8 reader in my own package, instead of using existing one.
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Igor Stasenko
In reply to this post by Nicolas Cellier
Hello, Nicolas.
I want to try it out.
I tried to load it (XTream-Core) into my image, and it bug me about
unresolved dependencies:
----
This package depends on the following classes:
  ByteTextConverter
You must resolve these dependencies before you will be able to load
these definitions:
  ByteTextConverter>>nextFromXtream:
  ByteTextConverter>>nextPut:toXtream:
  ByteTextConverter>>readInto:startingAt:count:fromXtream:
----
I ignored these warnings, pressing continue, and here what it warns
about in my trunk image:

TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)

Is ByteTextConverter a Pharo-specific class?

If you seen my previous message, i think you noticed that
XXXTextConverter is abdominations (IMO), and should be reimplemented
as a wrapping-streams instead.
Would you be willing to change that in XStreams? I mean implementing a
conversion streams model, which can wrap around any other stream,
like:

myStream := UTFReaderStream on: otherStream.
myString := myStream contents.

or using other way:

myString := (someBaseStream wrapWith: UTFReaderStream) contents.

or..
myDecodedString := (someBaseStream wrapWith: (DecodingStreams
decoderFor: myEncoding) contents.

That's would be much nicer than using converters.

Wrappers is more flexible comparing to TextConverters, since they are
not obliged to convert to/from text-based collections only.
For example, we can use same API for wrapping with ZIP stream:

myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.

and many other (ab)uses.. Like reading changeset chunks:

nextChunk := (fileStream wrapWith: ChunkReaderStream) next.


On 25 February 2010 21:19, Nicolas Cellier
<[hidden email]> wrote:

> 2010/2/25 Igor Stasenko <[hidden email]>:
>> Hello,
>>
>> i am cross-posting, since i think it is good for all of us to agree on
>> some common points.
>>
>> 1. Streams needs to be rewritten.
>> 2. What do you think is good replacement for current Streams?
>>
>> personally, i currently need a fast and concise UTF8 reader.
>> The UTF8TextConverter is closest thing what i would take, but i don't
>> understand, why
>> it implemented as a non-stream?
>>
>> The #nextFromStream:
>> and #nextPut:toStream:
>> crying out of loud to be just
>> #next
>> and
>> #nextPut:
>>
>> Another thing which makes me sad is this line:
>>
>> nextFromStream: aStream
>>
>>        | character1 value1 character2 value2 unicode character3 value3
>> character4 value4 |
>>        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>>
>>
>> All external streams is initially binary , but UTF8TextConverter wants
>> to play with characters, instead of octets..
>> But hey... UTF8 encoding is exactly about encoding unicode characters
>> into binary form..
>> I'm not even mentioning that operating with bytes (smallints) is times
>> more efficient than operating with characters (objects), because first
>> thing it does:
>>
>>        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
>> byte from somewhere and then converts it to instance of Character.
>> 'Bonus' overhead here. "
>>        character1 isNil ifTrue: [^ nil].
>>        value1 := character1 asciiValue.  " and... what a surprise, we
>> converting a character back to integer value.. What a waste! "
>>        value1 <= 127 ifTrue: [
>>
>> I really hope, that eventually we could have a good implementation,
>> where horse runs ahead of cart, not cart ahead of horse :)
>> Meanwhile i think i have no choice but make yet-another implementation
>> of utf8 reader in my own package, instead of using existing one.
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>
> Obviously right. encoded in bytes, decoded in Characters.
>
> There are also ideas experimented at http://www.squeaksource.com/XTream.html
> Sorry I hijacked VW name...
> You can download it, it coexist pacificly with Stream.
>
> - use endOfStreamAction instead of Exception... That means abandonning
> primitives next nextPut: (no real performance impact, and expect a
> boost in future COG).
> - separate CollectionReadStream=concrete class, ReadStream=abstract class
> - use a wrapper rather than a subclass for MultiByteFileStream
> - implement sequenceable collection API
> - buffer I/O (mostly in Squeak thanks Levente)
>
> Of course, alternate ideas to pick from Nile, VW XTream, gst generators etc...
>
> I think mutating existing library is doable (just a bit tricky because
> both Compiler and source code management use Stream extensively...).
>
> Nicolas
>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>



--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
2010/2/26 Igor Stasenko <[hidden email]>:
> Hello, Nicolas.

Hi igor.
You should load it in trunk.

> I want to try it out.
> I tried to load it (XTream-Core) into my image, and it bug me about
> unresolved dependencies:
> ----
> This package depends on the following classes:
>  ByteTextConverter
> You must resolve these dependencies before you will be able to load
> these definitions:
>  ByteTextConverter>>nextFromXtream:
>  ByteTextConverter>>nextPut:toXtream:
>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
> ----
> I ignored these warnings, pressing continue, and here what it warns
> about in my trunk image:
>
> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>
> Is ByteTextConverter a Pharo-specific class?
>

This is a refactoring of TextConverter I made in trunk.
Pharo did the same before me (it comes from Sophie), but I missed it
unfortunately...

> If you seen my previous message, i think you noticed that
> XXXTextConverter is abdominations (IMO), and should be reimplemented
> as a wrapping-streams instead.
> Would you be willing to change that in XStreams? I mean implementing a
> conversion streams model, which can wrap around any other stream,
> like:
>
> myStream := UTFReaderStream on: otherStream.
> myString := myStream contents.
>
> or using other way:
>
> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>
> or..
> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
> decoderFor: myEncoding) contents.
>
> That's would be much nicer than using converters.

Currently, I have a ConverterReadXtream and a ConverterWriteXtream
which are stream wrappers.
They use old TextConverter to do the real job, but I agree, a full
rewrite of this one is needed.
However, I would like to keep these two layers for Stream composition:
- the generic converter stream
- the conversion algorithm

Though current XTream is a quick hack reusing Yoshiki TextConverter,
it already demonstrates possible gains coming from buffering.
The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
large ASCII encoded portions verbatim.
This works very well with squeak source because 99,99% of characters are ASCII.

>
> Wrappers is more flexible comparing to TextConverters, since they are
> not obliged to convert to/from text-based collections only.
> For example, we can use same API for wrapping with ZIP stream:
>
> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>
> and many other (ab)uses.. Like reading changeset chunks:
>
> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>

Yes, that fits my intentions.
What I want is to preserve buffered operations along the chain, and
avoid byte-by-byte conversions when possible.

>
> On 25 February 2010 21:19, Nicolas Cellier
> <[hidden email]> wrote:
>> 2010/2/25 Igor Stasenko <[hidden email]>:
>>> Hello,
>>>
>>> i am cross-posting, since i think it is good for all of us to agree on
>>> some common points.
>>>
>>> 1. Streams needs to be rewritten.
>>> 2. What do you think is good replacement for current Streams?
>>>
>>> personally, i currently need a fast and concise UTF8 reader.
>>> The UTF8TextConverter is closest thing what i would take, but i don't
>>> understand, why
>>> it implemented as a non-stream?
>>>
>>> The #nextFromStream:
>>> and #nextPut:toStream:
>>> crying out of loud to be just
>>> #next
>>> and
>>> #nextPut:
>>>
>>> Another thing which makes me sad is this line:
>>>
>>> nextFromStream: aStream
>>>
>>>        | character1 value1 character2 value2 unicode character3 value3
>>> character4 value4 |
>>>        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>>>
>>>
>>> All external streams is initially binary , but UTF8TextConverter wants
>>> to play with characters, instead of octets..
>>> But hey... UTF8 encoding is exactly about encoding unicode characters
>>> into binary form..
>>> I'm not even mentioning that operating with bytes (smallints) is times
>>> more efficient than operating with characters (objects), because first
>>> thing it does:
>>>
>>>        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
>>> byte from somewhere and then converts it to instance of Character.
>>> 'Bonus' overhead here. "
>>>        character1 isNil ifTrue: [^ nil].
>>>        value1 := character1 asciiValue.  " and... what a surprise, we
>>> converting a character back to integer value.. What a waste! "
>>>        value1 <= 127 ifTrue: [
>>>
>>> I really hope, that eventually we could have a good implementation,
>>> where horse runs ahead of cart, not cart ahead of horse :)
>>> Meanwhile i think i have no choice but make yet-another implementation
>>> of utf8 reader in my own package, instead of using existing one.
>>>
>>> --
>>> Best regards,
>>> Igor Stasenko AKA sig.
>>>
>>
>> Obviously right. encoded in bytes, decoded in Characters.
>>
>> There are also ideas experimented at http://www.squeaksource.com/XTream.html
>> Sorry I hijacked VW name...
>> You can download it, it coexist pacificly with Stream.
>>
>> - use endOfStreamAction instead of Exception... That means abandonning
>> primitives next nextPut: (no real performance impact, and expect a
>> boost in future COG).
>> - separate CollectionReadStream=concrete class, ReadStream=abstract class
>> - use a wrapper rather than a subclass for MultiByteFileStream
>> - implement sequenceable collection API
>> - buffer I/O (mostly in Squeak thanks Levente)
>>
>> Of course, alternate ideas to pick from Nile, VW XTream, gst generators etc...
>>
>> I think mutating existing library is doable (just a bit tricky because
>> both Compiler and source code management use Stream extensively...).
>>
>> Nicolas
>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

XTream-Tests gives some usage pattern.
Here are also some timings on various machines just to check efficiency:
Though XTream does not use any next/nextPut: primitive, it competes quite well.


| str |
str := String new: 1000 withAll: $a.
{
[str readStream upToEnd] bench.
[str readXtream upToEnd] bench.
}
#('583247.75044991 per second.' '597688.862227554 per second.')
#('221266.5466906619 per second.' '221899.4201159768 per second.')
#('218044.1911617676 per second.' '220044.1911617676 per second.')
#('190631.7473010796 per second.' '192736.452709458 per second.')

| str |
str := String new: 1000 withAll: $a.
{
[str readStream upTo: $b] bench.
[str readXtream upTo: $b] bench.
}
#('125180.9638072386 per second.' '126922.0155968806 per second.')
#('120683.8632273545 per second.' '123071.1857628474 per second.')
#('105943.4113177364 per second.' '107742.851429714 per second.')


| str |
str := String new: 1000 withAll: $a.
{
[str readStream upToAnyOf: (CharacterSet crlf)] bench.
[str readXtream upToAnyOf: (CharacterSet crlf)] bench.
}
#('112977.2045590882 per second.' '112393.3213357328 per second.')
#('108469.9060187962 per second.' '108042.9914017197 per second.')
#('91692.0615876825 per second.' '92319.1361727654 per second.')

| str |
str := String new: 1000 withAll: $a.
{
[| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
[| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
[| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
}
#('10452.10957808438 per second.' '6419.11617676465 per second.'
'2384.323135372925 per second.')
#('9799.2401519696 per second.' '6436.712657468506 per second.'
'2171.765646870626 per second.')
#('10475.7048590282 per second.' '4569.08618276345 per second.'
'1989.202159568086 per second.')

| str |
str := String new: 80000 withAll: $a.
{
[| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
[| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
[| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
}
#('131.1737652469506 per second.' '81.1026767878546 per second.'
'29.96404314822213 per second.')
#('132.388178913738 per second.' '81.701957650819 per second.'
'27.44084310996222 per second.')

| str |
str := String new: 1000 withAll: $a.
{
[str readStream upToAll: 'ab'] bench.
[str readXtream upToAll: 'ab'] bench.
}
#('514.297140571886 per second.' '633.473305338932 per second.')
#('511.795281887245 per second.' '561.487702459508 per second.')
#('513.497300539892 per second.' '557.48850229954 per second.')

| str |
str := String new: 1000 withAll: $a.
{
[str readStream upToAll: 'aab'] bench.
[str readXtream upToAll: 'aab'] bench.
}
#('892.021595680864 per second.' '1427.914417116577 per second.')
#('388.122375524895 per second.' '521.991203518593 per second.')
#('394.5632620427743 per second.' '539.892021595681 per second.')
#('384.6461415433827 per second.' '476.2095161935226 per second.')
#('382.846861255498 per second.' '475.9048190361927 per second.')

{
[| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name). [tmp next==nil] whileFalse. tmp close] timeToRun.
[| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) readXtream buffered. [tmp next==nil] whileFalse. tmp
close] timeToRun.
}
#(1639 1491)
#(3121 2892)
#(3213 2799)
#(2591 2115)
#(2146 2030) #(2153 1988) #(2770 2574) #(2319 2089) #(2141 1927) #(27008 1947)

{
[| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii.
        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
[| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) readXtream ascii buffered.
        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
}
#(8779 566)
#(6418 1182)
#(6084 1076)
#(4647 856)
#(4742 881) #(4332 818) #(4859 855) #(4503 1563) #(4347 816) #(4026
835) #(4285 821)

{
[| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii.
        [tmp nextLine == nil] whileFalse. tmp close] timeToRun.
MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
buffered.
        [tmp nextLine == nil] whileFalse. tmp close] timeToRun].
}
#(2088 1996) #(1920 1814) #(1589 1537) #(1631 1514) #(1587 1449)
#(1490 1434) #(1567 1667) #(1807 1777) #(1785 2159) #(1802 2147)

MessageTally spyOn: [| tmp | tmp := (StandardFileStream
readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
buffered.
        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close]
.
{
[| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii.
        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
close] timeToRun.
[| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) readXtream ascii buffered.
        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
close] timeToRun.
}
#(9153 665)
#(6463 1251)
#(5028 996) #(5076 1051) #(5223 949) #(4898 1073) #(5130 1610) #(5092
1776) #(4798 878) #(4757 956) #(5499 1405) #(14522 954) #(75895 1003)


{
[| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii; wantsLineEndConversion: false; converter:
UTF8TextConverter new.
        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
[| tmp atEnd | tmp := (StandardFileStream readOnlyFileNamed:
(SourceFiles at: 2) name) readXtream ascii buffered decodeWith:
UTF8TextConverter new.
        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
}
#(332 183)
#(558 422) #(678 421) #(686 420) #(675 423) #(673 423) #(662 410)
#(681 558) #(674 550) #(674 928) #(694 1043) #(1668 1112)


{
MessageTally spyOn: [[| tmp | tmp := (MultiByteFileStream
readOnlyFileNamed: (SourceFiles at: 2) name) ascii;
wantsLineEndConversion: false; converter: UTF8TextConverter new.
        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered
decodeWith: UTF8TextConverter new.
        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
}
#(349 189)
#(577 458) #(595 487)
#(574 438) #(699 444) #(714 457) #(722 449) #(724 438) #(692 572)
#(707 698) #(707 693) #(689 670) #(691 663) #(726 957) #(714 1105)
#(724 1150) #(1765 1098)

{
[| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) ascii; wantsLineEndConversion: false; converter:
UTF8TextConverter new.
      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
[| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles
at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter
new installLineEndConvention: nil)) buffered.
      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
}
#(318 14)
#(558 38) #(559 44) #(579 43)#(540 32)
#(701 34) #(694 36)


MessageTally spyOn: [
| string1 converter |
string1 := 'à ta santé mon brave' squeakToUtf8.
converter := UTF8TextConverter new installLineEndConvention: nil.
{
        [string1 utf8ToSqueak] bench.
        [(string1 readXtream decodeWith: converter) upToEnd] bench.
        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
}
]
#('99488.1023795241 per second.' '27299.1401719656 per second.'
'17217.55648870226 per second.')
#('106710.2579484103 per second.' '30986.6026794641 per second.'
'21273.1453709258 per second.')
#('108047.7904419116 per second.' '31168.56628674265 per second.'
'21107.17856428714 per second.')
#('96647.2705458908 per second.' '28705.25894821036 per second.'
'19899.4201159768 per second.')
#('95075.9848030394 per second.' '32338.5322935413 per second.'
'20242.95140971806 per second.')

MessageTally spyOn: [
| string1 converter |
string1 := 'This ASCII string should not be hard to decode' squeakToUtf8.
converter := UTF8TextConverter new installLineEndConvention: nil.
{
        [string1 utf8ToSqueak] bench.
        [(string1 readXtream decodeWith: converter) upToEnd] bench.
        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
}
]
#('810708.458308338 per second.' '15476.30473905219 per second.'
'24907.81843631274 per second.')
#('1.044100979804039e6 per second.' '18131.57368526295 per second.'
'40563.0873825235 per second.')


{
[|ws |
       ws := (String new: 10000) writeStream.
       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
[| ws |
       ws := (String new: 10000) writeXtream.
       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
}
#('442.7114577084583 per second.' '359.3281343731254 per second.')
#('178.4929042574455 per second.' '130.7738452309538 per second.')
#('182.490505696582 per second.' '131.1475409836065 per second.')
#('85.4291417165669 per second.' '128.8453855373552 per second.')
#('86.4789294987018 per second.' '128.374325134973 per second.')

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
In the same vein as TextConverter, I reused StandardFileStream for
holding File I/O primitives.
I should have used IOHandle instead.

Non blocking variants for reading/writing should be written too.
That would enable extension to sockets...
This kind of feature must be carefully thought right from the beginning.

Nicolas

2010/2/26 Nicolas Cellier <[hidden email]>:

> 2010/2/26 Igor Stasenko <[hidden email]>:
>> Hello, Nicolas.
>
> Hi igor.
> You should load it in trunk.
>
>> I want to try it out.
>> I tried to load it (XTream-Core) into my image, and it bug me about
>> unresolved dependencies:
>> ----
>> This package depends on the following classes:
>>  ByteTextConverter
>> You must resolve these dependencies before you will be able to load
>> these definitions:
>>  ByteTextConverter>>nextFromXtream:
>>  ByteTextConverter>>nextPut:toXtream:
>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>> ----
>> I ignored these warnings, pressing continue, and here what it warns
>> about in my trunk image:
>>
>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>
>> Is ByteTextConverter a Pharo-specific class?
>>
>
> This is a refactoring of TextConverter I made in trunk.
> Pharo did the same before me (it comes from Sophie), but I missed it
> unfortunately...
>
>> If you seen my previous message, i think you noticed that
>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>> as a wrapping-streams instead.
>> Would you be willing to change that in XStreams? I mean implementing a
>> conversion streams model, which can wrap around any other stream,
>> like:
>>
>> myStream := UTFReaderStream on: otherStream.
>> myString := myStream contents.
>>
>> or using other way:
>>
>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>
>> or..
>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>> decoderFor: myEncoding) contents.
>>
>> That's would be much nicer than using converters.
>
> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
> which are stream wrappers.
> They use old TextConverter to do the real job, but I agree, a full
> rewrite of this one is needed.
> However, I would like to keep these two layers for Stream composition:
> - the generic converter stream
> - the conversion algorithm
>
> Though current XTream is a quick hack reusing Yoshiki TextConverter,
> it already demonstrates possible gains coming from buffering.
> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
> large ASCII encoded portions verbatim.
> This works very well with squeak source because 99,99% of characters are ASCII.
>
>>
>> Wrappers is more flexible comparing to TextConverters, since they are
>> not obliged to convert to/from text-based collections only.
>> For example, we can use same API for wrapping with ZIP stream:
>>
>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>
>> and many other (ab)uses.. Like reading changeset chunks:
>>
>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>
>
> Yes, that fits my intentions.
> What I want is to preserve buffered operations along the chain, and
> avoid byte-by-byte conversions when possible.
>
>>
>> On 25 February 2010 21:19, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2010/2/25 Igor Stasenko <[hidden email]>:
>>>> Hello,
>>>>
>>>> i am cross-posting, since i think it is good for all of us to agree on
>>>> some common points.
>>>>
>>>> 1. Streams needs to be rewritten.
>>>> 2. What do you think is good replacement for current Streams?
>>>>
>>>> personally, i currently need a fast and concise UTF8 reader.
>>>> The UTF8TextConverter is closest thing what i would take, but i don't
>>>> understand, why
>>>> it implemented as a non-stream?
>>>>
>>>> The #nextFromStream:
>>>> and #nextPut:toStream:
>>>> crying out of loud to be just
>>>> #next
>>>> and
>>>> #nextPut:
>>>>
>>>> Another thing which makes me sad is this line:
>>>>
>>>> nextFromStream: aStream
>>>>
>>>>        | character1 value1 character2 value2 unicode character3 value3
>>>> character4 value4 |
>>>>        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>>>>
>>>>
>>>> All external streams is initially binary , but UTF8TextConverter wants
>>>> to play with characters, instead of octets..
>>>> But hey... UTF8 encoding is exactly about encoding unicode characters
>>>> into binary form..
>>>> I'm not even mentioning that operating with bytes (smallints) is times
>>>> more efficient than operating with characters (objects), because first
>>>> thing it does:
>>>>
>>>>        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
>>>> byte from somewhere and then converts it to instance of Character.
>>>> 'Bonus' overhead here. "
>>>>        character1 isNil ifTrue: [^ nil].
>>>>        value1 := character1 asciiValue.  " and... what a surprise, we
>>>> converting a character back to integer value.. What a waste! "
>>>>        value1 <= 127 ifTrue: [
>>>>
>>>> I really hope, that eventually we could have a good implementation,
>>>> where horse runs ahead of cart, not cart ahead of horse :)
>>>> Meanwhile i think i have no choice but make yet-another implementation
>>>> of utf8 reader in my own package, instead of using existing one.
>>>>
>>>> --
>>>> Best regards,
>>>> Igor Stasenko AKA sig.
>>>>
>>>
>>> Obviously right. encoded in bytes, decoded in Characters.
>>>
>>> There are also ideas experimented at http://www.squeaksource.com/XTream.html
>>> Sorry I hijacked VW name...
>>> You can download it, it coexist pacificly with Stream.
>>>
>>> - use endOfStreamAction instead of Exception... That means abandonning
>>> primitives next nextPut: (no real performance impact, and expect a
>>> boost in future COG).
>>> - separate CollectionReadStream=concrete class, ReadStream=abstract class
>>> - use a wrapper rather than a subclass for MultiByteFileStream
>>> - implement sequenceable collection API
>>> - buffer I/O (mostly in Squeak thanks Levente)
>>>
>>> Of course, alternate ideas to pick from Nile, VW XTream, gst generators etc...
>>>
>>> I think mutating existing library is doable (just a bit tricky because
>>> both Compiler and source code management use Stream extensively...).
>>>
>>> Nicolas
>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
> XTream-Tests gives some usage pattern.
> Here are also some timings on various machines just to check efficiency:
> Though XTream does not use any next/nextPut: primitive, it competes quite well.
>
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToEnd] bench.
> [str readXtream upToEnd] bench.
> }
> #('583247.75044991 per second.' '597688.862227554 per second.')
> #('221266.5466906619 per second.' '221899.4201159768 per second.')
> #('218044.1911617676 per second.' '220044.1911617676 per second.')
> #('190631.7473010796 per second.' '192736.452709458 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upTo: $b] bench.
> [str readXtream upTo: $b] bench.
> }
> #('125180.9638072386 per second.' '126922.0155968806 per second.')
> #('120683.8632273545 per second.' '123071.1857628474 per second.')
> #('105943.4113177364 per second.' '107742.851429714 per second.')
>
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToAnyOf: (CharacterSet crlf)] bench.
> [str readXtream upToAnyOf: (CharacterSet crlf)] bench.
> }
> #('112977.2045590882 per second.' '112393.3213357328 per second.')
> #('108469.9060187962 per second.' '108042.9914017197 per second.')
> #('91692.0615876825 per second.' '92319.1361727654 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
> }
> #('10452.10957808438 per second.' '6419.11617676465 per second.'
> '2384.323135372925 per second.')
> #('9799.2401519696 per second.' '6436.712657468506 per second.'
> '2171.765646870626 per second.')
> #('10475.7048590282 per second.' '4569.08618276345 per second.'
> '1989.202159568086 per second.')
>
> | str |
> str := String new: 80000 withAll: $a.
> {
> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
> }
> #('131.1737652469506 per second.' '81.1026767878546 per second.'
> '29.96404314822213 per second.')
> #('132.388178913738 per second.' '81.701957650819 per second.'
> '27.44084310996222 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToAll: 'ab'] bench.
> [str readXtream upToAll: 'ab'] bench.
> }
> #('514.297140571886 per second.' '633.473305338932 per second.')
> #('511.795281887245 per second.' '561.487702459508 per second.')
> #('513.497300539892 per second.' '557.48850229954 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToAll: 'aab'] bench.
> [str readXtream upToAll: 'aab'] bench.
> }
> #('892.021595680864 per second.' '1427.914417116577 per second.')
> #('388.122375524895 per second.' '521.991203518593 per second.')
> #('394.5632620427743 per second.' '539.892021595681 per second.')
> #('384.6461415433827 per second.' '476.2095161935226 per second.')
> #('382.846861255498 per second.' '475.9048190361927 per second.')
>
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name). [tmp next==nil] whileFalse. tmp close] timeToRun.
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream buffered. [tmp next==nil] whileFalse. tmp
> close] timeToRun.
> }
> #(1639 1491)
> #(3121 2892)
> #(3213 2799)
> #(2591 2115)
> #(2146 2030) #(2153 1988) #(2770 2574) #(2319 2089) #(2141 1927) #(27008 1947)
>
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii.
>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream ascii buffered.
>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
> }
> #(8779 566)
> #(6418 1182)
> #(6084 1076)
> #(4647 856)
> #(4742 881) #(4332 818) #(4859 855) #(4503 1563) #(4347 816) #(4026
> 835) #(4285 821)
>
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii.
>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun.
> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
> buffered.
>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun].
> }
> #(2088 1996) #(1920 1814) #(1589 1537) #(1631 1514) #(1587 1449)
> #(1490 1434) #(1567 1667) #(1807 1777) #(1785 2159) #(1802 2147)
>
> MessageTally spyOn: [| tmp | tmp := (StandardFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
> buffered.
>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close]
> .
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii.
>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
> close] timeToRun.
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream ascii buffered.
>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
> close] timeToRun.
> }
> #(9153 665)
> #(6463 1251)
> #(5028 996) #(5076 1051) #(5223 949) #(4898 1073) #(5130 1610) #(5092
> 1776) #(4798 878) #(4757 956) #(5499 1405) #(14522 954) #(75895 1003)
>
>
> {
> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii; wantsLineEndConversion: false; converter:
> UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> [| tmp atEnd | tmp := (StandardFileStream readOnlyFileNamed:
> (SourceFiles at: 2) name) readXtream ascii buffered decodeWith:
> UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> }
> #(332 183)
> #(558 422) #(678 421) #(686 420) #(675 423) #(673 423) #(662 410)
> #(681 558) #(674 550) #(674 928) #(694 1043) #(1668 1112)
>
>
> {
> MessageTally spyOn: [[| tmp | tmp := (MultiByteFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) ascii;
> wantsLineEndConversion: false; converter: UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered
> decodeWith: UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
> }
> #(349 189)
> #(577 458) #(595 487)
> #(574 438) #(699 444) #(714 457) #(722 449) #(724 438) #(692 572)
> #(707 698) #(707 693) #(689 670) #(691 663) #(726 957) #(714 1105)
> #(724 1150) #(1765 1098)
>
> {
> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii; wantsLineEndConversion: false; converter:
> UTF8TextConverter new.
>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> [| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter
> new installLineEndConvention: nil)) buffered.
>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> }
> #(318 14)
> #(558 38) #(559 44) #(579 43)#(540 32)
> #(701 34) #(694 36)
>
>
> MessageTally spyOn: [
> | string1 converter |
> string1 := 'à ta santé mon brave' squeakToUtf8.
> converter := UTF8TextConverter new installLineEndConvention: nil.
> {
>        [string1 utf8ToSqueak] bench.
>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
> }
> ]
> #('99488.1023795241 per second.' '27299.1401719656 per second.'
> '17217.55648870226 per second.')
> #('106710.2579484103 per second.' '30986.6026794641 per second.'
> '21273.1453709258 per second.')
> #('108047.7904419116 per second.' '31168.56628674265 per second.'
> '21107.17856428714 per second.')
> #('96647.2705458908 per second.' '28705.25894821036 per second.'
> '19899.4201159768 per second.')
> #('95075.9848030394 per second.' '32338.5322935413 per second.'
> '20242.95140971806 per second.')
>
> MessageTally spyOn: [
> | string1 converter |
> string1 := 'This ASCII string should not be hard to decode' squeakToUtf8.
> converter := UTF8TextConverter new installLineEndConvention: nil.
> {
>        [string1 utf8ToSqueak] bench.
>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
> }
> ]
> #('810708.458308338 per second.' '15476.30473905219 per second.'
> '24907.81843631274 per second.')
> #('1.044100979804039e6 per second.' '18131.57368526295 per second.'
> '40563.0873825235 per second.')
>
>
> {
> [|ws |
>       ws := (String new: 10000) writeStream.
>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
> [| ws |
>       ws := (String new: 10000) writeXtream.
>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
> }
> #('442.7114577084583 per second.' '359.3281343731254 per second.')
> #('178.4929042574455 per second.' '130.7738452309538 per second.')
> #('182.490505696582 per second.' '131.1475409836065 per second.')
> #('85.4291417165669 per second.' '128.8453855373552 per second.')
> #('86.4789294987018 per second.' '128.374325134973 per second.')
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Philippe Marschall-2-3
In reply to this post by Igor Stasenko
On 25.02.2010 19:11, Igor Stasenko wrote:

> Hello,
>
> i am cross-posting, since i think it is good for all of us to agree on
> some common points.
>
> 1. Streams needs to be rewritten.
> 2. What do you think is good replacement for current Streams?
>
> personally, i currently need a fast and concise UTF8 reader.
> The UTF8TextConverter is closest thing what i would take, but i don't
> understand, why
> it implemented as a non-stream?

In Seaside we use the utf-8 fastpath by Andreas.

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Igor Stasenko
In reply to this post by Nicolas Cellier
On 26 February 2010 18:59, Nicolas Cellier
<[hidden email]> wrote:
> 2010/2/26 Igor Stasenko <[hidden email]>:
>> Hello, Nicolas.
>
> Hi igor.
> You should load it in trunk.
>
Ah, i think my image is a bit outdated then.

>> I want to try it out.
>> I tried to load it (XTream-Core) into my image, and it bug me about
>> unresolved dependencies:
>> ----
>> This package depends on the following classes:
>>  ByteTextConverter
>> You must resolve these dependencies before you will be able to load
>> these definitions:
>>  ByteTextConverter>>nextFromXtream:
>>  ByteTextConverter>>nextPut:toXtream:
>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>> ----
>> I ignored these warnings, pressing continue, and here what it warns
>> about in my trunk image:
>>
>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>
>> Is ByteTextConverter a Pharo-specific class?
>>
>
> This is a refactoring of TextConverter I made in trunk.
> Pharo did the same before me (it comes from Sophie), but I missed it
> unfortunately...
>
>> If you seen my previous message, i think you noticed that
>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>> as a wrapping-streams instead.
>> Would you be willing to change that in XStreams? I mean implementing a
>> conversion streams model, which can wrap around any other stream,
>> like:
>>
>> myStream := UTFReaderStream on: otherStream.
>> myString := myStream contents.
>>
>> or using other way:
>>
>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>
>> or..
>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>> decoderFor: myEncoding) contents.
>>
>> That's would be much nicer than using converters.
>
> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
> which are stream wrappers.
> They use old TextConverter to do the real job, but I agree, a full
> rewrite of this one is needed.
> However, I would like to keep these two layers for Stream composition:
> - the generic converter stream
> - the conversion algorithm
>

Why?
In your implementation you already added the
readInto: aCollection startingAt: startIndex count: anInteger
and
next: count into: aString startingAt: startIndex
into converters, which makes them even more like streams.

So, what stopping you from making an abstract, generic XtreamWrapper class,
and then a number of subclasses  (LatinConversionStream ,
UnicodeConversionStream etc),
as well as BufferedWrapper?

So, it will cost us 1 less message dispatch in
a := stream next.

In your model you having:

(converter stream) -> (converter) -> basic stream

while if using wrapper it will be just:
(converter wrapper) -> basic stream


> Though current XTream is a quick hack reusing Yoshiki TextConverter,
> it already demonstrates possible gains coming from buffering.
> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
> large ASCII encoded portions verbatim.
> This works very well with squeak source because 99,99% of characters are ASCII.
>
>>
>> Wrappers is more flexible comparing to TextConverters, since they are
>> not obliged to convert to/from text-based collections only.
>> For example, we can use same API for wrapping with ZIP stream:
>>
>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>
>> and many other (ab)uses.. Like reading changeset chunks:
>>
>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>
>
> Yes, that fits my intentions.
> What I want is to preserve buffered operations along the chain, and
> avoid byte-by-byte conversions when possible.
>

Buffering is just a wrapper. Btw, again, why you don't providing a
generic wrapper class which everyone can subclass from?

bufferedStream := anyStreamClass buffered

(buffered wrapper) -> (anyStreamClass)

i don't see where else you should care of buffering explicitly in
anyStreamClass.

And, how you can avoid byte-by-byte conversion in utf8? It should
iterate over bytes to determine the characters anyways.
But sure thing, nothing prevents you from buffering things in a way like:

reader := anyStream buffered wrapWith: UTF8Reader.

>>
>> On 25 February 2010 21:19, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2010/2/25 Igor Stasenko <[hidden email]>:
>>>> Hello,
>>>>
>>>> i am cross-posting, since i think it is good for all of us to agree on
>>>> some common points.
>>>>
>>>> 1. Streams needs to be rewritten.
>>>> 2. What do you think is good replacement for current Streams?
>>>>
>>>> personally, i currently need a fast and concise UTF8 reader.
>>>> The UTF8TextConverter is closest thing what i would take, but i don't
>>>> understand, why
>>>> it implemented as a non-stream?
>>>>
>>>> The #nextFromStream:
>>>> and #nextPut:toStream:
>>>> crying out of loud to be just
>>>> #next
>>>> and
>>>> #nextPut:
>>>>
>>>> Another thing which makes me sad is this line:
>>>>
>>>> nextFromStream: aStream
>>>>
>>>>        | character1 value1 character2 value2 unicode character3 value3
>>>> character4 value4 |
>>>>        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>>>>
>>>>
>>>> All external streams is initially binary , but UTF8TextConverter wants
>>>> to play with characters, instead of octets..
>>>> But hey... UTF8 encoding is exactly about encoding unicode characters
>>>> into binary form..
>>>> I'm not even mentioning that operating with bytes (smallints) is times
>>>> more efficient than operating with characters (objects), because first
>>>> thing it does:
>>>>
>>>>        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
>>>> byte from somewhere and then converts it to instance of Character.
>>>> 'Bonus' overhead here. "
>>>>        character1 isNil ifTrue: [^ nil].
>>>>        value1 := character1 asciiValue.  " and... what a surprise, we
>>>> converting a character back to integer value.. What a waste! "
>>>>        value1 <= 127 ifTrue: [
>>>>
>>>> I really hope, that eventually we could have a good implementation,
>>>> where horse runs ahead of cart, not cart ahead of horse :)
>>>> Meanwhile i think i have no choice but make yet-another implementation
>>>> of utf8 reader in my own package, instead of using existing one.
>>>>
>>>> --
>>>> Best regards,
>>>> Igor Stasenko AKA sig.
>>>>
>>>
>>> Obviously right. encoded in bytes, decoded in Characters.
>>>
>>> There are also ideas experimented at http://www.squeaksource.com/XTream.html
>>> Sorry I hijacked VW name...
>>> You can download it, it coexist pacificly with Stream.
>>>
>>> - use endOfStreamAction instead of Exception... That means abandonning
>>> primitives next nextPut: (no real performance impact, and expect a
>>> boost in future COG).
>>> - separate CollectionReadStream=concrete class, ReadStream=abstract class
>>> - use a wrapper rather than a subclass for MultiByteFileStream
>>> - implement sequenceable collection API
>>> - buffer I/O (mostly in Squeak thanks Levente)
>>>
>>> Of course, alternate ideas to pick from Nile, VW XTream, gst generators etc...
>>>
>>> I think mutating existing library is doable (just a bit tricky because
>>> both Compiler and source code management use Stream extensively...).
>>>
>>> Nicolas
>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
> XTream-Tests gives some usage pattern.
> Here are also some timings on various machines just to check efficiency:
> Though XTream does not use any next/nextPut: primitive, it competes quite well.
>
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToEnd] bench.
> [str readXtream upToEnd] bench.
> }
> #('583247.75044991 per second.' '597688.862227554 per second.')
> #('221266.5466906619 per second.' '221899.4201159768 per second.')
> #('218044.1911617676 per second.' '220044.1911617676 per second.')
> #('190631.7473010796 per second.' '192736.452709458 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upTo: $b] bench.
> [str readXtream upTo: $b] bench.
> }
> #('125180.9638072386 per second.' '126922.0155968806 per second.')
> #('120683.8632273545 per second.' '123071.1857628474 per second.')
> #('105943.4113177364 per second.' '107742.851429714 per second.')
>
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToAnyOf: (CharacterSet crlf)] bench.
> [str readXtream upToAnyOf: (CharacterSet crlf)] bench.
> }
> #('112977.2045590882 per second.' '112393.3213357328 per second.')
> #('108469.9060187962 per second.' '108042.9914017197 per second.')
> #('91692.0615876825 per second.' '92319.1361727654 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
> }
> #('10452.10957808438 per second.' '6419.11617676465 per second.'
> '2384.323135372925 per second.')
> #('9799.2401519696 per second.' '6436.712657468506 per second.'
> '2171.765646870626 per second.')
> #('10475.7048590282 per second.' '4569.08618276345 per second.'
> '1989.202159568086 per second.')
>
> | str |
> str := String new: 80000 withAll: $a.
> {
> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
> }
> #('131.1737652469506 per second.' '81.1026767878546 per second.'
> '29.96404314822213 per second.')
> #('132.388178913738 per second.' '81.701957650819 per second.'
> '27.44084310996222 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToAll: 'ab'] bench.
> [str readXtream upToAll: 'ab'] bench.
> }
> #('514.297140571886 per second.' '633.473305338932 per second.')
> #('511.795281887245 per second.' '561.487702459508 per second.')
> #('513.497300539892 per second.' '557.48850229954 per second.')
>
> | str |
> str := String new: 1000 withAll: $a.
> {
> [str readStream upToAll: 'aab'] bench.
> [str readXtream upToAll: 'aab'] bench.
> }
> #('892.021595680864 per second.' '1427.914417116577 per second.')
> #('388.122375524895 per second.' '521.991203518593 per second.')
> #('394.5632620427743 per second.' '539.892021595681 per second.')
> #('384.6461415433827 per second.' '476.2095161935226 per second.')
> #('382.846861255498 per second.' '475.9048190361927 per second.')
>
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name). [tmp next==nil] whileFalse. tmp close] timeToRun.
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream buffered. [tmp next==nil] whileFalse. tmp
> close] timeToRun.
> }
> #(1639 1491)
> #(3121 2892)
> #(3213 2799)
> #(2591 2115)
> #(2146 2030) #(2153 1988) #(2770 2574) #(2319 2089) #(2141 1927) #(27008 1947)
>
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii.
>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream ascii buffered.
>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
> }
> #(8779 566)
> #(6418 1182)
> #(6084 1076)
> #(4647 856)
> #(4742 881) #(4332 818) #(4859 855) #(4503 1563) #(4347 816) #(4026
> 835) #(4285 821)
>
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii.
>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun.
> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
> buffered.
>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun].
> }
> #(2088 1996) #(1920 1814) #(1589 1537) #(1631 1514) #(1587 1449)
> #(1490 1434) #(1567 1667) #(1807 1777) #(1785 2159) #(1802 2147)
>
> MessageTally spyOn: [| tmp | tmp := (StandardFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
> buffered.
>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close]
> .
> {
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii.
>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
> close] timeToRun.
> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream ascii buffered.
>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
> close] timeToRun.
> }
> #(9153 665)
> #(6463 1251)
> #(5028 996) #(5076 1051) #(5223 949) #(4898 1073) #(5130 1610) #(5092
> 1776) #(4798 878) #(4757 956) #(5499 1405) #(14522 954) #(75895 1003)
>
>
> {
> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii; wantsLineEndConversion: false; converter:
> UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> [| tmp atEnd | tmp := (StandardFileStream readOnlyFileNamed:
> (SourceFiles at: 2) name) readXtream ascii buffered decodeWith:
> UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> }
> #(332 183)
> #(558 422) #(678 421) #(686 420) #(675 423) #(673 423) #(662 410)
> #(681 558) #(674 550) #(674 928) #(694 1043) #(1668 1112)
>
>
> {
> MessageTally spyOn: [[| tmp | tmp := (MultiByteFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) ascii;
> wantsLineEndConversion: false; converter: UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered
> decodeWith: UTF8TextConverter new.
>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
> }
> #(349 189)
> #(577 458) #(595 487)
> #(574 438) #(699 444) #(714 457) #(722 449) #(724 438) #(692 572)
> #(707 698) #(707 693) #(689 670) #(691 663) #(726 957) #(714 1105)
> #(724 1150) #(1765 1098)
>
> {
> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) ascii; wantsLineEndConversion: false; converter:
> UTF8TextConverter new.
>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> [| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles
> at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter
> new installLineEndConvention: nil)) buffered.
>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
> }
> #(318 14)
> #(558 38) #(559 44) #(579 43)#(540 32)
> #(701 34) #(694 36)
>
>
> MessageTally spyOn: [
> | string1 converter |
> string1 := 'à ta santé mon brave' squeakToUtf8.
> converter := UTF8TextConverter new installLineEndConvention: nil.
> {
>        [string1 utf8ToSqueak] bench.
>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
> }
> ]
> #('99488.1023795241 per second.' '27299.1401719656 per second.'
> '17217.55648870226 per second.')
> #('106710.2579484103 per second.' '30986.6026794641 per second.'
> '21273.1453709258 per second.')
> #('108047.7904419116 per second.' '31168.56628674265 per second.'
> '21107.17856428714 per second.')
> #('96647.2705458908 per second.' '28705.25894821036 per second.'
> '19899.4201159768 per second.')
> #('95075.9848030394 per second.' '32338.5322935413 per second.'
> '20242.95140971806 per second.')
>
> MessageTally spyOn: [
> | string1 converter |
> string1 := 'This ASCII string should not be hard to decode' squeakToUtf8.
> converter := UTF8TextConverter new installLineEndConvention: nil.
> {
>        [string1 utf8ToSqueak] bench.
>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
> }
> ]
> #('810708.458308338 per second.' '15476.30473905219 per second.'
> '24907.81843631274 per second.')
> #('1.044100979804039e6 per second.' '18131.57368526295 per second.'
> '40563.0873825235 per second.')
>
>
> {
> [|ws |
>       ws := (String new: 10000) writeStream.
>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
> [| ws |
>       ws := (String new: 10000) writeXtream.
>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
> }
> #('442.7114577084583 per second.' '359.3281343731254 per second.')
> #('178.4929042574455 per second.' '130.7738452309538 per second.')
> #('182.490505696582 per second.' '131.1475409836065 per second.')
> #('85.4291417165669 per second.' '128.8453855373552 per second.')
> #('86.4789294987018 per second.' '128.374325134973 per second.')
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>



--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
2010/2/26 Igor Stasenko <[hidden email]>:

> On 26 February 2010 18:59, Nicolas Cellier
> <[hidden email]> wrote:
>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>> Hello, Nicolas.
>>
>> Hi igor.
>> You should load it in trunk.
>>
> Ah, i think my image is a bit outdated then.
>
>>> I want to try it out.
>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>> unresolved dependencies:
>>> ----
>>> This package depends on the following classes:
>>>  ByteTextConverter
>>> You must resolve these dependencies before you will be able to load
>>> these definitions:
>>>  ByteTextConverter>>nextFromXtream:
>>>  ByteTextConverter>>nextPut:toXtream:
>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>> ----
>>> I ignored these warnings, pressing continue, and here what it warns
>>> about in my trunk image:
>>>
>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>>
>>> Is ByteTextConverter a Pharo-specific class?
>>>
>>
>> This is a refactoring of TextConverter I made in trunk.
>> Pharo did the same before me (it comes from Sophie), but I missed it
>> unfortunately...
>>
>>> If you seen my previous message, i think you noticed that
>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>> as a wrapping-streams instead.
>>> Would you be willing to change that in XStreams? I mean implementing a
>>> conversion streams model, which can wrap around any other stream,
>>> like:
>>>
>>> myStream := UTFReaderStream on: otherStream.
>>> myString := myStream contents.
>>>
>>> or using other way:
>>>
>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>
>>> or..
>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>> decoderFor: myEncoding) contents.
>>>
>>> That's would be much nicer than using converters.
>>
>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>> which are stream wrappers.
>> They use old TextConverter to do the real job, but I agree, a full
>> rewrite of this one is needed.
>> However, I would like to keep these two layers for Stream composition:
>> - the generic converter stream
>> - the conversion algorithm
>>
>
> Why?
> In your implementation you already added the
> readInto: aCollection startingAt: startIndex count: anInteger
> and
> next: count into: aString startingAt: startIndex
> into converters, which makes them even more like streams.
>

Yes, you may be right.
Maybe my ratio innovating/reusing was a bit low :)

> So, what stopping you from making an abstract, generic XtreamWrapper class,
> and then a number of subclasses  (LatinConversionStream ,
> UnicodeConversionStream etc),

Yes, that's possible. But it's already what ConverterReadXtream and
ConverterWriteXtream are.

> as well as BufferedWrapper?
>

BufferedReadXtream and BufferedWriteXtream already are generic. It's
just that I have separated read and write...
So you will find ReadXtream>>buffered
        ^(BufferedReadXtream new)
                contentsSpecies: self contentsSpecies bufferSize: self preferredBufferSize;
                source: self

Or do you mean a single Buffer for read/write ?
That would look more like original VW I think.

> So, it will cost us 1 less message dispatch in
> a := stream next.
>
> In your model you having:
>
> (converter stream) -> (converter) -> basic stream
>
> while if using wrapper it will be just:
> (converter wrapper) -> basic stream
>

I must re-think why I made this decision of additional indirection...
Maybe it was just reusing...

>
>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>> it already demonstrates possible gains coming from buffering.
>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>> large ASCII encoded portions verbatim.
>> This works very well with squeak source because 99,99% of characters are ASCII.
>>
>>>
>>> Wrappers is more flexible comparing to TextConverters, since they are
>>> not obliged to convert to/from text-based collections only.
>>> For example, we can use same API for wrapping with ZIP stream:
>>>
>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>>
>>> and many other (ab)uses.. Like reading changeset chunks:
>>>
>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>
>>
>> Yes, that fits my intentions.
>> What I want is to preserve buffered operations along the chain, and
>> avoid byte-by-byte conversions when possible.
>>
>
> Buffering is just a wrapper. Btw, again, why you don't providing a
> generic wrapper class which everyone can subclass from?
>
> bufferedStream := anyStreamClass buffered
>
> (buffered wrapper) -> (anyStreamClass)
>

See above, it's just split in BufferedRead/WriteXtream

Or see the example (a bit heavy)
 tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name)
      readXtream ascii buffered decodeWith: (UTF8TextConverter new
installLineEndConvention: nil)) buffered.


> i don't see where else you should care of buffering explicitly in
> anyStreamClass.
>
> And, how you can avoid byte-by-byte conversion in utf8? It should
> iterate over bytes to determine the characters anyways.

True, it is faster because you scan fast with a primitive,
then copy a whole chunk with replaceFrom:to:with:startingAt: primitive

Of course, if you handle some cyrillic files, then this strategy won't
be efficient. It just work in ASCII dominated files.
UTF8 itself would not be an optimal choice for cyrillic anyway...

> But sure thing, nothing prevents you from buffering things in a way like:
>
> reader := anyStream buffered wrapWith: UTF8Reader.
>

My above example is just equivalent to:

reader := (anyStream buffered wrapWith: UTF8Reader) buffered.

Then even if I use reader next, a whole buffer of UTF8 is converted
(presumably by large chunks)

>>>
>>> On 25 February 2010 21:19, Nicolas Cellier
>>> <[hidden email]> wrote:
>>>> 2010/2/25 Igor Stasenko <[hidden email]>:
>>>>> Hello,
>>>>>
>>>>> i am cross-posting, since i think it is good for all of us to agree on
>>>>> some common points.
>>>>>
>>>>> 1. Streams needs to be rewritten.
>>>>> 2. What do you think is good replacement for current Streams?
>>>>>
>>>>> personally, i currently need a fast and concise UTF8 reader.
>>>>> The UTF8TextConverter is closest thing what i would take, but i don't
>>>>> understand, why
>>>>> it implemented as a non-stream?
>>>>>
>>>>> The #nextFromStream:
>>>>> and #nextPut:toStream:
>>>>> crying out of loud to be just
>>>>> #next
>>>>> and
>>>>> #nextPut:
>>>>>
>>>>> Another thing which makes me sad is this line:
>>>>>
>>>>> nextFromStream: aStream
>>>>>
>>>>>        | character1 value1 character2 value2 unicode character3 value3
>>>>> character4 value4 |
>>>>>        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>>>>>
>>>>>
>>>>> All external streams is initially binary , but UTF8TextConverter wants
>>>>> to play with characters, instead of octets..
>>>>> But hey... UTF8 encoding is exactly about encoding unicode characters
>>>>> into binary form..
>>>>> I'm not even mentioning that operating with bytes (smallints) is times
>>>>> more efficient than operating with characters (objects), because first
>>>>> thing it does:
>>>>>
>>>>>        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
>>>>> byte from somewhere and then converts it to instance of Character.
>>>>> 'Bonus' overhead here. "
>>>>>        character1 isNil ifTrue: [^ nil].
>>>>>        value1 := character1 asciiValue.  " and... what a surprise, we
>>>>> converting a character back to integer value.. What a waste! "
>>>>>        value1 <= 127 ifTrue: [
>>>>>
>>>>> I really hope, that eventually we could have a good implementation,
>>>>> where horse runs ahead of cart, not cart ahead of horse :)
>>>>> Meanwhile i think i have no choice but make yet-another implementation
>>>>> of utf8 reader in my own package, instead of using existing one.
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Igor Stasenko AKA sig.
>>>>>
>>>>
>>>> Obviously right. encoded in bytes, decoded in Characters.
>>>>
>>>> There are also ideas experimented at http://www.squeaksource.com/XTream.html
>>>> Sorry I hijacked VW name...
>>>> You can download it, it coexist pacificly with Stream.
>>>>
>>>> - use endOfStreamAction instead of Exception... That means abandonning
>>>> primitives next nextPut: (no real performance impact, and expect a
>>>> boost in future COG).
>>>> - separate CollectionReadStream=concrete class, ReadStream=abstract class
>>>> - use a wrapper rather than a subclass for MultiByteFileStream
>>>> - implement sequenceable collection API
>>>> - buffer I/O (mostly in Squeak thanks Levente)
>>>>
>>>> Of course, alternate ideas to pick from Nile, VW XTream, gst generators etc...
>>>>
>>>> I think mutating existing library is doable (just a bit tricky because
>>>> both Compiler and source code management use Stream extensively...).
>>>>
>>>> Nicolas
>>>>
>>>>> _______________________________________________
>>>>> Pharo-project mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>>
>>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Igor Stasenko AKA sig.
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>> XTream-Tests gives some usage pattern.
>> Here are also some timings on various machines just to check efficiency:
>> Though XTream does not use any next/nextPut: primitive, it competes quite well.
>>
>>
>> | str |
>> str := String new: 1000 withAll: $a.
>> {
>> [str readStream upToEnd] bench.
>> [str readXtream upToEnd] bench.
>> }
>> #('583247.75044991 per second.' '597688.862227554 per second.')
>> #('221266.5466906619 per second.' '221899.4201159768 per second.')
>> #('218044.1911617676 per second.' '220044.1911617676 per second.')
>> #('190631.7473010796 per second.' '192736.452709458 per second.')
>>
>> | str |
>> str := String new: 1000 withAll: $a.
>> {
>> [str readStream upTo: $b] bench.
>> [str readXtream upTo: $b] bench.
>> }
>> #('125180.9638072386 per second.' '126922.0155968806 per second.')
>> #('120683.8632273545 per second.' '123071.1857628474 per second.')
>> #('105943.4113177364 per second.' '107742.851429714 per second.')
>>
>>
>> | str |
>> str := String new: 1000 withAll: $a.
>> {
>> [str readStream upToAnyOf: (CharacterSet crlf)] bench.
>> [str readXtream upToAnyOf: (CharacterSet crlf)] bench.
>> }
>> #('112977.2045590882 per second.' '112393.3213357328 per second.')
>> #('108469.9060187962 per second.' '108042.9914017197 per second.')
>> #('91692.0615876825 per second.' '92319.1361727654 per second.')
>>
>> | str |
>> str := String new: 1000 withAll: $a.
>> {
>> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
>> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
>> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
>> }
>> #('10452.10957808438 per second.' '6419.11617676465 per second.'
>> '2384.323135372925 per second.')
>> #('9799.2401519696 per second.' '6436.712657468506 per second.'
>> '2171.765646870626 per second.')
>> #('10475.7048590282 per second.' '4569.08618276345 per second.'
>> '1989.202159568086 per second.')
>>
>> | str |
>> str := String new: 80000 withAll: $a.
>> {
>> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
>> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
>> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
>> }
>> #('131.1737652469506 per second.' '81.1026767878546 per second.'
>> '29.96404314822213 per second.')
>> #('132.388178913738 per second.' '81.701957650819 per second.'
>> '27.44084310996222 per second.')
>>
>> | str |
>> str := String new: 1000 withAll: $a.
>> {
>> [str readStream upToAll: 'ab'] bench.
>> [str readXtream upToAll: 'ab'] bench.
>> }
>> #('514.297140571886 per second.' '633.473305338932 per second.')
>> #('511.795281887245 per second.' '561.487702459508 per second.')
>> #('513.497300539892 per second.' '557.48850229954 per second.')
>>
>> | str |
>> str := String new: 1000 withAll: $a.
>> {
>> [str readStream upToAll: 'aab'] bench.
>> [str readXtream upToAll: 'aab'] bench.
>> }
>> #('892.021595680864 per second.' '1427.914417116577 per second.')
>> #('388.122375524895 per second.' '521.991203518593 per second.')
>> #('394.5632620427743 per second.' '539.892021595681 per second.')
>> #('384.6461415433827 per second.' '476.2095161935226 per second.')
>> #('382.846861255498 per second.' '475.9048190361927 per second.')
>>
>> {
>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name). [tmp next==nil] whileFalse. tmp close] timeToRun.
>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) readXtream buffered. [tmp next==nil] whileFalse. tmp
>> close] timeToRun.
>> }
>> #(1639 1491)
>> #(3121 2892)
>> #(3213 2799)
>> #(2591 2115)
>> #(2146 2030) #(2153 1988) #(2770 2574) #(2319 2089) #(2141 1927) #(27008 1947)
>>
>> {
>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) ascii.
>>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) readXtream ascii buffered.
>>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
>> }
>> #(8779 566)
>> #(6418 1182)
>> #(6084 1076)
>> #(4647 856)
>> #(4742 881) #(4332 818) #(4859 855) #(4503 1563) #(4347 816) #(4026
>> 835) #(4285 821)
>>
>> {
>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) ascii.
>>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun.
>> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
>> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
>> buffered.
>>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun].
>> }
>> #(2088 1996) #(1920 1814) #(1589 1537) #(1631 1514) #(1587 1449)
>> #(1490 1434) #(1567 1667) #(1807 1777) #(1785 2159) #(1802 2147)
>>
>> MessageTally spyOn: [| tmp | tmp := (StandardFileStream
>> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
>> buffered.
>>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close]
>> .
>> {
>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) ascii.
>>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
>> close] timeToRun.
>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) readXtream ascii buffered.
>>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
>> close] timeToRun.
>> }
>> #(9153 665)
>> #(6463 1251)
>> #(5028 996) #(5076 1051) #(5223 949) #(4898 1073) #(5130 1610) #(5092
>> 1776) #(4798 878) #(4757 956) #(5499 1405) #(14522 954) #(75895 1003)
>>
>>
>> {
>> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) ascii; wantsLineEndConversion: false; converter:
>> UTF8TextConverter new.
>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>> [| tmp atEnd | tmp := (StandardFileStream readOnlyFileNamed:
>> (SourceFiles at: 2) name) readXtream ascii buffered decodeWith:
>> UTF8TextConverter new.
>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>> }
>> #(332 183)
>> #(558 422) #(678 421) #(686 420) #(675 423) #(673 423) #(662 410)
>> #(681 558) #(674 550) #(674 928) #(694 1043) #(1668 1112)
>>
>>
>> {
>> MessageTally spyOn: [[| tmp | tmp := (MultiByteFileStream
>> readOnlyFileNamed: (SourceFiles at: 2) name) ascii;
>> wantsLineEndConversion: false; converter: UTF8TextConverter new.
>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
>> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
>> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered
>> decodeWith: UTF8TextConverter new.
>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
>> }
>> #(349 189)
>> #(577 458) #(595 487)
>> #(574 438) #(699 444) #(714 457) #(722 449) #(724 438) #(692 572)
>> #(707 698) #(707 693) #(689 670) #(691 663) #(726 957) #(714 1105)
>> #(724 1150) #(1765 1098)
>>
>> {
>> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) ascii; wantsLineEndConversion: false; converter:
>> UTF8TextConverter new.
>>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>> [| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles
>> at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter
>> new installLineEndConvention: nil)) buffered.
>>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>> }
>> #(318 14)
>> #(558 38) #(559 44) #(579 43)#(540 32)
>> #(701 34) #(694 36)
>>
>>
>> MessageTally spyOn: [
>> | string1 converter |
>> string1 := 'à ta santé mon brave' squeakToUtf8.
>> converter := UTF8TextConverter new installLineEndConvention: nil.
>> {
>>        [string1 utf8ToSqueak] bench.
>>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
>> }
>> ]
>> #('99488.1023795241 per second.' '27299.1401719656 per second.'
>> '17217.55648870226 per second.')
>> #('106710.2579484103 per second.' '30986.6026794641 per second.'
>> '21273.1453709258 per second.')
>> #('108047.7904419116 per second.' '31168.56628674265 per second.'
>> '21107.17856428714 per second.')
>> #('96647.2705458908 per second.' '28705.25894821036 per second.'
>> '19899.4201159768 per second.')
>> #('95075.9848030394 per second.' '32338.5322935413 per second.'
>> '20242.95140971806 per second.')
>>
>> MessageTally spyOn: [
>> | string1 converter |
>> string1 := 'This ASCII string should not be hard to decode' squeakToUtf8.
>> converter := UTF8TextConverter new installLineEndConvention: nil.
>> {
>>        [string1 utf8ToSqueak] bench.
>>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
>> }
>> ]
>> #('810708.458308338 per second.' '15476.30473905219 per second.'
>> '24907.81843631274 per second.')
>> #('1.044100979804039e6 per second.' '18131.57368526295 per second.'
>> '40563.0873825235 per second.')
>>
>>
>> {
>> [|ws |
>>       ws := (String new: 10000) writeStream.
>>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
>> [| ws |
>>       ws := (String new: 10000) writeXtream.
>>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
>> }
>> #('442.7114577084583 per second.' '359.3281343731254 per second.')
>> #('178.4929042574455 per second.' '130.7738452309538 per second.')
>> #('182.490505696582 per second.' '131.1475409836065 per second.')
>> #('85.4291417165669 per second.' '128.8453855373552 per second.')
>> #('86.4789294987018 per second.' '128.374325134973 per second.')
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Stéphane Ducasse
Thanks for the discussion
Continue :)) this is cool

Stef

On Feb 26, 2010, at 8:30 PM, Nicolas Cellier wrote:

> 2010/2/26 Igor Stasenko <[hidden email]>:
>> On 26 February 2010 18:59, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>> Hello, Nicolas.
>>>
>>> Hi igor.
>>> You should load it in trunk.
>>>
>> Ah, i think my image is a bit outdated then.
>>
>>>> I want to try it out.
>>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>>> unresolved dependencies:
>>>> ----
>>>> This package depends on the following classes:
>>>>  ByteTextConverter
>>>> You must resolve these dependencies before you will be able to load
>>>> these definitions:
>>>>  ByteTextConverter>>nextFromXtream:
>>>>  ByteTextConverter>>nextPut:toXtream:
>>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>>> ----
>>>> I ignored these warnings, pressing continue, and here what it warns
>>>> about in my trunk image:
>>>>
>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>>>
>>>> Is ByteTextConverter a Pharo-specific class?
>>>>
>>>
>>> This is a refactoring of TextConverter I made in trunk.
>>> Pharo did the same before me (it comes from Sophie), but I missed it
>>> unfortunately...
>>>
>>>> If you seen my previous message, i think you noticed that
>>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>>> as a wrapping-streams instead.
>>>> Would you be willing to change that in XStreams? I mean implementing a
>>>> conversion streams model, which can wrap around any other stream,
>>>> like:
>>>>
>>>> myStream := UTFReaderStream on: otherStream.
>>>> myString := myStream contents.
>>>>
>>>> or using other way:
>>>>
>>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>>
>>>> or..
>>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>>> decoderFor: myEncoding) contents.
>>>>
>>>> That's would be much nicer than using converters.
>>>
>>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>>> which are stream wrappers.
>>> They use old TextConverter to do the real job, but I agree, a full
>>> rewrite of this one is needed.
>>> However, I would like to keep these two layers for Stream composition:
>>> - the generic converter stream
>>> - the conversion algorithm
>>>
>>
>> Why?
>> In your implementation you already added the
>> readInto: aCollection startingAt: startIndex count: anInteger
>> and
>> next: count into: aString startingAt: startIndex
>> into converters, which makes them even more like streams.
>>
>
> Yes, you may be right.
> Maybe my ratio innovating/reusing was a bit low :)
>
>> So, what stopping you from making an abstract, generic XtreamWrapper class,
>> and then a number of subclasses  (LatinConversionStream ,
>> UnicodeConversionStream etc),
>
> Yes, that's possible. But it's already what ConverterReadXtream and
> ConverterWriteXtream are.
>
>> as well as BufferedWrapper?
>>
>
> BufferedReadXtream and BufferedWriteXtream already are generic. It's
> just that I have separated read and write...
> So you will find ReadXtream>>buffered
> ^(BufferedReadXtream new)
> contentsSpecies: self contentsSpecies bufferSize: self preferredBufferSize;
> source: self
>
> Or do you mean a single Buffer for read/write ?
> That would look more like original VW I think.
>
>> So, it will cost us 1 less message dispatch in
>> a := stream next.
>>
>> In your model you having:
>>
>> (converter stream) -> (converter) -> basic stream
>>
>> while if using wrapper it will be just:
>> (converter wrapper) -> basic stream
>>
>
> I must re-think why I made this decision of additional indirection...
> Maybe it was just reusing...
>
>>
>>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>>> it already demonstrates possible gains coming from buffering.
>>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>>> large ASCII encoded portions verbatim.
>>> This works very well with squeak source because 99,99% of characters are ASCII.
>>>
>>>>
>>>> Wrappers is more flexible comparing to TextConverters, since they are
>>>> not obliged to convert to/from text-based collections only.
>>>> For example, we can use same API for wrapping with ZIP stream:
>>>>
>>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>>>
>>>> and many other (ab)uses.. Like reading changeset chunks:
>>>>
>>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>>
>>>
>>> Yes, that fits my intentions.
>>> What I want is to preserve buffered operations along the chain, and
>>> avoid byte-by-byte conversions when possible.
>>>
>>
>> Buffering is just a wrapper. Btw, again, why you don't providing a
>> generic wrapper class which everyone can subclass from?
>>
>> bufferedStream := anyStreamClass buffered
>>
>> (buffered wrapper) -> (anyStreamClass)
>>
>
> See above, it's just split in BufferedRead/WriteXtream
>
> Or see the example (a bit heavy)
> tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name)
>      readXtream ascii buffered decodeWith: (UTF8TextConverter new
> installLineEndConvention: nil)) buffered.
>
>
>> i don't see where else you should care of buffering explicitly in
>> anyStreamClass.
>>
>> And, how you can avoid byte-by-byte conversion in utf8? It should
>> iterate over bytes to determine the characters anyways.
>
> True, it is faster because you scan fast with a primitive,
> then copy a whole chunk with replaceFrom:to:with:startingAt: primitive
>
> Of course, if you handle some cyrillic files, then this strategy won't
> be efficient. It just work in ASCII dominated files.
> UTF8 itself would not be an optimal choice for cyrillic anyway...
>
>> But sure thing, nothing prevents you from buffering things in a way like:
>>
>> reader := anyStream buffered wrapWith: UTF8Reader.
>>
>
> My above example is just equivalent to:
>
> reader := (anyStream buffered wrapWith: UTF8Reader) buffered.
>
> Then even if I use reader next, a whole buffer of UTF8 is converted
> (presumably by large chunks)
>
>>>>
>>>> On 25 February 2010 21:19, Nicolas Cellier
>>>> <[hidden email]> wrote:
>>>>> 2010/2/25 Igor Stasenko <[hidden email]>:
>>>>>> Hello,
>>>>>>
>>>>>> i am cross-posting, since i think it is good for all of us to agree on
>>>>>> some common points.
>>>>>>
>>>>>> 1. Streams needs to be rewritten.
>>>>>> 2. What do you think is good replacement for current Streams?
>>>>>>
>>>>>> personally, i currently need a fast and concise UTF8 reader.
>>>>>> The UTF8TextConverter is closest thing what i would take, but i don't
>>>>>> understand, why
>>>>>> it implemented as a non-stream?
>>>>>>
>>>>>> The #nextFromStream:
>>>>>> and #nextPut:toStream:
>>>>>> crying out of loud to be just
>>>>>> #next
>>>>>> and
>>>>>> #nextPut:
>>>>>>
>>>>>> Another thing which makes me sad is this line:
>>>>>>
>>>>>> nextFromStream: aStream
>>>>>>
>>>>>>        | character1 value1 character2 value2 unicode character3 value3
>>>>>> character4 value4 |
>>>>>>        aStream isBinary ifTrue: [^ aStream basicNext].   <<<<<<<
>>>>>>
>>>>>>
>>>>>> All external streams is initially binary , but UTF8TextConverter wants
>>>>>> to play with characters, instead of octets..
>>>>>> But hey... UTF8 encoding is exactly about encoding unicode characters
>>>>>> into binary form..
>>>>>> I'm not even mentioning that operating with bytes (smallints) is times
>>>>>> more efficient than operating with characters (objects), because first
>>>>>> thing it does:
>>>>>>
>>>>>>        character1 := aStream basicNext.  " a #basicNext, obviously, reads a
>>>>>> byte from somewhere and then converts it to instance of Character.
>>>>>> 'Bonus' overhead here. "
>>>>>>        character1 isNil ifTrue: [^ nil].
>>>>>>        value1 := character1 asciiValue.  " and... what a surprise, we
>>>>>> converting a character back to integer value.. What a waste! "
>>>>>>        value1 <= 127 ifTrue: [
>>>>>>
>>>>>> I really hope, that eventually we could have a good implementation,
>>>>>> where horse runs ahead of cart, not cart ahead of horse :)
>>>>>> Meanwhile i think i have no choice but make yet-another implementation
>>>>>> of utf8 reader in my own package, instead of using existing one.
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Igor Stasenko AKA sig.
>>>>>>
>>>>>
>>>>> Obviously right. encoded in bytes, decoded in Characters.
>>>>>
>>>>> There are also ideas experimented at http://www.squeaksource.com/XTream.html
>>>>> Sorry I hijacked VW name...
>>>>> You can download it, it coexist pacificly with Stream.
>>>>>
>>>>> - use endOfStreamAction instead of Exception... That means abandonning
>>>>> primitives next nextPut: (no real performance impact, and expect a
>>>>> boost in future COG).
>>>>> - separate CollectionReadStream=concrete class, ReadStream=abstract class
>>>>> - use a wrapper rather than a subclass for MultiByteFileStream
>>>>> - implement sequenceable collection API
>>>>> - buffer I/O (mostly in Squeak thanks Levente)
>>>>>
>>>>> Of course, alternate ideas to pick from Nile, VW XTream, gst generators etc...
>>>>>
>>>>> I think mutating existing library is doable (just a bit tricky because
>>>>> both Compiler and source code management use Stream extensively...).
>>>>>
>>>>> Nicolas
>>>>>
>>>>>> _______________________________________________
>>>>>> Pharo-project mailing list
>>>>>> [hidden email]
>>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pharo-project mailing list
>>>>> [hidden email]
>>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Igor Stasenko AKA sig.
>>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>> XTream-Tests gives some usage pattern.
>>> Here are also some timings on various machines just to check efficiency:
>>> Though XTream does not use any next/nextPut: primitive, it competes quite well.
>>>
>>>
>>> | str |
>>> str := String new: 1000 withAll: $a.
>>> {
>>> [str readStream upToEnd] bench.
>>> [str readXtream upToEnd] bench.
>>> }
>>> #('583247.75044991 per second.' '597688.862227554 per second.')
>>> #('221266.5466906619 per second.' '221899.4201159768 per second.')
>>> #('218044.1911617676 per second.' '220044.1911617676 per second.')
>>> #('190631.7473010796 per second.' '192736.452709458 per second.')
>>>
>>> | str |
>>> str := String new: 1000 withAll: $a.
>>> {
>>> [str readStream upTo: $b] bench.
>>> [str readXtream upTo: $b] bench.
>>> }
>>> #('125180.9638072386 per second.' '126922.0155968806 per second.')
>>> #('120683.8632273545 per second.' '123071.1857628474 per second.')
>>> #('105943.4113177364 per second.' '107742.851429714 per second.')
>>>
>>>
>>> | str |
>>> str := String new: 1000 withAll: $a.
>>> {
>>> [str readStream upToAnyOf: (CharacterSet crlf)] bench.
>>> [str readXtream upToAnyOf: (CharacterSet crlf)] bench.
>>> }
>>> #('112977.2045590882 per second.' '112393.3213357328 per second.')
>>> #('108469.9060187962 per second.' '108042.9914017197 per second.')
>>> #('91692.0615876825 per second.' '92319.1361727654 per second.')
>>>
>>> | str |
>>> str := String new: 1000 withAll: $a.
>>> {
>>> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
>>> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
>>> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
>>> }
>>> #('10452.10957808438 per second.' '6419.11617676465 per second.'
>>> '2384.323135372925 per second.')
>>> #('9799.2401519696 per second.' '6436.712657468506 per second.'
>>> '2171.765646870626 per second.')
>>> #('10475.7048590282 per second.' '4569.08618276345 per second.'
>>> '1989.202159568086 per second.')
>>>
>>> | str |
>>> str := String new: 80000 withAll: $a.
>>> {
>>> [| tmp | tmp := str readStream. [tmp next==nil] whileFalse] bench.
>>> [| tmp | tmp := str readXtream. [tmp next==nil] whileFalse] bench.
>>> [| tmp | tmp := str readXtream. tmp do: [:e | ]] bench.
>>> }
>>> #('131.1737652469506 per second.' '81.1026767878546 per second.'
>>> '29.96404314822213 per second.')
>>> #('132.388178913738 per second.' '81.701957650819 per second.'
>>> '27.44084310996222 per second.')
>>>
>>> | str |
>>> str := String new: 1000 withAll: $a.
>>> {
>>> [str readStream upToAll: 'ab'] bench.
>>> [str readXtream upToAll: 'ab'] bench.
>>> }
>>> #('514.297140571886 per second.' '633.473305338932 per second.')
>>> #('511.795281887245 per second.' '561.487702459508 per second.')
>>> #('513.497300539892 per second.' '557.48850229954 per second.')
>>>
>>> | str |
>>> str := String new: 1000 withAll: $a.
>>> {
>>> [str readStream upToAll: 'aab'] bench.
>>> [str readXtream upToAll: 'aab'] bench.
>>> }
>>> #('892.021595680864 per second.' '1427.914417116577 per second.')
>>> #('388.122375524895 per second.' '521.991203518593 per second.')
>>> #('394.5632620427743 per second.' '539.892021595681 per second.')
>>> #('384.6461415433827 per second.' '476.2095161935226 per second.')
>>> #('382.846861255498 per second.' '475.9048190361927 per second.')
>>>
>>> {
>>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name). [tmp next==nil] whileFalse. tmp close] timeToRun.
>>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) readXtream buffered. [tmp next==nil] whileFalse. tmp
>>> close] timeToRun.
>>> }
>>> #(1639 1491)
>>> #(3121 2892)
>>> #(3213 2799)
>>> #(2591 2115)
>>> #(2146 2030) #(2153 1988) #(2770 2574) #(2319 2089) #(2141 1927) #(27008 1947)
>>>
>>> {
>>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) ascii.
>>>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
>>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) readXtream ascii buffered.
>>>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close] timeToRun.
>>> }
>>> #(8779 566)
>>> #(6418 1182)
>>> #(6084 1076)
>>> #(4647 856)
>>> #(4742 881) #(4332 818) #(4859 855) #(4503 1563) #(4347 816) #(4026
>>> 835) #(4285 821)
>>>
>>> {
>>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) ascii.
>>>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun.
>>> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
>>> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
>>> buffered.
>>>        [tmp nextLine == nil] whileFalse. tmp close] timeToRun].
>>> }
>>> #(2088 1996) #(1920 1814) #(1589 1537) #(1631 1514) #(1587 1449)
>>> #(1490 1434) #(1567 1667) #(1807 1777) #(1785 2159) #(1802 2147)
>>>
>>> MessageTally spyOn: [| tmp | tmp := (StandardFileStream
>>> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii
>>> buffered.
>>>        [tmp upTo: Character cr. tmp atEnd] whileFalse. tmp close]
>>> .
>>> {
>>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) ascii.
>>>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
>>> close] timeToRun.
>>> [| tmp | tmp := (StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) readXtream ascii buffered.
>>>        [tmp upToAnyOf: (CharacterSet crlf). tmp atEnd] whileFalse. tmp
>>> close] timeToRun.
>>> }
>>> #(9153 665)
>>> #(6463 1251)
>>> #(5028 996) #(5076 1051) #(5223 949) #(4898 1073) #(5130 1610) #(5092
>>> 1776) #(4798 878) #(4757 956) #(5499 1405) #(14522 954) #(75895 1003)
>>>
>>>
>>> {
>>> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) ascii; wantsLineEndConversion: false; converter:
>>> UTF8TextConverter new.
>>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>>> [| tmp atEnd | tmp := (StandardFileStream readOnlyFileNamed:
>>> (SourceFiles at: 2) name) readXtream ascii buffered decodeWith:
>>> UTF8TextConverter new.
>>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>>> }
>>> #(332 183)
>>> #(558 422) #(678 421) #(686 420) #(675 423) #(673 423) #(662 410)
>>> #(681 558) #(674 550) #(674 928) #(694 1043) #(1668 1112)
>>>
>>>
>>> {
>>> MessageTally spyOn: [[| tmp | tmp := (MultiByteFileStream
>>> readOnlyFileNamed: (SourceFiles at: 2) name) ascii;
>>> wantsLineEndConversion: false; converter: UTF8TextConverter new.
>>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
>>> MessageTally spyOn: [[| tmp | tmp := (StandardFileStream
>>> readOnlyFileNamed: (SourceFiles at: 2) name) readXtream ascii buffered
>>> decodeWith: UTF8TextConverter new.
>>>        1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun].
>>> }
>>> #(349 189)
>>> #(577 458) #(595 487)
>>> #(574 438) #(699 444) #(714 457) #(722 449) #(724 438) #(692 572)
>>> #(707 698) #(707 693) #(689 670) #(691 663) #(726 957) #(714 1105)
>>> #(724 1150) #(1765 1098)
>>>
>>> {
>>> [| tmp | tmp := (MultiByteFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) ascii; wantsLineEndConversion: false; converter:
>>> UTF8TextConverter new.
>>>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>>> [| tmp | tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles
>>> at: 2) name) readXtream ascii buffered decodeWith: (UTF8TextConverter
>>> new installLineEndConvention: nil)) buffered.
>>>      1 to: 10000 do: [:i | tmp upTo: Character cr]. tmp close] timeToRun.
>>> }
>>> #(318 14)
>>> #(558 38) #(559 44) #(579 43)#(540 32)
>>> #(701 34) #(694 36)
>>>
>>>
>>> MessageTally spyOn: [
>>> | string1 converter |
>>> string1 := 'à ta santé mon brave' squeakToUtf8.
>>> converter := UTF8TextConverter new installLineEndConvention: nil.
>>> {
>>>        [string1 utf8ToSqueak] bench.
>>>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>>>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
>>> }
>>> ]
>>> #('99488.1023795241 per second.' '27299.1401719656 per second.'
>>> '17217.55648870226 per second.')
>>> #('106710.2579484103 per second.' '30986.6026794641 per second.'
>>> '21273.1453709258 per second.')
>>> #('108047.7904419116 per second.' '31168.56628674265 per second.'
>>> '21107.17856428714 per second.')
>>> #('96647.2705458908 per second.' '28705.25894821036 per second.'
>>> '19899.4201159768 per second.')
>>> #('95075.9848030394 per second.' '32338.5322935413 per second.'
>>> '20242.95140971806 per second.')
>>>
>>> MessageTally spyOn: [
>>> | string1 converter |
>>> string1 := 'This ASCII string should not be hard to decode' squeakToUtf8.
>>> converter := UTF8TextConverter new installLineEndConvention: nil.
>>> {
>>>        [string1 utf8ToSqueak] bench.
>>>        [(string1 readXtream decodeWith: converter) upToEnd] bench.
>>>        [(string1 readXtream decodeWith: converter) buffered upToEnd] bench.
>>> }
>>> ]
>>> #('810708.458308338 per second.' '15476.30473905219 per second.'
>>> '24907.81843631274 per second.')
>>> #('1.044100979804039e6 per second.' '18131.57368526295 per second.'
>>> '40563.0873825235 per second.')
>>>
>>>
>>> {
>>> [|ws |
>>>       ws := (String new: 10000) writeStream.
>>>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
>>> [| ws |
>>>       ws := (String new: 10000) writeXtream.
>>>       1 to: 20000 do: [:i | ws nextPut: $0]] bench.
>>> }
>>> #('442.7114577084583 per second.' '359.3281343731254 per second.')
>>> #('178.4929042574455 per second.' '130.7738452309538 per second.')
>>> #('182.490505696582 per second.' '131.1475409836065 per second.')
>>> #('85.4291417165669 per second.' '128.8453855373552 per second.')
>>> #('86.4789294987018 per second.' '128.374325134973 per second.')
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Igor Stasenko
In reply to this post by Nicolas Cellier
On 26 February 2010 21:30, Nicolas Cellier
<[hidden email]> wrote:

> 2010/2/26 Igor Stasenko <[hidden email]>:
>> On 26 February 2010 18:59, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>> Hello, Nicolas.
>>>
>>> Hi igor.
>>> You should load it in trunk.
>>>
>> Ah, i think my image is a bit outdated then.
>>
>>>> I want to try it out.
>>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>>> unresolved dependencies:
>>>> ----
>>>> This package depends on the following classes:
>>>>  ByteTextConverter
>>>> You must resolve these dependencies before you will be able to load
>>>> these definitions:
>>>>  ByteTextConverter>>nextFromXtream:
>>>>  ByteTextConverter>>nextPut:toXtream:
>>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>>> ----
>>>> I ignored these warnings, pressing continue, and here what it warns
>>>> about in my trunk image:
>>>>
>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>>>
>>>> Is ByteTextConverter a Pharo-specific class?
>>>>
>>>
>>> This is a refactoring of TextConverter I made in trunk.
>>> Pharo did the same before me (it comes from Sophie), but I missed it
>>> unfortunately...
>>>
>>>> If you seen my previous message, i think you noticed that
>>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>>> as a wrapping-streams instead.
>>>> Would you be willing to change that in XStreams? I mean implementing a
>>>> conversion streams model, which can wrap around any other stream,
>>>> like:
>>>>
>>>> myStream := UTFReaderStream on: otherStream.
>>>> myString := myStream contents.
>>>>
>>>> or using other way:
>>>>
>>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>>
>>>> or..
>>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>>> decoderFor: myEncoding) contents.
>>>>
>>>> That's would be much nicer than using converters.
>>>
>>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>>> which are stream wrappers.
>>> They use old TextConverter to do the real job, but I agree, a full
>>> rewrite of this one is needed.
>>> However, I would like to keep these two layers for Stream composition:
>>> - the generic converter stream
>>> - the conversion algorithm
>>>
>>
>> Why?
>> In your implementation you already added the
>> readInto: aCollection startingAt: startIndex count: anInteger
>> and
>> next: count into: aString startingAt: startIndex
>> into converters, which makes them even more like streams.
>>
>
> Yes, you may be right.
> Maybe my ratio innovating/reusing was a bit low :)
>
>> So, what stopping you from making an abstract, generic XtreamWrapper class,
>> and then a number of subclasses  (LatinConversionStream ,
>> UnicodeConversionStream etc),
>
> Yes, that's possible. But it's already what ConverterReadXtream and
> ConverterWriteXtream are.
>

I suggesting to use a following hierarchy

Xtream -> XtreamWrapper -> ConverterXtream -> (bunch of subclasses)

or just

Xtream -> XtreamWrapper -> (bunch of subclasses)

since i don't think there a lot of specific behavior in
ConverterXtream worth creating a separate class.
But maybe i'm wrong.

>> as well as BufferedWrapper?
>>
>
> BufferedReadXtream and BufferedWriteXtream already are generic. It's
> just that I have separated read and write...
> So you will find ReadXtream>>buffered
>        ^(BufferedReadXtream new)
>                contentsSpecies: self contentsSpecies bufferSize: self preferredBufferSize;
>                source: self
>
> Or do you mean a single Buffer for read/write ?
> That would look more like original VW I think.
>

obviously, if you reading and writing to same stream , you should take
care of keeping any buffered i/o in sync.
The #buffered can decide what kind of stream to create
self isWritable ifTrue: [ create R/W wrapper ] ifFalse: [ create R/O wrapper ]

this is , of course if you promote #buffered to Xtream class. Which i
think worthful thing.

>> So, it will cost us 1 less message dispatch in
>> a := stream next.
>>
>> In your model you having:
>>
>> (converter stream) -> (converter) -> basic stream
>>
>> while if using wrapper it will be just:
>> (converter wrapper) -> basic stream
>>
>
> I must re-think why I made this decision of additional indirection...
> Maybe it was just reusing...
>

I think this is just about reuse. But as i shown in
UFT8TextConverter>>nextFromStream:
its in addition to extra dispatch, using a characters instead of
bytes, which can be avoided
if you wrap the stream to be converted and tell it to work in binary
mode, since your wrapper are in control.

>>
>>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>>> it already demonstrates possible gains coming from buffering.
>>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>>> large ASCII encoded portions verbatim.
>>> This works very well with squeak source because 99,99% of characters are ASCII.
>>>
>>>>
>>>> Wrappers is more flexible comparing to TextConverters, since they are
>>>> not obliged to convert to/from text-based collections only.
>>>> For example, we can use same API for wrapping with ZIP stream:
>>>>
>>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>>>
>>>> and many other (ab)uses.. Like reading changeset chunks:
>>>>
>>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>>
>>>
>>> Yes, that fits my intentions.
>>> What I want is to preserve buffered operations along the chain, and
>>> avoid byte-by-byte conversions when possible.
>>>
>>
>> Buffering is just a wrapper. Btw, again, why you don't providing a
>> generic wrapper class which everyone can subclass from?
>>
>> bufferedStream := anyStreamClass buffered
>>
>> (buffered wrapper) -> (anyStreamClass)
>>
>
> See above, it's just split in BufferedRead/WriteXtream
>
> Or see the example (a bit heavy)
>  tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name)
>      readXtream ascii buffered decodeWith: (UTF8TextConverter new
> installLineEndConvention: nil)) buffered.
>

yes. its a bit heavy, but this is a way how one should build a chains
of streams.
Except that there should be only streams in chain, no non-stream
converters in between :)

>
>> i don't see where else you should care of buffering explicitly in
>> anyStreamClass.
>>
>> And, how you can avoid byte-by-byte conversion in utf8? It should
>> iterate over bytes to determine the characters anyways.
>
> True, it is faster because you scan fast with a primitive,
> then copy a whole chunk with replaceFrom:to:with:startingAt: primitive
>
> Of course, if you handle some cyrillic files, then this strategy won't
> be efficient. It just work in ASCII dominated files.
> UTF8 itself would not be an optimal choice for cyrillic anyway...
>
I prefer to use UFT8 nowadays, instead of old rubbish encodings, which
is many :)

>> But sure thing, nothing prevents you from buffering things in a way like:
>>
>> reader := anyStream buffered wrapWith: UTF8Reader.
>>
>
> My above example is just equivalent to:
>
> reader := (anyStream buffered wrapWith: UTF8Reader) buffered.
>
> Then even if I use reader next, a whole buffer of UTF8 is converted
> (presumably by large chunks)
>

Right, nobody says that its not possible to do double-buffering.
First, by wrapping an original stream (presumably file-based)
and second - an output of utf8 converter.

[snip]

--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Richard Durr-2
So what speaks against using VisualWorks' Xtreams?
http://www.cincomsmalltalk.com/blog/blogView?entry=3444278480&printTitle=Smalltalk_Daily_02/22/10:_Introducing_Xtreams&showComments=true

On Fri, Feb 26, 2010 at 11:01 PM, Igor Stasenko <[hidden email]> wrote:

> On 26 February 2010 21:30, Nicolas Cellier
> <[hidden email]> wrote:
>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>> On 26 February 2010 18:59, Nicolas Cellier
>>> <[hidden email]> wrote:
>>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>>> Hello, Nicolas.
>>>>
>>>> Hi igor.
>>>> You should load it in trunk.
>>>>
>>> Ah, i think my image is a bit outdated then.
>>>
>>>>> I want to try it out.
>>>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>>>> unresolved dependencies:
>>>>> ----
>>>>> This package depends on the following classes:
>>>>>  ByteTextConverter
>>>>> You must resolve these dependencies before you will be able to load
>>>>> these definitions:
>>>>>  ByteTextConverter>>nextFromXtream:
>>>>>  ByteTextConverter>>nextPut:toXtream:
>>>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>>>> ----
>>>>> I ignored these warnings, pressing continue, and here what it warns
>>>>> about in my trunk image:
>>>>>
>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>>>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>>>>
>>>>> Is ByteTextConverter a Pharo-specific class?
>>>>>
>>>>
>>>> This is a refactoring of TextConverter I made in trunk.
>>>> Pharo did the same before me (it comes from Sophie), but I missed it
>>>> unfortunately...
>>>>
>>>>> If you seen my previous message, i think you noticed that
>>>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>>>> as a wrapping-streams instead.
>>>>> Would you be willing to change that in XStreams? I mean implementing a
>>>>> conversion streams model, which can wrap around any other stream,
>>>>> like:
>>>>>
>>>>> myStream := UTFReaderStream on: otherStream.
>>>>> myString := myStream contents.
>>>>>
>>>>> or using other way:
>>>>>
>>>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>>>
>>>>> or..
>>>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>>>> decoderFor: myEncoding) contents.
>>>>>
>>>>> That's would be much nicer than using converters.
>>>>
>>>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>>>> which are stream wrappers.
>>>> They use old TextConverter to do the real job, but I agree, a full
>>>> rewrite of this one is needed.
>>>> However, I would like to keep these two layers for Stream composition:
>>>> - the generic converter stream
>>>> - the conversion algorithm
>>>>
>>>
>>> Why?
>>> In your implementation you already added the
>>> readInto: aCollection startingAt: startIndex count: anInteger
>>> and
>>> next: count into: aString startingAt: startIndex
>>> into converters, which makes them even more like streams.
>>>
>>
>> Yes, you may be right.
>> Maybe my ratio innovating/reusing was a bit low :)
>>
>>> So, what stopping you from making an abstract, generic XtreamWrapper class,
>>> and then a number of subclasses  (LatinConversionStream ,
>>> UnicodeConversionStream etc),
>>
>> Yes, that's possible. But it's already what ConverterReadXtream and
>> ConverterWriteXtream are.
>>
>
> I suggesting to use a following hierarchy
>
> Xtream -> XtreamWrapper -> ConverterXtream -> (bunch of subclasses)
>
> or just
>
> Xtream -> XtreamWrapper -> (bunch of subclasses)
>
> since i don't think there a lot of specific behavior in
> ConverterXtream worth creating a separate class.
> But maybe i'm wrong.
>
>>> as well as BufferedWrapper?
>>>
>>
>> BufferedReadXtream and BufferedWriteXtream already are generic. It's
>> just that I have separated read and write...
>> So you will find ReadXtream>>buffered
>>        ^(BufferedReadXtream new)
>>                contentsSpecies: self contentsSpecies bufferSize: self preferredBufferSize;
>>                source: self
>>
>> Or do you mean a single Buffer for read/write ?
>> That would look more like original VW I think.
>>
>
> obviously, if you reading and writing to same stream , you should take
> care of keeping any buffered i/o in sync.
> The #buffered can decide what kind of stream to create
> self isWritable ifTrue: [ create R/W wrapper ] ifFalse: [ create R/O wrapper ]
>
> this is , of course if you promote #buffered to Xtream class. Which i
> think worthful thing.
>
>>> So, it will cost us 1 less message dispatch in
>>> a := stream next.
>>>
>>> In your model you having:
>>>
>>> (converter stream) -> (converter) -> basic stream
>>>
>>> while if using wrapper it will be just:
>>> (converter wrapper) -> basic stream
>>>
>>
>> I must re-think why I made this decision of additional indirection...
>> Maybe it was just reusing...
>>
>
> I think this is just about reuse. But as i shown in
> UFT8TextConverter>>nextFromStream:
> its in addition to extra dispatch, using a characters instead of
> bytes, which can be avoided
> if you wrap the stream to be converted and tell it to work in binary
> mode, since your wrapper are in control.
>
>>>
>>>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>>>> it already demonstrates possible gains coming from buffering.
>>>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>>>> large ASCII encoded portions verbatim.
>>>> This works very well with squeak source because 99,99% of characters are ASCII.
>>>>
>>>>>
>>>>> Wrappers is more flexible comparing to TextConverters, since they are
>>>>> not obliged to convert to/from text-based collections only.
>>>>> For example, we can use same API for wrapping with ZIP stream:
>>>>>
>>>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>>>>
>>>>> and many other (ab)uses.. Like reading changeset chunks:
>>>>>
>>>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>>>
>>>>
>>>> Yes, that fits my intentions.
>>>> What I want is to preserve buffered operations along the chain, and
>>>> avoid byte-by-byte conversions when possible.
>>>>
>>>
>>> Buffering is just a wrapper. Btw, again, why you don't providing a
>>> generic wrapper class which everyone can subclass from?
>>>
>>> bufferedStream := anyStreamClass buffered
>>>
>>> (buffered wrapper) -> (anyStreamClass)
>>>
>>
>> See above, it's just split in BufferedRead/WriteXtream
>>
>> Or see the example (a bit heavy)
>>  tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name)
>>      readXtream ascii buffered decodeWith: (UTF8TextConverter new
>> installLineEndConvention: nil)) buffered.
>>
>
> yes. its a bit heavy, but this is a way how one should build a chains
> of streams.
> Except that there should be only streams in chain, no non-stream
> converters in between :)
>
>>
>>> i don't see where else you should care of buffering explicitly in
>>> anyStreamClass.
>>>
>>> And, how you can avoid byte-by-byte conversion in utf8? It should
>>> iterate over bytes to determine the characters anyways.
>>
>> True, it is faster because you scan fast with a primitive,
>> then copy a whole chunk with replaceFrom:to:with:startingAt: primitive
>>
>> Of course, if you handle some cyrillic files, then this strategy won't
>> be efficient. It just work in ASCII dominated files.
>> UTF8 itself would not be an optimal choice for cyrillic anyway...
>>
> I prefer to use UFT8 nowadays, instead of old rubbish encodings, which
> is many :)
>
>>> But sure thing, nothing prevents you from buffering things in a way like:
>>>
>>> reader := anyStream buffered wrapWith: UTF8Reader.
>>>
>>
>> My above example is just equivalent to:
>>
>> reader := (anyStream buffered wrapWith: UTF8Reader) buffered.
>>
>> Then even if I use reader next, a whole buffer of UTF8 is converted
>> (presumably by large chunks)
>>
>
> Right, nobody says that its not possible to do double-buffering.
> First, by wrapping an original stream (presumably file-based)
> and second - an output of utf8 converter.
>
> [snip]
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Levente Uzonyi-2
On Sat, 27 Feb 2010, Richard Durr wrote:

> So what speaks against using VisualWorks' Xtreams?
> http://www.cincomsmalltalk.com/blog/blogView?entry=3444278480&printTitle=Smalltalk_Daily_02/22/10:_Introducing_Xtreams&showComments=true

1. Someone has to port it.
2. It's optimized for VW, so the ported code's performance
will probably be bad.


Levente

>
> On Fri, Feb 26, 2010 at 11:01 PM, Igor Stasenko <[hidden email]> wrote:
>> On 26 February 2010 21:30, Nicolas Cellier
>> <[hidden email]> wrote:
>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>> On 26 February 2010 18:59, Nicolas Cellier
>>>> <[hidden email]> wrote:
>>>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>>>> Hello, Nicolas.
>>>>>
>>>>> Hi igor.
>>>>> You should load it in trunk.
>>>>>
>>>> Ah, i think my image is a bit outdated then.
>>>>
>>>>>> I want to try it out.
>>>>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>>>>> unresolved dependencies:
>>>>>> ----
>>>>>> This package depends on the following classes:
>>>>>>  ByteTextConverter
>>>>>> You must resolve these dependencies before you will be able to load
>>>>>> these definitions:
>>>>>>  ByteTextConverter>>nextFromXtream:
>>>>>>  ByteTextConverter>>nextPut:toXtream:
>>>>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>>>>> ----
>>>>>> I ignored these warnings, pressing continue, and here what it warns
>>>>>> about in my trunk image:
>>>>>>
>>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>>>>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>>>>>
>>>>>> Is ByteTextConverter a Pharo-specific class?
>>>>>>
>>>>>
>>>>> This is a refactoring of TextConverter I made in trunk.
>>>>> Pharo did the same before me (it comes from Sophie), but I missed it
>>>>> unfortunately...
>>>>>
>>>>>> If you seen my previous message, i think you noticed that
>>>>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>>>>> as a wrapping-streams instead.
>>>>>> Would you be willing to change that in XStreams? I mean implementing a
>>>>>> conversion streams model, which can wrap around any other stream,
>>>>>> like:
>>>>>>
>>>>>> myStream := UTFReaderStream on: otherStream.
>>>>>> myString := myStream contents.
>>>>>>
>>>>>> or using other way:
>>>>>>
>>>>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>>>>
>>>>>> or..
>>>>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>>>>> decoderFor: myEncoding) contents.
>>>>>>
>>>>>> That's would be much nicer than using converters.
>>>>>
>>>>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>>>>> which are stream wrappers.
>>>>> They use old TextConverter to do the real job, but I agree, a full
>>>>> rewrite of this one is needed.
>>>>> However, I would like to keep these two layers for Stream composition:
>>>>> - the generic converter stream
>>>>> - the conversion algorithm
>>>>>
>>>>
>>>> Why?
>>>> In your implementation you already added the
>>>> readInto: aCollection startingAt: startIndex count: anInteger
>>>> and
>>>> next: count into: aString startingAt: startIndex
>>>> into converters, which makes them even more like streams.
>>>>
>>>
>>> Yes, you may be right.
>>> Maybe my ratio innovating/reusing was a bit low :)
>>>
>>>> So, what stopping you from making an abstract, generic XtreamWrapper class,
>>>> and then a number of subclasses  (LatinConversionStream ,
>>>> UnicodeConversionStream etc),
>>>
>>> Yes, that's possible. But it's already what ConverterReadXtream and
>>> ConverterWriteXtream are.
>>>
>>
>> I suggesting to use a following hierarchy
>>
>> Xtream -> XtreamWrapper -> ConverterXtream -> (bunch of subclasses)
>>
>> or just
>>
>> Xtream -> XtreamWrapper -> (bunch of subclasses)
>>
>> since i don't think there a lot of specific behavior in
>> ConverterXtream worth creating a separate class.
>> But maybe i'm wrong.
>>
>>>> as well as BufferedWrapper?
>>>>
>>>
>>> BufferedReadXtream and BufferedWriteXtream already are generic. It's
>>> just that I have separated read and write...
>>> So you will find ReadXtream>>buffered
>>>        ^(BufferedReadXtream new)
>>>                contentsSpecies: self contentsSpecies bufferSize: self preferredBufferSize;
>>>                source: self
>>>
>>> Or do you mean a single Buffer for read/write ?
>>> That would look more like original VW I think.
>>>
>>
>> obviously, if you reading and writing to same stream , you should take
>> care of keeping any buffered i/o in sync.
>> The #buffered can decide what kind of stream to create
>> self isWritable ifTrue: [ create R/W wrapper ] ifFalse: [ create R/O wrapper ]
>>
>> this is , of course if you promote #buffered to Xtream class. Which i
>> think worthful thing.
>>
>>>> So, it will cost us 1 less message dispatch in
>>>> a := stream next.
>>>>
>>>> In your model you having:
>>>>
>>>> (converter stream) -> (converter) -> basic stream
>>>>
>>>> while if using wrapper it will be just:
>>>> (converter wrapper) -> basic stream
>>>>
>>>
>>> I must re-think why I made this decision of additional indirection...
>>> Maybe it was just reusing...
>>>
>>
>> I think this is just about reuse. But as i shown in
>> UFT8TextConverter>>nextFromStream:
>> its in addition to extra dispatch, using a characters instead of
>> bytes, which can be avoided
>> if you wrap the stream to be converted and tell it to work in binary
>> mode, since your wrapper are in control.
>>
>>>>
>>>>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>>>>> it already demonstrates possible gains coming from buffering.
>>>>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>>>>> large ASCII encoded portions verbatim.
>>>>> This works very well with squeak source because 99,99% of characters are ASCII.
>>>>>
>>>>>>
>>>>>> Wrappers is more flexible comparing to TextConverters, since they are
>>>>>> not obliged to convert to/from text-based collections only.
>>>>>> For example, we can use same API for wrapping with ZIP stream:
>>>>>>
>>>>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>>>>>
>>>>>> and many other (ab)uses.. Like reading changeset chunks:
>>>>>>
>>>>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>>>>
>>>>>
>>>>> Yes, that fits my intentions.
>>>>> What I want is to preserve buffered operations along the chain, and
>>>>> avoid byte-by-byte conversions when possible.
>>>>>
>>>>
>>>> Buffering is just a wrapper. Btw, again, why you don't providing a
>>>> generic wrapper class which everyone can subclass from?
>>>>
>>>> bufferedStream := anyStreamClass buffered
>>>>
>>>> (buffered wrapper) -> (anyStreamClass)
>>>>
>>>
>>> See above, it's just split in BufferedRead/WriteXtream
>>>
>>> Or see the example (a bit heavy)
>>>  tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name)
>>>      readXtream ascii buffered decodeWith: (UTF8TextConverter new
>>> installLineEndConvention: nil)) buffered.
>>>
>>
>> yes. its a bit heavy, but this is a way how one should build a chains
>> of streams.
>> Except that there should be only streams in chain, no non-stream
>> converters in between :)
>>
>>>
>>>> i don't see where else you should care of buffering explicitly in
>>>> anyStreamClass.
>>>>
>>>> And, how you can avoid byte-by-byte conversion in utf8? It should
>>>> iterate over bytes to determine the characters anyways.
>>>
>>> True, it is faster because you scan fast with a primitive,
>>> then copy a whole chunk with replaceFrom:to:with:startingAt: primitive
>>>
>>> Of course, if you handle some cyrillic files, then this strategy won't
>>> be efficient. It just work in ASCII dominated files.
>>> UTF8 itself would not be an optimal choice for cyrillic anyway...
>>>
>> I prefer to use UFT8 nowadays, instead of old rubbish encodings, which
>> is many :)
>>
>>>> But sure thing, nothing prevents you from buffering things in a way like:
>>>>
>>>> reader := anyStream buffered wrapWith: UTF8Reader.
>>>>
>>>
>>> My above example is just equivalent to:
>>>
>>> reader := (anyStream buffered wrapWith: UTF8Reader) buffered.
>>>
>>> Then even if I use reader next, a whole buffer of UTF8 is converted
>>> (presumably by large chunks)
>>>
>>
>> Right, nobody says that its not possible to do double-buffering.
>> First, by wrapping an original stream (presumably file-based)
>> and second - an output of utf8 converter.
>>
>> [snip]
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
In reply to this post by Igor Stasenko
I pushed embryonary XTream-CharacterCode for TextConverter replacement
in XTream.

Nicolas

2010/2/26 Igor Stasenko <[hidden email]>:

> On 26 February 2010 21:30, Nicolas Cellier
> <[hidden email]> wrote:
>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>> On 26 February 2010 18:59, Nicolas Cellier
>>> <[hidden email]> wrote:
>>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>>> Hello, Nicolas.
>>>>
>>>> Hi igor.
>>>> You should load it in trunk.
>>>>
>>> Ah, i think my image is a bit outdated then.
>>>
>>>>> I want to try it out.
>>>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>>>> unresolved dependencies:
>>>>> ----
>>>>> This package depends on the following classes:
>>>>>  ByteTextConverter
>>>>> You must resolve these dependencies before you will be able to load
>>>>> these definitions:
>>>>>  ByteTextConverter>>nextFromXtream:
>>>>>  ByteTextConverter>>nextPut:toXtream:
>>>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>>>> ----
>>>>> I ignored these warnings, pressing continue, and here what it warns
>>>>> about in my trunk image:
>>>>>
>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is Undeclared)
>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is Undeclared)
>>>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is Undeclared)
>>>>>
>>>>> Is ByteTextConverter a Pharo-specific class?
>>>>>
>>>>
>>>> This is a refactoring of TextConverter I made in trunk.
>>>> Pharo did the same before me (it comes from Sophie), but I missed it
>>>> unfortunately...
>>>>
>>>>> If you seen my previous message, i think you noticed that
>>>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>>>> as a wrapping-streams instead.
>>>>> Would you be willing to change that in XStreams? I mean implementing a
>>>>> conversion streams model, which can wrap around any other stream,
>>>>> like:
>>>>>
>>>>> myStream := UTFReaderStream on: otherStream.
>>>>> myString := myStream contents.
>>>>>
>>>>> or using other way:
>>>>>
>>>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>>>
>>>>> or..
>>>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>>>> decoderFor: myEncoding) contents.
>>>>>
>>>>> That's would be much nicer than using converters.
>>>>
>>>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>>>> which are stream wrappers.
>>>> They use old TextConverter to do the real job, but I agree, a full
>>>> rewrite of this one is needed.
>>>> However, I would like to keep these two layers for Stream composition:
>>>> - the generic converter stream
>>>> - the conversion algorithm
>>>>
>>>
>>> Why?
>>> In your implementation you already added the
>>> readInto: aCollection startingAt: startIndex count: anInteger
>>> and
>>> next: count into: aString startingAt: startIndex
>>> into converters, which makes them even more like streams.
>>>
>>
>> Yes, you may be right.
>> Maybe my ratio innovating/reusing was a bit low :)
>>
>>> So, what stopping you from making an abstract, generic XtreamWrapper class,
>>> and then a number of subclasses  (LatinConversionStream ,
>>> UnicodeConversionStream etc),
>>
>> Yes, that's possible. But it's already what ConverterReadXtream and
>> ConverterWriteXtream are.
>>
>
> I suggesting to use a following hierarchy
>
> Xtream -> XtreamWrapper -> ConverterXtream -> (bunch of subclasses)
>
> or just
>
> Xtream -> XtreamWrapper -> (bunch of subclasses)
>
> since i don't think there a lot of specific behavior in
> ConverterXtream worth creating a separate class.
> But maybe i'm wrong.
>
>>> as well as BufferedWrapper?
>>>
>>
>> BufferedReadXtream and BufferedWriteXtream already are generic. It's
>> just that I have separated read and write...
>> So you will find ReadXtream>>buffered
>>        ^(BufferedReadXtream new)
>>                contentsSpecies: self contentsSpecies bufferSize: self preferredBufferSize;
>>                source: self
>>
>> Or do you mean a single Buffer for read/write ?
>> That would look more like original VW I think.
>>
>
> obviously, if you reading and writing to same stream , you should take
> care of keeping any buffered i/o in sync.
> The #buffered can decide what kind of stream to create
> self isWritable ifTrue: [ create R/W wrapper ] ifFalse: [ create R/O wrapper ]
>
> this is , of course if you promote #buffered to Xtream class. Which i
> think worthful thing.
>
>>> So, it will cost us 1 less message dispatch in
>>> a := stream next.
>>>
>>> In your model you having:
>>>
>>> (converter stream) -> (converter) -> basic stream
>>>
>>> while if using wrapper it will be just:
>>> (converter wrapper) -> basic stream
>>>
>>
>> I must re-think why I made this decision of additional indirection...
>> Maybe it was just reusing...
>>
>
> I think this is just about reuse. But as i shown in
> UFT8TextConverter>>nextFromStream:
> its in addition to extra dispatch, using a characters instead of
> bytes, which can be avoided
> if you wrap the stream to be converted and tell it to work in binary
> mode, since your wrapper are in control.
>
>>>
>>>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>>>> it already demonstrates possible gains coming from buffering.
>>>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>>>> large ASCII encoded portions verbatim.
>>>> This works very well with squeak source because 99,99% of characters are ASCII.
>>>>
>>>>>
>>>>> Wrappers is more flexible comparing to TextConverters, since they are
>>>>> not obliged to convert to/from text-based collections only.
>>>>> For example, we can use same API for wrapping with ZIP stream:
>>>>>
>>>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream) contents.
>>>>>
>>>>> and many other (ab)uses.. Like reading changeset chunks:
>>>>>
>>>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>>>
>>>>
>>>> Yes, that fits my intentions.
>>>> What I want is to preserve buffered operations along the chain, and
>>>> avoid byte-by-byte conversions when possible.
>>>>
>>>
>>> Buffering is just a wrapper. Btw, again, why you don't providing a
>>> generic wrapper class which everyone can subclass from?
>>>
>>> bufferedStream := anyStreamClass buffered
>>>
>>> (buffered wrapper) -> (anyStreamClass)
>>>
>>
>> See above, it's just split in BufferedRead/WriteXtream
>>
>> Or see the example (a bit heavy)
>>  tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2) name)
>>      readXtream ascii buffered decodeWith: (UTF8TextConverter new
>> installLineEndConvention: nil)) buffered.
>>
>
> yes. its a bit heavy, but this is a way how one should build a chains
> of streams.
> Except that there should be only streams in chain, no non-stream
> converters in between :)
>
>>
>>> i don't see where else you should care of buffering explicitly in
>>> anyStreamClass.
>>>
>>> And, how you can avoid byte-by-byte conversion in utf8? It should
>>> iterate over bytes to determine the characters anyways.
>>
>> True, it is faster because you scan fast with a primitive,
>> then copy a whole chunk with replaceFrom:to:with:startingAt: primitive
>>
>> Of course, if you handle some cyrillic files, then this strategy won't
>> be efficient. It just work in ASCII dominated files.
>> UTF8 itself would not be an optimal choice for cyrillic anyway...
>>
> I prefer to use UFT8 nowadays, instead of old rubbish encodings, which
> is many :)
>
>>> But sure thing, nothing prevents you from buffering things in a way like:
>>>
>>> reader := anyStream buffered wrapWith: UTF8Reader.
>>>
>>
>> My above example is just equivalent to:
>>
>> reader := (anyStream buffered wrapWith: UTF8Reader) buffered.
>>
>> Then even if I use reader next, a whole buffer of UTF8 is converted
>> (presumably by large chunks)
>>
>
> Right, nobody says that its not possible to do double-buffering.
> First, by wrapping an original stream (presumably file-based)
> and second - an output of utf8 converter.
>
> [snip]
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
In reply to this post by Levente Uzonyi-2
2010/2/27 Levente Uzonyi <[hidden email]>:

> On Sat, 27 Feb 2010, Richard Durr wrote:
>
>> So what speaks against using VisualWorks' Xtreams?
>>
>> http://www.cincomsmalltalk.com/blog/blogView?entry=3444278480&printTitle=Smalltalk_Daily_02/22/10:_Introducing_Xtreams&showComments=true
>
> 1. Someone has to port it.
> 2. It's optimized for VW, so the ported code's performance will probably be
> bad.
>
>
> Levente
>

Licensing was not clear when I begun, so I just picked a few ideas and
re-implement from scratch.
Now it would be interesting to try porting VW Xtream (I should say the
original XTream, I just hijacked the name...).
Concerning performance, VW XTream use exeptions extensively, which I
tried to avoid.

Nicolas

>>
>> On Fri, Feb 26, 2010 at 11:01 PM, Igor Stasenko <[hidden email]>
>> wrote:
>>>
>>> On 26 February 2010 21:30, Nicolas Cellier
>>> <[hidden email]> wrote:
>>>>
>>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>>>
>>>>> On 26 February 2010 18:59, Nicolas Cellier
>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>>>>>
>>>>>>> Hello, Nicolas.
>>>>>>
>>>>>> Hi igor.
>>>>>> You should load it in trunk.
>>>>>>
>>>>> Ah, i think my image is a bit outdated then.
>>>>>
>>>>>>> I want to try it out.
>>>>>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>>>>>> unresolved dependencies:
>>>>>>> ----
>>>>>>> This package depends on the following classes:
>>>>>>>  ByteTextConverter
>>>>>>> You must resolve these dependencies before you will be able to load
>>>>>>> these definitions:
>>>>>>>  ByteTextConverter>>nextFromXtream:
>>>>>>>  ByteTextConverter>>nextPut:toXtream:
>>>>>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>>>>>> ----
>>>>>>> I ignored these warnings, pressing continue, and here what it warns
>>>>>>> about in my trunk image:
>>>>>>>
>>>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is
>>>>>>> Undeclared)
>>>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is
>>>>>>> Undeclared)
>>>>>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is
>>>>>>> Undeclared)
>>>>>>>
>>>>>>> Is ByteTextConverter a Pharo-specific class?
>>>>>>>
>>>>>>
>>>>>> This is a refactoring of TextConverter I made in trunk.
>>>>>> Pharo did the same before me (it comes from Sophie), but I missed it
>>>>>> unfortunately...
>>>>>>
>>>>>>> If you seen my previous message, i think you noticed that
>>>>>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>>>>>> as a wrapping-streams instead.
>>>>>>> Would you be willing to change that in XStreams? I mean implementing
>>>>>>> a
>>>>>>> conversion streams model, which can wrap around any other stream,
>>>>>>> like:
>>>>>>>
>>>>>>> myStream := UTFReaderStream on: otherStream.
>>>>>>> myString := myStream contents.
>>>>>>>
>>>>>>> or using other way:
>>>>>>>
>>>>>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>>>>>
>>>>>>> or..
>>>>>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>>>>>> decoderFor: myEncoding) contents.
>>>>>>>
>>>>>>> That's would be much nicer than using converters.
>>>>>>
>>>>>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>>>>>> which are stream wrappers.
>>>>>> They use old TextConverter to do the real job, but I agree, a full
>>>>>> rewrite of this one is needed.
>>>>>> However, I would like to keep these two layers for Stream composition:
>>>>>> - the generic converter stream
>>>>>> - the conversion algorithm
>>>>>>
>>>>>
>>>>> Why?
>>>>> In your implementation you already added the
>>>>> readInto: aCollection startingAt: startIndex count: anInteger
>>>>> and
>>>>> next: count into: aString startingAt: startIndex
>>>>> into converters, which makes them even more like streams.
>>>>>
>>>>
>>>> Yes, you may be right.
>>>> Maybe my ratio innovating/reusing was a bit low :)
>>>>
>>>>> So, what stopping you from making an abstract, generic XtreamWrapper
>>>>> class,
>>>>> and then a number of subclasses  (LatinConversionStream ,
>>>>> UnicodeConversionStream etc),
>>>>
>>>> Yes, that's possible. But it's already what ConverterReadXtream and
>>>> ConverterWriteXtream are.
>>>>
>>>
>>> I suggesting to use a following hierarchy
>>>
>>> Xtream -> XtreamWrapper -> ConverterXtream -> (bunch of subclasses)
>>>
>>> or just
>>>
>>> Xtream -> XtreamWrapper -> (bunch of subclasses)
>>>
>>> since i don't think there a lot of specific behavior in
>>> ConverterXtream worth creating a separate class.
>>> But maybe i'm wrong.
>>>
>>>>> as well as BufferedWrapper?
>>>>>
>>>>
>>>> BufferedReadXtream and BufferedWriteXtream already are generic. It's
>>>> just that I have separated read and write...
>>>> So you will find ReadXtream>>buffered
>>>>        ^(BufferedReadXtream new)
>>>>                contentsSpecies: self contentsSpecies bufferSize: self
>>>> preferredBufferSize;
>>>>                source: self
>>>>
>>>> Or do you mean a single Buffer for read/write ?
>>>> That would look more like original VW I think.
>>>>
>>>
>>> obviously, if you reading and writing to same stream , you should take
>>> care of keeping any buffered i/o in sync.
>>> The #buffered can decide what kind of stream to create
>>> self isWritable ifTrue: [ create R/W wrapper ] ifFalse: [ create R/O
>>> wrapper ]
>>>
>>> this is , of course if you promote #buffered to Xtream class. Which i
>>> think worthful thing.
>>>
>>>>> So, it will cost us 1 less message dispatch in
>>>>> a := stream next.
>>>>>
>>>>> In your model you having:
>>>>>
>>>>> (converter stream) -> (converter) -> basic stream
>>>>>
>>>>> while if using wrapper it will be just:
>>>>> (converter wrapper) -> basic stream
>>>>>
>>>>
>>>> I must re-think why I made this decision of additional indirection...
>>>> Maybe it was just reusing...
>>>>
>>>
>>> I think this is just about reuse. But as i shown in
>>> UFT8TextConverter>>nextFromStream:
>>> its in addition to extra dispatch, using a characters instead of
>>> bytes, which can be avoided
>>> if you wrap the stream to be converted and tell it to work in binary
>>> mode, since your wrapper are in control.
>>>
>>>>>
>>>>>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>>>>>> it already demonstrates possible gains coming from buffering.
>>>>>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>>>>>> large ASCII encoded portions verbatim.
>>>>>> This works very well with squeak source because 99,99% of characters
>>>>>> are ASCII.
>>>>>>
>>>>>>>
>>>>>>> Wrappers is more flexible comparing to TextConverters, since they are
>>>>>>> not obliged to convert to/from text-based collections only.
>>>>>>> For example, we can use same API for wrapping with ZIP stream:
>>>>>>>
>>>>>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream)
>>>>>>> contents.
>>>>>>>
>>>>>>> and many other (ab)uses.. Like reading changeset chunks:
>>>>>>>
>>>>>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>>>>>
>>>>>>
>>>>>> Yes, that fits my intentions.
>>>>>> What I want is to preserve buffered operations along the chain, and
>>>>>> avoid byte-by-byte conversions when possible.
>>>>>>
>>>>>
>>>>> Buffering is just a wrapper. Btw, again, why you don't providing a
>>>>> generic wrapper class which everyone can subclass from?
>>>>>
>>>>> bufferedStream := anyStreamClass buffered
>>>>>
>>>>> (buffered wrapper) -> (anyStreamClass)
>>>>>
>>>>
>>>> See above, it's just split in BufferedRead/WriteXtream
>>>>
>>>> Or see the example (a bit heavy)
>>>>  tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2)
>>>> name)
>>>>      readXtream ascii buffered decodeWith: (UTF8TextConverter new
>>>> installLineEndConvention: nil)) buffered.
>>>>
>>>
>>> yes. its a bit heavy, but this is a way how one should build a chains
>>> of streams.
>>> Except that there should be only streams in chain, no non-stream
>>> converters in between :)
>>>
>>>>
>>>>> i don't see where else you should care of buffering explicitly in
>>>>> anyStreamClass.
>>>>>
>>>>> And, how you can avoid byte-by-byte conversion in utf8? It should
>>>>> iterate over bytes to determine the characters anyways.
>>>>
>>>> True, it is faster because you scan fast with a primitive,
>>>> then copy a whole chunk with replaceFrom:to:with:startingAt: primitive
>>>>
>>>> Of course, if you handle some cyrillic files, then this strategy won't
>>>> be efficient. It just work in ASCII dominated files.
>>>> UTF8 itself would not be an optimal choice for cyrillic anyway...
>>>>
>>> I prefer to use UFT8 nowadays, instead of old rubbish encodings, which
>>> is many :)
>>>
>>>>> But sure thing, nothing prevents you from buffering things in a way
>>>>> like:
>>>>>
>>>>> reader := anyStream buffered wrapWith: UTF8Reader.
>>>>>
>>>>
>>>> My above example is just equivalent to:
>>>>
>>>> reader := (anyStream buffered wrapWith: UTF8Reader) buffered.
>>>>
>>>> Then even if I use reader next, a whole buffer of UTF8 is converted
>>>> (presumably by large chunks)
>>>>
>>>
>>> Right, nobody says that its not possible to do double-buffering.
>>> First, by wrapping an original stream (presumably file-based)
>>> and second - an output of utf8 converter.
>>>
>>> [snip]
>>>
>>> --
>>> Best regards,
>>> Igor Stasenko AKA sig.
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Igor Stasenko
Hi, i'm also did some hacking. I uploaded XTream-Wrappers-sig.1 into SqS/XTream.

There is a basic XtreamWrapper class, which should work transparently
for any stream (hopefully ;).
Next, in subclass i created converter. Sure thing i could also add a
buffered wrapper, but maybe later :)

Here some benchmarks. The file i used to test is utf-8 russian doc
text - in attachment..

| str |
str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
{
[ str reset. (XtreamUTF8Converter on: str readXtream) upToEnd ] bench.
[ str reset. (UTF8Decoder new source: str readXtream) upToEnd ] bench.
}
#('21.71314741035857 per second.' '14.0371688414393 per second.')
 #('22.16896345116836 per second.' '14.5186953062848 per second.')

Next, buffered

| str |
str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
{
[ str reset. (XtreamUTF8Converter on: str readXtream buffered) upToEnd ] bench.
[ str reset. (UTF8Decoder new source: str readXtream buffered) upToEnd ] bench.
}
#('58.52976428286057 per second.' '25.44225800039754 per second.')
#('58.90575079872205 per second.' '25.87064676616916 per second.')


I'm also tried double-buffering, but neither my class nor yours
currently works with it:

| str |
str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
{
[ str reset. (XtreamUTF8Converter on: str readXtream buffered)
buffered upToEnd ] bench.
[ str reset. (UTF8Decoder new source: str readXtream buffered)
buffered upToEnd ] bench.
}

Please , take a look. There are some quirks which not because i
cleaned up decoding/encoding code.
See XtreamWrapper>>upToEnd implementation.


--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

unitext.txt (20K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Stéphane Ducasse
In reply to this post by Nicolas Cellier
If you feel that this is important we can take some time and port the code from VW.
Let us know I can ask cyrille to give a try.

Stef

On Feb 27, 2010, at 10:56 PM, Nicolas Cellier wrote:

> 2010/2/27 Levente Uzonyi <[hidden email]>:
>> On Sat, 27 Feb 2010, Richard Durr wrote:
>>
>>> So what speaks against using VisualWorks' Xtreams?
>>>
>>> http://www.cincomsmalltalk.com/blog/blogView?entry=3444278480&printTitle=Smalltalk_Daily_02/22/10:_Introducing_Xtreams&showComments=true
>>
>> 1. Someone has to port it.
>> 2. It's optimized for VW, so the ported code's performance will probably be
>> bad.
>>
>>
>> Levente
>>
>
> Licensing was not clear when I begun, so I just picked a few ideas and
> re-implement from scratch.
> Now it would be interesting to try porting VW Xtream (I should say the
> original XTream, I just hijacked the name...).
> Concerning performance, VW XTream use exeptions extensively, which I
> tried to avoid.
>
> Nicolas
>
>>>
>>> On Fri, Feb 26, 2010 at 11:01 PM, Igor Stasenko <[hidden email]>
>>> wrote:
>>>>
>>>> On 26 February 2010 21:30, Nicolas Cellier
>>>> <[hidden email]> wrote:
>>>>>
>>>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>>>>
>>>>>> On 26 February 2010 18:59, Nicolas Cellier
>>>>>> <[hidden email]> wrote:
>>>>>>>
>>>>>>> 2010/2/26 Igor Stasenko <[hidden email]>:
>>>>>>>>
>>>>>>>> Hello, Nicolas.
>>>>>>>
>>>>>>> Hi igor.
>>>>>>> You should load it in trunk.
>>>>>>>
>>>>>> Ah, i think my image is a bit outdated then.
>>>>>>
>>>>>>>> I want to try it out.
>>>>>>>> I tried to load it (XTream-Core) into my image, and it bug me about
>>>>>>>> unresolved dependencies:
>>>>>>>> ----
>>>>>>>> This package depends on the following classes:
>>>>>>>>  ByteTextConverter
>>>>>>>> You must resolve these dependencies before you will be able to load
>>>>>>>> these definitions:
>>>>>>>>  ByteTextConverter>>nextFromXtream:
>>>>>>>>  ByteTextConverter>>nextPut:toXtream:
>>>>>>>>  ByteTextConverter>>readInto:startingAt:count:fromXtream:
>>>>>>>> ----
>>>>>>>> I ignored these warnings, pressing continue, and here what it warns
>>>>>>>> about in my trunk image:
>>>>>>>>
>>>>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Map is
>>>>>>>> Undeclared)
>>>>>>>> TextConverter>>next:putAll:startingAt:toXtream: (latin1Encodings is
>>>>>>>> Undeclared)
>>>>>>>> TextConverter>>readInto:startingAt:count:fromXtream: (latin1Map is
>>>>>>>> Undeclared)
>>>>>>>>
>>>>>>>> Is ByteTextConverter a Pharo-specific class?
>>>>>>>>
>>>>>>>
>>>>>>> This is a refactoring of TextConverter I made in trunk.
>>>>>>> Pharo did the same before me (it comes from Sophie), but I missed it
>>>>>>> unfortunately...
>>>>>>>
>>>>>>>> If you seen my previous message, i think you noticed that
>>>>>>>> XXXTextConverter is abdominations (IMO), and should be reimplemented
>>>>>>>> as a wrapping-streams instead.
>>>>>>>> Would you be willing to change that in XStreams? I mean implementing
>>>>>>>> a
>>>>>>>> conversion streams model, which can wrap around any other stream,
>>>>>>>> like:
>>>>>>>>
>>>>>>>> myStream := UTFReaderStream on: otherStream.
>>>>>>>> myString := myStream contents.
>>>>>>>>
>>>>>>>> or using other way:
>>>>>>>>
>>>>>>>> myString := (someBaseStream wrapWith: UTFReaderStream) contents.
>>>>>>>>
>>>>>>>> or..
>>>>>>>> myDecodedString := (someBaseStream wrapWith: (DecodingStreams
>>>>>>>> decoderFor: myEncoding) contents.
>>>>>>>>
>>>>>>>> That's would be much nicer than using converters.
>>>>>>>
>>>>>>> Currently, I have a ConverterReadXtream and a ConverterWriteXtream
>>>>>>> which are stream wrappers.
>>>>>>> They use old TextConverter to do the real job, but I agree, a full
>>>>>>> rewrite of this one is needed.
>>>>>>> However, I would like to keep these two layers for Stream composition:
>>>>>>> - the generic converter stream
>>>>>>> - the conversion algorithm
>>>>>>>
>>>>>>
>>>>>> Why?
>>>>>> In your implementation you already added the
>>>>>> readInto: aCollection startingAt: startIndex count: anInteger
>>>>>> and
>>>>>> next: count into: aString startingAt: startIndex
>>>>>> into converters, which makes them even more like streams.
>>>>>>
>>>>>
>>>>> Yes, you may be right.
>>>>> Maybe my ratio innovating/reusing was a bit low :)
>>>>>
>>>>>> So, what stopping you from making an abstract, generic XtreamWrapper
>>>>>> class,
>>>>>> and then a number of subclasses  (LatinConversionStream ,
>>>>>> UnicodeConversionStream etc),
>>>>>
>>>>> Yes, that's possible. But it's already what ConverterReadXtream and
>>>>> ConverterWriteXtream are.
>>>>>
>>>>
>>>> I suggesting to use a following hierarchy
>>>>
>>>> Xtream -> XtreamWrapper -> ConverterXtream -> (bunch of subclasses)
>>>>
>>>> or just
>>>>
>>>> Xtream -> XtreamWrapper -> (bunch of subclasses)
>>>>
>>>> since i don't think there a lot of specific behavior in
>>>> ConverterXtream worth creating a separate class.
>>>> But maybe i'm wrong.
>>>>
>>>>>> as well as BufferedWrapper?
>>>>>>
>>>>>
>>>>> BufferedReadXtream and BufferedWriteXtream already are generic. It's
>>>>> just that I have separated read and write...
>>>>> So you will find ReadXtream>>buffered
>>>>>        ^(BufferedReadXtream new)
>>>>>                contentsSpecies: self contentsSpecies bufferSize: self
>>>>> preferredBufferSize;
>>>>>                source: self
>>>>>
>>>>> Or do you mean a single Buffer for read/write ?
>>>>> That would look more like original VW I think.
>>>>>
>>>>
>>>> obviously, if you reading and writing to same stream , you should take
>>>> care of keeping any buffered i/o in sync.
>>>> The #buffered can decide what kind of stream to create
>>>> self isWritable ifTrue: [ create R/W wrapper ] ifFalse: [ create R/O
>>>> wrapper ]
>>>>
>>>> this is , of course if you promote #buffered to Xtream class. Which i
>>>> think worthful thing.
>>>>
>>>>>> So, it will cost us 1 less message dispatch in
>>>>>> a := stream next.
>>>>>>
>>>>>> In your model you having:
>>>>>>
>>>>>> (converter stream) -> (converter) -> basic stream
>>>>>>
>>>>>> while if using wrapper it will be just:
>>>>>> (converter wrapper) -> basic stream
>>>>>>
>>>>>
>>>>> I must re-think why I made this decision of additional indirection...
>>>>> Maybe it was just reusing...
>>>>>
>>>>
>>>> I think this is just about reuse. But as i shown in
>>>> UFT8TextConverter>>nextFromStream:
>>>> its in addition to extra dispatch, using a characters instead of
>>>> bytes, which can be avoided
>>>> if you wrap the stream to be converted and tell it to work in binary
>>>> mode, since your wrapper are in control.
>>>>
>>>>>>
>>>>>>> Though current XTream is a quick hack reusing Yoshiki TextConverter,
>>>>>>> it already demonstrates possible gains coming from buffering.
>>>>>>> The speed comes from,applying utf8ToSqueak, squeakToUtf8 trick: copy
>>>>>>> large ASCII encoded portions verbatim.
>>>>>>> This works very well with squeak source because 99,99% of characters
>>>>>>> are ASCII.
>>>>>>>
>>>>>>>>
>>>>>>>> Wrappers is more flexible comparing to TextConverters, since they are
>>>>>>>> not obliged to convert to/from text-based collections only.
>>>>>>>> For example, we can use same API for wrapping with ZIP stream:
>>>>>>>>
>>>>>>>> myUnpackedData := (someBaseStream wrapWith: ZIPReaderStream)
>>>>>>>> contents.
>>>>>>>>
>>>>>>>> and many other (ab)uses.. Like reading changeset chunks:
>>>>>>>>
>>>>>>>> nextChunk := (fileStream wrapWith: ChunkReaderStream) next.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, that fits my intentions.
>>>>>>> What I want is to preserve buffered operations along the chain, and
>>>>>>> avoid byte-by-byte conversions when possible.
>>>>>>>
>>>>>>
>>>>>> Buffering is just a wrapper. Btw, again, why you don't providing a
>>>>>> generic wrapper class which everyone can subclass from?
>>>>>>
>>>>>> bufferedStream := anyStreamClass buffered
>>>>>>
>>>>>> (buffered wrapper) -> (anyStreamClass)
>>>>>>
>>>>>
>>>>> See above, it's just split in BufferedRead/WriteXtream
>>>>>
>>>>> Or see the example (a bit heavy)
>>>>>  tmp := ((StandardFileStream readOnlyFileNamed: (SourceFiles at: 2)
>>>>> name)
>>>>>      readXtream ascii buffered decodeWith: (UTF8TextConverter new
>>>>> installLineEndConvention: nil)) buffered.
>>>>>
>>>>
>>>> yes. its a bit heavy, but this is a way how one should build a chains
>>>> of streams.
>>>> Except that there should be only streams in chain, no non-stream
>>>> converters in between :)
>>>>
>>>>>
>>>>>> i don't see where else you should care of buffering explicitly in
>>>>>> anyStreamClass.
>>>>>>
>>>>>> And, how you can avoid byte-by-byte conversion in utf8? It should
>>>>>> iterate over bytes to determine the characters anyways.
>>>>>
>>>>> True, it is faster because you scan fast with a primitive,
>>>>> then copy a whole chunk with replaceFrom:to:with:startingAt: primitive
>>>>>
>>>>> Of course, if you handle some cyrillic files, then this strategy won't
>>>>> be efficient. It just work in ASCII dominated files.
>>>>> UTF8 itself would not be an optimal choice for cyrillic anyway...
>>>>>
>>>> I prefer to use UFT8 nowadays, instead of old rubbish encodings, which
>>>> is many :)
>>>>
>>>>>> But sure thing, nothing prevents you from buffering things in a way
>>>>>> like:
>>>>>>
>>>>>> reader := anyStream buffered wrapWith: UTF8Reader.
>>>>>>
>>>>>
>>>>> My above example is just equivalent to:
>>>>>
>>>>> reader := (anyStream buffered wrapWith: UTF8Reader) buffered.
>>>>>
>>>>> Then even if I use reader next, a whole buffer of UTF8 is converted
>>>>> (presumably by large chunks)
>>>>>
>>>>
>>>> Right, nobody says that its not possible to do double-buffering.
>>>> First, by wrapping an original stream (presumably file-based)
>>>> and second - an output of utf8 converter.
>>>>
>>>> [snip]
>>>>
>>>> --
>>>> Best regards,
>>>> Igor Stasenko AKA sig.
>>>>
>>>> _______________________________________________
>>>> Pharo-project mailing list
>>>> [hidden email]
>>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
In reply to this post by Igor Stasenko
2010/2/28 Igor Stasenko <[hidden email]>:

> Hi, i'm also did some hacking. I uploaded XTream-Wrappers-sig.1 into SqS/XTream.
>
> There is a basic XtreamWrapper class, which should work transparently
> for any stream (hopefully ;).
> Next, in subclass i created converter. Sure thing i could also add a
> buffered wrapper, but maybe later :)
>
> Here some benchmarks. The file i used to test is utf-8 russian doc
> text - in attachment..
>
> | str |
> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
> {
> [ str reset. (XtreamUTF8Converter on: str readXtream) upToEnd ] bench.
> [ str reset. (UTF8Decoder new source: str readXtream) upToEnd ] bench.
> }
> #('21.71314741035857 per second.' '14.0371688414393 per second.')
>  #('22.16896345116836 per second.' '14.5186953062848 per second.')
>
> Next, buffered
>
> | str |
> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
> {
> [ str reset. (XtreamUTF8Converter on: str readXtream buffered) upToEnd ] bench.
> [ str reset. (UTF8Decoder new source: str readXtream buffered) upToEnd ] bench.
> }
> #('58.52976428286057 per second.' '25.44225800039754 per second.')
> #('58.90575079872205 per second.' '25.87064676616916 per second.')
>
>
> I'm also tried double-buffering, but neither my class nor yours
> currently works with it:
>
> | str |
> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
> {
> [ str reset. (XtreamUTF8Converter on: str readXtream buffered)
> buffered upToEnd ] bench.
> [ str reset. (UTF8Decoder new source: str readXtream buffered)
> buffered upToEnd ] bench.
> }
>
> Please , take a look. There are some quirks which not because i
> cleaned up decoding/encoding code.
> See XtreamWrapper>>upToEnd implementation.
>
>

Yes I published a bit soon and messed up because one temp from text
converter method (source) had same name as CharacterDecoder inst var
:(
Find a second attempt:

| str |
str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream binary.
{
[ str reset. (XtreamUTF8Converter on: str readXtream buffered)
buffered upToEnd ] bench.
[ str reset. (UTF8Decoder new source: str readXtream buffered)
buffered upToEnd ] bench.
}
#('118.0347513481126 per second.' '31.38117129722167 per second.')


As you can see, the optimistic ASCII version is pessimistic in case of
non ASCII...
It creates a composite stream and perform a lot of copys...
This is known and waiting better algorithm :)

Nicolas

> --
> Best regards,
> Igor Stasenko AKA sig.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Igor Stasenko
On 28 February 2010 12:00, Nicolas Cellier
<[hidden email]> wrote:

> 2010/2/28 Igor Stasenko <[hidden email]>:
>> Hi, i'm also did some hacking. I uploaded XTream-Wrappers-sig.1 into SqS/XTream.
>>
>> There is a basic XtreamWrapper class, which should work transparently
>> for any stream (hopefully ;).
>> Next, in subclass i created converter. Sure thing i could also add a
>> buffered wrapper, but maybe later :)
>>
>> Here some benchmarks. The file i used to test is utf-8 russian doc
>> text - in attachment..
>>
>> | str |
>> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
>> {
>> [ str reset. (XtreamUTF8Converter on: str readXtream) upToEnd ] bench.
>> [ str reset. (UTF8Decoder new source: str readXtream) upToEnd ] bench.
>> }
>> #('21.71314741035857 per second.' '14.0371688414393 per second.')
>>  #('22.16896345116836 per second.' '14.5186953062848 per second.')
>>
>> Next, buffered
>>
>> | str |
>> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
>> {
>> [ str reset. (XtreamUTF8Converter on: str readXtream buffered) upToEnd ] bench.
>> [ str reset. (UTF8Decoder new source: str readXtream buffered) upToEnd ] bench.
>> }
>> #('58.52976428286057 per second.' '25.44225800039754 per second.')
>> #('58.90575079872205 per second.' '25.87064676616916 per second.')
>>
>>
>> I'm also tried double-buffering, but neither my class nor yours
>> currently works with it:
>>
>> | str |
>> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
>> {
>> [ str reset. (XtreamUTF8Converter on: str readXtream buffered)
>> buffered upToEnd ] bench.
>> [ str reset. (UTF8Decoder new source: str readXtream buffered)
>> buffered upToEnd ] bench.
>> }
>>
>> Please , take a look. There are some quirks which not because i
>> cleaned up decoding/encoding code.
>> See XtreamWrapper>>upToEnd implementation.
>>
>>
>
> Yes I published a bit soon and messed up because one temp from text
> converter method (source) had same name as CharacterDecoder inst var
> :(
> Find a second attempt:
>
> | str |
> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream binary.
> {
> [ str reset. (XtreamUTF8Converter on: str readXtream buffered)
> buffered upToEnd ] bench.
> [ str reset. (UTF8Decoder new source: str readXtream buffered)
> buffered upToEnd ] bench.
> }
> #('118.0347513481126 per second.' '31.38117129722167 per second.')
>
>
> As you can see, the optimistic ASCII version is pessimistic in case of
> non ASCII...
> It creates a composite stream and perform a lot of copys...
> This is known and waiting better algorithm :)
>

whoops.. you got more than 3x speedup, while mine was around 2x.
But please, try on ascii files.

 | str |
 str := (String new: 1000 withAll: $a) asByteArray.
 {
 [ (XtreamUTF8Converter on: str readXtream binary)  upToEnd ] bench.
 [ (UTF8Decoder new source: str readXtream binary)  upToEnd ] bench.
 [ str readXtream binary upToEnd ] bench.
 }
 #('2039.392121575685 per second.' '1158.568286342731 per second.'
'92143.1713657269 per second.')

so, conversion is 90..45 times slower than just copying data :)
We need to tighten up this gap.
One would be to optimize #readInto:startingAt:count: using batch-mode
conversion.

> Nicolas
>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>



--
Best regards,
Igor Stasenko AKA sig.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: Streams. Status and where to go?

Nicolas Cellier
2010/2/28 Igor Stasenko <[hidden email]>:

> On 28 February 2010 12:00, Nicolas Cellier
> <[hidden email]> wrote:
>> 2010/2/28 Igor Stasenko <[hidden email]>:
>>> Hi, i'm also did some hacking. I uploaded XTream-Wrappers-sig.1 into SqS/XTream.
>>>
>>> There is a basic XtreamWrapper class, which should work transparently
>>> for any stream (hopefully ;).
>>> Next, in subclass i created converter. Sure thing i could also add a
>>> buffered wrapper, but maybe later :)
>>>
>>> Here some benchmarks. The file i used to test is utf-8 russian doc
>>> text - in attachment..
>>>
>>> | str |
>>> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
>>> {
>>> [ str reset. (XtreamUTF8Converter on: str readXtream) upToEnd ] bench.
>>> [ str reset. (UTF8Decoder new source: str readXtream) upToEnd ] bench.
>>> }
>>> #('21.71314741035857 per second.' '14.0371688414393 per second.')
>>>  #('22.16896345116836 per second.' '14.5186953062848 per second.')
>>>
>>> Next, buffered
>>>
>>> | str |
>>> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
>>> {
>>> [ str reset. (XtreamUTF8Converter on: str readXtream buffered) upToEnd ] bench.
>>> [ str reset. (UTF8Decoder new source: str readXtream buffered) upToEnd ] bench.
>>> }
>>> #('58.52976428286057 per second.' '25.44225800039754 per second.')
>>> #('58.90575079872205 per second.' '25.87064676616916 per second.')
>>>
>>>
>>> I'm also tried double-buffering, but neither my class nor yours
>>> currently works with it:
>>>
>>> | str |
>>> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream.
>>> {
>>> [ str reset. (XtreamUTF8Converter on: str readXtream buffered)
>>> buffered upToEnd ] bench.
>>> [ str reset. (UTF8Decoder new source: str readXtream buffered)
>>> buffered upToEnd ] bench.
>>> }
>>>
>>> Please , take a look. There are some quirks which not because i
>>> cleaned up decoding/encoding code.
>>> See XtreamWrapper>>upToEnd implementation.
>>>
>>>
>>
>> Yes I published a bit soon and messed up because one temp from text
>> converter method (source) had same name as CharacterDecoder inst var
>> :(
>> Find a second attempt:
>>
>> | str |
>> str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream binary.
>> {
>> [ str reset. (XtreamUTF8Converter on: str readXtream buffered)
>> buffered upToEnd ] bench.
>> [ str reset. (UTF8Decoder new source: str readXtream buffered)
>> buffered upToEnd ] bench.
>> }
>> #('118.0347513481126 per second.' '31.38117129722167 per second.')
>>
>>
>> As you can see, the optimistic ASCII version is pessimistic in case of
>> non ASCII...
>> It creates a composite stream and perform a lot of copys...
>> This is known and waiting better algorithm :)
>>
>
> whoops.. you got more than 3x speedup, while mine was around 2x.
> But please, try on ascii files.
>
>  | str |
>  str := (String new: 1000 withAll: $a) asByteArray.
>  {
>  [ (XtreamUTF8Converter on: str readXtream binary)  upToEnd ] bench.
>  [ (UTF8Decoder new source: str readXtream binary)  upToEnd ] bench.
>  [ str readXtream binary upToEnd ] bench.
>  }
>  #('2039.392121575685 per second.' '1158.568286342731 per second.'
> '92143.1713657269 per second.')
>
> so, conversion is 90..45 times slower than just copying data :)
> We need to tighten up this gap.
> One would be to optimize #readInto:startingAt:count: using batch-mode
> conversion.
>

Igor, you also got a problem:

| str |
str := (StandardFileStream readOnlyFileNamed: 'unitext.txt') readXtream binary.
(XtreamUTF8Converter on: str readXtream) upToEnd = (StandardFileStream
readOnlyFileNamed: 'unitext.txt') contents utf8ToSqueak
-> false

unless it's utf8ToSqueak and leadingChar stuff...

>> Nicolas
>>
>>> --
>>> Best regards,
>>> Igor Stasenko AKA sig.
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
12