Smalltalk › Squeak › Squeak - Dev

Faster FileStream experiments

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

43 messages Options

123

Nicolas Cellier

Faster FileStream experiments

I just gave a try to the BufferedFileStream.
As usual, code is MIT.
Implementation is rough, readOnly, partial (no support for basicNext
crap & al), untested (certainly has bugs).
Early timing experiments have shown a 5x to 7x speed up on [stream
nextLine] and [stream next] micro benchmarks
See class comment of attachment

Reminder: This bench is versus StandardFileStream.
StandardFileStream is the "fast" version, CrLf anf MultiByte are far worse!
This still let some more room...

Integrating and testing a read/write version is a lot harder than this
experiment, but we should really do it.

Nicolas

BufferedFileStream.st (13K) Download Attachment

Igor Stasenko

Re: Faster FileStream experiments

Hello Nicolas,
thanks for taking a time implementing this idea.

Since you are going to introduce something more clever than simple-minded
primitive based file operations, i think its worth to think about
creating a separate classes
for buffering/caching. Lets call it readStrategy, or writeStrategy or
cacheStrategy.
The idea is to redirect all read/write/seek operations to special layer, which
depending on implementation could choose, if given operation will be
just dumb primitive call,
or something more clever, like read-ahead etc.
So, then all streams (not only file stream) could be created using
choosen strategy
depending on user's will.

About BufferedFileStream implementation. There are some room for improvement:
cache should remember own starting position + size
then at #skip: you simply doing
self primSetPosition: fileID to: filePosition \\ bufferSize.
but not touching the buffer, because you can't predict what next
operation is follows (it can be another #skip: or truncate or close),
which makes your read-ahead redundant.

The cache should be refreshed only on direct read request, when some
data which needs to be read
is ouside the range covered by cache.
Let me illustrate the case, which shows the suboptimal #skip: behavior:

........>........[..........<..........]........

Here, [ ] is enclosed cached data,
and > is file position, after #skip: send.
Then caller wants to read bytes up to < marker.
In your case, #skip: will refresh cache, causing part of data which
was already in buffer to be re-read again,
while it is possible to reuse already cached data, and read only bytes
between > and [ ,
and rest can be delivered from cache.
Also, since after read request, a file pointer will point at < marker,
we are still inside a cache, and don't need to refresh it.

2009/11/18 Nicolas Cellier <[hidden email]>:

> I just gave a try to the BufferedFileStream.
> As usual, code is MIT.
> Implementation is rough, readOnly, partial (no support for basicNext
> crap & al), untested (certainly has bugs).
> Early timing experiments have shown a 5x to 7x speed up on [stream
> nextLine] and [stream next] micro benchmarks
> See class comment of attachment
>
> Reminder: This bench is versus StandardFileStream.
> StandardFileStream is the "fast" version, CrLf anf MultiByte are far worse!
> This still let some more room...
>
> Integrating and testing a read/write version is a lot harder than this
> experiment, but we should really do it.
>
> Nicolas
>
>
>
>

--
Best regards,
Igor Stasenko AKA sig.

Nicolas Cellier

Re: Faster FileStream experiments

2009/11/18 Igor Stasenko <[hidden email]>:

> Hello Nicolas,
> thanks for taking a time implementing this idea.
>
> Since you are going to introduce something more clever than simple-minded
> primitive based file operations, i think its worth to think about
> creating a separate classes
> for buffering/caching. Lets call it readStrategy, or writeStrategy or
> cacheStrategy.
> The idea is to redirect all read/write/seek operations to special layer, which
> depending on implementation could choose, if given operation will be
> just dumb primitive call,
> or something more clever, like read-ahead etc.
> So, then all streams (not only file stream) could be created using
> choosen strategy
> depending on user's will.
>

Yes, delegating is a very good idea.
Quite sure other smalltalks do that already (I did not want to be
tainted, so just kept away, reinventing my own wheel).
This trial was a minimal proof of concept, it cannot decently pretend
being a clean rewrite.

> About BufferedFileStream implementation. There are some room for improvement:
> cache should remember own starting position + size
> then at #skip: you simply doing
> self primSetPosition: fileID to: filePosition \\ bufferSize.
> but not touching the buffer, because you can't predict what next
> operation is follows (it can be another #skip: or truncate or close),
> which makes your read-ahead redundant.
>
> The cache should be refreshed only on direct read request, when some
> data which needs to be read
> is ouside the range covered by cache.
> Let me illustrate the case, which shows the suboptimal #skip: behavior:
>
> ........>........[..........<..........]........
>
> Here, [ ] is enclosed cached data,
> and > is file position, after #skip: send.
> Then caller wants to read bytes up to < marker.
> In your case, #skip: will refresh cache, causing part of data which
> was already in buffer to be re-read again,
> while it is possible to reuse already cached data, and read only bytes
> between > and [ ,
> and rest can be delivered from cache.
> Also, since after read request, a file pointer will point at < marker,
> we are still inside a cache, and don't need to refresh it.
>

Agree, my current buffer implementation is not lazy enough.
It does read ahead before knowing if really necessary :(

If I understand it, you would avoid throwing the buffer away until you
are sure it won't be reused.
Not sure if the use cases are worth the subtle complications. Two
consecutive skip: should be rare...
Anyway, all these tricks should better be hidden in a private policy
Object indeed, otherwise future subclasses which would inevitably
flourish under BufferedFileStream (the Squeak entropy) might well
break this masterpiece :)

Cheers

Nicolas

>
> 2009/11/18 Nicolas Cellier <[hidden email]>:
>> I just gave a try to the BufferedFileStream.
>> As usual, code is MIT.
>> Implementation is rough, readOnly, partial (no support for basicNext
>> crap & al), untested (certainly has bugs).
>> Early timing experiments have shown a 5x to 7x speed up on [stream
>> nextLine] and [stream next] micro benchmarks
>> See class comment of attachment
>>
>> Reminder: This bench is versus StandardFileStream.
>> StandardFileStream is the "fast" version, CrLf anf MultiByte are far worse!
>> This still let some more room...
>>
>> Integrating and testing a read/write version is a lot harder than this
>> experiment, but we should really do it.
>>
>> Nicolas
>>
>>
>>
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

Igor Stasenko

Re: Faster FileStream experiments

2009/11/18 Nicolas Cellier <[hidden email]>:

> 2009/11/18 Igor Stasenko <[hidden email]>:
>> Hello Nicolas,
>> thanks for taking a time implementing this idea.
>>
>> Since you are going to introduce something more clever than simple-minded
>> primitive based file operations, i think its worth to think about
>> creating a separate classes
>> for buffering/caching. Lets call it readStrategy, or writeStrategy or
>> cacheStrategy.
>> The idea is to redirect all read/write/seek operations to special layer, which
>> depending on implementation could choose, if given operation will be
>> just dumb primitive call,
>> or something more clever, like read-ahead etc.
>> So, then all streams (not only file stream) could be created using
>> choosen strategy
>> depending on user's will.
>>
>
> Yes, delegating is a very good idea.
> Quite sure other smalltalks do that already (I did not want to be
> tainted, so just kept away, reinventing my own wheel).
> This trial was a minimal proof of concept, it cannot decently pretend
> being a clean rewrite.
>

but it shown us the potential for improvements.
Seriously, 5x-7x speedup is not something which we can just forget and
throw away.

>> About BufferedFileStream implementation. There are some room for improvement:
>> cache should remember own starting position + size
>> then at #skip: you simply doing
>> self primSetPosition: fileID to: filePosition \\ bufferSize.
>> but not touching the buffer, because you can't predict what next
>> operation is follows (it can be another #skip: or truncate or close),
>> which makes your read-ahead redundant.
>>
>> The cache should be refreshed only on direct read request, when some
>> data which needs to be read
>> is ouside the range covered by cache.
>> Let me illustrate the case, which shows the suboptimal #skip: behavior:
>>
>> ........>........[..........<..........]........
>>
>> Here, [ ] is enclosed cached data,
>> and > is file position, after #skip: send.
>> Then caller wants to read bytes up to < marker.
>> In your case, #skip: will refresh cache, causing part of data which
>> was already in buffer to be re-read again,
>> while it is possible to reuse already cached data, and read only bytes
>> between > and [ ,
>> and rest can be delivered from cache.
>> Also, since after read request, a file pointer will point at < marker,
>> we are still inside a cache, and don't need to refresh it.
>>
>
> Agree, my current buffer implementation is not lazy enough.
> It does read ahead before knowing if really necessary :(
>
> If I understand it, you would avoid throwing the buffer away until you
> are sure it won't be reused.
> Not sure if the use cases are worth the subtle complications. Two
> consecutive skip: should be rare...

yes, it is rare and quite unlikely, but you catched my intent clearly:
- do not throw away the buffer unless its deem necessary.

Lets keep in mind that any memory operation is orders of magnitude
faster than disk operations,
moreover, a filesystem could be remotely mounted drive which adds even
more latency for all file-based operations.
So, fighting with it using cache, is good strategy.

> Anyway, all these tricks should better be hidden in a private policy
> Object indeed, otherwise future subclasses which would inevitably
> flourish under BufferedFileStream (the Squeak entropy) might well
> break this masterpiece :)
>

Right. A separate layer is for making a clean room for experiments,
without need of rewriting a whole stream class hierarchy,
especially subclasses, where things start exploding exponentially.
There should be a very thin layer, based on most simple operations:
read, write, seek , while rest of stream interface is based on that.
So, if we can identify this thin layer and make it pluggable, then we
can be sure that at least some part of stream library can be easily
customized, and if this part works well, so we can be sure streams in
good shape, without need of visiting and testing numerous methods in
multiple (sub)classes, which is quite messy.

> Cheers
>
> Nicolas
>

--
Best regards,
Igor Stasenko AKA sig.

Eliot Miranda-2

Re: Faster FileStream experiments

In reply to this post by Nicolas Cellier

On Wed, Nov 18, 2009 at 3:10 AM, Nicolas Cellier <[hidden email]> wrote:

I just gave a try to the BufferedFileStream.
As usual, code is MIT.
Implementation is rough, readOnly, partial (no support for basicNext
crap & al), untested (certainly has bugs).
Early timing experiments have shown a 5x to 7x speed up on [stream
nextLine] and [stream next] micro benchmarks
See class comment of attachment

Reminder: This bench is versus StandardFileStream.
StandardFileStream is the "fast" version, CrLf anf MultiByte are far worse!
This still let some more room...

Integrating and testing a read/write version is a lot harder than this
experiment, but we should really do it.

Just want to wish you every encouragement! This is *really* useful work.

Nicolas

Nicolas Cellier

Re: Faster FileStream experiments

2009/11/18 Eliot Miranda <[hidden email]>:

>
>
> On Wed, Nov 18, 2009 at 3:10 AM, Nicolas Cellier
> <[hidden email]> wrote:
>>
>> I just gave a try to the BufferedFileStream.
>> As usual, code is MIT.
>> Implementation is rough, readOnly, partial (no support for basicNext
>> crap & al), untested (certainly has bugs).
>> Early timing experiments have shown a 5x to 7x speed up on [stream
>> nextLine] and [stream next] micro benchmarks
>> See class comment of attachment
>>
>> Reminder: This bench is versus StandardFileStream.
>> StandardFileStream is the "fast" version, CrLf anf MultiByte are far
>> worse!
>> This still let some more room...
>>
>> Integrating and testing a read/write version is a lot harder than this
>> experiment, but we should really do it.
>
> Just want to wish you every encouragement! This is *really* useful work.

I just throw an un-tested minimal read/write version of BufferedFileStream.
Beware, I just wrote from scratch and did not even run one single
method since the read/write refactoring...
So far, I did rather spend my spare time in commenting the
implementation (see class comment too)...
If some good souls want to analyze/try it.
It should be reasonably optimized for readOnly and random read/write cases.
For append only, that might not be optimal due to useless attempts to
read past end, but that should not cost that much. For read/append,
there is probably room for more efficiency too, but major improvment
vs StandardFileStream should already show up. Not sure we really need
to introduce these optimizations.

The path to a cleaner/faster stream library is longer than just this
little step.
Beside testing, we'd have to refactor the hierarchy, insulate all
instance variables, and delegate as much as possible as Igor
suggested.
We'd better continue on the cleaning path and not just add another
FileStream subclass complexifying a bit more an unecessarily complex
library.

Nicolas

>>
>> Nicolas
>>
>>
>>
>
>
>
>
>

BufferedFileStream.st (18K) Download Attachment

Colin Putney

Re: Faster FileStream experiments

On 26-Nov-09, at 2:48 PM, Nicolas Cellier wrote:

> The path to a cleaner/faster stream library is longer than just this
> little step.
> Beside testing, we'd have to refactor the hierarchy, insulate all
> instance variables, and delegate as much as possible as Igor
> suggested.
> We'd better continue on the cleaning path and not just add another
> FileStream subclass complexifying a bit more an unecessarily complex
> library.

I've been thinking about this too. For Filesystem, I've only
implemented very basic stream functionality so far. But I do intend to
develop its stream functionality further, and to go in a very
different direction from the existing design. Some design elements:

- Using handles to decouple the streams from the storage they're
operating on. The same stream class should be able to read or write to
collections, sockets, files etc.

- Separating ReadStream from WriteStream. I find code that both reads
and writes to a particular stream to be very rare in practice, and in
cases where it does happen, reading and writing are separate
activities and using separate streams wouldn't introduce problems. On
the other hand, a lot of the complexity in the existing hierarchy
stems from the mingling of read and write functionality.

- Simplified protocols. The existing stream classes have accumulated a
lot of cruft that should be implemented as objects use streams rather
than being streams themselves. Examples include fileIn, fileOut,
RefrenceStream etc.

- Composition rather than inheritance. As I go about implementing
string encoding, buffering, compression etc. I plan to enable the
creation of stream pipelines to provide combinations of functionality.
Instead of implementing BufferedUtf8DelfateFilestream, I want to
create a sequence of streams like this:

WriteStream -> Utf8Encoder-> DeflateCompressor -> Buffer -> Handle

- Grow the new streams parallel to the existing ones. Rather than
trying to maintain backwards compatibility, leave the old streams in
place and continue to improve them while the new ones are being
developed. Migration to the new streams can happen gradually. If the
new streams don't attract any users, obviously I'm on the wrong
track. :-)

So I've been watching your cleanup efforts with interest, particularly
the buffering stuff. Keep it up!

Colin

Randal L. Schwartz

Re: Faster FileStream experiments

In reply to this post by Nicolas Cellier

>>>>> "Nicolas" == Nicolas Cellier <[hidden email]> writes:

Nicolas> The path to a cleaner/faster stream library is longer than just this
Nicolas> little step. Beside testing, we'd have to refactor the hierarchy,
Nicolas> insulate all instance variables, and delegate as much as possible as
Nicolas> Igor suggested. We'd better continue on the cleaning path and not
Nicolas> just add another FileStream subclass complexifying a bit more an
Nicolas> unecessarily complex library.

Michael Lucas-Smith gave a nice talk on Xtreams at the Portland Linux Users
Group. The most interesting thing out of this is the notion that #atEnd is
just plain wrong. For some streams, computing #atEnd is impossible. For most
streams, it's just expensive. Instead, Xtreams takes the approach that #do:
suffices for most people, and for those that can't, an exception when you read
past end-of-stream can provide the proper exit from your loop. Then, your
loop can concentrate on what happens most of the time, instead of what happens
rarely.

Xtreams is under a liberal license, and is currently in the Cincom public
store.

Instead of reinventing yet another stream package, we should be looking at
Xtreams, I think.

(As a side effect, Xtreams has as a test a very nice PEG parsing package... so
we'd get DSLs for relatively free.)

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[hidden email]> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion

Colin Putney

Re: Faster FileStream experiments

On 26-Nov-09, at 9:36 PM, Randal L. Schwartz wrote:

> Xtreams is under a liberal license, and is currently in the Cincom
> public
> store.
>
> Instead of reinventing yet another stream package, we should be
> looking at
> Xtreams, I think.

Very cool. Definitely need to steal ideas from them.

...and code, perhaps? I did a bit of poking around, but couldn't find
anything on the web that said what the license actually is. Can you be
more specific than "liberal?"

Colin

Nicolas Cellier

Re: Faster FileStream experiments

In reply to this post by Colin Putney

2009/11/27 Colin Putney <[hidden email]>:

Yes, mostly a read-append stream usage for change log...
However, a buffered implementation will be difficult with separate
read/write buffers in the rare case we need read/write capabilities:
writing might trash the read buffer, so they are not independent.

> - Simplified protocols. The existing stream classes have accumulated a lot
> of cruft that should be implemented as objects use streams rather than being
> streams themselves. Examples include fileIn, fileOut, RefrenceStream etc.
>

Yes, packaging and modularization of core...

> - Composition rather than inheritance. As I go about implementing string
> encoding, buffering, compression etc. I plan to enable the creation of
> stream pipelines to provide combinations of functionality. Instead of
> implementing BufferedUtf8DelfateFilestream, I want to create a sequence of
> streams like this:
>
> WriteStream -> Utf8Encoder-> DeflateCompressor -> Buffer -> Handle
>

Agree again

> - Grow the new streams parallel to the existing ones. Rather than trying to
> maintain backwards compatibility, leave the old streams in place and
> continue to improve them while the new ones are being developed. Migration
> to the new streams can happen gradually. If the new streams don't attract
> any users, obviously I'm on the wrong track. :-)
>
> So I've been watching your cleanup efforts with interest, particularly the
> buffering stuff. Keep it up!
>

Obviously, it's just a piece of a larger puzzle.

> Colin
>
>

Nicolas

Diego Gomez Deck

Re: Faster FileStream experiments

In reply to this post by Randal L. Schwartz

> Nicolas> The path to a cleaner/faster stream library is longer than just this
> Nicolas> little step. Beside testing, we'd have to refactor the hierarchy,
> Nicolas> insulate all instance variables, and delegate as much as possible as
> Nicolas> Igor suggested. We'd better continue on the cleaning path and not
> Nicolas> just add another FileStream subclass complexifying a bit more an
> Nicolas> unecessarily complex library.
>
> Michael Lucas-Smith gave a nice talk on Xtreams at the Portland Linux Users
> Group. The most interesting thing out of this is the notion that #atEnd is
> just plain wrong. For some streams, computing #atEnd is impossible. For most
> streams, it's just expensive. Instead, Xtreams takes the approach that #do:
> suffices for most people, and for those that can't, an exception when you read
> past end-of-stream can provide the proper exit from your loop. Then, your
> loop can concentrate on what happens most of the time, instead of what happens
> rarely.

I think we need a common superclass for Streams and Collection named
Iterable where #do: is abstract and #select:, #collect:, #reject:,
#count:, #detect:, etc (and quiet a lot of the messages in enumerating
category of Collection) are implemented based on #do:

Of course Stream can refine the #select:/#reject methods to answer a
FilteredStream that decorates the receiver and apply the filtering on
the fly. In the same way #collect: can return a TransformedStream that
decorates the receiver, etc.

Just my 2 cents.

Cheers,

-- Diego

Nicolas Cellier

Re: Faster FileStream experiments

2009/11/27 Diego Gomez Deck <[hidden email]>:

>> Nicolas> The path to a cleaner/faster stream library is longer than just this
>> Nicolas> little step. Beside testing, we'd have to refactor the hierarchy,
>> Nicolas> insulate all instance variables, and delegate as much as possible as
>> Nicolas> Igor suggested. We'd better continue on the cleaning path and not
>> Nicolas> just add another FileStream subclass complexifying a bit more an
>> Nicolas> unecessarily complex library.
>>
>> Michael Lucas-Smith gave a nice talk on Xtreams at the Portland Linux Users
>> Group. The most interesting thing out of this is the notion that #atEnd is
>> just plain wrong. For some streams, computing #atEnd is impossible. For most
>> streams, it's just expensive. Instead, Xtreams takes the approach that #do:
>> suffices for most people, and for those that can't, an exception when you read
>> past end-of-stream can provide the proper exit from your loop. Then, your
>> loop can concentrate on what happens most of the time, instead of what happens
>> rarely.
>
> I think we need a common superclass for Streams and Collection named
> Iterable where #do: is abstract and #select:, #collect:, #reject:,
> #count:, #detect:, etc (and quiet a lot of the messages in enumerating
> category of Collection) are implemented based on #do:
>
> Of course Stream can refine the #select:/#reject methods to answer a
> FilteredStream that decorates the receiver and apply the filtering on
> the fly. In the same way #collect: can return a TransformedStream that
> decorates the receiver, etc.
>
> Just my 2 cents.
>
> Cheers,
>
> -- Diego
>
>

Yes, this is gst approach, and it seems a good one.

>
>

Ralph Johnson

Re: Faster FileStream experiments

In reply to this post by Diego Gomez Deck

> I think we need a common superclass for Streams and Collection named
> Iterable where #do: is abstract and #select:, #collect:, #reject:,
> #count:, #detect:, etc (and quiet a lot of the messages in enumerating
> category of Collection) are implemented based on #do:
>
> Of course Stream can refine the #select:/#reject methods to answer a
> FilteredStream that decorates the receiver and apply the filtering on
> the fly. In the same way #collect: can return a TransformedStream that
> decorates the receiver, etc.

Since Stream can't reuse #select: and #collect: (or #count, and
#detect: on an infinite stream is risky), they shouldn't be in the
superclass. In that case, what is its purpose?

i think it is fine to give Stream the same interface as Collection. I
do this, too. But they will share very little code, and so there is
no need to give them a common superclass.

-Ralph Johnson

cdelaunay

Re: Faster FileStream experiments

In reply to this post by Diego Gomez Deck

2009/11/27 Diego Gomez Deck <[hidden email]>

> Nicolas> The path to a cleaner/faster stream library is longer than just this
> Nicolas> little step. Beside testing, we'd have to refactor the hierarchy,
> Nicolas> insulate all instance variables, and delegate as much as possible as
> Nicolas> Igor suggested. We'd better continue on the cleaning path and not
> Nicolas> just add another FileStream subclass complexifying a bit more an
> Nicolas> unecessarily complex library.
>
> Michael Lucas-Smith gave a nice talk on Xtreams at the Portland Linux Users
> Group. The most interesting thing out of this is the notion that #atEnd is
> just plain wrong. For some streams, computing #atEnd is impossible. For most
> streams, it's just expensive. Instead, Xtreams takes the approach that #do:
> suffices for most people, and for those that can't, an exception when you read
> past end-of-stream can provide the proper exit from your loop. Then, your
> loop can concentrate on what happens most of the time, instead of what happens
> rarely.

I think we need a common superclass for Streams and Collection named
Iterable where #do: is abstract and #select:, #collect:, #reject:,
#count:, #detect:, etc (and quiet a lot of the messages in enumerating
category of Collection) are implemented based on #do:

Maybe I'm wrong but I think traits are a good (better) solution for that kind of problem. #do can be a required method and you can implement remaining methods with #do

Of course Stream can refine the #select:/#reject methods to answer a
FilteredStream that decorates the receiver and apply the filtering on
the fly. In the same way #collect: can return a TransformedStream that
decorates the receiver, etc.

Just my 2 cents.

Cheers,

-- Diego

Diego Gomez Deck

Re: Faster FileStream experiments

In reply to this post by Ralph Johnson

El vie, 27-11-2009 a las 06:15 -0600, Ralph Johnson escribió:

> > I think we need a common superclass for Streams and Collection named
> > Iterable where #do: is abstract and #select:, #collect:, #reject:,
> > #count:, #detect:, etc (and quiet a lot of the messages in enumerating
> > category of Collection) are implemented based on #do:
> >
> > Of course Stream can refine the #select:/#reject methods to answer a
> > FilteredStream that decorates the receiver and apply the filtering on
> > the fly. In the same way #collect: can return a TransformedStream that
> > decorates the receiver, etc.
>
> Since Stream can't reuse #select: and #collect: (or #count, and
> #detect: on an infinite stream is risky),

Stream and Collection are just the 2 refinements of Iterable that we're
talking about in this thread, but there are a lot of classes that can
benefit from Iterable as a super-class.

On the other side, Stream has #do: (and #atEnd/#next pair) and it's also
risky for infinite streams. To push this discussion forward, Is
InfiniteStream a real Stream?

> they shouldn't be in the
> superclass. In that case, what is its purpose?
>
> i think it is fine to give Stream the same interface as Collection. I
> do this, too. But they will share very little code, and so there is
> no need to give them a common superclass.
>
> -Ralph Johnson

Cheers,

-- Diego

Nicolas Cellier

Re: Faster FileStream experiments

2009/11/27 Diego Gomez Deck <[hidden email]>:

> El vie, 27-11-2009 a las 06:15 -0600, Ralph Johnson escribió:
>> > I think we need a common superclass for Streams and Collection named
>> > Iterable where #do: is abstract and #select:, #collect:, #reject:,
>> > #count:, #detect:, etc (and quiet a lot of the messages in enumerating
>> > category of Collection) are implemented based on #do:
>> >
>> > Of course Stream can refine the #select:/#reject methods to answer a
>> > FilteredStream that decorates the receiver and apply the filtering on
>> > the fly. In the same way #collect: can return a TransformedStream that
>> > decorates the receiver, etc.
>>
>> Since Stream can't reuse #select: and #collect: (or #count, and
>> #detect: on an infinite stream is risky),
>
> Stream and Collection are just the 2 refinements of Iterable that we're
> talking about in this thread, but there are a lot of classes that can
> benefit from Iterable as a super-class.
>
> On the other side, Stream has #do: (and #atEnd/#next pair) and it's also
> risky for infinite streams. To push this discussion forward, Is
> InfiniteStream a real Stream?
>
>> they shouldn't be in the
>> superclass. In that case, what is its purpose?
>>
>> i think it is fine to give Stream the same interface as Collection. I
>> do this, too. But they will share very little code, and so there is
>> no need to give them a common superclass.
>>
>> -Ralph Johnson
>
> Cheers,
>
> -- Diego
>

#select: and #collect: are not necessarily dangerous even on infinite
stream once you see them as filters and implement them with a lazy
block evaluation : Stream select: aBlock should return a SelectStream
(find a better name here :).
Then you would use it with #next, as any other InfiniteStream.

>
>
>

Igor Stasenko

Re: Faster FileStream experiments

In reply to this post by Colin Putney

2009/11/27 Colin Putney <[hidden email]>:

>
> On 26-Nov-09, at 2:48 PM, Nicolas Cellier wrote:
>
>> The path to a cleaner/faster stream library is longer than just this
>> little step.
>> Beside testing, we'd have to refactor the hierarchy, insulate all
>> instance variables, and delegate as much as possible as Igor
>> suggested.
>> We'd better continue on the cleaning path and not just add another
>> FileStream subclass complexifying a bit more an unecessarily complex
>> library.
>
> I've been thinking about this too. For Filesystem, I've only implemented
> very basic stream functionality so far. But I do intend to develop its
> stream functionality further, and to go in a very different direction from
> the existing design. Some design elements:
>
> - Using handles to decouple the streams from the storage they're operating
> on. The same stream class should be able to read or write to collections,
> sockets, files etc.
>
> - Separating ReadStream from WriteStream. I find code that both reads and
> writes to a particular stream to be very rare in practice, and in cases
> where it does happen, reading and writing are separate activities and using
> separate streams wouldn't introduce problems. On the other hand, a lot of
> the complexity in the existing hierarchy stems from the mingling of read and
> write functionality.
>
> - Simplified protocols. The existing stream classes have accumulated a lot
> of cruft that should be implemented as objects use streams rather than being
> streams themselves. Examples include fileIn, fileOut, RefrenceStream etc.
>
> - Composition rather than inheritance. As I go about implementing string
> encoding, buffering, compression etc. I plan to enable the creation of
> stream pipelines to provide combinations of functionality. Instead of
> implementing BufferedUtf8DelfateFilestream, I want to create a sequence of
> streams like this:
>
> WriteStream -> Utf8Encoder-> DeflateCompressor -> Buffer -> Handle
>

+100.
Just yesterday i thought about same design principle: composition.
I call it StreamAdaptor.
It should carry a minimal set of methods, which providing a basic set
of operations (read/write/seek etc) and also should support a
pipelining in same way as you illustrated above:

Lets say, initially we created a stream which works with file:
Stream -> FileAdaptor

then we want it to be buffered:
stream adaptor: (stream adaptor beBuffered)

Stream -> BufferAdaptor -> FileAdaptor

then we want it to be compressed:

stream adaptor: (ZipAdaptor on: stream adaptor)

Stream -> DeflateCompressor -> BufferAdaptor -> FileAdaptor

and so on..

It is easy to see, that if we may want to create same structure for
socket connection, all we need is to just
use a socket adaptor in the chain, while rest don't requires any modifications.

> - Grow the new streams parallel to the existing ones. Rather than trying to
> maintain backwards compatibility, leave the old streams in place and
> continue to improve them while the new ones are being developed. Migration
> to the new streams can happen gradually. If the new streams don't attract
> any users, obviously I'm on the wrong track. :-)
>
> So I've been watching your cleanup efforts with interest, particularly the
> buffering stuff. Keep it up!
>
> Colin
>
>

--
Best regards,
Igor Stasenko AKA sig.

David T. Lewis

Re: Faster FileStream experiments

In reply to this post by Colin Putney

On Thu, Nov 26, 2009 at 08:56:08PM -0800, Colin Putney wrote:
>
> I've been thinking about this too. For Filesystem, I've only
> implemented very basic stream functionality so far. But I do intend to
> develop its stream functionality further, and to go in a very
> different direction from the existing design. Some design elements:
>
> - Using handles to decouple the streams from the storage they're
> operating on. The same stream class should be able to read or write to
> collections, sockets, files etc.

I implemented IOHandle for this, see http://wiki.squeak.org/squeak/996.
I have not maintained it since about 2003, but the idea is straightforward.
My purpose at that time was to :

* Separate the representation of external IO channels from the represention
of streams and communication protocols.
* Provide a uniform representation of IO channels similar to the unix notion
of treating everything as a 'file'.
* Simplify future refactoring of Socket and FileStream.
* Provide a place for handling asynchronous IO events. Refer to the aio
handling in the unix VM. Files, Sockets, and AsyncFiles could (should) use
a common IO event handling mechanism (aio event signaling a Smalltalk Semaphore).

Since that time I added aio event handling for file (AioPlugin, see
http://wiki.squeak.org/squeak/3384), which is a layer on top of Ian's
aio event handling in the unix and OS X VMx that which is mainly useful
for handling unix pipes. But I still think that a more unified view
of "handles for IO channels" is a good idea. The completely separate
representation of files and sockets in Squeak still feels wrong to me,
maybe just because I am accustomed to unix systems.

Dave

Randal L. Schwartz

Re: Faster FileStream experiments

In reply to this post by Colin Putney

>>>>> "Colin" == Colin Putney <[hidden email]> writes:

Colin> ...and code, perhaps? I did a bit of poking around, but couldn't find
Colin> anything on the web that said what the license actually is. Can you be
Colin> more specific than "liberal?"

MLS made it clear at the meeting that Cincom's default release model is now
"open source" except for things that are business differentiating, and in
fact, in particular, they would really like to see Xtreams adopted widely, so
the license would have to be MIT-like for htat to happen.

I'm sure if we poked Arden or James Robertson we could get a statement of
license for Xtreams available rather quickly.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[hidden email]> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion

Colin Putney

Re: Faster FileStream experiments

In reply to this post by David T. Lewis

On 27-Nov-09, at 8:03 AM, David T. Lewis wrote:

> I implemented IOHandle for this, see http://wiki.squeak.org/squeak/
> 996.
> I have not maintained it since about 2003, but the idea is
> straightforward.

Yes. I looked into IOHandle when implementing Filesystem, but decided
to go with a new (simpler, but limited) implementation that would let
me explore the requirements for the stream architecture I had in mind.

> My purpose at that time was to :
>
> * Separate the representation of external IO channels from the
> represention
> of streams and communication protocols.
> * Provide a uniform representation of IO channels similar to the
> unix notion
> of treating everything as a 'file'.
> * Simplify future refactoring of Socket and FileStream.
> * Provide a place for handling asynchronous IO events. Refer to the
> aio
> handling in the unix VM. Files, Sockets, and AsyncFiles could
> (should) use
> a common IO event handling mechanism (aio event signaling a
> Smalltalk Semaphore).

Indeed. Filesystem comes at this from the other direction, but I think
we want to end up in the same place. For now I've done TSTTCPW, which
is use the primitives from the FilePlugin. But eventually I want to
improve the plumbing. You've done some important work here - perhaps
Filesystem can use AioPlugin at some point.

Colin

123