Smalltalk › Squeak › Squeak - Dev

Re: Generators in Smalltalk??

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Ralph Boland

Re: Generators in Smalltalk??

(Warning, long post)

I do not understand all the commotion over generators.

If a Collection class cannot do what you want, use a Stream.
Take a read only Stream and subclass from it a class called GenMyData.
Inside GenMyData write code that generates the data you
need on the fly and feeds it to the user upon request using
the ReadStream interface.
Perhaps there should be an Abstract class "Generator" that
is a subclass of Stream to provide some generic generator functionality.

What exactly do Python generators provide that my proposal does not?
And is not my interface cleaner?
And more adaptable?

We could generalize this idea to that of a Pipe. A Pipe would be
a generalization (subclass) of a ReadWriteStream. A Pipe can accept
data using the WriteStream interface. The Pipe
can then process the data in some way and then feed the result
to the user upon request using the ReadStream interface.
A pipe is more complicated than a generator because it needs to
be able to block. If data is being fed into it but not being taken out
then eventually it may fill up. Similarly, if data is being removed
from it, then it may run out of data and again need to block.
Blocking may mean a process blocks or it may mean sending a message
to another pipe to send or accept data. I have not worked out the details.
Given how useful pipes are in Linux, I think a Generic Pipe class
would be VERY useful. Of particular use would be operators such
as '<<', '||', '>>', and tee. These operators would provide functionality
similar to what Linux does with pipes.

The '<<' operator would pass the argument (a ReadStream, Generator,
or Pipe) to the receiver which would use it as if it were a ReadStream.

The '>>' operator would pass the argument (a WriteStream or Pipe)
to the receiver which would use it as if it were a WriteStream.
I guess we need another concept here, the reverse of a generator, which accepts
data like a WriteStream, does some processing with the data, and keeps
the result.
Let's call this an Absorber.

The '||' operator (assuming that receiver is a ReadStream, Generator,
or Pipe) requires that the argument to be a WriteStream, Absorber, or Pipe.
The result of the operation is an Absorber, a Generator, or a Pipe depending
upon the classes of the receiver and operand.
If the receiver is a Generator (or ReadStream)
and the argument is an Absorber (or WriteSteam)
then the actual calculation is triggered and the result is an Absorber
(or WriteStream) which holds the final result of the computation.
Actually, whenever the first argument is a Generator (or ReadStream), the
computation could be triggered but it would eventually block if the second
argument is a Pipe.

The Tee class is a generalization (subclass) of class Pipe. it is
instanciated with a
collection argument where the elements of the collection are WriteStreams,
Absorbers, Pipes, or Tees. The Tee class will forward the results of its
computation to each of the elements of the collection.
I suppose there could be the reverse of a Tee, ReverseTee,
which accepts input from any of a collection of ReadStreams, Pipes, or
ReverseTees.
ReadStreams, Generators, and Pipes have a method tee: aCollection
which creates a Tee instance and connects it to the receiver.

I am sure there are numerous objects that make sense as Generators, Absorbers
and Pipes. For example Random number generators should be Generators.
Wow! Wasn't that insightful!

e.g. An instances of class Prime could be a Generator from which prime numbers
can be fetched. It would store the first 10,000 primes, say,
and generate further primes requested on demand.

e.g. Class PrimeFactors could be instanciated with a number which the instance
factors. The instance would be a Generator from which can be fetched
the factors of the initial number in sorted order.

e.g. An instance of Class Sum could be an Absorber that stores the sum of the
objects (numbers, vectors, matrices, etc.) fed to it.

e.g. An instance of Class TextToLines could be a Pipe which acts like a
WriteSteam of text and a ReadStream of lines of text.
That is, it is fed text one character at a time and
the same text can be fetched from it one line at a time.
Is that is confusing? I think some terminology needs to be standardized
here so we know what we are talking about.

I have always wanted to implement Pipes in Smalltalk but have never found the
time. I think it could be a fun project and VERY useful.

Exercise: Write code in Smalltalk to copy a file to another file assuming the
first file exists and is readable and the second file can be created
and written.
That is, ignore any error possibilities.
Then do the same exercise using a pipe. (Don't implement the '||' operator,
we only want to look at the code to see which is simpler.

Bugs.
You could create an infinite loop by linking a sequence of pipes into a loop.
Is this always a bug?
How would the computation be triggered in this case?

Is the hierarchy all wrong?
Should Absorber and Generator be subclasses of Pipe (i.e. pipes with
some functionality removed?) and ReadStreams and WriteStreams be
subclasses of Generator and Absorber respectively. This is reopening
the controversy of what is the best hierarchy for stream only worse. :-(
Pragmatically speaking, I think Absorbers, Generators, and Pipes, should
have a separate hierarchy than Streams but, of course, share interface.

So whatayaall think?

Frankly I think the biggest problem is that pipes would be altogether too
convenient; they would encourage writing code that is easy to write but
have too high an overhead because of the need to deal with blocking.
That is, pipes require good judgment, or that performance not be an issue.
If your code is too slow and some pipes you are using is the problem then
rewrite that code, perhaps avoiding the pipes.

While I do not have the time to do this project on my own
I could probably spare some time to help out if a group
could form to do this. Besides I am sure other could add much insight.

Note that this project involves more than creating the Abstract classes
Generator, Absorber and Pipe. These three classes would be the
head of Class hierarchies just like Stream is.
I don't know what the subclasses would be though.
Pipe might have a subclass BufferedPipe.

Also, I hasten to point out that many classes would simply implement the
Pipe, Generator, or Absorber interface and not actually be subclasses
of either of these classes.

Interesting question: Are there classes in Squeak that implement the
ReadStream, WriteStream, or ReadWriteStream interfaces now but are not
a subclasses of these classes?
If not, should there be?
That is, are there classes in Squeak that should implement these interfaces?
In the Java world, Pipe, Generator, and Absorber would be Interfaces.
I don't like Java much but Interfaces I liked.

Regards, to those of you still here.

Ralph Boland

--
Had a wife and kids in Florida, Jack (Nicklaus)
Went out for a "ride" but I couldn't get back
As a ten timer being hunted down
OR
When the wife found out who I was knowing (biblically speaking)
I hit a hydrant and I just kept going (till I hit a tree).
...

Andreas.Raab

Re: Generators in Smalltalk??

Ralph Boland wrote:
> I do not understand all the commotion over generators.

Generators transform a push model (#do:) into a pull model (#next).
Let's say you have two algorithms one that produces values via a push
model and one that consumes values via a pull model:

producer := SomeAlgorithm new.
producer do:[:value| "produces the value from the algorithm"].

consumer := AnotherAlgoritm new.
consumer process: inputStream. "consumes the value from inputStream"

Plugging these two together can be a royal pain. You basically have the
following three options:

1) Preproduce all the values from the producer and create a stream that
holds them:

values := Array streamContents:[:s| producer do:[:each| s nextPut: each]].
consumer process: values readStream.

This can work very well if the number of values is limited and if you
know you're going to consume all the produced values (doesn't work that
well with infinite sequences of numbers).

2) Rewrite either one of the algorithms. Can work very well if it's your
own code and reasonably simple, and can be extremely painful if it's
complex.

3) Use a Generator. With a generator, the solution becomes simply:

consumer process: (Generator on:[:yield| producer do: yield]).

When you can neither use solution 1) or 2) generators simplify your life
*greatly*. In fact I'd say they are indispensable if you're dealing
with a complex existing environment.

> If a Collection class cannot do what you want, use a Stream.
> Take a read only Stream and subclass from it a class called GenMyData.
> Inside GenMyData write code that generates the data you
> need on the fly and feeds it to the user upon request using
> the ReadStream interface.

See above. You're describing option #2. Rewriting an existing algorithm
to "feed" the data properly can be decidedly non-trivial. You can start
with Integer>>primesUpTo:do: which is a very simple algorithm. The point
being that while all algorithms can be expressed either way the generic
way in which a Generator does that replaces the need to reimplement each
specific algorithm.

Cheers,
- Andreas

Andreas.Raab

Re: Generators in Smalltalk??

As additional motivation, the Motivation section of PEP 225 is a great
read to explain why Generators are interesting and what the alternatives
are:

Abstract

This PEP introduces the concept of generators to Python, as well
as a new statement used in conjunction with them, the "yield"
statement.

Motivation

When a producer function has a hard enough job that it requires
maintaining state between values produced, most programming languages
offer no pleasant and efficient solution beyond adding a callback
function to the producer's argument list, to be called with each value
produced.

For example, tokenize.py in the standard library takes this approach:
the caller must pass a "tokeneater" function to tokenize(), called
whenever tokenize() finds the next token. This allows tokenize to be
coded in a natural way, but programs calling tokenize are typically
convoluted by the need to remember between callbacks which token(s)
were seen last. The tokeneater function in tabnanny.py is a good
example of that, maintaining a state machine in global variables, to
remember across callbacks what it has already seen and what it hopes to
see next. This was difficult to get working correctly, and is still
difficult for people to understand. Unfortunately, that's typical of
this approach.

An alternative would have been for tokenize to produce an entire parse
of the Python program at once, in a large list. Then tokenize clients
could be written in a natural way, using local variables and local
control flow (such as loops and nested if statements) to keep track of
their state. But this isn't practical: programs can be very large, so
no a priori bound can be placed on the memory needed to materialize the
whole parse; and some tokenize clients only want to see whether
something specific appears early in the program (e.g., a future
statement, or, as is done in IDLE, just the first indented statement),
and then parsing the whole program first is a severe waste of time.

Another alternative would be to make tokenize an iterator[1],
delivering the next token whenever its .next() method is invoked. This
is pleasant for the caller in the same way a large list of results
would be, but without the memory and "what if I want to get out early?"
drawbacks. However, this shifts the burden on tokenize to remember
*its* state between .next() invocations, and the reader need only
glance at tokenize.tokenize_loop() to realize what a horrid chore that
would be. Or picture a recursive algorithm for producing the nodes of
a general tree structure: to cast that into an iterator framework
requires removing the recursion manually and maintaining the state of
the traversal by hand.

A fourth option is to run the producer and consumer in separate
threads. This allows both to maintain their states in natural ways,
and so is pleasant for both. Indeed, Demo/threads/Generator.py in the
Python source distribution provides a usable synchronized-communication
class for doing that in a general way. This doesn't work on platforms
without threads, though, and is very slow on platforms that do
(compared to what is achievable without threads).

A final option is to use the Stackless[2][3] variant implementation of
Python instead, which supports lightweight coroutines. This has much
the same programmatic benefits as the thread option, but is much more
efficient. However, Stackless is a controversial rethinking of the
Python core, and it may not be possible for Jython to implement the
same semantics. This PEP isn't the place to debate that, so suffice it
to say here that generators provide a useful subset of Stackless
functionality in a way that fits easily into the current CPython
implementation, and is believed to be relatively straightforward for
other Python implementations.

That exhausts the current alternatives. Some other high-level
languages provide pleasant solutions, notably iterators in Sather[4],
which were inspired by iterators in CLU; and generators in Icon[5], a
novel language where every expression "is a generator". There are
differences among these, but the basic idea is the same: provide a
kind of function that can return an intermediate result ("the next
value") to its caller, but maintaining the function's local state so
that the function can be resumed again right where it left off. A
very simple example:

def fib():
a, b = 0, 1
while 1:
yield b
a, b = b, a+b

When fib() is first invoked, it sets a to 0 and b to 1, then yields b
back to its caller. The caller sees 1. When fib is resumed, from its
point of view the yield statement is really the same as, say, a print
statement: fib continues after the yield with all local state intact.
a and b then become 1 and 1, and fib loops back to the yield, yielding
1 to its invoker. And so on. From fib's point of view it's just
delivering a sequence of results, as if via callback. But from its
caller's point of view, the fib invocation is an iterable object that
can be resumed at will. As in the thread approach, this allows both
sides to be coded in the most natural ways; but unlike the thread
approach, this can be done efficiently and on all platforms. Indeed,
resuming a generator should be no more expensive than a function call.

The same kind of approach applies to many producer/consumer functions.
For example, tokenize.py could yield the next token instead of invoking
a callback function with it as argument, and tokenize clients could
iterate over the tokens in a natural way: a Python generator is a kind
of Python iterator[1], but of an especially powerful kind.

Cheers,
- Andreas

Andreas Raab wrote:

> Ralph Boland wrote:
>> I do not understand all the commotion over generators.
>
> Generators transform a push model (#do:) into a pull model (#next).
> Let's say you have two algorithms one that produces values via a push
> model and one that consumes values via a pull model:
>
> producer := SomeAlgorithm new.
> producer do:[:value| "produces the value from the algorithm"].
>
> consumer := AnotherAlgoritm new.
> consumer process: inputStream. "consumes the value from inputStream"
>
> Plugging these two together can be a royal pain. You basically have the
> following three options:
>
> 1) Preproduce all the values from the producer and create a stream that
> holds them:
>
> values := Array streamContents:[:s| producer do:[:each| s nextPut: each]].
> consumer process: values readStream.
>
> This can work very well if the number of values is limited and if you
> know you're going to consume all the produced values (doesn't work that
> well with infinite sequences of numbers).
>
> 2) Rewrite either one of the algorithms. Can work very well if it's your
> own code and reasonably simple, and can be extremely painful if it's
> complex.
>
> 3) Use a Generator. With a generator, the solution becomes simply:
>
> consumer process: (Generator on:[:yield| producer do: yield]).
>
> When you can neither use solution 1) or 2) generators simplify your life
> *greatly*. In fact I'd say they are indispensable if you're dealing
> with a complex existing environment.
>
>> If a Collection class cannot do what you want, use a Stream.
>> Take a read only Stream and subclass from it a class called GenMyData.
>> Inside GenMyData write code that generates the data you
>> need on the fly and feeds it to the user upon request using
>> the ReadStream interface.
>
> See above. You're describing option #2. Rewriting an existing algorithm
> to "feed" the data properly can be decidedly non-trivial. You can start
> with Integer>>primesUpTo:do: which is a very simple algorithm. The point
> being that while all algorithms can be expressed either way the generic
> way in which a Generator does that replaces the need to reimplement each
> specific algorithm.
>
> Cheers,
> - Andreas
>
>

Jecel Assumpcao Jr

Re: Generators in Smalltalk??

Tim Budd has already been mentioned, but I would like to reinforce that
by suggesting that anyone interested in this subject read chapter 8 of
"A Little Smalltalk":

http://stephane.ducasse.free.fr/FreeBooks/LittleSmalltalk/ALittleSmallta
lk.pdf

The implementation of generators in Smalltalk is shown (Little Smalltalk
uses that instead of Streams) and several interesting examples are
given.

-- Jecel