Smalltalk › Cincom › VisualWorks

Bugfix: Safe implementation of #nextLine adapting to any line end convention

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

11 messages Options

Andre Schnoor

Bugfix: Safe implementation of #nextLine adapting to any line end convention

Please consider this bug fix for inclusion in the product:

The implementation of Stream>>nextLine (NetClientsBase, VW7.4.1) uses

#upTo: Character cr

which is sensitive to line end conventions. At other places I've found
similar code using #upTo: Character lf.

This doesn't work well in a cross-platform situation, especially the
MIME packages won't compile and/or parse message headers correctly (see:
#linesDo: method). I've had problems with extra line feeds generated for
mail messages resulting in "splitted" headers. Messages arrived corrupted.

I fixed this with an all-purpose implementation of
PeekableStream>>nextLine (see code below). It adapts to any line end
convention automatically and can even handle strings mixed from
different platforms. The basic idea behind it is to scan up to either CR
or LF and then consume any immediately following CR or LF, given it is
*not* identical to the first one (i.e. not constituting an empty next
line).

For some streams, #atEnd blocks when waiting for future characters to
come (external connections). This might not be desired when using
#nextLine with network protocols. Below you will find a simple
implementation of #atEndNoWait that behaves accordingly. Simply use it
instead of #atEnd. I didn't check if there are more subclasses of Stream
that require a special implementation of #atEndNoWait.

Regards,
Andre

--
here's the code:

PeekableStream>>nextLine
"This implementation works with Character streams only.
It handles all line end conventions automatically."
| newStream char crlf |
crlf := Array with: Character cr with: Character lf.
newStream := (self contentsSpecies new: 64) writeStream.
[ self atEnd ]
whileFalse:[
char := self next.
(crlf includes: char)
ifTrue: [
"skip subsequent CR/LF, only if different from the
char that constituted the line break"
[ | p |
self atEnd not and:[ p := self peek.
(crlf includes: p) and:[ p ~~ char ]]]
whileTrue:[ self next ].
^newStream contents ].
newStream nextPut: char.].
^newStream contents

"Replace #atEnd by #atEndNoWait, if waiting for future characters is not
desired:"

PeekableStream>>atEndNoWait
"Default implementation"
^self atEnd

BufferedExternalStream>>atEndNoWait
"Check for atEnd but don't wait for future elements"
^self basicAtEnd

Reinout Heeck

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

Here is a faster version, it has different semantics in that it
interprets LF-CR as two lines instead of one.

nextLine
"This implementation works with Character streams only.
It handles all 3 line end conventions automatically."

| newStream char cr lf |
cr := Character cr.
lf := Character lf.
newStream := (self contentsSpecies new: 64) writeStream.
[self atEnd] whileFalse:
[char := self next.
char == cr
ifFalse:
[char == lf ifFalse: [newStream nextPut: char] ifTrue:
[^newStream contents]]
ifTrue:
[(self atEnd not and: [self peek == lf]) ifTrue: [self skip: 1].
^newStream contents]].
^newStream contents

Timing with

foocrlf := ('foo' copyWith: Character cr) copyWith: Character lf.
foocr := 'foo' copyWith: Character cr.
foolf := 'foo' copyWith: Character lf.
in := (foocrlf , foocr, foolf) .
Time microsecondsToRun: [
10000 timesRepeat: [ in readStream nextLine; nextLine; nextLine]]

yields

38713 39159 42756 37834 (original)
25942 24444 28805 25527 (optimised)

What about its end-of-stream behaviour?
Should we prefix the method with
self atEnd ifTrue: [^nil].
?

I also think the remarks about #atEndNoWait are on shaky ground, I
can imagine it returning partial lines when used on network streams
which seems useless to me since there is no hint that the result is
incomplete.

Finally I think doing this is a bad idea, it feels like we are slowly
infecting the image with line-end convention tests (that slow things
down) while there is only one line end convention used inside the
image: CR. (Ponder the meaning of 'convention' ;-)

I guess this better be fixed by introducing a #mixed line end
convention that can be assigned to the external stream (obviously
this will only have meaning on ReadStreams). Probably a lot more
work, but it feels more like the right thing to do.

A halfway solution could be to create a stream wrapper that will
translate all line end occurrences into a CR, less work but more
'right'...

Such a wrapper would also solve the #atEnd problems. Since it can
remember the last seen character across calls in an ivar, you can
look back instead of ahead to decide 'late' whether a trailing LF
needs to be discarded.

R
-

On Nov 21, 2006, at 10:03 AM, Andre Schnoor wrote:

> Please consider this bug fix for inclusion in the product:
>
> The implementation of Stream>>nextLine (NetClientsBase, VW7.4.1) uses
>
> #upTo: Character cr
>
> which is sensitive to line end conventions. At other places I've
> found similar code using #upTo: Character lf.
>
> This doesn't work well in a cross-platform situation, especially
> the MIME packages won't compile and/or parse message headers
> correctly (see: #linesDo: method). I've had problems with extra
> line feeds generated for mail messages resulting in "splitted"
> headers. Messages arrived corrupted.
>
> I fixed this with an all-purpose implementation of
> PeekableStream>>nextLine (see code below). It adapts to any line
> end convention automatically and can even handle strings mixed from
> different platforms. The basic idea behind it is to scan up to
> either CR or LF and then consume any immediately following CR or
> LF, given it is *not* identical to the first one (i.e. not
> constituting an empty next line).
>
> For some streams, #atEnd blocks when waiting for future characters
> to come (external connections). This might not be desired when
> using #nextLine with network protocols. Below you will find a
> simple implementation of #atEndNoWait that behaves accordingly.
> Simply use it instead of #atEnd. I didn't check if there are more
> subclasses of Stream that require a special implementation of
> #atEndNoWait.
>
> Regards,
> Andre
>
> --
> here's the code:
>
> PeekableStream>>nextLine
> "This implementation works with Character streams only.
> It handles all line end conventions automatically."
> | newStream char crlf |
> crlf := Array with: Character cr with: Character lf.
> newStream := (self contentsSpecies new: 64) writeStream.
> [ self atEnd ]
> whileFalse:[
> char := self next.
> (crlf includes: char)
> ifTrue: [
> "skip subsequent CR/LF, only if different from
> the char that constituted the line break"
> [ | p |
> self atEnd not and:[ p := self peek.
> (crlf includes: p) and:[ p ~~ char ]]]
> whileTrue:[ self next ].
> ^newStream contents ].
> newStream nextPut: char.].
> ^newStream contents
>
>
> "Replace #atEnd by #atEndNoWait, if waiting for future characters
> is not desired:"
>
> PeekableStream>>atEndNoWait
> "Default implementation"
> ^self atEnd
>
> BufferedExternalStream>>atEndNoWait
> "Check for atEnd but don't wait for future elements"
> ^self basicAtEnd
>
>

Andre Schnoor

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

Reinout Heeck wrote:
> Here is a faster version, it has different semantics in that it
> interprets LF-CR as two lines instead of one.
If performance is really an issue, why not? This method is used in few
places only (mail message composition for example).

>
> I also think the remarks about #atEndNoWait are on shaky ground, I can
> imagine it returning partial lines when used on network streams which
> seems useless to me since there is no hint that the result is incomplete.
No doubt. I encountered problems with infinite waiting when implementing
a simple FTP server reading "command lines" from the control connection.
It "peeks" forever after a CR, but the client doesn't send any of CR or
LF because it's already waiting for the reply. Hence, #atEndNoWait is
only required in the inner loop that peeks for an immediately following
CR or LF.

Your suggested wrapper (looking back instead of ahead) could be a better
solution here.

Andre

Reinout Heeck

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

Andre Schnoor wrote:

> This method is used in few places only (mail message composition
> for example).

What makes you think that? Several other packages also need a
#nextLine and end up implementing it as
^self upTo: Character cr
which as I implied earlier is the correct implementation IMO.
It should be in the base image...

Anyway, other apps shouldn't be suffering the performance hit of
upTo: vs our algorithms just because it is needed in one domain.

> I encountered problems with infinite waiting when implementing a
> simple FTP server reading "command lines" from the control
> connection. It "peeks" forever after a CR, but the client doesn't
> send any of CR or LF because it's already waiting for the reply.
> Hence, #atEndNoWait is only required in the inner loop that peeks
> for an immediately following CR or LF.

That is a broken FTP client then. Off the cuff I'd say that all FTP
clients are obliged to do CRLF. It is always hard to support broken
external applications :-(

> Your suggested wrapper (looking back instead of ahead) could be a
> better solution here.

Or in the FTP case set the line-end convention to CR and strip
leading LF from your lines when present.
This is different in that it doesn't accept bare LF as a line separator.

R
-

Andre Schnoor

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

Reinout Heeck wrote:

> Andre Schnoor wrote:
>
>> This method is used in few places only (mail message composition for
>> example).
>
> What makes you think that? Several other packages also need a
> #nextLine and end up implementing it as
> ^self upTo: Character cr
> which as I implied earlier is the correct implementation IMO.
> It should be in the base image...

In any case the (one and only) base implementation must support all line
end conventions transparently and strip redundant feed characters from
the answer, if necessary. Otherwise many packages will still override
#nextLine with their own notion of it, effectively corrupting the
behavior of other packages.

self upTo: Character cr simply doesn't cover this. AFAIK, we have

CR
CR LF
LF

the problem being that the occurence of CR must not wait forever for a
LF to come in order to not block.

Andre

Alan Knight-2

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

It isn't clear to me that having the nextLine method always trying to figure out the line end convention on each invocation is desirable behaviour. While putting <cr> characters into an <lf> delimited file is odd, it's certainly possible. Wouldn't it be simpler to do something like
self upToAll: self lineEndString

and thus just respect the line end convention of the stream. (One would also need to define lineEndString, and could probably do something more efficient, but you get the idea.)

Also, if you know the line end convention, then that seems to eliminate the infinite wait issue. You're only going to wait if you get e.g. just a CR from a stream that's expected to send CRLF, which either means that the other end really isn't done yet, or is in error.

At 09:19 AM 11/21/2006, Andre Schnoor wrote:

Reinout Heeck wrote:

Andre Schnoor wrote:

This method is used in few places only (mail message composition for example).

What makes you think that? Several other packages also need a #nextLine and end up implementing it as
    ^self upTo: Character cr
which as I implied earlier is the correct implementation IMO.
It should be in the base image...

In any case the (one and only) base implementation must support all line end conventions transparently and strip redundant feed characters from the answer, if necessary. Otherwise many packages will still override #nextLine with their own notion of it, effectively corrupting the behavior of other packages.

self upTo: Character cr simply doesn't cover this. AFAIK, we have

   CR
   CR LF
   LF

the problem being that the occurence of CR must not wait forever for a LF to come in order to not block.

Andre

Alan Knight [|], Cincom Smalltalk Development

[hidden email]

http://www.cincom.com/smalltalk

"The Static Typing Philosophy: Make it fast. Make it right. Make it run." - Niall Ross

kobetic

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

In reply to this post by Andre Schnoor

Andre,

It seems that most of your troubles are with our MIME implementation. Can you send me some samples that can help me reproduce the issues you ran into ? (Note that currently there might be differences in behavior between reading from internal streams and external streams). I also don't feel your fix is the right way to deal with this, but we'll know more when we see the cases. Either way our MIME framework sure shouldn't mess up your line-ends.

Thanks,

Martin

kobetic

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

In reply to this post by Reinout Heeck

Reinout Heeck wrote:

> Finally I think doing this is a bad idea, it feels like we are slowly
> infecting the image with line-end convention tests (that slow things
> down) while there is only one line end convention used inside the image:
> CR. (Ponder the meaning of 'convention' ;-)
>
> I guess this better be fixed by introducing a #mixed line end convention
> that can be assigned to the external stream (obviously this will only
> have meaning on ReadStreams). Probably a lot more work, but it feels
> more like the right thing to do.
>
> A halfway solution could be to create a stream wrapper that will
> translate all line end occurrences into a CR, less work but more 'right'...
>
> Such a wrapper would also solve the #atEnd problems. Since it can
> remember the last seen character across calls in an ivar, you can look
> back instead of ahead to decide 'late' whether a trailing LF needs to be
> discarded.

Actually, there's already something called AutoLineEndStream in the ComputingStreams package. I hope to do something faster for the block based APIs, #next:putAll: and #next:into: at some point, but this should do for now. I'm already using it in the FileBrowser and seems to work fine so far.

Martin

kobetic

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

In reply to this post by Andre Schnoor

Andre Schnoor wrote:

> In any case the (one and only) base implementation must support all line
> end conventions transparently and strip redundant feed characters from
> the answer, if necessary. Otherwise many packages will still override
> #nextLine with their own notion of it, effectively corrupting the
> behavior of other packages.
>
> self upTo: Character cr simply doesn't cover this. AFAIK, we have
>
> CR
> CR LF
> LF
>
> the problem being that the occurence of CR must not wait forever for a
> LF to come in order to not block.
>
> Andre
>
>

Note that both external streams and the EncodedStream wrapper do line end convention processing already, it's just a matter of setting the right one. And I think that's where it belongs. As Reinout already pointed out once in the image, strings should use CRs only, no matter on what platform, so strings should get converted to this convention when they are entering the image, and converted back to the platform convention when leaving the image.

Of course it's pretty easy to get a file with mixed conventions, e.g. when you edit it from both Windows and Linux, that's where something like the AutoLineEndStream I mentioned before comes in handy, but I have doubts about trying to do this universally. E.g. when you have a protocol, e.g. MIME, which says explicitly that line ends are CRLF, what should you do if you encounter just CR or LF in the body ? Doesn't changing that also constitute munging the original contents ? So you need to be able to control the convention processing in some cases.

Martin

Travis Griggs-3

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

On Nov 21, 2006, at 8:36, Martin Kobetic wrote:

Andre Schnoor wrote:

In any case the (one and only) base implementation must support all line end conventions transparently and strip redundant feed characters from the answer, if necessary. Otherwise many packages will still override #nextLine with their own notion of it, effectively corrupting the behavior of other packages.
self upTo: Character cr simply doesn't cover this. AFAIK, we have
   CR
   CR LF
   LF
the problem being that the occurence of CR must not wait forever for a LF to come in order to not block.
Andre

Note that both external streams and the EncodedStream wrapper do line end convention processing already, it's just a matter of setting the right one. And I think that's where it belongs. As Reinout already pointed out once in the image, strings should use CRs only, no matter on what platform, so strings should get converted to this convention when they are entering the image, and converted back to the platform convention when leaving the image.

There is a great irony to me here. If I were starting a brand new Smalltalk today, I would not make this convention. Instead I'd make everything be LF's, not CR's. But I wouldn't call it a lineFeed, I'd call it a newline. The rest of the world uses \n's (or endl) for most everything, which is not a CR on a *nix platform, but the code equivalent of an LF. But they never talk about LineFeeds, they talk about "new lines" which is exactly what we use CR (13) to logically represent in Smalltalk. This whole line end thing would be far simpler in that case. You'd just scan for 10's and throw away 13's. As it is today... we keep the 13s when we find them, and turn 10's into 13's if i didn't find a 13.

Travis Griggs

Objologist

My Other Machine runs OSX. But then... so does this one.

Andre Schnoor

Re: Bugfix: Safe implementation of #nextLine adapting to any line end convention

In reply to this post by kobetic

Martin Kobetic wrote:
> Of course it's pretty easy to get a file with mixed conventions, e.g.
> when you edit it from both Windows and Linux, that's where something
> like the AutoLineEndStream I mentioned before comes in handy, but I
> have doubts about trying to do this universally.

In my example case a Linux server composes mail messages combined from
on-the-fly strings created under the Unix convention (from the server)
with text templates taken from a PostgreSQL database (unicode) that are
edited by a Windows client. Then add some form postings from a HTTP
server and you'll get a mixture of many different conventions in a
single string.

BTW: I forwarded the problem to Tamara a while ago. We found that my own
(now obsolete) implemention of #nextLine (using upTo: Character lf)
caused the errors in the MIME message header, so it was actually my own
fault (obviously, upTo: Character lf will include the preceding CR at
the end of the line being answered).

However, after removing the override of #nextLine again, some other code
didn't work anymore, because of the mixed platform configuration. So I
ended up coding a universal #nextLine that fits all situations.

Fiddling with #nextLine has serious side effects!

Andre