Please consider this bug fix for inclusion in the product:
The implementation of Stream>>nextLine (NetClientsBase, VW7.4.1) uses #upTo: Character cr which is sensitive to line end conventions. At other places I've found similar code using #upTo: Character lf. This doesn't work well in a cross-platform situation, especially the MIME packages won't compile and/or parse message headers correctly (see: #linesDo: method). I've had problems with extra line feeds generated for mail messages resulting in "splitted" headers. Messages arrived corrupted. I fixed this with an all-purpose implementation of PeekableStream>>nextLine (see code below). It adapts to any line end convention automatically and can even handle strings mixed from different platforms. The basic idea behind it is to scan up to either CR or LF and then consume any immediately following CR or LF, given it is *not* identical to the first one (i.e. not constituting an empty next line). For some streams, #atEnd blocks when waiting for future characters to come (external connections). This might not be desired when using #nextLine with network protocols. Below you will find a simple implementation of #atEndNoWait that behaves accordingly. Simply use it instead of #atEnd. I didn't check if there are more subclasses of Stream that require a special implementation of #atEndNoWait. Regards, Andre -- here's the code: PeekableStream>>nextLine "This implementation works with Character streams only. It handles all line end conventions automatically." | newStream char crlf | crlf := Array with: Character cr with: Character lf. newStream := (self contentsSpecies new: 64) writeStream. [ self atEnd ] whileFalse:[ char := self next. (crlf includes: char) ifTrue: [ "skip subsequent CR/LF, only if different from the char that constituted the line break" [ | p | self atEnd not and:[ p := self peek. (crlf includes: p) and:[ p ~~ char ]]] whileTrue:[ self next ]. ^newStream contents ]. newStream nextPut: char.]. ^newStream contents "Replace #atEnd by #atEndNoWait, if waiting for future characters is not desired:" PeekableStream>>atEndNoWait "Default implementation" ^self atEnd BufferedExternalStream>>atEndNoWait "Check for atEnd but don't wait for future elements" ^self basicAtEnd |
Here is a faster version, it has different semantics in that it
interprets LF-CR as two lines instead of one. nextLine "This implementation works with Character streams only. It handles all 3 line end conventions automatically." | newStream char cr lf | cr := Character cr. lf := Character lf. newStream := (self contentsSpecies new: 64) writeStream. [self atEnd] whileFalse: [char := self next. char == cr ifFalse: [char == lf ifFalse: [newStream nextPut: char] ifTrue: [^newStream contents]] ifTrue: [(self atEnd not and: [self peek == lf]) ifTrue: [self skip: 1]. ^newStream contents]]. ^newStream contents Timing with foocrlf := ('foo' copyWith: Character cr) copyWith: Character lf. foocr := 'foo' copyWith: Character cr. foolf := 'foo' copyWith: Character lf. in := (foocrlf , foocr, foolf) . Time microsecondsToRun: [ 10000 timesRepeat: [ in readStream nextLine; nextLine; nextLine]] yields 38713 39159 42756 37834 (original) 25942 24444 28805 25527 (optimised) What about its end-of-stream behaviour? Should we prefix the method with self atEnd ifTrue: [^nil]. ? I also think the remarks about #atEndNoWait are on shaky ground, I can imagine it returning partial lines when used on network streams which seems useless to me since there is no hint that the result is incomplete. Finally I think doing this is a bad idea, it feels like we are slowly infecting the image with line-end convention tests (that slow things down) while there is only one line end convention used inside the image: CR. (Ponder the meaning of 'convention' ;-) I guess this better be fixed by introducing a #mixed line end convention that can be assigned to the external stream (obviously this will only have meaning on ReadStreams). Probably a lot more work, but it feels more like the right thing to do. A halfway solution could be to create a stream wrapper that will translate all line end occurrences into a CR, less work but more 'right'... Such a wrapper would also solve the #atEnd problems. Since it can remember the last seen character across calls in an ivar, you can look back instead of ahead to decide 'late' whether a trailing LF needs to be discarded. R - On Nov 21, 2006, at 10:03 AM, Andre Schnoor wrote: > Please consider this bug fix for inclusion in the product: > > The implementation of Stream>>nextLine (NetClientsBase, VW7.4.1) uses > > #upTo: Character cr > > which is sensitive to line end conventions. At other places I've > found similar code using #upTo: Character lf. > > This doesn't work well in a cross-platform situation, especially > the MIME packages won't compile and/or parse message headers > correctly (see: #linesDo: method). I've had problems with extra > line feeds generated for mail messages resulting in "splitted" > headers. Messages arrived corrupted. > > I fixed this with an all-purpose implementation of > PeekableStream>>nextLine (see code below). It adapts to any line > end convention automatically and can even handle strings mixed from > different platforms. The basic idea behind it is to scan up to > either CR or LF and then consume any immediately following CR or > LF, given it is *not* identical to the first one (i.e. not > constituting an empty next line). > > For some streams, #atEnd blocks when waiting for future characters > to come (external connections). This might not be desired when > using #nextLine with network protocols. Below you will find a > simple implementation of #atEndNoWait that behaves accordingly. > Simply use it instead of #atEnd. I didn't check if there are more > subclasses of Stream that require a special implementation of > #atEndNoWait. > > Regards, > Andre > > -- > here's the code: > > PeekableStream>>nextLine > "This implementation works with Character streams only. > It handles all line end conventions automatically." > | newStream char crlf | > crlf := Array with: Character cr with: Character lf. > newStream := (self contentsSpecies new: 64) writeStream. > [ self atEnd ] > whileFalse:[ > char := self next. > (crlf includes: char) > ifTrue: [ > "skip subsequent CR/LF, only if different from > the char that constituted the line break" > [ | p | > self atEnd not and:[ p := self peek. > (crlf includes: p) and:[ p ~~ char ]]] > whileTrue:[ self next ]. > ^newStream contents ]. > newStream nextPut: char.]. > ^newStream contents > > > "Replace #atEnd by #atEndNoWait, if waiting for future characters > is not desired:" > > PeekableStream>>atEndNoWait > "Default implementation" > ^self atEnd > > BufferedExternalStream>>atEndNoWait > "Check for atEnd but don't wait for future elements" > ^self basicAtEnd > > |
Reinout Heeck wrote: > Here is a faster version, it has different semantics in that it > interprets LF-CR as two lines instead of one. If performance is really an issue, why not? This method is used in few places only (mail message composition for example). > > I also think the remarks about #atEndNoWait are on shaky ground, I can > imagine it returning partial lines when used on network streams which > seems useless to me since there is no hint that the result is incomplete. No doubt. I encountered problems with infinite waiting when implementing a simple FTP server reading "command lines" from the control connection. It "peeks" forever after a CR, but the client doesn't send any of CR or LF because it's already waiting for the reply. Hence, #atEndNoWait is only required in the inner loop that peeks for an immediately following CR or LF. Your suggested wrapper (looking back instead of ahead) could be a better solution here. Andre |
Andre Schnoor wrote:
> This method is used in few places only (mail message composition > for example). What makes you think that? Several other packages also need a #nextLine and end up implementing it as ^self upTo: Character cr which as I implied earlier is the correct implementation IMO. It should be in the base image... Anyway, other apps shouldn't be suffering the performance hit of upTo: vs our algorithms just because it is needed in one domain. > I encountered problems with infinite waiting when implementing a > simple FTP server reading "command lines" from the control > connection. It "peeks" forever after a CR, but the client doesn't > send any of CR or LF because it's already waiting for the reply. > Hence, #atEndNoWait is only required in the inner loop that peeks > for an immediately following CR or LF. That is a broken FTP client then. Off the cuff I'd say that all FTP clients are obliged to do CRLF. It is always hard to support broken external applications :-( > Your suggested wrapper (looking back instead of ahead) could be a > better solution here. Or in the FTP case set the line-end convention to CR and strip leading LF from your lines when present. This is different in that it doesn't accept bare LF as a line separator. R - |
Reinout Heeck wrote: > Andre Schnoor wrote: > >> This method is used in few places only (mail message composition for >> example). > > What makes you think that? Several other packages also need a > #nextLine and end up implementing it as > ^self upTo: Character cr > which as I implied earlier is the correct implementation IMO. > It should be in the base image... In any case the (one and only) base implementation must support all line end conventions transparently and strip redundant feed characters from the answer, if necessary. Otherwise many packages will still override #nextLine with their own notion of it, effectively corrupting the behavior of other packages. self upTo: Character cr simply doesn't cover this. AFAIK, we have CR CR LF LF the problem being that the occurence of CR must not wait forever for a LF to come in order to not block. Andre |
It isn't clear to me that having the nextLine method always
trying to figure out the line end convention on each invocation is
desirable behaviour. While putting <cr> characters into an
<lf> delimited file is odd, it's certainly possible. Wouldn't it be
simpler to do something like
self upToAll: self lineEndString and thus just respect the line end convention of the stream. (One would also need to define lineEndString, and could probably do something more efficient, but you get the idea.) Also, if you know the line end convention, then that seems to eliminate the infinite wait issue. You're only going to wait if you get e.g. just a CR from a stream that's expected to send CRLF, which either means that the other end really isn't done yet, or is in error. At 09:19 AM 11/21/2006, Andre Schnoor wrote: Reinout Heeck wrote: --
Alan Knight [|], Cincom Smalltalk Development
"The Static Typing Philosophy: Make it fast. Make it right.
Make it run." - Niall Ross
|
In reply to this post by Andre Schnoor
Andre,
It seems that most of your troubles are with our MIME implementation. Can you send me some samples that can help me reproduce the issues you ran into ? (Note that currently there might be differences in behavior between reading from internal streams and external streams). I also don't feel your fix is the right way to deal with this, but we'll know more when we see the cases. Either way our MIME framework sure shouldn't mess up your line-ends. Thanks, Martin |
In reply to this post by Reinout Heeck
Reinout Heeck wrote:
> Finally I think doing this is a bad idea, it feels like we are slowly > infecting the image with line-end convention tests (that slow things > down) while there is only one line end convention used inside the image: > CR. (Ponder the meaning of 'convention' ;-) > > I guess this better be fixed by introducing a #mixed line end convention > that can be assigned to the external stream (obviously this will only > have meaning on ReadStreams). Probably a lot more work, but it feels > more like the right thing to do. > > A halfway solution could be to create a stream wrapper that will > translate all line end occurrences into a CR, less work but more 'right'... > > Such a wrapper would also solve the #atEnd problems. Since it can > remember the last seen character across calls in an ivar, you can look > back instead of ahead to decide 'late' whether a trailing LF needs to be > discarded. Actually, there's already something called AutoLineEndStream in the ComputingStreams package. I hope to do something faster for the block based APIs, #next:putAll: and #next:into: at some point, but this should do for now. I'm already using it in the FileBrowser and seems to work fine so far. Martin |
In reply to this post by Andre Schnoor
Andre Schnoor wrote:
> In any case the (one and only) base implementation must support all line > end conventions transparently and strip redundant feed characters from > the answer, if necessary. Otherwise many packages will still override > #nextLine with their own notion of it, effectively corrupting the > behavior of other packages. > > self upTo: Character cr simply doesn't cover this. AFAIK, we have > > CR > CR LF > LF > > the problem being that the occurence of CR must not wait forever for a > LF to come in order to not block. > > Andre > > Note that both external streams and the EncodedStream wrapper do line end convention processing already, it's just a matter of setting the right one. And I think that's where it belongs. As Reinout already pointed out once in the image, strings should use CRs only, no matter on what platform, so strings should get converted to this convention when they are entering the image, and converted back to the platform convention when leaving the image. Of course it's pretty easy to get a file with mixed conventions, e.g. when you edit it from both Windows and Linux, that's where something like the AutoLineEndStream I mentioned before comes in handy, but I have doubts about trying to do this universally. E.g. when you have a protocol, e.g. MIME, which says explicitly that line ends are CRLF, what should you do if you encounter just CR or LF in the body ? Doesn't changing that also constitute munging the original contents ? So you need to be able to control the convention processing in some cases. Martin |
On Nov 21, 2006, at 8:36, Martin Kobetic wrote:
-- Travis Griggs Objologist My Other Machine runs OSX. But then... so does this one. |
In reply to this post by kobetic
Martin Kobetic wrote: > Of course it's pretty easy to get a file with mixed conventions, e.g. > when you edit it from both Windows and Linux, that's where something > like the AutoLineEndStream I mentioned before comes in handy, but I > have doubts about trying to do this universally. In my example case a Linux server composes mail messages combined from on-the-fly strings created under the Unix convention (from the server) with text templates taken from a PostgreSQL database (unicode) that are edited by a Windows client. Then add some form postings from a HTTP server and you'll get a mixture of many different conventions in a single string. BTW: I forwarded the problem to Tamara a while ago. We found that my own (now obsolete) implemention of #nextLine (using upTo: Character lf) caused the errors in the MIME message header, so it was actually my own fault (obviously, upTo: Character lf will include the preceding CR at the end of the line being answered). However, after removing the override of #nextLine again, some other code didn't work anymore, because of the mixed platform configuration. So I ended up coding a universal #nextLine that fits all situations. Fiddling with #nextLine has serious side effects! Andre |
Free forum by Nabble | Edit this page |