Hi everyone,
I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image. I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter. Here is the code that shows the bug: FileStream readOnlyFileNamed: 'test.html' do: [:stream | | result | result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. It answers the following string: 'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A> <DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this: 'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' So it read part of the file again. I tried making the file smaller but the bug goes away then. As a cross check when I read the whole file at once it parses correctly. FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | | stream result | stream := fileStream contentsOfEntireFile readStream. result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. Any ideas anyone? Bernhard test.html (2K) Download Attachment |
FWIW - if you read the file into memory first raw := (FileStream readOnlyFileNamed:
'/Users/bob/squeak/test.html') contentsOfEntireFile. then you get the expected results - there was a change to PositionableStream>>upToAll: in 2017. If you revert to the 1999 version, you will get the expected results Maybe neither is the best answer, but
they may help with finding it. On 1/20/18 4:24 PM, Bernhard Pieber
wrote:
Hi everyone, I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image. I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter. Here is the code that shows the bug: FileStream readOnlyFileNamed: 'test.html' do: [:stream | | result | result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. It answers the following string: 'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A> <DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this: 'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' So it read part of the file again. I tried making the file smaller but the bug goes away then. As a cross check when I read the whole file at once it parses correctly. FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | | stream result | stream := fileStream contentsOfEntireFile readStream. result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. Any ideas anyone? Bernhard |
In reply to this post by bpi
The problem occurs when crossing the 2kb size of the StandardFileStream variable <collection>. The </A> you were looking for straddled that boundary. Here is a simple test: ============test1 - write 1000 a's followed by 3 b's - do this 4 times - read it back by using upToAll: 'bbb' - expect 4 1000-byte strings as the result BUT you getan OrderedCollection( #(1000 1003) #(1000 2006) #(2006 3009) #(1003 4012)) instead. The positions are right, but the lengths returned are not. On 1/20/18 4:24 PM, Bernhard Pieber
wrote:
Hi everyone, I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image. I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter. Here is the code that shows the bug: FileStream readOnlyFileNamed: 'test.html' do: [:stream | | result | result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. It answers the following string: 'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A> <DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this: 'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' So it read part of the file again. I tried making the file smaller but the bug goes away then. As a cross check when I read the whole file at once it parses correctly. FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | | stream result | stream := fileStream contentsOfEntireFile readStream. result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. Any ideas anyone? Bernhard |
In reply to this post by bpi
The culprit is 'From Squeak6.0alpha of 9 September 2017
[latest update: #17382] on 21 January 2018 at 6:58:58 am'! which was referencing the instVar
<position> directly. Changing that to "self position"
allows it to stop at the right place. On 1/20/18 4:24 PM, Bernhard Pieber
wrote:
Hi everyone, I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image. I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter. Here is the code that shows the bug: FileStream readOnlyFileNamed: 'test.html' do: [:stream | | result | result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. It answers the following string: 'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A> <DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this: 'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> </DL><p> </DL><p> </DL><p> </HTML> ' So it read part of the file again. I tried making the file smaller but the bug goes away then. As a cross check when I read the whole file at once it parses correctly. FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | | stream result | stream := fileStream contentsOfEntireFile readStream. result := OrderedCollection new. [stream atEnd] whileFalse: [ stream match: '<A HREF="'. result add: (stream upToAll: '</A>')]. result at: 13 ]. Any ideas anyone? Bernhard |
I added Bob's fix to trunk.
Dave On Sun, Jan 21, 2018 at 07:01:37AM -0500, Bob Arning wrote: > The culprit is > > 'From Squeak6.0alpha of 9 September 2017 [latest update: #17382] on 21 > January 2018 at 6:58:58 am'! > > !MultiByteFileStream methodsFor: 'accessing' stamp: 'raa 1/21/2018 06:57'! > upToPosition: anInteger > ??? "Answer a subcollection containing items starting from the current > position and ending including the given position. Usefully different to > #next: in that positions measure *bytes* from the file, where #next: > wants to measure *characters*." > ??? ^self collectionSpecies new: 1000 streamContents: [ :stream | > ??? ??? | ch | > ??? ??? [ (ch := self next) == nil or: [ self position > anInteger ] ] > ??? ??? ??? whileFalse: [ stream nextPut: ch ] ]! ! > > which was referencing the instVar <position> directly. Changing that to > "self position" allows it to stop at the right place. > > > On 1/20/18 4:24 PM, Bernhard Pieber wrote: > >Hi everyone, > > > >I think I found a really strange bug in MultiByteFileStream. I am on macOS > >Sierra and used the latest VM from bintray and an updated trunk image. > > > >I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It > >uses a MultiByteFileStream with a UTF8TextConverter. > > > >Here is the code that shows the bug: > > > >FileStream readOnlyFileNamed: 'test.html' do: [:stream | > > | result | > > result := OrderedCollection new. > > [stream atEnd] whileFalse: [ > > stream match: '<A HREF="'. > > result add: (stream upToAll: '</A>')]. > > result at: 13 > >]. > > > >It answers the following string: > >'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: > >Variable Risiko-Lebensversicherung</A> > > <DT><A > > HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> > > </DL><p> > > </DL><p> > > </DL><p> > ></HTML> > >' > > > >You can see that it did not stop at the </A> as it should have but answers > >the rest of the file. The strange thing is that the next anchor looks like > >this: > >'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A> > > </DL><p> > > </DL><p> > > </DL><p> > ></HTML> > >' > >So it read part of the file again. > > > >I tried making the file smaller but the bug goes away then. > > > >As a cross check when I read the whole file at once it parses correctly. > > > >FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | > > | stream result | > > stream := fileStream contentsOfEntireFile readStream. > > result := OrderedCollection new. > > [stream atEnd] whileFalse: [ > > stream match: '<A HREF="'. > > result add: (stream upToAll: '</A>')]. > > result at: 13 > >]. > > > >Any ideas anyone? > > > >Bernhard > > > > > > > > |
Free forum by Nabble | Edit this page |