MultiByteFileStream upToAll: strange bug

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
bpi
Reply | Threaded
Open this post in threaded view
|

MultiByteFileStream upToAll: strange bug

bpi
Hi everyone,

I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.

I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.

Here is the code that shows the bug:

FileStream readOnlyFileNamed: 'test.html' do: [:stream |
        | result |
        result := OrderedCollection new.
        [stream atEnd] whileFalse: [
                stream match: '<A HREF="'.
                result add: (stream upToAll: '</A>')].
        result at: 13
].

It answers the following string:
'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A>
                                <DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
                        </DL><p>
                </DL><p>
        </DL><p>
</HTML>
'

You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:
'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
                        </DL><p>
                </DL><p>
        </DL><p>
</HTML>
'
So it read part of the file again.

I tried making the file smaller but the bug goes away then.

As a cross check when I read the whole file at once it parses correctly.

FileStream readOnlyFileNamed: 'test.html' do: [:fileStream |
        | stream result |
        stream := fileStream contentsOfEntireFile readStream.
        result := OrderedCollection new.
        [stream atEnd] whileFalse: [
                stream match: '<A HREF="'.
                result add: (stream upToAll: '</A>')].
        result at: 13
].

Any ideas anyone?

Bernhard




test.html (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: MultiByteFileStream upToAll: strange bug

Bob Arning-2

FWIW

- if you read the file into memory first

raw := (FileStream readOnlyFileNamed: '/Users/bob/squeak/test.html') contentsOfEntireFile.
stream := raw readStream.

then you get the expected results

- there was a change to PositionableStream>>upToAll: in 2017. If you revert to the 1999 version, you will get the expected results

Maybe neither is the best answer, but they may help with finding it.


On 1/20/18 4:24 PM, Bernhard Pieber wrote:
Hi everyone,

I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.

I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.

Here is the code that shows the bug:

FileStream readOnlyFileNamed: 'test.html' do: [:stream | 
	| result |
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

It answers the following string:
'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A>
				<DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'

You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:
'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'
So it read part of the file again.

I tried making the file smaller but the bug goes away then.

As a cross check when I read the whole file at once it parses correctly.

FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | 
	| stream result |
	stream := fileStream contentsOfEntireFile readStream.
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

Any ideas anyone?

Bernhard




    



Reply | Threaded
Open this post in threaded view
|

Re: MultiByteFileStream upToAll: strange bug

Bob Arning-2
In reply to this post by bpi

The problem occurs when crossing the 2kb size of the StandardFileStream variable <collection>. The </A> you were looking for straddled that boundary. Here is a simple test:

============

test1
"
self test1
"
    | f answer result fn |
   
    fn := 'foo.foo.foo'.
    FileDirectory default deleteFileNamed: fn.
    f := MultiByteFileStream fileNamed: fn.
    {1000. 1000. 1000. 1000} do: [ :len |
        len timesRepeat: [f nextPutAll: 'a'].
        f nextPutAll: 'bbb'.
    ].
    f close.
    result := OrderedCollection new.
    f := MultiByteFileStream fileNamed: fn.
    [f atEnd] whileFalse: [
        answer := f upToAll: 'bbb'.
        result add: {answer size. f position. "answer"}
    ].
    f close.
   
    ^result   

=========

- write 1000 a's followed by 3 b's

- do this 4 times

- read it back by using upToAll: 'bbb'

- expect 4 1000-byte strings as the result

BUT you get

an OrderedCollection(
#(1000 1003)
#(1000 2006)
#(2006 3009)
#(1003 4012))

instead. The positions are right, but the lengths returned are not.

On 1/20/18 4:24 PM, Bernhard Pieber wrote:
Hi everyone,

I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.

I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.

Here is the code that shows the bug:

FileStream readOnlyFileNamed: 'test.html' do: [:stream | 
	| result |
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

It answers the following string:
'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A>
				<DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'

You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:
'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'
So it read part of the file again.

I tried making the file smaller but the bug goes away then.

As a cross check when I read the whole file at once it parses correctly.

FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | 
	| stream result |
	stream := fileStream contentsOfEntireFile readStream.
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

Any ideas anyone?

Bernhard




    



Reply | Threaded
Open this post in threaded view
|

Re: MultiByteFileStream upToAll: strange bug

Bob Arning-2
In reply to this post by bpi

The culprit is

'From Squeak6.0alpha of 9 September 2017 [latest update: #17382] on 21 January 2018 at 6:58:58 am'!

!MultiByteFileStream methodsFor: 'accessing' stamp: 'raa 1/21/2018 06:57'!
upToPosition: anInteger
    "Answer a subcollection containing items starting from the current position and ending including the given position. Usefully different to #next: in that positions measure *bytes* from the file, where #next: wants to measure *characters*."
    ^self collectionSpecies new: 1000 streamContents: [ :stream |
        | ch |
        [ (ch := self next) == nil or: [ self position > anInteger ] ]
            whileFalse: [ stream nextPut: ch ] ]! !

which was referencing the instVar <position> directly. Changing that to "self position" allows it to stop at the right place.


On 1/20/18 4:24 PM, Bernhard Pieber wrote:
Hi everyone,

I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.

I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.

Here is the code that shows the bug:

FileStream readOnlyFileNamed: 'test.html' do: [:stream | 
	| result |
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

It answers the following string:
'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A>
				<DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'

You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:
'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'
So it read part of the file again.

I tried making the file smaller but the bug goes away then.

As a cross check when I read the whole file at once it parses correctly.

FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | 
	| stream result |
	stream := fileStream contentsOfEntireFile readStream.
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

Any ideas anyone?

Bernhard




    



Reply | Threaded
Open this post in threaded view
|

Re: MultiByteFileStream upToAll: strange bug

David T. Lewis
I added Bob's fix to trunk.

Dave


On Sun, Jan 21, 2018 at 07:01:37AM -0500, Bob Arning wrote:

> The culprit is
>
> 'From Squeak6.0alpha of 9 September 2017 [latest update: #17382] on 21
> January 2018 at 6:58:58 am'!
>
> !MultiByteFileStream methodsFor: 'accessing' stamp: 'raa 1/21/2018 06:57'!
> upToPosition: anInteger
> ??? "Answer a subcollection containing items starting from the current
> position and ending including the given position. Usefully different to
> #next: in that positions measure *bytes* from the file, where #next:
> wants to measure *characters*."
> ??? ^self collectionSpecies new: 1000 streamContents: [ :stream |
> ??? ??? | ch |
> ??? ??? [ (ch := self next) == nil or: [ self position > anInteger ] ]
> ??? ??? ??? whileFalse: [ stream nextPut: ch ] ]! !
>
> which was referencing the instVar <position> directly. Changing that to
> "self position" allows it to stop at the right place.
>
>
> On 1/20/18 4:24 PM, Bernhard Pieber wrote:
> >Hi everyone,
> >
> >I think I found a really strange bug in MultiByteFileStream. I am on macOS
> >Sierra and used the latest VM from bintray and an updated trunk image.
> >
> >I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It
> >uses a MultiByteFileStream with a UTF8TextConverter.
> >
> >Here is the code that shows the bug:
> >
> >FileStream readOnlyFileNamed: 'test.html' do: [:stream |
> > | result |
> > result := OrderedCollection new.
> > [stream atEnd] whileFalse: [
> > stream match: '<A HREF="'.
> > result add: (stream upToAll: '</A>')].
> > result at: 13
> >].
> >
> >It answers the following string:
> >'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern:
> >Variable Risiko-Lebensversicherung</A>
> > <DT><A
> > HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
> > </DL><p>
> > </DL><p>
> > </DL><p>
> ></HTML>
> >'
> >
> >You can see that it did not stop at the </A> as it should have but answers
> >the rest of the file. The strange thing is that the next anchor looks like
> >this:
> >'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
> > </DL><p>
> > </DL><p>
> > </DL><p>
> ></HTML>
> >'
> >So it read part of the file again.
> >
> >I tried making the file smaller but the bug goes away then.
> >
> >As a cross check when I read the whole file at once it parses correctly.
> >
> >FileStream readOnlyFileNamed: 'test.html' do: [:fileStream |
> > | stream result |
> > stream := fileStream contentsOfEntireFile readStream.
> > result := OrderedCollection new.
> > [stream atEnd] whileFalse: [
> > stream match: '<A HREF="'.
> > result add: (stream upToAll: '</A>')].
> > result at: 13
> >].
> >
> >Any ideas anyone?
> >
> >Bernhard
> >
> >
> >
>

>