The Trunk: Collections-nice.933.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

The Trunk: Collections-nice.933.mcz

commits-2
Nicolas Cellier uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-nice.933.mcz

==================== Summary ====================

Name: Collections-nice.933
Author: nice
Time: 10 April 2021, 9:16:54.312889 pm
UUID: c066bf52-b7a5-474a-9614-90bbc3212e07
Ancestors: Collections-ul.932

Quick fix for double utf8->squeak conversion via nextChunk.

    (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
    (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.

=============== Diff against Collections-ul.932 ===============

Item was changed:
  ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') -----
  basicUpTo: anObject
  "Answer a subcollection from the current access position to the
  occurrence (if any, but not inclusive) of anObject in the receiver. If
  anObject is not in the collection, answer the entire rest of the receiver."
  | newStream element |
  newStream := WriteStream on: (self collectionSpecies new: 100).
+ [self atEnd or: [(element := self basicNext) = anObject]]
- [self atEnd or: [(element := self next) = anObject]]
  whileFalse: [newStream nextPut: element].
  ^newStream contents!



Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-nice.933.mcz

Nicolas Cellier
Hi all,
currently, MultiByteFileStream nextChunk is incorrect for utf8 stream:
it attempts to decode utf8 twice.

This crafted example will raise an error:

    (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
    (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.

nextChunk sends basicUpTo: with the intention to get an un-converted
string, then sends decodeString: to have fast (bulk) decoding.

Unfortunately basicUpTo: sends next instead of basicNext... This makes
the utf8 decoded twice, which can falsify the source code in some
cases, or even fail with an Error like in the crafted example above.

an accelerated version of basicUpTo: was provided by Levente in
Multilingual-ul.85.mcz
but was removed in Multilingual-ar.119.mcz, and I didn't understand
the intention by reading the commit message...
basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in
Collections-eem.684.mcz with the double decoding problem.


Le sam. 10 avr. 2021 à 21:17, <[hidden email]> a écrit :

>
> Nicolas Cellier uploaded a new version of Collections to project The Trunk:
> http://source.squeak.org/trunk/Collections-nice.933.mcz
>
> ==================== Summary ====================
>
> Name: Collections-nice.933
> Author: nice
> Time: 10 April 2021, 9:16:54.312889 pm
> UUID: c066bf52-b7a5-474a-9614-90bbc3212e07
> Ancestors: Collections-ul.932
>
> Quick fix for double utf8->squeak conversion via nextChunk.
>
>     (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
>     (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
>
> =============== Diff against Collections-ul.932 ===============
>
> Item was changed:
>   ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') -----
>   basicUpTo: anObject
>         "Answer a subcollection from the current access position to the
>         occurrence (if any, but not inclusive) of anObject in the receiver. If
>         anObject is not in the collection, answer the entire rest of the receiver."
>         | newStream element |
>         newStream := WriteStream on: (self collectionSpecies new: 100).
> +       [self atEnd or: [(element := self basicNext) = anObject]]
> -       [self atEnd or: [(element := self next) = anObject]]
>                 whileFalse: [newStream nextPut: element].
>         ^newStream contents!
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-nice.933.mcz

Levente Uzonyi
Hi Nicolas,

Nice analysis. I'm surprised I didn't notice either me making the
copy-paste mistake in PositionableStream >> #basicUpTo: and Andreas
removing the optimized version from MultiByteFileStream.
As with all basic* methods, PositionableStream >> #basicUpTo:
should simply be

  ^self upTo: anObject

I restored the removed version of MultiByteFileStream >> #basicUpTo: in my
image. It makes reading the sources file 3x faster using the following
"benchmark":

| file |
[
  | hash |
  file := (SourceFiles at: 1) readOnlyCopy.
  hash := 0.
  {
  [ [ file atEnd ] whileFalse: [
  hash := (hash bitXor: file nextChunk hash) hashMultiply ] ] timeToRun.
  hash } ]
  ensure: [ file ifNotNil: [ file close ] ].

"#(1414 265702489) naive"
"#(469 265702489) optimized"

So, I suggest restoring MultiByteFileStream >> #basicUpTo:, and changing
PositionableStream >> #basicUpTo: to the version suggested above.
Also, PositionableStream >> #upTo: could use #streamContents: instead of
duplicating its behavior.


Levente

On Sat, 10 Apr 2021, Nicolas Cellier wrote:

> Hi all,
> currently, MultiByteFileStream nextChunk is incorrect for utf8 stream:
> it attempts to decode utf8 twice.
>
> This crafted example will raise an error:
>
>    (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
>    (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
>
> nextChunk sends basicUpTo: with the intention to get an un-converted
> string, then sends decodeString: to have fast (bulk) decoding.
>
> Unfortunately basicUpTo: sends next instead of basicNext... This makes
> the utf8 decoded twice, which can falsify the source code in some
> cases, or even fail with an Error like in the crafted example above.
>
> an accelerated version of basicUpTo: was provided by Levente in
> Multilingual-ul.85.mcz
> but was removed in Multilingual-ar.119.mcz, and I didn't understand
> the intention by reading the commit message...
> basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in
> Collections-eem.684.mcz with the double decoding problem.
>
>
> Le sam. 10 avr. 2021 à 21:17, <[hidden email]> a écrit :
>>
>> Nicolas Cellier uploaded a new version of Collections to project The Trunk:
>> http://source.squeak.org/trunk/Collections-nice.933.mcz
>>
>> ==================== Summary ====================
>>
>> Name: Collections-nice.933
>> Author: nice
>> Time: 10 April 2021, 9:16:54.312889 pm
>> UUID: c066bf52-b7a5-474a-9614-90bbc3212e07
>> Ancestors: Collections-ul.932
>>
>> Quick fix for double utf8->squeak conversion via nextChunk.
>>
>>     (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
>>     (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
>>
>> =============== Diff against Collections-ul.932 ===============
>>
>> Item was changed:
>>   ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') -----
>>   basicUpTo: anObject
>>         "Answer a subcollection from the current access position to the
>>         occurrence (if any, but not inclusive) of anObject in the receiver. If
>>         anObject is not in the collection, answer the entire rest of the receiver."
>>         | newStream element |
>>         newStream := WriteStream on: (self collectionSpecies new: 100).
>> +       [self atEnd or: [(element := self basicNext) = anObject]]
>> -       [self atEnd or: [(element := self next) = anObject]]
>>                 whileFalse: [newStream nextPut: element].
>>         ^newStream contents!
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Collections-nice.933.mcz

Nicolas Cellier
Hi Levente,
feel free to publish those changes, this was just a quick fix.
I only detected this by inspecting inbox submissions
(http://source.squeak.org/treated/Collections-cbc.581.diff) and
started to wonder why we would need basicUpTo: at all.
In the mid-term I would prefer that we start using stream composition
rather than basic* protocol, but that's at the risk of bugs and
regressions

Le sam. 10 avr. 2021 à 22:13, Levente Uzonyi <[hidden email]> a écrit :

>
> Hi Nicolas,
>
> Nice analysis. I'm surprised I didn't notice either me making the
> copy-paste mistake in PositionableStream >> #basicUpTo: and Andreas
> removing the optimized version from MultiByteFileStream.
> As with all basic* methods, PositionableStream >> #basicUpTo:
> should simply be
>
>         ^self upTo: anObject
>
> I restored the removed version of MultiByteFileStream >> #basicUpTo: in my
> image. It makes reading the sources file 3x faster using the following
> "benchmark":
>
> | file |
> [
>         | hash |
>         file := (SourceFiles at: 1) readOnlyCopy.
>         hash := 0.
>         {
>                 [ [ file atEnd ] whileFalse: [
>                         hash := (hash bitXor: file nextChunk hash) hashMultiply ] ] timeToRun.
>                 hash } ]
>         ensure: [ file ifNotNil: [ file close ] ].
>
> "#(1414 265702489) naive"
> "#(469 265702489) optimized"
>
> So, I suggest restoring MultiByteFileStream >> #basicUpTo:, and changing
> PositionableStream >> #basicUpTo: to the version suggested above.
> Also, PositionableStream >> #upTo: could use #streamContents: instead of
> duplicating its behavior.
>
>
> Levente
>
> On Sat, 10 Apr 2021, Nicolas Cellier wrote:
>
> > Hi all,
> > currently, MultiByteFileStream nextChunk is incorrect for utf8 stream:
> > it attempts to decode utf8 twice.
> >
> > This crafted example will raise an error:
> >
> >    (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
> >    (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
> >
> > nextChunk sends basicUpTo: with the intention to get an un-converted
> > string, then sends decodeString: to have fast (bulk) decoding.
> >
> > Unfortunately basicUpTo: sends next instead of basicNext... This makes
> > the utf8 decoded twice, which can falsify the source code in some
> > cases, or even fail with an Error like in the crafted example above.
> >
> > an accelerated version of basicUpTo: was provided by Levente in
> > Multilingual-ul.85.mcz
> > but was removed in Multilingual-ar.119.mcz, and I didn't understand
> > the intention by reading the commit message...
> > basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in
> > Collections-eem.684.mcz with the double decoding problem.
> >
> >
> > Le sam. 10 avr. 2021 à 21:17, <[hidden email]> a écrit :
> >>
> >> Nicolas Cellier uploaded a new version of Collections to project The Trunk:
> >> http://source.squeak.org/trunk/Collections-nice.933.mcz
> >>
> >> ==================== Summary ====================
> >>
> >> Name: Collections-nice.933
> >> Author: nice
> >> Time: 10 April 2021, 9:16:54.312889 pm
> >> UUID: c066bf52-b7a5-474a-9614-90bbc3212e07
> >> Ancestors: Collections-ul.932
> >>
> >> Quick fix for double utf8->squeak conversion via nextChunk.
> >>
> >>     (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
> >>     (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
> >>
> >> =============== Diff against Collections-ul.932 ===============
> >>
> >> Item was changed:
> >>   ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') -----
> >>   basicUpTo: anObject
> >>         "Answer a subcollection from the current access position to the
> >>         occurrence (if any, but not inclusive) of anObject in the receiver. If
> >>         anObject is not in the collection, answer the entire rest of the receiver."
> >>         | newStream element |
> >>         newStream := WriteStream on: (self collectionSpecies new: 100).
> >> +       [self atEnd or: [(element := self basicNext) = anObject]]
> >> -       [self atEnd or: [(element := self next) = anObject]]
> >>                 whileFalse: [newStream nextPut: element].
> >>         ^newStream contents!
> >>
> >>