Nicolas Cellier uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-nice.933.mcz ==================== Summary ==================== Name: Collections-nice.933 Author: nice Time: 10 April 2021, 9:16:54.312889 pm UUID: c066bf52-b7a5-474a-9614-90bbc3212e07 Ancestors: Collections-ul.932 Quick fix for double utf8->squeak conversion via nextChunk. (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close. (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk. =============== Diff against Collections-ul.932 =============== Item was changed: ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') ----- basicUpTo: anObject "Answer a subcollection from the current access position to the occurrence (if any, but not inclusive) of anObject in the receiver. If anObject is not in the collection, answer the entire rest of the receiver." | newStream element | newStream := WriteStream on: (self collectionSpecies new: 100). + [self atEnd or: [(element := self basicNext) = anObject]] - [self atEnd or: [(element := self next) = anObject]] whileFalse: [newStream nextPut: element]. ^newStream contents! |
Hi all,
currently, MultiByteFileStream nextChunk is incorrect for utf8 stream: it attempts to decode utf8 twice. This crafted example will raise an error: (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close. (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk. nextChunk sends basicUpTo: with the intention to get an un-converted string, then sends decodeString: to have fast (bulk) decoding. Unfortunately basicUpTo: sends next instead of basicNext... This makes the utf8 decoded twice, which can falsify the source code in some cases, or even fail with an Error like in the crafted example above. an accelerated version of basicUpTo: was provided by Levente in Multilingual-ul.85.mcz but was removed in Multilingual-ar.119.mcz, and I didn't understand the intention by reading the commit message... basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in Collections-eem.684.mcz with the double decoding problem. Le sam. 10 avr. 2021 à 21:17, <[hidden email]> a écrit : > > Nicolas Cellier uploaded a new version of Collections to project The Trunk: > http://source.squeak.org/trunk/Collections-nice.933.mcz > > ==================== Summary ==================== > > Name: Collections-nice.933 > Author: nice > Time: 10 April 2021, 9:16:54.312889 pm > UUID: c066bf52-b7a5-474a-9614-90bbc3212e07 > Ancestors: Collections-ul.932 > > Quick fix for double utf8->squeak conversion via nextChunk. > > (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close. > (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk. > > =============== Diff against Collections-ul.932 =============== > > Item was changed: > ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') ----- > basicUpTo: anObject > "Answer a subcollection from the current access position to the > occurrence (if any, but not inclusive) of anObject in the receiver. If > anObject is not in the collection, answer the entire rest of the receiver." > | newStream element | > newStream := WriteStream on: (self collectionSpecies new: 100). > + [self atEnd or: [(element := self basicNext) = anObject]] > - [self atEnd or: [(element := self next) = anObject]] > whileFalse: [newStream nextPut: element]. > ^newStream contents! > > |
Hi Nicolas,
Nice analysis. I'm surprised I didn't notice either me making the copy-paste mistake in PositionableStream >> #basicUpTo: and Andreas removing the optimized version from MultiByteFileStream. As with all basic* methods, PositionableStream >> #basicUpTo: should simply be ^self upTo: anObject I restored the removed version of MultiByteFileStream >> #basicUpTo: in my image. It makes reading the sources file 3x faster using the following "benchmark": | file | [ | hash | file := (SourceFiles at: 1) readOnlyCopy. hash := 0. { [ [ file atEnd ] whileFalse: [ hash := (hash bitXor: file nextChunk hash) hashMultiply ] ] timeToRun. hash } ] ensure: [ file ifNotNil: [ file close ] ]. "#(1414 265702489) naive" "#(469 265702489) optimized" So, I suggest restoring MultiByteFileStream >> #basicUpTo:, and changing PositionableStream >> #basicUpTo: to the version suggested above. Also, PositionableStream >> #upTo: could use #streamContents: instead of duplicating its behavior. Levente On Sat, 10 Apr 2021, Nicolas Cellier wrote: > Hi all, > currently, MultiByteFileStream nextChunk is incorrect for utf8 stream: > it attempts to decode utf8 twice. > > This crafted example will raise an error: > > (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close. > (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk. > > nextChunk sends basicUpTo: with the intention to get an un-converted > string, then sends decodeString: to have fast (bulk) decoding. > > Unfortunately basicUpTo: sends next instead of basicNext... This makes > the utf8 decoded twice, which can falsify the source code in some > cases, or even fail with an Error like in the crafted example above. > > an accelerated version of basicUpTo: was provided by Levente in > Multilingual-ul.85.mcz > but was removed in Multilingual-ar.119.mcz, and I didn't understand > the intention by reading the commit message... > basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in > Collections-eem.684.mcz with the double decoding problem. > > > Le sam. 10 avr. 2021 à 21:17, <[hidden email]> a écrit : >> >> Nicolas Cellier uploaded a new version of Collections to project The Trunk: >> http://source.squeak.org/trunk/Collections-nice.933.mcz >> >> ==================== Summary ==================== >> >> Name: Collections-nice.933 >> Author: nice >> Time: 10 April 2021, 9:16:54.312889 pm >> UUID: c066bf52-b7a5-474a-9614-90bbc3212e07 >> Ancestors: Collections-ul.932 >> >> Quick fix for double utf8->squeak conversion via nextChunk. >> >> (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close. >> (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk. >> >> =============== Diff against Collections-ul.932 =============== >> >> Item was changed: >> ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') ----- >> basicUpTo: anObject >> "Answer a subcollection from the current access position to the >> occurrence (if any, but not inclusive) of anObject in the receiver. If >> anObject is not in the collection, answer the entire rest of the receiver." >> | newStream element | >> newStream := WriteStream on: (self collectionSpecies new: 100). >> + [self atEnd or: [(element := self basicNext) = anObject]] >> - [self atEnd or: [(element := self next) = anObject]] >> whileFalse: [newStream nextPut: element]. >> ^newStream contents! >> >> |
Hi Levente,
feel free to publish those changes, this was just a quick fix. I only detected this by inspecting inbox submissions (http://source.squeak.org/treated/Collections-cbc.581.diff) and started to wonder why we would need basicUpTo: at all. In the mid-term I would prefer that we start using stream composition rather than basic* protocol, but that's at the risk of bugs and regressions Le sam. 10 avr. 2021 à 22:13, Levente Uzonyi <[hidden email]> a écrit : > > Hi Nicolas, > > Nice analysis. I'm surprised I didn't notice either me making the > copy-paste mistake in PositionableStream >> #basicUpTo: and Andreas > removing the optimized version from MultiByteFileStream. > As with all basic* methods, PositionableStream >> #basicUpTo: > should simply be > > ^self upTo: anObject > > I restored the removed version of MultiByteFileStream >> #basicUpTo: in my > image. It makes reading the sources file 3x faster using the following > "benchmark": > > | file | > [ > | hash | > file := (SourceFiles at: 1) readOnlyCopy. > hash := 0. > { > [ [ file atEnd ] whileFalse: [ > hash := (hash bitXor: file nextChunk hash) hashMultiply ] ] timeToRun. > hash } ] > ensure: [ file ifNotNil: [ file close ] ]. > > "#(1414 265702489) naive" > "#(469 265702489) optimized" > > So, I suggest restoring MultiByteFileStream >> #basicUpTo:, and changing > PositionableStream >> #basicUpTo: to the version suggested above. > Also, PositionableStream >> #upTo: could use #streamContents: instead of > duplicating its behavior. > > > Levente > > On Sat, 10 Apr 2021, Nicolas Cellier wrote: > > > Hi all, > > currently, MultiByteFileStream nextChunk is incorrect for utf8 stream: > > it attempts to decode utf8 twice. > > > > This crafted example will raise an error: > > > > (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close. > > (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk. > > > > nextChunk sends basicUpTo: with the intention to get an un-converted > > string, then sends decodeString: to have fast (bulk) decoding. > > > > Unfortunately basicUpTo: sends next instead of basicNext... This makes > > the utf8 decoded twice, which can falsify the source code in some > > cases, or even fail with an Error like in the crafted example above. > > > > an accelerated version of basicUpTo: was provided by Levente in > > Multilingual-ul.85.mcz > > but was removed in Multilingual-ar.119.mcz, and I didn't understand > > the intention by reading the commit message... > > basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in > > Collections-eem.684.mcz with the double decoding problem. > > > > > > Le sam. 10 avr. 2021 à 21:17, <[hidden email]> a écrit : > >> > >> Nicolas Cellier uploaded a new version of Collections to project The Trunk: > >> http://source.squeak.org/trunk/Collections-nice.933.mcz > >> > >> ==================== Summary ==================== > >> > >> Name: Collections-nice.933 > >> Author: nice > >> Time: 10 April 2021, 9:16:54.312889 pm > >> UUID: c066bf52-b7a5-474a-9614-90bbc3212e07 > >> Ancestors: Collections-ul.932 > >> > >> Quick fix for double utf8->squeak conversion via nextChunk. > >> > >> (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close. > >> (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk. > >> > >> =============== Diff against Collections-ul.932 =============== > >> > >> Item was changed: > >> ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') ----- > >> basicUpTo: anObject > >> "Answer a subcollection from the current access position to the > >> occurrence (if any, but not inclusive) of anObject in the receiver. If > >> anObject is not in the collection, answer the entire rest of the receiver." > >> | newStream element | > >> newStream := WriteStream on: (self collectionSpecies new: 100). > >> + [self atEnd or: [(element := self basicNext) = anObject]] > >> - [self atEnd or: [(element := self next) = anObject]] > >> whileFalse: [newStream nextPut: element]. > >> ^newStream contents! > >> > >> |
Free forum by Nabble | Edit this page |