Class name with diacritic character and Pharo

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Class name with diacritic character and Pharo

Dominique Dartois-4
Hello all.
If a use french diacritic character in a class name, the code runs but I can’t fileout the package nor save it with Monticello.
For example, the C cedilla in the class name drive me to an ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.

Is it a bug or a feature?
Thank you

---
Dominique Dartois
Reply | Threaded
Open this post in threaded view
|

Re: Class name with diacritic character and Pharo

Dominique Dartois-4
Hi Sven.
Thank you for the time spend for your reply.
I have tried the same code in Pharo 6.1 (21.0) instead of 7.1.0 and I had NO problem.
It seems the class implementation for ZnUTF8Encoder is different.
Thanks again.
---
Dominique

> Le 27 janvier 2019 à 17:03, Sven Van Caekenberghe <[hidden email]> a écrit :
>
>
> Hi Dominique,
>
> > On 27 Jan 2019, at 11:40, Dominique Dartois <[hidden email]> wrote:
> >
> > Hello all.
> > If a use french diacritic character in a class name, the code runs but I can’t fileout the package nor save it with Monticello.
> > For example, the C cedilla in the class name drive me to an ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.
> >
> > Is it a bug or a feature?
> > Thank you
> >
> > <image.png>---
> > Dominique Dartois
>
> Thanks for reporting this. This is most definitely a bug, I can confirm its occurrence.
>
> I'm CC pharo-dev as this is quite important. This will be a long mail.
>
>
> This is one manifestation of a problem that has been present for quite a while.
>
> I'll start by describing what I did, what went well and where/how this fails, some generic points, and two conceptual solutions (that need further verification).
>
> Like you, I created a new subclass:
>
> Object subclass: #ClasseFrançaise
> instanceVariableNames: ''
> classVariableNames: ''
> package: '_UnpackagedPackage'
>
> With comment:
>
> I am ClasseFrançaise.
>
> Try:
>
> ClasseFrançaise new élève.
> ClasseFrançaise new euro.
>
> And two methods (in the 'test' protocol):
>
> élève
> ^ 'élève'
>
> euro
> ^ '€'
>
> I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 like ç).
> Like you said, the system can cope with such class and method names and seems to function fine.
>
> Looking at the .changes file, the correct source code was appended:
>
> ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 339848!
>
> Object subclass: #ClasseFrançaise
>         instanceVariableNames: ''
>         classVariableNames: ''
>         package: '_UnpackagedPackage'!
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0!
> I am ClasseFrançaise.!
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:26'!
> élève
>         ^ 'élève'! !
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 33898360!
> I am ClasseFrançaise.
>
> Try:
>
>         ClasseFrançaise new élève.
>         ClasseFrançaise new euro.
> !
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:27'!
> euro
> ^ '€'! !
>
>
> Doing a file out (or otherwise saving the source code) fails. The reason is an incorrect manipulation of this source file while looking for what is called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position
>
> An programmatic way to invoke the same error is by doing
>
> (ClasseFrançaise>>#élève) timeStamp.
> (ClasseFrançaise>>#élève) author.
>
> Both fail with the same error.
>
>
> The source code of methods is (currently) stored in a .sources or .changes file. CompiledMethods know their source pointer, an offset in one of these files. Right before the place where the source starts is a preamble that contains some meta information (including the author and timestamp). To access that preamble, the source code pointer is moved backwards to the beginning of the preamble (which begins and ends with a !).
>
>
> The current approach fails in the presence of non-ASCII characters. More specifically because of a mixup between the concept of byte position and character position when using UTF-8, a variable length encoding (both the .changes and the .sources are UTF-8 encoded).
>
> For example, consider
>
> 'à partir de 10 €' size. "16"
> 'à partir de 10 €' utf8Encoded size. "19"
>
> So although the string contains 16 characters, it is encoded as 19 bytes, à using 2 bytes and € using 3 bytes. In general, moving backwards or forwards in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.
>
> ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will signal an error when forced to operate in between encoded characters, which is what happens here.
>
> It is thus not possible to move to arbitrary bytes positions and assume/hope to always arrive on the correct character boundaries and it is also wrong to take the difference between two byte positions as the count of characters present (since their encoding is of variable length).
>
> SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets away with it in 99.99% of all cases since very few people name their classes like that).
>
> There are two solutions: operate mostly on the byte level or operate correctly on the character level. Here are two conceptual solutions (you must execute either solution 1 or 2, not both), with two different inputs.
>
>
> src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
> euro
> ^ ''€''! !'.
>
> "startPosition := 83"
>
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 83. "at start of euro, the methods source string"
> str upToEnd.
>
> str position: (83 - 3). "before ! before euro"
>
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
>
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
>
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
> str position.
> str next: count.
>
>
> Same code, different input (and starting position):
>
>
> src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
> euro
> ^ ''€''! !'.
>
> "startPosition := 87"
>
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 87. "at start of euro, the methods source string"
> str upToEnd.
>
> str position: (87 - 3). "before ! before euro"
>
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
>
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
>
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
> str position.
> str next: count.
>
>
> Solution 1 (at the byte level) works because $! (ASCII code 33) is also encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 byte UTF-8 encoded character. The final byte segment must then be given to the proper encoder to turn it into characters.
>
> Solution 2 (at the character level) works because it essentially counts characters while moving backwards correctly (it moves back 2 steps each time because #next moves 1 step forward, while #peek would not help).
>
> Note that in Solution 1 the moving position is held in an external variable, while in Solution 2 it is held inside the stream.
>
>
> Implementing either solution requires more internal access of the streams (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be moved to SourceFile, IMHO.
>
> Obviously this is a sensitive/critical area to touch.
>
> We will also need a regression unit test.
>
> On a higher design level, I think that CompiledMethod>>#timeStamp and CompiledMethod>>#author should be cached at the instance level (a unix timestamp and a symbol would cost very little).
>
> Sven
>
>
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Class name with diacritic character and Pharo

Pharo Smalltalk Users mailing list
In reply to this post by Dominique Dartois-4
While we're at it, a similar problem arises when the author name (in my
case BenoîtStJean) contains a French diacritic.

Just tested it with Pharo 7 (64 bit on Windows 10)...

Fileout works fine.  But filing in crashes!

On 2019-01-27 11:03, Sven Van Caekenberghe wrote:

> Hi Dominique,
>
>> On 27 Jan 2019, at 11:40, Dominique Dartois <[hidden email]> wrote:
>>
>> Hello all.
>> If a use french diacritic character in a class name, the code runs but I can’t fileout the package nor save it with Monticello.
>> For example, the C cedilla in the class name drive me to an ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.
>>
>> Is it a bug or a feature?
>> Thank you
>>
>> <image.png>---
>> Dominique Dartois
> Thanks for reporting this. This is most definitely a bug, I can confirm its occurrence.
>
> I'm CC pharo-dev as this is quite important. This will be a long mail.
>
>
> This is one manifestation of a problem that has been present for quite a while.
>
> I'll start by describing what I did, what went well and where/how this fails, some generic points, and two conceptual solutions (that need further verification).
>
> Like you, I created a new subclass:
>
> Object subclass: #ClasseFrançaise
> instanceVariableNames: ''
> classVariableNames: ''
> package: '_UnpackagedPackage'
>
> With comment:
>
> I am ClasseFrançaise.
>
> Try:
>
> ClasseFrançaise new élève.
> ClasseFrançaise new euro.
>
> And two methods (in the 'test' protocol):
>
> élève
> ^ 'élève'
>
> euro
> ^ '€'
>
> I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 like ç).
> Like you said, the system can cope with such class and method names and seems to function fine.
>
> Looking at the .changes file, the correct source code was appended:
>
> ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 339848!
>
> Object subclass: #ClasseFrançaise
>          instanceVariableNames: ''
>          classVariableNames: ''
>          package: '_UnpackagedPackage'!
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0!
> I am ClasseFrançaise.!
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:26'!
> élève
>          ^ 'élève'! !
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 33898360!
> I am ClasseFrançaise.
>
> Try:
>
>          ClasseFrançaise new élève.
>          ClasseFrançaise new euro.
> !
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:27'!
> euro
> ^ '€'! !
>
>
> Doing a file out (or otherwise saving the source code) fails. The reason is an incorrect manipulation of this source file while looking for what is called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position
>
> An programmatic way to invoke the same error is by doing
>
> (ClasseFrançaise>>#élève) timeStamp.
> (ClasseFrançaise>>#élève) author.
>
> Both fail with the same error.
>
>
> The source code of methods is (currently) stored in a .sources or .changes file. CompiledMethods know their source pointer, an offset in one of these files. Right before the place where the source starts is a preamble that contains some meta information (including the author and timestamp). To access that preamble, the source code pointer is moved backwards to the beginning of the preamble (which begins and ends with a !).
>
>
> The current approach fails in the presence of non-ASCII characters. More specifically because of a mixup between the concept of byte position and character position when using UTF-8, a variable length encoding (both the .changes and the .sources are UTF-8 encoded).
>
> For example, consider
>
> 'à partir de 10 €' size. "16"
> 'à partir de 10 €' utf8Encoded size. "19"
>
> So although the string contains 16 characters, it is encoded as 19 bytes, à using 2 bytes and € using 3 bytes. In general, moving backwards or forwards in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.
>
> ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will signal an error when forced to operate in between encoded characters, which is what happens here.
>
> It is thus not possible to move to arbitrary bytes positions and assume/hope to always arrive on the correct character boundaries and it is also wrong to take the difference between two byte positions as the count of characters present (since their encoding is of variable length).
>
> SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets away with it in 99.99% of all cases since very few people name their classes like that).
>
> There are two solutions: operate mostly on the byte level or operate correctly on the character level. Here are two conceptual solutions (you must execute either solution 1 or 2, not both), with two different inputs.
>
>
> src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
> euro
> ^ ''€''! !'.
>
> "startPosition := 83"
>
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 83. "at start of euro, the methods source string"
> str upToEnd.
>
> str position: (83 - 3). "before ! before euro"
>
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
>
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
>
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
> str position.
> str next: count.
>
>
> Same code, different input (and starting position):
>
>
> src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
> euro
> ^ ''€''! !'.
>
> "startPosition := 87"
>
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 87. "at start of euro, the methods source string"
> str upToEnd.
>
> str position: (87 - 3). "before ! before euro"
>
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
>
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
>
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
> str position.
> str next: count.
>
>
> Solution 1 (at the byte level) works because $! (ASCII code 33) is also encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 byte UTF-8 encoded character. The final byte segment must then be given to the proper encoder to turn it into characters.
>
> Solution 2 (at the character level) works because it essentially counts characters while moving backwards correctly (it moves back 2 steps each time because #next moves 1 step forward, while #peek would not help).
>
> Note that in Solution 1 the moving position is held in an external variable, while in Solution 2 it is held inside the stream.
>
>
> Implementing either solution requires more internal access of the streams (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be moved to SourceFile, IMHO.
>
> Obviously this is a sensitive/critical area to touch.
>
> We will also need a regression unit test.
>
> On a higher design level, I think that CompiledMethod>>#timeStamp and CompiledMethod>>#author should be cached at the instance level (a unix timestamp and a symbol would cost very little).
>
> Sven
>
>
>
>
>
>
>
--
-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)


Reply | Threaded
Open this post in threaded view
|

Re: Class name with diacritic character and Pharo

Pharo Smalltalk Users mailing list
In reply to this post by Dominique Dartois-4
Sorry, my toothache meds are kicking in! lol

Correction to my last post:

I mean *fileouts" don't work when author name has diacritic French
characters!

Obviously, couldn't test filing in!


On 2019-01-27 11:03, Sven Van Caekenberghe wrote:

> Hi Dominique,
>
>> On 27 Jan 2019, at 11:40, Dominique Dartois <[hidden email]> wrote:
>>
>> Hello all.
>> If a use french diacritic character in a class name, the code runs but I can’t fileout the package nor save it with Monticello.
>> For example, the C cedilla in the class name drive me to an ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.
>>
>> Is it a bug or a feature?
>> Thank you
>>
>> <image.png>---
>> Dominique Dartois
> Thanks for reporting this. This is most definitely a bug, I can confirm its occurrence.
>
> I'm CC pharo-dev as this is quite important. This will be a long mail.
>
>
> This is one manifestation of a problem that has been present for quite a while.
>
> I'll start by describing what I did, what went well and where/how this fails, some generic points, and two conceptual solutions (that need further verification).
>
> Like you, I created a new subclass:
>
> Object subclass: #ClasseFrançaise
> instanceVariableNames: ''
> classVariableNames: ''
> package: '_UnpackagedPackage'
>
> With comment:
>
> I am ClasseFrançaise.
>
> Try:
>
> ClasseFrançaise new élève.
> ClasseFrançaise new euro.
>
> And two methods (in the 'test' protocol):
>
> élève
> ^ 'élève'
>
> euro
> ^ '€'
>
> I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 like ç).
> Like you said, the system can cope with such class and method names and seems to function fine.
>
> Looking at the .changes file, the correct source code was appended:
>
> ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 339848!
>
> Object subclass: #ClasseFrançaise
>          instanceVariableNames: ''
>          classVariableNames: ''
>          package: '_UnpackagedPackage'!
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0!
> I am ClasseFrançaise.!
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:26'!
> élève
>          ^ 'élève'! !
> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 33898360!
> I am ClasseFrançaise.
>
> Try:
>
>          ClasseFrançaise new élève.
>          ClasseFrançaise new euro.
> !
> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:27'!
> euro
> ^ '€'! !
>
>
> Doing a file out (or otherwise saving the source code) fails. The reason is an incorrect manipulation of this source file while looking for what is called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position
>
> An programmatic way to invoke the same error is by doing
>
> (ClasseFrançaise>>#élève) timeStamp.
> (ClasseFrançaise>>#élève) author.
>
> Both fail with the same error.
>
>
> The source code of methods is (currently) stored in a .sources or .changes file. CompiledMethods know their source pointer, an offset in one of these files. Right before the place where the source starts is a preamble that contains some meta information (including the author and timestamp). To access that preamble, the source code pointer is moved backwards to the beginning of the preamble (which begins and ends with a !).
>
>
> The current approach fails in the presence of non-ASCII characters. More specifically because of a mixup between the concept of byte position and character position when using UTF-8, a variable length encoding (both the .changes and the .sources are UTF-8 encoded).
>
> For example, consider
>
> 'à partir de 10 €' size. "16"
> 'à partir de 10 €' utf8Encoded size. "19"
>
> So although the string contains 16 characters, it is encoded as 19 bytes, à using 2 bytes and € using 3 bytes. In general, moving backwards or forwards in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.
>
> ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will signal an error when forced to operate in between encoded characters, which is what happens here.
>
> It is thus not possible to move to arbitrary bytes positions and assume/hope to always arrive on the correct character boundaries and it is also wrong to take the difference between two byte positions as the count of characters present (since their encoding is of variable length).
>
> SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets away with it in 99.99% of all cases since very few people name their classes like that).
>
> There are two solutions: operate mostly on the byte level or operate correctly on the character level. Here are two conceptual solutions (you must execute either solution 1 or 2, not both), with two different inputs.
>
>
> src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
> euro
> ^ ''€''! !'.
>
> "startPosition := 83"
>
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 83. "at start of euro, the methods source string"
> str upToEnd.
>
> str position: (83 - 3). "before ! before euro"
>
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
>
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
>
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
> str position.
> str next: count.
>
>
> Same code, different input (and starting position):
>
>
> src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
> euro
> ^ ''€''! !'.
>
> "startPosition := 87"
>
> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
> str position: 87. "at start of euro, the methods source string"
> str upToEnd.
>
> str position: (87 - 3). "before ! before euro"
>
> "find the previous ! before position"
> position := str position.
> binary := str wrappedStream.
> encoder := str encoder.
>
> "solution 1"
> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
> position.
> encoder decodeBytes: (binary next: 80 - position).
>
> "solution 2"
> count := 0.
> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
> str position.
> str next: count.
>
>
> Solution 1 (at the byte level) works because $! (ASCII code 33) is also encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 byte UTF-8 encoded character. The final byte segment must then be given to the proper encoder to turn it into characters.
>
> Solution 2 (at the character level) works because it essentially counts characters while moving backwards correctly (it moves back 2 steps each time because #next moves 1 step forward, while #peek would not help).
>
> Note that in Solution 1 the moving position is held in an external variable, while in Solution 2 it is held inside the stream.
>
>
> Implementing either solution requires more internal access of the streams (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be moved to SourceFile, IMHO.
>
> Obviously this is a sensitive/critical area to touch.
>
> We will also need a regression unit test.
>
> On a higher design level, I think that CompiledMethod>>#timeStamp and CompiledMethod>>#author should be cached at the instance level (a unix timestamp and a symbol would cost very little).
>
> Sven
>
>
>
>
>
>
>
--
-----------------
Benoît St-Jean
Yahoo! Messenger: bstjean
Twitter: @BenLeChialeux
Pinterest: benoitstjean
Instagram: Chef_Benito
IRC: lamneth
Blogue: endormitoire.wordpress.com
"A standpoint is an intellectual horizon of radius zero".  (A. Einstein)


Reply | Threaded
Open this post in threaded view
|

Re: Class name with diacritic character and Pharo

Sven Van Caekenberghe-2
In reply to this post by Dominique Dartois-4
Yer, author and classname are both part of the preamble, so both are capable of breaking it.

Same problem, same solution.

> On 28 Jan 2019, at 05:10, Benoit St-Jean <[hidden email]> wrote:
>
> Sorry, my toothache meds are kicking in! lol
>
> Correction to my last post:
>
> I mean *fileouts" don't work when author name has diacritic French characters!
>
> Obviously, couldn't test filing in!
>
>
> On 2019-01-27 11:03, Sven Van Caekenberghe wrote:
>> Hi Dominique,
>>
>>> On 27 Jan 2019, at 11:40, Dominique Dartois <[hidden email]> wrote:
>>>
>>> Hello all.
>>> If a use french diacritic character in a class name, the code runs but I can’t fileout the package nor save it with Monticello.
>>> For example, the C cedilla in the class name drive me to an ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.
>>>
>>> Is it a bug or a feature?
>>> Thank you
>>>
>>> <image.png>---
>>> Dominique Dartois
>> Thanks for reporting this. This is most definitely a bug, I can confirm its occurrence.
>>
>> I'm CC pharo-dev as this is quite important. This will be a long mail.
>>
>>
>> This is one manifestation of a problem that has been present for quite a while.
>>
>> I'll start by describing what I did, what went well and where/how this fails, some generic points, and two conceptual solutions (that need further verification).
>>
>> Like you, I created a new subclass:
>>
>> Object subclass: #ClasseFrançaise
>> instanceVariableNames: ''
>> classVariableNames: ''
>> package: '_UnpackagedPackage'
>>
>> With comment:
>>
>> I am ClasseFrançaise.
>>
>> Try:
>>
>> ClasseFrançaise new élève.
>> ClasseFrançaise new euro.
>>
>> And two methods (in the 'test' protocol):
>>
>> élève
>> ^ 'élève'
>>
>> euro
>> ^ '€'
>>
>> I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 like ç).
>> Like you said, the system can cope with such class and method names and seems to function fine.
>>
>> Looking at the .changes file, the correct source code was appended:
>>
>> ----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 339848!
>>
>> Object subclass: #ClasseFrançaise
>>         instanceVariableNames: ''
>>         classVariableNames: ''
>>         package: '_UnpackagedPackage'!
>> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0!
>> I am ClasseFrançaise.!
>> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:26'!
>> élève
>>         ^ 'élève'! !
>> !ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 33898360!
>> I am ClasseFrançaise.
>>
>> Try:
>>
>>         ClasseFrançaise new élève.
>>         ClasseFrançaise new euro.
>> !
>> !ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:27'!
>> euro
>> ^ '€'! !
>>
>>
>> Doing a file out (or otherwise saving the source code) fails. The reason is an incorrect manipulation of this source file while looking for what is called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position
>>
>> An programmatic way to invoke the same error is by doing
>>
>> (ClasseFrançaise>>#élève) timeStamp.
>> (ClasseFrançaise>>#élève) author.
>>
>> Both fail with the same error.
>>
>>
>> The source code of methods is (currently) stored in a .sources or .changes file. CompiledMethods know their source pointer, an offset in one of these files. Right before the place where the source starts is a preamble that contains some meta information (including the author and timestamp). To access that preamble, the source code pointer is moved backwards to the beginning of the preamble (which begins and ends with a !).
>>
>>
>> The current approach fails in the presence of non-ASCII characters. More specifically because of a mixup between the concept of byte position and character position when using UTF-8, a variable length encoding (both the .changes and the .sources are UTF-8 encoded).
>>
>> For example, consider
>>
>> 'à partir de 10 €' size. "16"
>> 'à partir de 10 €' utf8Encoded size. "19"
>>
>> So although the string contains 16 characters, it is encoded as 19 bytes, à using 2 bytes and € using 3 bytes. In general, moving backwards or forwards in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.
>>
>> ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will signal an error when forced to operate in between encoded characters, which is what happens here.
>>
>> It is thus not possible to move to arbitrary bytes positions and assume/hope to always arrive on the correct character boundaries and it is also wrong to take the difference between two byte positions as the count of characters present (since their encoding is of variable length).
>>
>> SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets away with it in 99.99% of all cases since very few people name their classes like that).
>>
>> There are two solutions: operate mostly on the byte level or operate correctly on the character level. Here are two conceptual solutions (you must execute either solution 1 or 2, not both), with two different inputs.
>>
>>
>> src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
>> euro
>> ^ ''€''! !'.
>>
>> "startPosition := 83"
>>
>> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
>> str position: 83. "at start of euro, the methods source string"
>> str upToEnd.
>>
>> str position: (83 - 3). "before ! before euro"
>>
>> "find the previous ! before position"
>> position := str position.
>> binary := str wrappedStream.
>> encoder := str encoder.
>>
>> "solution 1"
>> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
>> position.
>> encoder decodeBytes: (binary next: 80 - position).
>>
>> "solution 2"
>> count := 0.
>> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
>> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
>> str position.
>> str next: count.
>>
>>
>> Same code, different input (and starting position):
>>
>>
>> src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
>> euro
>> ^ ''€''! !'.
>>
>> "startPosition := 87"
>>
>> str := ZnCharacterReadStream on: (src utf8Encoded readStream).
>> str position: 87. "at start of euro, the methods source string"
>> str upToEnd.
>>
>> str position: (87 - 3). "before ! before euro"
>>
>> "find the previous ! before position"
>> position := str position.
>> binary := str wrappedStream.
>> encoder := str encoder.
>>
>> "solution 1"
>> [ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
>> position.
>> encoder decodeBytes: (binary next: 80 - position).
>>
>> "solution 2"
>> count := 0.
>> [ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [
>> encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
>> str position.
>> str next: count.
>>
>>
>> Solution 1 (at the byte level) works because $! (ASCII code 33) is also encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 byte UTF-8 encoded character. The final byte segment must then be given to the proper encoder to turn it into characters.
>>
>> Solution 2 (at the character level) works because it essentially counts characters while moving backwards correctly (it moves back 2 steps each time because #next moves 1 step forward, while #peek would not help).
>>
>> Note that in Solution 1 the moving position is held in an external variable, while in Solution 2 it is held inside the stream.
>>
>>
>> Implementing either solution requires more internal access of the streams (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be moved to SourceFile, IMHO.
>>
>> Obviously this is a sensitive/critical area to touch.
>>
>> We will also need a regression unit test.
>>
>> On a higher design level, I think that CompiledMethod>>#timeStamp and CompiledMethod>>#author should be cached at the instance level (a unix timestamp and a symbol would cost very little).
>>
>> Sven
>>
>>
>>
>>
>>
>>
>>
> --
> -----------------
> Benoît St-Jean
> Yahoo! Messenger: bstjean
> Twitter: @BenLeChialeux
> Pinterest: benoitstjean
> Instagram: Chef_Benito
> IRC: lamneth
> Blogue: endormitoire.wordpress.com
> "A standpoint is an intellectual horizon of radius zero".  (A. Einstein)
>