isSmalltalkLetter and collation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

isSmalltalkLetter and collation

Hans-Martin Mosner-3
Hello,
in our project we've removed the primitive calls in Character and string comparison methods because they delivered different results from the corresponding locale collation methods.
For example, $ü (u-umlaut) is sorted between $a and $z when using the collation method, and before $a when using the primitive VMprCharacterLessThan, which is incorrect according to german collation rules.
When switching to 8.6.3 from 8.6, we found that the new source compression algorithm uses isSmalltalkLetter for character categorization, which is essentially fine but it uses #between:and: which with or without our patch is incorrect for the intended purpose.
Instead of comparing characters (which is locale-sensitive) the method should compare code points:

isSmalltalkLetter
    "Answer true if the receiver is a valid Smalltalk letter as described in the ANSI Smalltalk Standard; otherwise answer false.

     letter ::= uppercaseAlphabetic | lowercaseAlphabetic | nonCaseLetter
     uppercaseAlphabetic ::= ’A’ | ’B’ | ’C’ | ’D’ | ’E’ | ’F’ | ’G’ | ’H’ | ’I’ | ’J’ | ’K’ | ’L’ | ’M’ | ’N’ | ’O’ | ’P’ | ’Q’ | ’R’ | ’S’| ’T’ | ’U’ | ’V’ | ’W’ | ’X’ | ’Y’ | ’Z’
     lowercaseAlphabetic ::= ’a’ | ’b’ | ’c’ | ’d’ | ’e’ | ’f’ | ’g’ | ’h’ | ’I’ | ’j’ | ’k’ | ’l’ | ’m’ | ’n’ | ’o’ | ’p’ | ’q’ | ’r’ | ’s’ | ’t’ | ’u’ | ’v’ | ’w’ | ’x’ | ’y’ | ’z’
     nonCaseLetter ::= ’_’

    It would be easier to simply send #isLetter, but we cannot do this because some country codes have characters that say they are letters but are not valid Smalltalk syntactic letters. We also need to allow for the nonCaseLetter"

    | cp |
    cp := self codePoint.
    ^(cp between: 97 "$a codePoint" and: 122 "$z codePoint") or: [(cp between: 65 "$A codePoint" and: 90 "$Z codePoint") or: [cp = 95 "$_ codePoint"]]

In addition, something should be done about the different behavior of Character comparison method #< and the corresponding locale collation. To me it is unclear what the correct behavior should be, but the locale collation is more useful to us.

Cheers,
Hans-Martin

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: isSmalltalkLetter and collation

Hans-Martin Mosner-3
I should note that this issue came up for us because we fixed the 8.6.2 source compression incompatibility by an image patch which disabled the use of the source compression primitives, so that the Smalltalk code would be run which is probably not the case for many other users.

Hans-Martin

--
You received this message because you are subscribed to the Google Groups "VA Smalltalk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/va-smalltalk.
For more options, visit https://groups.google.com/d/optout.