Testing a Unicode Character's Category

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Testing a Unicode Character's Category

Prof. Andrew P. Black
The Pharo image has a table of Unicode Character Categories in a class variable GeneralCategory of class Unicode.  But there do not seem to be many methods to interpret this data.  For example, while there is a method

        Unicode class >> #isDigit: aCharacter

that checks if the Unicode category of aCharacter is Nd, and

        Unicode class >> #isLetter: aCharacter

that checks if aCharacter is in one of the letter categories, there does not seem to be a general way of asking “what is the category of this character”.

I want to check if a character is a mathematical symbol, that is, if it is in the Unicode Category Sm.  What’s the right way of doing this?
Would it be reasonable to add a method Unicode class >> #category: aCharacter that answers one of the 29 category symbols #Cc to #Zs?  Or “is” methods for each category?

Andrew
Reply | Threaded
Open this post in threaded view
|

Re: Testing a Unicode Character's Category

Richard Sargent
Administrator
Andrew P. Black wrote

> The Pharo image has a table of Unicode Character Categories in a class
> variable GeneralCategory of class Unicode.  But there do not seem to be
> many methods to interpret this data.  For example, while there is a method
>
> Unicode class >> #isDigit: aCharacter
>
> that checks if the Unicode category of aCharacter is Nd, and
>
> Unicode class >> #isLetter: aCharacter
>
> that checks if aCharacter is in one of the letter categories, there does
> not seem to be a general way of asking “what is the category of this
> character”.
>
> I want to check if a character is a mathematical symbol, that is, if it is
> in the Unicode Category Sm.  What’s the right way of doing this?
> Would it be reasonable to add a method Unicode class >> #category:
> aCharacter that answers one of the 29 category symbols #Cc to #Zs?  Or
> “is” methods for each category?
>
> Andrew

Hi Andrew,

I would recommend the guideline that "special codes" should be modelled as
private to the class (or family of classes) in question and answers to
questions like "is it a mathematical symbol" be answered by a method with
almost that exact name.

There may be a justification for exposing the code itself, but not for
directly interpreting the meaning of the code. A contrived example would be
to answer a Dictionary grouping each UnicodeCharacterData instance by its
General Category so that one could manipulate the set of related instances
in some manner. As I said, contrived.

Also, the existing methods provide a template (pattern?) for future such
methods.




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Testing a Unicode Character's Category

Prof. Andrew P. Black
Hi Richard,

Normally I agree with you, and prefer boolean methods

        inCatagoryCc: aChar
        isCategorySm: aChar
to
        categoryOf: aChar == #Cc
        categoryOf: aChar == #Sm

In this particular case, though, the category codes are part of the Unicode Standard, so perhaps exposing them isn’t so bad.  Moreover, there is meaning encoded into the symbols — for example, all the letter categories start with L.  So one could write an isLetter test like this

Character >> isLetter
        ^ (Unicode categoryOf: self) first == $L

rather than

Character >> isLetter
        Unicode inCategoryLl ifTrue: [ ^ true ].
        Unicode inCategoryLm ifTrue: [ ^ true ].
        Unicode inCategoryLo ifTrue: [ ^ true ].
        Unicode inCategoryLt ifTrue: [ ^ true ].
        Unicode inCategoryLu ifTrue: [ ^ true ].
        ^ false

There is still the disadvantage that a typo in the Category name (typing #L1, for example, when one means #Ll) is likely to go undetected.

        Andrew



Reply | Threaded
Open this post in threaded view
|

Re: Testing a Unicode Character's Category

Richard Sargent
Administrator
Andrew P. Black wrote

> Hi Richard,
>
> Normally I agree with you, and prefer boolean methods
>
> inCatagoryCc: aChar
> isCategorySm: aChar
> to
> categoryOf: aChar == #Cc
> categoryOf: aChar == #Sm
>
> In this particular case, though, the category codes are part of the
> Unicode Standard, so perhaps exposing them isn’t so bad.  Moreover, there
> is meaning encoded into the symbols — for example, all the letter
> categories start with L.  So one could write an isLetter test like this

I understand this argument, but I cannot agree with it. If I were to ask you
what code points were defined in character category Mx, you would
immediately go to a web site that contained the Unicode categories to find
the answer. In other words, you would "ask Unicode".

In general, I would expect to see:
Character>>#isCapitalLetter
^Unicode isCapitalLetter: self "actually, /self codePoint/"

Character>>#isLetter
^Unicode isLetter: self

Character>>#isMathSymbol
Unicode isMathSymbol: self

Unicode's methods would look at and interpret its category information for
the character ... which might be internally managed via some kind of tree
structure (who knows?).


And I would definitely expect to *not* see a method name like
"isCategorySm:". :-)

Magic numbers, magic codes, etc.: you always want there to be one definitive
expert (class) and you do not want other classes usurping its
responsibilities.


> Character >> isLetter
> ^ (Unicode categoryOf: self) first == $L
>
> rather than
>
> Character >> isLetter
> Unicode inCategoryLl ifTrue: [ ^ true ].
> Unicode inCategoryLm ifTrue: [ ^ true ].
> Unicode inCategoryLo ifTrue: [ ^ true ].
> Unicode inCategoryLt ifTrue: [ ^ true ].
> Unicode inCategoryLu ifTrue: [ ^ true ].
> ^ false
>
> There is still the disadvantage that a typo in the Category name (typing
> #L1, for example, when one means #Ll) is likely to go undetected.
>
> Andrew





--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Testing a Unicode Character's Category

Prof. Andrew P. Black
Richard (and others):

Marcus and I committed methods to implement these category tests on Friday.  If any of you have time to review the code, it would be appreciated.  The commit can be found here:  

https://github.com/pharo-project/pharo/pull/326/commits/933b1dcba05b837ab292e19aab413f67b3f9eec5

I noticed that there is one test failing — which seems to be entirely unrelated  The test passes in my image.

        Andrew


Reply | Threaded
Open this post in threaded view
|

Re: Testing a Unicode Character's Category

Richard Sargent
Administrator
Andrew, I've tried to comment on the changes. Overall, it was essentially what I was envisioning. My comments on a few methods are mostly specific quibbles about the implementation. One reflects that I think the original implementation contains an error.


Thanks for doing this!


On Sun, Oct 1, 2017 at 7:10 AM, Prof. Andrew P. Black <[hidden email]> wrote:
Richard (and others):

Marcus and I committed methods to implement these category tests on Friday.  If any of you have time to review the code, it would be appreciated.  The commit can be found here:

 https://github.com/pharo-project/pharo/pull/326/commits/933b1dcba05b837ab292e19aab413f67b3f9eec5

I noticed that there is one test failing — which seems to be entirely unrelated  The test passes in my image.

        Andrew