Smalltalk › Pharo › Pharo Smalltalk Users

Testing a Unicode Character's Category

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Prof. Andrew P. Black

Testing a Unicode Character's Category

The Pharo image has a table of Unicode Character Categories in a class variable GeneralCategory of class Unicode. But there do not seem to be many methods to interpret this data. For example, while there is a method

Unicode class >> #isDigit: aCharacter

that checks if the Unicode category of aCharacter is Nd, and

Unicode class >> #isLetter: aCharacter

that checks if aCharacter is in one of the letter categories, there does not seem to be a general way of asking “what is the category of this character”.

I want to check if a character is a mathematical symbol, that is, if it is in the Unicode Category Sm. What’s the right way of doing this?
Would it be reasonable to add a method Unicode class >> #category: aCharacter that answers one of the 29 category symbols #Cc to #Zs? Or “is” methods for each category?

Andrew

Richard Sargent

Re: Testing a Unicode Character's Category

Administrator

Andrew P. Black wrote

> The Pharo image has a table of Unicode Character Categories in a class
> variable GeneralCategory of class Unicode. But there do not seem to be
> many methods to interpret this data. For example, while there is a method
>
> Unicode class >> #isDigit: aCharacter
>
> that checks if the Unicode category of aCharacter is Nd, and
>
> Unicode class >> #isLetter: aCharacter
>
> that checks if aCharacter is in one of the letter categories, there does
> not seem to be a general way of asking “what is the category of this
> character”.
>
> I want to check if a character is a mathematical symbol, that is, if it is
> in the Unicode Category Sm. What’s the right way of doing this?
> Would it be reasonable to add a method Unicode class >> #category:
> aCharacter that answers one of the 29 category symbols #Cc to #Zs? Or
> “is” methods for each category?
>
> Andrew

Hi Andrew,

I would recommend the guideline that "special codes" should be modelled as
private to the class (or family of classes) in question and answers to
questions like "is it a mathematical symbol" be answered by a method with
almost that exact name.

There may be a justification for exposing the code itself, but not for
directly interpreting the meaning of the code. A contrived example would be
to answer a Dictionary grouping each UnicodeCharacterData instance by its
General Category so that one could manipulate the set of related instances
in some manner. As I said, contrived.

Also, the existing methods provide a template (pattern?) for future such
methods.

--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Prof. Andrew P. Black

Re: Testing a Unicode Character's Category

Hi Richard,

Normally I agree with you, and prefer boolean methods

inCatagoryCc: aChar
isCategorySm: aChar
to
categoryOf: aChar == #Cc
categoryOf: aChar == #Sm

In this particular case, though, the category codes are part of the Unicode Standard, so perhaps exposing them isn’t so bad. Moreover, there is meaning encoded into the symbols — for example, all the letter categories start with L. So one could write an isLetter test like this

Character >> isLetter
^ (Unicode categoryOf: self) first == $L

rather than

Character >> isLetter
Unicode inCategoryLl ifTrue: [ ^ true ].
Unicode inCategoryLm ifTrue: [ ^ true ].
Unicode inCategoryLo ifTrue: [ ^ true ].
Unicode inCategoryLt ifTrue: [ ^ true ].
Unicode inCategoryLu ifTrue: [ ^ true ].
^ false

There is still the disadvantage that a typo in the Category name (typing #L1, for example, when one means #Ll) is likely to go undetected.

Andrew

Richard Sargent

Re: Testing a Unicode Character's Category

Administrator

Andrew P. Black wrote

> Hi Richard,
>
> Normally I agree with you, and prefer boolean methods
>
> inCatagoryCc: aChar
> isCategorySm: aChar
> to
> categoryOf: aChar == #Cc
> categoryOf: aChar == #Sm
>
> In this particular case, though, the category codes are part of the
> Unicode Standard, so perhaps exposing them isn’t so bad. Moreover, there
> is meaning encoded into the symbols — for example, all the letter
> categories start with L. So one could write an isLetter test like this

I understand this argument, but I cannot agree with it. If I were to ask you
what code points were defined in character category Mx, you would
immediately go to a web site that contained the Unicode categories to find
the answer. In other words, you would "ask Unicode".

In general, I would expect to see:
Character>>#isCapitalLetter
^Unicode isCapitalLetter: self "actually, /self codePoint/"

Character>>#isLetter
^Unicode isLetter: self

Character>>#isMathSymbol
Unicode isMathSymbol: self

Unicode's methods would look at and interpret its category information for
the character ... which might be internally managed via some kind of tree
structure (who knows?).

And I would definitely expect to *not* see a method name like
"isCategorySm:". :-)

Magic numbers, magic codes, etc.: you always want there to be one definitive
expert (class) and you do not want other classes usurping its
responsibilities.

> Character >> isLetter
> ^ (Unicode categoryOf: self) first == $L
>
> rather than
>
> Character >> isLetter
> Unicode inCategoryLl ifTrue: [ ^ true ].
> Unicode inCategoryLm ifTrue: [ ^ true ].
> Unicode inCategoryLo ifTrue: [ ^ true ].
> Unicode inCategoryLt ifTrue: [ ^ true ].
> Unicode inCategoryLu ifTrue: [ ^ true ].
> ^ false
>
> There is still the disadvantage that a typo in the Category name (typing
> #L1, for example, when one means #Ll) is likely to go undetected.
>
> Andrew

--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Prof. Andrew P. Black

Re: Testing a Unicode Character's Category

Richard (and others):

Marcus and I committed methods to implement these category tests on Friday. If any of you have time to review the code, it would be appreciated. The commit can be found here:

https://github.com/pharo-project/pharo/pull/326/commits/933b1dcba05b837ab292e19aab413f67b3f9eec5

I noticed that there is one test failing — which seems to be entirely unrelated The test passes in my image.

Andrew

Richard Sargent

Re: Testing a Unicode Character's Category

Administrator

Andrew, I've tried to comment on the changes. Overall, it was essentially what I was envisioning. My comments on a few methods are mostly specific quibbles about the implementation. One reflects that I think the original implementation contains an error.

Thanks for doing this!

On Sun, Oct 1, 2017 at 7:10 AM, Prof. Andrew P. Black <[hidden email]> wrote:

Richard (and others):

Marcus and I committed methods to implement these category tests on Friday. If any of you have time to review the code, it would be appreciated. The commit can be found here:

https://github.com/pharo-project/pharo/pull/326/commits/933b1dcba05b837ab292e19aab413f67b3f9eec5

I noticed that there is one test failing — which seems to be entirely unrelated The test passes in my image.

Andrew