The Pharo image has a table of Unicode Character Categories in a class variable GeneralCategory of class Unicode. But there do not seem to be many methods to interpret this data. For example, while there is a method
Unicode class >> #isDigit: aCharacter that checks if the Unicode category of aCharacter is Nd, and Unicode class >> #isLetter: aCharacter that checks if aCharacter is in one of the letter categories, there does not seem to be a general way of asking “what is the category of this character”. I want to check if a character is a mathematical symbol, that is, if it is in the Unicode Category Sm. What’s the right way of doing this? Would it be reasonable to add a method Unicode class >> #category: aCharacter that answers one of the 29 category symbols #Cc to #Zs? Or “is” methods for each category? Andrew |
Administrator
|
Andrew P. Black wrote
> The Pharo image has a table of Unicode Character Categories in a class > variable GeneralCategory of class Unicode. But there do not seem to be > many methods to interpret this data. For example, while there is a method > > Unicode class >> #isDigit: aCharacter > > that checks if the Unicode category of aCharacter is Nd, and > > Unicode class >> #isLetter: aCharacter > > that checks if aCharacter is in one of the letter categories, there does > not seem to be a general way of asking “what is the category of this > character”. > > I want to check if a character is a mathematical symbol, that is, if it is > in the Unicode Category Sm. What’s the right way of doing this? > Would it be reasonable to add a method Unicode class >> #category: > aCharacter that answers one of the 29 category symbols #Cc to #Zs? Or > “is” methods for each category? > > Andrew Hi Andrew, I would recommend the guideline that "special codes" should be modelled as private to the class (or family of classes) in question and answers to questions like "is it a mathematical symbol" be answered by a method with almost that exact name. There may be a justification for exposing the code itself, but not for directly interpreting the meaning of the code. A contrived example would be to answer a Dictionary grouping each UnicodeCharacterData instance by its General Category so that one could manipulate the set of related instances in some manner. As I said, contrived. Also, the existing methods provide a template (pattern?) for future such methods. -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html |
Hi Richard,
Normally I agree with you, and prefer boolean methods inCatagoryCc: aChar isCategorySm: aChar to categoryOf: aChar == #Cc categoryOf: aChar == #Sm In this particular case, though, the category codes are part of the Unicode Standard, so perhaps exposing them isn’t so bad. Moreover, there is meaning encoded into the symbols — for example, all the letter categories start with L. So one could write an isLetter test like this Character >> isLetter ^ (Unicode categoryOf: self) first == $L rather than Character >> isLetter Unicode inCategoryLl ifTrue: [ ^ true ]. Unicode inCategoryLm ifTrue: [ ^ true ]. Unicode inCategoryLo ifTrue: [ ^ true ]. Unicode inCategoryLt ifTrue: [ ^ true ]. Unicode inCategoryLu ifTrue: [ ^ true ]. ^ false There is still the disadvantage that a typo in the Category name (typing #L1, for example, when one means #Ll) is likely to go undetected. Andrew |
Administrator
|
Andrew P. Black wrote
> Hi Richard, > > Normally I agree with you, and prefer boolean methods > > inCatagoryCc: aChar > isCategorySm: aChar > to > categoryOf: aChar == #Cc > categoryOf: aChar == #Sm > > In this particular case, though, the category codes are part of the > Unicode Standard, so perhaps exposing them isn’t so bad. Moreover, there > is meaning encoded into the symbols — for example, all the letter > categories start with L. So one could write an isLetter test like this I understand this argument, but I cannot agree with it. If I were to ask you what code points were defined in character category Mx, you would immediately go to a web site that contained the Unicode categories to find the answer. In other words, you would "ask Unicode". In general, I would expect to see: Character>>#isCapitalLetter ^Unicode isCapitalLetter: self "actually, /self codePoint/" Character>>#isLetter ^Unicode isLetter: self Character>>#isMathSymbol Unicode isMathSymbol: self Unicode's methods would look at and interpret its category information for the character ... which might be internally managed via some kind of tree structure (who knows?). And I would definitely expect to *not* see a method name like "isCategorySm:". :-) Magic numbers, magic codes, etc.: you always want there to be one definitive expert (class) and you do not want other classes usurping its responsibilities. > Character >> isLetter > ^ (Unicode categoryOf: self) first == $L > > rather than > > Character >> isLetter > Unicode inCategoryLl ifTrue: [ ^ true ]. > Unicode inCategoryLm ifTrue: [ ^ true ]. > Unicode inCategoryLo ifTrue: [ ^ true ]. > Unicode inCategoryLt ifTrue: [ ^ true ]. > Unicode inCategoryLu ifTrue: [ ^ true ]. > ^ false > > There is still the disadvantage that a typo in the Category name (typing > #L1, for example, when one means #Ll) is likely to go undetected. > > Andrew -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html |
Richard (and others):
Marcus and I committed methods to implement these category tests on Friday. If any of you have time to review the code, it would be appreciated. The commit can be found here: https://github.com/pharo-project/pharo/pull/326/commits/933b1dcba05b837ab292e19aab413f67b3f9eec5 I noticed that there is one test failing — which seems to be entirely unrelated The test passes in my image. Andrew |
Administrator
|
Andrew, I've tried to comment on the changes. Overall, it was essentially what I was envisioning. My comments on a few methods are mostly specific quibbles about the implementation. One reflects that I think the original implementation contains an error. Thanks for doing this!On Sun, Oct 1, 2017 at 7:10 AM, Prof. Andrew P. Black <[hidden email]> wrote: Richard (and others): |
Free forum by Nabble | Edit this page |