Unicode

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode

Christoph Thiede

Hi all! :-)


After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:


At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.

Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).


Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph


Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Chris Cunnington-4
I’m not in any position to provide authority for anything, but I’m interested in learning more about what you’re doing. 
I’d like to know more about Unicode in Squeak, so if you post more on the topic, perhaps some examples, you can be sure I’ll be reading them. 

Chris 

On Mar 17, 2020, at 6:51 PM, Thiede, Christoph <[hidden email]> wrote:

Hi all! :-)

After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:

At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).

Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph



Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Christoph Thiede

Hi Chris, I don't know much about Unicode in Squeak at the moment, too, but I will try to document as many insights as possible when I commit related stuff :)


However, at the moment this project is blocked for me as I need a second opinion before continuing with my proposed design changes ...


Best,

Christoph


Von: Squeak-dev <[hidden email]> im Auftrag von Chris Cunnington <[hidden email]>
Gesendet: Sonntag, 29. März 2020 18:05:31
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Unicode
 
I’m not in any position to provide authority for anything, but I’m interested in learning more about what you’re doing. 
I’d like to know more about Unicode in Squeak, so if you post more on the topic, perhaps some examples, you can be sure I’ll be reading them. 

Chris 

On Mar 17, 2020, at 6:51 PM, Thiede, Christoph <[hidden email]> wrote:

Hi all! :-)

After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:

At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).

Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph