Unicode

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode

Christoph Thiede

Hi all! :-)


After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:


At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.

Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).


Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph


Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Chris Cunnington-4
I’m not in any position to provide authority for anything, but I’m interested in learning more about what you’re doing. 
I’d like to know more about Unicode in Squeak, so if you post more on the topic, perhaps some examples, you can be sure I’ll be reading them. 

Chris 

On Mar 17, 2020, at 6:51 PM, Thiede, Christoph <[hidden email]> wrote:

Hi all! :-)

After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:

At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).

Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph



Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Christoph Thiede

Hi Chris, I don't know much about Unicode in Squeak at the moment, too, but I will try to document as many insights as possible when I commit related stuff :)


However, at the moment this project is blocked for me as I need a second opinion before continuing with my proposed design changes ...


Best,

Christoph


Von: Squeak-dev <[hidden email]> im Auftrag von Chris Cunnington <[hidden email]>
Gesendet: Sonntag, 29. März 2020 18:05:31
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Unicode
 
I’m not in any position to provide authority for anything, but I’m interested in learning more about what you’re doing. 
I’d like to know more about Unicode in Squeak, so if you post more on the topic, perhaps some examples, you can be sure I’ll be reading them. 

Chris 

On Mar 17, 2020, at 6:51 PM, Thiede, Christoph <[hidden email]> wrote:

Hi all! :-)

After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:

At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).

Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph



Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Christoph Thiede
Hi all,

I would still be interested in resuming this project and supporting the
latest Unicode codepoints in Squeak. Is there really no one who could find
some minutes to review my proposed design change?
Or putting it another way: If I will upload these changes into the inbox,
will anyone merge it? :-)

Best,
Christoph



--
Sent from: http://forum.world.st/Squeak-Dev-f45488.html

Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Levente Uzonyi
Hi Christoph,

On Sat, 5 Sep 2020, Christoph Thiede wrote:

> Hi all,
>
> I would still be interested in resuming this project and supporting the
> latest Unicode codepoints in Squeak. Is there really no one who could find
> some minutes to review my proposed design change?

Your words suggest that it has already been published, but I can't find it
anywhere.

> Or putting it another way: If I will upload these changes into the inbox,
> will anyone merge it? :-)

I will review it and I'm sure others will do as well. Though I can't
promise to merge it without having a look. :)


Levente

>
> Best,
> Christoph
>
>
>
> --
> Sent from: http://forum.world.st/Squeak-Dev-f45488.html

Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Eliot Miranda-2
In reply to this post by Christoph Thiede
Hi Christoph, Hi All,

On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:

Hi all! :-)

After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:


And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant Characters are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.

Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.

Q2, how many bits should the 64-bit variant VM support for immediate Characters?

Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.

It has implications in a few parts of the system:
- failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
- ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
- 32-bit <=> 64-bit image conversion 

All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the get-go.

At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.

Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).


Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph

Best, Eliot
_,,,^..^,,,_ (phone)


Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Tobias Pape

> On 06.09.2020, at 19:15, Eliot Miranda <[hidden email]> wrote:
>
> Hi Christoph, Hi All,
>
>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:
>>
>> Hi all! :-)
>>
>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>>
>
> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant Characters are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>
> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>
> Q2, how many bits should the 64-bit variant VM support for immediate Characters?

Unicode has a max value of 0x10FFFF.
That makes 21 bit.
So no worries there.

We should just not forget the leading-char stuff (Yoshiki, Andreas,...)


BEst regards
        -Tobias

>
> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>
> It has implications in a few parts of the system:
> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
> - 32-bit <=> 64-bit image conversion
>
> All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the get-go.
>
>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>
>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
>>
>>
>> Examples:
>> Unicode generalTagOf: $a asUnicode. "#Ll"
>>
>> Unicode class >> isLetterCode: charCode
>>   ^ (self generalTagOf: charCode) first = $L
>>
>> Unicode class >> isAlphaNumericCode: charCode
>>   | tag|
>>   ^ (tag := self generalCategoryOf: charCode) first = $L
>>         or: [tag = #Nd]
>>
>> How do you think about this proposal? Please let me know and I will go ahead! :D
>>
>> Best,
>> Christoph
>
> Best, Eliot
> _,,,^..^,,,_ (phone)



Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Levente Uzonyi
On Sun, 6 Sep 2020, Tobias Pape wrote:

>
>> On 06.09.2020, at 19:15, Eliot Miranda <[hidden email]> wrote:
>>
>> Hi Christoph, Hi All,
>>
>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:
>>>
>>> Hi all! :-)
>>>
>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>>>
>>
>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant Characters are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>>
>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>>
>> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
>
> Unicode has a max value of 0x10FFFF.
> That makes 21 bit.
> So no worries there.
>
> We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
AFAIU the leading char only makes sense when you have multiple CJK(V?)
languages in use at the same time. In other cases Unicode (leadingChar =
0) is perfectly fine.
IIRC there are 22 bits available for the codePoint and 8 for the
leadingChar, so we're still good: all unicode characters fit.


Levente

>
>
> BEst regards
> -Tobias
>
>>
>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>>
>> It has implications in a few parts of the system:
>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
>> - 32-bit <=> 64-bit image conversion
>>
>> All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the get-go.
>>
>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>>
>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
>>>
>>>
>>> Examples:
>>> Unicode generalTagOf: $a asUnicode. "#Ll"
>>>
>>> Unicode class >> isLetterCode: charCode
>>>   ^ (self generalTagOf: charCode) first = $L
>>>
>>> Unicode class >> isAlphaNumericCode: charCode
>>>   | tag|
>>>   ^ (tag := self generalCategoryOf: charCode) first = $L
>>>         or: [tag = #Nd]
>>>
>>> How do you think about this proposal? Please let me know and I will go ahead! :D
>>>
>>> Best,
>>> Christoph
>>
>> Best, Eliot
>> _,,,^..^,,,_ (phone)

Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Tobias Pape

> On 06.09.2020, at 20:40, Levente Uzonyi <[hidden email]> wrote:
>
> On Sun, 6 Sep 2020, Tobias Pape wrote:
>
>>
>>> On 06.09.2020, at 19:15, Eliot Miranda <[hidden email]> wrote:
>>> Hi Christoph, Hi All,
>>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:
>>>> Hi all! :-)
>>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant Characters are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>>> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
>>
>> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
>>
>> We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
>
> AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine.
> IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
>
>

\o/ hooray!

> Levente
>
>>
>>
>> BEst regards
>> -Tobias
>>
>>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>>> It has implications in a few parts of the system:
>>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
>>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
>>> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the get-go.
>>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
>>>> Examples:
>>>> Unicode generalTagOf: $a asUnicode. "#Ll"
>>>> Unicode class >> isLetterCode: charCode
>>>>  ^ (self generalTagOf: charCode) first = $L
>>>> Unicode class >> isAlphaNumericCode: charCode
>>>>  | tag|
>>>>  ^ (tag := self generalCategoryOf: charCode) first = $L
>>>>        or: [tag = #Nd]
>>>> How do you think about this proposal? Please let me know and I will go ahead! :D
>>>> Best,
>>>> Christoph
>>> Best, Eliot
>>> _,,,^..^,,,_ (phone)



Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Christoph Thiede

Hi all,


Your words suggest that it has already been published, but I can't find it anywhere.


Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to replace the separate class variables for every known character class in favor of greater flexibility.

Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)

WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)

Best,
Christoph


Von: Squeak-dev <[hidden email]> im Auftrag von Tobias Pape <[hidden email]>
Gesendet: Sonntag, 6. September 2020 21:00:14
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Unicode
 

> On 06.09.2020, at 20:40, Levente Uzonyi <[hidden email]> wrote:
>
> On Sun, 6 Sep 2020, Tobias Pape wrote:
>
>>
>>> On 06.09.2020, at 19:15, Eliot Miranda <[hidden email]> wrote:
>>> Hi Christoph, Hi All,
>>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:
>>>> Hi all! :-)
>>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
>>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant Characters are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
>>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
>>> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
>>
>> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
>>
>> We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
>
> AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine.
> IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
>
>

\o/ hooray!

> Levente
>
>>
>>
>> BEst regards
>>       -Tobias
>>
>>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
>>> It has implications in a few parts of the system:
>>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
>>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
>>> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the get-go.
>>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
>>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
>>>> Examples:
>>>> Unicode generalTagOf: $a asUnicode. "#Ll"
>>>> Unicode class >> isLetterCode: charCode
>>>>  ^ (self generalTagOf: charCode) first = $L
>>>> Unicode class >> isAlphaNumericCode: charCode
>>>>  | tag|
>>>>  ^ (tag := self generalCategoryOf: charCode) first = $L
>>>>        or: [tag = #Nd]
>>>> How do you think about this proposal? Please let me know and I will go ahead! :D
>>>> Best,
>>>> Christoph
>>> Best, Eliot
>>> _,,,^..^,,,_ (phone)





Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Levente Uzonyi
Hi Christoph,

On Tue, 8 Sep 2020, Thiede, Christoph wrote:

>
> Hi all,
>
>
> > Your words suggest that it has already been published, but I can't find it anywhere.
>
>
> Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to
> replace the separate class variables for every known character class in favor of greater flexibility.

How would your changes affect GeneralCategory? Would it still be a
SpareLargeTable with ByteArray as arrayClass?
If you just replace those integers with symbols, the size of the table
will be at least 4 or 8 times larger in 32 or 64 bit images,
respectively.


Levente.

>
> Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
>
> WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
>
> Best,
> Christoph
>
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> Von: Squeak-dev <[hidden email]> im Auftrag von Tobias Pape <[hidden email]>
> Gesendet: Sonntag, 6. September 2020 21:00:14
> An: The general-purpose Squeak developers list
> Betreff: Re: [squeak-dev] Unicode  
>
> > On 06.09.2020, at 20:40, Levente Uzonyi <[hidden email]> wrote:
> >
> > On Sun, 6 Sep 2020, Tobias Pape wrote:
> >
> >>
> >>> On 06.09.2020, at 19:15, Eliot Miranda <[hidden email]> wrote:
> >>> Hi Christoph, Hi All,
> >>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:
> >>>> Hi all! :-)
> >>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓
> are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips,
> but so long, I have one general question for you:
> >>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant Characters
> are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
> >>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that
> initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
> >>> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
> >>
> >> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
> >>
> >> We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
> >
> > AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine.
> > IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
> >
> >
>
> \o/ hooray!
>
> > Levente
> >
> >>
> >>
> >> BEst regards
> >>       -Tobias
> >>
> >>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
> LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
> >>> It has implications in a few parts of the system:
> >>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
> >>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
> >>> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the
> get-go.
> >>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
> purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
> >>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
> #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know
> whether this will ever happen).
> >>>> Examples:
> >>>> Unicode generalTagOf: $a asUnicode. "#Ll"
> >>>> Unicode class >> isLetterCode: charCode
> >>>>  ^ (self generalTagOf: charCode) first = $L
> >>>> Unicode class >> isAlphaNumericCode: charCode
> >>>>  | tag|
> >>>>  ^ (tag := self generalCategoryOf: charCode) first = $L
> >>>>        or: [tag = #Nd]
> >>>> How do you think about this proposal? Please let me know and I will go ahead! :D
> >>>> Best,
> >>>> Christoph
> >>> Best, Eliot
> >>> _,,,^..^,,,_ (phone)
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Christoph Thiede

Hi Levente,


basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol to a number and vice versa. What do you think?


Best,

Christoph



Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]>
Gesendet: Dienstag, 8. September 2020 21:43:56
An: The general-purpose Squeak developers list
Betreff: Re: [squeak-dev] Unicode
 
Hi Christoph,

On Tue, 8 Sep 2020, Thiede, Christoph wrote:

>
> Hi all,
>
>
> > Your words suggest that it has already been published, but I can't find it anywhere.
>
>
> Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to
> replace the separate class variables for every known character class in favor of greater flexibility.

How would your changes affect GeneralCategory? Would it still be a
SpareLargeTable with ByteArray as arrayClass?
If you just replace those integers with symbols, the size of the table
will be at least 4 or 8 times larger in 32 or 64 bit images,
respectively.


Levente.

>
> Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
>
> WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
>
> Best,
> Christoph
>
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> Von: Squeak-dev <[hidden email]> im Auftrag von Tobias Pape <[hidden email]>
> Gesendet: Sonntag, 6. September 2020 21:00:14
> An: The general-purpose Squeak developers list
> Betreff: Re: [squeak-dev] Unicode  
>
> > On 06.09.2020, at 20:40, Levente Uzonyi <[hidden email]> wrote:
> >
> > On Sun, 6 Sep 2020, Tobias Pape wrote:
> >
> >>
> >>> On 06.09.2020, at 19:15, Eliot Miranda <[hidden email]> wrote:
> >>> Hi Christoph, Hi All,
> >>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:
> >>>> Hi all! :-)
> >>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓
> are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips,
> but so long, I have one general question for you:
> >>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant Characters
> are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
> >>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure that
> initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
> >>> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
> >>
> >> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
> >>
> >> We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
> >
> > AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine.
> > IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
> >
> >
>
> \o/ hooray!
>
> > Levente
> >
> >>
> >>
> >> BEst regards
> >>       -Tobias
> >>
> >>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
> LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
> >>> It has implications in a few parts of the system:
> >>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
> >>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
> >>> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the
> get-go.
> >>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
> purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
> >>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
> #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know
> whether this will ever happen).
> >>>> Examples:
> >>>> Unicode generalTagOf: $a asUnicode. "#Ll"
> >>>> Unicode class >> isLetterCode: charCode
> >>>>  ^ (self generalTagOf: charCode) first = $L
> >>>> Unicode class >> isAlphaNumericCode: charCode
> >>>>  | tag|
> >>>>  ^ (tag := self generalCategoryOf: charCode) first = $L
> >>>>        or: [tag = #Nd]
> >>>> How do you think about this proposal? Please let me know and I will go ahead! :D
> >>>> Best,
> >>>> Christoph
> >>> Best, Eliot
> >>> _,,,^..^,,,_ (phone)
>
>
>
>
>


Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Unicode

Levente Uzonyi
Hi Christoph,

On Wed, 9 Sep 2020, Thiede, Christoph wrote:

>
> Hi Levente,
>
>
> basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you
> need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol
> to a number and vice versa. What do you think?

You mean an array to map the integers to symbols, right? :)
Anyway, I don't think it's worth using symbols internally.
For example, #isLetterCode: is 8-10% slower with the extra array lookup
and checking the category symbol's first letter than the current method
of integer comparisons.

Do you expect these constants to appear outside the Unicode class? If yes,
then using symbols for those cases is probably a good solution.
But for internal use, the integers are better.


Levente

>
>
> Best,
>
> Christoph
>
>
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> Von: Squeak-dev <[hidden email]> im Auftrag von Levente Uzonyi <[hidden email]>
> Gesendet: Dienstag, 8. September 2020 21:43:56
> An: The general-purpose Squeak developers list
> Betreff: Re: [squeak-dev] Unicode  
> Hi Christoph,
>
> On Tue, 8 Sep 2020, Thiede, Christoph wrote:
>
> >
> > Hi all,
> >
> >
> > > Your words suggest that it has already been published, but I can't find it anywhere.
> >
> >
> > Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to
> > replace the separate class variables for every known character class in favor of greater flexibility.
>
> How would your changes affect GeneralCategory? Would it still be a
> SpareLargeTable with ByteArray as arrayClass?
> If you just replace those integers with symbols, the size of the table
> will be at least 4 or 8 times larger in 32 or 64 bit images,
> respectively.
>
>
> Levente.
>
> >
> > Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
> >
> > WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
> >
> > Best,
> > Christoph
> >
> >________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> _
> > Von: Squeak-dev <[hidden email]> im Auftrag von Tobias Pape <[hidden email]>
> > Gesendet: Sonntag, 6. September 2020 21:00:14
> > An: The general-purpose Squeak developers list
> > Betreff: Re: [squeak-dev] Unicode  
> >
> > > On 06.09.2020, at 20:40, Levente Uzonyi <[hidden email]> wrote:
> > >
> > > On Sun, 6 Sep 2020, Tobias Pape wrote:
> > >
> > >>
> > >>> On 06.09.2020, at 19:15, Eliot Miranda <[hidden email]> wrote:
> > >>> Hi Christoph, Hi All,
> > >>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <[hidden email]> wrote:
> > >>>> Hi all! :-)
> > >>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓
> > are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips,
> > but so long, I have one general question for you:
> > >>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant
> Characters
> > are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
> > >>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
> that
> > initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
> > >>> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
> > >>
> > >> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
> > >>
> > >> We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
> > >
> > > AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine.
> > > IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
> > >
> > >
> >
> > \o/ hooray!
> >
> > > Levente
> > >
> > >>
> > >>
> > >> BEst regards
> > >>       -Tobias
> > >>
> > >>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
> > LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
> > >>> It has implications in a few parts of the system:
> > >>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
> > >>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific “wire” protocol/representation
> > >>> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the
> > get-go.
> > >>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
> > purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
> > >>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
> > #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
> know
> > whether this will ever happen).
> > >>>> Examples:
> > >>>> Unicode generalTagOf: $a asUnicode. "#Ll"
> > >>>> Unicode class >> isLetterCode: charCode
> > >>>>  ^ (self generalTagOf: charCode) first = $L
> > >>>> Unicode class >> isAlphaNumericCode: charCode
> > >>>>  | tag|
> > >>>>  ^ (tag := self generalCategoryOf: charCode) first = $L
> > >>>>        or: [tag = #Nd]
> > >>>> How do you think about this proposal? Please let me know and I will go ahead! :D
> > >>>> Best,
> > >>>> Christoph
> > >>> Best, Eliot
> > >>> _,,,^..^,,,_ (phone)
> >
> >
> >
> >
> >
>
>